Node Affinity, Pod Affinity, Anti-Affinity, Taints & Tolerations¶
- Kubernetes provides a rich set of mechanisms to control where Pods are scheduled in your cluster.
- In this lab we will deep-dive into every scheduling constraint available:
nodeSelector,Node Affinity,Pod Affinity,Pod Anti-Affinity,Taints,Tolerations, andTopology Spread Constraints. - We will build real-world examples - from GPU node pools to zone-aware deployments and multi-tenant cluster isolation.
- By the end of this lab you will have mastered fine-grained Pod placement and be able to design sophisticated scheduling strategies for production clusters.
What will we learn?¶
- Why Pod scheduling constraints exist and when to use each mechanism
nodeSelector- simple, label-based node filteringNode Affinity- expressive node selection with operators, required vs. preferred rulesPod Affinity- co-locate Pods with other Pods (same node, same zone)Pod Anti-Affinity- spread Pods away from each otherTaints- repel Pods from NodesTolerations- allow Pods onto tainted NodesTopology Spread Constraints- balance Pods evenly across topology domains- Real-world patterns: GPU pools, zone spreading, multi-tenant isolation, database co-location
- How to combine all mechanisms for complex production requirements
- Scheduling internals and the decision pipeline
Official Documentation & References¶
| Resource | Link |
|---|---|
| Assign Pods to Nodes | kubernetes.io/docs |
| Node Affinity | kubernetes.io/docs/affinity |
| Taints and Tolerations | kubernetes.io/docs/taints |
| Topology Spread Constraints | kubernetes.io/docs/topology |
| kube-scheduler | kubernetes.io/docs/kube-scheduler |
| Pod Priority & Preemption | kubernetes.io/docs/priority |
| Well-Known Node Labels | kubernetes.io/docs/reference/labels |
Introduction¶
The Kubernetes Scheduling Pipeline¶
When you create a Pod, the kube-scheduler selects which Node it runs on. The scheduler runs through several phases:
flowchart TD
A["New Pod created\n(no NodeName)"] --> B["Filtering Phase\n(Predicates)"]
B --> C{"Any nodes\npassed filter?"}
C -- No --> D["Pod stays Pending\nEvent: FailedScheduling"]
C -- Yes --> E["Scoring Phase\n(Priorities)"]
E --> F["Highest-score Node\nselected"]
F --> G["Pod bound to Node\nkubelet starts Pod"]
subgraph "Filtering checks include:"
B1["nodeSelector / nodeName"]
B2["Node Affinity (required)"]
B3["Pod Affinity / Anti-Affinity (required)"]
B4["Taints / Tolerations"]
B5["Resource availability (CPU, Mem)"]
B6["Topology Spread Constraints"]
end
subgraph "Scoring factors include:"
E1["Node Affinity (preferred) weight"]
E2["Pod Affinity (preferred) weight"]
E3["Least requested resources"]
E4["Image locality"]
end Scheduling Mechanism Overview¶
| Mechanism | Direction | Hardness | Scope |
|---|---|---|---|
nodeName | Pod → Node | Hard | Single node |
nodeSelector | Pod → Node | Hard | Label match |
| Node Affinity (required) | Pod → Node | Hard | Operators, multi-label |
| Node Affinity (preferred) | Pod → Node | Soft | With weights |
| Pod Affinity (required) | Pod → Pod | Hard | Topology domain |
| Pod Affinity (preferred) | Pod → Pod | Soft | With weights |
| Pod Anti-Affinity (required) | Pod ↔ Pod | Hard | Topology domain |
| Pod Anti-Affinity (preferred) | Pod ↔ Pod | Soft | With weights |
Taint NoSchedule | Node repels Pod | Hard | New pods excluded |
Taint NoExecute | Node repels Pod | Hard | Existing pods evicted |
Taint PreferNoSchedule | Node repels Pod | Soft | Avoid if possible |
| Topology Spread | Pod distribution | Hard/Soft | Arbitrary topology |
Terminology¶
| Term | Description |
|---|---|
| Node | A physical or virtual machine in the Kubernetes cluster |
| Node Label | A key-value pair attached to a Node used for selection |
| Taint | A key-value-effect triple on a Node that repels Pods |
| Toleration | A key-value-effect triple on a Pod that permits scheduling on a tainted Node |
| Affinity | A set of rules the scheduler uses to prefer or require specific placement |
| Anti-Affinity | Rules to keep Pods away from specific locations or other Pods |
| topologyKey | A node label key that defines the topology domain (e.g., kubernetes.io/hostname, topology.kubernetes.io/zone) |
| Required (hard) | requiredDuringSchedulingIgnoredDuringExecution - the Pod won’t schedule if the rule can’t be satisfied |
| Preferred (soft) | preferredDuringSchedulingIgnoredDuringExecution - the scheduler tries to honor but will schedule anyway |
| IgnoredDuringExecution | Already-running Pods are NOT evicted if rules change after scheduling |
| weight | Integer 1–100 given to a preferred rule; used in scoring |
| Topology Spread Constraint | Rule that limits how unevenly Pods can be distributed across topology domains |
| maxSkew | Maximum difference in Pod count between the most and least loaded topology domain |
| whenUnsatisfiable | What to do when spread can’t be satisfied: DoNotSchedule (hard) or ScheduleAnyway (soft) |
Common kubectl Commands¶
kubectl label - Add, update, remove labels on nodes
Syntax: kubectl label nodes <node-name> <key>=<value>
Description: Labels are key-value pairs attached to Nodes (and any Kubernetes object). They are the foundation of all affinity rules and nodeSelector.
# List all nodes
kubectl get nodes
# Show all labels on nodes
kubectl get nodes --show-labels
# Label a node
kubectl label nodes node-1 environment=production
# Label with multiple keys at once
kubectl label nodes node-1 environment=production tier=frontend
# Overwrite an existing label (requires --overwrite)
kubectl label nodes node-1 environment=staging --overwrite
# Remove a label (append a minus sign)
kubectl label nodes node-1 environment-
# Label all nodes matching a selector
kubectl label nodes -l kubernetes.io/role=worker disk-type=ssd
# Show node labels formatted as a table
kubectl get nodes -o custom-columns=NAME:.metadata.name,LABELS:.metadata.labels
kubectl taint - Add and remove taints on nodes
Syntax: kubectl taint nodes <node-name> <key>=<value>:<effect>
Description: Taints prevent Pods from being scheduled on a Node unless they have a matching toleration.
# Add a taint with NoSchedule effect
kubectl taint nodes node-1 dedicated=gpu:NoSchedule
# Add a taint with NoExecute effect (evicts running pods)
kubectl taint nodes node-1 maintenance=true:NoExecute
# Add a taint with PreferNoSchedule effect (soft)
kubectl taint nodes node-1 spot-instance=true:PreferNoSchedule
# Remove a taint (append a minus sign after the effect)
kubectl taint nodes node-1 dedicated=gpu:NoSchedule-
# Remove all taints with a given key regardless of value/effect
kubectl taint nodes node-1 dedicated-
# Show all taints on all nodes
kubectl describe nodes | grep -A3 "Taints:"
# Taint ALL nodes in the cluster
kubectl taint nodes --all dedicated=shared:PreferNoSchedule
kubectl describe node - Inspect node labels, taints, and allocatable resources
# Full node description
kubectl describe node node-1
# Show just labels
kubectl get node node-1 -o jsonpath='{.metadata.labels}' | jq
# Show just taints
kubectl get node node-1 -o jsonpath='{.spec.taints}'
# Show topology labels (zone / region)
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
REGION:.metadata.labels."topology\.kubernetes\.io/region",\
ZONE:.metadata.labels."topology\.kubernetes\.io/zone"
# Check which node a pod landed on
kubectl get pods -o wide
# Watch pod scheduling events
kubectl get events --sort-by='.lastTimestamp' | grep FailedScheduling
kubectl get - Filter pods and nodes by label
# List pods on a specific node
kubectl get pods --field-selector spec.nodeName=node-1
# List pods with nodeSelector labels
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.nodeSelector}{"\n"}{end}'
# List all nodes with a specific label
kubectl get nodes -l environment=production
# List nodes with topology zone labels
kubectl get nodes -l topology.kubernetes.io/zone=us-east-1a
Part 01 - nodeSelector (Simple Node Selection)¶
nodeSelectoris the simplest way to constrain a Pod to specific Nodes.- It is a map of label key-value pairs - the Pod will only be scheduled on Nodes that have all the specified labels.
- It is less expressive than Node Affinity but easier to read for simple cases.
Step 01.01 - Label a Node¶
# Label a node for SSD storage
kubectl label nodes node-1 disk-type=ssd
# Label another node for HDD storage
kubectl label nodes node-2 disk-type=hdd
# Verify
kubectl get nodes --show-labels | grep disk-type
Step 01.02 - Schedule a Pod with nodeSelector¶
# pod-nodeselector.yaml
apiVersion: v1
kind: Pod
metadata:
name: ssd-pod
spec:
nodeSelector:
disk-type: ssd # Must match this exact label
containers:
- name: app
image: nginx:1.25
kubectl apply -f pod-nodeselector.yaml
kubectl get pod ssd-pod -o wide # Verify it landed on a ssd node
Step 01.03 - nodeSelector vs Node Affinity¶
| Feature | nodeSelector | Node Affinity |
|---|---|---|
| Operators | = only | In, NotIn, Exists, DoesNotExist, Gt, Lt |
| Logic | AND (all labels must match) | AND within a term, OR between terms |
| Soft preferences | No | Yes (preferred) |
| Multiple label conditions | Yes (all must match) | Yes (with full OR/AND logic) |
- Prefer
Node Affinityfor all new workloads. UsenodeSelectoronly for simple backward-compatible cases.
Part 02 - Node Affinity¶
Node Affinityis the next-generationnodeSelector.- It lets you express complex label requirements using operators and supports both hard and soft rules.
- All Node Affinity rules live under
spec.affinity.nodeAffinity.
Node Affinity Rule Types¶
| Rule | Description |
|---|---|
requiredDuringSchedulingIgnoredDuringExecution | Hard - Pod won’t schedule if rule can’t be met |
preferredDuringSchedulingIgnoredDuringExecution | Soft - scheduler tries to honor, but schedules anyway |
IgnoredDuringExecution
Both rule types have IgnoredDuringExecution. This means if a Node’s labels change after a Pod is scheduled, the Pod is not evicted. A future type requiredDuringSchedulingRequiredDuringExecution is planned but not yet stable.
Node Affinity Operators¶
| Operator | Description | Example |
|---|---|---|
In | Label value is in the set | environment In [production, staging] |
NotIn | Label value is NOT in the set | environment NotIn [development] |
Exists | Label key exists (any value) | gpu Exists |
DoesNotExist | Label key does not exist | spot-node DoesNotExist |
Gt | Label value numerically > specified | storage-gb Gt 100 |
Lt | Label value numerically < specified | storage-gb Lt 1000 |
Step 02.01 - Required Node Affinity (Hard Rule)¶
# Setup - label nodes
kubectl label nodes node-1 environment=production zone=us-east
kubectl label nodes node-2 environment=development zone=us-west
# pod-required-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: prod-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: environment
operator: In
values:
- production
- staging # Pod can go to production OR staging nodes
containers:
- name: app
image: nginx:1.25
kubectl apply -f pod-required-affinity.yaml
kubectl get pod prod-pod -o wide
kubectl describe pod prod-pod | grep -E "Node:|Affinity"
Pod stays Pending if no matching Nodes exist
If no Node has environment=production or environment=staging, the Pod will remain in Pending state with event FailedScheduling: 0/N nodes are available: N node(s) didn't match Pod's node affinity/selector.
Step 02.02 - Preferred Node Affinity (Soft Rule with Weight)¶
# pod-preferred-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: preferred-pod
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80 # Strong preference (out of 100)
preference:
matchExpressions:
- key: environment
operator: In
values:
- production
- weight: 20 # Weak preference
preference:
matchExpressions:
- key: zone
operator: In
values:
- us-east
containers:
- name: app
image: nginx:1.25
- The scheduler adds the
weightvalues as bonus score for each Node that matches that preference. - A Pod CAN be scheduled on a Node that matches neither preference - the rules are purely advisory.
Step 02.03 - Combining Required and Preferred Rules¶
# pod-combined-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: combined-affinity-pod
spec:
affinity:
nodeAffinity:
# HARD: MUST be production or staging
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: environment
operator: In
values:
- production
- staging
# SOFT: prefer us-east zone within those nodes
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: zone
operator: In
values:
- us-east
containers:
- name: app
image: nginx:1.25
Step 02.04 - Understanding OR and AND Logic¶
nodeSelectorTerms - OR logic; matchExpressions - AND logic
- Multiple entries in
nodeSelectorTermsare combined with OR - the Pod can match ANY of them. - Multiple entries in a single
matchExpressionslist are combined with AND - ALL must be satisfied.
# OR logic example: pod can go to either SSD nodes OR GPU nodes
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions: # First term (SSD)
- key: disk-type
operator: In
values: [ssd]
- matchExpressions: # Second term (GPU) - OR with first
- key: hardware
operator: In
values: [gpu]
# AND logic example: pod must be on SSD AND production nodes
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions: # Both conditions in the same term = AND
- key: disk-type
operator: In
values: [ssd]
- key: environment
operator: In
values: [production]
Step 02.05 - NotIn Operator (Exclude Nodes)¶
# Avoid scheduling on spot/preemptible instances for critical workloads
apiVersion: v1
kind: Pod
metadata:
name: critical-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-lifecycle
operator: NotIn
values:
- spot
- preemptible
containers:
- name: app
image: nginx:1.25
Step 02.06 - Exists and DoesNotExist Operators¶
# Must run on a node that has ANY value for the 'gpu' label
apiVersion: v1
kind: Pod
metadata:
name: any-gpu-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu
operator: Exists # Any value for 'gpu' label is acceptable
containers:
- name: app
image: nvidia/cuda:12.0-base
# Must run on a node that has NO 'special-hardware' label at all
apiVersion: v1
kind: Pod
metadata:
name: standard-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: special-hardware
operator: DoesNotExist
containers:
- name: app
image: nginx:1.25
Step 02.07 - Gt and Lt Operators (Numeric Comparison)¶
# Label nodes with numeric values
kubectl label nodes node-1 memory-gb=16
kubectl label nodes node-2 memory-gb=64
kubectl label nodes node-3 memory-gb=128
# Schedule only on nodes with memory-gb > 32
apiVersion: v1
kind: Pod
metadata:
name: high-memory-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: memory-gb
operator: Gt
values:
- "32" # Value must be a string, comparison is numeric
containers:
- name: app
image: nginx:1.25
# Schedule on nodes with memory-gb between 32 and 256 (exclusive)
apiVersion: v1
kind: Pod
metadata:
name: medium-memory-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: memory-gb
operator: Gt
values: ["32"]
- key: memory-gb
operator: Lt
values: ["256"]
containers:
- name: app
image: nginx:1.25
Step 02.08 - Node Affinity in a Deployment¶
# deployment-node-affinity.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: environment
operator: In
values: [production]
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: zone
operator: In
values: [us-east-1a]
- weight: 50
preference:
matchExpressions:
- key: zone
operator: In
values: [us-east-1b]
containers:
- name: web
image: nginx:1.25
ports:
- containerPort: 80
Part 03 - Pod Affinity (Co-locate Pods)¶
Pod Affinityallows you to influence the scheduler to place Pods near other Pods.- This is useful when Pods benefit from being on the same Node or in the same topology domain (e.g., same availability zone) to reduce network latency.
- Pod Affinity rules live under
spec.affinity.podAffinity.
How topologyKey Works¶
flowchart LR
subgraph "Zone us-east-1a"
subgraph "node-1"
P1["frontend-abc\napp=frontend"]
P2["cache-xyz (new)"]
end
subgraph "node-2"
P3["frontend-def\napp=frontend"]
P4["cache-uvw (new)"]
end
end
subgraph "Zone us-east-1b"
subgraph "node-3"
P5["backend-pod"]
end
end
note["topologyKey: kubernetes.io/hostname\n→ Co-locate on same NODE as matching Pod\n\ntopologyKey: topology.kubernetes.io/zone\n→ Co-locate in same ZONE as matching Pod"] Common topologyKey values
kubernetes.io/hostname- same physical/virtual Nodetopology.kubernetes.io/zone- same availability zonetopology.kubernetes.io/region- same cloud region- Any custom label key on your Nodes
Step 03.01 - Required Pod Affinity (Co-locate on Same Node)¶
# two-pods-same-node.yaml
# First: deploy the "anchor" pod that others want to co-locate with
apiVersion: v1
kind: Pod
metadata:
name: cache-pod
labels:
app: cache
tier: caching
spec:
containers:
- name: redis
image: redis:7
ports:
- containerPort: 6379
---
# Second: deploy the app that MUST be on the same Node as a cache pod
apiVersion: v1
kind: Pod
metadata:
name: app-pod
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
topologyKey: kubernetes.io/hostname # Same node as a cache pod
containers:
- name: app
image: nginx:1.25
Step 03.02 - Required Pod Affinity (Co-locate in Same Zone)¶
# frontend-with-zone-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: frontend-pod
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: backend # Must be in same zone as a backend pod
topologyKey: topology.kubernetes.io/zone
containers:
- name: frontend
image: nginx:1.25
Step 03.03 - Preferred Pod Affinity¶
# pod-preferred-pod-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: app-prefer-near-cache
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
topologyKey: kubernetes.io/hostname # Prefer same node, but not required
containers:
- name: app
image: nginx:1.25
Step 03.04 - Pod Affinity in a Deployment (Sidecar Co-location Pattern)¶
# sidecar-affinity-deployment.yaml
# This deployment ensures each app pod is on the same node as a log-agent pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-with-sidecar-affinity
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: log-agent
topologyKey: kubernetes.io/hostname
containers:
- name: app
image: nginx:1.25
Step 03.05 - Scoping Pod Affinity to Specific Namespaces¶
# Pod affinity targeting pods in a specific namespace
apiVersion: v1
kind: Pod
metadata:
name: cross-namespace-affinity-pod
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: database
topologyKey: kubernetes.io/hostname
namespaces: # Look for matching pods in these namespaces
- data-plane
- production
# If 'namespaces' is omitted, only the Pod's own namespace is searched
# If 'namespaces' is empty [], all namespaces are searched
containers:
- name: app
image: nginx:1.25
Part 04 - Pod Anti-Affinity (Spread Pods Apart)¶
Pod Anti-Affinityis the opposite ofPod Affinity- it ensures Pods are placed away from other Pods.- Its primary use is high availability: spreading replicas across Nodes, zones, or regions so a single failure doesn’t take down your entire service.
- Anti-Affinity rules live under
spec.affinity.podAntiAffinity.
Step 04.01 - Required Anti-Affinity (One Pod per Node)¶
# deployment-one-per-node.yaml
# This deployment ensures no two replicas end up on the same node
apiVersion: apps/v1
kind: Deployment
metadata:
name: ha-nginx
spec:
replicas: 3
selector:
matchLabels:
app: ha-nginx
template:
metadata:
labels:
app: ha-nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ha-nginx # Avoid nodes that already have this app
topologyKey: kubernetes.io/hostname
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
kubectl apply -f deployment-one-per-node.yaml
kubectl get pods -o wide # Each pod should land on a different node
Required Anti-Affinity can leave Pods Pending
If you have replicas: 5 but only 3 Nodes, 2 Pods will stay Pending because no Node is available that doesn’t already have a matching Pod. Use preferredDuringScheduling if this is a concern.
Step 04.02 - Preferred Anti-Affinity (Prefer Different Nodes)¶
# deployment-prefer-spread.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-spread
spec:
replicas: 5
selector:
matchLabels:
app: web-spread
template:
metadata:
labels:
app: web-spread
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: web-spread
topologyKey: kubernetes.io/hostname
containers:
- name: web
image: nginx:1.25
Step 04.03 - Zone-Level Anti-Affinity (HA Across Zones)¶
# First, verify your nodes have zone labels
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
ZONE:.metadata.labels."topology\.kubernetes\.io/zone"
# deployment-zone-ha.yaml
# Ensures each replica lands in a different availability zone
apiVersion: apps/v1
kind: Deployment
metadata:
name: zone-ha-app
spec:
replicas: 3
selector:
matchLabels:
app: zone-ha-app
template:
metadata:
labels:
app: zone-ha-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: zone-ha-app
topologyKey: topology.kubernetes.io/zone # One per zone
containers:
- name: app
image: nginx:1.25
Step 04.04 - Combining Pod Affinity and Anti-Affinity¶
# This pod:
# - MUST be in the same zone as at least one 'backend' pod (affinity)
# - MUST NOT be on the same node as another 'frontend' pod (anti-affinity)
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
spec:
replicas: 3
selector:
matchLabels:
app: frontend
template:
metadata:
labels:
app: frontend
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: backend
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: frontend
topologyKey: kubernetes.io/hostname
containers:
- name: frontend
image: nginx:1.25
Part 05 - Taints and Tolerations¶
Taintsare applied to Nodes and repel Pods.Tolerationsare applied to Pods and allow them to be scheduled onto tainted Nodes.- Together they implement a whitelist model: a tainted Node only accepts Pods that explicitly opt-in via tolerations.
- This is the opposite of Node Affinity (which is a pull mechanism from the Pod side).
flowchart LR
subgraph "Tainted Node"
N["node-1\nTaint: dedicated=gpu:NoSchedule"]
end
P1["Pod A\n(no toleration)"] --> X["❌ Cannot schedule\non node-1"]
P2["Pod B\ntoleration: dedicated=gpu:NoSchedule"] --> N
X -.-> OtherNode["Scheduled on\nother node"] Taint Effects¶
| Effect | New Pods | Existing Pods |
|---|---|---|
NoSchedule | Blocked (hard) | Not affected |
PreferNoSchedule | Avoided (soft) | Not affected |
NoExecute | Blocked (hard) | Evicted if no matching toleration |
Step 05.01 - Apply and Remove Taints¶
# Add NoSchedule taint
kubectl taint nodes node-1 dedicated=gpu:NoSchedule
# Add NoExecute taint (evicts non-tolerating existing pods immediately)
kubectl taint nodes node-2 maintenance=true:NoExecute
# Add PreferNoSchedule taint (soft - pods go elsewhere if possible)
kubectl taint nodes node-3 spot-instance=true:PreferNoSchedule
# View taints on all nodes
kubectl describe nodes | grep -A2 "Taints:"
# Remove a specific taint
kubectl taint nodes node-1 dedicated=gpu:NoSchedule-
# Remove all taints with a key (any value or effect)
kubectl taint nodes node-1 dedicated-
Step 05.02 - Pod Without Toleration¶
# pod-no-toleration.yaml
# This pod cannot be scheduled on node-1 which has dedicated=gpu:NoSchedule
apiVersion: v1
kind: Pod
metadata:
name: regular-pod
spec:
containers:
- name: app
image: nginx:1.25
kubectl apply -f pod-no-toleration.yaml
# If ALL nodes are tainted, the pod stays Pending
kubectl describe pod regular-pod | grep -A5 "Events:"
# Output: 0/N nodes are available: N node(s) had untolerated taint {dedicated: gpu}
Step 05.03 - Pod With Equal Toleration¶
# pod-equal-toleration.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-app
spec:
tolerations:
- key: "dedicated"
operator: "Equal" # Must match both key AND value
value: "gpu"
effect: "NoSchedule"
containers:
- name: app
image: nginx:1.25
Step 05.04 - Pod With Exists Toleration¶
# pod-exists-toleration.yaml
# Tolerates ANY taint with key 'dedicated', regardless of value
apiVersion: v1
kind: Pod
metadata:
name: flexible-pod
spec:
tolerations:
- key: "dedicated"
operator: "Exists" # Matches any value for this key
effect: "NoSchedule"
containers:
- name: app
image: nginx:1.25
Step 05.05 - Tolerate All Taints (Wildcard)¶
# This pod tolerates ALL taints on ALL nodes
# WARNING: Only use for cluster-system pods (DaemonSets, CNI, etc.)
apiVersion: v1
kind: Pod
metadata:
name: omnipotent-pod
spec:
tolerations:
- operator: "Exists" # No key, no effect = match anything
containers:
- name: app
image: nginx:1.25
Tolerate-All is dangerous
Toleration operator: Exists with no key matches ALL taints. This is used by DaemonSets that must run everywhere (kube-proxy, CNI plugins, log agents). Never use it in application workloads.
Step 05.06 - NoExecute Effect and tolerationSeconds¶
# pod-maintenance-toleration.yaml
# This pod has 10 minutes to finish before being evicted during maintenance
apiVersion: v1
kind: Pod
metadata:
name: long-running-job
spec:
tolerations:
- key: "maintenance"
operator: "Equal"
value: "true"
effect: "NoExecute"
tolerationSeconds: 600 # Stay up to 10 minutes after the taint is applied
containers:
- name: job
image: busybox
command: ["sleep", "infinity"]
# Simulate Node maintenance - taint the node
kubectl taint nodes node-1 maintenance=true:NoExecute
# The pod will continue running for up to 600 seconds, then be evicted
kubectl get pods -o wide --watch
Step 05.07 - Multiple Tolerations on a Single Pod¶
# pod-multi-toleration.yaml
# This pod can run on nodes with multiple specific taints
apiVersion: v1
kind: Pod
metadata:
name: multi-tolerant-pod
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
- key: "maintenance"
operator: "Equal"
value: "true"
effect: "NoExecute"
tolerationSeconds: 60
containers:
- name: gpu-app
image: nginx:1.25
Step 05.08 - Tolerating Multiple Effects for the Same Key¶
# Tolerate both NoSchedule and NoExecute for the same key
apiVersion: v1
kind: Pod
metadata:
name: dual-effect-pod
spec:
tolerations:
- key: "node-role"
operator: "Equal"
value: "edge"
effect: "NoSchedule"
- key: "node-role"
operator: "Equal"
value: "edge"
effect: "NoExecute"
# Shortcut: omit 'effect' to match ALL effects for this key/value pair
# - key: "node-role"
# operator: "Equal"
# value: "edge"
containers:
- name: app
image: nginx:1.25
Step 05.09 - Built-in Kubernetes Taints¶
Kubernetes automatically adds these taints to Nodes in various states:
| Taint | Effect | Added when |
|---|---|---|
node.kubernetes.io/not-ready | NoExecute | Node Ready condition is False |
node.kubernetes.io/unreachable | NoExecute | Node Ready condition is Unknown |
node.kubernetes.io/memory-pressure | NoSchedule | Node has memory pressure |
node.kubernetes.io/disk-pressure | NoSchedule | Node has disk pressure |
node.kubernetes.io/pid-pressure | NoSchedule | Node has PID pressure |
node.kubernetes.io/unschedulable | NoSchedule | Node is cordoned |
node.kubernetes.io/network-unavailable | NoSchedule | Node network not configured |
node.cloudprovider.kubernetes.io/uninitialized | NoSchedule | Cloud provider not finished initializing |
# Application pods should tolerate node.kubernetes.io/not-ready briefly
# to avoid unnecessary rescheduling during transient node issues
apiVersion: apps/v1
kind: Deployment
metadata:
name: resilient-app
spec:
replicas: 3
selector:
matchLabels:
app: resilient-app
template:
metadata:
labels:
app: resilient-app
spec:
tolerations:
# Kubernetes default tolerations - pods wait 5 minutes before rescheduling
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300
containers:
- name: app
image: nginx:1.25
Step 05.10 - Taint-based Node Isolation for Control Plane¶
# Control-plane nodes are automatically tainted:
kubectl describe node <control-plane> | grep Taints
# Taints: node-role.kubernetes.io/control-plane:NoSchedule
# Allow a pod to run on control-plane nodes (e.g., for monitoring)
apiVersion: v1
kind: Pod
metadata:
name: control-plane-monitor
spec:
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
node-role.kubernetes.io/control-plane: ""
containers:
- name: monitor
image: busybox
command: ["sleep", "infinity"]
Part 06 - Combining Node Affinity, Taints, and Tolerations¶
- Using
Node Affinityalone means Pods from other teams can also go to those nodes. - Using
Taintsalone ensures no uninvited Pods land there, but your own Pods still won’t be attracted to the nodes. - The pattern is: Taint the node (repel others) + Use Affinity (attract your pods).
flowchart TD
subgraph "Dedicated GPU Node Pool"
N["gpu-node-1\nTaint: dedicated=gpu:NoSchedule\nLabel: hardware=gpu"]
end
P1["Random Pod (no toleration)"] --> X["❌ Blocked by Taint"]
P2["GPU Workload\n(toleration + node affinity)"] --> N
style N fill:#2d6a2d,color:#fff Step 06.01 - Dedicated Node Pool (Taint + Node Affinity)¶
# Setup dedicated GPU nodes
kubectl label nodes gpu-node-1 hardware=gpu accelerator=nvidia
kubectl label nodes gpu-node-2 hardware=gpu accelerator=nvidia
# Taint them to repel non-GPU workloads
kubectl taint nodes gpu-node-1 dedicated=gpu:NoSchedule
kubectl taint nodes gpu-node-2 dedicated=gpu:NoSchedule
# gpu-workload.yaml
apiVersion: v1
kind: Pod
metadata:
name: ml-training-job
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: hardware
operator: In
values: [gpu]
- key: accelerator
operator: In
values: [nvidia]
containers:
- name: train
image: nvidia/cuda:12.0-base
resources:
limits:
nvidia.com/gpu: "1"
Step 06.02 - Multi-Tenant Cluster Team Isolation¶
# Allocate nodes to team-alpha
kubectl label nodes node-1 node-2 team=alpha
kubectl taint nodes node-1 node-2 team=alpha:NoSchedule
# Allocate nodes to team-beta
kubectl label nodes node-3 node-4 team=beta
kubectl taint nodes node-3 node-4 team=beta:NoSchedule
# team-alpha deployment - only lands on alpha nodes
apiVersion: apps/v1
kind: Deployment
metadata:
name: alpha-service
namespace: team-alpha
spec:
replicas: 2
selector:
matchLabels:
app: alpha-service
template:
metadata:
labels:
app: alpha-service
spec:
tolerations:
- key: "team"
operator: "Equal"
value: "alpha"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: team
operator: In
values: [alpha]
containers:
- name: service
image: nginx:1.25
Step 06.03 - Spot Instance Nodes with Graceful Fallback¶
# Label and taint spot instances
kubectl label nodes spot-1 spot-2 node-lifecycle=spot
kubectl taint nodes spot-1 spot-2 spot-instance=true:PreferNoSchedule
# batch-job.yaml - prefer spot, but fall back to on-demand
apiVersion: batch/v1
kind: Job
metadata:
name: batch-data-processor
spec:
template:
spec:
tolerations:
- key: "spot-instance"
operator: "Equal"
value: "true"
effect: "PreferNoSchedule"
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-lifecycle
operator: In
values: [spot]
restartPolicy: OnFailure
containers:
- name: processor
image: busybox
command: ["sh", "-c", "echo processing && sleep 60"]
# critical-service.yaml - explicitly AVOID spot instances
apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-api
spec:
replicas: 3
selector:
matchLabels:
app: critical-api
template:
metadata:
labels:
app: critical-api
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-lifecycle
operator: NotIn
values: [spot, preemptible]
containers:
- name: api
image: nginx:1.25
Part 07 - Topology Spread Constraints¶
Topology Spread Constraintsgive you explicit control over how Pods are distributed across topology domains.- Unlike Anti-Affinity (which blocks placement), Topology Spread Constraints let you specify the maximum imbalance allowed.
- They are more predictable and flexible than Anti-Affinity for distribution scenarios.
Key Fields¶
| Field | Description |
|---|---|
maxSkew | Maximum difference in Pod count between the most and least loaded domains. maxSkew: 1 means no domain can have more than 1 extra Pod |
topologyKey | The node label that defines the topology domains (e.g., zone, hostname) |
whenUnsatisfiable | DoNotSchedule (hard) or ScheduleAnyway (soft - best effort) |
labelSelector | Which Pods to count when computing skew |
minDomains | Minimum number of topology domains that must be available (requires at least this many domains to exist) |
nodeAffinityPolicy | Whether to honor nodeAffinity/nodeSelector when counting pods: Honor (default) or Ignore |
nodeTaintsPolicy | Whether to honor taints when counting: Honor or Ignore (default) |
Step 07.01 - Basic Zone Spreading¶
# deployment-zone-spread.yaml
# Spread pods evenly across zones, allowing max 1 extra pod in any zone
apiVersion: apps/v1
kind: Deployment
metadata:
name: zone-spread-app
spec:
replicas: 6
selector:
matchLabels:
app: zone-spread-app
template:
metadata:
labels:
app: zone-spread-app
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # Hard constraint
labelSelector:
matchLabels:
app: zone-spread-app
containers:
- name: app
image: nginx:1.25
Step 07.02 - Node-Level Spreading¶
# deployment-node-spread.yaml
# Spread pods evenly across nodes (max 1 extra pod per node)
apiVersion: apps/v1
kind: Deployment
metadata:
name: node-spread-app
spec:
replicas: 9
selector:
matchLabels:
app: node-spread-app
template:
metadata:
labels:
app: node-spread-app
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname # Each node is a domain
whenUnsatisfiable: ScheduleAnyway # Soft - still schedules if can't spread
labelSelector:
matchLabels:
app: node-spread-app
containers:
- name: app
image: nginx:1.25
Step 07.03 - Multi-Level Constraints (Zone AND Node)¶
# deployment-multi-spread.yaml
# Spread evenly across zones first, then evenly across nodes within zones
apiVersion: apps/v1
kind: Deployment
metadata:
name: multi-level-spread
spec:
replicas: 12
selector:
matchLabels:
app: multi-level-spread
template:
metadata:
labels:
app: multi-level-spread
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone # Zone-level spread
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: multi-level-spread
- maxSkew: 1
topologyKey: kubernetes.io/hostname # Node-level spread
whenUnsatisfiable: ScheduleAnyway # Best effort for nodes
labelSelector:
matchLabels:
app: multi-level-spread
containers:
- name: app
image: nginx:1.25
Step 07.04 - minDomains (Ensure Minimum Domain Count)¶
# deployment-min-domains.yaml
# Only schedule if at least 3 zones are available
apiVersion: apps/v1
kind: Deployment
metadata:
name: min-domain-app
spec:
replicas: 6
selector:
matchLabels:
app: min-domain-app
template:
metadata:
labels:
app: min-domain-app
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
minDomains: 3 # Must have at least 3 zones available
labelSelector:
matchLabels:
app: min-domain-app
containers:
- name: app
image: nginx:1.25
Step 07.05 - Topology Spread Constraints vs Anti-Affinity¶
| Aspect | Anti-Affinity (required) | Topology Spread Constraint |
|---|---|---|
| Semantics | One pod per domain (binary) | Balance up to maxSkew |
| Flexibility | Rigid - pod Pending if >1 per domain | Flexible - maxSkew allows limited stacking |
| Multiple replicas per domain | Not possible (required) | Yes, balanced by maxSkew |
| Partial failure handling | Pod stays pending | ScheduleAnyway for soft behavior |
| Multiple topology levels | Requires chaining | Native multi-constraint support |
Part 08 - Real-World Scenarios¶
Scenario 08.01 - Database + Stateful Service Co-Location¶
# Deploy a Redis cache that MUST be co-located with application pods
# Application pods get ~1ms latency to Redis when on the same node
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: redis-local
spec:
selector:
matchLabels:
app: redis-local
template:
metadata:
labels:
app: redis-local
tier: cache
spec:
tolerations:
- operator: "Exists" # Run on all nodes including tainted ones
containers:
- name: redis
image: redis:7-alpine
ports:
- containerPort: 6379
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 6
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: redis-local # MUST be on same node as Redis
topologyKey: kubernetes.io/hostname
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: web-app # Prefer different nodes for web app replicas
topologyKey: kubernetes.io/hostname
containers:
- name: web
image: nginx:1.25
Scenario 08.02 - High-Availability Web Service (3 Zones)¶
# ha-web-service.yaml
# Web: must spread across 3 zones, one replica per zone minimum
apiVersion: apps/v1
kind: Deployment
metadata:
name: ha-web
spec:
replicas: 6
selector:
matchLabels:
app: ha-web
template:
metadata:
labels:
app: ha-web
tier: web
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: ha-web
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values: [web, general]
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
tier: web
topologyKey: kubernetes.io/hostname
containers:
- name: web
image: nginx:1.25
Scenario 08.03 - Node Maintenance (Drain Workflow)¶
# 1. Cordon the node (adds node.kubernetes.io/unschedulable taint)
kubectl cordon node-1
# 2. Drain the node (evicts all pods gracefully)
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data --grace-period=60
# 3. Perform maintenance...
# 4. Uncordon the node to allow scheduling again
kubectl uncordon node-1
# Check scheduling status
kubectl get nodes
# STATUS: Ready (not SchedulingDisabled) = uncordoned
Scenario 08.04 - Burstable Workloads on Spot Instances¶
# production-baseline.yaml - always on on-demand nodes
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-baseline
spec:
replicas: 2
selector:
matchLabels:
app: api
tier: baseline
template:
metadata:
labels:
app: api
tier: baseline
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-lifecycle
operator: In
values: [on-demand]
containers:
- name: api
image: nginx:1.25
---
# burst-capacity.yaml - scale-out pods go to cheaper spot instances
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-burst
spec:
replicas: 0 # HPA will scale this up during bursts
selector:
matchLabels:
app: api
tier: burst
template:
metadata:
labels:
app: api
tier: burst
spec:
tolerations:
- key: "spot-instance"
operator: "Exists"
effect: "NoSchedule"
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-lifecycle
operator: In
values: [spot]
containers:
- name: api
image: nginx:1.25
Scenario 08.05 - Dedicated Nodes for Monitoring Stack¶
# Create a dedicated monitoring node pool
kubectl label nodes monitoring-1 monitoring-2 role=monitoring
kubectl taint nodes monitoring-1 monitoring-2 role=monitoring:NoSchedule
# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
tolerations:
- key: "role"
operator: "Equal"
value: "monitoring"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: role
operator: In
values: [monitoring]
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: prometheus
topologyKey: kubernetes.io/hostname
containers:
- name: prometheus
image: prom/prometheus:latest
ports:
- containerPort: 9090
Part 09 - DaemonSets and System Pods¶
DaemonSetsrun one Pod per Node.- System DaemonSets (CNI, kube-proxy, log agents) need to run even on tainted Nodes.
- They use the wildcard toleration or specific tolerations for known system taints.
Step 09.01 - DaemonSet on All Nodes Including Tainted¶
# node-log-agent.yaml
# Log agent that runs on ALL nodes, including control-plane and tainted nodes
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: log-agent
namespace: kube-system
spec:
selector:
matchLabels:
app: log-agent
template:
metadata:
labels:
app: log-agent
spec:
tolerations:
# Standard system node taints
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
# Node lifecycle taints
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
- key: node.kubernetes.io/disk-pressure
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/memory-pressure
operator: Exists
effect: NoSchedule
# Custom workload taints - agents still need to run on GPU/special nodes
- key: dedicated
operator: Exists
effect: NoSchedule
containers:
- name: log-agent
image: fluent/fluent-bit:latest
Step 09.02 - DaemonSet on a Subset of Nodes (Node Affinity)¶
# gpu-node-daemonset.yaml
# Driver installer that runs only on GPU nodes
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-driver-installer
namespace: gpu-system
spec:
selector:
matchLabels:
app: nvidia-driver-installer
template:
metadata:
labels:
app: nvidia-driver-installer
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: hardware
operator: In
values: [gpu]
containers:
- name: installer
image: nvidia/gpu-operator:latest
Part 10 - Observability and Debugging Scheduling Issues¶
Step 10.01 - Check Why a Pod is Pending¶
# Describe the pod for scheduling events
kubectl describe pod <pod-name>
# Look for the "Events" section - FailedScheduling explains why
# Common messages:
# "0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector"
# "0/3 nodes are available: 3 node(s) had untolerated taint {dedicated: gpu}"
# "0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules"
# "0/3 nodes are available: 1 Insufficient cpu, 2 Insufficient memory"
# Get all scheduler events
kubectl get events -n <namespace> --field-selector reason=FailedScheduling
# Sort by time
kubectl get events --sort-by='.lastTimestamp' | grep Failed
# Watch scheduling events in real time
kubectl get events -w | grep -E "FailedScheduling|Scheduled"
Step 10.02 - Verify Node Labels Match Affinity Rules¶
# Check if a node has the required label
kubectl get node node-1 -o jsonpath='{.metadata.labels}' | jq
# Check with a specific label key
kubectl get nodes -l environment=production
# Show which nodes match a specific label selector
kubectl get nodes --selector='environment in (production,staging)'
Step 10.03 - Verify Taints on Nodes¶
# Show taints on all nodes in a table
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
TAINTS:.spec.taints
# Check if a pod's tolerations match node taints
kubectl get pod <pod-name> -o jsonpath='{.spec.tolerations}' | jq
kubectl get node <node-name> -o jsonpath='{.spec.taints}' | jq
Step 10.04 - Simulate Scheduling (Dry Run)¶
# Try applying a pod definition to see if it would schedule
kubectl apply -f pod.yaml --dry-run=server
# Use kubectl describe to see what nodes are eligible after a failed schedule
kubectl describe pod <pending-pod> | grep -A 20 "Events:"
Part 11 - Complete Scenario: Production-Grade Multi-Tier Application¶
This scenario combines everything to deploy a 3-tier application (frontend, backend, database) with: - Zone-spread frontend pods - Backend co-located in the same zones as frontend - Database on dedicated storage nodes - Monitoring agents on every node - Spot instance Nodes for batch jobs
# ===== Setup: Label and taint cluster nodes =====
# Zone assignment (cloud providers set these automatically)
kubectl label nodes node-1 node-2 topology.kubernetes.io/zone=us-east-1a
kubectl label nodes node-3 node-4 topology.kubernetes.io/zone=us-east-1b
kubectl label nodes node-5 node-6 topology.kubernetes.io/zone=us-east-1c
# Node types
kubectl label nodes node-1 node-3 node-5 node-type=web
kubectl label nodes node-2 node-4 node-6 node-type=storage disk-type=ssd
# Storage nodes are dedicated
kubectl taint nodes node-2 node-4 node-6 dedicated=storage:NoSchedule
# frontend-deployment.yaml - spread across zones, one per node
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
spec:
replicas: 3
selector:
matchLabels:
app: frontend
tier: web
template:
metadata:
labels:
app: frontend
tier: web
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: frontend
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values: [web]
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: frontend
topologyKey: kubernetes.io/hostname
containers:
- name: frontend
image: nginx:1.25
ports:
- containerPort: 80
# backend-deployment.yaml - co-located in same zones as frontend
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend
spec:
replicas: 3
selector:
matchLabels:
app: backend
tier: api
template:
metadata:
labels:
app: backend
tier: api
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values: [web]
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
tier: web # Be in same zone as frontend
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: backend
topologyKey: kubernetes.io/hostname
containers:
- name: backend
image: nginx:1.25
ports:
- containerPort: 8080
# database-statefulset.yaml - dedicated storage nodes
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: database
spec:
serviceName: database
replicas: 3
selector:
matchLabels:
app: database
tier: data
template:
metadata:
labels:
app: database
tier: data
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "storage"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values: [storage]
- key: disk-type
operator: In
values: [ssd]
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: database
topologyKey: topology.kubernetes.io/zone # One DB replica per zone
containers:
- name: db
image: postgres:16
env:
- name: POSTGRES_PASSWORD
value: "changeme"
Exercises¶
Exercise 1: Schedule a Pod only on nodes with ≥ 16 CPU cores labeled as cpu-cores
**Solution:** Exercise 2: Deploy 4 replicas of an app with HARD anti-affinity across nodes, then observe Pending pods
**Solution:**# First, deploy with 4 replicas - if fewer than 4 nodes, some will Pending
apiVersion: apps/v1
kind: Deployment
metadata:
name: anti-affinity-test
spec:
replicas: 4
selector:
matchLabels:
app: anti-affinity-test
template:
metadata:
labels:
app: anti-affinity-test
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: anti-affinity-test
topologyKey: kubernetes.io/hostname
containers:
- name: app
image: nginx:1.25
Exercise 3: Taint a node with NoExecute and observe pod eviction, then add a toleration with tolerationSeconds: 120
**Solution:** # Deploy some pods on all nodes
kubectl run test-evict --image=nginx --replicas=3
# Find which node one of the pods is on
kubectl get pods -o wide
# Taint that node with NoExecute - all pods without toleration will be evicted
kubectl taint nodes <node-name> eviction-test=true:NoExecute
# Watch the pods being evicted and rescheduled
kubectl get pods -o wide --watch
Exercise 4: Deploy 9 pods spread evenly across 3 zones with maxSkew=1, then scale to 12 pods
**Solution:**apiVersion: apps/v1
kind: Deployment
metadata:
name: spread-exercise
spec:
replicas: 9
selector:
matchLabels:
app: spread-exercise
template:
metadata:
labels:
app: spread-exercise
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: spread-exercise
containers:
- name: app
image: nginx:1.25
Exercise 5: Create a "data locality pattern" - Redis DaemonSet + application pods that MUST be on the same node as Redis
**Solution:**# Step 1: DaemonSet ensures Redis runs on every node
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: redis-local
spec:
selector:
matchLabels:
type: redis-local
template:
metadata:
labels:
type: redis-local
spec:
containers:
- name: redis
image: redis:7-alpine
ports:
- containerPort: 6379
---
# Step 2: App pods MUST be co-located with Redis (same node)
apiVersion: apps/v1
kind: Deployment
metadata:
name: latency-sensitive-app
spec:
replicas: 4
selector:
matchLabels:
app: latency-sensitive-app
template:
metadata:
labels:
app: latency-sensitive-app
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
type: redis-local # Must be on same node as Redis DaemonSet pod
topologyKey: kubernetes.io/hostname
containers:
- name: app
image: nginx:1.25
env:
- name: REDIS_HOST
value: "localhost" # Redis is on the same node
Cleanup¶
# Remove all test pods
kubectl delete pod --all -n default
# Remove test deployments
kubectl delete deployment ha-nginx web-spread zone-ha-app --ignore-not-found
kubectl delete deployment gpu-app multi-level-spread spread-exercise --ignore-not-found
kubectl delete deployment frontend backend ha-web critical-api api-baseline --ignore-not-found
kubectl delete statefulset database --ignore-not-found
kubectl delete daemonset redis-local log-agent --ignore-not-found
# Remove labels from nodes (replace node-1, node-2 with your actual node names)
kubectl label nodes --all environment- zone- disk-type- hardware- --overwrite 2>/dev/null || true
kubectl label nodes --all accelerator- team- node-lifecycle- role- --overwrite 2>/dev/null || true
kubectl label nodes --all node-type- memory-gb- cpu-cores- --overwrite 2>/dev/null || true
# Remove taints from nodes
kubectl taint nodes --all dedicated- 2>/dev/null || true
kubectl taint nodes --all maintenance- 2>/dev/null || true
kubectl taint nodes --all spot-instance- 2>/dev/null || true
kubectl taint nodes --all team- role- eviction-test- 2>/dev/null || true
# Verify nodes are clean
kubectl describe nodes | grep -A3 "Taints:"
kubectl get nodes --show-labels
Summary¶
In this lab you learned:
| Topic | Covered |
|---|---|
nodeSelector | Simple exact-match node label filtering |
| Node Affinity | Rich operators (In, NotIn, Exists, DoesNotExist, Gt, Lt), required + preferred, weight-based scoring, AND/OR logic |
| Pod Affinity | Co-locate Pods on same node or same topology domain, namespace scoping |
| Pod Anti-Affinity | Spread Pods across nodes and zones, hard + soft variants |
| Taints | Node-side repulsion with NoSchedule, PreferNoSchedule, NoExecute effects |
| Tolerations | Pod-side opt-in with Equal/Exists operators, tolerationSeconds |
| Built-in Taints | System lifecycle taints (not-ready, unreachable, disk-pressure, etc.) |
| Topology Spread | Balance Pod counts across topology domains with maxSkew, minDomains |
| Patterns | GPU pools, zone-HA, spot bursting, multi-tenant isolation, data locality |
| Debugging | FailedScheduling events, label/taint inspection, dry-run |
Decision Guide¶
flowchart TD
Q1{"Do you need to\ncontrol which NODES\nPods go to?"}
Q1 -- Yes --> Q2{"Simple exact\nlabel match?"}
Q2 -- Yes --> A1["Use nodeSelector"]
Q2 -- No --> A2["Use Node Affinity\n(required or preferred)"]
Q1 -- No --> Q3{"Do you need to\nplace Pods relative\nto OTHER Pods?"}
Q3 -- "Near each other" --> A3["Use Pod Affinity"]
Q3 -- "Away from each other" --> A4["Use Pod Anti-Affinity"]
Q3 -- "Balanced spread" --> A5["Use Topology Spread\nConstraints"]
Q3 -- No --> Q4{"Do you need to\nREPEL pods from\na node?"}
Q4 -- Yes --> A6["Use Taints on Node\n+ Tolerations on Pod"]
Q4 -- No --> A7["Default scheduler\nbehavior is fine"]
style A1 fill:#1a73e8,color:#fff
style A2 fill:#1a73e8,color:#fff
style A3 fill:#34a853,color:#fff
style A4 fill:#34a853,color:#fff
style A5 fill:#fbbc04,color:#000
style A6 fill:#ea4335,color:#fff Troubleshooting¶
- Pod stuck in Pending:
Check the pod events for scheduling failures - the event message tells you exactly which constraint was violated:
kubectl describe pod <pod-name> | grep -A10 "Events:"
# Common messages:
# "0/N nodes are available: N node(s) didn't match Pod's node affinity/selector"
# "0/N nodes are available: N node(s) had untolerated taint {key: value}"
# "0/N nodes are available: N node(s) didn't match pod anti-affinity rules"
- Anti-affinity blocking new pods:
If you have more replicas than nodes and use requiredDuringScheduling anti-affinity, excess pods stay Pending. Switch to preferredDuringScheduling:
- Taint applied but pods still running on the node:
NoSchedule only blocks new pods. Use NoExecute to evict existing pods:
kubectl describe nodes | grep -A3 "Taints:"
# Change NoSchedule to NoExecute if you want to evict running pods
kubectl taint nodes <node> key=value:NoExecute
- Labels not matching affinity rules:
Verify that node labels actually exist and match your affinity selectors:
kubectl get nodes --show-labels | grep <expected-label>
kubectl get node <node-name> -o jsonpath='{.metadata.labels}' | jq
- Topology spread constraints not balancing evenly:
Ensure your nodes have the correct topology labels (topology.kubernetes.io/zone, kubernetes.io/hostname):
Next Steps¶
- Explore Pod Priority and Preemption for workload prioritization.
- Learn about Descheduler for rebalancing pods after scheduling decisions.
- Try combining scheduling constraints with PodDisruptionBudgets (Lab 17) for production-grade high availability.
- Explore Custom Schedulers (Lab 19) for when built-in scheduling constraints aren’t enough.
- Practice scheduling tasks in the Kubernetes Scheduling Tasks section.