Network security in Azure Kubernetes Service (AKS) is rarely a one-and-done configuration. Most teams start with the default VNet integration, then discover gaps when a pod accidentally reaches the internet, or when a cluster upgrade breaks connectivity. This guide is for platform engineers and DevOps leads who already know the basics—they've deployed a cluster, maybe set up Azure Firewall—but need a structured checklist to harden their AKS networking without overcomplicating it. We'll walk through the critical decisions, common traps, and maintenance patterns that keep clusters secure over time.
Where AKS Network Security Gets Real
The moment your AKS cluster needs to talk to on-premises systems, or when multiple teams share a cluster, network security becomes a first-class concern. In a typical project, the networking team hands over a VNet with a /24 subnet for nodes, and maybe a /24 for pods if using Azure CNI. The first surprise: pod IPs consume VNet addresses even if they never leave the cluster. With Azure CNI, each pod gets a VNet IP, so a cluster with 500 pods needs 500 usable IPs in the pod subnet. Teams often underestimate this and run out of addresses during scaling events.
Another real-world scenario: a microservices team deploys a service that calls an external API. Without explicit egress controls, that traffic goes through the node's default route—often straight to the internet via the load balancer's outbound IP. If the API is malicious or the traffic is intercepted, the cluster is exposed. This is where the checklist mindset helps: verify egress path before production, not after an incident.
The core tension is between simplicity and control. Kubenet is simpler but requires a separate UDR for egress. Azure CNI offers more features (like network policies and Windows support) but consumes VNet IPs and adds complexity. Our checklist helps you decide which model fits your constraints—and how to secure the chosen path.
Key Decision: CNI vs. Kubenet
For most production clusters, Azure CNI is the recommended choice because it supports network policies, advanced routing, and integration with Azure services. However, if your cluster is small or you're in a VNet with limited IP space, Kubenet can work with careful egress routing. The trade-off: Kubenet pods use a private IP range (10.x.x.x) that is NATed to the node's IP, making it harder to apply network policies at the pod level.
Egress Path Audit
Before going live, map every egress path from your cluster. Use Azure Firewall or a third-party NVA to inspect and restrict outbound traffic. Even if you don't need egress filtering now, set up the route table and firewall rules early—retrofitting later requires downtime and reconfiguration.
Foundations Readers Confuse
A common misconception is that AKS network security equals network policies. Network policies (Calico or Azure) control pod-to-pod traffic within the cluster, but they do nothing for traffic leaving the cluster or for traffic entering from outside. Another confusion: service endpoints and Private Link are often mixed up. Service endpoints extend your VNet identity to Azure services over the Microsoft backbone, but they do not make the service private—the service still has a public endpoint. Private Link, on the other hand, creates a private IP in your VNet that maps to the service, effectively removing public exposure.
Teams also confuse pod subnets with node subnets. In Azure CNI, pods get IPs from a dedicated subnet (or the same subnet as nodes, depending on configuration). If you place pods and nodes in the same subnet, you lose the ability to apply different NSG rules to pods vs. nodes. A better practice is to use separate subnets: one for nodes (with NSG allowing inbound traffic from the load balancer) and one for pods (with NSG only allowing intra-cluster traffic and specific egress).
NSG vs. Azure Firewall
Network Security Groups (NSGs) are stateless? No—they're stateful. But they operate at layer 3/4 and are applied at the subnet or NIC level. Azure Firewall is a layer 3-7 managed service with threat intelligence and application rules. Many teams start with NSGs, then hit a scenario where they need to block a specific URL (e.g., *.malware.com) and realize NSGs can't do that. The checklist: use NSGs for basic perimeter security (e.g., block SSH from internet), and Azure Firewall for granular outbound filtering and inspection.
Service Mesh vs. Network Policy
Some teams think a service mesh (like Istio or Linkerd) replaces network policies. It doesn't. A service mesh provides mTLS, traffic splitting, and observability at the application layer, but it still relies on underlying network policies for pod-level segmentation. Use both: network policies for baseline isolation, service mesh for advanced traffic management.
Patterns That Usually Work
After reviewing dozens of AKS deployments, a few patterns consistently deliver security without excessive complexity. The first is forced tunneling with Azure Firewall. In this setup, all egress traffic from the cluster (including pod traffic) routes through the Azure Firewall via a route table with 0.0.0.0/0 next hop to the firewall's private IP. This ensures no traffic bypasses inspection. The firewall rules can whitelist specific FQDNs for container registries, OS updates, and external APIs.
Another working pattern is using Private Link for critical Azure services (e.g., Azure Container Registry, Key Vault). Instead of exposing these services over the internet, you create private endpoints in the cluster's VNet. This reduces attack surface and keeps traffic within the Microsoft backbone. The catch: private endpoints cost money and require DNS configuration (private DNS zones).
For pod-to-pod communication, the recommended pattern is to use Azure Network Policy with a default-deny ingress policy and explicit allow rules for each service. This prevents lateral movement if a pod is compromised. For example, a web frontend should only allow ingress from the load balancer, and the backend API should only allow ingress from the frontend.
Checklist: Production-Ready AKS Networking
- Use Azure CNI for clusters with >50 pods or needing network policies.
- Create separate subnets for nodes, pods, and Azure Firewall (if used).
- Apply NSGs to node subnet: allow inbound from Azure Load Balancer (ports 80, 443, 8080 etc.) and deny all other inbound from internet.
- Set up a route table with 0.0.0.0/0 → Azure Firewall private IP for egress.
- Configure Azure Firewall application rules for required FQDNs (e.g., *.azurecr.io, *.microsoft.com, *.docker.io).
- Enable Azure Policy for AKS to enforce network policies and other security constraints.
- Use Private Link for ACR, Key Vault, and SQL Database if accessed from cluster.
- Enable diagnostic logs for Azure Firewall and NSGs to audit traffic.
Anti-Patterns and Why Teams Revert
One of the most common anti-patterns is deploying AKS with the default network configuration and assuming it's secure. The default AKS cluster uses Kubenet with no explicit egress rules, meaning pods can reach the internet through the node's public IP. Teams often don't realize this until a security audit or a data breach. The fix—adding a firewall and route table—can cause connectivity issues if not planned carefully, so teams sometimes revert to the insecure default.
Another anti-pattern is over-using network policies without understanding the performance impact. Each network policy rule adds iptables entries on the node, and with hundreds of rules, pod startup latency increases. Some teams write policies for every microservice, resulting in thousands of rules. The better approach is to group services by security tier and apply policies at the namespace level.
Teams also misuse Azure Firewall by not properly configuring SNAT. When traffic goes through the firewall, it SNATs to the firewall's private IP. If the destination requires source IP whitelisting, you need to configure Azure Firewall's SNAT settings or use a public IP on the firewall. Many teams forget this and wonder why external services reject their traffic.
Why Teams Revert to Simpler Setups
The main reason teams revert is complexity. Forced tunneling with Azure Firewall requires careful routing: you must add a route for the firewall's subnet itself (to avoid a routing loop) and ensure that management traffic (e.g., from AKS control plane) bypasses the firewall. If the firewall goes down, the cluster loses egress entirely. Some teams find that the operational overhead outweighs the security benefits, especially for smaller clusters. In those cases, a simpler setup with NSGs and a NAT gateway might be sufficient.
Maintenance, Drift, and Long-Term Costs
Network security in AKS isn't a set-it-and-forget-it task. Over time, firewall rules accumulate stale entries, NSG rules become outdated, and pod IP ranges need expansion. Without regular audits, the network configuration drifts from the intended security posture. For example, a developer might add a temporary firewall rule to test an external API and forget to remove it. Months later, that rule becomes a security hole.
Cost is another long-term factor. Azure Firewall has a base cost plus data processing fees. For clusters with high egress traffic, the firewall can become a significant line item. Private Endpoints also incur monthly charges per endpoint. Teams should estimate these costs upfront and include them in the cluster budget. A cost-saving pattern is to use Azure Firewall only for outbound traffic and rely on NSGs for inbound, but even then, the firewall's base cost may be hard to justify for small clusters.
Regular maintenance tasks include: reviewing and pruning firewall rules quarterly, testing failover scenarios (e.g., what happens if the firewall is down?), and updating private DNS zones when services change. Use Azure Policy to enforce tagging and prevent creation of public IPs on pods. Also, monitor network metrics like SNAT port utilization—exhaustion can cause connection failures.
Checklist: Ongoing Network Security
- Monthly review of Azure Firewall logs for denied traffic patterns.
- Quarterly audit of NSG rules and firewall rules; remove unused rules.
- Test egress path after any cluster upgrade or VNet change.
- Monitor SNAT port usage on nodes and firewall; scale up if needed.
- Update private DNS zones when services are added or removed.
- Use Azure Resource Graph to inventory all network resources and detect drift.
When Not to Use This Approach
The advanced network security patterns described here are not for everyone. If you are running a small development cluster with a handful of pods and no sensitive data, the overhead of Azure Firewall, Private Link, and complex network policies is unnecessary. A simple AKS cluster with Kubenet and NSGs restricting inbound traffic to your office IP is sufficient. Similarly, if your team lacks the operational maturity to manage firewall rules and route tables, starting with a simpler setup and gradually adding controls is better than failing at a complex one.
Another case: if your cluster is ephemeral (e.g., for CI/CD pipelines), you might not need persistent network security. Use a managed cluster with default settings and rely on the ephemeral nature for security. Also, if your workloads are all internal and never access the internet, you can skip egress filtering entirely—just ensure no public IPs are attached to nodes.
Finally, if your organization already has a centralized network security team that manages a hub-spoke topology with a central firewall, you might not need a separate Azure Firewall per cluster. Instead, peer the AKS VNet to the hub and route traffic through the central firewall. This reduces duplication and operational burden.
Open Questions and FAQ
Can I use Azure Firewall with Kubenet?
Yes, but you need to create a route table for the node subnet with 0.0.0.0/0 pointing to the firewall's private IP. Since Kubenet pods are NATed to the node IP, all pod egress goes through the node's route. The firewall will see traffic from the node's IP, not pod IPs. This is fine for most cases, but if you need per-pod egress visibility, consider Azure CNI with Azure Firewall's SNAT.
How do I handle overlapping CIDRs between clusters?
Overlapping pod CIDRs are a common problem when multiple clusters need to communicate. The solution is to use a global address space plan: assign each cluster a unique /16 or /14 range for pods and services. Use Azure CNI with a dedicated pod subnet per cluster. If clusters are in different VNets and need to communicate, use VNet peering or Azure Virtual WAN with non-overlapping address spaces.
What about IPv6?
AKS supports dual-stack (IPv4 and IPv6) in preview. For network security, IPv6 adds complexity because Azure Firewall and NSGs have limited IPv6 support. Most teams should stick with IPv4 until IPv6 becomes a requirement.
Should I use Azure Policy or custom scripts for enforcement?
Azure Policy is the preferred method because it is built-in, auditable, and can prevent non-compliant resources from being created. For example, you can create a policy that denies creation of public IPs on nodes or requires specific tags. Custom scripts are brittle and require maintenance. Use Azure Policy for guardrails and scripts only for one-time migrations.
How do I test network policies?
Deploy a test namespace with a simple nginx pod and a client pod. Apply a network policy that denies all ingress to nginx, then verify the client cannot reach it. Gradually allow specific ports and verify. Also test egress by deploying a pod that tries to reach external URLs. Use tools like `curl` or `wget` inside the pod.
After you've validated your setup, the next step is to document the network architecture and share it with your team. Create a runbook for common issues like firewall rule changes or subnet exhaustion. Finally, set up automated alerts for network security events—for example, when Azure Firewall denies a high volume of traffic. Security is a continuous practice, not a one-time checklist.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!