Kubernetes Access Nodeport Fails: Master Cannot Reach Workers, Solved by Correct Flannel Configuration for On-Premise Environments

In a typical on-premises kubernetes setup with an active NodePort service and intermittent connectivity issues encountered during access attempts via tools like cURL or Wget from other clusters. The following article outlines the investigation process, culminating in identifying misconfigurations within Flannel’s Network Interface Manager (NIC) that were causing these network problems for Kubernetes nodes to communicate effectively with each other and external services:

Problem Description

Accessing a NodePort service on your local machine yields inconsistent results, sometimes timing out completely or occasionally succeeding. Additionally, attempts from within the cluster using tools like nginx pods didn’t work consistently either despite functioning in different environments (e.g., Azure). This suggests potential network misconfigurations specific to how Flannel operates your kubernetes networking:

Troubleshooting Steps and Findings

Verification of Log Files: By examining the logs from flanneld pods, which handle inter-pod communication within K8S clusters using plugins for network address translation (NAT), no immediate irregularities were found at first glance – pointing towards further investigation was needed to understand what’s happening behind the scenes.
Deep Dive into Flannel’s Implementation: Understanding that by default, flanneld is set up with VXLAN as its “backend type”, we noticed this caused issues in non-cloud environments where hostgw might be a better fit due to differences in NAT implementations and IP address assignment.
Configuration Adjustments: Two critical changes were made within the Flannel configuration files (kubelet’s default flanneld config file, as well as net-conf):
- Specify which network interface should be used by flanneld with --iface=enp0s8. This ensures that traffic is correctly routed back to Kubernetes nodes.
```
    args: 
        --ip-masq,
        --kube-subnet-mgr,
        --iface=enp0s8 # Here we pass the interface name for correct IP assignment by Flannel's NIC plugin (`flanneld`).
```
Switch Backend Type: Switching from VXLAN to hostgw meant traffic would not be NATted, which is more compatible with on-premises environments that may have different IP address allocation and gateway handling mechanisms than cloud providers like AWS or Azure. The backend type was also updated in the netconf file:
```
   pod_network_cidr: 10.244.0.0/24, # Specify CIDR block that hostsgw will use to assign IPs within your cluster space (default Flannel configuration)
backend-type: hostgw
   
```
Final Network Verification: After these modifications were made and kubelet was restarted with the new flanneld arguments, checking ‘ip route’ on each node showed expected routing to their respective IP ranges allocated by flannel within your cluster – indicating that Flannel has correctly communicated all nodes internally.
Verification of External Communication: Accessing services via cURL or Wget from external clusters was successful after the above changes, suggesting consistent network communication is now established across different environments and tools used in various K8S setups with these specific configurations for Flannel (VXLAN to hostgw backend type change).
```
   curl -s http://<NODE_IP>:32001/  # Replace <NODE_IP> with your NodePort service's IP. Here, the node has been configured correctly and is now accessible from outside as well!
```

By correcting these two Flannel configuration aspects – specifying a particular network interface (--iface=enp0s8) for flanneld to properly assign external pod traffic back into your Kubernetes nodes, alongside choosing the right backend type (hostgw) that align more with on-premises networks and NAT implementations - intermittent connectivity problems were resolved.