cresstech

Need help? Talk to an expert

+1- ( 246 ) 333 - 0079

A Devops mystery โ€“ the case of openstack and the missing fin pack

Debugging and solving timeouts after receiving data from a specific host in Openstack due to missing FIN packets:

At Cress Tech, our team of experienced DevOps consultants encounter various challenges daily, such as the case where a client reported an unusual issue within their Openstack cluster. After thorough investigation, we discovered that a simple reproduction case involved a PHP script that made a GET request to a specific third-party service, causing a delay of 60 seconds before completing successfully. Upon further experimentation, we discovered that the issue only occurred when the PHP script was executed on a VM within our cluster, indicating a specific combination of factors causing the problem. With our expertise, we were able to solve the issue by changing the configuration and eliminating the root cause of the delay.
To investigate further, we narrowed down the affected requests to a simple reproduction case, which was a basic GET request to the third-party service via PHP. Running this snippet from the PHP CLI on a virtual machine (VM) reproduced the same behavior every time, which confirmed that the issue was related to the PHP environment. However, GET requests to many other hosts via PHP worked without issue, revealing that the issue was specific to this host.

Further experimentation revealed that making the same request using cURL returned immediately, while making the request via PHP on an Openstack host (rather than a VM) also returned immediately. These findings showed that a specific combination of factors was causing the issue, and changing any of these factors would make the request work fine. Specifically, only the combination of making a request to this host, via PHP, on a VM in our cluster caused the strange stalls to occur.

A Solitary Experience

We noticed a problem with our Openstack cluster where requests to a third-party service would occasionally take 60 seconds to complete. This is a bad experience for users and is usually the result of a timeout. We tried to reproduce the problem with a simple PHP snippet and found that it happened 100% of the time.
After some investigation, we discovered that the issue only occurred when making requests to this specific third-party service using PHP on a VM within our cluster. We searched for similar issues in Openstack bug trackers but found nothing.
We were surprised because our cluster was built using default configuration options on a supported OS, so we didn’t expect to experience such a unique problem. Unfortunately, troubleshooting infrastructure issues can be difficult when our case is so specific that no one else has experienced the same issue.

Creative Alternative Solutions

There are a couple of potential solutions to work around the issue. One approach is to use cURL instead of file get contents to make the request, as cURL doesn’t experience the stalled request problem. However, this would require significant changes to the application’s various service calls, which may not be ideal.
Another option is to create a patched version of PHP that handles the missing FIN packet by terminating the connection when the response’s ending line feeds are received. This would require familiarity with the PHP implementation, as well as maintaining the patched version and building it for all relevant VMs in the infrastructure.
While these solutions could be used as temporary fixes under time constraints, they would only mask the underlying problem of missing packets. In this case, it was possible to find the root cause and implement a more comprehensive solution.

Narrowing the Search

We discovered that the packets we were missing made it onto the compute hosts but not onto the virtual machines. Our investigation using tshark revealed that the packets were present on the Geneve interface, which connects the compute nodes to the software-defined network (SDN), but were missing on the TAP interface, which is a virtual interface connected directly to QEMU. This told us that the problem was somewhere between Geneve and TAP.
We traced the problem to OpenVSwitch, a software-defined bridge that connects the virtual machines to the compute hosts. OpenVSwitch has two components: a kernel module and a user-space daemon. We checked the statistics for OpenFlow routing, which showed that packets sent from OpenVSwitch to the virtual machine’s TAP interface were being dropped.
These packets may have been dropped because they did not match an established connection, or because they were explicitly rejected by one of the flow rules. We reviewed the flow rules and then checked connection tracking to see if packets were being dropped due to checksum failures. We enabled kernel debug messages in conntrack and found that packets were indeed being dropped due to checksum failures.

The Fix and Root Cause

The Cress tech team discovered that packets were being dropped between the Geneve interface and the TAP interface. They narrowed down the problem to OpenVSwitch’s OpenFlow routing and discovered that packets were being dropped due to conntrack (a kernel subsystem) identifying invalid checksums. They fixed the issue by setting the net.netfilter.nf_conntrack_checksum parameter to zero, which allowed conntrack to use control packets with invalid checksums.

Further investigation revealed a bug in the version of the OVS kernel module they were using. This bug improperly checksummed packets shorter than 60 bytes, causing control packets like ACK and FIN to be dropped. Upgrading the kernel wasn’t possible, so they applied the sysctl parameter permanently via sysctl. d to fix the issue.

Final Thoughts

In conclusion, this issue was complex and challenging to solve because it originated from within the kernel and required a deep understanding of the entire technology stack. Observing and diagnosing such problems requires specialized tools and technical expertise. Without an experienced engineering team, our client would have had to undergo another costly migration, rewrite their entire application, or risk losing users due to prolonged timeouts. It is important to have a comprehensive view of the system to understand the interlinkages between layers of the problem.

OUR BUSINESS PARTNERS

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top