Case study
Python SaaS app returning 502 errors behind NGINX and Gunicorn
A Python SaaS application was returning intermittent 502 errors because Gunicorn workers were silently disappearing under memory pressure. The visible symptom was NGINX upstream failure, but the real cause was worker memory growth and Linux OOM kills.
Context
A Python SaaS application was running behind NGINX with Gunicorn serving the app through systemd. Users were seeing intermittent 502 Bad Gateway errors, but the failures did not look like a normal application crash.
The site could run normally for hours, then suddenly produce a burst of 502s before recovering. There were no clear Python tracebacks explaining the outage. NGINX showed upstream failures, while Gunicorn workers appeared to disappear, restart or stop responding after memory usage had climbed over time.
The problem
- NGINX was returning 502 errors because the upstream Gunicorn worker had vanished, closed the connection unexpectedly or stopped responding.
- The failures were not being caused by NGINX itself. NGINX was only reporting that the Python backend had become unavailable.
- Individual Gunicorn workers were slowly increasing in memory usage during the day.
- The memory growth appeared to come from a mix of in-process caching, request-level objects being retained too long and long-lived database/session objects.
- When memory pressure became high enough, the Linux OOM killer terminated the largest Gunicorn workers to protect the server.
- Because the workers were killed by the operating system, the application did not produce a useful Python-level exception.
Our approach
- Matched NGINX upstream 502 errors with Gunicorn worker restarts, dropped connections and rising per-worker memory usage.
- Checked
journalctl, kernel logs and OOM killer events to confirm the workers were being killed outside the Python application. - Reviewed caching, database/session lifecycle, worker count and systemd limits to reduce avoidable memory growth.
- Added controlled Gunicorn worker recycling with
max-requestsandmax-requests-jitter, then monitored worker RSS, OOM events and upstream failures.
Practical outcomes
max-requests and jitter, rather than emergency kills by Linux.Relevant technologies and keywords
These are the main technologies, services and search terms connected to this case study.
Related services
Relevant services for similar infrastructure problems.
Want help with a similar issue?
Send the symptoms, affected service, recent changes and business impact. We will suggest the most appropriate route: emergency support, a fixed-scope technical fix, an infrastructure review or a wider project.