Case study

Python SaaS app returning 502 errors behind NGINX and Gunicorn

A Python SaaS application was returning intermittent 502 errors because Gunicorn workers were silently disappearing under memory pressure. The visible symptom was NGINX upstream failure, but the real cause was worker memory growth and Linux OOM kills.

Context

A Python SaaS application was running behind NGINX with Gunicorn serving the app through systemd. Users were seeing intermittent 502 Bad Gateway errors, but the failures did not look like a normal application crash.

The site could run normally for hours, then suddenly produce a burst of 502s before recovering. There were no clear Python tracebacks explaining the outage. NGINX showed upstream failures, while Gunicorn workers appeared to disappear, restart or stop responding after memory usage had climbed over time.

The problem

NGINX was returning 502 errors because the upstream Gunicorn worker had vanished, closed the connection unexpectedly or stopped responding.
The failures were not being caused by NGINX itself. NGINX was only reporting that the Python backend had become unavailable.
Individual Gunicorn workers were slowly increasing in memory usage during the day.
The memory growth appeared to come from a mix of in-process caching, request-level objects being retained too long and long-lived database/session objects.
When memory pressure became high enough, the Linux OOM killer terminated the largest Gunicorn workers to protect the server.
Because the workers were killed by the operating system, the application did not produce a useful Python-level exception.

Our approach

Matched NGINX upstream 502 errors with Gunicorn worker restarts, dropped connections and rising per-worker memory usage.
Checked journalctl, kernel logs and OOM killer events to confirm the workers were being killed outside the Python application.
Reviewed caching, database/session lifecycle, worker count and systemd limits to reduce avoidable memory growth.
Added controlled Gunicorn worker recycling with max-requests and max-requests-jitter, then monitored worker RSS, OOM events and upstream failures.

Practical outcomes

502 errors stoppedGunicorn workers were no longer being killed unpredictably by the operating system during memory pressure.

Root cause made visibleThe team could see the link between worker memory growth, OOM events and NGINX upstream failures.

Controlled worker recyclingWorker restarts became planned and graceful using max-requests and jitter, rather than emergency kills by Linux.

Better incident checksFuture 502 investigations included system logs, memory growth and worker behaviour, not just NGINX configuration.

Relevant technologies and keywords

These are the main technologies, services and search terms connected to this case study.

PythonGunicornNGINXsystemd502 Bad GatewayOOM killerMemory leakmax-requestsLinuxSaaS supportWorker recyclingUpstream errors

Related services

Relevant services for similar infrastructure problems.

Want help with a similar issue?

Send the symptoms, affected service, recent changes and business impact. We will suggest the most appropriate route: emergency support, a fixed-scope technical fix, an infrastructure review or a wider project.

Python SaaS app returning 502 errors behind NGINX and Gunicorn

Context

The problem

Our approach

Practical outcomes

Relevant technologies and keywords

Related services

Gunicorn Support

NGINX Support

Python Development

Django Development

Emergency Server Support

Monitoring Setup

Want help with a similar issue?