There is arguably nothing more stressful for a backend engineer than an unexpected pager alert stating that your production Postgres RDS instance is sitting at 100% CPU utilization. This inevitably leads to cascading API timeouts and a full application outage.
Why does Postgres CPU spike?
Unlike memory exhaustion (which usually results in an OOM kill), high CPU utilization in PostgreSQL is almost always a symptom of inefficient queries forcing the database engine to perform massive amounts of computational work. The three most common culprits are:
- Sequential Scans (Missing Indexes): The query planner is forced to read every single row in a massive table because a WHERE clause references an unindexed column.
- Lock Contention: Long-running transactions holding exclusive locks on frequently updated rows.
- Connection Thrashing: Too many concurrent connections opening and closing, forcing the CPU to spend all its cycles on context switching rather than executing queries.
Diagnosing the Culprit in Real-Time
When your database is locked at 100% CPU, you cannot afford to spend 20 minutes clicking through the AWS CloudWatch console. You need to immediately identify the query holding the CPU hostage.
If you have access to a SQL shell (`psql`), you can query the `pg_stat_activity` view to see exactly what Postgres is executing at this exact millisecond:
-- Find queries running longer than 30 secondsSELECT pid, age(clock_timestamp(), query_start), usename, queryFROM pg_stat_activityWHERE query != '<IDLE>' AND query_state != 'idle'ORDER BY query_start DESC;
Once you identify the rogue PID (Process ID), you can aggressively terminate it to restore application availability using `pg_terminate_backend(pid);`.
The Permanent Fix: Query Analytics
Killing the query is a band-aid. The permanent fix requires enabling the `pg_stat_statements` extension to track execution times globally. You must identify which query has the highest average execution time multiplied by the frequency of calls.
Automate your Database APM
Running raw SQL commands during a production outage is terrifying. ClockingPulse automatically monitors your database CPU and `pg_stat_activity`, alerting your Slack channel with the exact slow query before your CPU hits 100%.
- Real-time CPU and Memory tracking
- Automated Slow Query logs
- Automated Incident Playbooks
