dmai/blog

dmai/blog

Overview
1. Query Execution Pipeline2. EXPLAIN ANALYZE3. Performance Techniques4. Processes & Memory
Back to Articles
Database

2. EXPLAIN ANALYZE

Mar 17, 2026·12 min read

Your query is slow. You know it is slow. But where is it slow? Is it reading too many rows? Is it sorting in memory or spilling to disk? Is it using the index you created, or ignoring it entirely?

EXPLAIN ANALYZE answers all of these questions. It runs your query, instruments every step of the execution plan, and tells you exactly what happened. This is the single most important tool for PostgreSQL performance work.

EXPLAIN vs EXPLAIN ANALYZE

These two commands look similar but do very different things.

EXPLAIN (without ANALYZE)

Shows the planner's estimated plan without actually running the query. Safe to run on production. The costs, row counts, and widths are all guesses based on table statistics.

EXPLAIN ANALYZE

Actually executes the query and reports real timing, real row counts, and real I/O stats. This is what you need for debugging. Be careful with INSERT/UPDATE/DELETE queries because they will modify data. Wrap them in a transaction and roll back:

sql
-- Safe way to EXPLAIN ANALYZE a write query:
BEGIN;
EXPLAIN ANALYZE DELETE FROM logs WHERE created_at < '2024-01-01';
ROLLBACK;

Reading the Output

Let's start with a simple table and query:

sql
CREATE TABLE orders (
    id         SERIAL PRIMARY KEY,
    customer   TEXT NOT NULL,
    amount     NUMERIC(10,2) NOT NULL,
    status     TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT now()
);

-- 100k rows of test data
INSERT INTO orders (customer, amount, status, created_at)
SELECT
    'customer_' || (random() * 1000)::int,
    (random() * 500)::numeric(10,2),
    (ARRAY['pending','shipped','delivered'])[floor(random()*3+1)],
    now() - (random() * interval '365 days')
FROM generate_series(1, 100000);

ANALYZE orders;

Now let's run our first EXPLAIN ANALYZE:

sql
EXPLAIN ANALYZE
SELECT * FROM orders WHERE status = 'pending';
text
Seq Scan on orders  (cost=0.00..2137.00 rows=33521 width=52)
                    (actual time=0.013..12.41 rows=33289 loops=1)
  Filter: (status = 'pending')
  Rows Removed by Filter: 66711
Planning Time: 0.085 ms
Execution Time: 14.12 ms

Every line has a lot packed into it. Let's break each part down.

Cost: The Planner's Estimate

text
cost=0.00..2137.00
-0.00 = startup cost. How much work before the first row can be returned. A Seq Scan starts returning immediately (0), but a Sort has to process all input first.
-2137.00 = total cost. The estimated total cost to return all rows. This is in arbitrary units (not milliseconds). It is relative, so a cost of 2000 is roughly twice as expensive as 1000.

Cost units are based on seq_page_cost (default 1.0) and cpu_tuple_cost (default 0.01). You don't usually need to think about the absolute numbers. What matters is comparing costs between different plans for the same query.

Estimated vs Actual Rows

text
rows=33521    <-- planner's estimate
rows=33289    <-- what actually happened

The planner estimated 33,521 rows but the actual result was 33,289. That's close enough. The planner got this right because we ran ANALYZE, which collected up-to-date statistics.

When estimates are way off:

-Stale statistics - If you loaded a lot of data without running ANALYZE, the planner is working with old numbers. Run ANALYZE or wait for autovacuum.
-Correlated columns - The planner assumes columns are independent. If status='pending' correlates strongly with created_at being recent, the row estimate for a combined filter can be wildly off.
-Skewed data - If 90% of rows have status='delivered' and 1% have status='pending', the planner needs the most_common_vals statistics to know this. Default statistics_target may not be enough.

Bad row estimates are the #1 cause of bad plans. If the planner thinks a filter returns 10 rows but it actually returns 100,000, it might choose a Nested Loop when a Hash Join would be orders of magnitude faster.

Actual Time

text
actual time=0.013..12.41
-0.013 ms = time to return the first row
-12.41 ms = time to return all rows

These are real wall-clock times in milliseconds. Unlike cost, these are absolute. If a node says 500ms, it took 500ms.

Important: in a tree with multiple nodes, the time of a parent node includes the time of its children. To find how long a specific node took by itself, subtract its children's time.

Rows Removed by Filter

text
Filter: (status = 'pending')
Rows Removed by Filter: 66711

This is a red flag detector. The Seq Scan read all 100,000 rows but only 33,289 passed the filter. That means 66,711 rows were read from disk (or cache) and then thrown away.

For this query, that might be acceptable. But if you are filtering down to 10 rows out of 10 million, a Seq Scan that reads all 10 million rows is extremely wasteful. That is when you need an index.

sql
-- Add an index to avoid the full table scan
CREATE INDEX idx_orders_status ON orders(status);

EXPLAIN ANALYZE
SELECT * FROM orders WHERE status = 'pending';
text
Bitmap Heap Scan on orders  (cost=375.40..1678.94 rows=33521 width=52)
                            (actual time=2.18..8.92 rows=33289 loops=1)
  Recheck Cond: (status = 'pending')
  Heap Blocks: exact=1137
  ->  Bitmap Index Scan on idx_orders_status
                            (cost=0.00..367.02 rows=33521 width=0)
                            (actual time=1.92..1.92 rows=33289 loops=1)
        Index Cond: (status = 'pending')
Planning Time: 0.153 ms
Execution Time: 10.45 ms

Now PostgreSQL uses a Bitmap Index Scan. It reads the index to find which pages contain matching rows, then fetches only those pages from the heap. No more "Rows Removed by Filter" because every row it reads matches.

Notice the planner chose a Bitmap scan instead of a plain Index Scan. That is because ~33% of rows match. For such a large fraction, reading pages sequentially via bitmap is faster than jumping around the heap via an Index Scan.

Scan Types Explained

Seq Scan

Reads every page of the table from start to finish. Filters are applied after reading. Cheapest per-page cost because it reads sequentially, but expensive overall when you only need a few rows.

Index Scan

Walks the index (B-tree) to find matching row pointers, then fetches each row from the heap. Best for highly selective queries (returning a small fraction of rows). Each heap fetch is a random I/O, so this gets expensive if many rows match.

Index Only Scan

The best case. All requested columns are in the index, so PostgreSQL never touches the heap at all. Only works when the visibility map confirms the pages are all-visible (no recent updates).

Bitmap Index Scan + Bitmap Heap Scan

Two-phase approach. First, scan the index and build a bitmap of matching page locations. Then, fetch those pages in physical order (sequential I/O). Best for medium selectivity where Index Scan would cause too much random I/O.

text
-- Rough selectivity guide for which scan type the planner picks:
--
-- < 5% of rows match    -->  Index Scan (few random reads)
-- 5-20% of rows match   -->  Bitmap Scan (batch random reads)
-- > 20% of rows match   -->  Seq Scan (just read everything)
--
-- These thresholds depend on table size, page layout, and cost settings.

BUFFERS: Where the Time Really Goes

Add BUFFERS to see I/O details. This is where you find out if your query is fast because everything is cached, or slow because it is hitting disk.

sql
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE status = 'pending';
text
Bitmap Heap Scan on orders  (cost=375.40..1678.94 rows=33521 width=52)
                            (actual time=2.31..9.14 rows=33289 loops=1)
  Recheck Cond: (status = 'pending')
  Heap Blocks: exact=1137
  Buffers: shared hit=1228
  ->  Bitmap Index Scan on idx_orders_status
                            (cost=0.00..367.02 rows=33521 width=0)
                            (actual time=1.95..1.95 rows=33289 loops=1)
        Index Cond: (status = 'pending')
        Buffers: shared hit=91
Planning Time: 0.12 ms
Execution Time: 10.72 ms

Reading buffer stats:

-shared hit=1228 - 1,228 pages (8KB each) were found in shared_buffers (memory). Fast.
-shared read=N - N pages had to be read from disk (or OS cache). Slow. Not shown here because everything was cached.
-shared dirtied=N - N pages were modified (only for writes). These will need to be written back to disk eventually.
-shared written=N - N pages were written to disk during this query (rare, usually the background writer handles this).
-temp read/written - Pages read from or written to temp files. This means the operation (sort, hash) did not fit in work_mem and spilled to disk.

If you see temp read or temp written, your query is spilling sorts or hash tables to disk. Increase work_mem for that session:

sql
SET work_mem = '256MB';  -- per-operation, not global
EXPLAIN (ANALYZE, BUFFERS)
SELECT customer, sum(amount)
FROM orders
GROUP BY customer
ORDER BY sum(amount) DESC;

Joins: Nested Loop, Hash, Merge

When your query joins tables, the plan becomes a tree with join nodes. The planner chooses between three join strategies:

sql
-- Setup: add a customers table
CREATE TABLE customers (
    id   SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    tier TEXT NOT NULL
);

INSERT INTO customers (name, tier)
SELECT 'customer_' || i, (ARRAY['free','pro','enterprise'])[floor(random()*3+1)]
FROM generate_series(1, 1000) AS i;

ANALYZE customers;

EXPLAIN ANALYZE
SELECT c.name, o.amount, o.status
FROM orders o
JOIN customers c ON c.name = o.customer
WHERE c.tier = 'enterprise';
text
Hash Join  (cost=20.50..2670.50 rows=11111 width=44)
           (actual time=0.34..25.71 rows=11063 loops=1)
  Hash Cond: (o.customer = c.name)
  ->  Seq Scan on orders o  (cost=0.00..2137.00 rows=100000 width=36)
                             (actual time=0.006..7.82 rows=100000 loops=1)
  ->  Hash  (cost=16.50..16.50 rows=320 width=16)
             (actual time=0.31..0.31 rows=331 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 24kB
        ->  Seq Scan on customers c  (cost=0.00..16.50 rows=320 width=16)
                                      (actual time=0.005..0.15 rows=331 loops=1)
              Filter: (tier = 'enterprise')
              Rows Removed by Filter: 669
Planning Time: 0.25 ms
Execution Time: 26.38 ms

Hash Join

Builds a hash table from the smaller table (customers filtered to enterprise tier = 331 rows, using only 24kB). Then scans the larger table (orders) and probes the hash table for each row. Great when one side is small enough to fit in memory.

Nested Loop

For each row in the outer table, look up matching rows in the inner table (usually via an index). Best when the outer table returns very few rows and the inner table has an index. Worst case is O(n*m) without an index.

Merge Join

Both inputs must be sorted on the join key. Then it walks through both sorted lists simultaneously, merging matches. Efficient when both sides are large and already sorted (or have an index that provides order).

Watch the Batches field on Hash nodes. If Batches: 1, the hash table fit in memory. If batches > 1, it spilled to disk and you should increase work_mem.

Planning Time vs Execution Time

text
Planning Time: 0.25 ms
Execution Time: 26.38 ms
-Planning Time - How long the planner took to evaluate strategies and produce the execution plan. Usually under 1ms for simple queries. If this is high (10ms+), the query might have too many joins causing the planner to evaluate many combinations.
-Execution Time - Total time to execute the plan and return rows. This is what your application actually experiences (plus network latency). This is the number you want to minimize.

Note: Execution Time includes time sending rows to the client. If you are running EXPLAIN ANALYZE from a remote connection on a query that returns millions of rows, network transfer time is included. Use LIMIT or pipe to /dev/null to isolate actual computation time.

Patterns to Watch For

1. Seq Scan on a large table with heavy filtering

Rows Removed by Filter: 9999000 out of 10M rows means you need an index. The database is reading the entire table to find a handful of rows.

Fix: Create an index on the filtered column.

2. Estimated vs actual rows diverge wildly

rows=100 (estimated) vs rows=500000 (actual). The planner made a bad choice because it thought the result was small.

Fix: Run ANALYZE on the table. Consider increasing default_statistics_target for columns with skewed distributions.

3. Sort with temp files

Sort Method: external merge Disk: 45MB means the sort did not fit in work_mem and spilled to disk.

Fix: Increase work_mem for the session, or add an index that provides the sort order so no sorting is needed.

4. Nested Loop with high loops count

loops=50000 on an inner Index Scan means the nested loop is running the inner scan 50,000 times. Multiply actual time by loops to get the true cost.

Fix: If loops is high, a Hash Join is usually faster. Check if the planner underestimated the outer row count.

5. Index exists but not used

You created an index but the planner still picks a Seq Scan. This is often correct. If the query returns a large fraction of the table, a Seq Scan is genuinely faster than bouncing around the index.

Check: Is the filter selective enough? Is the index on the right column? Is the data type matching (no implicit casts)?

The Full EXPLAIN Toolkit

EXPLAIN accepts several options that can be combined:

sql
-- The full diagnostic toolkit:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT ...;

-- Machine-readable output:
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT ...;

-- Verbose: show output columns, schema-qualified table names
EXPLAIN (ANALYZE, VERBOSE)
SELECT ...;

-- WAL: show WAL (write-ahead log) usage for writes
EXPLAIN (ANALYZE, BUFFERS, WAL)
INSERT INTO ...;

-- Settings: show non-default planner settings that affect the plan
EXPLAIN (ANALYZE, SETTINGS)
SELECT ...;

The combination I use most often is EXPLAIN (ANALYZE, BUFFERS). It gives you timing, row counts, and I/O stats, which covers 95% of debugging needs.

Quick debugging checklist:

1.Run EXPLAIN (ANALYZE, BUFFERS) on the slow query
2.Find the node with the highest actual time
3.Check if estimated rows match actual rows
4.Look for "Rows Removed by Filter" - is an index missing?
5.Check buffer stats - is it hitting disk (shared read) or cache (shared hit)?
6.Check for temp files - is work_mem too small?
7.Look at the join type - is Nested Loop running too many loops?

© 2026 dmai/blog Engineer Notes. All rights reserved.