To Save: PG limitations with numerous index scan queries

PG limitations with numerous index scan queries

Via PostgreSQL by /u/CacheMeUp

The following query takes >45 minutes on Postgresql 14, 64GB RAM, 8 cores CPU:

create table table_50M_rows_A (id varchar(16) primary key, c1 jsonb, c2 text /* 100-1025 characters */, c3 timestamp); create table table_50M_rows_B (id varchar(16) primary key, c2 text /* 100-1025 characters */, c3 timestamp); set work_mem='16384MB'; select id from table_50M_rows_A a where not exists (select from table_50M_rows_B b where a.id=b.id)

despite using index scan. The PK index is several GBs big but smaller than work_mem.

When implemented as a procedural code in Java, it is x10 faster (4 minutes), where the bottleneck is actually IO from the database:

var existing = new HashSet(); for (var id in "select id from table_50M_rows_B") existing.add(id); var diff = new ArrayList(); for (var id in "select id from table_50M_rows_A"){ if (!existing.contains(id)) diff.add   (id); }

The Java code uses also 8 cores.

It seems that the query engine fails to utilize the index as a hash-set (membership check). It seems to have challenges handling queries that involve small tasks but on numerous rows. This is despite being written in C and overall being highly optimized.

Is there a way to optimize such queries on Postgresql? I don't want to "re-invent the wheel".

Query plans:

No hints:

EXPLAIN select id from table_50M_rows_A a where not exists (select from table_50M_rows_B b where a.id=b.id); Gather (cost=3690212.14..5956357.34 rows=1 width=16) Workers Planned: 2 -> Parallel Hash Anti Join (cost=3689212.14..5955357.24 rows=1 width=16) Hash Cond: (a.id= b.id) -> Parallel Index Only Scan using x_table_a_id on table_50M_rows_A a (cost=0.56..1287666.41 rows=21855175 width=16) -> Parallel Hash (cost=3299770.03..3299770.03 rows=22403803 width=16) -> Parallel Seq Sca   n on table_50M_rows_B b (cost=0.00..3299770.03 rows=22403803 width=16)

set enable_seqscan=false

Gather (cost=1001.13..100945294.37 rows=1 width=16) Workers Planned: 2 -> Nested Loop Anti Join (cost=1.13..100944294.27 rows=1 width=16) -> Parallel Index Only Scan using x_table_a_id on table_50M_rows_A a (cost=0.56..1287666.41 rows=21855175 width=16) -> Index Only Scan using table_b_pkey on table_50M_rows_B b (cost=0.56..4.55 rows=1 width=16) Index Cond: (a.id=b.id)

submitted by /u/CacheMeUp
[link] [comments]

New article matched for in your Blogger! rule

View on the web

Inoreader. Take back control of your news feed. Follow us on Twitter and Facebook.

Testing Wide Production 2020

vrijdag 18 maart 2022

PG limitations with numerous index scan queries