vrijdag 18 maart 2022

PG limitations with numerous index scan queries

Via PostgreSQL by /u/CacheMeUp

The following query takes >45 minutes on Postgresql 14, 64GB RAM, 8 cores CPU:

create table table_50M_rows_A (id varchar(16) primary key, c1 jsonb, c2 text /* 100-1025 characters */, c3 timestamp); create table table_50M_rows_B (id varchar(16) primary key, c2 text /* 100-1025 characters */, c3 timestamp); set work_mem='16384MB'; select id from table_50M_rows_A a where not exists (select from table_50M_rows_B b where a.id=b.id) 

despite using index scan. The PK index is several GBs big but smaller than work_mem.

When implemented as a procedural code in Java, it is x10 faster (4 minutes), where the bottleneck is actually IO from the database:

var existing = new HashSet(); for (var id in "select id from table_50M_rows_B") existing.add(id); var diff = new ArrayList(); for (var id in "select id from table_50M_rows_A"){ if (!existing.contains(id)) diff.add   (id); } 

The Java code uses also 8 cores.

It seems that the query engine fails to utilize the index as a hash-set (membership check). It seems to have challenges handling queries that involve small tasks but on numerous rows. This is despite being written in C and overall being highly optimized.

Is there a way to optimize such queries on Postgresql? I don't want to "re-invent the wheel".

Query plans:

No hints:

EXPLAIN select id from table_50M_rows_A a where not exists (select from table_50M_rows_B b where a.id=b.id); Gather (cost=3690212.14..5956357.34 rows=1 width=16) Workers Planned: 2 -> Parallel Hash Anti Join (cost=3689212.14..5955357.24 rows=1 width=16) Hash Cond: (a.id= b.id) -> Parallel Index Only Scan using x_table_a_id on table_50M_rows_A a (cost=0.56..1287666.41 rows=21855175 width=16) -> Parallel Hash (cost=3299770.03..3299770.03 rows=22403803 width=16) -> Parallel Seq Sca   n on table_50M_rows_B b (cost=0.00..3299770.03 rows=22403803 width=16) 

set enable_seqscan=false

Gather (cost=1001.13..100945294.37 rows=1 width=16) Workers Planned: 2 -> Nested Loop Anti Join (cost=1.13..100944294.27 rows=1 width=16) -> Parallel Index Only Scan using x_table_a_id on table_50M_rows_A a (cost=0.56..1287666.41 rows=21855175 width=16) -> Index Only Scan using table_b_pkey on table_50M_rows_B b (cost=0.56..4.55 rows=1 width=16) Index Cond: (a.id=b.id) 

submitted by /u/CacheMeUp
[link] [comments]

New article matched for in your Blogger! rule

Testing Wide Production 2020