Thursday 12 March 2015

Working on PostgreSQL (What are join removals)

What are join removals?

Since September 2009 PostgreSQL's query planner has supported removal of LEFT JOINs. In order to properly explain what "join removals" are, we'd best look at some examples.

Let's keep it as simple as possible, say we had 2 tables like:

CREATE TABLE product (
  product_id INT NOT NULL,
  description VARCHAR(64) NOT NULL,
  PRIMARY KEY (product_id)
);

CREATE TABLE sale (
  sale_id INT NOT NULL,
  product_id INT NOT NULL,
  quantity NUMERIC(10,3) NOT NULL,
  PRIMARY KEY (sale_id)
);

If someone were to write a query such as;

SELECT s.product_id,p.description,s.quantity
FROM sale s
LEFT JOIN product p ON s.product_id = p.product_id;

The query planner would construct some sort of plan which accessed both sale and product, and joined the two tables together in order to produce the required results.

Now let's throw a crazy idea out there... What if the query joined to the product table, but the SELECT clause contained no actual columns from the product table...

SELECT s.product_id,s.quantity
FROM sale s
LEFT JOIN product p ON s.product_id = p.product_id;

Ok, perhaps it might seem strange to do this. Who would write such a query? Why bother with the join at all?

Well, if you imagine that someone had created a view which made available columns from both tables, but the consumer of the view didn't select any columns from the product table, then would the query executor really need to still make that join?

The answer is: 

It depends on:

1 We know that with LEFT JOINs, that if the product table happened not to have a matching record for a sale record that the semantic of LEFT JOIN would cause NULL values to appear in the result set for columns from the table being LEFT JOIN. 

2. Another affect that the LEFT JOIN could have would be that, if there happened to be more than 1 record matching a given product ID, then the join would match both of those rows to the sale table's row and cause the row in the sale table to be duplicated.

Item 1 would really not matter to us, since the join would only appear to be surplus to requirements none of the product table's columns were used in either the SELECT clause, WHERE clause etc. 

Item 2 really could be a problem as removing the join in if join would cause duplicate sale rows to be produced in the results would cause query results to change, which is obviously not acceptable!

If we were able to prove that these duplicate rows were not possible, then we'd be ok to remove that join. Luckily this is quite easy to prove in this case as in the example above the product_id is the primary key of the product table, therefore there can never be any duplicates product_id values in the table. So we can be certain that he two queries below will produce the same results:

SELECT s.product_id,s.quantity
FROM sale s
LEFT JOIN product p ON s.product_id = p.product_id;

and

SELECT s.product_id,s.quantity
FROM sale s;

Any although the join was explicitly asked for, the planner can simply pluck the join out of the plan and nobody should notice any sort of difference, apart from a nice gain in performance!

The work to implement this was done by Robert Haas and committed by Tom Lane back in September 2009 in commit 488d70ab46311386801c10691196ec8d755f2283