Wednesday, 16 December 2009

Correlated Delete & Update

Sometimes you want to delete or update rows in one table based on the existence of matching rows in another table. Examples include building up a list of records to be deleted in another table, or new data values being first loaded into a staging table before updating the main data set. I call this a Correlated Delete or Update, because the rows affected in the main table must correlate with matching rows in the other table. There are different ways we can write such correlated actions in Oracle. Unfortunately one way results in a very poor execution plan and incredibly bad performance – several hours estimated in one example – while the other way might take only a second or two. So it can be very useful to know which is which, and why.

This scenario is easiest seen with a Delete i.e. Delete from tableA where matching rows exist in tableB. We might naturally think of this as being a kind of join between tableA and tableB, assuming tableB has a foreign key to the primary key of tableA. It turns out that Sybase has implemented an extension to its DELETE and UPDATE statements that lets us use join syntax to specify this kind of correlated action, with an additional ‘from’ clause. In Sybase our delete would be:
delete tableA from tableA, tableB where tableA.pkey = tableB.fkey

Unfortunately this is an extension to the ANSI SQL syntax, and Oracle does not have an equivalent syntax. So in Oracle we can only refer to one table in the main table, and need to use sub-queries to refer to the other tables. One way I came across the other day to do this is:
delete from tableA
where exists (select 1 from tableB where tableA.pkey = tableB.fkey)

On the face of it this is correct - we only delete rows in tableA that have a match in tableB. Unfortunately it suffers from terrible performance. In the case I came across I saw that Oracle would take 3 hours to scan tableA (table names changed from their original ones):
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | DELETE STATEMENT | | 1 | 33 | 1075K (2)| 03:35:07 |
| 1 | DELETE | A | | | | |
|* 2 | FILTER | | | | | |
| 3 | TABLE ACCESS FULL | A | 320M| 9G| 1072K (2)| 03:34:30 |
|* 4 | TABLE ACCESS BY INDEX ROWID| B | 1 | 82 | 0 (0)| 00:00:01 |
|* 5 | INDEX RANGE SCAN | X1_B | 1 | | 0 (0)| 00:00:01 |

This is because the sub-query is correlated – it refers to the outer table (tableA) and so must be executed for each row of tableA from the outer, main query. This results in a full table scan of tableA, and then a join to tableB for the correlated sub-query. If tableA is large and tableB is small with a list of rows to delete, then the performance is very bad indeed.

The solution is to rewrite the sub-query so that it is not correlated, and we can do that using IN i.e. where a row in tableA has a matching row IN tableB. The delete then becomes:
delete from tableA where (tableA.pkey) IN (select tableB.fkey from tableB)

The meaning of this is exactly the same as the other delete, but now the sub-query is not correlated. Oracle can now choose an execution plan that scans tableB to produce a set of rows, and join to tableA using an index. This executes much faster, as you would expect.

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | DELETE STATEMENT | | 1 | 113 | 6 (17)| 00:00:01 |
| 1 | DELETE | A | | | | |
|* 2 | FILTER | | | | | |
| 3 | NESTED LOOPS | | 1 | 113 | 6 (17)| 00:00:01 |
| 4 | SORT UNIQUE | | 1 | 80 | 2 (0)| 00:00:01 |
|* 5 | TABLE ACCESS FULL | B | 1 | 80 | 2 (0)| 00:00:01 |
|* 6 | INDEX RANGE SCAN | X1_A | 1 | 33 | 3 (0)| 00:00:01 |

As you can see the estimated cost has reduced from over 1 million to just 6, and the estimated elapsed time from 3 and a half hours to just one second! A major improvement. Obviously this improvement is down to the fact that tableB is at least 1,000 times smaller than tableA.

Either way, I believe that using IN is a better way of phrasing the sub-query, because it is clearer to Oracle which way round the relationship is (A depends on B, not vice versa), and gives the optimizer more flexibility of how to execute the SQL statement. If tableB were big enough with respect to tableA, then there is no reason why the optimizer could not go back to the original execution plan – scanning A and joining to B.

Where it really makes a difference is where you have other conditions on tableB within the sub-query – not just the foreign key join to tableA. When using EXISTS, Oracle will ignore extra indexes on tableB and treat the correlated sub-query in the same way – check tableB for each row in tableA. When using IN, Oracle can take advantage of extra indexes on tableB if they help with other conditions on columns of tableB in its sub-query. Thus it can use an index for the initial access to data in tableB, and then use the primary key index into tableA. This results in efficient execution plans for the DELETE.

The problem also occurs with Updates, but is compounded by needing to refer to the same table twice within the update statement. Again, the Sybase syntax is much simpler and more straightforward:

update tableA
set tableA.column1 = tableB.column1, tableA.column2 = tableB.column2
from tableA, tableB
where tableA.pkey = tableB.fkey

In Oracle we need to use a sub-query for tableB in the WHERE clause in the same way as for the Delete statement, and we also need a sub-query for tableB in the SET clause so that it is updated to the values in the matching row in tableB. This is important – without both sub-queries we would either update the wrong rows (WHERE clause problem) or update them to the wrong values (SET clause problem). The Oracle equivalent, using the same rewrite as before is:

update tableA
set (column1, column2) = (select tableB.column1, tableB.column2
from tableB where tableA.pkey = tableB.fkey)
where pkey in (select fkey from tableB)

As already mentioned this particular update will perform well when executed. Typically Oracle will scan tableB if it is much smaller for the sub-query, then join to tableA on its key columns assuming there is an index on them, and then join back to tableB to get the values to update the columns to.