Wednesday, 16 December 2009

Correlated Delete & Update

Sometimes you want to delete or update rows in one table based on the existence of matching rows in another table. Examples include building up a list of records to be deleted in another table, or new data values being first loaded into a staging table before updating the main data set. I call this a Correlated Delete or Update, because the rows affected in the main table must correlate with matching rows in the other table. There are different ways we can write such correlated actions in Oracle. Unfortunately one way results in a very poor execution plan and incredibly bad performance – several hours estimated in one example – while the other way might take only a second or two. So it can be very useful to know which is which, and why.

This scenario is easiest seen with a Delete i.e. Delete from tableA where matching rows exist in tableB. We might naturally think of this as being a kind of join between tableA and tableB, assuming tableB has a foreign key to the primary key of tableA. It turns out that Sybase has implemented an extension to its DELETE and UPDATE statements that lets us use join syntax to specify this kind of correlated action, with an additional ‘from’ clause. In Sybase our delete would be:
delete tableA from tableA, tableB where tableA.pkey = tableB.fkey

Unfortunately this is an extension to the ANSI SQL syntax, and Oracle does not have an equivalent syntax. So in Oracle we can only refer to one table in the main table, and need to use sub-queries to refer to the other tables. One way I came across the other day to do this is:
delete from tableA
where exists (select 1 from tableB where tableA.pkey = tableB.fkey)

On the face of it this is correct - we only delete rows in tableA that have a match in tableB. Unfortunately it suffers from terrible performance. In the case I came across I saw that Oracle would take 3 hours to scan tableA (table names changed from their original ones):
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | DELETE STATEMENT | | 1 | 33 | 1075K (2)| 03:35:07 |
| 1 | DELETE | A | | | | |
|* 2 | FILTER | | | | | |
| 3 | TABLE ACCESS FULL | A | 320M| 9G| 1072K (2)| 03:34:30 |
|* 4 | TABLE ACCESS BY INDEX ROWID| B | 1 | 82 | 0 (0)| 00:00:01 |
|* 5 | INDEX RANGE SCAN | X1_B | 1 | | 0 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

This is because the sub-query is correlated – it refers to the outer table (tableA) and so must be executed for each row of tableA from the outer, main query. This results in a full table scan of tableA, and then a join to tableB for the correlated sub-query. If tableA is large and tableB is small with a list of rows to delete, then the performance is very bad indeed.

The solution is to rewrite the sub-query so that it is not correlated, and we can do that using IN i.e. where a row in tableA has a matching row IN tableB. The delete then becomes:
delete from tableA where (tableA.pkey) IN (select tableB.fkey from tableB)

The meaning of this is exactly the same as the other delete, but now the sub-query is not correlated. Oracle can now choose an execution plan that scans tableB to produce a set of rows, and join to tableA using an index. This executes much faster, as you would expect.

------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | DELETE STATEMENT | | 1 | 113 | 6 (17)| 00:00:01 |
| 1 | DELETE | A | | | | |
|* 2 | FILTER | | | | | |
| 3 | NESTED LOOPS | | 1 | 113 | 6 (17)| 00:00:01 |
| 4 | SORT UNIQUE | | 1 | 80 | 2 (0)| 00:00:01 |
|* 5 | TABLE ACCESS FULL | B | 1 | 80 | 2 (0)| 00:00:01 |
|* 6 | INDEX RANGE SCAN | X1_A | 1 | 33 | 3 (0)| 00:00:01 |
------------------------------------------------------------------------------------------

As you can see the estimated cost has reduced from over 1 million to just 6, and the estimated elapsed time from 3 and a half hours to just one second! A major improvement. Obviously this improvement is down to the fact that tableB is at least 1,000 times smaller than tableA.

Either way, I believe that using IN is a better way of phrasing the sub-query, because it is clearer to Oracle which way round the relationship is (A depends on B, not vice versa), and gives the optimizer more flexibility of how to execute the SQL statement. If tableB were big enough with respect to tableA, then there is no reason why the optimizer could not go back to the original execution plan – scanning A and joining to B.

Where it really makes a difference is where you have other conditions on tableB within the sub-query – not just the foreign key join to tableA. When using EXISTS, Oracle will ignore extra indexes on tableB and treat the correlated sub-query in the same way – check tableB for each row in tableA. When using IN, Oracle can take advantage of extra indexes on tableB if they help with other conditions on columns of tableB in its sub-query. Thus it can use an index for the initial access to data in tableB, and then use the primary key index into tableA. This results in efficient execution plans for the DELETE.

The problem also occurs with Updates, but is compounded by needing to refer to the same table twice within the update statement. Again, the Sybase syntax is much simpler and more straightforward:

update tableA
set tableA.column1 = tableB.column1, tableA.column2 = tableB.column2
from tableA, tableB
where tableA.pkey = tableB.fkey

In Oracle we need to use a sub-query for tableB in the WHERE clause in the same way as for the Delete statement, and we also need a sub-query for tableB in the SET clause so that it is updated to the values in the matching row in tableB. This is important – without both sub-queries we would either update the wrong rows (WHERE clause problem) or update them to the wrong values (SET clause problem). The Oracle equivalent, using the same rewrite as before is:

update tableA
set (column1, column2) = (select tableB.column1, tableB.column2
from tableB where tableA.pkey = tableB.fkey)
where pkey in (select fkey from tableB)

As already mentioned this particular update will perform well when executed. Typically Oracle will scan tableB if it is much smaller for the sub-query, then join to tableA on its key columns assuming there is an index on them, and then join back to tableB to get the values to update the columns to.

Monday, 19 October 2009

Solving Database Design in Development

Previously I have looked at the challenges involved in being able to support a changing database design in an application development environment, which were:
  1. Support change at any time
  2. Ensure database designed and modelled properly
  3. Record each individual change, to enable metamorphosis of older databases
  4. Allow multiple versions or branches of the Database Design to exist, and be changed individually or collectively
Although I started off on this problem of database structure change from the context of the need for it when using Agile development methods, the actual problem is not unique or specific to Agile. It is a generic problem to the development of any application that involves the use of a relational database to store its data in.

What I now want to do is list the key features needed in a solution to this problem which, if all assembled together, would deliver a viable solution to this problem.
  • With Application Source Code we only care about the final state it ends up in. But with a Database Design we care about each individual change, and what gets changed. We need to record each of these Databse Changes individually, so that they can be applied to other instances of that database, as well as recording the final state of the database design itself in the Database Model.
  • Changes are physically made to a Database Instance by executing appropriate SQL statements. This is the only way to make a change to an existing database instance.
  • Each "Database Change" should be an atomic change to a database - it either completes successfully or not at all. Changes cannot be partially implemented on a database instance.
  • The Database Changes must be executed in the correct sequence on each database instance, for repeatability and due to any dependencies between the changes.
  • Changes to the Database Design should be formally Requested, and these requests recorded somewhere to provide an audit trail - Who, When, What Changed, Where in the Database.
  • Changes must be reviewed and approved by the Data Modeller, to ensure a good, scalable design. The Data Modeller is responsible for updating the Database Model - the record of the Database Design - and producing the SQL statement that implements that change. Some of this could be done automatically using appropriate tools, or manually. All that matters is that it does happen.
  • The Database Designer or Data Modelling role now becomes a part time role during the whole lifecycle of the application's development, instead of being a full time role only in the first phase of its development. It is now likely that all development cycles will involve the Database Designer, who becomes a central and critical member of the development team. The role itself, however, could be fulfilled by someone who has other roles too within the development cycle, or shared amongst a number of people, because it is not a full time role. This could be someone who is also a developer or a database administrator.
  • Each Database Change is given its own unique Identifier, which must be globally unique over all the other Database Changes. This identifies the particular Database Change, regardless of which Database Instances it is applied to, or which versions of the Application it appears in.
  • Changes to a particular Database Instance must be recorded within the Database Instance itself. This is the only way to avoid applying the same changes twice to a Database Instance. This implies some form of Change History table within each Database Instance.
  • This Change History table needs to record the Change Identifer for those changes applied, and the date and time when it was done. Other data could also be recorded, but these two are the minimum required.
  • The set of changes should be stored in a single file, often termed the Change Log. Each entry in this will include the Change Identifier and the SQL statement that implements that change. All necessary changes can be easily located and sequenced this way.
  • This Change Log needs to be well structured, so that a program can read and decode the Changes in it and execute the SQL statements for changes not yet applied to a given Database Instance. This is probably best done using XML.
  • There must be separate instances of the Database Model / Database Design and the Change Log file in each version / branch of the Application source code. This allows changes to be made independently to each branch of the Application.
  • In reality SQL statements are often specific to one database product, such as Oracle, Sybase, SQL*Server or MySQL. The Change Log will need to record the separate SQL statements for each supported database product for each Database Change. Thus each Change will have an Identifier, and a SQL statement per database product, clearly labelled with the database product name.
  • The whole solution is brought together in a program that is run during an Application Upgrade. It opens the Change Log XML file and reads in each Change. If that Change has not yet been applied to this particular Database Instance, then the corresponding SQL statement is executed and a record inserted into the Change History metadata table.
This is my interpretation of how to solve the problem of a changing database design during application development. But it turns out of course that others have been through exactly the same process before me and come up with exactly the same answers. And in fact, once I had thoroughly understood the nature of the problem, I could immediately see that these other people too had understood it and had arrived at pretty much the same conclusions I would. So although I did work through the problem from first principles myself, simply because that is the best way for me to fully understand a topic and appreciate whether any "solution" is right or not, I have no doubt borrowed from some of the material I have read in the way that I have described this solution here.

I still find some of the descriptions I found on the Web around Database Design, Changes and Agile Development unclear around this particular problem of implementing changes to a database design during application development. Quite a lot discuss the need to change the database design, and how to model different types of changes (refactoring is often mentioned). But almost none discuss how such changes get implemented in real deployed databases, and how you maintain a database over the long term with a series of changes to the application that uses it. None of them covered enough of it at once to leave me feeling that I had a real solution I could go out and apply. Hence my need to work through the problem from first principles for myself, to get the full picture as it were.

Again, one article that did help to clarify the nature of the problem a lot for me was Rethinking Agility in Databases: Evolution from Hexagon Software. This brought home the message that Databases must be treated differently to Applications in how you change them.

And the Change Log file of SQL statements was explicitly mentioned by Peter Schuh and Pramod Sadalage, who have put this into practice themselves on large projects. See Agility and the Database by Peter Schuh for instance.

So although I have not invented anything new here that has not already been described by other people elsewhere, in one form or another, I have tried to bring together all of the necessary ingredients for a solution in a coherent manner. As I said, this was my way of working through the database design problem to arrive at a solution that I fully understood and felt able to go out and apply in the real world.

I may revisit this again and try and outline a working solution to this i.e. what would need to be delivered to implement the kind of solution I have described here. Anybody reading this and want to know more?

Monday, 5 October 2009

The Challenge of Agile Database Design

Previously I have said that for what I call Enterprise Applications data modelling is important because the data itself has meaning and value outside of the application that manipulates it, and that scalability requires a good database design because you cannot just add scalability on afterwards. The Data Model or Database Design defines the structure of the database and the relationships between the data sets, and is part of the foundation on which the application is built. And a good database design is essential to achieve a scalable application. Which leads to the challenge - How do I go about designing a database in an Agile development project, when not all the requirements are known initially? What techniques should I be using?

Having read as much as I can find on this topic, I think I have a better understanding of the nature of this challenge. And this is what I want to explain here - the nature of the challenge of Agile Database Design.

First, we should restate the problem in a more positive way. Rather than "when not all the requirements are known", we can say "How should I be designing an Agile database that will change in the future". Change happens all the time, in one form or another, and is inevitable. We need to embrace it, assume that our database design will change over time, and find ways to support this changing database. This need for change over time is not unique to Agile development, and is in fact really a universal problem for applications and databases.

Second, we need to accept that we cannot skip the design stage for the database in any way. We must design the database properly - at least those parts of the database that we need to design now - and produce a correct model. As I argued before, a good and correct database design is essential to a well performing application, and to the integrity of the data itself.

The outcome of the database design is documentation on the structure of the database, often termed the Data Model. There are many tools that can be used to help you design your database and record the details of the model. Any such tool chosen should enable and support small and frequent changes to the data model, as this is a major requirement of Agile development. But you could also use tools as simple as a spreadsheet and a set of diagrams.

Third, a Database Design is not the same as Application Source Code. There are similarities, but they are actually different beasts. Both act as Blueprints for a thing that can be built - an Instance of that Design. And both can be changed over time as needed. The difference is that when the Source Code to an Application changes, we rebuild the Application completely, typically compiling all source code files. We have produced a new instance of the Application Program, as a next generation instance of it. This is an example of "Evolution": the Blueprint changes, and a brand new instance is created using it. Existing instances are not modified, but instead "replaced" by the newly created instance.

Databases are the opposite. Changes must be applied "in place" directly to each Database Instance (a real database on a computer system), to modify it into the latest database design. Such an in place changing in the structure of a thing is termed "Metamorphosis", and is quite different from "Evolution".

While I have appreciated for some time that there is a difference in type between an Application's Source Code and a Database's Design, I read about the explicit nature and form of this difference (Evolution versus Metamorphosis) in an article on Rethinking Agility in Databases: Evolution from Hexagon Software. All credit for this distinction between them and the terminology goes to them.

Application Source Code editing and maintenance methods will not work for a Database Design. A "replace and rebuild" methodology cannot be used for databases, which need a change to the design to be applied "in place" instead to each instance of that database.

Fourth, there may be multiple separate versions or branches of the Database Design to be maintained, as a result of the existence of separate branches of the Application Source Code. It is common for Application Source Code to be branched when major releases are done, typically producing a new branch for support of that release, and a new branch for the next major release. It is possible that the Database Design may need to change in different ways in different branches of the Application. Likewise, the same change may need to be made to the Database Design in different branches - correcting a bug for instance. We need a way to record each version of the Database Design separately to support this.

These then form the Challenges of delivering a changing Database Design:

1. Supporting change at any time to the Database Design

2. Ensuring that the Database is Designed and Modelled properly

3. Recording each individual change to the Database Design, so that older instances of that Database Design can metamorphose by having these changes applied to them

4. Allowing multiple versions or branches of the Database Design to exist, and to be changed individually or collectively

These challenges are not specific to Agile Development, and apply to any large enough application software development. Addressing these challenges will provide a solution that could be used in any database oriented application development, whether using Agile development methods or not.

In the next post I hope to start describing the outlines of what you would need in order to achieve what I call "Agile Database Design" that addresses these challenges. And then subsequently how to meet this in a minimal way.

Monday, 21 September 2009

Data Modelling & Scalability

For what I call Enterprise Class Applications, scalability is an important factor. Scalability can occur in multiple directions - increases in user count, transaction volume, database size. These can occur individually or in combination together. The application must perform well when any of these scales in size, and performance must not degrade severely. For this to be true, scalability must be designed in to the application from the beginning. You cannot simply add scalability onto an application later on, much like you cannot just add security on. There is no magical sticking plaster that you can apply to an existing application and suddenly have it scale wonderfully. If there was, then everyone would be doing it.

No, scalability must be designed into the application and the way it works. This leads to the corollary that only good designs scale well, and bad ones don't. Skipping the database design phase in any application development can only result in a poor design, with associated poor scalability. At some point as the workload increases on the application, performance will hit the dreaded knee and throughput will level off and response times will increase.

Poor performance results from things like lack of primary and foreign keys, general purpose tables with a type field indicating which data fields are relevant, use of key fields that only have a few values ('Y' or 'N' values are typical of this), outer joins between tables where the child records may not exist. Any of these can result in weak SQL when used in a query, and poor execution as a result.

So if the scalability of your application is important to you, and it may not be important for everyone, then make sure you design your database properly rather than just implementing the first set of tables that comes into someones mind. And this design requires capturing information about all of the entities that need to be stored, and modelling their relationships both at a high, conceptual level and then at a detailed low, physical level.

To ensure all entities and their relationships are captured, clearly requires knowledge about all of the entities within the application. But in an iterative development, as typified by Agile, not all of the entities are known at the beginning when the first version of the database design is needed. So how does Agile deal with this? It is the one thing lacking from the documentation I have seen on Agile so far - how do you produce a good, scalable database design when it all takes place in an iterative development environment and when the full requirements and data entities are not yet known?

I've heard of Agile projects delivering poor database designs from other database architects and administrators, and know of one specific one at the moment where no database skilled person has been involved at all.

For me this is a key sticking point with what I have seen of Agile Development, especially the aspect of delaying decisions for as long as possible. There seems to be this perception that the database design "can be finished off later on", after more of the application has been developed. But when the database design becomes an issue and needs to be changed, it is too late, as the time and effort involved in changing and rewriting the application to reflect these database changes is too great.

Given that a scalable application requires a good database design to store its data in, why is it that database design seems missing from most descriptions of Agile Development? It may get mentioned every once in a while, in general lists of development tasks, but I have seen very little on the specifics of what I would term "Agile Database Design".

My concern is that by ignoring database design within Agile Development, it will most of the time result in a poor database design, and so poor application scalability.

Either way it seems to me that to ensure you do end up with a good, scalable database design, you need a Data Modeler within the development team, responsible for the database design. All database changes should be routed through the Data Modeler, rather than a developer making them directly themselves.

Tuesday, 15 September 2009

Reported Execution Plan Anomalies

In doing some tests of the effects of different indexes on execution plans, I came across a small anomaly with the way the execution plan was being reported. I was using the Autotrace feature of SQL*Plus to report the execution plan after a SQL query was executed. I noticed that the costs reported for each step were not completely true, although the overall execution total cost was correct.

The SQL query I was playing with was:

select count (*) from (
select t1.i1, t1.i2, t1.c1, t2.i1, t2.i2, t2.c1
from t1 join t2 on t1.i3 = t2.i1
where t1.i2 = 111
and t2.i4 = 11
)

With indexes on the I1 columns in both tables, and the I3 column of table T1 (foreign key). Oracle will therefore scan T2 and join to T1 using the index on I3, as seen in this reported plan:

--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 18 | 1236 (1)| 00:00:15 |
| 1 | SORT AGGREGATE | | 1 | 18 | | |
|* 2 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 9 | 102 (0)| 00:00:02 |
| 3 | NESTED LOOPS | | 10 | 180 | 1236 (1)| 00:00:15 |
|* 4 | TABLE ACCESS FULL | T2 | 20 | 180 | 195 (1)| 00:00:03 |
|* 5 | INDEX RANGE SCAN | I2_T1 | 100 | | 2 (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("T1"."I2"=111)
4 - filter("T2"."I4"=11)
5 - access("T1"."I3"="T2"."I1")

As can be seen, the access to the data row of table T1 is performed after the join to T2 is complete i.e. outside and after the NESTED LOOP join has completed. However, the reported cost of 1236 for the Nested Loop is incorrect. It should be 236, based on the other reported values for cost and cardinality (row count). The cost of a Nested Loop join is the cost of the inner, first table access, plus the cost of the outer, second table access multiplied by the number of rows retrieved from the inner, first table (cardinality):

195 + (20 * 2) + CPU = 195 + 40 + 1 = 236.

Likewise the row count of the T1 table access is incorrectly shown as 1, when it should be 10, the same as the Nested Loop count. The cost of accessing data rows in T1 that match a single value in the I2_T1 index is about 100. (This can be verified by looking at the values reported in the 10053 trace file for Clustering Factor and Filter Factor, or deriving these manually). The cost for accessing all necessary data rows in T1 as a result of the Nested Loop is the single access cost times the cardinality of the number of times it is performed i.e. 100 * 10 = 1000.

Thus the total cost for the two table join query is 236 (NL) + 1000 (T1 data) = 1236. And indeed the total cost of the query is correctly reported as this. It is just the 2 inner steps of NESTED LOOPS and TABLE ACCESS BY INDEX ROWID that have their costs and row count incorrectly reported, causing this anomaly.

I guess this anomaly is caused by the execution plan moving the table access to outside the nested loop, rather than inside where it most often occurs. If the table access was instead inside the nested loop, between steps 4 and 5 as the parent of 5, then the reported costs would be correct. Most probably the Optimizer arrived at the original plan, then modified the execution steps when it realised it could move the table access outside of the nested loop. Unfortunately, it did not adjust the cost values for this, and so they are mis-reported.

Monday, 7 September 2009

Index Selectivity Analysis

Why does Oracle sometimes not use an index when executing a SQL query? Even when the index includes most of the columns from a table explicitly referenced in the query? The obvious answer is that using the index does not produce a lower cost than other access methods, such as a full table scan. The Optimizer has calculated that the number of disk I/Os needed to use the index (depth of index plus matching leaf blocks) and then get the data blocks (number of matching rows) would be greater than the number of disk I/Os needed to do a full table scan with multi-block reads.

In a case like this, such an index will almost never be used by Oracle even though the index includes columns directly referenced in the WHERE clause of a query. This is because the number of estimated disk I/Os to use it will always be greater than the number of disk I/Os to do a full table scan. There might be exceptions in cases where the data is very skewed and you have a histogram on the columns in the index, and you are selecting a value with a low occurrence count. But generally such an index is unlikely to be used by any queries you run.

Which raises the question, how can we find out which indexes in our database are selective enough to be used by a query before we actually run those queries? Or conversely, which indexes are not selective enough given the data in the database and will never be used? It is a bit too late to find out when we are running a query and Oracle is doing a full table scan and taking a long time to do it. Of course we could do an EXPLAIN PLAN of the query beforehand, but execution plans can change over time dependent on the volumes of data in a table, and the distribution of data values.

We would like to have a query that we can run on a complete database, and which will tell us which indexes are not selective given the actual data in the tables. This would flag weak indexes to us, regardless of which queries might be running i.e. indexes that are unlikely to be chosen by the Optimizer even if all columns in them are referenced in the query.

How would such a query work? Well, we know how the Optimizer costs the different access methods to data in a table (see previous posts on Full Table Scans and Index Scans). We can calculate each of these for a table and an index, and see if the table scan cost is lower than the index access cost. In fact, using a bit of maths we can manipulate the formula for these two costs together in a way that results in a more direct comparison. If the maths bit is too much for you then just skip ahead to the final formula followed by the SQL to run it.

The costs of each type of access are:
  • FTS Cost = (Blocks * mreadtim) / (MBRC * sreadtim)
  • Index Cost = Levels + (Leaf Blocks + Clustering Factor) * Filter Factor
And an index will be ignored when:
  • Index Cost > Full Table Scan Cost
Assuming that the table is large, then the number of blocks in the index will be far greater than the depth or number of levels in the index tree, so we can ignore this. This gives the relationship between the two that an index will be ignored when (approximately):
  • (Leaf Blocks + Clustering Factor) * Filter Factor > (Blocks * mreadtim) / (MBRC * sreadtim)
Dividing both sides gives:
  • Filter Factor > (Blocks * mreadtim) / (MBRC * sreadtim * (Leaf Blocks + Clustering Factor))
Now when there is no histogram on the columns or no data skew then the Filter Factor will be the inverse of the Number of Distinct Values or Keys i.e.
  • Filter Factor = 1 / Distinct Keys
Inverting both sides, and the relationship too gives us
  • Distinct Keys < (MBRC * sreadtim * (Leaf Blocks + Clustering Factor)) / (Blocks * mreadtim)
All of these values are stored in the Data Dictionary of the database in one place or another. So we can write a single SQL query that can calculate both sides of the comparison directly from the database itself. Note that this assumes that the statistics on your tables and indexes are up to date, and reflect the true data distributions in the tables.

To make some things easier we can group together values that are table and index independent, finally giving us that for the index to be ignored the following must be true:
  • Distinct Keys < (MBRC * sreadtim / mreadtim) * (Leaf Blocks + Clustering Factor) / Blocks
We can calculate the expression "(MBRC * sreadtim / mreadtim)" once, and then use this value with the other values on each index on each table in a database. We can then list all the indexes that fail the test i.e. indexes that will be ignored because their cost of access is greater than a full table scan.

For a system with no System Statistics gathered, which is actually the majority of cases, then the following calculations are used for the disk read times:
  • sreadtim = ioseektim + (db_block_size / iotfrspeed)
  • mreadtim = ioseektim + (db_file_multiblock_read_count * db_block_size / iotfrspeed)
By default ioseektim will be 10 milliseconds and iotfrspeed will be 4 KB / millisecond.

This means that the only non-table dependent values in the fully expanded formula are for db_block_size and db_file_multiblock_read_count. These can only be obtained directly from the Oracle database if the user has SELECT permission on the dynamic performance views (V$PARAMETER specifically). And generally this is not true for normal users. To work around this, the SQL script below simply prompts the user to enter these two values.

The SQL script then calculates the derived values, and then uses these on all indexes on all tables to list those that fail the test i.e. number of distinct values in the index is too low.

As mentioned earlier, we can only ignore the depth of the index when the table is large enough, so the SQL includes an additional constraint on the number of blocks in a table, which is hardcoded at 100 blocks. You could easily change this to another value if desired.

This is a SQL*Plus script and must be run via SQL*Plus, as it uses SQL*Plus features such as substitution variables. And remember that your statistics must be up to date on each table and index in the database.

--
-- Check which indexes on all tables are better or worse than a full table scan
-- When the number of distinct values in a column is low, a full table scan
-- is more efficient (less I/Os) than using an index.
--
set define on
set verify off
set heading off
--
prompt
prompt Report to show which indexes are selective enough and which are not.
prompt Non-selective indexes will be ignored and full table scans preferred.
prompt

define small_table_threshold = 100

accept block_size prompt 'Enter database block size in bytes (e.g. 8192) : '
accept mbrc prompt 'Enter db_file_multiblock_read_count value (e.g. 8): '

prompt Using following database wide configuration settings:
prompt : Database Block size is &&block_size (bytes)
prompt : db_file_multiblock_read_count is &&mbrc
prompt : Threshold for a small table is &&small_table_threshold (blocks)
prompt : (Smaller tables than this are ignored in this report)
prompt : [If these are incorrect, edit this script and re-run]

column sreadtim new_value sreadtim
column mreadtim new_value mreadtim
column effective_mbrc new_value effective_mbrc

set termout off
-- Hide from screen internal calculations of derived values
select 10 + (&&block_size / 4096) sreadtim from dual ;
select 10 + (&&mbrc * &&block_size / 4096) mreadtim from dual ;
select &&mbrc * &&sreadtim / &&mreadtim effective_mbrc from dual ;
set termout on

prompt
prompt Assuming that system wide statistics have these derived values:
prompt : Single Block Read Time used is &&sreadtim (milliseconds)
prompt : Multi Block Read Time used is &&mreadtim (milliseconds)

set heading on
column table_name heading 'Table' format a20
column index_name heading 'Index Name' format a20
column distinct_keys heading 'Number of|Distinct|Values' format 999,999
column min_ndv heading 'Min|NDV' format 999,999
column selectivity heading 'Selectivity'
column sel heading 'Sel?'
column blocks heading 'Blocks' format 999,999
column mb heading 'Table|MB' format 999,999

select table_name, index_name, distinct_keys, min_ndv,
case
when (distinct_keys < min_ndv)
then 'N'
else 'Y'
end sel,
(blocks * &&block_size / (1024 * 1024)) mb
from ( select t.table_name, t.blocks, t.num_rows num_trows,
i.index_name, i.num_rows num_irows, i.distinct_keys,
&&effective_mbrc *
((i.leaf_blocks + i.clustering_factor) / t.blocks) min_ndv
from user_ind_statistics i, user_tables t
where t.table_name = i.table_name
and i.num_rows != 0
)
where distinct_keys < min_ndv
and blocks >= &&small_table_threshold
order by table_name, index_name ;

prompt Selectivity (S) indicates whether index will be chosen or not in a query
prompt given the values of the database settings you have entered.
prompt

Running this should list indexes that will NOT generally be used by the Optimizer for any query you might run that references that table. There might be exceptional cases where the Optimizer might use such an index e.g. an Index Fast Full Scan, but generally such indexes would not be used as expected for direct access to the matching rows.

As ever, remember that higher values for db_file_multiblock_read_count lower the cost for a full table scan, and so lower the threshold at which a full table scan becomes cheaper than an index access.

Monday, 31 August 2009

Hash Join Costing

As already discussed in previous posts Oracle has different join methods available to it, and will choose the one with the lowest cost for any specific join. So far I have described how the Nested Loop and Sort Merge Join methods work. Now I will describe the Hash Join method.

It shares some similarities with the Sort Merge Join method - both data sets are accessed separately, and then joined together - but is obviously different in how it does the join. The cost formula for this is:
  • [Outer Access Cost] + [Inner Access Cost] + [Hash Join Cost]
As for the Sort Merge method, this uses an area of memory and so its cost is affected by whether the controlling data set is smaller than the available memory or not. If it is smaller, then the Hash Join will take place in memory, and the Hash Join Cost component is negligible being only a CPU based cost. If the data set is larger, then the join takes place in stages, with intermediate results being written to disk. Hence the Hash Join Cost component will increase significantly due to the disk I/Os involved in writing out the overflow data, and reading it back in again later on.

The memory available for the Hash Join is the same as for the Sort Merge join, being derived from the pga_aggregate_target initialization parameter.

The main difference between the Sort Merge and Hash Join methods is that the Sort Merge wants to have both data sets in memory before joining them together, whereas the Hash Join method only wants one of the data sets in memory before starting the join. This makes a significant difference to the size of the data sets that can be processed by the Hash Join before it overflows to disk, compared to the Sort Merge join method.

Also the Sort Merge Join wants the data sets sorted before they can be merged, which has an associated cost even if it is only CPU because the data can be sorted in memory. The Hash Join however does not need the data sets pre-sorted in any way, and so avoids this extra cost. This difference ensures that the Hash Join is almost always cheaper than a Sort Merge Join, except under exceptional circumstances.

The Hash Join is particularily efficient compared to the Sort Merge Join where a small data set is joined to a much larger data set. A Sort Merge Join requires that both data sets be sorted, which needs memory for each data set. Whereas a Hash Join only requires memory proportional to the smaller first data set and is unaffected by the size of the second data set. This means that a Hash Join can avoid needing to overflow to disk when one of the data sets is much larger, compared to a Sort Merge Join which would need to overflow to disk to sort the larger data set.

The Hash Join works by reading all of the rows in one of the data sets - termed the Outer data set as before - and puts all of them into a table like structure in memory, assuming that sufficient memory is available. The data rows are placed in the memory table according to the values in the columns used for the join. These values are used in a hash function - hence the name of the join method - that returns a value within the range of the size of the table in memory. The data row is then placed in this location in the table - subject to some caveats.

Then the second, Inner data set is read, the same hash function applied to the same join columns in this data set, and the corresponding location in the memory table read. Now either there is a matching row in this location or there is not. If there is a matching row, then the join takes place and a joined row is produced. If there is no matching row then the inner data row is simply discarded.

Thus the Hash Join method can be relatively fast, with little overhead apart from the hash function itself.

The only caveat, as mentioned, is that it is possible for more than one outer data row to hash to the same location in the memory table. When this happens the data row is basically stored in the next available higher location. When the inner data row is hashed, Oracle compares the next locations in the memory table too, while they have the same join values. This handles the situation of duplicate hash values from different data rows, with minimal overhead.

Clearly the size of the hash table in memory is determined by the number of rows in the first or outer data set. The Optimizer is able to swap the data sets around if one is larger than the other, so that it processes the smaller one first. If this produces a significantly lower cost, then this swapped hash join will be used.

When the size of the first data set is larger than the available memory, Oracle proceeds along the following lines. Note that I have simplified the description, and omitted some of the details, but the gist of it is correct. When hashing the first, outer data set, rows with hash values outside the size of the memory table are written out to disk. But those that do hash within the in memory range are put into the memory table. Then the second, outer data set is processed, and hashed in the same way. Those that hash into an in memory range are processed as normal - join or no join - and those that don't are written out to disk.

The range of hash values covered by the in memory table is adjusted, to cover the second range of values, and the two pending data sets on disk joined. The saved first data set is read back in, and either hashed into memory or written out to another temporary area on disk. Then the second data set is read in, hashed and either joined or not joined if it falls within the in memory range, or written out to another temporary area. This is repeated until the remaining data from the first, outer data set fits in memory, and the second, inner data set processed the one final time.

Note that Oracle has certain optimizations in the implementation of this hash mechanism to both discard rows from the second data set that will not join as early as possible, and to write the temporary pending data sets into pre-hashed sub-sets. These optimizations minimize the total number of disk I/Os that need to be done and minimize multiple passes on the same data rows. Thus the actual Hash Join mechanism differs in detail from my simplified summary of it.

As ever, the costs can be verified by looking at the trace file output from the 10053 event. As described, when the outer data set is small enough to fit in memory the cost of the Hash Join stepitself is very small, being only CPU operations. The majority of the total Hash Join cost is simply the sum of the access cost to the two data sets being joined.
  • [Outer Access Cost] + [Inner Access Cost] + [Hash Join Cost]
Where the data sets are much larger than the available memory, then the Hash Join cost step itself increases significantly. Unfortunately I have not yet been able to determine a single rule that can be applied to determine this cost. Clearly it is dependent on the size of the outer data set, where the overflow data that will not fit in memory is written out to disk, and on the size of the inner data set that is also processed and written out to disk. I have not been able to determine yet how the split between these two affects the reported cost of the Hash Join itself (a lack of time on my part unfortunately). Rather than delay a posting on this I thought I would post what I have now, and later post anything else I discover about the Hash Join.