Wednesday 16 December 2009

Correlated Delete & Update

Sometimes you want to delete or update rows in one table based on the existence of matching rows in another table. Examples include building up a list of records to be deleted in another table, or new data values being first loaded into a staging table before updating the main data set. I call this a Correlated Delete or Update, because the rows affected in the main table must correlate with matching rows in the other table. There are different ways we can write such correlated actions in Oracle. Unfortunately one way results in a very poor execution plan and incredibly bad performance – several hours estimated in one example – while the other way might take only a second or two. So it can be very useful to know which is which, and why.

This scenario is easiest seen with a Delete i.e. Delete from tableA where matching rows exist in tableB. We might naturally think of this as being a kind of join between tableA and tableB, assuming tableB has a foreign key to the primary key of tableA. It turns out that Sybase has implemented an extension to its DELETE and UPDATE statements that lets us use join syntax to specify this kind of correlated action, with an additional ‘from’ clause. In Sybase our delete would be:
delete tableA from tableA, tableB where tableA.pkey = tableB.fkey

Unfortunately this is an extension to the ANSI SQL syntax, and Oracle does not have an equivalent syntax. So in Oracle we can only refer to one table in the main table, and need to use sub-queries to refer to the other tables. One way I came across the other day to do this is:
delete from tableA
where exists (select 1 from tableB where tableA.pkey = tableB.fkey)

On the face of it this is correct - we only delete rows in tableA that have a match in tableB. Unfortunately it suffers from terrible performance. In the case I came across I saw that Oracle would take 3 hours to scan tableA (table names changed from their original ones):
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | DELETE STATEMENT | | 1 | 33 | 1075K (2)| 03:35:07 |
| 1 | DELETE | A | | | | |
|* 2 | FILTER | | | | | |
| 3 | TABLE ACCESS FULL | A | 320M| 9G| 1072K (2)| 03:34:30 |
|* 4 | TABLE ACCESS BY INDEX ROWID| B | 1 | 82 | 0 (0)| 00:00:01 |
|* 5 | INDEX RANGE SCAN | X1_B | 1 | | 0 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

This is because the sub-query is correlated – it refers to the outer table (tableA) and so must be executed for each row of tableA from the outer, main query. This results in a full table scan of tableA, and then a join to tableB for the correlated sub-query. If tableA is large and tableB is small with a list of rows to delete, then the performance is very bad indeed.

The solution is to rewrite the sub-query so that it is not correlated, and we can do that using IN i.e. where a row in tableA has a matching row IN tableB. The delete then becomes:
delete from tableA where (tableA.pkey) IN (select tableB.fkey from tableB)

The meaning of this is exactly the same as the other delete, but now the sub-query is not correlated. Oracle can now choose an execution plan that scans tableB to produce a set of rows, and join to tableA using an index. This executes much faster, as you would expect.

------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | DELETE STATEMENT | | 1 | 113 | 6 (17)| 00:00:01 |
| 1 | DELETE | A | | | | |
|* 2 | FILTER | | | | | |
| 3 | NESTED LOOPS | | 1 | 113 | 6 (17)| 00:00:01 |
| 4 | SORT UNIQUE | | 1 | 80 | 2 (0)| 00:00:01 |
|* 5 | TABLE ACCESS FULL | B | 1 | 80 | 2 (0)| 00:00:01 |
|* 6 | INDEX RANGE SCAN | X1_A | 1 | 33 | 3 (0)| 00:00:01 |
------------------------------------------------------------------------------------------

As you can see the estimated cost has reduced from over 1 million to just 6, and the estimated elapsed time from 3 and a half hours to just one second! A major improvement. Obviously this improvement is down to the fact that tableB is at least 1,000 times smaller than tableA.

Either way, I believe that using IN is a better way of phrasing the sub-query, because it is clearer to Oracle which way round the relationship is (A depends on B, not vice versa), and gives the optimizer more flexibility of how to execute the SQL statement. If tableB were big enough with respect to tableA, then there is no reason why the optimizer could not go back to the original execution plan – scanning A and joining to B.

Where it really makes a difference is where you have other conditions on tableB within the sub-query – not just the foreign key join to tableA. When using EXISTS, Oracle will ignore extra indexes on tableB and treat the correlated sub-query in the same way – check tableB for each row in tableA. When using IN, Oracle can take advantage of extra indexes on tableB if they help with other conditions on columns of tableB in its sub-query. Thus it can use an index for the initial access to data in tableB, and then use the primary key index into tableA. This results in efficient execution plans for the DELETE.

The problem also occurs with Updates, but is compounded by needing to refer to the same table twice within the update statement. Again, the Sybase syntax is much simpler and more straightforward:

update tableA
set tableA.column1 = tableB.column1, tableA.column2 = tableB.column2
from tableA, tableB
where tableA.pkey = tableB.fkey

In Oracle we need to use a sub-query for tableB in the WHERE clause in the same way as for the Delete statement, and we also need a sub-query for tableB in the SET clause so that it is updated to the values in the matching row in tableB. This is important – without both sub-queries we would either update the wrong rows (WHERE clause problem) or update them to the wrong values (SET clause problem). The Oracle equivalent, using the same rewrite as before is:

update tableA
set (column1, column2) = (select tableB.column1, tableB.column2
from tableB where tableA.pkey = tableB.fkey)
where pkey in (select fkey from tableB)

As already mentioned this particular update will perform well when executed. Typically Oracle will scan tableB if it is much smaller for the sub-query, then join to tableA on its key columns assuming there is an index on them, and then join back to tableB to get the values to update the columns to.

Monday 19 October 2009

Solving Database Design in Development

Previously I have looked at the challenges involved in being able to support a changing database design in an application development environment, which were:
  1. Support change at any time
  2. Ensure database designed and modelled properly
  3. Record each individual change, to enable metamorphosis of older databases
  4. Allow multiple versions or branches of the Database Design to exist, and be changed individually or collectively
Although I started off on this problem of database structure change from the context of the need for it when using Agile development methods, the actual problem is not unique or specific to Agile. It is a generic problem to the development of any application that involves the use of a relational database to store its data in.

What I now want to do is list the key features needed in a solution to this problem which, if all assembled together, would deliver a viable solution to this problem.
  • With Application Source Code we only care about the final state it ends up in. But with a Database Design we care about each individual change, and what gets changed. We need to record each of these Databse Changes individually, so that they can be applied to other instances of that database, as well as recording the final state of the database design itself in the Database Model.
  • Changes are physically made to a Database Instance by executing appropriate SQL statements. This is the only way to make a change to an existing database instance.
  • Each "Database Change" should be an atomic change to a database - it either completes successfully or not at all. Changes cannot be partially implemented on a database instance.
  • The Database Changes must be executed in the correct sequence on each database instance, for repeatability and due to any dependencies between the changes.
  • Changes to the Database Design should be formally Requested, and these requests recorded somewhere to provide an audit trail - Who, When, What Changed, Where in the Database.
  • Changes must be reviewed and approved by the Data Modeller, to ensure a good, scalable design. The Data Modeller is responsible for updating the Database Model - the record of the Database Design - and producing the SQL statement that implements that change. Some of this could be done automatically using appropriate tools, or manually. All that matters is that it does happen.
  • The Database Designer or Data Modelling role now becomes a part time role during the whole lifecycle of the application's development, instead of being a full time role only in the first phase of its development. It is now likely that all development cycles will involve the Database Designer, who becomes a central and critical member of the development team. The role itself, however, could be fulfilled by someone who has other roles too within the development cycle, or shared amongst a number of people, because it is not a full time role. This could be someone who is also a developer or a database administrator.
  • Each Database Change is given its own unique Identifier, which must be globally unique over all the other Database Changes. This identifies the particular Database Change, regardless of which Database Instances it is applied to, or which versions of the Application it appears in.
  • Changes to a particular Database Instance must be recorded within the Database Instance itself. This is the only way to avoid applying the same changes twice to a Database Instance. This implies some form of Change History table within each Database Instance.
  • This Change History table needs to record the Change Identifer for those changes applied, and the date and time when it was done. Other data could also be recorded, but these two are the minimum required.
  • The set of changes should be stored in a single file, often termed the Change Log. Each entry in this will include the Change Identifier and the SQL statement that implements that change. All necessary changes can be easily located and sequenced this way.
  • This Change Log needs to be well structured, so that a program can read and decode the Changes in it and execute the SQL statements for changes not yet applied to a given Database Instance. This is probably best done using XML.
  • There must be separate instances of the Database Model / Database Design and the Change Log file in each version / branch of the Application source code. This allows changes to be made independently to each branch of the Application.
  • In reality SQL statements are often specific to one database product, such as Oracle, Sybase, SQL*Server or MySQL. The Change Log will need to record the separate SQL statements for each supported database product for each Database Change. Thus each Change will have an Identifier, and a SQL statement per database product, clearly labelled with the database product name.
  • The whole solution is brought together in a program that is run during an Application Upgrade. It opens the Change Log XML file and reads in each Change. If that Change has not yet been applied to this particular Database Instance, then the corresponding SQL statement is executed and a record inserted into the Change History metadata table.
This is my interpretation of how to solve the problem of a changing database design during application development. But it turns out of course that others have been through exactly the same process before me and come up with exactly the same answers. And in fact, once I had thoroughly understood the nature of the problem, I could immediately see that these other people too had understood it and had arrived at pretty much the same conclusions I would. So although I did work through the problem from first principles myself, simply because that is the best way for me to fully understand a topic and appreciate whether any "solution" is right or not, I have no doubt borrowed from some of the material I have read in the way that I have described this solution here.

I still find some of the descriptions I found on the Web around Database Design, Changes and Agile Development unclear around this particular problem of implementing changes to a database design during application development. Quite a lot discuss the need to change the database design, and how to model different types of changes (refactoring is often mentioned). But almost none discuss how such changes get implemented in real deployed databases, and how you maintain a database over the long term with a series of changes to the application that uses it. None of them covered enough of it at once to leave me feeling that I had a real solution I could go out and apply. Hence my need to work through the problem from first principles for myself, to get the full picture as it were.

Again, one article that did help to clarify the nature of the problem a lot for me was Rethinking Agility in Databases: Evolution from Hexagon Software. This brought home the message that Databases must be treated differently to Applications in how you change them.

And the Change Log file of SQL statements was explicitly mentioned by Peter Schuh and Pramod Sadalage, who have put this into practice themselves on large projects. See Agility and the Database by Peter Schuh for instance.

So although I have not invented anything new here that has not already been described by other people elsewhere, in one form or another, I have tried to bring together all of the necessary ingredients for a solution in a coherent manner. As I said, this was my way of working through the database design problem to arrive at a solution that I fully understood and felt able to go out and apply in the real world.

I may revisit this again and try and outline a working solution to this i.e. what would need to be delivered to implement the kind of solution I have described here. Anybody reading this and want to know more?

Monday 5 October 2009

The Challenge of Agile Database Design

Previously I have said that for what I call Enterprise Applications data modelling is important because the data itself has meaning and value outside of the application that manipulates it, and that scalability requires a good database design because you cannot just add scalability on afterwards. The Data Model or Database Design defines the structure of the database and the relationships between the data sets, and is part of the foundation on which the application is built. And a good database design is essential to achieve a scalable application. Which leads to the challenge - How do I go about designing a database in an Agile development project, when not all the requirements are known initially? What techniques should I be using?

Having read as much as I can find on this topic, I think I have a better understanding of the nature of this challenge. And this is what I want to explain here - the nature of the challenge of Agile Database Design.

First, we should restate the problem in a more positive way. Rather than "when not all the requirements are known", we can say "How should I be designing an Agile database that will change in the future". Change happens all the time, in one form or another, and is inevitable. We need to embrace it, assume that our database design will change over time, and find ways to support this changing database. This need for change over time is not unique to Agile development, and is in fact really a universal problem for applications and databases.

Second, we need to accept that we cannot skip the design stage for the database in any way. We must design the database properly - at least those parts of the database that we need to design now - and produce a correct model. As I argued before, a good and correct database design is essential to a well performing application, and to the integrity of the data itself.

The outcome of the database design is documentation on the structure of the database, often termed the Data Model. There are many tools that can be used to help you design your database and record the details of the model. Any such tool chosen should enable and support small and frequent changes to the data model, as this is a major requirement of Agile development. But you could also use tools as simple as a spreadsheet and a set of diagrams.

Third, a Database Design is not the same as Application Source Code. There are similarities, but they are actually different beasts. Both act as Blueprints for a thing that can be built - an Instance of that Design. And both can be changed over time as needed. The difference is that when the Source Code to an Application changes, we rebuild the Application completely, typically compiling all source code files. We have produced a new instance of the Application Program, as a next generation instance of it. This is an example of "Evolution": the Blueprint changes, and a brand new instance is created using it. Existing instances are not modified, but instead "replaced" by the newly created instance.

Databases are the opposite. Changes must be applied "in place" directly to each Database Instance (a real database on a computer system), to modify it into the latest database design. Such an in place changing in the structure of a thing is termed "Metamorphosis", and is quite different from "Evolution".

While I have appreciated for some time that there is a difference in type between an Application's Source Code and a Database's Design, I read about the explicit nature and form of this difference (Evolution versus Metamorphosis) in an article on Rethinking Agility in Databases: Evolution from Hexagon Software. All credit for this distinction between them and the terminology goes to them.

Application Source Code editing and maintenance methods will not work for a Database Design. A "replace and rebuild" methodology cannot be used for databases, which need a change to the design to be applied "in place" instead to each instance of that database.

Fourth, there may be multiple separate versions or branches of the Database Design to be maintained, as a result of the existence of separate branches of the Application Source Code. It is common for Application Source Code to be branched when major releases are done, typically producing a new branch for support of that release, and a new branch for the next major release. It is possible that the Database Design may need to change in different ways in different branches of the Application. Likewise, the same change may need to be made to the Database Design in different branches - correcting a bug for instance. We need a way to record each version of the Database Design separately to support this.

These then form the Challenges of delivering a changing Database Design:

1. Supporting change at any time to the Database Design

2. Ensuring that the Database is Designed and Modelled properly

3. Recording each individual change to the Database Design, so that older instances of that Database Design can metamorphose by having these changes applied to them

4. Allowing multiple versions or branches of the Database Design to exist, and to be changed individually or collectively

These challenges are not specific to Agile Development, and apply to any large enough application software development. Addressing these challenges will provide a solution that could be used in any database oriented application development, whether using Agile development methods or not.

In the next post I hope to start describing the outlines of what you would need in order to achieve what I call "Agile Database Design" that addresses these challenges. And then subsequently how to meet this in a minimal way.

Monday 21 September 2009

Data Modelling & Scalability

For what I call Enterprise Class Applications, scalability is an important factor. Scalability can occur in multiple directions - increases in user count, transaction volume, database size. These can occur individually or in combination together. The application must perform well when any of these scales in size, and performance must not degrade severely. For this to be true, scalability must be designed in to the application from the beginning. You cannot simply add scalability onto an application later on, much like you cannot just add security on. There is no magical sticking plaster that you can apply to an existing application and suddenly have it scale wonderfully. If there was, then everyone would be doing it.

No, scalability must be designed into the application and the way it works. This leads to the corollary that only good designs scale well, and bad ones don't. Skipping the database design phase in any application development can only result in a poor design, with associated poor scalability. At some point as the workload increases on the application, performance will hit the dreaded knee and throughput will level off and response times will increase.

Poor performance results from things like lack of primary and foreign keys, general purpose tables with a type field indicating which data fields are relevant, use of key fields that only have a few values ('Y' or 'N' values are typical of this), outer joins between tables where the child records may not exist. Any of these can result in weak SQL when used in a query, and poor execution as a result.

So if the scalability of your application is important to you, and it may not be important for everyone, then make sure you design your database properly rather than just implementing the first set of tables that comes into someones mind. And this design requires capturing information about all of the entities that need to be stored, and modelling their relationships both at a high, conceptual level and then at a detailed low, physical level.

To ensure all entities and their relationships are captured, clearly requires knowledge about all of the entities within the application. But in an iterative development, as typified by Agile, not all of the entities are known at the beginning when the first version of the database design is needed. So how does Agile deal with this? It is the one thing lacking from the documentation I have seen on Agile so far - how do you produce a good, scalable database design when it all takes place in an iterative development environment and when the full requirements and data entities are not yet known?

I've heard of Agile projects delivering poor database designs from other database architects and administrators, and know of one specific one at the moment where no database skilled person has been involved at all.

For me this is a key sticking point with what I have seen of Agile Development, especially the aspect of delaying decisions for as long as possible. There seems to be this perception that the database design "can be finished off later on", after more of the application has been developed. But when the database design becomes an issue and needs to be changed, it is too late, as the time and effort involved in changing and rewriting the application to reflect these database changes is too great.

Given that a scalable application requires a good database design to store its data in, why is it that database design seems missing from most descriptions of Agile Development? It may get mentioned every once in a while, in general lists of development tasks, but I have seen very little on the specifics of what I would term "Agile Database Design".

My concern is that by ignoring database design within Agile Development, it will most of the time result in a poor database design, and so poor application scalability.

Either way it seems to me that to ensure you do end up with a good, scalable database design, you need a Data Modeler within the development team, responsible for the database design. All database changes should be routed through the Data Modeler, rather than a developer making them directly themselves.

Tuesday 15 September 2009

Reported Execution Plan Anomalies

In doing some tests of the effects of different indexes on execution plans, I came across a small anomaly with the way the execution plan was being reported. I was using the Autotrace feature of SQL*Plus to report the execution plan after a SQL query was executed. I noticed that the costs reported for each step were not completely true, although the overall execution total cost was correct.

The SQL query I was playing with was:

select count (*) from (
select t1.i1, t1.i2, t1.c1, t2.i1, t2.i2, t2.c1
from t1 join t2 on t1.i3 = t2.i1
where t1.i2 = 111
and t2.i4 = 11
)

With indexes on the I1 columns in both tables, and the I3 column of table T1 (foreign key). Oracle will therefore scan T2 and join to T1 using the index on I3, as seen in this reported plan:

--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 18 | 1236 (1)| 00:00:15 |
| 1 | SORT AGGREGATE | | 1 | 18 | | |
|* 2 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 9 | 102 (0)| 00:00:02 |
| 3 | NESTED LOOPS | | 10 | 180 | 1236 (1)| 00:00:15 |
|* 4 | TABLE ACCESS FULL | T2 | 20 | 180 | 195 (1)| 00:00:03 |
|* 5 | INDEX RANGE SCAN | I2_T1 | 100 | | 2 (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("T1"."I2"=111)
4 - filter("T2"."I4"=11)
5 - access("T1"."I3"="T2"."I1")

As can be seen, the access to the data row of table T1 is performed after the join to T2 is complete i.e. outside and after the NESTED LOOP join has completed. However, the reported cost of 1236 for the Nested Loop is incorrect. It should be 236, based on the other reported values for cost and cardinality (row count). The cost of a Nested Loop join is the cost of the inner, first table access, plus the cost of the outer, second table access multiplied by the number of rows retrieved from the inner, first table (cardinality):

195 + (20 * 2) + CPU = 195 + 40 + 1 = 236.

Likewise the row count of the T1 table access is incorrectly shown as 1, when it should be 10, the same as the Nested Loop count. The cost of accessing data rows in T1 that match a single value in the I2_T1 index is about 100. (This can be verified by looking at the values reported in the 10053 trace file for Clustering Factor and Filter Factor, or deriving these manually). The cost for accessing all necessary data rows in T1 as a result of the Nested Loop is the single access cost times the cardinality of the number of times it is performed i.e. 100 * 10 = 1000.

Thus the total cost for the two table join query is 236 (NL) + 1000 (T1 data) = 1236. And indeed the total cost of the query is correctly reported as this. It is just the 2 inner steps of NESTED LOOPS and TABLE ACCESS BY INDEX ROWID that have their costs and row count incorrectly reported, causing this anomaly.

I guess this anomaly is caused by the execution plan moving the table access to outside the nested loop, rather than inside where it most often occurs. If the table access was instead inside the nested loop, between steps 4 and 5 as the parent of 5, then the reported costs would be correct. Most probably the Optimizer arrived at the original plan, then modified the execution steps when it realised it could move the table access outside of the nested loop. Unfortunately, it did not adjust the cost values for this, and so they are mis-reported.

Monday 7 September 2009

Index Selectivity Analysis

Why does Oracle sometimes not use an index when executing a SQL query? Even when the index includes most of the columns from a table explicitly referenced in the query? The obvious answer is that using the index does not produce a lower cost than other access methods, such as a full table scan. The Optimizer has calculated that the number of disk I/Os needed to use the index (depth of index plus matching leaf blocks) and then get the data blocks (number of matching rows) would be greater than the number of disk I/Os needed to do a full table scan with multi-block reads.

In a case like this, such an index will almost never be used by Oracle even though the index includes columns directly referenced in the WHERE clause of a query. This is because the number of estimated disk I/Os to use it will always be greater than the number of disk I/Os to do a full table scan. There might be exceptions in cases where the data is very skewed and you have a histogram on the columns in the index, and you are selecting a value with a low occurrence count. But generally such an index is unlikely to be used by any queries you run.

Which raises the question, how can we find out which indexes in our database are selective enough to be used by a query before we actually run those queries? Or conversely, which indexes are not selective enough given the data in the database and will never be used? It is a bit too late to find out when we are running a query and Oracle is doing a full table scan and taking a long time to do it. Of course we could do an EXPLAIN PLAN of the query beforehand, but execution plans can change over time dependent on the volumes of data in a table, and the distribution of data values.

We would like to have a query that we can run on a complete database, and which will tell us which indexes are not selective given the actual data in the tables. This would flag weak indexes to us, regardless of which queries might be running i.e. indexes that are unlikely to be chosen by the Optimizer even if all columns in them are referenced in the query.

How would such a query work? Well, we know how the Optimizer costs the different access methods to data in a table (see previous posts on Full Table Scans and Index Scans). We can calculate each of these for a table and an index, and see if the table scan cost is lower than the index access cost. In fact, using a bit of maths we can manipulate the formula for these two costs together in a way that results in a more direct comparison. If the maths bit is too much for you then just skip ahead to the final formula followed by the SQL to run it.

The costs of each type of access are:
  • FTS Cost = (Blocks * mreadtim) / (MBRC * sreadtim)
  • Index Cost = Levels + (Leaf Blocks + Clustering Factor) * Filter Factor
And an index will be ignored when:
  • Index Cost > Full Table Scan Cost
Assuming that the table is large, then the number of blocks in the index will be far greater than the depth or number of levels in the index tree, so we can ignore this. This gives the relationship between the two that an index will be ignored when (approximately):
  • (Leaf Blocks + Clustering Factor) * Filter Factor > (Blocks * mreadtim) / (MBRC * sreadtim)
Dividing both sides gives:
  • Filter Factor > (Blocks * mreadtim) / (MBRC * sreadtim * (Leaf Blocks + Clustering Factor))
Now when there is no histogram on the columns or no data skew then the Filter Factor will be the inverse of the Number of Distinct Values or Keys i.e.
  • Filter Factor = 1 / Distinct Keys
Inverting both sides, and the relationship too gives us
  • Distinct Keys < (MBRC * sreadtim * (Leaf Blocks + Clustering Factor)) / (Blocks * mreadtim)
All of these values are stored in the Data Dictionary of the database in one place or another. So we can write a single SQL query that can calculate both sides of the comparison directly from the database itself. Note that this assumes that the statistics on your tables and indexes are up to date, and reflect the true data distributions in the tables.

To make some things easier we can group together values that are table and index independent, finally giving us that for the index to be ignored the following must be true:
  • Distinct Keys < (MBRC * sreadtim / mreadtim) * (Leaf Blocks + Clustering Factor) / Blocks
We can calculate the expression "(MBRC * sreadtim / mreadtim)" once, and then use this value with the other values on each index on each table in a database. We can then list all the indexes that fail the test i.e. indexes that will be ignored because their cost of access is greater than a full table scan.

For a system with no System Statistics gathered, which is actually the majority of cases, then the following calculations are used for the disk read times:
  • sreadtim = ioseektim + (db_block_size / iotfrspeed)
  • mreadtim = ioseektim + (db_file_multiblock_read_count * db_block_size / iotfrspeed)
By default ioseektim will be 10 milliseconds and iotfrspeed will be 4 KB / millisecond.

This means that the only non-table dependent values in the fully expanded formula are for db_block_size and db_file_multiblock_read_count. These can only be obtained directly from the Oracle database if the user has SELECT permission on the dynamic performance views (V$PARAMETER specifically). And generally this is not true for normal users. To work around this, the SQL script below simply prompts the user to enter these two values.

The SQL script then calculates the derived values, and then uses these on all indexes on all tables to list those that fail the test i.e. number of distinct values in the index is too low.

As mentioned earlier, we can only ignore the depth of the index when the table is large enough, so the SQL includes an additional constraint on the number of blocks in a table, which is hardcoded at 100 blocks. You could easily change this to another value if desired.

This is a SQL*Plus script and must be run via SQL*Plus, as it uses SQL*Plus features such as substitution variables. And remember that your statistics must be up to date on each table and index in the database.

--
-- Check which indexes on all tables are better or worse than a full table scan
-- When the number of distinct values in a column is low, a full table scan
-- is more efficient (less I/Os) than using an index.
--
set define on
set verify off
set heading off
--
prompt
prompt Report to show which indexes are selective enough and which are not.
prompt Non-selective indexes will be ignored and full table scans preferred.
prompt

define small_table_threshold = 100

accept block_size prompt 'Enter database block size in bytes (e.g. 8192) : '
accept mbrc prompt 'Enter db_file_multiblock_read_count value (e.g. 8): '

prompt Using following database wide configuration settings:
prompt : Database Block size is &&block_size (bytes)
prompt : db_file_multiblock_read_count is &&mbrc
prompt : Threshold for a small table is &&small_table_threshold (blocks)
prompt : (Smaller tables than this are ignored in this report)
prompt : [If these are incorrect, edit this script and re-run]

column sreadtim new_value sreadtim
column mreadtim new_value mreadtim
column effective_mbrc new_value effective_mbrc

set termout off
-- Hide from screen internal calculations of derived values
select 10 + (&&block_size / 4096) sreadtim from dual ;
select 10 + (&&mbrc * &&block_size / 4096) mreadtim from dual ;
select &&mbrc * &&sreadtim / &&mreadtim effective_mbrc from dual ;
set termout on

prompt
prompt Assuming that system wide statistics have these derived values:
prompt : Single Block Read Time used is &&sreadtim (milliseconds)
prompt : Multi Block Read Time used is &&mreadtim (milliseconds)

set heading on
column table_name heading 'Table' format a20
column index_name heading 'Index Name' format a20
column distinct_keys heading 'Number of|Distinct|Values' format 999,999
column min_ndv heading 'Min|NDV' format 999,999
column selectivity heading 'Selectivity'
column sel heading 'Sel?'
column blocks heading 'Blocks' format 999,999
column mb heading 'Table|MB' format 999,999

select table_name, index_name, distinct_keys, min_ndv,
case
when (distinct_keys < min_ndv)
then 'N'
else 'Y'
end sel,
(blocks * &&block_size / (1024 * 1024)) mb
from ( select t.table_name, t.blocks, t.num_rows num_trows,
i.index_name, i.num_rows num_irows, i.distinct_keys,
&&effective_mbrc *
((i.leaf_blocks + i.clustering_factor) / t.blocks) min_ndv
from user_ind_statistics i, user_tables t
where t.table_name = i.table_name
and i.num_rows != 0
)
where distinct_keys < min_ndv
and blocks >= &&small_table_threshold
order by table_name, index_name ;

prompt Selectivity (S) indicates whether index will be chosen or not in a query
prompt given the values of the database settings you have entered.
prompt

Running this should list indexes that will NOT generally be used by the Optimizer for any query you might run that references that table. There might be exceptional cases where the Optimizer might use such an index e.g. an Index Fast Full Scan, but generally such indexes would not be used as expected for direct access to the matching rows.

As ever, remember that higher values for db_file_multiblock_read_count lower the cost for a full table scan, and so lower the threshold at which a full table scan becomes cheaper than an index access.

Monday 31 August 2009

Hash Join Costing

As already discussed in previous posts Oracle has different join methods available to it, and will choose the one with the lowest cost for any specific join. So far I have described how the Nested Loop and Sort Merge Join methods work. Now I will describe the Hash Join method.

It shares some similarities with the Sort Merge Join method - both data sets are accessed separately, and then joined together - but is obviously different in how it does the join. The cost formula for this is:
  • [Outer Access Cost] + [Inner Access Cost] + [Hash Join Cost]
As for the Sort Merge method, this uses an area of memory and so its cost is affected by whether the controlling data set is smaller than the available memory or not. If it is smaller, then the Hash Join will take place in memory, and the Hash Join Cost component is negligible being only a CPU based cost. If the data set is larger, then the join takes place in stages, with intermediate results being written to disk. Hence the Hash Join Cost component will increase significantly due to the disk I/Os involved in writing out the overflow data, and reading it back in again later on.

The memory available for the Hash Join is the same as for the Sort Merge join, being derived from the pga_aggregate_target initialization parameter.

The main difference between the Sort Merge and Hash Join methods is that the Sort Merge wants to have both data sets in memory before joining them together, whereas the Hash Join method only wants one of the data sets in memory before starting the join. This makes a significant difference to the size of the data sets that can be processed by the Hash Join before it overflows to disk, compared to the Sort Merge join method.

Also the Sort Merge Join wants the data sets sorted before they can be merged, which has an associated cost even if it is only CPU because the data can be sorted in memory. The Hash Join however does not need the data sets pre-sorted in any way, and so avoids this extra cost. This difference ensures that the Hash Join is almost always cheaper than a Sort Merge Join, except under exceptional circumstances.

The Hash Join is particularily efficient compared to the Sort Merge Join where a small data set is joined to a much larger data set. A Sort Merge Join requires that both data sets be sorted, which needs memory for each data set. Whereas a Hash Join only requires memory proportional to the smaller first data set and is unaffected by the size of the second data set. This means that a Hash Join can avoid needing to overflow to disk when one of the data sets is much larger, compared to a Sort Merge Join which would need to overflow to disk to sort the larger data set.

The Hash Join works by reading all of the rows in one of the data sets - termed the Outer data set as before - and puts all of them into a table like structure in memory, assuming that sufficient memory is available. The data rows are placed in the memory table according to the values in the columns used for the join. These values are used in a hash function - hence the name of the join method - that returns a value within the range of the size of the table in memory. The data row is then placed in this location in the table - subject to some caveats.

Then the second, Inner data set is read, the same hash function applied to the same join columns in this data set, and the corresponding location in the memory table read. Now either there is a matching row in this location or there is not. If there is a matching row, then the join takes place and a joined row is produced. If there is no matching row then the inner data row is simply discarded.

Thus the Hash Join method can be relatively fast, with little overhead apart from the hash function itself.

The only caveat, as mentioned, is that it is possible for more than one outer data row to hash to the same location in the memory table. When this happens the data row is basically stored in the next available higher location. When the inner data row is hashed, Oracle compares the next locations in the memory table too, while they have the same join values. This handles the situation of duplicate hash values from different data rows, with minimal overhead.

Clearly the size of the hash table in memory is determined by the number of rows in the first or outer data set. The Optimizer is able to swap the data sets around if one is larger than the other, so that it processes the smaller one first. If this produces a significantly lower cost, then this swapped hash join will be used.

When the size of the first data set is larger than the available memory, Oracle proceeds along the following lines. Note that I have simplified the description, and omitted some of the details, but the gist of it is correct. When hashing the first, outer data set, rows with hash values outside the size of the memory table are written out to disk. But those that do hash within the in memory range are put into the memory table. Then the second, outer data set is processed, and hashed in the same way. Those that hash into an in memory range are processed as normal - join or no join - and those that don't are written out to disk.

The range of hash values covered by the in memory table is adjusted, to cover the second range of values, and the two pending data sets on disk joined. The saved first data set is read back in, and either hashed into memory or written out to another temporary area on disk. Then the second data set is read in, hashed and either joined or not joined if it falls within the in memory range, or written out to another temporary area. This is repeated until the remaining data from the first, outer data set fits in memory, and the second, inner data set processed the one final time.

Note that Oracle has certain optimizations in the implementation of this hash mechanism to both discard rows from the second data set that will not join as early as possible, and to write the temporary pending data sets into pre-hashed sub-sets. These optimizations minimize the total number of disk I/Os that need to be done and minimize multiple passes on the same data rows. Thus the actual Hash Join mechanism differs in detail from my simplified summary of it.

As ever, the costs can be verified by looking at the trace file output from the 10053 event. As described, when the outer data set is small enough to fit in memory the cost of the Hash Join stepitself is very small, being only CPU operations. The majority of the total Hash Join cost is simply the sum of the access cost to the two data sets being joined.
  • [Outer Access Cost] + [Inner Access Cost] + [Hash Join Cost]
Where the data sets are much larger than the available memory, then the Hash Join cost step itself increases significantly. Unfortunately I have not yet been able to determine a single rule that can be applied to determine this cost. Clearly it is dependent on the size of the outer data set, where the overflow data that will not fit in memory is written out to disk, and on the size of the inner data set that is also processed and written out to disk. I have not been able to determine yet how the split between these two affects the reported cost of the Hash Join itself (a lack of time on my part unfortunately). Rather than delay a posting on this I thought I would post what I have now, and later post anything else I discover about the Hash Join.

Tuesday 4 August 2009

Data Modelling - Why it is important

A recent post by Robyn on the importance of a good data model, especially in a development environment using Agile practices made me sit up and think "Yes, I agree with you". So here are some thoughts of my own on why a Data Model is so important on large data processing applications that store their data in a database. If your application is neither large nor stores its data in a database, then you can probably ignore these thoughts.

First, let me state that I think the most important thing for the class of Application that I am discussing here is the Data. You could go so far as to say "Its ALL about the Data", and not the processing done by the Application.

Applications involving financial or medical data, for instance. I am sure that Customers of these classes of Applications care more about the integrity and correctness of their Data than they do about the features offered by the Application. Their Data must never be corrupted, or deleted, or lost through broken links to other data, or incorrect (miscalculated financial amounts or dates). It is the Data that is of value, not the processing of the Data. In fact the Data may live beyond the current application, and be migrated over to the next one and continue to live in one form or another.

Of course Data on its own is no use to anyone. Data almost always exists with an Application that manipulates it in one way or another. So the two can never be separated, or considered as completely independent problems. But it is Data first, Application second.

Given the importance then of the Data, we need a Data Model that defines the structure of this Data being manipulated by the Application. To me such a Data Model involves both a Design and full Documentation of that Design. And by "full Documentation" I simply mean "adequate" to capture all the necessary information about the Data and the Design, so that someone else coming along later could understand it. What use is a Data Model that no one can understand?

In my view the Data and the "Data Model" of it should stand in their own right. This is because the data and its meaning should be constant and consistent, regardless of which particular application is manipulating it. Remember - Data first, Application second.

By a "Data Model" I mean a high level Logical or Conceptual Model, and the main data entities and the relationships between them. This should always be the same model for a given set of data, regardless of which particular application is manipulating the Data itself.

Of course the details of the implementation of this Conceptual Model can differ. And all specific applications will require unique elements within the physical database - extra staging tables to hold intermediate results, flags to indicate the type of data or entry, status codes, etc. But these should be in addition to the core data elements identified in the Conceptual Model.

In other words, although the Physical Database will have lots of extra data columns and tables, you should still be able to map large parts of it back to the Entities in the Conceptual Model very easily.

If the "Data Model" is somehow unique and tied in to an Application and its implementation, and cannot be separated from it, then this is admitting that it does not capture the true "meaning" of the data, and that such a database could not be used by another application, or be queried directly.

The Data Model is the Foundation of an Application, just as we have a real Foundation of a House. Imagine what it would be like to build a house room by room, with only minimal foundations for each room as we build it? First a one room house with a roof. Then later you extend it for another room - extra foundations dug, change one existing wall, new walls built, roof extended. Then 2 more rooms - again dig foundations, build walls, extend the roof over them. Then you go up a storey by building some rooms on top - rip off the complete roof, build on top of the existing walls to increase their height, put in a new ceiling/floor between the two floors, and a new ceiling and a new roof. Maybe you go into the roof, instead of putting on a new floor. The roof will still be modified, and extended outward for windows to be put in. A floor will have to be laid, and stairs put in. Any water plumbing or electricity ring mains will have to be extended too on each new development. Is this an efficient way to build a house?

No. The sensible way is to dig all the foundations at once, and then build only what you needed or could afford - maybe only the 2 room model initially - on top of this. When you came to extend later it would be much easier - no digging of foundations, or realigning walls. Allowances would have been made for this in advance.

What I have seen in reality is that newer development methods (Object Orientation) and tools (Java) let developers more easily write code that manipulates data "objects" internally, and then somehow magically persist that data to a database. So the developers focus on application functionality, assuming that the "data persistence" problem will be taken care of somehow. And when it does not happen "transparently" and there are "issues", it then becomes a database administrator area problem and not a development area problem.

The developers argue that they are taking a "model led design", but unfortunately it is an "object" led design, and not a "data" led design. From my experience, the object to relational mapping is non-trivial, and many object led designs are imperfect - there are major revisions between version 1.0 and 2.0. So what happens to the database structure between version 1.0 and version 2.0? And more importantly, what happens to all of the customer's data in that database? You cannot just throw the data away and start with fresh empty tables again.

Also, in many Object Models there can be a lot of repetition between different objects. Customers, People, and Organisations for instance will all have names and addresses. If the Data Model were designed first this commonality would be recognised, and a single set of tables designed to store this data. Whether one shared table or separate identical tables does not matter - it is that the design is the same for all of them. But with separate development teams designing their own objects as they go, you run the risk of having separate Address objects for each piece of functionality developed, all differing in one way or another.

Data and Objects are not the same thing. They may have a lot in common, but they are not the same thing. And while developers continue to propagate this myth, we will continue to have poor data models, and poor databases, and poor performance as a result. If you believe that the data is important, then you should adopt a data led design approach. I do not see this as necessarily conflicting with other development methodologies such as Agile. As long as a data led design is incorporated into it.

Monday 20 July 2009

Sort Merge Join Costing

For a Sort-Merge Join, each data set (Outer and Inner) is read separately and sorted by the joining columns. The resultant sorted data sets are then joined together. The cost formula for this is:
  • [Outer Access Cost] + [Sorting Outer Cost] + [Inner Access Cost] + [Inner Sort Cost]
We know how to calculate the two data Access Costs from earlier posts for different tables and the indexes available on them. How does the Optimizer cost the two Sorts?

The two key points are - how big is the data set to be sorted, and how much memory is available for use as a sort area?

If both of the data sets fit into the available memory, then they can be sorted directly into memory as they are read from their sources (previous steps in the execution plan). This avoids additional disk I/O and only has a CPU cost for the Sorts. The Optimizer calculates an approximate cost for this CPU component according to its own internal formula. Although this CPU cost is included by the Optimizer in its costs, for our purposes we can ignore it. This is because it is generally so small compared to the I/O costs involved in getting the data from disk in the first place, that it is negligible and has no impact on the final choice of which execution plan to use.

If you are using Oracle 10g and its new initialization parameters, with "workarea_size_policy" set to auto, then the amount of memory used by Optimizer as the query sort area will be a percentage of the memory specified for "pga_aggregate_target". (If these are not set correctly, then it will use the value of the older "sort_area_size parameter" - but this is now deprecated and should not be used.)

The percentage it will use is not documented anywhere, but there are several references to 5% on the web. However, in the testing that I did I saw nearer 20% being used. This could be due to the fact that nothing else was happening on this system, so all of the PGA memory was free and available. Or that I had a relatively small PGA aggregate target of 32 MB, and there is a minimum allocation size from within the PGA memory area.

Either way, if you are trying to estimate the cost yourself, it is probably best to use 5% of the PGA memory as your value.

To calculate the size of the data being sorted you need to know the following:
  • The columns being retrieved from each table. Oracle only processes those data columns it needs, and not the whole data row. The columns used will include those needed for the query output itself, plus those needed for filter conditions and joins to other data sets.
  • The size of each of these columns, added together for each data set to give a data total.
  • An additional 6 bytes per data row, as a unique identifier for it, internal to the processing by the Optimizer. This is present in all data sets, regardless of the join method.
  • An additional 12 bytes per data row for sort processing, on 32 bit systems. This would be doubled to 24 bytes on 64 bit systems, as they are memory address pointers. [Jonathan Lewis, Cost Based Oracle Fundamentals].
  • The number of rows or cardinality of each data set.
Adding together the row element sizes and multiplying by the cardinality of the data set, gives the total space needed for each data set. If either of these exceeds the available memory, then the Optimizer assumes that the Sort will be done in chunks, or sets, each written out to disk, and then each sorted sub-set merged together into the final sorted result set.

During this Merge phase, Oracle needs to read in blocks of sorted data from all of the sorted sets into memory at the same time, in order to properly merge them together. For efficiency Oracle uses multi-block reads on the sorted sets, because they are already sorted sequentially on disk and are being used in that order. Multi-block reads reduces the total number of read disk I/Os needed to read in all the sorted sets. Thus the number of sorted sub-sets that can be merged simultaneously is limited by the memory available, as mentioned earlier, and the size of the multi-block reads. The Optimizer terms this the "Sort width" in some of its trace output, but it means the number of sort sets that can be merged together at once.

If there is enough memory after the Sort phase to Merge all of the sorted sets together at once, then they are indeed merged this way, and this is termed a "single pass merge" - one processing of the sorted sets to merge them.

If there is not enough memory to Merge all sorted sets at once, then Oracle will merge as many as it can, and produce a single, larger sorted set of data rows from these. It will then repeat this merging across the next group of sorted sets, and so on to iterate over all of the sets from the initial sort. This will result in a smaller number of larger sorted sets, which hopefully can now be merged together. If not, the merging of sets is repeated until the number of sets is reduced to that which can be merged together. This is termed a "multi-pass merge".

Assuming that the data to be sorted is larger than the memory available, then the cost of the initial sort phase into sorted sets is the cost of writing out the data blocks using single block writes, plus the CPU cost to do the sort. Again, as before, the CPU cost is trivial compared to the disk I/O cost, but both are calculated by the Optimizer. The cost of the Merge phase is again the cost to read in the data from disk, plus the CPU cost of the merge.

When reading the data back in to merge, it seems that Oracle wants to use multi-block reads as described, but may not be able to achieve that for all the blocks for some reason or another. The Optimizer seems to base its calculations on achieving multi-block reads for two thirds of the sorted blocks, but not for the other third i.e. single block I/O used for this.

Sort-Merge Example


The calculations here produce results that agree with the costs given in a 10053 event trace file. In fact, it was a number of such trace files for different data sets that was used to derive the formula presented.

The Outer data set has 7 bytes of data, plus the 6 overhead is 13 bytes of data per row, with cardinality of 500,000 rows. The cost to produce this has been calculated by the Optimizer as being 245.84.

The Inner data set has 3 bytes of data, plus 6 overhead gives 9 bytes of data per row, with cardinality of 100,000 rows. The Optimizer has calculated the cost to produce this as 270.83.

Adding on the 12 bytes per row used by the sort operation, gives 25 bytes per Outer data row and 21 bytes per Inner data row. The Outer data volume is therefore 500,000 * 25 = 12,500,00 bytes = 11.92 MB (dividing by 1024 twice). The maximum memory available for the sort is 6.4 MB in this case, so the data will be sorted to disk and then merged back together. The sorts can be done using 2 sort sets, so only a single pass merge is required.

It seems that when the Outer data set is sorted in multiple sets, so is the Inner data set too, even though it could fit in memory in this case.

To calculate the I/Os involved, we must calculate the number of blocks required to store each of these data sets of 25 and 21 bytes per data row each. With an 8 KB block size, the calculations of the number of blocks are:
  • Outer Rows per block = floor (8192 / 25) = 327
  • Outer blocks = ceiling (500,000 / 327) = 1530
  • Inner Rows per block = floor (8192 / 21) = 390
  • Inner blocks = ceiling (100,000 / 390) = 257
To each of these another block is added, presumably some kind of master control block, giving 1531 and 258 blocks per data set.

The sort phase cost is that of writing out the blocks using single block I/O i.e. 1531 and 258. The merge phase cost is two thirds multi-block I/O and one third single block I/O. Remember that multi-block I/O is adjusted by Oracle by the ratio of single block read to multi-block read and the value used for Multi-Block Read Count (MBRC, typically being the db_file_multiblock_read_count initialization parameter), as seen previously in the formula for a Full Table Scan. Using typical values of these of 8 for MBRC and 12 for SREADTIM and 26 for MREADTIM, gives an adjusted MBRC value of 3.69.

However, Oracle does not use the value of "db_file_multiblock_read_count" for sorts, and instead uses an undocumented parameter of "_smm_auto_min_io_size", which defaults to 56 KB. This is 7 blocks in size in this case.
  • 1530 / 3 = 510
  • 510 * 2 = 1020 (multi-block I/O)
  • Multi-Block Cost is (1020 * 26) / (7 * 12) = 26520 / 84 = 315.714 i.e. 316 I/O cost units
The Merge phase cost is the sum of both types of I/O - 510 + 316 = 826. The Sort phase cost is 1020. And the total Sort/Merge I/O cost is 1530 + 826 = 2356.

Likewise for the Inner data set the Sort phase cost is 258. The Merge cost is derived from:
  • 258 / 3 = 86
  • 86 * 2 = 172
  • Multi-Block Cost is (172 * 26) / (7 * 12) = 4472 / 84 = 53.23 i.e. 54 I/O cost units
  • Total Merge cost is 86 + 54 = 140.
  • Total Sort / Merge cost is 258 + 140 = 398.
These results agree very closely with those reported in the 10053 event trace file for this query. This reports the Merge cost separately as "IO Cost / pass", and the "Total IO sort cost" which is actually for the complete Sort / Merge. The Sort cost itself must be deduced by subtracting these two values from each other.

For the Outer data set the 10053 trace file reported 2349 as the total Sort/Merge cost, and 400 for the Inner data set. These differences can be explained as follows:

The Optimizer used a calculated cardinality of 497,392.56 for the Outer data set, instead of 500,000. This resulted in a lower sort block count of 1523 instead of 1530.

After dividing the block count by three into thirds, it seems to add on two more blocks, presumably to account for an uneven division. Thus 1523 / 3 gives 508 blocks rounded up, but the Optimizer uses 510. Likewise instead of 86 for the third of 258, it uses 88.

This gives the modified results of:
  • Outer Merge I/O Cost = 510 + 316 = 826
  • Outer Sort / Merge I/O Cost = 1523 + 826 = 2349 - as reported in the 10053 file
  • Inner Merge I/O Cost = 88 + 55 = 143 - 142 is reported in the 10053 file
  • Inner Sort / Merge I/O Cost = 258 + 143 = 401 - 400 is reported in the 10053 file
Using the values reported in the 10053 file, the total cost for the Sort Merge Join is:
  • [Outer Access Cost] + [Sorting Outer Cost] + [Inner Access Cost] + [Inner Sort Cost]
  • 245.84 + 2349 + 270.83 + 400 = 3265.67
The reported value in the 10053 trace file is 3331.02, of which 3254.00 is the total I/O cost. The missing 65.35 (3331.02 - 3265.67) is presumably the CPU cost of the two sorts and their merges and the production of the Outer and Inner data sets. As predicted this is much lower than the I/O cost of the sorts and merges.

The heuristic presented of the Merge reading two thirds of the sorted blocks using multi-block reads, and single blocks reads for the remaining third, was arrived at by varying the values of different inputs that the Optimizer would use, and observing the impact on the costs reported. Whenever the value of _smm_auto_min_io_size was adjusted there was a corresponding change in the reported value of the Merge cost (IO Cost / pass). From this the splitting into thirds was deduced, and also that the Sort cost was simply the number of blocks to be sorted due to using single block I/O. Once hypothesized, all subsequent tests agreed with this formula, and none disagreed. Increasing the number of rows to be sorted, or the width of the rows, or the PGA Aggregate Target did change the number of sort sets produced, but the formula presented here still held and the Sort and Merge I/O costs reported agreed with those calculated.

Monday 22 June 2009

Nested Loop Join Costing

As described previously, the optimizer has 3 main join methods available when joining 2 data sets together:
  • Nested Loop Join
  • Sort Merge Join
  • Hash Join
Each method has a different associated formula for its cost of execution. Here I will look at the Nested Loop Join, which is the most straightforward in some respects. Some of this information, such as the costing formula, are given in the Oracle Performance Tuning Guide Manual, in the chapter on the Query Optimizer.

A Nested Loop Join reads data from one data set - the Outer data set - and for each data row retrieves corresponding rows from the other data set - the Inner data set. Clearly the Inner data set is accessed separately for each data row in the Outer data set. And the number of rows in a data set is termed its Cardinality. Hence the cost formula for a Nested Loop Join is as follows, where "Outer" and "Inner" refer to their respective data sets, and square brackets used to enclose each individual element:
  • [Cost of Outer Access] + ([Cardinality of Outer] * [Cost of Inner Access])
The Outer and Inner costs will have been determined earlier by the Optimizer, either as a single table access cost or as the cost of a previous join between data sets. Thus these component costs do not have to be recalculated by the Optimizer.

Example


Consider the following query using a table with 100,000 rows in it over 1252 * 8 KB blocks:

select it1.i1
from insert_test_1 it1, insert_test_1 it2
where it1.i1 = it2.i4
and it1.i3 = 99

I am using the same table twice in the query, with an alias for each occurrence of it.

There are indexes on each column i.e. on i1, i2 (not used here), i3 and i4. The i1 index is unique, but the others are not.

Remember that the Optimizer starts with the data set with the lowest cardinality (smallest number of data rows). In this case this is "it1", because the "i3" column has 100 rows for each possible value. This gives a cardinality of 100 for "it1" as the Outer data set. The lowest cost access to this is using the index on this, which has a cost of 102 (costed using index range scan described in an earlier post).

To join to "it2" in this case, the index on "i4" can be used. In this case no other data columns are needed from the "it2" table, and thus the cost is really that of reading the index leaf block with that particular value. For this index all entries for one value fit within one leaf block (leaf blocks per key value is 1), so the access cost is just 1 to read that particular leaf block.

The Nested Loop Cost is therefore:-
  • 102 + (100 * 1) = 102 + 100 = 202
This was verified by examining a 10053 event trace file when this SQL was executed. The costs reported differ only after the decimal point, due to a combination of rounding of values within Oracle and additional minor costs such as CPU cost (in 10g).

The 10053 trace file also showed that the Optimizer also costed other variations of the Nested Loop Join. This agrees with the statement that the Optimizer costs all possible access and join plans to determine the cheapest one.

Also considered are a full table scan of it2, at a cost of 243.1. The Nested Loop cost of this should be:
  • 102.1 + (100.2 * 243.1) = 102.1 + 24358.2 = 244460.3
The 10053 trace file actually reported a cost of 24580.26 total. Again this would be different due to rounding of results and additional CPU costs. Nevertheless, it is very close to our calculated value for the I/O cost only.

Monday 15 June 2009

Optimizer Costing of Joins

Previously I have described how the Oracle Cost Based Optimizer costs all possible execution plans for a SQL statement to arrive at the one with the lowest cost (cheapest), and how it costs access to single tables using access methods of Full Table Scan and Index Range Scans. There are a few other potential access methods to a single table, but these are the primary ones that occur most if not all of the time. I will now describe how the Optimizer goes about costing joins between tables.

Although we often say that a query involves processing data from tables, it is important to realise that internally the query processes a series of data sets. Logically each step in the execution plan produces a result set of data rows, that is in turn consumed by its parent step within the execution plan. So although we might talk about joining "tables" together, the Optimizer is actually always joining "data sets" together. Some of these data sets do come from tables, while others may come from other steps within a more complex execution plan e.g. from a join of other data sets.

The Optimizer normally only joins two data sets together at a time. There are some exceptions, such as using Bit Map Indexes on Star schemas in Data Warehouses. But for most transaction processing systems, the joins will be between two data sets at a time. Joining more than two tables together simply involves the Optimizer first joining two of the tables together to produce a resultant data set, and then joining this to the third table, and so on to join together all the tables in the query.

First, how does the Optimizer determine the lowest cost join execution plan? In simple terms it costs each and every possible join method between the pair of data sets being joined, and chooses the cheapest one from these. That is: it considers different join methods for joining Data Set 1 to Data Set 2, and also considers the join the other way around - Data Set 2 to Data Set 1.

When there are more than two tables to join together, the Optimizer simply iterates over all the possible join combinations between them and the different join methods for each. It picks one data set as the first one, another as the second, costs the joins between them to get a cheapest, then costs the joins of this to the third data set. Then the next iteration does the first data set to the third, and in turn to the second. Then it repeats for all the other combinations - second to first to third, second to third to first, and so on. Thus all possible join paths between the data sets are costed, and the cheapest one chosen.

Second, what join methods are available? There are 3 main join methods:
  • Nested Loop Join
  • Sort Merge Join
  • Hash Join
Each method has a different associated formula for its cost of execution. Generally this is the cost of accessing each data set, already calculated previously by the Optimizer, and the cost of the join itself.

Note that when I say "cost of accessing each data set" for a join between two data sets, the actual access method can be different for each join method, and so have a different cost. It is not required that all three join methods use the same access method to get to the data sets they need to join. The Optimizer will cost the join methods with the appropriate data access and choose the cheapest one.

Thus for 2 tables and 3 join methods the Optimizer must cost 6 separate possible join combinations. For 3 tables this produces 6 different join combinations between them, each costed for the 3 join methods, for a total of 18 different possible join costs. For 4 tables it becomes 72 different joins to cost, and so on.

Fully costing each join combination properly would be very time consuming, so the Optimizer has a number of improvements to reduce this. For the first join combination it starts with the data set with the lowest cardinality - smallest estimated number of rows in it. In turn it joins this to the data set with the next lowest cardinality, and so on. This is because this is likely to produce a relatively low cost plan due to it processing less data rows initially. This is not guaranteed to be the lowest cost, but will generally be relatively low. This first join combination is then remembered as the currently lowest join cost.

When costing subsequent join combinations, if the partial cost so far of a join plan is greater than the currently lowest join cost, then it is abandoned (pruned). This makes sense, because costs can only increase for an execution plan as other costs are accumulated within it. Furthermore, sometimes the cost for just accessing the first data set may be greater than the current lowest join cost, and such a join can be pruned without any further consideration or costing of different join methods. Remember that single table access paths are costed first by the Optimizer for each table referenced in the query. So it already knows what these costs are, with no further calculations required.

Thus the Optimizer is able to cost each join combination and each join method between the data sets, to arrive at the cheapest join cost. The internal optimizations attempt to minimise the total effort involved in costing each of these, by picking a relatively low cost join combination as the first one to be costed, and by immediately pruning out any alternative join plan as soon as its partial cost exceeds the current lowest cost join.

Wednesday 27 May 2009

Optimizer Costing of Index Scans

Previously I gave a brief overview of the fact that the optimizer costs each potential execution plan for a SQL query and then chooses the one with the lowest cost, and described how it costs a Full Table Scan. Now I will carry on and look at the cost calculations for using an Index to access data in a table. Again, most of this is based on the information in A Look Under The Hood Of CBO: The 10053 Event by Wolfgang Breitling, and various tests I have done to confirm the details of this. Now we will look at the calculation of costs for table access using an Index.

Note that I am concentrating here on the I/O cost as that is the largest component of the cost of an execution plan. However, there is also a CPU cost, which is not shown here. Typically this is far less than the I/O cost, and is so in the cases discussed so far. It is possible for some execution plans that the CPU cost is significant, and this will increase the total cost of such an execution plan.

Just as Oracle can scan all blocks in a table, so it can scan all leaf blocks in an index. This scan can be performed in two different ways potentially - as a scan using single block reads, or a scan using multi-block reads as for a table scan. The former is termed an Index Full Scan, while the latter is termed an Index Fast Full Scan. The cost calculations of each are based on the number of Leaf Blocks in that particular index.

For a Full Scan the cost formula seems to be:
  • #Levels + #Leaf Blocks
For a Fast Full Scan the cost formula uses MBRC as for a Full Table Scan:
  • #Levels + (#Leaf Blocks * MREADTIM / (MBRC * SREADTIM))
Testing with different queries and different indexes shows this to be the case.

Note that I have seen Oracle report both access methods as an "INDEX FAST FULL SCAN" operation in an EXPLAIN PLAN (DBMS_STATS.DISPLAY_CURSOR). I wonder whether this is a bug within the labelling of operations within the Plan Table used?

Normally however indexes are used to implement a predicate condition, and restrict the number of rows being returned. To do this the index is traversed from the root block down through the levels of the index to the leaf blocks that contain the matching values. Not all leaf blocks will be visited - only those containing a matching entry for the predicate condition. The Oracle Optimizer terms this restriction on the number of matching entries the Filter Factor. It is the proportion of rows in the whole table that will match the predicate condition i.e. a value always less than 1. Thus the cost of using only the index itself is:
  • #Levels + (#Leaf Blocks * Filter Factor)
In many situations some data needed for the query is not present in the index, and so the data row needs to be read from the table itself. Thus the cost increases by the cost of reading these data rows, which is essentially the number of rows matching the predicate condition:
  • #Rows * Filter Factor
There are however several refinements which need to be made to these formulae to be reflect how the Optimizer actually costs the access paths.

First, it is possible that matching data rows are stored in the same data block. In such cases the data block is only read once, even though multiple data rows are used from it. Such clustering of similar values will reduce the number of data blocks that need to be read i.e. the read of one data block from disk can satisfy multiple entries from the same leaf block. Oracle calculates this clustering as a statistic for each index, and uses this to report the number of data blocks that would need to be read – the Clustering Factor. This is visible in the CLUSTERING_FACTOR column in the USER_INDEXES and USER_IND_STATISTICS data dictionary views.

For many indexes it will be the case that the Clustering Factor is the same value as the number of Rows in the table – no co-location of data values for that particular column. But for some indexes, the Clustering Factor will be much lower, such as an index on the date the data was loaded, or an incrementing sequential account number. The Clustering Factor Statistic factors in the benefits of the reduced disk I/O from this co-location of Index and Data entries, to better estimate the real disk I/O that would actually take place.

The final full index cost formula is thus:
  • #Levels + (#Leaf Blocks * Filter Factor) + (Clustering Factor * Filter Factor)
Second the Filter Factor is calculated in different ways for different types of predicate conditions within a query. For now I will restrict myself to equality conditions (=), as they are the most common. Please read Wolfgang's Under The Hood paper for information on the Filter Factor for other conditions.

For a column with no Histograms (this is important), the Filter Factor for an equality condition will be the density of data values within that column - a measure of the proportion of data rows that have the same value for that column. This is true whether the condition uses a literal value or a bind variable. In turn the density value is simply the inverse of the number of distinct values in that column i.e. 1 / NDV. This count is available in the DISTINCT_KEYS column of USER_IND_STATISTICS and USER_INDEXES.

The Filter Factor can also be used to calculate the number of matching rows produced by the predicate condition i.e. the Cardinality of the result set. This is used by the Optimizer when multiple conditions are combined, either on the same table or joined tables.

Filter Factors on multiple conditions in a query are combined together according to the standard rules for probabilities:
  • c1 AND c2 ==> FF1 * FF2
  • c1 OR c2 ==> FF1 + FF2 – (FF1 * FF2) i.e. exclude duplicate matches
If the column has a Histogram on it, the Density value will be calculated from the Density values of each bucket within the Histogram that fall within the query range. This might be close to 1 / NDV for the whole table, or it might be different. That is one way in which Histograms can have such a big impact on query execution plans.

Examples


Using the index cost formula of:
  • #Levels + (#Leaf Blocks * Filter Factor) + (Clustering Factor * Filter Factor)
For a query with a condition on column C3, which has an index on it with a Density of 0.001002, 250 Leaf Blocks, 1 Level and 100,000 rows and the same value for Clustering Factor the cost should be:
  • 1 + 250 * 0.001002 + 100000 * 0.001002 = 1 + 0.2505 + 100.2 = 101.4505
The reported I/O cost by Oracle was 102, which agrees very closely with our calculated value. The difference is probably due to rounding errors. The expected cardinality is 100,000 * 0.001002 = 100.2.

Friday 8 May 2009

Oracle Optimizer Plan Costing - Full Table Scans

Hopefully this is the start of a series on how the Oracle Cost Based Optimizer determines the cost of a particular execution plan for a SQL statement. Knowing how it arrives at the cost of each possible access path to a data set, we can better understand why a particular access path was chosen, and why an alternative access path had a higher cost. This information was kick started by A Look Under The Hood Of CBO: The 10053 Event by Wolfgang Breitling, and also information in Troubleshooting Oracle Performance by Christian Antognini, and refined by various tests I did.The Under the Hood paper is very good, and has a good description of how the Optimizer goes about costing different access paths to data sets. Unfortunately it seems to be based on Oracle 9i, and things have changed slightly in Oracle 10g. The principles are the same, but the trace information output by the optimizer has changed slightly.

For this post I will focus on how the optimizer works generally, and specifically how a Full Table Scan is costed. These will not be full, exhaustive explanations of the inner workings of the Oracle Cost Based Optimiser, but short and to the point descriptions of the key costings.

Plan Costing


The Optimizer itself costs all execution plans in terms of the number of disk I/Os involved, because these are generally the largest component of the execution plan cost. Other costs such as CPU costs are also included in the total plan cost, after being converted to an I/O count via a conversion value. When the lowest cost plan has been produced, this can be easily converted to an elapsed time value by simply multiplying the cost by an average I/O time value.

When producing an execution plan for a SQL query (queries are the most common and also the most complicated things that the Optimizer has to deal with), the Optimizer simply tries to produce all possible execution plans and cost each of them. From these it chooses the plan with the lowest cost. Clearly this could be quite an exhaustive way of doing things. So what the Optimizer actually does is to fully cost one execution plan, and then only partially cost the subsequent execution plans.

Each subsequent plan is costed as each step of the plan is determined by the Optimizer. If the partial cost of such a plan so far is greater than the current best plan, then it is abandoned. This early pruning avoids the optimizer wasting too much time on plans that are too costly. Likewise if a new plan is produced with a total cost less than the current best plan, then the new plan becomes the current best plan.

After considering all possible execution plans the optimizer will have the lowest cost execution plan for the SQL query.

The execution plans themselves are formed in an iterative fashion. First, for each table referenced in the SQL query the optimizer costs each potential access path to the data in that table. Indexes are considered if they can be used directly to implement a predicate - a filter condition within the SQL query restricting the rows returned. Then the optimizer considers each possible join combination between the tables in the SQL query, using each possible join method - Nested Loop, Sort Merge or Hash Join. These joins use some of the costs for accessing each of the tables determined eariler, combined according to how that join method works. Thus each possible join path (A to B to C, A to C to B, B to A to C, etc.) is costed in turn using each of the 3 possible join methods, to arrive at the lowest cost execution plan for the SQL query.

Full Table Scan


This is one access method to data in a table that is always available to the optimizer. Logically the cost is the number of disk I/Os required to read every data block in the table. The number of data blocks in a table is obtained from the BLOCKS column in USER_TABLES. It is as accurate as the last time you gathered statistics on that table. Oracle 10g has an automatic job that runs every night by default to update the statistics on tables that have changed significantly, to minimise the drift between actual table sizes and the statistics stored about them in the database.

An enhancement Oracle has is the ability to issue multi-block reads that read more than one block from disk in a single read request. This reduces the total number of disk I/Os needed. Thus the number of disk I/Os needed is reduced by the multi-block read count (MBRC), according to the formula:
  • Blocks / MBRC

The Optimizer wants all costs to be in units of equal disk I/Os i.e. in units of single block reads. However, the time for a multi-block read is likely to be different – generally slightly longer given that more data is transferred. To convert a multi-block read cost into units of a single block read, the multi-block read cost is divided by the time for a multi-block read and multiplied by the time for a single block read. This adjustment means that the “cost” is now in units of cost of single block reads, as used by other access paths such as indexes.

Therefore the cost of a Full Table Scan is actually:
  • (Blocks * MREADTIM) / (MBRC * SREADTIM)

MREADTIM and SREADTIM, as well as MBRC, are part of a set of statistics Oracle has about the relative performance of the computer system it is running on. These are termed the System Statistics, and are visible in sys.aux_stats$ or via the dbms_stats.get_system_stats procedure. Out of the box following installation, Oracle only has a minimal set of default values for some of these statistics - IOSEEKTIM, IOTFRSPEED and CPUSPEEDNW. All the other system statistics have no value. They can be set in a number of ways - the intended way being to get Oracle to measure these values from the system itself while it is executing some representative workload (these are termed Workload Statistics) - but they can be set manually if desired, or left unset.

Oracle uses those System Statistics that it has values for. For those with no value Oracle uses various substitution formulae to calculate values to use instead.

If there is a value for MBRC in the System Statistics then the Optimizer uses that value. Otherwise, it uses the value of the db_file_multiblock_read_count (DBFMBRC) initialization parameter.

If MREADTIM and SREADTIM do not have any values in the System Statistics, then the following substitutions are made:
  • SREADTIM = IOSEEKTIM + (DB_BLOCK_SIZE / IOTFRSPEED)
  • MREADTIM = IOSEEKTIM + (MBRC * DB_BLOCK_SIZE / IOTFRSPEED)

This is probably the general case on most systems - no Workload System Statistics are present (MBRC, SREADTIM, MREADTIM and others) - and only default values for IOSEEKTIM (10 milliseconds) and IOTFRSPEED (4 KB / millisecond). For an 8 KB block size and DBFMBRC of 8 this gives values for SREADTIM of 12 ms and for MREADTIM of 26 ms, and the ratio of MREADTIM to SREADTIM of 2.167.

A simple test shows this to be the case - a SELECT on a table with no indexes. The table I used had 1252 blocks of 8 KB each, and DBFMBRC was set to 8. No System Statistics had been collected, so only the defaults exist for IOSEEKTIM and IOTFRSPEED. Using the substituted formulae for MREADTIM and SREADTIM the calculated cost should be:
  • (1252 * 26) / (8 * 12) = 32552 / 96 = 339.08

The reported cost by Oracle was 341, which is very close. The difference is due to various factors, such as rounding within the calculation, the optimizer probably costing an extra read for the table header (+1), and the optimizer adding in a CPU cost for processing the data.

Using these formulae and the values of the System Statistics, we can calculate for ourselves the expected cost of a Full Table Scan of any table.

Sunday 8 March 2009

System Activity Data - What is the System up to?

I'm always amazed at how many people do not know about the SAR utility on UNIX and Linux systems. It is a great little utility for collecting a lot of data about what is going on on a system. No, it is not the best possible tool for all such information, but it comes as standard on most UNIX systems and will give you most of what you want for very little effort on your part.

The problem we often deal with is that when running an application and taking application level measurements we are not sure what is happening lower down on the system, under the covers as it were. When the application performance is slow it would be nice to know if the computer system itself was very heavily loaded too, or whether the cause of the slowness lies somewhere else. Such relatively simple data about activity and resource usage on the system can help us establish what class of problem we have - a saturated resource in the system itself, or a badly behaving application.

To know what the computer system is up to we need to know about all of the resources in it - CPU, Memory, and Disks. And we need several different measurements on each resource - number of requests, elapsed times, and queue lengths. We also need an accurate timestamp with each of these measurements, to know when it occurred, and the period of time it covered.

All of this can be relatively easily achieved with SAR - the System Activity Reporter. It will collect and record a wide range of measurements on system resources, and the time when they occurred.

Behind the scenes SAR uses SADC - the System Activity Data Collector. And in fact it is SADC we are more concerned with than SAR. SAR simply extracts and formats the data you want from an SADC data set. We can run SADC to collect all the data it can about the system, at the frequency we want it collected (measurement period) and for the duration we want (how long until). SADC has a very simple set of options to it:
  • /usr/lib/sa/sadc t n file
where
  • t is the time duration between measurements i.e. the frequency
  • n is the number of measurements to make in total
  • file is the name of a file to write these measurements to
To not have SADC tie up a terminal window, it is often run in the background with the '&' specifier at the end of the command line - control then returns to the terminal window. This can also be used in command files and scripts.

The file produced by SACD contains binary data, not text data, and needs SAR to read it and format it as readable text.
  • sar [-data_option] -f file
If no data option is specified it defaults to CPU (-u). The common data options are:
  • -u CPU
  • -d Disk I/O
  • -r Physical Memory
  • -p Virtual Memory Paging
Remember that the SADC file contains all the data collected, so you can analyse any sub-set of this data any time you want to, at any later point in time. That is the benefit of collecting all of the possible measurements together at the same time and then saving them to a file.

You can obviously use SADC to collect data whenever you have performance problems. But it is far better to run SADC all the time, so that you already have the data to analyse after someone reports a problem. One way to do this is to put an entry into a "crontab" file for use by the "cron" utility, which runs jobs at regular intervals. Putting an entry similar to the following into "crontab" for the system administrator for midnight would run SADC for 24 hours collecting measurements every minute. If this is too frequent just modify the time and number values by the same ratio, so that the two multiplied together give 86400 seconds (24 hours).
  • /usr/lib/sa/sadc 60 1440 /usr/adm/sa/`uname -n`_`date +%y%m%d`.sad &
This will create a file named after the system and the current date in a central directory. Make sure there is enough space in that location, or change it to somewhere else. You will need to do housekeeping once in a while to purge and delete old data sets. This could also be done via another "cron" to delete files older than one or two months.

I have found SADC and SAR invaluable in a number of circumstances where I have been able to show that it was the underlying hardware that had performance issues, and not the application itself. And I have used it in the opposite manner too, to show that a poorly performing application was not suffering from running out of system resources or their slow performance. On most UNIX systems the impact of running SADC is minimal, as the UNIX kernel is already collecting all of these measurements anyway. The overhead is simply of SADC reading them from memory and writing them to disk at the frequency requested. And generally this is a very low and negligible overhead, compared to that of the application software itself.