Monday 31 August 2009

Hash Join Costing

As already discussed in previous posts Oracle has different join methods available to it, and will choose the one with the lowest cost for any specific join. So far I have described how the Nested Loop and Sort Merge Join methods work. Now I will describe the Hash Join method.

It shares some similarities with the Sort Merge Join method - both data sets are accessed separately, and then joined together - but is obviously different in how it does the join. The cost formula for this is:
  • [Outer Access Cost] + [Inner Access Cost] + [Hash Join Cost]
As for the Sort Merge method, this uses an area of memory and so its cost is affected by whether the controlling data set is smaller than the available memory or not. If it is smaller, then the Hash Join will take place in memory, and the Hash Join Cost component is negligible being only a CPU based cost. If the data set is larger, then the join takes place in stages, with intermediate results being written to disk. Hence the Hash Join Cost component will increase significantly due to the disk I/Os involved in writing out the overflow data, and reading it back in again later on.

The memory available for the Hash Join is the same as for the Sort Merge join, being derived from the pga_aggregate_target initialization parameter.

The main difference between the Sort Merge and Hash Join methods is that the Sort Merge wants to have both data sets in memory before joining them together, whereas the Hash Join method only wants one of the data sets in memory before starting the join. This makes a significant difference to the size of the data sets that can be processed by the Hash Join before it overflows to disk, compared to the Sort Merge join method.

Also the Sort Merge Join wants the data sets sorted before they can be merged, which has an associated cost even if it is only CPU because the data can be sorted in memory. The Hash Join however does not need the data sets pre-sorted in any way, and so avoids this extra cost. This difference ensures that the Hash Join is almost always cheaper than a Sort Merge Join, except under exceptional circumstances.

The Hash Join is particularily efficient compared to the Sort Merge Join where a small data set is joined to a much larger data set. A Sort Merge Join requires that both data sets be sorted, which needs memory for each data set. Whereas a Hash Join only requires memory proportional to the smaller first data set and is unaffected by the size of the second data set. This means that a Hash Join can avoid needing to overflow to disk when one of the data sets is much larger, compared to a Sort Merge Join which would need to overflow to disk to sort the larger data set.

The Hash Join works by reading all of the rows in one of the data sets - termed the Outer data set as before - and puts all of them into a table like structure in memory, assuming that sufficient memory is available. The data rows are placed in the memory table according to the values in the columns used for the join. These values are used in a hash function - hence the name of the join method - that returns a value within the range of the size of the table in memory. The data row is then placed in this location in the table - subject to some caveats.

Then the second, Inner data set is read, the same hash function applied to the same join columns in this data set, and the corresponding location in the memory table read. Now either there is a matching row in this location or there is not. If there is a matching row, then the join takes place and a joined row is produced. If there is no matching row then the inner data row is simply discarded.

Thus the Hash Join method can be relatively fast, with little overhead apart from the hash function itself.

The only caveat, as mentioned, is that it is possible for more than one outer data row to hash to the same location in the memory table. When this happens the data row is basically stored in the next available higher location. When the inner data row is hashed, Oracle compares the next locations in the memory table too, while they have the same join values. This handles the situation of duplicate hash values from different data rows, with minimal overhead.

Clearly the size of the hash table in memory is determined by the number of rows in the first or outer data set. The Optimizer is able to swap the data sets around if one is larger than the other, so that it processes the smaller one first. If this produces a significantly lower cost, then this swapped hash join will be used.

When the size of the first data set is larger than the available memory, Oracle proceeds along the following lines. Note that I have simplified the description, and omitted some of the details, but the gist of it is correct. When hashing the first, outer data set, rows with hash values outside the size of the memory table are written out to disk. But those that do hash within the in memory range are put into the memory table. Then the second, outer data set is processed, and hashed in the same way. Those that hash into an in memory range are processed as normal - join or no join - and those that don't are written out to disk.

The range of hash values covered by the in memory table is adjusted, to cover the second range of values, and the two pending data sets on disk joined. The saved first data set is read back in, and either hashed into memory or written out to another temporary area on disk. Then the second data set is read in, hashed and either joined or not joined if it falls within the in memory range, or written out to another temporary area. This is repeated until the remaining data from the first, outer data set fits in memory, and the second, inner data set processed the one final time.

Note that Oracle has certain optimizations in the implementation of this hash mechanism to both discard rows from the second data set that will not join as early as possible, and to write the temporary pending data sets into pre-hashed sub-sets. These optimizations minimize the total number of disk I/Os that need to be done and minimize multiple passes on the same data rows. Thus the actual Hash Join mechanism differs in detail from my simplified summary of it.

As ever, the costs can be verified by looking at the trace file output from the 10053 event. As described, when the outer data set is small enough to fit in memory the cost of the Hash Join stepitself is very small, being only CPU operations. The majority of the total Hash Join cost is simply the sum of the access cost to the two data sets being joined.
  • [Outer Access Cost] + [Inner Access Cost] + [Hash Join Cost]
Where the data sets are much larger than the available memory, then the Hash Join cost step itself increases significantly. Unfortunately I have not yet been able to determine a single rule that can be applied to determine this cost. Clearly it is dependent on the size of the outer data set, where the overflow data that will not fit in memory is written out to disk, and on the size of the inner data set that is also processed and written out to disk. I have not been able to determine yet how the split between these two affects the reported cost of the Hash Join itself (a lack of time on my part unfortunately). Rather than delay a posting on this I thought I would post what I have now, and later post anything else I discover about the Hash Join.

Tuesday 4 August 2009

Data Modelling - Why it is important

A recent post by Robyn on the importance of a good data model, especially in a development environment using Agile practices made me sit up and think "Yes, I agree with you". So here are some thoughts of my own on why a Data Model is so important on large data processing applications that store their data in a database. If your application is neither large nor stores its data in a database, then you can probably ignore these thoughts.

First, let me state that I think the most important thing for the class of Application that I am discussing here is the Data. You could go so far as to say "Its ALL about the Data", and not the processing done by the Application.

Applications involving financial or medical data, for instance. I am sure that Customers of these classes of Applications care more about the integrity and correctness of their Data than they do about the features offered by the Application. Their Data must never be corrupted, or deleted, or lost through broken links to other data, or incorrect (miscalculated financial amounts or dates). It is the Data that is of value, not the processing of the Data. In fact the Data may live beyond the current application, and be migrated over to the next one and continue to live in one form or another.

Of course Data on its own is no use to anyone. Data almost always exists with an Application that manipulates it in one way or another. So the two can never be separated, or considered as completely independent problems. But it is Data first, Application second.

Given the importance then of the Data, we need a Data Model that defines the structure of this Data being manipulated by the Application. To me such a Data Model involves both a Design and full Documentation of that Design. And by "full Documentation" I simply mean "adequate" to capture all the necessary information about the Data and the Design, so that someone else coming along later could understand it. What use is a Data Model that no one can understand?

In my view the Data and the "Data Model" of it should stand in their own right. This is because the data and its meaning should be constant and consistent, regardless of which particular application is manipulating it. Remember - Data first, Application second.

By a "Data Model" I mean a high level Logical or Conceptual Model, and the main data entities and the relationships between them. This should always be the same model for a given set of data, regardless of which particular application is manipulating the Data itself.

Of course the details of the implementation of this Conceptual Model can differ. And all specific applications will require unique elements within the physical database - extra staging tables to hold intermediate results, flags to indicate the type of data or entry, status codes, etc. But these should be in addition to the core data elements identified in the Conceptual Model.

In other words, although the Physical Database will have lots of extra data columns and tables, you should still be able to map large parts of it back to the Entities in the Conceptual Model very easily.

If the "Data Model" is somehow unique and tied in to an Application and its implementation, and cannot be separated from it, then this is admitting that it does not capture the true "meaning" of the data, and that such a database could not be used by another application, or be queried directly.

The Data Model is the Foundation of an Application, just as we have a real Foundation of a House. Imagine what it would be like to build a house room by room, with only minimal foundations for each room as we build it? First a one room house with a roof. Then later you extend it for another room - extra foundations dug, change one existing wall, new walls built, roof extended. Then 2 more rooms - again dig foundations, build walls, extend the roof over them. Then you go up a storey by building some rooms on top - rip off the complete roof, build on top of the existing walls to increase their height, put in a new ceiling/floor between the two floors, and a new ceiling and a new roof. Maybe you go into the roof, instead of putting on a new floor. The roof will still be modified, and extended outward for windows to be put in. A floor will have to be laid, and stairs put in. Any water plumbing or electricity ring mains will have to be extended too on each new development. Is this an efficient way to build a house?

No. The sensible way is to dig all the foundations at once, and then build only what you needed or could afford - maybe only the 2 room model initially - on top of this. When you came to extend later it would be much easier - no digging of foundations, or realigning walls. Allowances would have been made for this in advance.

What I have seen in reality is that newer development methods (Object Orientation) and tools (Java) let developers more easily write code that manipulates data "objects" internally, and then somehow magically persist that data to a database. So the developers focus on application functionality, assuming that the "data persistence" problem will be taken care of somehow. And when it does not happen "transparently" and there are "issues", it then becomes a database administrator area problem and not a development area problem.

The developers argue that they are taking a "model led design", but unfortunately it is an "object" led design, and not a "data" led design. From my experience, the object to relational mapping is non-trivial, and many object led designs are imperfect - there are major revisions between version 1.0 and 2.0. So what happens to the database structure between version 1.0 and version 2.0? And more importantly, what happens to all of the customer's data in that database? You cannot just throw the data away and start with fresh empty tables again.

Also, in many Object Models there can be a lot of repetition between different objects. Customers, People, and Organisations for instance will all have names and addresses. If the Data Model were designed first this commonality would be recognised, and a single set of tables designed to store this data. Whether one shared table or separate identical tables does not matter - it is that the design is the same for all of them. But with separate development teams designing their own objects as they go, you run the risk of having separate Address objects for each piece of functionality developed, all differing in one way or another.

Data and Objects are not the same thing. They may have a lot in common, but they are not the same thing. And while developers continue to propagate this myth, we will continue to have poor data models, and poor databases, and poor performance as a result. If you believe that the data is important, then you should adopt a data led design approach. I do not see this as necessarily conflicting with other development methodologies such as Agile. As long as a data led design is incorporated into it.