Tuesday, 11 March 2014

Speeding up Imports

There are a number of techniques you can use to speed up an Oracle Import, some of which I'll describe here. This is not any attempt at a comprehensive list, just some of the main techniques I have used that can really speed up some parts of an import. I've seen a standard import come down from 2.5 hours to about 0.75 hours using these techniques.

The first thing to realise is that an exported dump file is a "logical" dump of the database - it contains the SQL statements necessary to recreate all the database structures, and the data records to be loaded into each table. The import works by executing these SQL statements against the target database, thus recreating the contents of the original database that was exported. We can leverage this to our advantage in various ways.

The objects and their corresponding SQL are in a certain order in the exported dump file, and they are imported in a strict sequence by the "impdp" utility. This order is roughly the following for the main objects (it is not a complete or absolute list, as I have simplified some of the details a bit):
  • Sequences
  • Tables - Creation, but no data
  • Table Data
  • Indexes
  • Constraints - Check & Referential
  • Packages, Functions, Procedures
  • Views
  • Triggers
  • Functional and Bitmap Indexes
  • Materialized Views
When you run a full import, each of these sets of objects is done in turn, one after the other. This means that the import is not really parallelised. There is a PARALLEL option, but this only really affects the "Table Data" section, and allows multiple slave readers of the table data to run INSERT statements i.e. different tables can be loaded at the same time, but that is all. The PARALLEL option does not affect other options, such as index builds, which is a shame.

 

Its all about Disk Bandwidth

Before I get into specific techniques, I want to point out that an Oracle Import is fundamentally a disk I/O bound task. The techniques I'll be describing will be using parallelism of various forms to get more work done at the same time on the system, to reduce the total elapsed time. And this in turn will produce more disk I/O operations running at the same time. So you will ultimately be limited by the system's ability to get data off and on to the disks, rather than being limited by the amount of CPU or Memory it has in it.

Consider how each of the object types in turn is created within the database during the import:
  • The initial table data load is mainly disk I/O - read all the data records from the dump file and INSERT
  • Index creation is mainly disk I/O - read all the data in the table, sort it for the index, write the index to disk
  • Constraints are mainly disk I/O - read all the data in one table and verify value or existence in another table
  • Materialized Views involve executing the SELECT to populate the MV on disk
Yes, some CPU will be used (certainly for data load and index creation), but the CPU's must each be fed with data from disk to keep them busy. So CPU utilisation can only go up by increasing the disk bandwidth used.

In my experience it is the index creation and the referential integrity constraints that take the longest elapsed time during imports, due to their need to always read all the data records in a table.

 

Divide And Conquer

The main technique I use to speed up an import is to import each object type separately via its own "impdp" command, using the "INCLUDE" option to only do those type of objects. While this does not directly speed up the import itself, it sets the foundation for being able to do so. Now we have a series of individual import commands to import the whole database, we can then sub-divide these up within themselves.

So instead of loading data into all the tables in one command, which does them sequentially one after the other, we can run separate import commands at the same time for each of the largest tables, and run another import command for all the other tables (using the EXCLUDE option to omit the largest tables). Now we have a high level of parallelism within the table data load, which we can fine tune based on the relative sizes of our tables and the disk I/O bandwidth available on the system.

 

Parallelizing Indexes

We could use a similar technique on indexes of doing different tables at the same time, but this can have other issues to deal with e.g. an increase in the amount of temporary tablespace storage used. Also you will probably be limited by the time taken for all of the indexes on the largest table in the database, which would be built serially one after the other.

Instead what I do is use the "SQLFILE" option of impdp with the "INCLUDE=INDEX" option to get the "CREATE INDEX" statements into a single SQL file. Then I edit this file and add a "PARALLEL" clause to the end of each "CREATE INDEX". In fact, import actually puts a "PARALLEL 1" on the end of each "CREATE INDEX" already (I'm using 11.2.0.3.0), so I only need to do a global replace of this by the degree of parallelism I want. You probably want something like the same as the number of disks you have, but you can experiment to get the best value.

Once edited I can then create the indexes by running this SQL script from SQL*Plus, after all the tables have had their data loaded in. As each index is being created using parallelism, each will be maximising the resources on the system, and we can just let it build one index at a time. There may not be any benefit from trying to split the indexes into two separate SQL files and running them both at the same time.

Note that you can create and edit this index SQL script as soon as you have the dump file you want to import. You can do this while the table data is being loaded, so that the index SQL script is ready before the tables have finished loading. This way there is no need for a gap between the end of the table data being loaded and the index creation starting.

 

Skipping Constraints

As mentioned, constraints involve reading all of the data in a table to validate that it is correct against the specified constraint. If we are importing a complete database from another system, and we know for certain that all of the data satisfies all of the constraints in that database, then we can basically skip validating the constraints completely. The speed up from this is quite significant, as we completely avoid enormous amounts of disk I/O for every table in the database, but can still end up with all the constraints in place.

As for indexes I use the SQLFILE option to produce a file containing the SQL for the constraints. I then edit this file and add "NOVALIDATE" to the end of each constraint statement. This means that Oracle will create the constraint, which is what we want, but it will not validate the data i.e. it will assume that the data is valid and not explicitly check it, which is also what we want. In fact each constraint SQL statement ends in the word "ENABLE", so we can do a global replace of this by "ENABLE NOVALIDATE", which is easy.

Again, as for indexes, we can produce and edit this SQL script as soon as we have the dump file, and then run it at the appropriate point in the steps being used to import the database.

 

Export

I've discussed the import side of things so far, as that is where you often want the speed up. But you can apply similar principles to the export of the database, using divide and conquer to run multiple exports at the same time. Again this will increase the disk I/O operations occurring at the same time, but will reduce the total elapsed time of the export. Some of the techniques include the following:
  • Do an export of just the metadata i.e. just the SQL to create the objects, using "CONTENT=METADATA_ONLY"
  • Export the data from the biggest tables individually to their own dump files, using the "TABLES" option to list each table
  • Export the data from all the other tables to a dump file, using the "EXCLUDE=TABLE:" option
    • Beware that the "EXCLUDE" option has a slightly weird syntax where you need to embed quotes around the list of table names

 

Full Database Migration

If you are using export and import to migrate a production database from one system to another, then you can combine all of these techniques together to minimise the total elapsed time before the second database is up and available.
  • Once the metadata export has finished, transfer it to the target system over the network
    • Then create the empty tables, and produce and edit the SQL scripts for indexes and constraints
  • As each table data export finishes, start its transfer over the network to the target system
    • And as each data dump file arrives on the target system start its load
  • Once data load has completed, start creating the indexes, followed by the other steps
This overlaps execution of the export, transfer of the dump files over the network, and import on all of the systems at the same time.

5 comments:

Jean Desjardins said...

At which point do you make the constraints VALIDATE?

John Brady said...

Sorry for the late reply. You don't VALIDATE the existing data being imported, because the assumption is that it is already valid for the foreign key data values you are re-importing. The foreign key constraints are still applied to new data inserted into the table later on - from the Oracle 11gR2 SQL Reference Manual:-

"ENABLE NOVALIDATE ensures that all new DML operations on the constrained
data comply with the constraint. This clause does not ensure that existing data in
the table complies with the constraint."

So you are speeding up the import by not validating the existing data being imported, but these constraints are applied to new data that gets inserted into the tables afterwards i.e. normal application usage.

mohammad omari said...
This comment has been removed by the author.
mohammad omari said...

Great tips John especially disabling constraint validation and parallelizing index creation those sure have dramatic impact on the import performance.

Javier Perez said...

Hi there,

Just wanted to share my opinion in case it may be of any help to somebody...

In version 11.2.0.3.0, impdp already uses parallelism on index creation.

Indeed, constraint validation is still done serially(one may change parallel degree of table and validate constraints manually as it was described in the post) although I wouldnt recommend leaving the constraints enabled novalidate at all. Actually with the way you described here for the impdp, imagine some other DBA/engineer comes back to this database and wants to take an export and then import somewhere else, how will he know the data is consistent if your constraints are not validated? Or imagine you want to do some heavy DML and have to disable constraints for performance, even if you tested this procedure in a similar development enviroment, you might have the nasty surprise of validation errors in old data. In my opinon, it's not a good practice.

best regards