Monday 19 October 2009

Solving Database Design in Development

Previously I have looked at the challenges involved in being able to support a changing database design in an application development environment, which were:
  1. Support change at any time
  2. Ensure database designed and modelled properly
  3. Record each individual change, to enable metamorphosis of older databases
  4. Allow multiple versions or branches of the Database Design to exist, and be changed individually or collectively
Although I started off on this problem of database structure change from the context of the need for it when using Agile development methods, the actual problem is not unique or specific to Agile. It is a generic problem to the development of any application that involves the use of a relational database to store its data in.

What I now want to do is list the key features needed in a solution to this problem which, if all assembled together, would deliver a viable solution to this problem.
  • With Application Source Code we only care about the final state it ends up in. But with a Database Design we care about each individual change, and what gets changed. We need to record each of these Databse Changes individually, so that they can be applied to other instances of that database, as well as recording the final state of the database design itself in the Database Model.
  • Changes are physically made to a Database Instance by executing appropriate SQL statements. This is the only way to make a change to an existing database instance.
  • Each "Database Change" should be an atomic change to a database - it either completes successfully or not at all. Changes cannot be partially implemented on a database instance.
  • The Database Changes must be executed in the correct sequence on each database instance, for repeatability and due to any dependencies between the changes.
  • Changes to the Database Design should be formally Requested, and these requests recorded somewhere to provide an audit trail - Who, When, What Changed, Where in the Database.
  • Changes must be reviewed and approved by the Data Modeller, to ensure a good, scalable design. The Data Modeller is responsible for updating the Database Model - the record of the Database Design - and producing the SQL statement that implements that change. Some of this could be done automatically using appropriate tools, or manually. All that matters is that it does happen.
  • The Database Designer or Data Modelling role now becomes a part time role during the whole lifecycle of the application's development, instead of being a full time role only in the first phase of its development. It is now likely that all development cycles will involve the Database Designer, who becomes a central and critical member of the development team. The role itself, however, could be fulfilled by someone who has other roles too within the development cycle, or shared amongst a number of people, because it is not a full time role. This could be someone who is also a developer or a database administrator.
  • Each Database Change is given its own unique Identifier, which must be globally unique over all the other Database Changes. This identifies the particular Database Change, regardless of which Database Instances it is applied to, or which versions of the Application it appears in.
  • Changes to a particular Database Instance must be recorded within the Database Instance itself. This is the only way to avoid applying the same changes twice to a Database Instance. This implies some form of Change History table within each Database Instance.
  • This Change History table needs to record the Change Identifer for those changes applied, and the date and time when it was done. Other data could also be recorded, but these two are the minimum required.
  • The set of changes should be stored in a single file, often termed the Change Log. Each entry in this will include the Change Identifier and the SQL statement that implements that change. All necessary changes can be easily located and sequenced this way.
  • This Change Log needs to be well structured, so that a program can read and decode the Changes in it and execute the SQL statements for changes not yet applied to a given Database Instance. This is probably best done using XML.
  • There must be separate instances of the Database Model / Database Design and the Change Log file in each version / branch of the Application source code. This allows changes to be made independently to each branch of the Application.
  • In reality SQL statements are often specific to one database product, such as Oracle, Sybase, SQL*Server or MySQL. The Change Log will need to record the separate SQL statements for each supported database product for each Database Change. Thus each Change will have an Identifier, and a SQL statement per database product, clearly labelled with the database product name.
  • The whole solution is brought together in a program that is run during an Application Upgrade. It opens the Change Log XML file and reads in each Change. If that Change has not yet been applied to this particular Database Instance, then the corresponding SQL statement is executed and a record inserted into the Change History metadata table.
This is my interpretation of how to solve the problem of a changing database design during application development. But it turns out of course that others have been through exactly the same process before me and come up with exactly the same answers. And in fact, once I had thoroughly understood the nature of the problem, I could immediately see that these other people too had understood it and had arrived at pretty much the same conclusions I would. So although I did work through the problem from first principles myself, simply because that is the best way for me to fully understand a topic and appreciate whether any "solution" is right or not, I have no doubt borrowed from some of the material I have read in the way that I have described this solution here.

I still find some of the descriptions I found on the Web around Database Design, Changes and Agile Development unclear around this particular problem of implementing changes to a database design during application development. Quite a lot discuss the need to change the database design, and how to model different types of changes (refactoring is often mentioned). But almost none discuss how such changes get implemented in real deployed databases, and how you maintain a database over the long term with a series of changes to the application that uses it. None of them covered enough of it at once to leave me feeling that I had a real solution I could go out and apply. Hence my need to work through the problem from first principles for myself, to get the full picture as it were.

Again, one article that did help to clarify the nature of the problem a lot for me was Rethinking Agility in Databases: Evolution from Hexagon Software. This brought home the message that Databases must be treated differently to Applications in how you change them.

And the Change Log file of SQL statements was explicitly mentioned by Peter Schuh and Pramod Sadalage, who have put this into practice themselves on large projects. See Agility and the Database by Peter Schuh for instance.

So although I have not invented anything new here that has not already been described by other people elsewhere, in one form or another, I have tried to bring together all of the necessary ingredients for a solution in a coherent manner. As I said, this was my way of working through the database design problem to arrive at a solution that I fully understood and felt able to go out and apply in the real world.

I may revisit this again and try and outline a working solution to this i.e. what would need to be delivered to implement the kind of solution I have described here. Anybody reading this and want to know more?

Monday 5 October 2009

The Challenge of Agile Database Design

Previously I have said that for what I call Enterprise Applications data modelling is important because the data itself has meaning and value outside of the application that manipulates it, and that scalability requires a good database design because you cannot just add scalability on afterwards. The Data Model or Database Design defines the structure of the database and the relationships between the data sets, and is part of the foundation on which the application is built. And a good database design is essential to achieve a scalable application. Which leads to the challenge - How do I go about designing a database in an Agile development project, when not all the requirements are known initially? What techniques should I be using?

Having read as much as I can find on this topic, I think I have a better understanding of the nature of this challenge. And this is what I want to explain here - the nature of the challenge of Agile Database Design.

First, we should restate the problem in a more positive way. Rather than "when not all the requirements are known", we can say "How should I be designing an Agile database that will change in the future". Change happens all the time, in one form or another, and is inevitable. We need to embrace it, assume that our database design will change over time, and find ways to support this changing database. This need for change over time is not unique to Agile development, and is in fact really a universal problem for applications and databases.

Second, we need to accept that we cannot skip the design stage for the database in any way. We must design the database properly - at least those parts of the database that we need to design now - and produce a correct model. As I argued before, a good and correct database design is essential to a well performing application, and to the integrity of the data itself.

The outcome of the database design is documentation on the structure of the database, often termed the Data Model. There are many tools that can be used to help you design your database and record the details of the model. Any such tool chosen should enable and support small and frequent changes to the data model, as this is a major requirement of Agile development. But you could also use tools as simple as a spreadsheet and a set of diagrams.

Third, a Database Design is not the same as Application Source Code. There are similarities, but they are actually different beasts. Both act as Blueprints for a thing that can be built - an Instance of that Design. And both can be changed over time as needed. The difference is that when the Source Code to an Application changes, we rebuild the Application completely, typically compiling all source code files. We have produced a new instance of the Application Program, as a next generation instance of it. This is an example of "Evolution": the Blueprint changes, and a brand new instance is created using it. Existing instances are not modified, but instead "replaced" by the newly created instance.

Databases are the opposite. Changes must be applied "in place" directly to each Database Instance (a real database on a computer system), to modify it into the latest database design. Such an in place changing in the structure of a thing is termed "Metamorphosis", and is quite different from "Evolution".

While I have appreciated for some time that there is a difference in type between an Application's Source Code and a Database's Design, I read about the explicit nature and form of this difference (Evolution versus Metamorphosis) in an article on Rethinking Agility in Databases: Evolution from Hexagon Software. All credit for this distinction between them and the terminology goes to them.

Application Source Code editing and maintenance methods will not work for a Database Design. A "replace and rebuild" methodology cannot be used for databases, which need a change to the design to be applied "in place" instead to each instance of that database.

Fourth, there may be multiple separate versions or branches of the Database Design to be maintained, as a result of the existence of separate branches of the Application Source Code. It is common for Application Source Code to be branched when major releases are done, typically producing a new branch for support of that release, and a new branch for the next major release. It is possible that the Database Design may need to change in different ways in different branches of the Application. Likewise, the same change may need to be made to the Database Design in different branches - correcting a bug for instance. We need a way to record each version of the Database Design separately to support this.

These then form the Challenges of delivering a changing Database Design:

1. Supporting change at any time to the Database Design

2. Ensuring that the Database is Designed and Modelled properly

3. Recording each individual change to the Database Design, so that older instances of that Database Design can metamorphose by having these changes applied to them

4. Allowing multiple versions or branches of the Database Design to exist, and to be changed individually or collectively

These challenges are not specific to Agile Development, and apply to any large enough application software development. Addressing these challenges will provide a solution that could be used in any database oriented application development, whether using Agile development methods or not.

In the next post I hope to start describing the outlines of what you would need in order to achieve what I call "Agile Database Design" that addresses these challenges. And then subsequently how to meet this in a minimal way.