Monday, 21 September 2009

Data Modelling & Scalability

For what I call Enterprise Class Applications, scalability is an important factor. Scalability can occur in multiple directions - increases in user count, transaction volume, database size. These can occur individually or in combination together. The application must perform well when any of these scales in size, and performance must not degrade severely. For this to be true, scalability must be designed in to the application from the beginning. You cannot simply add scalability onto an application later on, much like you cannot just add security on. There is no magical sticking plaster that you can apply to an existing application and suddenly have it scale wonderfully. If there was, then everyone would be doing it.

No, scalability must be designed into the application and the way it works. This leads to the corollary that only good designs scale well, and bad ones don't. Skipping the database design phase in any application development can only result in a poor design, with associated poor scalability. At some point as the workload increases on the application, performance will hit the dreaded knee and throughput will level off and response times will increase.

Poor performance results from things like lack of primary and foreign keys, general purpose tables with a type field indicating which data fields are relevant, use of key fields that only have a few values ('Y' or 'N' values are typical of this), outer joins between tables where the child records may not exist. Any of these can result in weak SQL when used in a query, and poor execution as a result.

So if the scalability of your application is important to you, and it may not be important for everyone, then make sure you design your database properly rather than just implementing the first set of tables that comes into someones mind. And this design requires capturing information about all of the entities that need to be stored, and modelling their relationships both at a high, conceptual level and then at a detailed low, physical level.

To ensure all entities and their relationships are captured, clearly requires knowledge about all of the entities within the application. But in an iterative development, as typified by Agile, not all of the entities are known at the beginning when the first version of the database design is needed. So how does Agile deal with this? It is the one thing lacking from the documentation I have seen on Agile so far - how do you produce a good, scalable database design when it all takes place in an iterative development environment and when the full requirements and data entities are not yet known?

I've heard of Agile projects delivering poor database designs from other database architects and administrators, and know of one specific one at the moment where no database skilled person has been involved at all.

For me this is a key sticking point with what I have seen of Agile Development, especially the aspect of delaying decisions for as long as possible. There seems to be this perception that the database design "can be finished off later on", after more of the application has been developed. But when the database design becomes an issue and needs to be changed, it is too late, as the time and effort involved in changing and rewriting the application to reflect these database changes is too great.

Given that a scalable application requires a good database design to store its data in, why is it that database design seems missing from most descriptions of Agile Development? It may get mentioned every once in a while, in general lists of development tasks, but I have seen very little on the specifics of what I would term "Agile Database Design".

My concern is that by ignoring database design within Agile Development, it will most of the time result in a poor database design, and so poor application scalability.

Either way it seems to me that to ensure you do end up with a good, scalable database design, you need a Data Modeler within the development team, responsible for the database design. All database changes should be routed through the Data Modeler, rather than a developer making them directly themselves.

2 comments:

pom said...

John,

AgileDatabases yahoo group would be a place for you to discuss these kinds of questions. Also there are books like Database Refactoring and Agile Data Techniques that talk about database in Agile teams.

Cheers
Pramod

John Brady said...

Pramod,

I have just finished reading Agile Database Techniques by Scott Ambler, and various other white papers on this topic and Database Refactoring while researching this. I intend to summarise these for my own benefit, and then make some more blog posts about them.

I do believe that there are answers to designing and building a database in an Agile development, and I will describe these as I flesh them out over time. It is just that this topic does not seem to have been clearly answered by the Agile community.

There are pieces of the answer out there, but it does not seem to have been put together into a single working solution. I have found descriptions of the principles involved, but none yet of a concrete solution people can adopt. And that is my focus - a real solution that people can adopt.

John