Tuesday 4 August 2009

Data Modelling - Why it is important

A recent post by Robyn on the importance of a good data model, especially in a development environment using Agile practices made me sit up and think "Yes, I agree with you". So here are some thoughts of my own on why a Data Model is so important on large data processing applications that store their data in a database. If your application is neither large nor stores its data in a database, then you can probably ignore these thoughts.

First, let me state that I think the most important thing for the class of Application that I am discussing here is the Data. You could go so far as to say "Its ALL about the Data", and not the processing done by the Application.

Applications involving financial or medical data, for instance. I am sure that Customers of these classes of Applications care more about the integrity and correctness of their Data than they do about the features offered by the Application. Their Data must never be corrupted, or deleted, or lost through broken links to other data, or incorrect (miscalculated financial amounts or dates). It is the Data that is of value, not the processing of the Data. In fact the Data may live beyond the current application, and be migrated over to the next one and continue to live in one form or another.

Of course Data on its own is no use to anyone. Data almost always exists with an Application that manipulates it in one way or another. So the two can never be separated, or considered as completely independent problems. But it is Data first, Application second.

Given the importance then of the Data, we need a Data Model that defines the structure of this Data being manipulated by the Application. To me such a Data Model involves both a Design and full Documentation of that Design. And by "full Documentation" I simply mean "adequate" to capture all the necessary information about the Data and the Design, so that someone else coming along later could understand it. What use is a Data Model that no one can understand?

In my view the Data and the "Data Model" of it should stand in their own right. This is because the data and its meaning should be constant and consistent, regardless of which particular application is manipulating it. Remember - Data first, Application second.

By a "Data Model" I mean a high level Logical or Conceptual Model, and the main data entities and the relationships between them. This should always be the same model for a given set of data, regardless of which particular application is manipulating the Data itself.

Of course the details of the implementation of this Conceptual Model can differ. And all specific applications will require unique elements within the physical database - extra staging tables to hold intermediate results, flags to indicate the type of data or entry, status codes, etc. But these should be in addition to the core data elements identified in the Conceptual Model.

In other words, although the Physical Database will have lots of extra data columns and tables, you should still be able to map large parts of it back to the Entities in the Conceptual Model very easily.

If the "Data Model" is somehow unique and tied in to an Application and its implementation, and cannot be separated from it, then this is admitting that it does not capture the true "meaning" of the data, and that such a database could not be used by another application, or be queried directly.

The Data Model is the Foundation of an Application, just as we have a real Foundation of a House. Imagine what it would be like to build a house room by room, with only minimal foundations for each room as we build it? First a one room house with a roof. Then later you extend it for another room - extra foundations dug, change one existing wall, new walls built, roof extended. Then 2 more rooms - again dig foundations, build walls, extend the roof over them. Then you go up a storey by building some rooms on top - rip off the complete roof, build on top of the existing walls to increase their height, put in a new ceiling/floor between the two floors, and a new ceiling and a new roof. Maybe you go into the roof, instead of putting on a new floor. The roof will still be modified, and extended outward for windows to be put in. A floor will have to be laid, and stairs put in. Any water plumbing or electricity ring mains will have to be extended too on each new development. Is this an efficient way to build a house?

No. The sensible way is to dig all the foundations at once, and then build only what you needed or could afford - maybe only the 2 room model initially - on top of this. When you came to extend later it would be much easier - no digging of foundations, or realigning walls. Allowances would have been made for this in advance.

What I have seen in reality is that newer development methods (Object Orientation) and tools (Java) let developers more easily write code that manipulates data "objects" internally, and then somehow magically persist that data to a database. So the developers focus on application functionality, assuming that the "data persistence" problem will be taken care of somehow. And when it does not happen "transparently" and there are "issues", it then becomes a database administrator area problem and not a development area problem.

The developers argue that they are taking a "model led design", but unfortunately it is an "object" led design, and not a "data" led design. From my experience, the object to relational mapping is non-trivial, and many object led designs are imperfect - there are major revisions between version 1.0 and 2.0. So what happens to the database structure between version 1.0 and version 2.0? And more importantly, what happens to all of the customer's data in that database? You cannot just throw the data away and start with fresh empty tables again.

Also, in many Object Models there can be a lot of repetition between different objects. Customers, People, and Organisations for instance will all have names and addresses. If the Data Model were designed first this commonality would be recognised, and a single set of tables designed to store this data. Whether one shared table or separate identical tables does not matter - it is that the design is the same for all of them. But with separate development teams designing their own objects as they go, you run the risk of having separate Address objects for each piece of functionality developed, all differing in one way or another.

Data and Objects are not the same thing. They may have a lot in common, but they are not the same thing. And while developers continue to propagate this myth, we will continue to have poor data models, and poor databases, and poor performance as a result. If you believe that the data is important, then you should adopt a data led design approach. I do not see this as necessarily conflicting with other development methodologies such as Agile. As long as a data led design is incorporated into it.

4 comments:

Brian Tkatch said...

Nice post.

The question is, how to explain it to those who don't already understand it?

John Brady said...

Brian,

I think it is like dieting. You talk to people and they say that they "get" the issues, and promise to do better. But the moment you leave them alone they seem to forget everything and just go back to their old habits. They continue to buy the same old food and keep eating it.

Somehow we need to cross the chasm, and make people actively change what they do when designing and developing new applications. Unfortunately, all the new "exciting" methodologies are all around application development and not data design. Maybe we need to invent a new methodology with a new name, and see if it catches on?

John

Brian Tkatch said...

John,

Interesting association. You assume these people want to change but have a hard time, or at least pay lip service to it.

That could be, but it is not my experience. Of which i don't have much.

I prefer to chalk these things up to personality types. Specifically, the MBTI. Coders are usually Ps, DBAs are Js. Hence, coders just want to get it done, and deal with problems as they arise. DBAs want to get it right the first time.

That works out well as it forces compromise, which is good for everyone, as many times compromising is a method of sharing the losses. However, with a project, a compromise usually creates a product everyone can deal with.

The issue arises when Ps start becoming designers too. They are not designers because they do not see long term. So, they just design as the process moves along. That doesn't help anyone.

Just some thoughts.

what is sap ecc 6.0 said...

I do agree with your thoughts about data modelling finds it a very necessary portion in the whole application development. Our focus is on the data that we are processing and the data model represents an overview of the overall process and structure.