Databases and Performance: October 2015

JSON is a simple data storage format intended to allow easy data interchange between programs and systems. Data is stored in a text format as objects of "name : value" pairs. And that's about it, other than objects can be nested (a value could be another whole object) and multiple objects can occur one after another in the data set (a list or array of such objects). This makes it slightly database like because you have records (objects) made up of named fields holding data values, with nested records inside other records for any kind of relationship or hierarchy.

The "name : value" storage structure is often termed a Dictionary where a given name is associated with its value - other terms such as Hash Map are used in other programming languages. This makes a JSON data set an almost perfect match for processing in Python which has almost directly corresponding data types of dict for the "name : value" pairs and list for multiple objects of the same type. And Python offers a standard "json" module (library) for importing and exporting such JSON data sets into and out of Python objects (the corresponding methods are called "load" and "dump" for consistency with the naming in the existing "pickle" module). Generally speaking when you load in a JSON data set it is converted to a corresponding Python dict object, that may in turn contain other embedded list and dict objects according to the structure of the JSON data being loaded in.

Trello

A real example of this in use is with Trello, where you can export your Do Lists to a local file as a JSON data set. You could then open this file and read in the JSON data set in Python, and process it one way or another. Note that you can also directly access the live data in a Trello List using a suitable URL - Google the Trello API for more on this.

Trello structures its data as follows in a hierarchy where each thing has a name and an internal, normally hidden identifier:

A Board is the top level structure - each Board is separate from each other
A Board contains multiple Lists
A List contains Cards - normally these are the "actions" you want to do
A Card can contain a few subsidiary other data items, such as a Checklist, or Labels (categories)

With the data from one Trello Board now read into a Python dict object it becomes very easy to navigate through this and pull out individual members of a list for further processing. An obvious example is to do a simple reformat of the lists for printing purposes - Trello only lets you print the whole board or an individual card, and a board can run to many, many pages when you have lots of lists and cards (items) in it given the default spacing between items.

Trello does not actually store the data in a strict hierarchy, but instead more in a relational format, where child items will contain the identifier ("id") of the parent item they are part of. Thus the top level JSON data for a Trello Board contains data of:

name - name of the board
id - identifier of this board, a long hexadecimal number
desc - description of the board
lists - all the Lists in this Board
cards - all the individual Cards in all the Lists
checklists - all the Checklists used in different Cards

i.e. most things are stored directly linked off a Board, and not stored nested within each other. This is clearly to allow easy moving of Cards between Lists - only the parent identifier within the Card needs to be updated, not the Lists involved or the Board itself.

The main data sets of lists, cards and checklists come into Python as a list (array) of multiple members, each member being a dict. Each data record for these child data sets contains an "id" field for the parent i.e. "idBoard" in a List, and "idList" in a Card (Checklists are slightly different as the Card does contain a list of the Checklist identifiers).

Example 1 - Printing the names of the lists in a board

We just need to iterate over the "lists" data set of the Board, printing out the "name" of each List. We can also check if each list is open i.e. not closed. Note that I am using Python 3, hence "print" is a true function now.

import json
import sys

NAME = "name"
ID = "id"
DESC = "desc"
CLOSED = "closed"
LISTS = "lists"

json_file = open (sys.argv [0])
trello_board = json.load (json_file)
print ("Board: " + trello_board [NAME])
print ("Lists:")
# trello_board [LISTS] is a Python list, each member being a Python dict for each List
# Loop through all the Lists i.e. one List at a time
for trello_list in trello_board [LISTS] :
    if (not trello_list [CLOSED]) :
        print (trello_list [NAME])

This assumes that the JSON file name is given as the first argument on the command line to the Python program.

Example 2 - Printing out the cards in one list

Assuming that the second command line argument is the first part of the name of a List in the Board, then we can use this to find that List and then print out just its Cards.

import json
import sys

NAME = "name"
ID = "id"
DESC = "desc"
CLOSED = "closed"
LISTS = "lists"
CARDS = "cards"
CHECKLISTS = "checklists"
IDBOARD = "idBoard"
IDLIST = "idList"
IDCHECKLISTS = "idChecklists"

json_file = open (sys.argv [0])
trello_board = json.load (json_file)
print ("Board: " + trello_board [NAME])

# Loop through all Lists in the Board, comparing its name against the input name
for trello_list in trello_board [LISTS] :
    # Only do a partial match on name on leading part of name
    if (trello_list [NAME] [:len(sys.argv [1])] == sys.argv [1] and not trello_list [CLOSED]) :
        print ("List: " + trello_list [NAME])
        # Loop through all Cards in all Lists, checking the parent ID of the Card
        for trello_card in trello_board [CARDS] :
            if (trello_card [IDLIST] == trello_list [ID] and not trello_card [CLOSED]) :
                print (trello_card [NAME])
                if (trello_card [DESC]) :
                    print (trello_card [DESC])

Note that in reality you would have extra code to check that the command line arguments were present, and better error handling for things like the file not existing.

Also I actually used "textwrap" in Python to word wrap long lines properly in the output, and indent wrapped lines for better readability - I've just used "print" directly in these examples to keep it simple.

Summary

That's it for JSON and Python using Trello as an example. You can load in a JSON data set from a file into a corresponding Python data structure using just one method call (json.method), and it is very easy to traverse that data structure finding and processing the data elements you want to using the "field name" as the index into the dictionary of "name : value" pairs in the Python data structure.

I've been quiet for a long while on this blog because I had been working for a company that just did not get databases at all, and it has been a real energy sapper dealing with the consequences of all of the wrong decisions they kept making. I'll spare you the gory details, but the biggest problems of many were the wholesale adoption of Agile Development (Scrum) with no training, support or planned cutover coupled with the complete abandonment of any documentation, and the decision by the Technical Architect that databases were trivial and unimportant now so proper, up front database design was not needed any more.

As you can imagine the consequences of such changes were almost catastrophic and very repetitive:-

No one knew whether any newly developed piece of software was correct because there was no documentation at all on what it was supposed to be doing in the first place or how it should have been doing it
Doing Agile Development with no clear end goal meant that every new Sprint was always refactoring the previous Sprint's work to add in the new but unplanned additional functionality so real productivity kept decreasing while test failure rates increased
The resultant database was a complete mess with data all over the various tables, undocumented, with duplicated data fields within the same table, and no indexes whatsoever other than on the primary key fields
They used Hibernate to "generate" the SQL queries, but no one knew how to make it generate "good" SQL - one search query joined the main data table to itself 5 times in the same query, and no one in the development team could explain why.

That's progress for you!!

I've been out of there for some time now, but it left behind a distinctly bad taste about "modern software development" and databases, and put me off doing database work for some time. Luckily though I've slowly come back round to realising that databases are the right place to store important data, and they should be designed properly and not just thrown together. And there are still challenges to be met with good database design, and new technologies to learn and use.

Which means both that I should be starting to blog more frequently in the future, and that I'm now looking again for database related work (Oracle based, around Woking in the UK, if anyone needs a good Oracle person).

Databases and Performance

Monday, 26 October 2015

JSON, Python & Trello

Trello

Example 1 - Printing the names of the lists in a board

Example 2 - Printing out the cards in one list

Summary

Friday, 16 October 2015

Getting My Passion Back

Search This Blog

About Me

Blog Archive

Popular Posts

Blogroll (Alphabetical)

Labels