Monday, 14 March 2016

The current main stream ways of storing data suck but it's not their fault.

The current main stream ways of storing data suck, but the problem isn't them, it's us. We've moved on, and the way we store data just hasn't caught up. Databases, document stores, they're really good at storing arbitrary large amounts of records but aggregating that data? Christ, pass the bottle.

For the lucky ones this isn't an issue and never will be, the real bread and butter of application development of taking data off a screen into their own data store and pulling it out will always be there. Even though the tools have changed, the paradigms are the same.  But increasingly in the connected age, these bastions of data, following their own schemas never having to interface with other systems are becoming rarer.

We live in an age were data is processed and shared and the current ways aren't up for the job. Let's look at some of the more common approaches to data bases.

Firstly, the time honoured stalwart of data stores, the relational database. These things are great, they're highly structured and very easy to pull data out of. Perfect right? Well  with all that good stuff comes a few caveats, and they're big ones. The data must conform to a schema, and that schema has to be defined before the data is inserted into the database.  To be able to pull data out of the DB you need to know that schema. This really just doesn't scale (and by scale I do not mean how much data can be stored, but the types of data that can be stored), when you start adding hundreds of disparate types of data and their relationships into a relational database it becomes very complicated and a nightmare to maintain the schemas needed.

Secondly, the plucky fire brand, document stores particularly JSON document stores. These are the web devs dream, They know what they're sticking into these beasts and they don't need to keep any complicated relationships between the data. These are interesting because the document it's self defines the schema for that particular document (if you're using XML, JSON or CSV etc), this gives a huge amount of flexibility to insert into a DB compared to the relational DBs. Once again however, there is very little structure about how each of these documents related to each other, you're effectively sacrificing those relationships for ease of insertion of records.

Both of these types of stores are designed to be the master of their own data sets,  most companies store their data in a few of these stores and generally they don't have to query across multiple to the stores. If you've ever had to pull data from multiple stores and combine it in a meaningful way I feel sorry for you, because it's a massive pain, it always requires some sort of logic glue which is fragile and prone to breaking when schemas or document types change.

This gets compounded when you add cloud services. You have all your sales information in Salesforce, your marketing data in Postgres and portal data in MongoDB, now your boss wants you to aggregate it into a dashboard? Good luck with that. There are companies which specialize in doing this because of how complicated theses scenarios are.

This is a problem which will only become more self evident as the way we work and the way applications interact with each other increases. The traditional methods of storing this data will become increasingly burdensome to use.

Our approach has to change, and one of the ways that companies which already have hit this issue are turning to is RDF.  Google, Facebook, Wikipedia are increasingly formatting their data into RDF form to combat this issue. Why? RDF allows users to insert data as flexibility as a document store, but still maintains relationships  like a relationship DB, between the data which is incredibly valuable to be able to leverage that data.

In it's most simple of forms, RDF consists of a list of triples (Subject, Verb / Property, Object), by decomposing data into a list of these triples, the data is easy to insert but maintains relationships by creating and updating a graph of the data using the Subject and Objects as the nodes and the verbs as the vertices in a graph.

For example:

Inserting these Triples into a RDF DB:

(Shakespeare, authorOf, Macbeth),
(Shakespeare, authorOf, Hamlet),
(Ophelia, characterIn, Hamlet),
(Gertrude, characterIn, Hamlet)


Creates the structure:




Inherent within the structure we can see relationships and are able to query between those relationships without needing to express the structure before hand.  It allows for extremely complicated structures to be represented and more importantly queried in a way that scales.

As the structure is build dynamically,  the structure creates connections between un-related data without us having to expressly creating those connections. This means we can see how things are connected by simply walking the graph. For instance, in the structure above, I haven't explicitly linked Shakespeare with Ophelia but by walking the graph I can see they are indeed linked. This provides a very powerful tool for analytics.

The side effect of breaking down the data into a series of atomic statements, means that the records within a JSON store, or within a relationship DB can by easily and automatically decomposed into these statements. Allowing for the migration of legacy stores easily. 

For instance

ID First Name Last Name Age
1001 Joe Bloggs 100

Breaks down to :

(1001, FirstName, Joe),
(1001, LastName, Blogs),
(1001, Age, 100)

I'm sure a few guys are asking isn't this just a graph db? Well yes and no, RDF is a type of graph sure, but it holds to the invariant that each node holds one piece of information and one piece only. This really is needed for the model to scale. GraphDBs generally do not hold you to that invariant.

Do I think that RDF is the future? I don't know, the future is a hard thing to see. And RDF isn't perfect by a long shot, but I do know that more and more companies are using this approach to solve the short comings within data stores. This approach to storage is also being increasingly leveraged in the analytics world to be to find and recognize the complex relationships.

The type of problem of data normalization across DBs is only getting bigger and RDF seems to be a viable solution to it.






No comments:

Post a Comment