Monday 21 March 2016

RDF Store, Part 1: Implementing B-Trees

Data structures are awesome! Seriously, I know that is probably one of the most geekiest things I have ever said, it's really true. Think about it, how you store your data has a profound effect on how quickly you can retrieve it. And I'm not talking about a few seconds here and a few seconds there. I'm talking about the difference between whether an application is usable and whether an application isn't.

One of the structures which is used extensively in data stores is B-Trees. While the avant-garde is to store all data in  memory and use a transaction log to rebuild those structures when a server is restarted, B-Trees are used extensively within disk based stores.

I've been reading around efficient ways of storing data for RDF stores, and while reading a paper called RDF-3X: a RISC-style Engine for RDF, it expressed some interesting ideas of using multiple B-Trees to store  and retrieve RDF statements. I thought this would be a great basis to start making a RDF store. As we're going to be mostly cooking this from the ground up, we first need a B-Tree implementation to start off with.

I've started creating this in C# which probably isn't the best choice of language for this, we will probably move from C# to another language when we start looking at making the store multi-node etc. But for now it's a good enough prototyping language for us to play with.

What are B-Trees and why are they important?


Before we answer that, let's have a look at binary search trees (BST). While BSTs are remarkably useful, their structure is dependent on the order in which the values are inserted into the tree, rather than the values of the objects being inserted. If the values are random, this isn't a problem, as the tree roughly becomes balanced. However if the values are ordered, we end up with a vine rather than a tree, this makes look up costs disastrous.


Fig 1. BST where insertion was {3,4,2,1,5}

Fig 2. BST where the insertion was {1,2,3,4,5}

For example, if we compare Fig 1 and Fig 2. Within Fig 1. inserting {6} takes 3 compares but within Fig 2. inserting {6} takes 5 compares. These are both BSTs with the same values within them, however Fig1. is much more efficient when it comes to the compares needed to look up a value or insert a value compared to Fig 2. 

The solution to this is to have the trees become self-balancing. There are many different tree structures which are designed to be self balancing. The one which we will be looking at for this post is B-Trees.

Structure of B-Tree

Fig 3. B-Tree with d = 2
B-Trees are a self-balanced tree structure where each node contains  a variable number of keys and a number of references to other nodes. First we define the minimum  number of keys a node can hold, normally denoted by d.

Each non-leaf node contains a number of keys from d to 2d, so if we have a d = 2, the minimum size for a node is 2 and the maximum is 4. 

As well as key values, each non-leaf node keeps d  to 2d + 1 pointers to other nodes.
Fig 3, shows a simple B-Tree.


Now we have roughly defined a B-Tree, in theory, let's start spec-ing out some simple code for a B-Tree. We'll do this in C# for the moment.

So we define a simple node class and define 2 lists for the keys and for the pointers to the other nodes.

To insert into a B-Tree.


Now let's look at adding into a B-Tree. There are 2 cases we need to look at first.

  1. Insert into a non-leaf node.
    1. If passed a key, we compare the key against the nodes stored keys and find which pointer we should pass this value on to.
    2. If this is a key and a node, we need to add the key to our keys and the node to our nodes. Then check to see if our key list is too large.
      1. If not too large, do nothing.
      2. Else,
        1. We split the keys into 2 lists (leftSide and rightSide) where the pivot is the median key. 
        2. Create a new node with  the key list from the rightSide.
        3. Check if we are the root node, if we are, create a new empty root node.
        4. Pass the new node and the median key to our parent node.
  2. Insert into a leaf node. Compare key against the stored keys, insert the key into the list. Check if the size of the stored keys is greater than 2d.
    1. If No, Finish.
    2. If Yes,
      1. Split the keys into 2 lists (leftSide and rightSide) where the pivot is the median key. 
      2. Create a new node with  the key list from the rightSide.
      3. Check if we are the root node, if we are, create a new empty root node.
      4. Pass the new node and the median key to our parent node.

That's pretty much it. It's a bit confusing as pseudo code, so let's look at the code for it

If we break this down.

Line 3: We find the position within where this key is compared to our keys.
Line 9: We check if there if this node is a leaf.
Line 11: If we aren't a leaf we insert the key into our list of keys.
Line 12: We check if the key list is now too large.
Line 14-29: We split the keys and create a new node.
Line 31: We check to see if we have a parent.
Line 36: Finally we insert the pivot key and the node into our parent.
Line 42: If we aren't a node we pass the key along.


So now we have to deal with insertion for non-leaf nodes.
Once again let's break this down.
Line 3: We find the position where this key is compared to our keys.
Line 8-9: We insert the key and the node into our lists at the correct spot.
Line 12: Check to see if we're over the maximum
Line 14-43: We split the keys and the pointers based on the median then create a new node.
Line 47: Passed the new key value and the new node to our parent to deal with.

This is how we insert into a B-Tree.

Get a value from a B-Tree.

This should be fairly self-explanatory.

Conclusion

OK, so now we have a very basic in-memory C# B-Tree, I've uploaded the source for it. It's not particularly useful yet, however it should be good enough for us to start building on. The next step will be to tidy up the code and leverage some of the observations about the structure to make improvements, when move to a B+-Tree structure and beyond.

The code can be found at GitHub

Monday 14 March 2016

The current main stream ways of storing data suck but it's not their fault.

The current main stream ways of storing data suck, but the problem isn't them, it's us. We've moved on, and the way we store data just hasn't caught up. Databases, document stores, they're really good at storing arbitrary large amounts of records but aggregating that data? Christ, pass the bottle.

For the lucky ones this isn't an issue and never will be, the real bread and butter of application development of taking data off a screen into their own data store and pulling it out will always be there. Even though the tools have changed, the paradigms are the same.  But increasingly in the connected age, these bastions of data, following their own schemas never having to interface with other systems are becoming rarer.

We live in an age were data is processed and shared and the current ways aren't up for the job. Let's look at some of the more common approaches to data bases.

Firstly, the time honoured stalwart of data stores, the relational database. These things are great, they're highly structured and very easy to pull data out of. Perfect right? Well  with all that good stuff comes a few caveats, and they're big ones. The data must conform to a schema, and that schema has to be defined before the data is inserted into the database.  To be able to pull data out of the DB you need to know that schema. This really just doesn't scale (and by scale I do not mean how much data can be stored, but the types of data that can be stored), when you start adding hundreds of disparate types of data and their relationships into a relational database it becomes very complicated and a nightmare to maintain the schemas needed.

Secondly, the plucky fire brand, document stores particularly JSON document stores. These are the web devs dream, They know what they're sticking into these beasts and they don't need to keep any complicated relationships between the data. These are interesting because the document it's self defines the schema for that particular document (if you're using XML, JSON or CSV etc), this gives a huge amount of flexibility to insert into a DB compared to the relational DBs. Once again however, there is very little structure about how each of these documents related to each other, you're effectively sacrificing those relationships for ease of insertion of records.

Both of these types of stores are designed to be the master of their own data sets,  most companies store their data in a few of these stores and generally they don't have to query across multiple to the stores. If you've ever had to pull data from multiple stores and combine it in a meaningful way I feel sorry for you, because it's a massive pain, it always requires some sort of logic glue which is fragile and prone to breaking when schemas or document types change.

This gets compounded when you add cloud services. You have all your sales information in Salesforce, your marketing data in Postgres and portal data in MongoDB, now your boss wants you to aggregate it into a dashboard? Good luck with that. There are companies which specialize in doing this because of how complicated theses scenarios are.

This is a problem which will only become more self evident as the way we work and the way applications interact with each other increases. The traditional methods of storing this data will become increasingly burdensome to use.

Our approach has to change, and one of the ways that companies which already have hit this issue are turning to is RDF.  Google, Facebook, Wikipedia are increasingly formatting their data into RDF form to combat this issue. Why? RDF allows users to insert data as flexibility as a document store, but still maintains relationships  like a relationship DB, between the data which is incredibly valuable to be able to leverage that data.

In it's most simple of forms, RDF consists of a list of triples (Subject, Verb / Property, Object), by decomposing data into a list of these triples, the data is easy to insert but maintains relationships by creating and updating a graph of the data using the Subject and Objects as the nodes and the verbs as the vertices in a graph.

For example:

Inserting these Triples into a RDF DB:

(Shakespeare, authorOf, Macbeth),
(Shakespeare, authorOf, Hamlet),
(Ophelia, characterIn, Hamlet),
(Gertrude, characterIn, Hamlet)


Creates the structure:




Inherent within the structure we can see relationships and are able to query between those relationships without needing to express the structure before hand.  It allows for extremely complicated structures to be represented and more importantly queried in a way that scales.

As the structure is build dynamically,  the structure creates connections between un-related data without us having to expressly creating those connections. This means we can see how things are connected by simply walking the graph. For instance, in the structure above, I haven't explicitly linked Shakespeare with Ophelia but by walking the graph I can see they are indeed linked. This provides a very powerful tool for analytics.

The side effect of breaking down the data into a series of atomic statements, means that the records within a JSON store, or within a relationship DB can by easily and automatically decomposed into these statements. Allowing for the migration of legacy stores easily. 

For instance

ID First Name Last Name Age
1001 Joe Bloggs 100

Breaks down to :

(1001, FirstName, Joe),
(1001, LastName, Blogs),
(1001, Age, 100)

I'm sure a few guys are asking isn't this just a graph db? Well yes and no, RDF is a type of graph sure, but it holds to the invariant that each node holds one piece of information and one piece only. This really is needed for the model to scale. GraphDBs generally do not hold you to that invariant.

Do I think that RDF is the future? I don't know, the future is a hard thing to see. And RDF isn't perfect by a long shot, but I do know that more and more companies are using this approach to solve the short comings within data stores. This approach to storage is also being increasingly leveraged in the analytics world to be to find and recognize the complex relationships.

The type of problem of data normalization across DBs is only getting bigger and RDF seems to be a viable solution to it.






Monday 7 March 2016

Erlang, is it really viable?

Lately, I've been looking for a language to write a pet project in. Realistically I will most likely fall back onto on the languages I feel most comfortable in, but I would like to start the project with the best intentions, coupled with the project being an exploration of ideas, I thought it would be cool to at least try and start it off in a language I'm not familiar with.

Originally I was thinking about writing it in Go, I've written a few things in Go and have done some fast prototyping in it and I have found it edifying to use, while mulling this over with an extremely talented  senior engineer at where I work, who also happens to be a functional programming and lisp fanatic, we ended up talking about Erlang.

Erlang has always been interesting to me, but in an academic sense, mostly as it always has seemed too good to be true. You hear these mythical up-times on servers written in Erlang, pure message passing model, process model built in, not bolted on, OTP allowing for building of highly cluster-able manageable servers. The language from which the nirvana of a concurrent, 99.99999% (Yes that's right, five 9s) up time servers is born.

And yet, in industry, writing something in Erlang gets eye-rolls and sighs about academia. Even I have been guilty of this. But even the system I architect-ed and have actively developed for the last 3 years has to deal with all the problems which Erlang proposes to if not solve, at least help with greatly.

But I didn't choose Erlang to write it in, and even if I did I knew my SVP or CTO would't go for it. Hell, if I think about it now, after I know a lot more about Erlang than I did a couple of weeks ago (though I would just like to be clear, I am very much at newb level), would I have chosen Erlang? No, I wouldn't.

Why?  Well the truth of the matter is that there are factors outside the technical realm which make it impractical, while the team started off small and in an area which fairly deep pool of engineers. The idea of being able to find 3-4 senior engineers who were proficient in Erlang would of been a nightmare.

While reflecting on this, I can't help but feel déjà vu, 3 years ago, I wrote the prototype of our next generation product in Go, and I remember sitting down and having the discussion which language the system should be written in, carry on with Go? Or go with Java? Well, we went with Java, there were some technical reasons, Go at the time wasn't anywhere near as performant compared to Java as it is today. But realistically, we knew we could get the java developers and when we brought our other development team over, even though they didn't know Java, they could pick it up easily.

What I find ironic about this is we went with the Java, Scala, Akka. Why Akka? Because it follows the actor model, promotes immutable message passing, allows us to develop complex relationships. It gives us concurrency, clustering, supervision tress, sound familiar huh? As Java moved from 6 to 7 and then 8, we see more and more functional programs paradigms come into the language.

The actor model has been around on this earth longer than I have, along with Lisp. Haskell came out when I was 6 years old, the ideas and paradigms they represent are entering the lexicon of the average developer and they seem to be doing so at an increasing rate. It's really now with the advent of sheer scale of processing and the amount of communication even the most trivial application needs to do that we are turning to the "new" models, which happen to be old models developed by visionaries.

Would I ever develop a commercial product using Erlang? I honestly don't know,  but I do expect every engineer in our group to be familiar with the paradigms it espouses.  The more problems I try to solve the more I realize there is very little which someone smarter than me hasn't already come across and has either algorithm / data structure to solve it, or has written a white paper on how to tackle it.

It's enough to make you wonder if there really are new frontiers? Or just new mappings to make new frontiers like old ones.


Sunday 6 March 2016

Birthday Surface Book

So today is my birthday and my ever loving wife has surprised me with a Surface Book. It's the first non linux / OS-X laptop we have had in the house for at least 5 years.

Normally a windows laptop wouldn't of interested me in the slightest, hell, any type of laptop these days really. However, when the surface book came out, it immediately intrigued me. It's the first laptop I had seen in a long time which tried to do something different. I was very tempted, however the better angels in my head stopped me from grabbing one. Mostly as I was worried that it would be a toy for me, my daily work and daily development is so Linux based, that even using OSX causes me pains due to it's idiosyncrasies compared to Linux. And when some of the teething issues came up, rather ironically on the software side, my temptation died down even more.

It was surprising when my wife said we should go to the windows store. I actually wanted to go because I was really tempted by the Dell Curved Monitor, I knew they had one there (I'm a total sucker for monitors). Once we were there, she said we should grab the surface pro.  It was a strange experience actually, as I have brought home so many cool gadgets and all my girls have become numb to the new tech and don't normally show any huge interest, especially in something like a new laptop. But the girls were all over the surface book. All three of them loved the pen, tablet  / laptop idea, it was strange to see them interested in a laptop.

The windows store is an interesting place, you get the feeling that it wants to be an apple store so badly, but doesn't quite pull it off. The whole experience was good, but rough around the edges in many areas.

After I was convinced to get the laptop, which I grudgingly admit didn't take much to convince me. The buying experience was definitely a mixed one,  While we had a great attendant and they had the surface book we wanted in stock. But the whole process was a bit rough, we asked to get the receipt emailed to us, however that never actually happened. Also they offered the VIP service, which comprised them installing all the latest firmware and OS patches, it also comes with the added incentive of a $50 gift card, so what the heck, we went for it. We were told it'll ideally take an hour or 2, maximum of 3, and they will call us when it would be done. Unfortunately that phone call didn't come and by 5 o'clock we decided just to go back to pick it up.

So now I'm sitting here typing on my new brand shiny windows laptop. I've had mixed feelings about a windows, as my main workstation is a ubuntu box, and my work laptop is a mac, so using a windows machine as a daily driver is going to be a different experience for me, I'm looking forward to setting up the development environment on it.

Do I like my new machine? Well the answer to that has to be yes. Really it's a remarkable piece of engineering, there are a few things which I need to get used to, the keyboard has a bit more travel then the macbook one, and also I hate the way the cursor keys are set up. Also the track pad isn't bad, but isn't the polished experience you get from a MacBook.

Obviously it's early days, but I'm looking forward to seeing how my work-flow changes using it, back when I was making games, Visual Studio was a remarkable IDE and I'm looking forward to using it again, also .NET from 3.5 onwards (both C# and F#) are very cool languages, so we'll see what happens.