Thursday, 5 October 2017

The Leaky Ship of MFA.

We've just finished getting PCI compliance for our SaaS offering at work. One of the interesting things around this is the need for Multi-Factor Authentication (I'm going to use the abbreviation MFA for this going forward).  For those who don't know what MFA is, essentially it means that you need another secret to login besides your password.   Something like a code that gets SMS'd to your phone.

Now, this doesn't sound very contentious.  Obviously MFA is more secure than SFA (Single Factor Authentication, i.e. your password or, even better, passphrase). But the debate has been really how to do it correctly. 

The normal flow MFA would be:
  1. Challenge for username and password
  2. Check username and password, if either incorrect, reject login.
  3. Prompt for MFA code.
  4. Check MFA code, if valid, allow login.

Simple right? Well this is where things get a bit sticky. This design stems from three restrictions of UX when it comes to public facing websites. 
  1. Typically on a public facing webpage, the MFA is down using SMS. In order for the site to send you the code, it needs to know a number to send it to you. It also needs to know when to actually send the code.  So it makes sense to send the code and challenge for it after the login and password has been accepted.
  2. MFA is normally not mandatory. While systems will push MFA onto you (I'm looking at you Apple), there are few public pages which force you to have MFA in order to use them.
  3. As MFA causes friction in the sign-in process, a lot of sites only challenge people periodically, for instance every 30 days.

Given these 3 restrictions this pattern makes complete sense, so what's the problem?  Here's the rub, there are a couple of security implications to this pattern. 

Complication of State

Normally there is a binary state to authentication of a user in most models. She is either authenticated or not. This makes things clean for your trust model. Now we've introduced a new state, what state is the user between 2 and 3? This is where MFA bypass attacks come into play. 

The user needs to access certain end points to allow them to request a new code however they're obviously not allowed to go straight to the main page (otherwise it wouldn't actually be MFA). 

We've complicated our model by introducing this state, and we need to make sure that state is sandboxed appropriately.  In theory this should be fairly straightforward but in practice things can get messy and have given rise to the MFA bypass attacks.

Integration tests should make sure that tests are run against all the end points within the API assuming the state of a user who has completed their credentials but not entered their MFA token. It's important that these tests are complete and done automatically as part of your integration process. It's easy to start with the best intentions and then simply forget 6-8 months down the line.

Information Leak

Go to your favourite website and put in your username and an incorrect password. Look at the error message. Now put in a incorrect username and a valid password. Check out the error again, they're the same right?  That's because it's not leaking information. By that I mean the website isn't allowing an attack to know that they have guessed either the username or password correctly.  

With the pattern described above, we are leaking that the login and password are correct before they have fully authenticated. This really is an information leak but it's unavoidable with the approach defined above. 

A Better Approach?

OK so we see the issues and it begs the question, is there anyway of implementing a flow without these issues?  The problem with this flow stems from the restrictions rather than flow it's self.

Alright then, so if the problem is with the restrictions, are we sure they still apply? Excellent question! Let's explore that further with looking at some common vectors of attack like Vevo and Deloitte

Scenario 1 - Spear Phishing 

So you're cruising the web.
Feeling good. 
Then you get an email from LinkedIN about someone desperate to give you a fantastic new job paying millions of dollars.
Naturally, as you're a remarkable, and yet strangely chronically under valued, employee at your current job in EvilCorp, you eagerly hit that link to read more. 
You're prompted for your login details.
You think that's weird as normally your browser logins you in automatically but, whatever.
You put in your login and password.
And it turns out it's from a Nigerian Prince. 
You shrug your shoulders and go on with your life.

Meanwhile in some evil lair unbeknownst to yourself, an hacker has stirred from their slumber. They have caught you in their web. They now have your login and password for LinkedIN. While I am sure that your LinkedIN profile is infinitely valuable to you, it's not really what they're looking for.

They have their eyes set on EvilCorp whom has chronically undervalued and underpaid you. They know your email and as 99% of usernames are either email address or the first part of the email address, they are confident they have username. 

So that leads to the next variable, do you practice good password hygiene? Are you using different passwords for every service or is "P0w3rR4ng3rsRG0!" (great choice for a password BTW) that you used for LinkedIN your go to password for everything? 

They go to Evil Corps page and with bated breath they enter your email address followed by "P0w3rR4ng3rsRG0!", and wait......Getting tense isn't it? Bingo they're presented challenge for MFA. 

They now have validated your company login and password without needing to use the MFA. They are now armed with more (valuable) information then they had to begin with. Hence information leak. 

Scenario 2 - Brute Force

Peter is the next haxx0ring legend.
He just needs to prove it.
He notices on facebook that you're working for EvilCorp.
He takes a guess that your company email is probably <first letter of your first name><your surname>@evilcorp.com.
While Peter has determination, he isn't very sophisticated (yet).
He decides to try bruteforcing your password.
After a few hours and many forwarding boxes later, he hits pay dirt.
Whilst "sexgod" is an apt title for yourself, it might not have been the best password.
Now he is prompted for your MFA.

Now both these scenarios don't end up with you being owned, but they do show that with the established pattern that the password has validated / found. Now it's simply a case of finding a way of reseting the MFA, or looking for MFA bypass. 



Breaking the Restrictions.

OK, so now we have scenarios, let's look at those scenarios again. But instead of looking at them as a webapp catering the the wit and whimsy of the public. Let's look at them from the more structured world of corporate webapps.

  1.  Using SMS for MFA. The MFA code that gets sent when a user logs in is actually generated using a well-known and more important standard algorithm. You can checkout the RFC here. These days there are many really excellent a free Authenticators, Google Authenticator for Android, Microsoft Authenticator for iOS, 1Password has a great implementation for Windows and OS X. All these authenticators are designed for MFA and use stand QR to set up. It takes less then 10 seconds to setup. There really isn't any reason to use SMS to deliver an MFA code these days.
  2. There is no reason why MFA should not be a mandatory across your company's infrastructure. MFA is easy to setup and offers a huge pay off from a security perspective.
  3. Using something like 1Password makes this really trivial and less of a burden. It's not totally removed but considering the security pay off the pain really is minimal.
Obviously you need to have these discussions internally. But removing even one of these restrictions removes the information leak or reduces it significantly. Let's look a few new approaches based on lifting of the differing restrictions.


1. Removing All Restrictions.

OK, This is easy:

  1. Challenge for username, password and MFA
  2. Check username and password and MFA, if any are incorrect, reject login.
  3. Allow the user to enter.
As you can see this completely removes the complication of the state and mitigates both of the attack scenarios as the attacker has no idea if the password is correct or not without first knowing the MFA. We've doubled the amount of unknown variables he has to correct guess before gaining anymore information.

2. Removing SMS for MFA

This is a little more tricky as, obviously, we now have to deal with a mix of MFA and non-MFA users. So this would look like:
  1. Challenge for username and password.
  2. Check if the username needs MFA (do not check the password yet). 
    1. If the username doesn't exist, prompt for MFA
    2. If the username does exist and the user needs MFA, prompt for MFA
    3. If the username does exit and the user doesn't need MFA, Go to 5.
  3. Prompt for MFA.
  4. Check the password and the MFA, if either of them is invalid, reject login.
  5. Allow the user to enter
Here we have completely removed the complication of state. However, we still do have an information leak. Let's break it down a bit.

If the attacker doesn't know the username or password what will he learn? He learns nothing, as we prompt for the MFA token if the username doesn't exist so there is no way of knowing if the user exists and needs MFA or if the user doesn't exist.

If the attacker knows the username (which these days if very likely as it's normally email based) and has an idea of the password. Here if the user needs MFA, he learns nothing, as we check both the MFA token and the password at the same time and if either of them is wrong we reject the login. He has no way of knowing which one he got wrong. 

If the attacker knows the username and password. He can't look for MFA bypass as we do not have any half-authenticated states (i.e. logged in but not answered the MFA challenge). 

Conclusions

When we looked at the restrictions we went with the approach with the lifting of the SMS for MFA restriction. This meant we could have a security model which is stronger than the typical design pattern. 

While you should always attempt to go for common patterns whenever possible. This does show that it's important to understand why those patterns are being used the fist place. In the case of MFA pattern, it was a design which was created for the public in mind and minimum requirements needed for buy-in. In our case and a lot of cases for corporate webapps the restrictions which were placed on original pattern aren't there.  

Going down the well-traveled road is the best choice in software most of the time.  However, don't blindly follow it. I see this coming up a lot when people start talking about their stacks (MongoDB, Redis, Node.js) and when I ask them why they picked those over other approaches it's sometimes difficult to get a real answer.  The patterns we implement and the services we put into a stack have massive repercussions for years down the road. It's a lot cheaper and more effective to spent time thinking about them now, then trying to untangle them later down the path.





Monday, 18 September 2017

Lessons never learned, Equifax

It's curious looking at the fall out over the Equifax hack. I've been biting my tongue, not wanting to add to the hyperbole, but the temptation has become too great. I'm going to have to throw in my 2 cents.  

There is some disagreement to whether it was a simple as an admin/admin login and password  or if their servers where not patched, but what's really interesting in looking at the demands for more explanations and various CISOs giving their opinions on linkedin and blog posts about the hack, everyone is talking about patch management and password management. There are various discussions and excuses given to why this is hard and how people sign off knowingly on not upgrading. 

Now, I have strong opinions around these areas too and if you're running a server which is holding the crown jewels and something which will have a direct and significant impact on your customers' lives, each of the decisions on which exceptions will get signed off on have to be made at the upmost echelons of the company.  

This isn't the internal knowledge base you're talking about, this is the information which your very company is built upon. 

If your company deals in analytics and information, which Equifax does, then one of your absolute highest concerns is security. It simply has to be. People are trusting you and the things which are coming out of Equifax show that in no way did that company deserve that trust.  

But I feel like focusing on patching and password management also misses a far deeper and more troubling issue, an issue which often gets overlooked for these more superficial problems. 

That issue really is design.  The state of security today means that any CISO must understand that their network will be compromised.  I'm sorry guys to have to tell you that even as you read this, most likely, there is a compromised server, desktop, laptop, phone sitting within your infrastructure or with access to some server within your infrastructure.  So, given that premise, how on earth did any server, desktop or laptop have access to over 100 Million records? 

Any system which is designed to allow for access to that many records from a single source (or cluster of same type of sources) should never have been allowed out of the design stage.  Security doesn't stop or start at patch management and pen testing. 

Systems have to be designed with the idea that at some point they can be compromised.  This means limiting trust, making sure there are barriers at different stages, that when looking at data segmentation you're looking at it from a security model view. 

As more people are moving to a micro service architecture and going to a containerized deployment processes, there has to be a concerted effort to make sure a security model is set out with clear definitions of perimeters and trust.  With each step of the design there should be an idea of where this service sits within the security model and what the consequences of that service being compromised will have.

Thinking in terms of services and compromise arms you with an understanding of what level of security you should have around those services. With services running on docker and immutable instances within cloud means that there is very little excuse not to start looking long and hard about how you're segregating your data and limiting the scope of what can be accessed from a particular instance.

You're probably thinking that's all well and good when you're talking about a new architecture but obviously Equifax  are running a stack which predates this type of flexibility.  But really that isn't a good enough excuse.   Hackers being able to access that many records in one go shows that there is a fundamental problem at Equifax and that problem extends to the bowels of the deployment and software architecture. 

All the best password and patch management in the world will be a band-aide over the fundamental problems which need to be addressed. I am sure that Equifax will take a long hard look at this in the future, and it's probably a good time for all of us to do the same.




Wednesday, 23 August 2017

Entropy & Passwords


I've been putting off writing this for a while, mostly because I hope that this should be obvious to people in the security world. However, a few things happened lately and I thought I would finally bite the bullet and type this up. Most of this comes directly from NIST Electronic Authentication Guideline and is well worth a read.



Claude Shannon - 
By DobriZheglov (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], 
via Wikimedia Commons

In 1948 a gentleman called Claude Shannon wrote an article called "A Mathematical Theory" which gave birth to a new field of mathematics called Information Theory. It has a profound impact on how we define and use information.  

Within Prediction and Entropy of Printed English, Shannon states that:
 The entropy is a statistical parameter which measures, in a certain sense, how much information is produced on the average for each letter of a text in the language. If the language is translated into binary digits (0 or 1) in the most efficient way, the entropy  is the average number of binary digits required per letter of the original language.
In thermodynamics entropy is described as:
measure of disorder in the universe or of the availability of the energy in a system to do work. 
And similarly within Information Theory, Shannon was defining entropy as the uncertainty of symbols chosen from a given set of symbols, based on a priori probabilities.  i.e. The probability that a given variable (X) has particular value.

This type of entropy is used to define how difficult it is to guess a particular password. The higher the entropy the harder it is to guess what the password is.,

Why does this matter? Because we're currently in an arms race when it comes to passwords and customers are definitely suffering because of it. In a battle where entropy is the winning factor, humans are always going to lose. Humans aren't designed to be random, we're geared to see patterns. We're really good at patterns, so good at patterns in fact that we often see patterns which aren't there

Like most things which become a problem, it all started swimmingly...Back in the 80s and 90s, cracking password hashes was a time consuming business. Many hours I spent tuning my word lists and running those passwd and shadow files through Alec Muffett's Crack (I am sure that can be taken in many different ways, so let's just keep moving on).  If you had good password security it was possible to be secure and not be too onerous on the user. Today things have definitely changed, using GPUs and hashcat it's billions of hashes which can be computed in seconds.

What does this mean for users? Password hell. We keep pushing to have passwords have higher entropy. It's got to the point where we now have applications dedicated to generating passwords for users and keep track of them. Because for passwords to be secure they really need to be completely random.  By the way, I'm not saying you shouldn't use these, you definitely should but we still need a password to protect the other passwords so....

So what's the solution? 

So before we go further, let's do some tests, before we can do that we need to define how we're calculating entropy. So if we have a number(a) of values within a set (e.g. the letters of the alphabet) and the length of the (l), then the entropy would be a^l. For example using the alphabet and a length of 6 characters we get 26^6 = 308915776. However normally entropy is expressed in terms of bits so if we take the log2(308915776) we get 28.  So to abstract out that formula we have:


Where b is the cardinality of the set which the value can be chosen from.
Where l is the length of the password
Where H is the entropy in bits of the password.

There are 95 printable ascii characters, so let's look at entropy as password length increases:


This really shouldn't be surprising right? As the length increases uniformly the entropy increases along with it.  This describes the entropy if there is an equal chance of selecting any member of the set but humans don't select their characters randomly. Typically passwords are selected based off the language which the person speaks, if she is English then most cases she will select passwords based off the English language.  

As Shannon pointed out, the English language has a set of properties which significantly reduce it's entropy, capital letters tend to be used at the beginning of a word rather than in the middle of it, certain pairings of letters (i before e except after c, u next to q etc). These conventions and properties of a language reduce it's entropy. This is compounded by the fact that when a user is forced by password policy they tend to substitute i's for 1's, s's with $'s etc, rather than sprinkling these within the password. 

All this means that the entropy of real life password isn't anywhere near as strong as a random selection.  The excellent NIST Electronic Authentication Guideline has devised a "scoring" system to help determine a more realistic entropy based of human behavior. It's defined as:
• the entropy of the first character is taken to be 4 bits; 
• the entropy of the next 7 characters are 2 bits per character; this is roughly consistent with Shannon’s estimate that “when statistical effects extending over not more than 8 letters are considered the entropy is roughly 2.3 bits per character;” 
• for the 9th through the 20th character the entropy is taken to be 1.5 bits per character; -49- Special Publication 800-63 Electronic Authentication Guideline 
• for characters 21 and above the entropy is taken to be 1 bit per character; 
• A “bonus” of 6 bits of entropy is assigned for a composition rule that requires both upper case and non-alphabetic characters. This forces the use of these characters, but in many cases thee characters will occur only at the beginning or the end of the password, and it reduces the total search space somewhat, so the benefit is probably modest and nearly independent of the length of the password; 
• A bonus of up to 6 bits of entropy is added for an extensive dictionary check. If the attacker knows the dictionary, he can avoid testing those passwords, and will in any event, be able to guess much of the dictionary, which will, however, be the most likely selected passwords in the absence of a dictionary rule. 
Using that scheme, they determined the following results:

The results are interesting because as we see while enforcing complexity and testing passwords against dictionaries of known common passwords does indeed increase the entropy for smaller passwords, as the size of the password increases, the dominating factor again becomes the size of the password. 

So in a rather convoluted manner we're back to the original proposition, what's the best practice when it comes to passwords?

Well the solution really is bigger passwords, the problem is when people think passwords, they think exactly that; a pass word, like speak "friend" and enter, fame. But this reduces the scope for entropy significantly because 80% of words fall between 2-7 letters long.


Looking at the bell curve, after about 14 characters the increased entropy from dictionary and complexity rules starts to become redundant but that only leaves around 2% of English words we can use. On the bright side we can start looking at Germanic words which fair a lot better but it's probably not going to be a great UX experience trying to enforce that. 

So how we do get users to use more secure passwords without making them memorize gibberish? It's really simple, we start the long process of changing how people think of passwords and make them start making them look at using pass phrases.  

What's easier to remember? "Palo told us Thales fell down the well."  which 40 characters well above our sweet spot or "rb97P)eE4.Y3"? They both have similar entropy but one is more easier to remember than the other. 

This is a very long winded way of saying something very obvious but the next time you're making something to ask for a password, maybe give an example as a passphrase rather than a password.  It's going to take a while to move people away from thinking about single words. For a long time software has conditioned people to think of passwords that way. 

It's common even today to see systems which restrict the use of space or put 12-20 character limits on passwords, it's a tremendous disservice to the industry to do this. There really isn't any need for it anyway as these passwords are going to be hashed so it's not like the hashes are going to get bigger.

This was written as a request by a friend, I would just like to apologize to them how long it's taken me to actually get around to writing this. Hopefully you'll forgive the delay =)


Thursday, 3 August 2017

Trust. No. One. - Zero Trust in Networking.

It's funny how quickly things change. Especially so when it comes to technology and the internet. In some ways security reflects this world of constant change. In fact, within certain aspects of security these types of changes can happen daily. Exploits, vulnerabilities and patches happen can happen on extremely condensed time frames. 
We trust certain sets of programs and services. Hackers constantly attack those services so the realms of secure and non-secure intersect and switch constantly. As Alex Stamos discussed in his key note at Black Hat this year, this is the sexy part of security and it really is sexy. It's a game of chess, where security researchers and hackers test their skills. The days of easy exploits is mostly past us and the exploits that arise these days are staggering in their ingenuity. Understanding how these are made is fascinating and well worth exploring.
I am not going to be discussing the sexy world of 0day. Today I would like to talk about something unsexy but something which needs to be addressed, and that is networking. The assertion I'm going to make is that most networking models we use are simply hopelessly out of date, how we use the internet has changed but network models hasn't.
Let's look at conventional wisdom when it comes to networking design. If we look at the diagram above, we can see the castle like design of networks. Each circle represents a layer within the network which has been segmented from the other levels. 
This model is based on physical security, where you have a DMZ followed by areas of increasing security until some where in the middle there is the keep which represents the most secure layer. This type of design makes sense as it gives you a fall back plan where in order to get into the most secure area you must of first infiltrate all the other areas. 
The problem is that networks really don't work like that anymore, or realistically ever did. Typical companies have 3 layers, a DMZ, a middle layer where all the users and internal servers are kept and finally PCI segment. But these segments aren't so much staunch walls built for a robust defense, they tend to have gaping holes in them, things like VPNs and special firewall rules to allow traffic from different machines, satellite Offices which need access, public VPNs to allow for users to access the internal servers while working away from the office. Restricted segments tend not to have their own VPNs and normally have firewall rules which allow for select machines within the internal network access them. And within the last few years SaaS services and cloud deployments keep increasingly mission critical infrastructure and services. 
Networks aren't stable, they grow and evolve, firewall rules get added, don't always get removed and typically are run by ops rather than the security teams so there is nebulous arrangement of duties and keeping these perimeters secure. Coupled with infrastructure which has been in place for many years means that significant configuration drift could of taken place. This is compounded if companies have multiple mergers etc. 
The nature of attacks has changed, the old days of attacking directly the DMZ to try and gain access to the networks is very rare. Hackers start by attacking the users themselves with spear phishing attacks and the similar. Using the law of numbers to find an entrance, once they have a machine which has been compromised within the internal network, they leverage the trust of that network to scan for other machines within it they can attack and spread to. As they sink deeper into the internal networks they find more machines which have access to other segments of the network until eventually all areas have been breached. 
These types of hacks are common and represent the short comings of this type of network architecture. Once a hacker has pierced one of these segments he has gained that level of trust and access to all the machines within that segment. So how do we combat this? Also how can we combat this in a manner which doesn't mean a massive overhaul of networking infrastructure.
The answer appears to be simple but brings up many interesting connotations, trust no one. Stop caring about where a machine is within the network, assume every packet of communication is malicious until proven otherwise.
Sounds interesting doesn't it? While this has been discussed before, it's really Googles excellent BeyondCorp which has started to bring it into the main stream, where companies are starting to be built around it. With the continued adoptance of the cloud and SaaS, zero-trust networking is becoming more and more compelling. 
So how does this work? There is no completely standard way of implementing a Zero Trust network, but we can go over one of the ideas of how to implement such a network in theory loosely based on the BeyondCorp paper. 
The basic premise is built on 2 key assertions:
  1. Verification of the User or Process attempting to connect
  2. Verification the Machine the User or Process is using for the attempt.
  • Each service machine generates a certificate which is signed by a trust server.
  • Each service generates a certificate which is signed by the trust server.
  • Each client machine generates a certificate which is signed by a trust server. 
  • When a user enters the network he generates a certificate which is signed by a trust server.
  • When a connection is attempted to a service, the agent on the machine presents both the public key for the user and the machine to the trust server.
  • The trust server verifies that the user is allowed access to the service and the keys have been verified, a short term certificate is created and signed (typically for 10-30 mins). The public key is encrypted with the machines public key and the users public key and sent back to the agent.
  • The machine then makes an attempt to the service using the short term certificate. The service then reaches out to the trust server with the certificate, if the certificate is valid the trust server sends the key encrypted by the private keys of the service and the machine it's running on.
  • Then the connection is established.
Let's look at what this approach gives us within certain scenarios.
  1. Users credentials are compromised. If the users credentials are compromised it would not be enough to give access to any service. You need both certificate of a machine and the user to be able to access anything. When the compromise is discovered, simply revoking the users certificate is enough to stop all network access to all services within the network. 
  2. Machine has been compromised. Access to the machines certificate would not be good enough, you also need to have a compromised user certificate as well.When the compromise is discovered, simply revoking the machines certificate is also enough to isolate it from all services.
  3. The certificate for the connection is compromised. This connection certificates are short term and can only be used once, this means you need to compromise it and use it within a very shortened time frame. It reduces the risk significantly.
  4. Service has been compromised. There is a flaw with the service software and some how they have both the client certificate and the machine certificate, this would allow them access to the machine, but within the service machines certificate they can not continue to spread within the network.
  5. With this paradigm, the network connection is now not tied to the networking layer but to the security layer, meaning you can restrict the machines and users who have access to a particular service. If a user isn't allowed a service simply refuse to generate the certificate. 
Zero-Trust Networking gives a great deal of flexibility and a much needed rethink to how we manage security at the networking layer. While firewalls will always have their place, they're still static entities with hard to maintain sets of rules, the need for them to be physical boxes comes from a time when managing that amount of data needed dedicated hardware.  
But machines are now more than capable of handling the networking traffic they're going to be generating and with this approach we're effectively making each dynamic firewalls around all endpoints and services. All the while bringing the networking connections more inline with the security models based off a user and their roles.
Networking is on the edge of change with software defined networking and Zero-Trust, it's going to be an interesting time, maybe networking isn't so unsexy after all.

Monday, 21 March 2016

RDF Store, Part 1: Implementing B-Trees

Data structures are awesome! Seriously, I know that is probably one of the most geekiest things I have ever said, it's really true. Think about it, how you store your data has a profound effect on how quickly you can retrieve it. And I'm not talking about a few seconds here and a few seconds there. I'm talking about the difference between whether an application is usable and whether an application isn't.

One of the structures which is used extensively in data stores is B-Trees. While the avant-garde is to store all data in  memory and use a transaction log to rebuild those structures when a server is restarted, B-Trees are used extensively within disk based stores.

I've been reading around efficient ways of storing data for RDF stores, and while reading a paper called RDF-3X: a RISC-style Engine for RDF, it expressed some interesting ideas of using multiple B-Trees to store  and retrieve RDF statements. I thought this would be a great basis to start making a RDF store. As we're going to be mostly cooking this from the ground up, we first need a B-Tree implementation to start off with.

I've started creating this in C# which probably isn't the best choice of language for this, we will probably move from C# to another language when we start looking at making the store multi-node etc. But for now it's a good enough prototyping language for us to play with.

What are B-Trees and why are they important?


Before we answer that, let's have a look at binary search trees (BST). While BSTs are remarkably useful, their structure is dependent on the order in which the values are inserted into the tree, rather than the values of the objects being inserted. If the values are random, this isn't a problem, as the tree roughly becomes balanced. However if the values are ordered, we end up with a vine rather than a tree, this makes look up costs disastrous.


Fig 1. BST where insertion was {3,4,2,1,5}

Fig 2. BST where the insertion was {1,2,3,4,5}

For example, if we compare Fig 1 and Fig 2. Within Fig 1. inserting {6} takes 3 compares but within Fig 2. inserting {6} takes 5 compares. These are both BSTs with the same values within them, however Fig1. is much more efficient when it comes to the compares needed to look up a value or insert a value compared to Fig 2. 

The solution to this is to have the trees become self-balancing. There are many different tree structures which are designed to be self balancing. The one which we will be looking at for this post is B-Trees.

Structure of B-Tree

Fig 3. B-Tree with d = 2
B-Trees are a self-balanced tree structure where each node contains  a variable number of keys and a number of references to other nodes. First we define the minimum  number of keys a node can hold, normally denoted by d.

Each non-leaf node contains a number of keys from d to 2d, so if we have a d = 2, the minimum size for a node is 2 and the maximum is 4. 

As well as key values, each non-leaf node keeps d  to 2d + 1 pointers to other nodes.
Fig 3, shows a simple B-Tree.


Now we have roughly defined a B-Tree, in theory, let's start spec-ing out some simple code for a B-Tree. We'll do this in C# for the moment.

So we define a simple node class and define 2 lists for the keys and for the pointers to the other nodes.

To insert into a B-Tree.


Now let's look at adding into a B-Tree. There are 2 cases we need to look at first.

  1. Insert into a non-leaf node.
    1. If passed a key, we compare the key against the nodes stored keys and find which pointer we should pass this value on to.
    2. If this is a key and a node, we need to add the key to our keys and the node to our nodes. Then check to see if our key list is too large.
      1. If not too large, do nothing.
      2. Else,
        1. We split the keys into 2 lists (leftSide and rightSide) where the pivot is the median key. 
        2. Create a new node with  the key list from the rightSide.
        3. Check if we are the root node, if we are, create a new empty root node.
        4. Pass the new node and the median key to our parent node.
  2. Insert into a leaf node. Compare key against the stored keys, insert the key into the list. Check if the size of the stored keys is greater than 2d.
    1. If No, Finish.
    2. If Yes,
      1. Split the keys into 2 lists (leftSide and rightSide) where the pivot is the median key. 
      2. Create a new node with  the key list from the rightSide.
      3. Check if we are the root node, if we are, create a new empty root node.
      4. Pass the new node and the median key to our parent node.

That's pretty much it. It's a bit confusing as pseudo code, so let's look at the code for it

If we break this down.

Line 3: We find the position within where this key is compared to our keys.
Line 9: We check if there if this node is a leaf.
Line 11: If we aren't a leaf we insert the key into our list of keys.
Line 12: We check if the key list is now too large.
Line 14-29: We split the keys and create a new node.
Line 31: We check to see if we have a parent.
Line 36: Finally we insert the pivot key and the node into our parent.
Line 42: If we aren't a node we pass the key along.


So now we have to deal with insertion for non-leaf nodes.
Once again let's break this down.
Line 3: We find the position where this key is compared to our keys.
Line 8-9: We insert the key and the node into our lists at the correct spot.
Line 12: Check to see if we're over the maximum
Line 14-43: We split the keys and the pointers based on the median then create a new node.
Line 47: Passed the new key value and the new node to our parent to deal with.

This is how we insert into a B-Tree.

Get a value from a B-Tree.

This should be fairly self-explanatory.

Conclusion

OK, so now we have a very basic in-memory C# B-Tree, I've uploaded the source for it. It's not particularly useful yet, however it should be good enough for us to start building on. The next step will be to tidy up the code and leverage some of the observations about the structure to make improvements, when move to a B+-Tree structure and beyond.

The code can be found at GitHub

Monday, 14 March 2016

The current main stream ways of storing data suck but it's not their fault.

The current main stream ways of storing data suck, but the problem isn't them, it's us. We've moved on, and the way we store data just hasn't caught up. Databases, document stores, they're really good at storing arbitrary large amounts of records but aggregating that data? Christ, pass the bottle.

For the lucky ones this isn't an issue and never will be, the real bread and butter of application development of taking data off a screen into their own data store and pulling it out will always be there. Even though the tools have changed, the paradigms are the same.  But increasingly in the connected age, these bastions of data, following their own schemas never having to interface with other systems are becoming rarer.

We live in an age were data is processed and shared and the current ways aren't up for the job. Let's look at some of the more common approaches to data bases.

Firstly, the time honoured stalwart of data stores, the relational database. These things are great, they're highly structured and very easy to pull data out of. Perfect right? Well  with all that good stuff comes a few caveats, and they're big ones. The data must conform to a schema, and that schema has to be defined before the data is inserted into the database.  To be able to pull data out of the DB you need to know that schema. This really just doesn't scale (and by scale I do not mean how much data can be stored, but the types of data that can be stored), when you start adding hundreds of disparate types of data and their relationships into a relational database it becomes very complicated and a nightmare to maintain the schemas needed.

Secondly, the plucky fire brand, document stores particularly JSON document stores. These are the web devs dream, They know what they're sticking into these beasts and they don't need to keep any complicated relationships between the data. These are interesting because the document it's self defines the schema for that particular document (if you're using XML, JSON or CSV etc), this gives a huge amount of flexibility to insert into a DB compared to the relational DBs. Once again however, there is very little structure about how each of these documents related to each other, you're effectively sacrificing those relationships for ease of insertion of records.

Both of these types of stores are designed to be the master of their own data sets,  most companies store their data in a few of these stores and generally they don't have to query across multiple to the stores. If you've ever had to pull data from multiple stores and combine it in a meaningful way I feel sorry for you, because it's a massive pain, it always requires some sort of logic glue which is fragile and prone to breaking when schemas or document types change.

This gets compounded when you add cloud services. You have all your sales information in Salesforce, your marketing data in Postgres and portal data in MongoDB, now your boss wants you to aggregate it into a dashboard? Good luck with that. There are companies which specialize in doing this because of how complicated theses scenarios are.

This is a problem which will only become more self evident as the way we work and the way applications interact with each other increases. The traditional methods of storing this data will become increasingly burdensome to use.

Our approach has to change, and one of the ways that companies which already have hit this issue are turning to is RDF.  Google, Facebook, Wikipedia are increasingly formatting their data into RDF form to combat this issue. Why? RDF allows users to insert data as flexibility as a document store, but still maintains relationships  like a relationship DB, between the data which is incredibly valuable to be able to leverage that data.

In it's most simple of forms, RDF consists of a list of triples (Subject, Verb / Property, Object), by decomposing data into a list of these triples, the data is easy to insert but maintains relationships by creating and updating a graph of the data using the Subject and Objects as the nodes and the verbs as the vertices in a graph.

For example:

Inserting these Triples into a RDF DB:

(Shakespeare, authorOf, Macbeth),
(Shakespeare, authorOf, Hamlet),
(Ophelia, characterIn, Hamlet),
(Gertrude, characterIn, Hamlet)


Creates the structure:




Inherent within the structure we can see relationships and are able to query between those relationships without needing to express the structure before hand.  It allows for extremely complicated structures to be represented and more importantly queried in a way that scales.

As the structure is build dynamically,  the structure creates connections between un-related data without us having to expressly creating those connections. This means we can see how things are connected by simply walking the graph. For instance, in the structure above, I haven't explicitly linked Shakespeare with Ophelia but by walking the graph I can see they are indeed linked. This provides a very powerful tool for analytics.

The side effect of breaking down the data into a series of atomic statements, means that the records within a JSON store, or within a relationship DB can by easily and automatically decomposed into these statements. Allowing for the migration of legacy stores easily. 

For instance

ID First Name Last Name Age
1001 Joe Bloggs 100

Breaks down to :

(1001, FirstName, Joe),
(1001, LastName, Blogs),
(1001, Age, 100)

I'm sure a few guys are asking isn't this just a graph db? Well yes and no, RDF is a type of graph sure, but it holds to the invariant that each node holds one piece of information and one piece only. This really is needed for the model to scale. GraphDBs generally do not hold you to that invariant.

Do I think that RDF is the future? I don't know, the future is a hard thing to see. And RDF isn't perfect by a long shot, but I do know that more and more companies are using this approach to solve the short comings within data stores. This approach to storage is also being increasingly leveraged in the analytics world to be to find and recognize the complex relationships.

The type of problem of data normalization across DBs is only getting bigger and RDF seems to be a viable solution to it.






Monday, 7 March 2016

Erlang, is it really viable?

Lately, I've been looking for a language to write a pet project in. Realistically I will most likely fall back onto on the languages I feel most comfortable in, but I would like to start the project with the best intentions, coupled with the project being an exploration of ideas, I thought it would be cool to at least try and start it off in a language I'm not familiar with.

Originally I was thinking about writing it in Go, I've written a few things in Go and have done some fast prototyping in it and I have found it edifying to use, while mulling this over with an extremely talented  senior engineer at where I work, who also happens to be a functional programming and lisp fanatic, we ended up talking about Erlang.

Erlang has always been interesting to me, but in an academic sense, mostly as it always has seemed too good to be true. You hear these mythical up-times on servers written in Erlang, pure message passing model, process model built in, not bolted on, OTP allowing for building of highly cluster-able manageable servers. The language from which the nirvana of a concurrent, 99.99999% (Yes that's right, five 9s) up time servers is born.

And yet, in industry, writing something in Erlang gets eye-rolls and sighs about academia. Even I have been guilty of this. But even the system I architect-ed and have actively developed for the last 3 years has to deal with all the problems which Erlang proposes to if not solve, at least help with greatly.

But I didn't choose Erlang to write it in, and even if I did I knew my SVP or CTO would't go for it. Hell, if I think about it now, after I know a lot more about Erlang than I did a couple of weeks ago (though I would just like to be clear, I am very much at newb level), would I have chosen Erlang? No, I wouldn't.

Why?  Well the truth of the matter is that there are factors outside the technical realm which make it impractical, while the team started off small and in an area which fairly deep pool of engineers. The idea of being able to find 3-4 senior engineers who were proficient in Erlang would of been a nightmare.

While reflecting on this, I can't help but feel déjà vu, 3 years ago, I wrote the prototype of our next generation product in Go, and I remember sitting down and having the discussion which language the system should be written in, carry on with Go? Or go with Java? Well, we went with Java, there were some technical reasons, Go at the time wasn't anywhere near as performant compared to Java as it is today. But realistically, we knew we could get the java developers and when we brought our other development team over, even though they didn't know Java, they could pick it up easily.

What I find ironic about this is we went with the Java, Scala, Akka. Why Akka? Because it follows the actor model, promotes immutable message passing, allows us to develop complex relationships. It gives us concurrency, clustering, supervision tress, sound familiar huh? As Java moved from 6 to 7 and then 8, we see more and more functional programs paradigms come into the language.

The actor model has been around on this earth longer than I have, along with Lisp. Haskell came out when I was 6 years old, the ideas and paradigms they represent are entering the lexicon of the average developer and they seem to be doing so at an increasing rate. It's really now with the advent of sheer scale of processing and the amount of communication even the most trivial application needs to do that we are turning to the "new" models, which happen to be old models developed by visionaries.

Would I ever develop a commercial product using Erlang? I honestly don't know,  but I do expect every engineer in our group to be familiar with the paradigms it espouses.  The more problems I try to solve the more I realize there is very little which someone smarter than me hasn't already come across and has either algorithm / data structure to solve it, or has written a white paper on how to tackle it.

It's enough to make you wonder if there really are new frontiers? Or just new mappings to make new frontiers like old ones.