Why Data Normalisation is the forgotten silver bullet

Introduction

Without realising it we as human’s compartmentalise objects to reason about the world more effectively. An apple is round, hard and green. If I throw it it will probably go this far, if it’s mushy instead of hard it’s no longer a healthy apple but a rotten apple. This technique developed so when faced with the overwhelming complexity of the world we can reason quickly based on the abstractions we have built over the environment around us.

Strongly typed languages by design benefit from this exact property, by telling the system you expect this value to always be an integer or string you are telling the system about the world being created, now that you are passing an integer to this method it knows this value can be added to another integer but not a string, it knows if you try to iterate over the integer that doesn’t make sense either – it’s not a group of things it’s one thing.

That’s great you know all of that already, so you can just program it in a completely non idiotic way, but that’s not what we want. Like a small baby, you want to teach it the properties of the apple, the different ways it can be used compared to other objects in the hopes that one day it knows enough about the world to make decisions on its own. Our system will reach a complexity where we can’t make simple changes anymore without impacting the whole system and at that point we want it to do most of the work for us.

The same principles we have been talking about with programming languages are used in another area which is a bit more fashionable to debate these days – relational databases and NoSQL. The benefits of NoSQL are clear and immediate, you get faster code because the data is in one place and when you need to update the system about the environment it is in – perhaps it only ever saw healthy apples, but now you need a flag to say whether it is healthy or mouldy. You don’t need to sit in a room for a week and ask yourself if mould is the property of an apple, or if a mouldy apple is a completely different object to a healthy apple.

The benefits of relational databases are not so obvious, but I hope that going through an example from our real world system below will convince you that much like our strongly typed languages, by restricting exactly what our objects represent in the system we give the system itself knowledge about the environment and how it interacts which offloads massive cognitive load from you the programmer, preventing you from making mistakes because mistakes are simply not possible in the world you’ve created.

Data Anomalies

Our aim is to avoid the 3 demons of data anomalies, take this data:

Employee_ID Name Department Student_Group
123 J. Longfellow Accounting Beta Alpha Psi
234 B. Rech Marketing Marketing Club
234 B. Rech Marketing Management Club
456 A. Bruchs CIS Technology Org.
456 A. Bruchs CIS Beta Alpha Psi

This data is not normalised, and will thus suffer from data anomalies.

Update Anomaly: If A. Bruchs updates his department, it needs to be updated in 2 places at once or the environment no longer makes sense

Deletion Anomaly: If the Student Group Beta Alpha Psi was removed because the group no longer exists, J. Longfellow would no longer be an employee in the system

Insertion Anomaly: If I want to insert a new employee they must belong to a student_group, but what if an employee needs to be registered before they choose a group

Real-Life Consequences

Let’s use our real life requirements for an order to see how we did. Our order can have multiple order items where each order item is a SKU (unique identifier), quantity and price. All good so far.

Now we have the requirement that if a group of items are together they may form a “package” deal like the ski package represented below. The way we have represented this is to say now a real order item is that which has a ParentItemId of NULL and anything with a ParentItemId of NOT NULL is actually a child item of the package. To add to this a Package OrderItem doesn’t actually exist, so we will make up the identifier but guarantee is starts with “PKG”.

OrderItemId OrderId ParentItemId Sku Quantity Price
19189428 6106482 NULL PKG010 1 0
19189429 6106482 19189428 5052776748339 1 3
19189430 6106482 19189428 5052776345767 1 30
19189431 6106482 19189428 5052776743136 1 5
19189432 6106482 19189428 5052776747370 1 7
19189433 6106482 19189428 5052776774710 1 45

This is a subset of our actual OrderItems table in the database which is used to solve our problem of ski-packages, but to me it doesn’t sit quite right and I have some intuitions it will cause issues further down the line – let’s investigate.

Normalisation

Normalisation is a hard topic to grasp a face value, we wont discuss the intricacies here but it should be in the tool-belt of every programmer.

Is it in first normal form?

Are there any repeating groups in individual tables? No, each cell has a single atomic piece of data.

Is there a separate table for each set of related data? Uhoh, it looks like we’ve hit our first snag already, item packages are a collection of items, but they are in the same table as order items themselves.

As we mentioned, any data which is not in 3rd normal form is going to suffer from data anomalies, let’s see if we can find them.

Update Anomaly: If we update the quantity of the package item, we also need to update the linked rows

Insert Anomaly: If we insert another OrderItem under the package we will have changed the package definition for this order but none of the other orders

Deletion Anomaly: If we delete the package, we need to delete all of the order items and vice versa

What do these examples mean? It means you’re no longer passing around an object which has a single interface to the external world. An order item is an order item unless it’s a package item, you’re now doomed to repeat this logic everywhere instead of defining it once in the data layer. The same with our apple analogy, it is green in every scenario unless it is mouldy, then you should treat it as a brown apple. So once again if we now need logic to determine packages from non-packages lets go looking for it I’m sure we’ll find something!

Example 1

public IList<RefundOrderItemsModel> SetRefundPackageQuantities(IList<RefundOrderItemsModel> refundOrderItemsModel) {
    var packages = refundOrderItemsModel
            .Where(x => !string.IsNullOrEmpty(x.Sku) && 
                        x.Sku.StartsWith("PKG") &&
                        x.RefundQuantity > 0)
            .ToList();

    foreach (var package in packages) {
        var items = refundOrderItemsModel.Where(x =>
                       !string.IsNullOrEmpty(x.Sku) &&
                       !x.Sku.StartsWith("PKG") &&
                       x.PackageId == package.PackageId);
        foreach (var item in items) 
            item.RefundQuantity = package.RefundQuantity;
    }

    return refundOrderItemsModel;
}

Example 2

var listPackagesId = refundOrderItems
    .Where(x => x.RefundQuantity > 0 &&
                !string.IsNullOrEmpty(x.Sku) &&
                x.Sku.StartsWith("PKG")).Select(y => y.PackageId).ToList();

foreach (var item in refundOrderItems) {
    if (!string.IsNullOrEmpty(item.Sku) && 
            !item.Sku.StartsWith("PKG") &&
            listPackagesId.Contains(item.PackageId))
        continue;
}

Example 3

if (orderItem.IsChildItem ||
        !orderItem.Sku.StartsWith("PKG") &&
        !string.IsNullOrEmpty(orderItem.CustomFields.TryGetValue("PackageItemId"))) {
    var zeroPrice = CurrencyHelper.FormatCurrency(_currency, 0M);
    
    foreach (TableCell c in e.Row.Cells) {
        ApplyPackageFormatting(c);
        AlterPriceDisplayForPackages(c, zeroPrice);
    }
}

Example 4

internal static void BuildTaxJarItemsFromOrderShipment(TaxJarDto taxJarDto, OrderShipment orderShipment, ICollection<string> kidsItemsList) {
    foreach (var orderItem in orderShipment.OrderItems.FindAll(f => f.OrderItemType == OrderItemType.Product)) {
        if (!string.IsNullOrEmpty(orderItem.KitList) || orderItem.Sku.StartsWith("PKG"))
            continue;
    }
}

As expected we now have several areas in the code which have no idea that an OrderItem can also be a PackageItem so we need to keep reteaching this to the system, any new development will also need to be aware of this cognitive load.

Real-Life Solutions

We need to step back and ask, what is a “ski package”. It’s a collection of items, not a single item so there is our first mistake. What we are saying is we sell all these items individually but if you buy them in this group the items price is lowered. So we have two options to go down:

  • Bundle items together so each package gets a correct SKU, and say throughout the system if you have this SKU – you should bundle these items together. This means right down the warehouse level we need to be consistent, you can never split these items or combine other items to create this SKU
  • Treat a ski package as a promotion, creating a rules engine which specifies if you have a specific combination of items – you get a discount on those items

Approach #1

OrderItems
OrderId ItemId
6106482 5052776774711
OrderPackageItems
OrderId PackageId
6106482 PKG010
Items
ItemId Name Price
5052776774711 Non Package Item 5
OrderDiscount
PackageId Name Price
PKG010 Ski Package #1 89
PackageItems
PackageId ItemId
PKG010 5052776748339
PKG010 5052776345767
PKG010 5052776743136
PKG010 5052776747370
PKG010 5052776774710

This is a good start, we can reason more about how to display each piece of data in the correct way.

  • For OrderItems just show the line, for PackageItems we may want to get each ItemId in the package.
  • When adding or remove a package we only need to remove a single line – avoiding orders getting into an inconsistent state.
  • If we want to open up functionality to allow business users to define new ski packages we just need to expose the Package and PackageItems tables.

What are the downsides? We need to take into account OrderItems and PackageItems every time we want to calculate the true price or say what an order really consists of. I think we can do slightly better in option #2.

Approach #2

OrderItems
OrderId ItemId Price
6106482 5052776748339 20
6106482 5052776345767 20
6106482 5052776743136 10
6106482 5052776747370 10
6106482 5052776774710 20
Items
ItemId Name
5052776774711 Non Package Item
OrderAdjustment
OrderId DiscountId Price Reason
6106482 PKG010 -20 Ski Package Bonus
OrderAdjustmentItems
DiscountId ItemId
PKG010 5052776748339
PKG010 5052776345767
PKG010 5052776743136
PKG010 5052776747370
PKG010 5052776774710

Now we have fewer tables to handle and have nicely encapsulated OrderItems from packages, we have left price in for price history purposes but we could also take this off and say if the price changes, the SKU should also change.

  • We can add rows to OrderAdjustments that are not always about packages, we may run other promotions and simply when doing payments need to calculate the OrderItems sum and the OrderAdjustments sum.
  • If we want to display what items were associated with a specific discount we also have the OrderAdjustmentItems table.

The downside to this approach is the rules engine will determine which items are packages along with other discounts such as “if they buy 5 or more items give them 50% off” which makes sense when it gets to that level of business logic.

Now let’s see what our examples might look like with a different schema:

Example 1

public IList<RefundOrderItemsModel> SetRefundPackageQuantities(IList<RefundOrderItemsModel> refundOrderItemsModel) {
    # Completely removed, now redundant we don't need to treat packages as a special case.
}

Example 2

var listPackagesId = refundOrderItems
    .Where(x => x.RefundQuantity > 0)
    .Select(y => y.PackageId).ToList();

Example 3

# Completely removed, we don't need to format packages differently now.

Example 4

internal static void BuildTaxJarItemsFromOrderShipment(TaxJarDto taxJarDto, OrderShipment orderShipment, ICollection<string> kidsItemsList) {
    foreach (var orderItem in orderShipment.OrderItems.FindAll(f => f.OrderItemType == OrderItemType.Product)) {
        # We can just check taxes on all OrderItems now without checking for PKG.
    }
}

We’ve also reduced the cognitive load required for new features, we have a much more intuitive system which can simply add all OrderItems together and subtract any OrderDiscounts.

Summary

Can you deal with unnormalised data in code? Of course. There isn’t a problem that arises in skipping data formatting that you can’t later on fix in code, but you retain that information in your head, you haven’t passed that knowledge into the system. When you pass an apple around and you decide to refactor or create new code, or hire a new person, you always need to be aware that apple can be a mouldy apple as well, the colour is not always green but can sometimes be brown as a result. Let’s make it conceptually impossible in the data and instead let the system decide where the acceptable places to use healthy or mouldy apples should be.

The downsides of NoSQL are often less obvious and insidious, the side effects taking root all around your system in the form of extra conditionals and steps to format or cleanse the data. NoSQL is an excellent tool for prototyping and performance gains, but outside of those realms I urge you to deeply consider why it is preferable.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s