Monday, July 23, 2007

Why storing (too much) data in tables is a Really Bad Idea (tm)

The world is full of Really Bad Ideas which look like good ideas at first glance. My favorite example of this is the Schick Slim Twin disposable razor. The Slim Twin is, as the name implies, a twin-blade razor. It has the "innovative" feature of having a small plastic tab between the blades. The tab is attached to a little button that lets you push the tab towards the business end of the blades thereby (ostensibly) forcing out the razor stubble and other assorted gunk that is trapped in the space between the blades. Schick actually had a series of TV commercials that touted this feature when the Slim Twin first launched many years ago.

It seems like a good if not particularly earth-shattering idea at first glance. That is, until you actually try it. What you find is that the Slim Twin actually collects a lot more gunk than other razors. This is because the plastic tab blocks the space between the blades and actually causes gunk to build up in the first place! When the tab isn't there, any accumulated gunk just falls out of the back of the blades. So this "innovation" actually causes the problem it purports to solve. (And in fact, it makes the problem much worse, because hair can get caught between the tab and the blades, at which point it becomes all but impossible to dislodge. Don't ask me how I know all this.)

Which brings me to this blog entry from Jeff Smith where he asserts that "data belongs in your tables -- not in your code." It seems like a plausible enough assertion on its face, kind of like the idea that having a little tab in the razor to push gunk out ought to be a useful feature. But he doesn't back up this assertion with any actual arguments, only with examples. And in those examples he looks only at the benefits of storing data in tables instead of code and none of the drawbacks.

There are a lot of problems with storing data in tables the way Jeff suggests, but there is one overriding uber-problem, but I won't spoil the fun by telling you what it is just yet. Instead, consider what happens if you follow Jeff's advice, for example:


Your company defines 3 shifts per day: Shift 1 is 12:00AM-8AM, Shift 2 is 8AM-4PM, Shift 3 is 4PM-12AM.

So, when you need to evaluate a DateTime variable to see which shift it falls into, you write:

SELECT ..., CASE WHEN Hour(SomeDate) < 8 THEN 1
WHEN Hour(SomeDate) < 16 THEN 2
ELSE 3
END as ShiftNo
FROM ...

Great, except you now have data stored in your code -- not in a table! Let's store our data where it belongs by creating a “Shifts” table:

And now you simply write:

SELECT ...., Shift.ShiftNo, Shift.Description
FROM ...
INNER JOIN Shift
ON Hour(SomeDate) BETWEEN Shift.StartHour and Shift.EndHour


Now consider what would happen if the company's shift schedule were to change. Simple, you just update the SHIFTS table to reflect the new schedule and you're done, right?

Except that all your historical data is now wrong because it is based on the old shift schedule. And that old shift schedule is now gone.

So the first problem with storing data in tables is that relational databases don't have revision control. Code does. And if you have data that has the kinds of dependencies that revision control systems are good at tracking then you might well be better off having that data in your code so that you revision control system can track it.

But there is a much more fundamental problem with Jeff's advice, and that is that there is no sharp dividing line between code and data. Look at those SQL queries. They are just strings, and hence they are data. So should we store them in a QUERIES table? For that matter, look at the code itself. That is just data too. Why not store that in a table?

The fact of the matter is that the admonition to store data in tables is completely vacuous because the distinction between code and data is arbitrary. It is therefore, just like the tab in the Slim Twin, worse than useless because it seems like such a good idea but in fact it creates problems rather than solving them.

The right way to decide what to put where is to look at the properties of the data you need to store. If it's large quantites of identically structured data that doesn't change in ways that alters referential integrity then it probably belongs in the database. If it's small quantities of data whose structure defines the semantics of other data and which doesn't change at run-time, then it probably belongs somewhere else, if not actually in the code then probably in a configuration file under revision control.

But just because something "looks like data" doesn't mean it belongs in a table.

UPDATE: I do not deny that the problems with Jeff's original example can be fixed. But the point is 1) there are problems and 2) they have to be fixed and 3) the process of fixing the problems is, in this example and many others, essentially, re-invention of revision control. There is no magic that automatically accrues unvarnished benefits merely from moving "data" (whatever that means) out of code and into a database, and applying Jeff's advice uncritically is as likely to create problems as solve them.

15 comments:

Jeffrey said...
This comment has been removed by the author.
Jeffrey said...

Ron -- you might want to read the article you are criticizing a little more closely ...

your criticism is easily addressed in the very next paragraph after what you quoted!

>>This allows you to add descriptions, shift managers, calculation adjustments, anything you need that is related to a shift. In addition, you can use this Shift table to drive reports and now changing to a 4 shift day is a matter of changing data and not by changing code. Even better, add a "WeekEnd?" flag or "WeekDay" column to your shift table to allow for different shifts for different days of the week; add more columns for more variations. Even historical changes can be represented (and clearly documented) in this table by adding StartDate/EndDate columns to the primary key.

Jeffrey said...

In fact, if you wish to store shifts historically, you *must* store that data in your tables with effective dates! you can't just alter your code -- your historical reporting will not work! how does storing the start/end times in your code help you with this?

Are you honestly suggesting that a table that defines SHIFTS, in this example, is not data? That is doesn't belong in a database? That assigning shift managers or titles to shifts in a database table for reporting is NOT a good idea? There should just be endless CASE and IF-THEN statements sprinkled throughout code everywhere to handle this instead?

Yikes!

Mark said...

I tend to think a little simpler.

Any data I could potentially want to change on the fly, or by external means (management script/webpage) goes in a table.

Soumya said...

I'm with Jeffrey on this example... rather than changing the data of the existing shifts in the table, you would add new shifts which are to be used while keeping the rows with the old shifts intact for historical data.

Of course, now if the shift data table is being used to populate drop down boxes in other parts of the app, the table must indicate which shifts are 'current' and which 'historical' so the appropriate shifts may be displayed.

All this suggests to me that if the data-in-tables-only paradigm is to be used, we need a kind of template or design pattern for tables that allow for future additions.

Ron said...

Thanks for the comments. I've posted an update to clarify my position.

Jeffrey said...

Rob -- there are no problems. That was an example. If you need historical reporting or tracking, you simply model it that way. As the article clearly says.

As I asked earlier: How would you handle this in your applications? How would you handle historical reporting requirements when the definition of shifts change? How about some concrete examples.

Also, if you cannot distinguish between data such as a shift name or the the hours it is assigned to versus code like a SELECT statement or objects in your database schema, then you really should read up on databases and fundamental programming concepts.

Mike Harris said...

>Except that all your historical
>data is now wrong because
>it is based on the old shift
>schedule. And that old shift
>schedule is now gone.

You don't keep your database scripts in revision control?

Cris said...

Ron makes this point with his usual clarity. It is the dual of the usual dictum that you need to use databases to store (some/much of your) data because if you don't you end up building a database in your code.
Both are true, but easily forgotten.

As for the example, it is just a parable, so no need to debug it, fix it, or rebut it.

Aaron said...

uhh, use a history table and an update trigger to track changes...

...you need to decide what should be stored in data versus in code based on how often that data can change...sql statements do not change that often - if they do then you've got another problem altogether. Business rules, on the other hand, can and will change quite frequently. True, there is no "sharp dividing line" between code and data, especially if you decide not to set one yourself. That's why developers are usually paid quite well; they have the experience to know where to set that line.

Ron said...

Jeffrey, your comments deserve longer replies than I have time to write at the moment. I'll try to get to it tomorrow. (BTW, are you Jeff Smith, the author of the original article, or are you some other Jeffrey?)

Mike:

You don't keep your database scripts in revision control?

Of course. But keeping your scripts under revision control is not at all the same thing as keeping your database under revision control.

And Cris: Thanks!

Jeffrey said...

Ron (sorry for typing "Rob" earlier, by the way) -- Yes, I am the author of that article.

Maybe this will help clear things up?

http://weblogs.sqlteam.com/jeffs/archive/2007/07/24/60266.aspx

Honroy said...

It's important to consider the likelihood of change as well. Just because something can change does not mean that it is likely to change. In the shifts case, the shifts have probably been the same for many years and will probably stay the same for years to come, so putting them into the code keeps your application simple.

Here's my Rules of Thumb on this matter:

High likelihood of change, small amount of data: config/resource files.
Low likelihood of change, small amount of data: code/constants.
High/Low likelihood of change, large amount of data: database.

YANP said...

The whole using revision control to store historically sensitive data is completely fubar. This is stretching the purpose of revision control beyond it's intent.

Say you worked 3rd shift based on what is essentially shift rules mon-wed. On Thursday the company re-evaluates its policies and wishes to change the shift boundries starting on Thursday so that you now fall under the 2nd shift.

If such a change where to effect how you got paid then you would have no recourse of how to effectively pay out two types of time for that week if your shift rules where in revision control. You'd end up being paid as a 2nd shift worker even though you effectively worked as a 3rd shift employee for most of the week.

In summary historically sensitive data needs to be persisted in data not in code.

I'm sorry but Jeff is completely right here.

WolfDancer said...

I am with Ron based on what Cris said, in case Ron is frustrated with all these comments. I know I would.