Minimize Code, Maximize Data (database-programmer.blogspot.com)

Sunday, May 4, 2008

Minimize Code, Maximize Data

Early in my career, I was fortunate to receive some programming lessons from one of the early pioneers in computer age, A. Neil Pappalardo. While I am sure he would not remember me, I certainly remembered him, and he said one thing that I remember very much: Minimize Code, Maximize Data.

This is the Database Programmer blog, for anybody who wants practical advice on database use.

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The Best Kept Secret in Programming

This week we are going to examine what I was told so many years ago:

Minimize Code, Maximize Data

Since then, and it was nearly 15 years ago, I have never once heard another programmer (except myself) express this very basic and simple idea. It is the best kept secret in programming. I can think of two reasons why most programmers working today have never heard the simple idea, "minimize code, maximize data."

First possibility: the guy was totally wrong. I highly doubt this however as his company is approaching its 40th year of worldwide growth, and how many companies in this field last that long, never mind prosper and grow?

The second possibility seems much more likely to me: most programmers just don't think that way. Most programmers would never stumble upon this idea on their own and if they do hear it they forget it or reject it. We programmers love to code, and are reluctant at best to accept the idea that he who codes least codes best.

Now we will see how this simple idea impacts the entire software cycle, from designer to user.

The Example: Magazine Regulation

Our example is fairly easy. Consider a magazine distributor, a company that receives magazines in bulk and distributes them to stores. His system contains a table called DEFAULTS that lists the default quantity of each individual magazine that is distributed to each store (a big fat cross-reference table).

Now for the tricky part. Let's say we run a "SELECT SUM(qty)" from the DEFAULTS table for TV GUIDE and we get 2123. That means he needs 2123 copies of TV GUIDE to give every store their default delivery. But when the magazines arrived from the supplier, they only received 1900. What does he do? Well it so happens this happens every day, the distributor in fact never receives his exact default requirements, they are always over or under. So the distributor must run a process called regulation.

Regulation is an automatic computer process that runs through the defaults, compares them to what you actually have, and increases or decreases the delivery amounts to each store until everything lines up. It is by nature iterative, you run through the stores over and over increasing amounts by one or two until you balance out. Next we will look at how to write it.

Writing a Regulation Program

Before we get to the details of the regulation program, I have to point out one detail about magazine distribution. It turns out that 50-80% of magazines sent to retail racks go back to the distributors and are shredded. This means it is not uncommon to talk about sales percentages of 50% or 60%. If a store is receiving too many magazines, they might easily have a sales percent as low as 10%.

So how do we write the regulation program? I'm going to avoid any long build-up here and just go straight to the answer. You make a table of rules that determine how the regulation process works. The owner of the distribution company will say things like, "If the store sold less than 20% of the TV GUIDES for the past 4 weeks, drop their amount by 2, but do not go below 2." So you make a table with columns like "THRESHOLD_PERCENT", "DECREMENT" and "ABSOLUTE_MINIMUM". There will also be rules that operate when the company has too many magazines. The owner says, "If his sales percent is greater than 80%, give him two more." So your table now contains percentages and increase amounts to use when raising the distribution amounts.

From here your program is reduced to a fairly simple affair, it is little more than a double-nested loop. You fetch a list of the rules, in order, and you begin to iterate through the magazines, applying the current rule to the current magazine. You keep applying each rule over and over until there are no more cases where it applies, then you move on to the next rule. As soon as the total magazines matches the amount you have, the program terminates. It is the very soul of simplicity, and a perfect example of what database applications can do so well.

The Wrong Way To Do It

It so happens that I have a customer who is a magazine distributor, which is how I learned about this process. The system he is using now, which I maintain for him, was not written as I described it above. There is no table of rules. The programmer coded up each different rule separately, and the programmer is now dead (I feel like Dave Barry saying this, but I swear I am not making this up). The owner and I are both afraid to touch the code, and so he lives with it while we plan for something better.

While I do not wish to speak ill of the deceased, it is safe to say that the original programmer did not know he was supposed to minimize code and maximize data. Let's examine the consequences of that ignorance:

The customer is not in total control of his own business, because he
cannot reliably modify one of his most basic operations.

So we can now draw the conclusion that the "minimize code, maximize data" has very direct consequences for the user experience, especially those very important users who sign the checks.

The Impact Upon Your Code

A code grinder who is not trying to minimize code usually ends up with very complicated programs. This is true even of veteran and expert programmers who supposedly know better. They set out with an idea to simplify things and then end up writing complex heavily inter-dependent class libraries. I contend that this happens because they don't know that they are supposed to minimize code by maximizing data.

On the other hand, the database programmer who knows to maximize data ends up with dramatically simpler code patterns. This happens because a lot of the complex conditionals and branching logic that is required for code-centric solutions is reduced to row scanning operations that contain much simpler algorithms in the innermost loop.

We can see this in the example above for the regulation program. The data-centric solution is basically one loop nested inside of another with the core operation occurring on the inside. The code-centric solution, which I mentioned we are afraid to touch, is full of conditionals and branches that make it dangerous to mess with for fear of causing unintended side-effects. The original programmer could probably do anything with it, everybody else is reduced to doing nothing.

The problem becomes more exaggerated as time goes on. As programs mature, their original simplicity is enhanced to handle more sophisticated edge cases and exceptions. As this process continues, many of these sophisticated edge cases and exceptions will be mutually exclusive or will interact in subtle ways. When this happens the code-centric program becomes increasingly difficult to understand and keep correct, as the layering of conditionals and branches and interdependencies makes it harder and harder to eliminate unwanted side effects. The data-centric solution on the other hand, while still becoming more difficult, is reduced to simply making sure that the tables provide the correct options for the code, and the code remains a matter of scanning the rules, picking precedence, and executing them.

The Impact on Debugging

It is much easier to debug data than code, especially when a simple operation has been matured to handle a collection of subtle edge cases and exceptions that may interact with each other.

If we continue to the example of the regulation process, the basic question arises, how do you test this thing? The entire concept of reliable testing is far more than we can cover in a single essay, but I do want to introduce the idea of testing with -- you guessed it, more data.

The regulation program as I have described it loops through rules and magazine quantities and adjusts those quantities. It is a row-by-row process, with each pass through the inner loop executing a single change. Debugging this process can be vastly simplified if a table is made that records, in order, each update that was made and what rule was used to make the update. Obvious errors could be detected if a rule was being applied out of order, or if the rule made the wrong change. Generating the test cases brings in yet again more data, creating a body of magazines, defaults and customers that deliberately stresses the system with extreme cases.

This debug-the-data approach can have a huge impact on boosting the customer's confidence and control. If he can directly view this log he has a perfect black-and-white explanation of how the process ran. If he does not like it he can change the rules. If you let your program run in "planning only" mode, where it generates these logs without making the changes, then your customer can play what-if and you will find you have truly made a new friend!

Meta-Data and the Data Dictionary

This week's essay leads naturally to the matter of meta-data and data dictionaries and how they can dramatically reduce the amount of code needed for routine table maintenance tasks. However, that is a large topic and must be reserved for one or more future essays.

When we get to the topic of data dictionaries, we will be looking at how a data dictionary can give you true zero-code generated table maintenance forms, among other things.

Conclusion

The motto "Minimize Code, Maximize Data" is not well known in popular discussions today, which I contend is a natural consequence of our basic personalities as programmers. Coding is what we do and we tend to think of solving problems as an exercise in code and more code.

Nevertheless, we have seen this week in a specific example (which could be repeated many times over) that the code-centric versus data-centric decision impacts everybody. The rule "Minimize Code, Maximize Data" has positive impacts and the coding process, the debugging process, the maintenance process, and the user experience. Since that covers all parties concerned with software development, it is safe to conclude that this is a crucial design concept.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other philosophy essays are: