This week we will review the practical arguments
in favor of normalization. The major concern as always on this
blog is to examine database decisions in light of how they affect
the overall application. The major argument for normalization is
very simple: you end up coding less, coding easier, and coding
stronger, and you end up with fewer data errors.
There are links to related essays on normalization and denormalization at the bottom of this post.
This blog has two tables of contents, the
Topical Table of Contents and the list
of
Database Skills.
Informal Description of Normalization
When I find a programmer who is stuck trying to grasp the concepts
of normalization, the problem usually comes down to not being able
to see the big picture. The programmer may read any number of
rigorous papers on the subject (the "http://en.wikipedia.org/wiki/Database_normalization"
>Wikipedia article is a good place to start) but still be unable
to get the basic point. This leaves the programmer stumbling through
table design, second-guessing himself, and then running through a
frustrating sequence of redesigns.
In the worst case it leads him to conclude normalization
may not be worth the effort, at which point he starts writing really
crappy applications.
The goal of normalization in simple terms is just this: to store each
fact in exactly one place. When you put each fact in only one place, you
always know where to go to read it or write it. When facts are
repeated in the database, the application programmer has an increased
burden to make sure they are all consistent.
If he fails to shoulder this
burden completely, the database will have inconsistent values for the
same facts, leading to emergency phone calls and emails requesting
help. These request for help always come at 4:30pm as you are
getting ready for a date or an extended vacation.
Once the programmer understands this very simple goal, he usually
has an "aha!" moment reading the various articles on normalization.
Each rule for normalizing suddenly makes more sense, as it appears
as just another way to make sure that there is a place for
everything and everything is in its place.
The Programmer's Burden
When a programmer is dealing with a non-normalized database, he is
going to run into 4 problems over and over again. Three of these
are called "anomalies" and the fourth is "inconsistency" (a
fancy way of saying the database holds bad data). Most authors
who write on normalization take it for granted that the reader
can readily see why the anomalies are bad, but I would like to
spell it out here to make it crystal clear. First we will look
at the three kinds of anomalies, then we will go to the inconsistency
problem, and see exactly how they affect the programmer.
Consider a fledgling programmer who has ready too many Web 2.0
blogs saying that relational databases are bad, and so has not
bothered to learn anything about them. In the name of "simplicity",
he creates a single table that lists employees, their email addresses,
the customers they are assigned to, and the primary
email address of each customer. This will lead to three kinds of
anomaly, each of which leads to inconsistency.
An Update Anomaly occurs when a fact is stored in
multiple locations and a user is able to change one without changing
them all. If a user goes to this employee-customer table and
changes an employee's email on only one row, and no provision is
made to change the others, then the database now has inconsistent
values for the employee's email.
possible to record a fact. If an employee is hired but not yet
assigned to any customers, it is not possible to store
his email address!
fact and clobbers some other fact along the way. If an employee
goes on leave, so that we must remove (delete) their assignments,
then we have lost their email address!
This non-normalized database requires the programmer to write
additional application code to try to intercept and correct these
issues. This is the Programmer's Burden in a
non-normalized situation, and it gets worse and worse as the program
expands.
The Programmer's Burden also emerges as a continuing stream of
complaints from users that "the program is wrong." For every case
where the programmer fails to provide exception-handling code, a
user will stumble across inconsistent data. The customer says,
"it says 'X' on this screen but it says 'Y' on that screen,
what's going on?" As far as they are concerned it is a bug
(which of course it is) and must be fixed. You can't make money
coding new features when you are fixing garbage like that.
The Basic Argument
So the basic argument for normalization is:
we wish to avoid the Programmer's Burden as completely as possible.
We want to spend our time on cool features, not going back over and
over to fix features we thought were finished already.
Special Comment on First Normal Form
First normal form is different from the others. When a database
designer violates the higher normal forms, the result is that
a fact is recorded in more than one place. However, when you
violate first normal form it results in more than one fact in
the same place.
A basic example would be the same table of employees and customers,
where we "solve" the problems listed
above by storing only one row
for each employee, with a comma-separated list of accounts, like so:
EMPLOYEE EMAIL CUSTOMERS
------------------------------------------------------------
ARANDOLPH art@praxis.com 100, 523, 638, 724
SRUSSELL sax@overlook.edu 516, 123, 158
PBOYLE phyllis@sp-elevataor.com 713, 928, 212
The above scheme increases the Programmer's Burden because now
he must decompose the data that comes from the server.
In technical terms we say that the value CUSTOMERS is
non-atomic, it is not a single fact. Every piece of
code that touches that table must break down the list of
customers and sometimes reassemble it.
To see this, consider the basic task of adding a customer for
employee Art Randolph. If the tables were set up properly,
you would insert into a cross-reference of employees and
customers, and duplicates would be trapped by a primary key.
But here you must retrieve the list of existing customers,
split it up in application code, and check that the value is
not repeated. Then you have collapse the list back down and
send it up to the server.
All I can say is, no thanks.
By The Way, What Is The Right Way?
Now that we have beat up our fledgeling programmer's lousy
employee-customer table, it would be worthwhile to spell
out how to do it correctly.
First off, we always need one table for each kind of thing
we are keeping track of. That means we will have a table of
employees and a table customers. This solves all of the
anomalies and inconsistencies listed above because we
put facts about employees in the employees table (like email
address) and facts about customers in the customers table.
This leaves the issue of linking employees to customers.
There are three ways to do it:
- If each customer gets a team of employees assigned to
them, but an employee only ever works for one customer, then
put a "http://database-programmer.blogspot.com/2007/12/database-skills-foreign-keys-this-is.htmlhttp://database-programmer.blogspot.com/2007/12/database-skills-foreign-keys-this-is.html"
>Foreign Key on the employees table that links to the
customers table.
- If each employee works on more than one customer, but
each customer gets only one employee, then put a
foreign key on the customers table that links back to
employees.
- If an employee can work for more than one customer and
vice-versa, make a "http://database-programmer.blogspot.com/2008/01/database-skills-sane-approach-to.html#rule5"
>Cross-reference between customers and employees.
Conclusion
This week we have seen a fairly simple argument for normalization,
and one that regular readers of this blog have seen before:
normalization eliminates unnecessary coding burdens. It is hard
enough to get software projects done on time and on budget without
imposing additional labor that could be avoided entirely by
normalizing.
I do not mean to imply that normalizing takes no time
or is instantly easier than a fear-based retreat into coding
your way out of things. It does take time to learn to normalize
and it does take time to learn to code an application around
normalized tables. In my own experience I passed through the
various erroneous mindsets that I make fun of in this blog, and
each time I put effort into learning the "right way" then every
effort I made after that was forever easier, had fewer bugs, and
made my customers more happy. So I am not saying it is free, but
I am saying it is one of the best bargains in town.
Related Essays
This blog has two tables of contents, the
Topical Table of Contents and the list
of
Database Skills.
The normalization essays on this blog are:
- >Revisiting Normalization and Denormalization.
- Pay Me Now Or Pay Me Later
- The Argument for Normalization (this essay)
- First Normal Form
- Second Normal Form
- Third Normal Form and Calculated Values
- The Argument for Denormalization
- Denormalization Patterns
- Keeping Denormalized Values Correct
- Triggers, Encapsulation and Composition
- The Data Dictionary and Calculations, Part 1
- The Data Dictionary and Calculations, Part 2
Update Contact :
No Wa/Telepon (puat) : 085267792168
No Wa/Telepon (fajar) : 085369237896
Email : Fajarudinsidik@gmail.com
No Wa/Telepon (puat) : 085267792168
No Wa/Telepon (fajar) : 085369237896
Email: Fajarudinsidik@gmail.com
atau Kirimkan Private messanger melalui email dengan klik tombol order dibawah ini :