There are links to related essays on normalization and denormalization at the bottom of this post.
This blog has two tables of contents, the
Topical Table of Contents and the list
of
Database Skills.
Non-normalized, Normalized and Denormalized
A nonnormalized database is a disorganized one, where nobody
has bothered to work out where the facts should be stored. It is like
a stack of paper files that has been tossed down the stairs. We
are not interested in non-normalized databases.
A normalized database has been organized so that each fact is
stored in exactly one place (2nf and greater) and no more than one fact
is stored in each place (1nf). In a normalized database there is
a place for everything and everything is in its place.
A denormalized database is a normalized database that has
had redundancies deliberately re-introduced for some practical
gain.
Most denormalizing means adding columns to tables that
provide values you would otherwise have to calculate
as needed. Values are copied from table to table, calculations
are made within a row, and totals, averages and other aggregrations
are made between child and parent tables.
Related Essays
If you are a first-time reader of this blog, I recommend taking
a look at "http://database-programmer.blogspot.com/2008/01/database-skills-third-normal-form-and.html"
>Third Normal Form and Calculated Values and "http://database-programmer.blogspot.com/2008/04/denormalization-patterns.html"
>Denormalization Patterns, which cover issues related to
today's post.
The Practical Problems Of Normalization
There are four practical problems with a fully normalized
database, three of which I have listed before. I will list
them all here for completeness:
- No calculated values. Calculated values are a fact
of life for all applications, but a normalized database lacks
them. The burden of providing calculated values must be taken
up by somebody somehow. Denormalization is one approach to
this, though there are others.
- Non-reproducible Calculations. If you do not store
calculated values in your database, your application must
generate them on the fly as needed. If your application changes
over time, you risk not being able to reproduce prior results
when the business rules drift far enough from the original.
- Join Jungles. When each fact is stored in exactly
one place, you may find it daunting to pull together everything
needed for a certain query. A query joining 4,5, 7 or even 12
tables may be required for something the end-user considers
trivial and easy. Such queries are hard to code, hard to debug,
and dangerous to alter.
- Performance. When you face a JOIN jungle you almost
always face performance problems. A JOIN is a very expensive
operation compared to a single-table read, and the more JOINs you
have the worse it gets.
The Convenience Argument
The convenience argument addresses the first problem listed
above, no calculated values. When calculated values are
generated and added to tables, it is far easier for downstream
programmers (including members of the customer's IT department)
to generate their own reports and ad-hoc queries. It is also
much easier for members of the original team to generate
display pages and reports.
This convenience is not a result of the simple presence of
the calculated values. The convenience stems from the fact
that the downstream programmers do not have to get involved
in code that generates or calculates the values. They do not
have to know anything about the API, the language the app
was written in, or anything else, they just have to pull the
data they need.
This convenience goes beyond the programmers to semi-technical
users who may want to use their favorite 3rd party reporting
tool (like Crystal Reports) to query the database. If your
application API will not work with their favorite tool
(or if you don't have an API), then you have a dissappointed
customer. But if the data is right there in tables they
can pretty much use anything.
At this point you may be saying, sure, that's fine, but views
get all of this done without denormalizing. That is true,
but when we go on to the next 3 arguments we will see something
of why denormalizing often wins out over views.
The Stability Argument
Every healthy computer program changes and grows as new users
and customers make use of it. During this process it is
inevitable that later customers will request significant changes
to very basic functions that were coded early on and are
considered stable. When this happens the programmers have
the daunting task of providing the original functionality
unchanged for established customers, while providing the new
functionality for the newer customers.
Denormalizing can help here. When derived values are calculated
during write operations and put directly into the database, they
can basically stay there forever unchanged. When a significant
new version brings newer code to older users, there
is no need to fear that that an invoice printed last week will
suddenly come out with different numbers.
There still remains of course the fact that a bug in this whole
effort means future calculations are wrong, and the worst case
is when a bug gets
out to production and generates bad calculated values.
When this happens you
face the prospect of fixing bad data on a live system. This is
definitely my least favorite thing to do.
The Simple Queries Argument
The third problem listed above is JOIN jungles: queries that
involve so many JOINs that they become impractical to write,
difficult to debug, and dangerous to change.
When you denormalize a database by copying values around
between parent and child tables, you reduce the number of
JOINs that are required. Very obvious examples include things
like copying an items price onto an order_lines table when
a customer puts an item in their cart. Each time you copy
a fact from one table to another, you eliminate the need for
a JOIN between those two tables. Each eliminated JOIN is
a simpler query that is easier to get right the first time,
easier to debug, and easier to keep correct when changed.
This argument also goes directly back to the convenience
argument. If that huge customer you just landed is happy to
hear that they can use Crystal Reports to generate reports,
you may still face disappointment when they find the reports
involve "too many tables" from their perspective for reports
that "ought to be simple".
The Performance Argument
The final argument proceeds from our fourth problem listed
above. Normalized databases require a lot more JOINs than
denormalized databases, and JOINs are very expensive.
This means that, overall, any operation that reads
and presents data will be more expensive in a normalized
database than a denormalized one.
Once we reduce the JOINs by copying data between tables,
we end up improving performance because we need fewer JOINs
to retrieve the same number of facts.
Denormalization is not the only way to get the convenience
of copied values and calculated values. Views and materialized
views are the most often mentioned alternatives. The
choice between denormalizing and using views has a lot to do
with the "http://database-programmer.blogspot.com/2008/07/database-performance-pay-me-now-or-pay.html"
>Pay Me Now or Pay Me Later decision.
Conclusion
Denormalization is not an absolute: it is not one of those things
that all wise experienced programmers always do, and it is not
something that only fools ignore. The four arguments listed here
have guided me well in deciding when to denormalize (and when not to),
and I hope that they are of some benefit to you when you face the
same decisions.
Related Essays
This blog has two tables of contents, the
Topical Table of Contents and the list
of
Database Skills.
The normalization essays on this blog are:
- >Revisiting Normalization and Denormalization.
- Pay Me Now Or Pay Me Later
- The Argument for Normalization
- First Normal Form
- Second Normal Form
- Third Normal Form and Calculated Values
- The Argument for Denormalization (this essay)
- Denormalization Patterns
- Keeping Denormalized Values Correct
- Triggers, Encapsulation and Composition
- The Data Dictionary and Calculations, Part 1
- The Data Dictionary and Calculations, Part 2
Update Contact :
No Wa/Telepon (puat) : 085267792168
No Wa/Telepon (fajar) : 085369237896
Email : Fajarudinsidik@gmail.com
No Wa/Telepon (puat) : 085267792168
No Wa/Telepon (fajar) : 085369237896
Email: Fajarudinsidik@gmail.com
atau Kirimkan Private messanger melalui email dengan klik tombol order dibawah ini :