This is the Database Programmer blog, for anybody who wants
practical advice on database use.
There are links to other essays at the bottom of this post.
This blog has two tables of contents, the
Topical Table of Contents and the list
of
Database Skills.
Today we are going to look at performance, direct SQL
coding,
and then begin with the basics of the SQL SELECT command.
Disk Activity Determines Performance
Before we even look at the SQL SELECT command, we must
know what motivates the experienced database programmer to
pick certain kinds of queries and avoid others. The first
motivator is of course performance. We all want our
programs to go fast. If you want a fast database program,
then you have to think about the hard disk.
The slowest device in a computer is the disk drive.
The price of
disk reads is so much higher than in-memory operations that
there is nothing to gain by optimizing code unless disk
reads are optimized first. I often tell my programmers
to consider in-memory operations to be "free" when coding,
and to concentrate all optimization efforts on
reducing disk reads. This approach may
gloss over some important truths, but it is an excellent starting
point for database beginners.
Optimizing disk reads comes down to writing efficient
queries on top of well-designed tables. With that being
said, we have one more short note to cover before going
into the syntax of the queries.
A Quick Note on Inline SQL
Many authors and framework programmers discourage the use
of SQL SELECT statements directly in application code.
This essay contains no opinions on that question.
However, it is important to know
that the SELECT statement must be coded somehow and sent to the
server, whether you code it manually or some framework tool
generates it for you. This essay is all about how to code
(or generate) that SELECT statement. Knowing how it all works
is a requirement if you code your own framework or if you
suspect your chosen framework may not be your friend in all
cases.
Introducing SQL Select
The simplest query is four words (or symbols) long. If
your database has a table of countries then here is a very
simple query that will work:
SELECT * FROM COUNTRIES
This query will return all rows and columns from a table.
Depending on what language you code in, these may come back
in an associative array, an array of generic objects (or similar),
or some other special-purpose object like a ResultSet.
You have probably been told not to use the "*" in a SELECT,
for performance reasons. This is usually good advice. In the
simplest case, you save bandwidth by only retrieving the
columns that are of interest to you. If your table contains
long varchar or text (aka clob) columns there are even more
reasons to avoid "SELECT *". When you have long varchar and
text columns, they may be stored outside of the main storage
for the table, causing the server to look in two places to
retrieve each row. Therefore, avoiding "SELECT *" and always
specifying just the columns you need reduces disk reads on the
server and reduces bandwidth delivering the results.
But as this is an
introductory essay it is important to know how to retrieve
a complete table, so I have used the "SELECT *" here.
Filtering Results with WHERE
The WHERE clause limits which rows from the base table
will go into the query results. You specify a WHERE clause
as one or more boolean conditional expressions. Multiple
expressions can be separated by AND and OR, using parentheses
to group expressions. You can review your product's documentation
to see all of the comparisons and functions that are available.
A moderate WHERE clause might look like this:
SELECT country,name
FROM COUNTRIES
WHERE country like 'A%'
AND ( name like 'D%'
OR name like 'E%'
)
AND continent = 'Africa'
Filtering and Performance
The primary purpose of a WHERE clause is to obtain the
correct result. However, it is also a very important
performance tool. Here is why.
If you are completely and totally new to database programming,
you may get the idea that you will skip the WHERE clause
and do your filtering in the application. This may seem like
a good idea because you save the trouble of learning two languages.
Instead of learning SQL plus your application language, you
can concentrate on just your application language. And so you
make a reasonable decision to use as little SQL as possible and
just do everything in application code.
The drawback to this perfectly reasonable suggestion is that
it violates our first performance concept, it creates a huge
disk read burden. If you need five rows out of 50,000, then
filtering in the application requires the database server
to read all 50,000 rows off the disk. On top of that, these
have to be delivered to your application for processing.
Making use of the WHERE clause means that only 5 rows are
read off the disk (this assumes the presence of an index
which will be explained in a later essay).
In this particular example, using a
WHERE clause will perform 1000 times faster than not using
it. Of course this is only a single very vague example,
but since a database application is composed largely of queries,
it is definitely a good idea to have all of these queries start
out on solid ground.
Foreign Keys and JOIN
Next week we are going to look at JOINs in much detail,
but I want to mention them here briefly. The JOIN clause
lets you return results from more than one table, and
the JOIN determines how the rows from multiple tables
will be matched to each other.
For this week I will say only that good queries will
almost always use foreign keys as the basis of their
JOINs. We have seen in these essays more than once that the
foreign key is the fundamental and only way to
connect information in separate tables. Naturally, therefore,
the foreign key will loom large in our discussion of
JOIN, since JOIN
controls the combined retrieval of information from separate
tables.
Sorting Query Results
You can sort query results by including an ORDER BY clause
in the query. Simply name the columns:
SELECT customer,order,date
FROM ORDERS
WHERE date >= '2008-03-01'
ORDER BY date,customer
Some database servers let you put an ASC or DESC in front of
individual columns, while other servers can only apply a
DESC or ASC term to the entire sort operation.
Overall application performance comes into play with ORDER BY
clauses. It is almost universally true that you can sort faster
on the database server than you can in your code. You want
to make sure that your manual queries contain ORDER BY clauses,
and that your framework is generating them. You do not want to
be sorting in application code in most cases.
Order of Terms Matters
You have to put the various clauses into the right
order or they will not work. The order is:
SELECT ....
FROM ....
JOIN .... ON ....
WHERE ....
ORDER BY...
Conclusions
This week we began to examine queries, by looking at the
very basics of the SQL SELECT query syntax. Not surprisingly,
performance issues came up for every single part of the
query, from the column list to the ORDER BY.
For performance, we looked at the basic idea that disk reads
determine performance, which we will see more of in later
weeks. I also mentioned that table design determines query
efficiency, but we have not gotten very deep into that yet.
Related Essays
This blog has two tables of contents, the
Topical Table of Contents and the list
of
Database Skills.
Other essays relating to SQL SELECT are:
- >The Basics of SELECT (this essay)
- >GROUP BY, HAVING, SUM, AVG and COUNT
- >JOINs Part 1 (Basics)
- >JOINs Part 2 (Variations)
- >UNION