Database Performance 1: Huge Inserts

بِسْــــــــــــــــمِ اﷲِالرَّحْمَنِ اارَّحِيم

bismillaahirrahmaanirrahiim

السَّلاَمُ عَلَيْكُمْ وَرَحْمَةُ اللهِ وَبَرَكَاتُهُ

Assalamu'alaikum warahmatullahi wabarakatuh

The modern database server provides a wealth of features that
provide robust and reliable storage. Understanding
how these features work is vital if you want fast performance
for your databases. This week we begin a series on performance
by looking at "ACID" compliance and how it affects our handling
of large operations.

Welcome to the Database Programmer blog. This blog is for anybody who wants to see practical examples of how databases work and how to create lean and efficient database applications. There is a
"http://database-programmer.blogspot.com/2007/12/database-skills-complete-contents.html"
>Complete Table Of Contents that is updated each week, and a
"http://database-programmer.blogspot.com/2008/01/table-design-patterns.html"
>Master list of table design patterns
that is updated whenever a new design pattern is presented.

What is ACID Compliance

The modern database provides a set of features known as
ACID compliance which
make the database very robust. To paraphrase the Wikipedia article,
ACID compliance means that:

Each transaction is Atomic. It is completed in its entirety
or not at all.
The database is always Consistent. No user ever sees the
intermediate and possibly invalid state of the database while your
transaction is in progress.
Each transaction is isolated. Your changes do not get mixed
up with other people's changes, even though they are executing at the
same time (see more in the Wikipedia article on "http://en.wikipedia.org/wiki/Serializability">Serializability.)
The transaction is durable. Once the database says the job
is complete without errors, you are assured the database has checked
all constraints and keys and the transaction is completely valid. In most
cases we also take this to mean the data is safely on disk and pulling
the plug will not corrupt it.

Maintaining ACID compliance is expensive for the database server. It must
in effect keep two versions of every row in play, and it must do so while
multiple users have multiple transactions running at the same time, even
while other users may be trying to read the rows that are being
effected.
This cost is considered more than acceptable when the reliability
requirement is high. But there is one case where the inevitable
consequence of ACID compliance is to destroy performance, and this is
on large UPDATES and INSERTS. Today we are going to look particularly
at large INSERT operations.

Consider the case where you are creating a new system that must load
1 million rows into a single table from an older system.
You go to your server's manual and
find the command to do so (For PostgreSQL it is COPY..., for SQL Server
it is BULK INSERT...). You work out the painful details of the command
with a test file of 10 rows and get the syntax down. Then you issue the
command with the real file of 1 million rows. After a minute or two you
realize, well, 1 million rows is a lot, time to get some coffee. Returning
with the coffee you see that it is still not finished, so time to check
some email. An hour later it is still not done, and when you leave it
running overnight and come back in the morning your machine is frozen.

The problem here is simply one of numbers. One million rows of average
length of 100 characters (in ASCII or UTF-8) will be about 100 megabytes.
The server must do no less than maintain two completely separate states
for database -- one state without your rows and one with your rows.
The cost of this is several times the actual size of the input data,
so the 1 million rows in this example will take several hundred
megabytes of resources, at least!
The server will be managing this process on
both disk and in RAM. This will simply die on a laptop or development
workstation, even one with a gig or two of RAM.

You can maybe go out and buy some RAM, but the purpose of this essay is
to explain how to deal with those inevitable cases where the operation
you are performing requires more resources than you have. This situation
will always come up, so it is good to know how to deal with it.

Step 1: Drop Indexes and Keys

ACID compliance extends to indexes as well. When you INSERT many thousands
or millions of rows to a single table in one shot, the server must maintain
two separate versions of each index. This burden is laid on top of
the burden of calculating the index keys for every single row one-by-one.
If we began with a burden of several hundred megabytes of resources,
just a few indexes on your table could end up more than doubling that.

This is why you will see advice on mailing lists and forums to drop
indexes before doing large insert operations.

Your table will have one index automatically for the primary key, so you
must drop the primary key. You will also want to drop foreign keys so
that you do not waste time checking them (they also have indexes).
Unique constraints also end up creating
indexes automatically behind the scenes, so you must drop those.
Other constraints must also be dropped to prevent having them checked
for every single one of your million rows.
Finally, you must drop all
indexes you created yourself. After the load is complete, you must
recreate these keys, indexes, and constraints.

In some cases, where your load is small enough, this may be enough to
get predictable load times, and you can stop here. But the larger the
operation, the more likely that this will not be enough. In those cases,
there is another step to take.

Step 2: Chunks

The basic problem described above is that the database performance has
gone non-linear. When you double the number of rows, it does not
take twice as long, but four times (or 3 or 10 or whatever). When you
multiply the rows by 10, it may not take 10 times as long, you might see
it take 100 times as long, or more! (Or maybe you just killed it after
you came back in the morning and your workstation was frozen).

We can restore linear performance if we break the input into chunks
and load them one at a time in succession. You break the input into files
that are small enough so that no individual file will send the
server into non-linear hell. If we find that a
chunk of 5000 rows loads in 4 seconds, and we have 2000 of these files,
we now have a predictable load time. We have restored linear
performance because we know that twice as many rows will take twice
as long.

I am currently working on a system where I must occassionally
load 3 tables of about 3 million rows each
periodically from an old desktop Visual Foxpro system into a Postgres
database. The chunking code on the output looks something like this:


    * This is FOXPRO code, which is vaguely like BASIC...
  mCount     = 0
  mIncrement = 5000                    * Hardcoded chunk size
  mRN        = Reccount(p_tableName)   * Fox's command to get row count

    * these three commands get the record pointer to the top
    * of the table and turn off all indexes 
 SELECT (p_tableName) 
 locate
 set order to 0
 for rnStart = 1 TO mRN step mIncrement
  mCount = mCount + 1
 
        * each loop outputs the next 5000 rows and leaves the
        * record pointer ready for the next loop. 
        * Foxpro uses a semi-colon to mean continue onto next line
  COPY column1,column2,column3 ;
        TO (m_dir+p_tableName+"_"+PADL(mCount,6,'0')+".asc") DELIMITED ;
   WHILE recno() < (rnStart + mIncrement)
        * This is foxpro's 'echo' command, a question mark
  ? p_tableName + "  copied "+str(_TALLY)+" records"
 endfor

Then on the receiving end I need a program that reads the chunks and
loads them in. The relevant portion of the code is here (the example
is in PHP and loads to PostgreSQL):


#  Assume variable $fcnt holds the number of chunks to load
#  and that variables like $tabname, $ddir etc hold file names,
#  directory locations, column lists and so forth.
for($c = 1; $c<= $fcnt; $c++) {
    $insert = "_".str_pad($c,6,'0',STR_PAD_LEFT);
    LogEntry("     loading file # "
        .str_pad($c,6,' ',STR_PAD_LEFT)
        .' of '.str_pad($fcnt,6,' ',STR_PAD_LEFT)
    );
    $cmd="COPY $tabname ($collist) "
        ." FROM '$ddir$afile$insert.asc' DELIMITERS ',' CSV QUOTE '\"'";
    # This is my frameworks super-simple direct SQL command
    SQL($cmd);
}

Conclusion: Chunks Restore Linear Performance

Database programmers depend heavily on "ACID" features to provide
robust data storage. We depend upon these features so much that we will
not consider using systems that cannot provide them (MySQL's MyISAM engine
for instance). The cost of these features for performance is considered
part of the bargain when robustness is required, but when
you are doing a huge insert, the ACID features cause performance to
go "non-linear", to become unpredictably long. As a first step you can
drop indexes, keys, and constraints on a table to improve load times,
but if that is not enough, you can restore linear performance by breaking
the large operation into many "chunks", each of which is small enough to
stay linear.

"http://database-programmer.blogspot.com/2008/06/database-performance-web-layer.html"
>Next Essay: Performance in the Web Layer

Update Contact :
No Wa/Telepon (puat) : 085267792168
No Wa/Telepon (fajar) : 085369237896
Email : Fajarudinsidik@gmail.com

NB :: Bila Sobat tertarik Ingin membuat software, membeli software, membeli source code, membeli hardware elektronika untuk kepentingan Perusahaan maupun Tugas Akhir (TA/SKRIPSI), Insyaallah Saya siap membantu, untuk Respon Cepat dapat menghubungi kami, melalui :

No Wa/Telepon (puat) : 085267792168
No Wa/Telepon (fajar) : 085369237896
Email: Fajarudinsidik@gmail.com

atau Kirimkan Private messanger melalui email dengan klik tombol order dibawah ini :

ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِين

Alhamdulilah hirobil alamin

وَ السَّلاَمُ عَلَيْكُمْ وَرَحْمَةُ اللهِ وَبَرَكَاتُهُ

wassalamualaikum warahmatullahi wabarakatuh

Artikel Database Performance 1: Huge Inserts, Diterbitkan oleh scodeaplikasi pada Minggu, 22 Juni 2008. Semoga artikel ini dapat menambah wawasan Anda. Website ini dipost dari beberapa sumber, bisa cek disini sumber, Sobat diperbolehkan mengcopy paste / menyebar luaskan artikel ini, karena segala yang dipost di public adalah milik public. Bila Sobat tertarik Ingin membuat software, membeli software, membeli source code ,Dengan Cara menghubungi saya Ke Email: Fajarudinsidik@gmail.com, atau No Hp/WA : (fajar) : 085369237896, (puat) : 085267792168.

Database Performance 1: Huge Inserts

What is ACID Compliance

Step 1: Drop Indexes and Keys

Step 2: Chunks

Conclusion: Chunks Restore Linear Performance

Terimakasih Telah mengunjungi website Kami..!!

Tawk.to

Pencarian

Blog

Populer

Populer

My Ads

Artikel

Artikel