Critical performance problems on large databases 
Author Message
 Critical performance problems on large databases

Quote:

> PS: we are seriously looking into using pgsql as the core
> of a BIG medical record system, but we already know that
> if we can't get quick online responses (< 2 s) on
> large rasult sets (10000 records)  at least at the first
> page (~ 100 records) we are in trouble.

There are a few tricks to getting fast results for pages of data in
large tables.  I have an application in which we have a scrolling window
displaying data from a million-row table, and I have been able to make
it fairly interactively responsive (enough that it's not a problem).

We grab pages of a few screenfuls of data at a time using LIMIT /
OFFSET, enough to scroll smoothly over a short range.  For LIMIT /
OFFSET queries to be fast, I found it was necessary to CREATE INDEX,
CLUSTER and ORDER BY the key field.

Then the biggest slowdown is count(*), which we have to do in order to
fake up the scrollbar (so we know what proportion of the data has been
scrolled through).  I have not completely foxed this yet.  I want to
keep a separate mini-table of how many records are in the big table and
update it with a trigger (the table is mostly static).  ATM, I just try
hard to minimize the times I call count(*).

b.g.

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster



Mon, 27 Sep 2004 20:04:20 GMT
 Critical performance problems on large databases

Quote:

> SELECT * FROM Bigtable;

> it takes a long time for it to come up with the first page
> of results using up lots of computer resources etc and after
> the first page is returned the backend basically goes into
> idle mode for all the rest of the query results retrieval.

Very simple.  PHP and PERL DBI, when doing non-cursor queries, will try
and buffer the *entire* result into memory before letting you have it.
In PHP, when you run a pg_exec, or mysql_query, or whatever, it actually
executes the query, and stores all of the results in memory.  Don't
believe me?  Try closing the database connection, and then do a fetch on
the result identifier.  Heck, fetch until you reach the end.

You'll get all of your results.  Even though the database is closed.

So when you are doing a "SELECT * FROM bigtable", you're telling PHP to
not only buffer over a million rows, but to transfer it from the
database as well.  "SELECT COUNT(*) FROM bigtable" doesn't have
this overhead.  The database just returns a single row, which you
can view basically as soon as the database is done.

There's also something to be said for the LIMIT clause.  Somehow I
doubt you need every single row in the entire table in order to do
your work on that page.  Just adding LIMIT/OFFSET will increase your
speed significantly.

Or, as someone else mentioned, use a cursor, and let the database buffer
your query.  It'll be slow too, but you can at least fetch the rows
individually and get the perception of speed in your application.

Quote:
> no work needed other than to retrieve the tuples out of
> physical storage, the response should be immediate and resource
> usage low. There should not be large buffer allocations.

I see how you can get into this kind of thinking.  Since select * has no
ordering, no calculations, nothing but just returning raw results from
the table in whatever format they may be in, how could it possibly be
resource intensive?  But the logic is wrong.  If you want all the data,
it'll give it to you.  Whether you do it at the console, and it has to
throw everything at the screen buffer, or your application, which has
to put it in some temporary storage until it gets what it needs and
deallocates the memory.  That data has to go *somewhere*, it can't just
exist in limbo until you decide you want to use some of it.

You can use cursors to get around this, because the database basically
runs the query and doesn't send back squat until you actually ask it to.
But I'll tell you a well selected limit clause will work almost as well,
and reduce the need for the database to maintain a cursor.

Say you're on page 5, and you are showing 20 results per page.  Knowing
that results start at 0, you can get the offset just by doing:

(PAGE - 1) * PER_PAGE = (5-1) * 20 = 4*20 = 80.  So your query becomes:

SELECT * FROM Bigtable LIMIT 20 OFFSET 80.

And viola.  You'll get back 20 rows right where you want them.  Also,
take care of what you select from the table.  Maybe you don't actually
need all of the data, but only certain rows.  The less data that has to
be transferred from database to application, the faster you can get/show
your data.  Especially if your database is on a separate machine from
your application server.  Network transfers are *not* instantaneous.

Try:

SELECT col1,col2,col3,etc FROM Bigtable LIMIT x OFFSET y

instead.  You'll save yourself some time, save the database the effort
of loading all million rows of every column in the table into the
network interface, and save your application the need to fetch them.

Quote:
> COUNT(smallcolumn) behaves much faster than COUNT(*).

Or you can do what every database optimization book in the entire
universe says, and do COUNT(1).  Since all you're testing is the
existence of the row, not any value of anything in it.

Quote:
> Again, count should be streamed as well such as to use no
> significant memory resources other than the counter.

Actually, it's converted into: "*, eh?  I think not.  Let me rewrite
this query and put a 1 here instead."

The database tries to keep people from shooting themselves in the foot,
but even the most Herculean effort will fail when the user is bound and
determined to blow off their big toe.

Quote:
> Any enlightenments? Am I wrong? Will this be fixed soon?
> Is it hard to change pgsql to do better streaming of its
> operations.

Read a few books on database design and optimization, the basics of
application memory allocation, and the TCP/IP stack.  If you want to
understand how to make an application fast from front to back, you have
to understand the components that make it work.  Your knowledge of
application memory performance and network latency seems inherently
flawed, and until you get over the assumption that network transfers are
free and optimizing queries is a fool's errand, you'll continue to have
problems in any database you choose.

--
+-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+
| Shaun M. Thomas                INN Database Administrator           |
| Phone: (309) 743-0812          Fax  : (309) 743-0830                |

| Web  : www.townnews.com                                             |
|                                                                     |
|     "Most of our lives are about proving something, either to       |
|      ourselves or to someone else."                                 |
|                                           -- Anonymous              |
+-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command



Mon, 27 Sep 2004 21:51:32 GMT
 Critical performance problems on large databases
Hi,

thanks everyone who responded. I want to set this whole COUNT(...)
issue aside for a moment because I have reported hearsay instead
of my own experiences. I apologize for muddying the waters with
that COUNT issue. But I still want to respond to the SELECT *
FROM Bigtable; issue once more.

[[[I have to say I am a bit confused by some of the responses who
basically shrugged the problem off sending the asker from anywhere
between back to the schoolbooks to the mental facility. That's not
necessary.]]]

I am delighted to hear that one respondent has no problem with a
2M table and snappy response, but unfortunately he didn't say much
about the detail, was that really select * from bigtable; queries
or do we have where clauses and stuff that limits the result set
considerably? A number of respondents confirmed my observations,
so I think the problem is real.

There was one remark about Perl or PHP always loading the complete
result set before returning. Bad for them. I don't use either and
I think it's just bad design to do that on the client but I don't
care about bad clients. I care about a good server.

The constructive responses suggested that I use LIMIT/OFFSET and
CURSORs. I can see how that could be a workaround the problem, but
I still believe that something is wrong with the PostgreSQL query
executer. Loading the entire result set into a buffer without
need just makes no sense. Good data base engines try to provide
for parallel execution of the query plan as much as possible, and
that implies streaming. There's a crowd of literature about this
testifying for it's importance.

The main reasons for this is (a) the use on multi-processor machines
where one CPU does one task and the other does another task on the
same query plan and the results from CPU 1 is streamed to CPU 2 (hey,
I have a 6 processor machine in my ba{*filter*}t.) Perhaps more importantly
(b) buffering (without need) is inherently bad, because it wastes
memory resources leads to bursty demand on CPU and network, and slow
perceived response times. Buffering is a complete waste if the buffer
is being paged out to disk again and it isn't flexible or scaleable
if buffer pages are fixed into physical memory.  Straming is
especially important if you want to do distributed joins (and though
pgsql doesn't support that yet it would be foolish to close your eyes
before a fundamental problem and then being forced to rework this in
a hurry when the time comes for distributed PostgreSQL.)

So, while my client application might benefit from such things as
cursors and OFFSET/LIMIT, the query planning and executing may
suffer from the buffering. And of course, the point is that it makes
sense to design the server such that streaming results to the
client is transparent because it automatically relieves the strain
on all resources, CPU, storage and network! Isn't that obvious?

regards
-Gunther

--

Medical Information Scientist      Regenstrief Institute for Health Care
Adjunct Assistant Professor        Indiana University School of Medicine
tel:1(317)630-7960                         http://www.***.com/

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command



Tue, 28 Sep 2004 00:13:30 GMT
 Critical performance problems on large databases

Quote:
> So, while my client application might benefit from such things as
> cursors and OFFSET/LIMIT, the query planning and executing may
> suffer from the buffering. And of course, the point is that it makes
> sense to design the server such that streaming results to the
> client is transparent because it automatically relieves the strain
> on all resources, CPU, storage and network! Isn't that obvious?

Well, to me, it seems that it IS obvious to anyone, whose client wants/needs to use streaming...
On the other hand, there are clients, that, do not need it, and just always want the entire result
set fetched right away... For those, I guess, "the obvious" thing would be to make bufferring
transparent to the client :-)

And anyway, I am kinda missing the point - what's 'not transparent' about 'fetch 100 from mycursor'? :-)



Tue, 28 Sep 2004 02:02:10 GMT
 Critical performance problems on large databases

Quote:

> The constructive responses suggested that I use LIMIT/OFFSET and
> CURSORs. I can see how that could be a workaround the problem, but
> I still believe that something is wrong with the PostgreSQL query
> executer. Loading the entire result set into a buffer without
> need just makes no sense.

The Postgres backend does not do that.  Most of the frontend client-side
libraries do, but feel free to write one that does not.

Offhand I think the only really serious downside to letting the
application code process the result in a streaming fashion is that the
application would have to be prepared to undo whatever it's done so far,
if it gets an error report partway through the result set.  An example:
        SELECT 1/x FROM foo;
where foo.x contains zeroes here and there.  You'll get some rows out
before the query is abandoned with a divide-by-zero error.  By
accumulating the result set, the existing libraries are able to offer a
cleaner yes-it-succeeded or no-it-didn't API.  But this is surely not a
fatal objection, just a necessary piece of a less-clean streaming API.

Quote:
> [snip]
> And of course, the point is that it makes
> sense to design the server such that streaming results to the
> client is transparent because it automatically relieves the strain
> on all resources, CPU, storage and network! Isn't that obvious?

The reason you got brush-off responses before was that you went into
lecture mode when you clearly hadn't spent any effort studying Postgres
internals.  You're still doing that...

                        regards, tom lane

---------------------------(end of broadcast)---------------------------



Tue, 28 Sep 2004 01:51:44 GMT
 Critical performance problems on large databases

Quote:
> The constructive responses suggested that I use LIMIT/OFFSET and
> CURSORs. I can see how that could be a workaround the problem, but
> I still believe that something is wrong with the PostgreSQL query
> executer. Loading the entire result set into a buffer without
> need just makes no sense. Good data base engines try to provide
> for parallel execution of the query plan as much as possible, and
> that implies streaming. There's a crowd of literature about this
> testifying for it's importance.

> The main reasons for this is (a) the use on multi-processor machines
> where one CPU does one task and the other does another task on the
> same query plan and the results from CPU 1 is streamed to CPU 2 (hey,
> I have a 6 processor machine in my ba{*filter*}t.) Perhaps more importantly
> (b) buffering (without need) is inherently bad, because it wastes
> memory resources leads to bursty demand on CPU and network, and slow
> perceived response times. Buffering is a complete waste if the buffer
> is being paged out to disk again and it isn't flexible or scaleable
> if buffer pages are fixed into physical memory.  Straming is
> especially important if you want to do distributed joins (and though
> pgsql doesn't support that yet it would be foolish to close your eyes
> before a fundamental problem and then being forced to rework this in
> a hurry when the time comes for distributed PostgreSQL.)

> So, while my client application might benefit from such things as
> cursors and OFFSET/LIMIT, the query planning and executing may
> suffer from the buffering. And of course, the point is that it makes
> sense to design the server such that streaming results to the
> client is transparent because it automatically relieves the strain
> on all resources, CPU, storage and network! Isn't that obvious?

My limited understanding of the internals is that except for steps
that want their entire data set for providing output (sort) and
the final send everything to the client, internally each node
asks other nodes for rows as they need them and those nodes can
provide results on an as needed basis without necessarily buffering
their result set.

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.***.com/



Tue, 28 Sep 2004 01:52:03 GMT
 Critical performance problems on large databases
The big issue with LIMIT,OFFSET is that it still use all rows
for sorting. I already suggested to use partial sorting to avoid
sorting all rows if one selected only first 20 row, for example.
It's very important for Web applications because web users usually
read first 1-2 pages. Our experimnets have shown 6 times performance
win when using partial sorting.

        Oleg

Quote:


> > SELECT * FROM Bigtable;

> > it takes a long time for it to come up with the first page
> > of results using up lots of computer resources etc and after
> > the first page is returned the backend basically goes into
> > idle mode for all the rest of the query results retrieval.

> Very simple.  PHP and PERL DBI, when doing non-cursor queries, will try
> and buffer the *entire* result into memory before letting you have it.
> In PHP, when you run a pg_exec, or mysql_query, or whatever, it actually
> executes the query, and stores all of the results in memory.  Don't
> believe me?  Try closing the database connection, and then do a fetch on
> the result identifier.  Heck, fetch until you reach the end.

> You'll get all of your results.  Even though the database is closed.

> So when you are doing a "SELECT * FROM bigtable", you're telling PHP to
> not only buffer over a million rows, but to transfer it from the
> database as well.  "SELECT COUNT(*) FROM bigtable" doesn't have
> this overhead.  The database just returns a single row, which you
> can view basically as soon as the database is done.

> There's also something to be said for the LIMIT clause.  Somehow I
> doubt you need every single row in the entire table in order to do
> your work on that page.  Just adding LIMIT/OFFSET will increase your
> speed significantly.

> Or, as someone else mentioned, use a cursor, and let the database buffer
> your query.  It'll be slow too, but you can at least fetch the rows
> individually and get the perception of speed in your application.

> > no work needed other than to retrieve the tuples out of
> > physical storage, the response should be immediate and resource
> > usage low. There should not be large buffer allocations.

> I see how you can get into this kind of thinking.  Since select * has no
> ordering, no calculations, nothing but just returning raw results from
> the table in whatever format they may be in, how could it possibly be
> resource intensive?  But the logic is wrong.  If you want all the data,
> it'll give it to you.  Whether you do it at the console, and it has to
> throw everything at the screen buffer, or your application, which has
> to put it in some temporary storage until it gets what it needs and
> deallocates the memory.  That data has to go *somewhere*, it can't just
> exist in limbo until you decide you want to use some of it.

> You can use cursors to get around this, because the database basically
> runs the query and doesn't send back squat until you actually ask it to.
> But I'll tell you a well selected limit clause will work almost as well,
> and reduce the need for the database to maintain a cursor.

> Say you're on page 5, and you are showing 20 results per page.  Knowing
> that results start at 0, you can get the offset just by doing:

> (PAGE - 1) * PER_PAGE = (5-1) * 20 = 4*20 = 80.  So your query becomes:

> SELECT * FROM Bigtable LIMIT 20 OFFSET 80.

> And viola.  You'll get back 20 rows right where you want them.  Also,
> take care of what you select from the table.  Maybe you don't actually
> need all of the data, but only certain rows.  The less data that has to
> be transferred from database to application, the faster you can get/show
> your data.  Especially if your database is on a separate machine from
> your application server.  Network transfers are *not* instantaneous.

> Try:

> SELECT col1,col2,col3,etc FROM Bigtable LIMIT x OFFSET y

> instead.  You'll save yourself some time, save the database the effort
> of loading all million rows of every column in the table into the
> network interface, and save your application the need to fetch them.

> > COUNT(smallcolumn) behaves much faster than COUNT(*).

> Or you can do what every database optimization book in the entire
> universe says, and do COUNT(1).  Since all you're testing is the
> existence of the row, not any value of anything in it.

> > Again, count should be streamed as well such as to use no
> > significant memory resources other than the counter.

> Actually, it's converted into: "*, eh?  I think not.  Let me rewrite
> this query and put a 1 here instead."

> The database tries to keep people from shooting themselves in the foot,
> but even the most Herculean effort will fail when the user is bound and
> determined to blow off their big toe.

> > Any enlightenments? Am I wrong? Will this be fixed soon?
> > Is it hard to change pgsql to do better streaming of its
> > operations.

> Read a few books on database design and optimization, the basics of
> application memory allocation, and the TCP/IP stack.  If you want to
> understand how to make an application fast from front to back, you have
> to understand the components that make it work.  Your knowledge of
> application memory performance and network latency seems inherently
> flawed, and until you get over the assumption that network transfers are
> free and optimizing queries is a fool's errand, you'll continue to have
> problems in any database you choose.

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)

phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster



Tue, 28 Sep 2004 01:55:40 GMT
 Critical performance problems on large databases
I have had a simular issue when doing a select one a table of 10k or 15k
records. It takes a long time to get the data back. I am using JDBC.
If I take a subset of the table it returns very quickly. I figured that the
JDBC driver is buffering all the data. I guess I will have to try the cursor
or the LIMIT,OFFSET next as well.
Quote:
-----Original Message-----


Sent: Thursday, April 11, 2002 1:51 PM
To: Shaun Thomas

Subject: Re: [GENERAL] Critical performance problems on large databases

The big issue with LIMIT,OFFSET is that it still use all rows
for sorting. I already suggested to use partial sorting to avoid
sorting all rows if one selected only first 20 row, for example.
It's very important for Web applications because web users usually
read first 1-2 pages. Our experimnets have shown 6 times performance
win when using partial sorting.

        Oleg


> > SELECT * FROM Bigtable;

> > it takes a long time for it to come up with the first page
> > of results using up lots of computer resources etc and after
> > the first page is returned the backend basically goes into
> > idle mode for all the rest of the query results retrieval.

> Very simple.  PHP and PERL DBI, when doing non-cursor queries, will try
> and buffer the *entire* result into memory before letting you have it.
> In PHP, when you run a pg_exec, or mysql_query, or whatever, it actually
> executes the query, and stores all of the results in memory.  Don't
> believe me?  Try closing the database connection, and then do a fetch on
> the result identifier.  Heck, fetch until you reach the end.

> You'll get all of your results.  Even though the database is closed.

> So when you are doing a "SELECT * FROM bigtable", you're telling PHP to
> not only buffer over a million rows, but to transfer it from the
> database as well.  "SELECT COUNT(*) FROM bigtable" doesn't have
> this overhead.  The database just returns a single row, which you
> can view basically as soon as the database is done.

> There's also something to be said for the LIMIT clause.  Somehow I
> doubt you need every single row in the entire table in order to do
> your work on that page.  Just adding LIMIT/OFFSET will increase your
> speed significantly.

> Or, as someone else mentioned, use a cursor, and let the database buffer
> your query.  It'll be slow too, but you can at least fetch the rows
> individually and get the perception of speed in your application.

> > no work needed other than to retrieve the tuples out of
> > physical storage, the response should be immediate and resource
> > usage low. There should not be large buffer allocations.

> I see how you can get into this kind of thinking.  Since select * has no
> ordering, no calculations, nothing but just returning raw results from
> the table in whatever format they may be in, how could it possibly be
> resource intensive?  But the logic is wrong.  If you want all the data,
> it'll give it to you.  Whether you do it at the console, and it has to
> throw everything at the screen buffer, or your application, which has
> to put it in some temporary storage until it gets what it needs and
> deallocates the memory.  That data has to go *somewhere*, it can't just
> exist in limbo until you decide you want to use some of it.

> You can use cursors to get around this, because the database basically
> runs the query and doesn't send back squat until you actually ask it to.
> But I'll tell you a well selected limit clause will work almost as well,
> and reduce the need for the database to maintain a cursor.

> Say you're on page 5, and you are showing 20 results per page.  Knowing
> that results start at 0, you can get the offset just by doing:

> (PAGE - 1) * PER_PAGE = (5-1) * 20 = 4*20 = 80.  So your query becomes:

> SELECT * FROM Bigtable LIMIT 20 OFFSET 80.

> And viola.  You'll get back 20 rows right where you want them.  Also,
> take care of what you select from the table.  Maybe you don't actually
> need all of the data, but only certain rows.  The less data that has to
> be transferred from database to application, the faster you can get/show
> your data.  Especially if your database is on a separate machine from
> your application server.  Network transfers are *not* instantaneous.

> Try:

> SELECT col1,col2,col3,etc FROM Bigtable LIMIT x OFFSET y

> instead.  You'll save yourself some time, save the database the effort
> of loading all million rows of every column in the table into the
> network interface, and save your application the need to fetch them.

> > COUNT(smallcolumn) behaves much faster than COUNT(*).

> Or you can do what every database optimization book in the entire
> universe says, and do COUNT(1).  Since all you're testing is the
> existence of the row, not any value of anything in it.

> > Again, count should be streamed as well such as to use no
> > significant memory resources other than the counter.

> Actually, it's converted into: "*, eh?  I think not.  Let me rewrite
> this query and put a 1 here instead."

> The database tries to keep people from shooting themselves in the foot,
> but even the most Herculean effort will fail when the user is bound and
> determined to blow off their big toe.

> > Any enlightenments? Am I wrong? Will this be fixed soon?
> > Is it hard to change pgsql to do better streaming of its
> > operations.

> Read a few books on database design and optimization, the basics of
> application memory allocation, and the TCP/IP stack.  If you want to
> understand how to make an application fast from front to back, you have
> to understand the components that make it work.  Your knowledge of
> application memory performance and network latency seems inherently
> flawed, and until you get over the assumption that network transfers are
> free and optimizing queries is a fool's errand, you'll continue to have
> problems in any database you choose.

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)

phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster



Tue, 28 Sep 2004 02:23:05 GMT
 Critical performance problems on large databases

Quote:


> > The big issue with LIMIT,OFFSET is that it still use all rows
> > for sorting. I already suggested to use partial sorting to avoid
> > sorting all rows if one selected only first 20 row, for example.
> > It's very important for Web applications because web users usually
> > read first 1-2 pages. Our experimnets have shown 6 times performance
> > win when using partial sorting.

> That is interesting. I haven't even dared to ask for this.
> Did you contribute that partial sorting code to postgresql?

A year ago we had a very crude patch which implement partial sorting.
We had hope to work close on it for 7.2/3 but seems we have no time.
If you're ready to work on this issue we could think about
contributing our libpsort to postgres.

Quote:

> thanks,
> -Gunther

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)

phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate

message can get through to the mailing list cleanly



Tue, 28 Sep 2004 02:28:45 GMT
 Critical performance problems on large databases

Quote:


> > The big issue with LIMIT,OFFSET is that it still use all rows
> > for sorting. I already suggested to use partial sorting to avoid
> > sorting all rows if one selected only first 20 row, for example.
> > It's very important for Web applications because web users usually
> > read first 1-2 pages. Our experimnets have shown 6 times performance
> > win when using partial sorting.

> That is interesting. I haven't even dared to ask for this.
> Did you contribute that partial sorting code to postgresql?

We have the TODO item, but the code is not written yet.

--
  Bruce Momjian                        |  http://candle.pha.pa.us

  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------



Tue, 28 Sep 2004 02:31:10 GMT
 Critical performance problems on large databases

Quote:


> > The big issue with LIMIT,OFFSET is that it still use all rows
> > for sorting. I already suggested to use partial sorting to avoid
> > sorting all rows if one selected only first 20 row, for example.
> > It's very important for Web applications because web users usually
> > read first 1-2 pages. Our experimnets have shown 6 times performance
> > win when using partial sorting.

> We do have this TODO item:

>    * Allow ORDER BY ... LIMIT to select top values without sort or index
>      using a sequential scan for highest/lowest values

looks too complex to me :-) I like partial sorting, but it's not
important.

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)

phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate

message can get through to the mailing list cleanly



Tue, 28 Sep 2004 02:33:06 GMT
 Critical performance problems on large databases

Quote:



> > > The big issue with LIMIT,OFFSET is that it still use all rows
> > > for sorting. I already suggested to use partial sorting to avoid
> > > sorting all rows if one selected only first 20 row, for example.
> > > It's very important for Web applications because web users usually
> > > read first 1-2 pages. Our experimnets have shown 6 times performance
> > > win when using partial sorting.

> > That is interesting. I haven't even dared to ask for this.
> > Did you contribute that partial sorting code to postgresql?

> We have the TODO item, but the code is not written yet.

Actually, we have libpsort for partial sorting and very crude patch for 7.1
Jan and Tom were not optimist about that patch, so we need one bright
person who could incorporate the idea of partial sorting and current sources
and satisfy core developers. We have no time, unfortunately

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)

phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate

message can get through to the mailing list cleanly



Tue, 28 Sep 2004 02:35:19 GMT
 Critical performance problems on large databases

Quote:

> The big issue with LIMIT,OFFSET is that it still use all rows
> for sorting. I already suggested to use partial sorting to avoid
> sorting all rows if one selected only first 20 row, for example.
> It's very important for Web applications because web users usually
> read first 1-2 pages. Our experimnets have shown 6 times performance
> win when using partial sorting.

Could you explain a little more what you mean by this, please (perhaps, off the list, if you prefer)?
Is it some kind of euristical sorting algorithm you are suggesting?
Or are you talking about just fetching the first few pages at random, and then sorting them?

The former sounds very interesting...
The latter, I am afraid, won't work however - for starters, it will break things like

select name, age from people order by age desc limit 1;

(select the name of the oldest person)

and also, the majority of web applications, while do indeed browse only the first few pages,
but still rely on those pages being 'most relevant' according to some criteria...

So, if you know of a way to improve sorting performance, without breaking those two things above,
I'd greatly appreciate you telling me a little more about it!

Thanks a lot!

Dima

Quote:
>    Oleg


>>>SELECT * FROM Bigtable;

>>>it takes a long time for it to come up with the first page
>>>of results using up lots of computer resources etc and after
>>>the first page is returned the backend basically goes into
>>>idle mode for all the rest of the query results retrieval.

>>Very simple.  PHP and PERL DBI, when doing non-cursor queries, will try
>>and buffer the *entire* result into memory before letting you have it.
>>In PHP, when you run a pg_exec, or mysql_query, or whatever, it actually
>>executes the query, and stores all of the results in memory.  Don't
>>believe me?  Try closing the database connection, and then do a fetch on
>>the result identifier.  Heck, fetch until you reach the end.

>>You'll get all of your results.  Even though the database is closed.

>>So when you are doing a "SELECT * FROM bigtable", you're telling PHP to
>>not only buffer over a million rows, but to transfer it from the
>>database as well.  "SELECT COUNT(*) FROM bigtable" doesn't have
>>this overhead.  The database just returns a single row, which you
>>can view basically as soon as the database is done.

>>There's also something to be said for the LIMIT clause.  Somehow I
>>doubt you need every single row in the entire table in order to do
>>your work on that page.  Just adding LIMIT/OFFSET will increase your
>>speed significantly.

>>Or, as someone else mentioned, use a cursor, and let the database buffer
>>your query.  It'll be slow too, but you can at least fetch the rows
>>individually and get the perception of speed in your application.

>>>no work needed other than to retrieve the tuples out of
>>>physical storage, the response should be immediate and resource
>>>usage low. There should not be large buffer allocations.

>>I see how you can get into this kind of thinking.  Since select * has no
>>ordering, no calculations, nothing but just returning raw results from
>>the table in whatever format they may be in, how could it possibly be
>>resource intensive?  But the logic is wrong.  If you want all the data,
>>it'll give it to you.  Whether you do it at the console, and it has to
>>throw everything at the screen buffer, or your application, which has
>>to put it in some temporary storage until it gets what it needs and
>>deallocates the memory.  That data has to go *somewhere*, it can't just
>>exist in limbo until you decide you want to use some of it.

>>You can use cursors to get around this, because the database basically
>>runs the query and doesn't send back squat until you actually ask it to.
>>But I'll tell you a well selected limit clause will work almost as well,
>>and reduce the need for the database to maintain a cursor.

>>Say you're on page 5, and you are showing 20 results per page.  Knowing
>>that results start at 0, you can get the offset just by doing:

>>(PAGE - 1) * PER_PAGE = (5-1) * 20 = 4*20 = 80.  So your query becomes:

>>SELECT * FROM Bigtable LIMIT 20 OFFSET 80.

>>And viola.  You'll get back 20 rows right where you want them.  Also,
>>take care of what you select from the table.  Maybe you don't actually
>>need all of the data, but only certain rows.  The less data that has to
>>be transferred from database to application, the faster you can get/show
>>your data.  Especially if your database is on a separate machine from
>>your application server.  Network transfers are *not* instantaneous.

>>Try:

>>SELECT col1,col2,col3,etc FROM Bigtable LIMIT x OFFSET y

>>instead.  You'll save yourself some time, save the database the effort
>>of loading all million rows of every column in the table into the
>>network interface, and save your application the need to fetch them.

>>>COUNT(smallcolumn) behaves much faster than COUNT(*).

>>Or you can do what every database optimization book in the entire
>>universe says, and do COUNT(1).  Since all you're testing is the
>>existence of the row, not any value of anything in it.

>>>Again, count should be streamed as well such as to use no
>>>significant memory resources other than the counter.

>>Actually, it's converted into: "*, eh?  I think not.  Let me rewrite
>>this query and put a 1 here instead."

>>The database tries to keep people from shooting themselves in the foot,
>>but even the most Herculean effort will fail when the user is bound and
>>determined to blow off their big toe.

>>>Any enlightenments? Am I wrong? Will this be fixed soon?
>>>Is it hard to change pgsql to do better streaming of its
>>>operations.

>>Read a few books on database design and optimization, the basics of
>>application memory allocation, and the TCP/IP stack.  If you want to
>>understand how to make an application fast from front to back, you have
>>to understand the components that make it work.  Your knowledge of
>>application memory performance and network latency seems inherently
>>flawed, and until you get over the assumption that network transfers are
>>free and optimizing queries is a fool's errand, you'll continue to have
>>problems in any database you choose.

>    Regards,
>            Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)

> phone: +007(095)939-16-83, +007(095)939-23-83

> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster



Tue, 28 Sep 2004 04:26:35 GMT
 Critical performance problems on large databases

Quote:



>>>The big issue with LIMIT,OFFSET is that it still use all rows
>>>for sorting. I already suggested to use partial sorting to avoid
>>>sorting all rows if one selected only first 20 row, for example.
>>>It's very important for Web applications because web users usually
>>>read first 1-2 pages. Our experimnets have shown 6 times performance
>>>win when using partial sorting.

>>We do have this TODO item:

>>        * Allow ORDER BY ... LIMIT to select top values without sort or index
>>          using a sequential scan for highest/lowest values

> looks too complex to me :-) I like partial sorting, but it's not
> important.

Oleg, I might argee. I might even take some of this one. But I
think a first step would be for you to put your libpsort and
the pacth "out there" so that someone could just look and try
and see if taking the time to do what needs to be done is feasible
with this. If you can put up your lib and patch and documentation
(even crude notes)) onto a web site, that would be a good start.

regards
-Gunther

--

Medical Information Scientist      Regenstrief Institute for Health Care
Adjunct Assistant Professor        Indiana University School of Medicine
tel:1(317)630-7960                         http://aurora.regenstrief.org

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate

message can get through to the mailing list cleanly



Tue, 28 Sep 2004 04:07:55 GMT
 Critical performance problems on large databases
Well,

I've got several responses..
I'm far from office and has slow and unstable connection, so I
will return to this topic next week.

        Oleg

Quote:




> >>>The big issue with LIMIT,OFFSET is that it still use all rows
> >>>for sorting. I already suggested to use partial sorting to avoid
> >>>sorting all rows if one selected only first 20 row, for example.
> >>>It's very important for Web applications because web users usually
> >>>read first 1-2 pages. Our experimnets have shown 6 times performance
> >>>win when using partial sorting.

> >>We do have this TODO item:

> >>   * Allow ORDER BY ... LIMIT to select top values without sort or index
> >>     using a sequential scan for highest/lowest values

> > looks too complex to me :-) I like partial sorting, but it's not
> > important.

> Oleg, I might argee. I might even take some of this one. But I
> think a first step would be for you to put your libpsort and
> the pacth "out there" so that someone could just look and try
> and see if taking the time to do what needs to be done is feasible
> with this. If you can put up your lib and patch and documentation
> (even crude notes)) onto a web site, that would be a good start.

> regards
> -Gunther

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)

phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------



Tue, 28 Sep 2004 13:43:01 GMT
 
 [ 16 post ]  Go to page: [1] [2]

 Relevant Pages 

1. Critical performance problems on large databases

2. Critical Performance Problem!!

3. Oracle Performance Problem W/ Very Large Database

4. fine tune large critical table

5. Help: tools for large mission critical projects development

6. ODBC/Large column performance problem

7. Large SGA - performance problems

8. Performance problem with large insert statements

9. Performance problems with large SELECTS in 7.3x ?

10. Performance problems with large SELECTS in 7.3x ?

11. Performance problems when large jobs are run

12. Performance Problem with Large Tables


 
Powered by phpBB® Forum Software