Proposal of tunable fix for scalability of 8.4

classic Classic list List threaded Threaded
129 messages Options
1234567
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4

Tom Lane-2
Gregory Stark <[hidden email]> writes:
> Tom Lane <[hidden email]> writes:
>> Ugh.  So apparently, we actually need to special-case Solaris to not
>> believe that posix_fadvise works, or we'll waste cycles uselessly
>> calling a do-nothing function.  Thanks, Sun.

> Do we? Or do we just document that setting effective_cache_size on Solaris
> won't help?

I assume you meant effective_io_concurrency.  We'd still need a special
case because the default is currently hard-wired at 1, not 0, if
configure thinks the function exists.  Also there's a posix_fadvise call
in xlog.c that that parameter doesn't control anyhow.

                        regards, tom lane

--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4

Robert Haas
On Fri, Mar 13, 2009 at 10:06 PM, Tom Lane <[hidden email]> wrote:

> Gregory Stark <[hidden email]> writes:
>> Tom Lane <[hidden email]> writes:
>>> Ugh.  So apparently, we actually need to special-case Solaris to not
>>> believe that posix_fadvise works, or we'll waste cycles uselessly
>>> calling a do-nothing function.  Thanks, Sun.
>
>> Do we? Or do we just document that setting effective_cache_size on Solaris
>> won't help?
>
> I assume you meant effective_io_concurrency.  We'd still need a special
> case because the default is currently hard-wired at 1, not 0, if
> configure thinks the function exists.  Also there's a posix_fadvise call
> in xlog.c that that parameter doesn't control anyhow.

I think 1 should mean no prefetching, rather than 0.  If the number of
concurrent I/O requests was 0, that would mean you couldn't perform
any I/O at all.

...Robert

--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4

Gregory Stark-2
Robert Haas <[hidden email]> writes:

> On Fri, Mar 13, 2009 at 10:06 PM, Tom Lane <[hidden email]> wrote:
>
>> I assume you meant effective_io_concurrency.  We'd still need a special
>> case because the default is currently hard-wired at 1, not 0, if
>> configure thinks the function exists.  Also there's a posix_fadvise call
>> in xlog.c that that parameter doesn't control anyhow.
>
> I think 1 should mean no prefetching, rather than 0.  If the number of
> concurrent I/O requests was 0, that would mean you couldn't perform
> any I/O at all.

That is actually how I had intended it but apparently I messed it up at some
point such that later patches were doing some prefetching at 1 and there was
no way to disable it. When Tom reviewed it he corrected the inability to
disable prefetching by making 0 disable prefetching.

I didn't think it was worth raising as an issue but I didn't realize we were
currently doing prefetching by default? i didn't realize that. Even on a
system with posix_fadvise there's nothing much to be gained unless the data is
on a RAID device, so the original objection holds anyways. We shouldn't do any
prefetching unless the user tells us to.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

david@lang.hm
In reply to this post by Kevin Grittner
On Fri, 13 Mar 2009, Kevin Grittner wrote:

> Tom Lane <[hidden email]> wrote:
>> Robert Haas <[hidden email]> writes:
>>> I think that changing the locking behavior is attacking the problem
>>> at the wrong level anyway.
>>
>> Right.  By the time a patch here could have any effect, you've
>> already lost the game --- having to deschedule and reschedule a
>> process is a large cost compared to the typical lock hold time for
>> most LWLocks.  So it would be better to look at how to avoid
>> blocking in the first place.
>
> That's what motivated my request for a profile of the "80 clients with
> zero wait" case.  If all data access is in RAM, why can't 80 processes
> keep 64 threads (on 8 processors) busy?  Does anybody else think
> that's an interesting question, or am I off in left field here?

I don't think that anyone is arguing that it's not intersting, but I also
think that complete dismissal of the existing test case is also wrong.

last night Tom documented some reasons why the prior test may have some
issues, but even with those I think the test shows that there is room for
improvement on the locking.

making sure that the locking change doesn't cause problems for other
workload is a _very_ valid concern, but it's grounds for more testing, not
dismissal.

I think that the suggestion to wake up the first N waiters instead of all
of them is a good optimization (and waking N - # active back-ends would be
even better if there is an easy way to know that number) but I think that
it's worth making the result testable by more people so that we can see if
what workloads are pathalogical for this change (if any)

David Lang

--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Heikki Linnakangas-3
In reply to this post by Tom Lane-2
Tom Lane wrote:
> Robert Haas <[hidden email]> writes:
>> I think that changing the locking behavior is attacking the problem at
>> the wrong level anyway.
>
> Right.  By the time a patch here could have any effect, you've already
> lost the game --- having to deschedule and reschedule a process is a
> large cost compared to the typical lock hold time for most LWLocks.  So
> it would be better to look at how to avoid blocking in the first place.

I think the elephant in the room is that we have a single lock that
needs to be acquired every time a transaction commits, and every time a
backend takes a snapshot. It has worked well, and it still does for
smaller numbers of CPUs, but I'm not surprised it starts to become a
bottleneck on a test like the one Jignesh is running. To make matters
worse, the more backends there are, the longer the lock needs to be held
to take a snapshot.

It's going require some hard thinking to bust that bottleneck. I've
sometimes thought about maintaining a pre-calculated array of
in-progress XIDs in shared memory. GetSnapshotData would simply memcpy()
that to private memory, instead of collecting the xids from ProcArray.
Or we could try to move some of the if-tests inside the for-loop to
after the ProcArrayLock is released. For example, we could easily remove
the check for "proc == MyProc", and remove our own xid from the array
afterwards. That's just linear speed up, though. I can't immediately
think of a way to completely avoid / partition away the contention.

WALInsertLock is also quite high on Jignesh's list. That I've seen
become the bottleneck on other tests too.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Simon Riggs
In reply to this post by Jignesh K. Shah

On Wed, 2009-03-11 at 16:53 -0400, Jignesh K. Shah wrote:

> 1200: 2000: Medium Throughput: -1781969.000 Avg Medium Resp: 0.019

I think you need to iron out bugs in your test script before we put too
much stock into the results generated. Your throughput should not be
negative.

I'd be interested in knowing the number of S and X locks requested, so
we can think about this from first principles. My understanding is that
ratio of S:X is about 10:1. Do you have more exact numbers?

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Jim Nasby
In reply to this post by Jignesh K. Shah
On Mar 11, 2009, at 10:48 PM, Jignesh K. Shah wrote:
> Fair enough..  Well I am now appealing  to all  who has a fairly  
> decent sized hardware want to try it out  and see whether there are  
> "gains", "no-changes" or "regressions" based on your workload. Also  
> it will help if you report number of cpus when you respond back to  
> help collect feedback.


Do you have a self-contained test case? I have several boxes with 16-
cores worth of Xeon with 96GB I could try it on (though you might not  
care about having "only" 16 cores :P)
--
Decibel!, aka Jim C. Nasby, Database Architect  [hidden email]
Give your computer some brain candy! www.distributed.net Team #1828



--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Jim Nasby
In reply to this post by Jignesh K. Shah
On Mar 12, 2009, at 2:22 PM, Jignesh K. Shah wrote:
>> Something that might be useful for him to report is the avg number  
>> of active backends for each data point ...
> short of doing select * from pg_stat_activity and removing the IDLE  
> entries, any other clean way to get that information.


Uh, isn't there a DTrace probe that would provide that info? It  
certainly seems like something you'd want to know...
--
Decibel!, aka Jim C. Nasby, Database Architect  [hidden email]
Give your computer some brain candy! www.distributed.net Team #1828



--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Jim Nasby
In reply to this post by Gregory Stark-2
On Mar 13, 2009, at 8:05 AM, Gregory Stark wrote:

> "Jignesh K. Shah" <[hidden email]> writes:
>
>> Scott Carey wrote:
>>> On 3/12/09 11:37 AM, "Jignesh K. Shah" <[hidden email]> wrote:
>>>
>>> In general, I suggest that it is useful to run tests with a few  
>>> different
>>> types of pacing. Zero delay pacing will not have realistic number of
>>> connections, but will expose bottlenecks that are universal, and  
>>> less
>>> controversial
>>
>> I think I have done that before so I can do that again by running  
>> the users at
>> 0 think time which will represent a "Connection pool" which is  
>> highly utilized"
>> and test how big the connection pool can be before the throughput  
>> tanks.. This
>> can be useful for App Servers which sets up connections pools of  
>> their own
>> talking with PostgreSQL.
>
> Keep in mind when you do this that it's not interesting to test a  
> number of
> connections much larger than the number of processors you have.  
> Once the
> system reaches 100% cpu usage it would be a misconfigured  
> connection pooler
> that kept more than that number of connections open.


How certain are you of that? I believe that assertion would only be  
true if a backend could never block on *anything*, which simply isn't  
the case. Of course in most systems you'll usually be blocking on IO,  
but even in a ramdisk scenario there's other things you can end up  
blocking on. That means having more threads than cores isn't  
unreasonable.

If you want to see this in action in an easy to repeat test, try  
compiling a complex system (such as FreeBSD) with different levels of  
-j handed to make (of course you'll need to wait until everything is  
in cache, and I'm assuming you have enough memory so that everything  
would fit in cache).
--
Decibel!, aka Jim C. Nasby, Database Architect  [hidden email]
Give your computer some brain candy! www.distributed.net Team #1828



--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Jim Nasby
In reply to this post by Jignesh K. Shah
On Mar 13, 2009, at 3:02 PM, Jignesh K. Shah wrote:

> vmstat seems similar to wakeup some
> kthr      memory            page            disk          
> faults      cpu
> r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 sd   in   sy    
> cs us sy id
> 63 0 0 45535728 38689856 0 14 0 0 0  0  0  0  0  0  0 163318 334225  
> 360179 47 17 36
> 85 0 0 45436736 38690760 0 6 0 0  0  0  0  0  0  0  0 165536 347462  
> 365987 47 17 36
> 59 0 0 45405184 38681752 0 11 0 0 0  0  0  0  0  0  0 155153 326182  
> 345527 47 16 37
> 53 0 0 45393816 38673344 0 6 0 0  0  0  0  0  0  0  0 152752 317851  
> 340737 47 16 37
> 66 0 0 45378312 38651920 0 11 0 0 0  0  0  0  0  0  0 150979 304350  
> 336915 47 16 38
> 67 0 0 45489520 38639664 0 5 0 0  0  0  0  0  0  0  0 157188 318958  
> 351905 47 16 37
> 82 0 0 45483600 38633344 0 10 0 0 0  0  0  0  0  0  0 168797 348619  
> 375827 47 17 36
> 68 0 0 45463008 38614432 0 9 0 0  0  0  0  0  0  0  0 173020 376594  
> 385370 47 18 35
> 54 0 0 45451376 38603792 0 13 0 0 0  0  0  0  0  0  0 161891 342522  
> 364286 48 17 35
> 41 0 0 45356544 38605976 0 5 0 0  0  0  0  0  0  0  0 167250 358320  
> 372469 47 17 36
> 27 0 0 45323472 38596952 0 11 0 0 0  0  0  0  0  0  0 165099 344695  
> 364256 48 17 35


The good news is there's now at least enough runnable procs. What I  
find *extremely* odd is the CPU usage is almost dead constant...
--
Decibel!, aka Jim C. Nasby, Database Architect  [hidden email]
Give your computer some brain candy! www.distributed.net Team #1828



--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4

Tom Lane-2
In reply to this post by Robert Haas
Robert Haas <[hidden email]> writes:
> On Fri, Mar 13, 2009 at 10:06 PM, Tom Lane <[hidden email]> wrote:
>> I assume you meant effective_io_concurrency.  We'd still need a special
>> case because the default is currently hard-wired at 1, not 0, if
>> configure thinks the function exists.

> I think 1 should mean no prefetching, rather than 0.

No, 1 means "prefetch a single block ahead".  It doesn't involve I/O
concurrency in the sense of multiple I/O requests being processed at
once; what it does give you is CPU vs I/O concurrency.  0 shuts that
down and returns the system to pre-8.4 behavior.

                        regards, tom lane

--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Tom Lane-2
In reply to this post by Heikki Linnakangas-3
Heikki Linnakangas <[hidden email]> writes:
> WALInsertLock is also quite high on Jignesh's list. That I've seen
> become the bottleneck on other tests too.

Yeah, that's been seen to be an issue before.  I had the germ of an idea
about how to fix that:

        ... with no lock, determine size of WAL record ...
        obtain WALInsertLock
        identify WAL start address of my record, advance insert pointer
                past record end
        *release* WALInsertLock
        without lock, copy record into the space just reserved

The idea here is to allow parallelization of the copying of data into
the buffers.  The hold time on WALInsertLock would be very short.  Maybe
it could even become a spinlock, though I'm not sure, because the
"advance insert pointer" bit is more complicated than it looks (you have
to allow for the extra overhead when crossing a WAL page boundary).

Now the fly in the ointment is that there would need to be some way to
ensure that we didn't write data out to disk until it was valid; in
particular how do we implement a request to flush WAL up to a particular
LSN value, when maybe some of the records before that haven't been fully
transferred into the buffers yet?  The best idea I've thought of so far
is shared/exclusive locks on the individual WAL buffer pages, with the
rather unusual behavior that writers of the page would take shared lock
and only the reader (he who has to dump to disk) would take exclusive
lock.  But maybe there's a better way.  Currently I don't believe that
dumping a WAL buffer (WALWriteLock) blocks insertion of new WAL data,
and it would be nice to preserve that property.

                        regards, tom lane

--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Scott Carey
Top posting because my email client will mess up the inline:

Re: advance insert pointer.
I have no idea how complicated that advance part is as you allude to.  But can this be done without a lock at all?
An atomic compare and exchange (or compare and set, etc) should do it. Although boundaries in buffers could make it a bit more complicated than that.  Sounds potentially lockless to me.  CompareAndSet - like atomics would prevent context switches entirely and generally work fabulous if the item that needs locking is itself an atomic value like a pointer or int.  This is similar to, but lighter weight than, a spin lock.

________________________________________
From: Tom Lane [[hidden email]]
Sent: Saturday, March 14, 2009 9:09 AM
To: Heikki Linnakangas
Cc: Robert Haas; Scott Carey; Greg Smith; Jignesh K. Shah; Kevin Grittner; [hidden email]
Subject: Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

Yeah, that's been seen to be an issue before.  I had the germ of an idea
about how to fix that:

        ... with no lock, determine size of WAL record ...
        obtain WALInsertLock
        identify WAL start address of my record, advance insert pointer
                past record end
        *release* WALInsertLock
        without lock, copy record into the space just reserved

The idea here is to allow parallelization of the copying of data into
the buffers.  The hold time on WALInsertLock would be very short.  Maybe
it could even become a spinlock, though I'm not sure, because the
"advance insert pointer" bit is more complicated than it looks (you have
to allow for the extra overhead when crossing a WAL page boundary).
--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Jignesh K. Shah
In reply to this post by Simon Riggs


Simon Riggs wrote:

> On Wed, 2009-03-11 at 16:53 -0400, Jignesh K. Shah wrote:
>
>  
>> 1200: 2000: Medium Throughput: -1781969.000 Avg Medium Resp: 0.019
>>    
>
> I think you need to iron out bugs in your test script before we put too
> much stock into the results generated. Your throughput should not be
> negative.
>
> I'd be interested in knowing the number of S and X locks requested, so
> we can think about this from first principles. My understanding is that
> ratio of S:X is about 10:1. Do you have more exact numbers?
>
>  
Simon, that's a known bug for the test where the first time it reaches
the max number of users, it throws a negative number. But all other
numbers are pretty much accurate

Generally the users:transactions count depends on think time..

-Jignesh

--
Jignesh Shah           http://blogs.sun.com/jkshah 
The New Sun Microsystems,Inc   http://sun.com/postgresql


--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Jignesh K. Shah
In reply to this post by Jim Nasby


decibel wrote:

> On Mar 11, 2009, at 10:48 PM, Jignesh K. Shah wrote:
>> Fair enough..  Well I am now appealing  to all  who has a fairly
>> decent sized hardware want to try it out  and see whether there are
>> "gains", "no-changes" or "regressions" based on your workload. Also
>> it will help if you report number of cpus when you respond back to
>> help collect feedback.
>
>
> Do you have a self-contained test case? I have several boxes with
> 16-cores worth of Xeon with 96GB I could try it on (though you might
> not care about having "only" 16 cores :P)
I dont have authority over iGen, but I am pretty sure that with sysbench
we should be able to recreate the test case or even dbt-2
That said the patch should be pretty easy to apply to your own workloads
(where more feedback is more appreciated ).. On x64 16 cores might bring
out the problem faster too since typically they are 2.5X higher clock
frequency..   Try it out.. stock build vs patched builds.


-Jignesh

--
Jignesh Shah           http://blogs.sun.com/jkshah 
The New Sun Microsystems,Inc   http://sun.com/postgresql


--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Jignesh K. Shah
In reply to this post by Jim Nasby


decibel wrote:

> On Mar 13, 2009, at 3:02 PM, Jignesh K. Shah wrote:
>> vmstat seems similar to wakeup some
>> kthr      memory            page            disk          faults      
>> cpu
>> r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 sd   in   sy   cs
>> us sy id
>> 63 0 0 45535728 38689856 0 14 0 0 0  0  0  0  0  0  0 163318 334225
>> 360179 47 17 36
>> 85 0 0 45436736 38690760 0 6 0 0  0  0  0  0  0  0  0 165536 347462
>> 365987 47 17 36
>> 59 0 0 45405184 38681752 0 11 0 0 0  0  0  0  0  0  0 155153 326182
>> 345527 47 16 37
>> 53 0 0 45393816 38673344 0 6 0 0  0  0  0  0  0  0  0 152752 317851
>> 340737 47 16 37
>> 66 0 0 45378312 38651920 0 11 0 0 0  0  0  0  0  0  0 150979 304350
>> 336915 47 16 38
>> 67 0 0 45489520 38639664 0 5 0 0  0  0  0  0  0  0  0 157188 318958
>> 351905 47 16 37
>> 82 0 0 45483600 38633344 0 10 0 0 0  0  0  0  0  0  0 168797 348619
>> 375827 47 17 36
>> 68 0 0 45463008 38614432 0 9 0 0  0  0  0  0  0  0  0 173020 376594
>> 385370 47 18 35
>> 54 0 0 45451376 38603792 0 13 0 0 0  0  0  0  0  0  0 161891 342522
>> 364286 48 17 35
>> 41 0 0 45356544 38605976 0 5 0 0  0  0  0  0  0  0  0 167250 358320
>> 372469 47 17 36
>> 27 0 0 45323472 38596952 0 11 0 0 0  0  0  0  0  0  0 165099 344695
>> 364256 48 17 35
>
>
> The good news is there's now at least enough runnable procs. What I
> find *extremely* odd is the CPU usage is almost dead constant...
Generally when there is dead constant.. signs of classic bottleneck ;-)  
We will be fixing one to get to another.. but knocking bottlenecks is
the name of the game I think

-Jignesh

--
Jignesh Shah           http://blogs.sun.com/jkshah 
The New Sun Microsystems,Inc   http://sun.com/postgresql


--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Gregory Stark-2
"Jignesh K. Shah" <[hidden email]> writes:

> Generally when there is dead constant.. signs of classic bottleneck ;-)  We
> will be fixing one to get to another.. but knocking bottlenecks is the name of
> the game I think

Indeed. I think the bottleneck we're interested in addressing here is why you
say you weren't able to saturate the 64 threads with 64 processes when they're
all RAM-resident.

From what I see you still have 400+ processes? Is that right?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Kevin Grittner
In reply to this post by david@lang.hm
<[hidden email]> wrote:
> On Fri, 13 Mar 2009, Kevin Grittner wrote:
>> If all data access is in RAM, why can't 80 processes
>> keep 64 threads (on 8 processors) busy?  Does anybody else think
>> that's an interesting question, or am I off in left field here?
>
> I don't think that anyone is arguing that it's not intersting, but I
> also think that complete dismissal of the existing test case is also
> wrong.
 
Right, I just think this point in the test might give more targeted
results.  When you've got many more times the number of processes than
processors, of course processes will be held up.  It seems to me that
this is the point where the real issues are least likely to get lost
in the noise.  It also might point out delays from the clients which
would help in interpreting the results farther down the list.
 
One more reason this point is an interesting one is that it is one
that gets *worse* with the suggested patch, if only by half a percent.
 
Without:
 
600: 80: Medium Throughput: 82632.000 Avg Medium Resp: 0.005
 
with:
 
600: 80: Medium Throughput: 82241.000 Avg Medium Resp: 0.005
 
-Kevin

--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Matthew Wakeling-3
In reply to this post by Heikki Linnakangas-3
On Sat, 14 Mar 2009, Heikki Linnakangas wrote:
> I think the elephant in the room is that we have a single lock that needs to
> be acquired every time a transaction commits, and every time a backend takes
> a snapshot.

I like this line of thinking.

There are two valid sides to this. One is the elephant - can we remove the
need for this lock, or at least reduce its contention. The second is the
fact that these tests have shown that the locking code has potential for
improvement in the case where there are many processes waiting on the same
lock. Both could be worked on, but perhaps the greatest benefit will come
from stopping a single lock being so contended in the first place.

One possibility would be for the locks to alternate between exclusive and
shared - that is:

1. Take a snapshot of all shared waits, and grant them all - thundering
     herd style.
2. Wait until ALL of them have finished, granting no more.
3. Take a snapshot of all exclusive waits, and grant them all, one by one.
4. Wait until all of them have been finished, granting no more.
5. Back to (1).

This may also possibly improve CPU cache coherency. Or of course, it may
make everything much worse - I'm no expert. It would avoid starvation
though.

> It's going require some hard thinking to bust that bottleneck. I've sometimes
> thought about maintaining a pre-calculated array of in-progress XIDs in
> shared memory. GetSnapshotData would simply memcpy() that to private memory,
> instead of collecting the xids from ProcArray.

Shifting the contention from reading that data to altering it. But that
would probably be quite a lot fewer times, so it would be a benefit.

> Or we could try to move some of the if-tests inside the for-loop to
> after the ProcArrayLock is released.

That's always a useful change.

On Sat, 14 Mar 2009, Tom Lane wrote:

> Now the fly in the ointment is that there would need to be some way to
> ensure that we didn't write data out to disk until it was valid; in
> particular how do we implement a request to flush WAL up to a particular
> LSN value, when maybe some of the records before that haven't been fully
> transferred into the buffers yet?  The best idea I've thought of so far
> is shared/exclusive locks on the individual WAL buffer pages, with the
> rather unusual behavior that writers of the page would take shared lock
> and only the reader (he who has to dump to disk) would take exclusive
> lock.  But maybe there's a better way.  Currently I don't believe that
> dumping a WAL buffer (WALWriteLock) blocks insertion of new WAL data,
> and it would be nice to preserve that property.

The writers would need to take a shared lock on the page before releasing
the lock that marshals access to the "how long is the log" data. Other
than that, your idea would work.

An alternative would be to maintain a concurrent linked list of WAL writes
in progress. An entry would be added to the tail every time a new writer
is generated, marking the end of the log. When a writer finishes, it can
remove the entry from the list very cheaply and with very little
contention. The reader (who dumps the WAL to disc) need only look at the
head of the list to find out how far the log is completed, because the
list is guaranteed to be in order of position in the log.

The linked list would probably be simpler - the writers don't need to lock
multiple things. It would also have fewer things accessing each
lock, and therefore maybe less contention. However, it may involve more
locks than the one lock per WAL page method, and I don't know what the
overhead of that would be. (It may be fewer - I don't know what the
average WAL write size is.)

Matthew

--
 What goes up must come down. Ask any system administrator.

--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Proposal of tunable fix for scalability of 8.4

Kevin Grittner
In reply to this post by Kevin Grittner
I wrote:
> One more reason this point is an interesting one is that it is one
> that gets *worse* with the suggested patch, if only by half a
percent.
>  
> Without:
>  
> 600: 80: Medium Throughput: 82632.000 Avg Medium Resp: 0.005
>  
> with:
>  
> 600: 80: Medium Throughput: 82241.000 Avg Medium Resp: 0.005
 
Oops.  A later version:
 
> Redid the test with - waking up all waiters irrespective of shared,
> exclusive
 
> 600: 80: Medium Throughput: 82920.000 Avg Medium Resp: 0.005
 
The one that showed the decreased performance at 800 was:
 
> a modified Fix (not the original one that I proposed but something
> that works like a heart valve : Opens and shuts to minimum
> default way thus  controlling how many waiters are waked up )
 
-Kevin

--
Sent via pgsql-performance mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
1234567
Loading...