Quantcast

Posix Shared Mem patch

classic Classic list List threaded Threaded
81 messages Options
12345
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

A.M.
On 06/26/2012 07:15 PM, Alvaro Herrera wrote:

>
> Excerpts from Tom Lane's message of mar jun 26 18:58:45 -0400 2012:
>
>> Even if you actively try to configure the shmem settings to exactly
>> fill shmmax (which I concede some installation scripts might do),
>> it's going to be hard to do because of the 8K granularity of the main
>> knob, shared_buffers.
>
> Actually it's very easy -- just try to start postmaster on a system with
> not enough shmmax and it will tell you how much shmem it wants.  Then
> copy that number verbatim in the config file.  This might fail on picky
> systems such as MacOSX that require some exact multiple or power of some
> other parameter, but it works fine on Linux.
>

Except that we have to account for other installers. A user can install
an application in the future which clobbers the value and then the
original application will fail to run. The options to get the first app
working is:

a) to re-install the first app (potentially preventing the second app
from running)
b) to have the first app detect the failure and readjust the value
(guessing what it should be) and potentially forcing a reboot
c) to have the the user manually adjust the value and potentially force
a reboot

The failure usually gets blamed on the first application.

That's why we had to nuke SysV shmem.

Cheers,
M



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Robert Haas
In reply to this post by Tom Lane-2
On Tue, Jun 26, 2012 at 6:20 PM, Tom Lane <[hidden email]> wrote:
> Robert Haas <[hidden email]> writes:
>> So, what about keeping a FIFO in the data directory?
>
> Hm, does that work if the data directory is on NFS?  Or some other weird
> not-really-Unix file system?

I would expect NFS to work in general.  We could test that.  Of
course, it's more than possible that there's some bizarre device out
there that purports to be NFS but doesn't actually support mkfifo.
It's difficult to prove a negative.

>> When the
>> postmaster starts up, it tries to open the file with O_NONBLOCK |
>> O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather
>> than O_NONBLOCK).  If that succeeds, it bails out.  If it fails with
>> anything other than ENXIO, it bails out.  If it fails with exactly
>> ENXIO, then it opens the pipe with O_RDONLY
>
> ... race condition here ...

Oh, if someone tries to start two postmasters at the same time?  Hmm.

>> and arranges to pass the
>> file descriptor down to all of its children, so that a subsequent open
>> will fail if it or any of its children are still alive.
>
> This might be made to work, but that doesn't sound quite right in
> detail.
>
> I remember we speculated about using an fcntl lock on some file in the
> data directory, but that fails because child processes don't inherit
> fcntl locks.
>
> In the modern world, it'd be really a step forward if the lock mechanism
> worked on shared storage, ie a data directory on NFS or similar could be
> locked against all comers not just those on the same node as the
> original postmaster.  I don't know how to do that though.

Well, I think that in theory that DOES work.  But I also think it's
often misconfigured.  Which could also be said of NFS in general.

> In the meantime, insisting that we solve this problem before we do
> anything is a good recipe for ensuring that nothing happens, just
> like it hasn't happened for the last half dozen years.  (I see Alvaro
> just made the same point.)

Agreed all around.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Tom Lane-2
In reply to this post by A.M.
"A.M." <[hidden email]> writes:
> On 06/26/2012 07:30 PM, Tom Lane wrote:
>>> I solved this via fcntl locking.

>> No, you didn't, because fcntl locks aren't inherited by child processes.
>> Too bad, because they'd be a great solution otherwise.

> You claimed this last time and I replied:
> http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php

> "I address this race condition by ensuring that a lock-holding violator
> is the postmaster or a postmaster child. If such as condition is
> detected, the child exits immediately without touching the shared
> memory. POSIX shmem is inherited via file descriptors."

> This is possible because the locking API allows one to request which PID
> violates the lock. The child expects the lock to be held and checks that
> the PID is the parent. If the lock is not held, that means that the
> postmaster is dead, so the child exits immediately.

OK, I went back and re-read the original patch, and I now agree that
something like this is possible --- but I don't like the way you did
it. The dependence on particular PIDs seems both unnecessary and risky.

The key concept here seems to be that the postmaster first stakes a
claim on the data directory by exclusive-locking a lock file.  If
successful, it reduces that lock to shared mode (which can be done
atomically, according to the SUS fcntl specification), and then holds
the shared lock until it exits.  Spawned children will not initially
have a lock, but what they can do is attempt to acquire shared lock on
the lock file.  If fail, exit.  If successful, *check to see that the
parent postmaster is still alive* (ie, getppid() != 1).  If so, the
parent must have been continuously holding the lock, and the child has
successfully joined the pool of shared lock holders.  Otherwise bail
out without having changed anything.  It is the "parent is still alive"
check, not any test on individual PIDs, that makes this work.

There are two concrete reasons why I don't care for the
GetPIDHoldingLock() way.  Firstly, the fact that you can get a blocking
PID from F_GETLK isn't an essential part of the concept of file locking
IMO --- it's just an incidental part of this particular API.  May I
remind you that the reason we're stuck on SysV shmem in the first place
is that we decided to depend on an incidental part of that API, namely
nattch?  I would like to not require file locking to have any semantics
more specific than "a process can hold an exclusive or a shared lock on
a file, which is auto-released at process exit".  Secondly, in an NFS
world I don't believe that the returned l_pid value can be trusted for
anything.  If it's a PID from a different machine then it might
accidentally conflict with one on our machine, or not.

Reflecting on this further, it seems to me that the main remaining
failure modes are (1) file locking doesn't work, or (2) idiot DBA
manually removes the lock file.  Both of these could be ameliorated
with some refinements to the basic idea.  For (1), I suggest that
we tweak the startup process (only) to attempt to acquire exclusive lock
on the lock file.  If it succeeds, we know that file locking is broken,
and we can complain.  (This wouldn't help for cases where cross-machine
locking is broken, but I see no practical way to detect that.)
For (2), the problem really is that the proposed patch conflates the PID
file with the lock file, but people are conditioned to think that PID
files are removable.  I suggest that we create a separate, permanently
present file that serves only as the lock file and doesn't ever get
modified (it need have no content other than the string "Don't remove
this!").  It'd be created by initdb, not by individual postmaster runs;
indeed the postmaster should fail if it doesn't find the lock file
already present.  The postmaster PID file should still exist with its
current contents, but it would serve mostly as documentation and as
server-contact information for pg_ctl; it would not be part of the data
directory locking mechanism.

I wonder whether this design can be adapted to Windows?  IIRC we do
not have a bulletproof data directory lock scheme for Windows.
It seems like this makes few enough demands on the lock mechanism
that there ought to be suitable primitives available there too.

                        regards, tom lane

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Tom Lane-2
I wrote:
> Reflecting on this further, it seems to me that the main remaining
> failure modes are (1) file locking doesn't work, or (2) idiot DBA
> manually removes the lock file.

Oh, wait, I just remembered the really fatal problem here: to quote from
the SUS fcntl spec,

        All locks associated with a file for a given process are removed
        when a file descriptor for that file is closed by that process
        or the process holding that file descriptor terminates.

That carefully says "a file descriptor", not "the file descriptor
through which the lock was acquired".  Any close() referencing the lock
file will do.  That means that it is possible for perfectly innocent
code --- for example, something that scans all files in the data
directory, as say pg_basebackup might do --- to cause a backend process
to lose its lock.  When we looked at this before, it seemed like a
showstopper.  Even if we carefully taught every directory-scanning loop
in postgres not to touch the lock file, we cannot expect that for
instance a pl/perl function wouldn't accidentally break things.  And
99.999% of the time nobody would notice ... it would just be that last
0.001% of people that would be screwed.

Still, this discussion has yielded a useful advance, which is that we
now see how we might safely make use of lock mechanisms that don't
inherit across fork().  We just need something less broken than fcntl().

                        regards, tom lane

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Robert Haas
In reply to this post by Tom Lane-2
On Tue, Jun 26, 2012 at 6:25 PM, Tom Lane <[hidden email]> wrote:
> Josh Berkus <[hidden email]> writes:
>> So let's fix the 80% case with something we feel confident in, and then
>> revisit the no-sysv interlock as a separate patch.  That way if we can't
>> fix the interlock issues, we still have a reduced-shmem version of Postgres.
>
> Yes.  Insisting that we have the whole change in one patch is a good way
> to prevent any forward progress from happening.  As Alvaro noted, there
> are plenty of issues to resolve without trying to change the interlock
> mechanism at the same time.

So, here's a patch.  Instead of using POSIX shmem, I just took the
expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
memory.  The sysv shm is still allocated, but it's just a copy of
PGShmemHeader; the "real" shared memory is the anonymous block.  This
won't work if EXEC_BACKEND is defined so it just falls back on
straight sysv shm in that case.

There are obviously some portability issues here - this is documented
not to work on Linux <= 2.4, but it's not clear whether it fails with
some suitable error code or just pretends to work and does the wrong
thing.  I tested that it does compile and work on both Linux 3.2.6 and
MacOS X 10.6.8.  And the comments probably need work and... who knows
what else is wrong.  But, thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

anonymous-shmem.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Tom Lane-2
Robert Haas <[hidden email]> writes:
> So, here's a patch.  Instead of using POSIX shmem, I just took the
> expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
> memory.  The sysv shm is still allocated, but it's just a copy of
> PGShmemHeader; the "real" shared memory is the anonymous block.  This
> won't work if EXEC_BACKEND is defined so it just falls back on
> straight sysv shm in that case.

Um.  I hadn't thought about the EXEC_BACKEND interaction, but that seems
like a bit of a showstopper.  I would not like to give up the ability
to debug EXEC_BACKEND mode on Unixen.

Would Posix shmem help with that at all?  Why did you choose not to
use the Posix API, anyway?

                        regards, tom lane

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Robert Haas
On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <[hidden email]> wrote:

> Robert Haas <[hidden email]> writes:
>> So, here's a patch.  Instead of using POSIX shmem, I just took the
>> expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
>> memory.  The sysv shm is still allocated, but it's just a copy of
>> PGShmemHeader; the "real" shared memory is the anonymous block.  This
>> won't work if EXEC_BACKEND is defined so it just falls back on
>> straight sysv shm in that case.
>
> Um.  I hadn't thought about the EXEC_BACKEND interaction, but that seems
> like a bit of a showstopper.  I would not like to give up the ability
> to debug EXEC_BACKEND mode on Unixen.
>
> Would Posix shmem help with that at all?  Why did you choose not to
> use the Posix API, anyway?

It seemed more complicated.  If we use the POSIX API, we've got to
have code to find a non-colliding name for the shm, and we've got to
arrange to clean it up at process exit.  Anonymous shm doesn't require
a name and goes away automatically when it's no longer in use.

With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to
make it continue to use a full-sized sysv shm.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Magnus Hagander-2
In reply to this post by Tom Lane-2
On Wed, Jun 27, 2012 at 3:50 AM, Tom Lane <[hidden email]> wrote:

> "A.M." <[hidden email]> writes:
>> On 06/26/2012 07:30 PM, Tom Lane wrote:
>>>> I solved this via fcntl locking.
>
>>> No, you didn't, because fcntl locks aren't inherited by child processes.
>>> Too bad, because they'd be a great solution otherwise.
>
>> You claimed this last time and I replied:
>> http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php
>
>> "I address this race condition by ensuring that a lock-holding violator
>> is the postmaster or a postmaster child. If such as condition is
>> detected, the child exits immediately without touching the shared
>> memory. POSIX shmem is inherited via file descriptors."
>
>> This is possible because the locking API allows one to request which PID
>> violates the lock. The child expects the lock to be held and checks that
>> the PID is the parent. If the lock is not held, that means that the
>> postmaster is dead, so the child exits immediately.
>
> OK, I went back and re-read the original patch, and I now agree that
> something like this is possible --- but I don't like the way you did
> it. The dependence on particular PIDs seems both unnecessary and risky.
>
> The key concept here seems to be that the postmaster first stakes a
> claim on the data directory by exclusive-locking a lock file.  If
> successful, it reduces that lock to shared mode (which can be done
> atomically, according to the SUS fcntl specification), and then holds
> the shared lock until it exits.  Spawned children will not initially
> have a lock, but what they can do is attempt to acquire shared lock on
> the lock file.  If fail, exit.  If successful, *check to see that the
> parent postmaster is still alive* (ie, getppid() != 1).  If so, the
> parent must have been continuously holding the lock, and the child has
> successfully joined the pool of shared lock holders.  Otherwise bail
> out without having changed anything.  It is the "parent is still alive"
> check, not any test on individual PIDs, that makes this work.
>
> There are two concrete reasons why I don't care for the
> GetPIDHoldingLock() way.  Firstly, the fact that you can get a blocking
> PID from F_GETLK isn't an essential part of the concept of file locking
> IMO --- it's just an incidental part of this particular API.  May I
> remind you that the reason we're stuck on SysV shmem in the first place
> is that we decided to depend on an incidental part of that API, namely
> nattch?  I would like to not require file locking to have any semantics
> more specific than "a process can hold an exclusive or a shared lock on
> a file, which is auto-released at process exit".  Secondly, in an NFS
> world I don't believe that the returned l_pid value can be trusted for
> anything.  If it's a PID from a different machine then it might
> accidentally conflict with one on our machine, or not.
>
> Reflecting on this further, it seems to me that the main remaining
> failure modes are (1) file locking doesn't work, or (2) idiot DBA
> manually removes the lock file.  Both of these could be ameliorated
> with some refinements to the basic idea.  For (1), I suggest that
> we tweak the startup process (only) to attempt to acquire exclusive lock
> on the lock file.  If it succeeds, we know that file locking is broken,
> and we can complain.  (This wouldn't help for cases where cross-machine
> locking is broken, but I see no practical way to detect that.)
> For (2), the problem really is that the proposed patch conflates the PID
> file with the lock file, but people are conditioned to think that PID
> files are removable.  I suggest that we create a separate, permanently
> present file that serves only as the lock file and doesn't ever get
> modified (it need have no content other than the string "Don't remove
> this!").  It'd be created by initdb, not by individual postmaster runs;
> indeed the postmaster should fail if it doesn't find the lock file
> already present.  The postmaster PID file should still exist with its
> current contents, but it would serve mostly as documentation and as
> server-contact information for pg_ctl; it would not be part of the data
> directory locking mechanism.
>
> I wonder whether this design can be adapted to Windows?  IIRC we do
> not have a bulletproof data directory lock scheme for Windows.
> It seems like this makes few enough demands on the lock mechanism
> that there ought to be suitable primitives available there too.

I assume you're saying we need to make changes in the internal API,
right? Because we alreayd have a windows native implementation of
shared memory that AFAIK works, so if the new Unix stuff can be done
with the same internal APIs, it shouldn't nede to be changed. (Sorry,
haven't followed the thread in detail)

If so - can we define exactly what properties it is we *need*?

(A native API worth looking at is e.g.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa365203(v=vs.85).aspx
- but there are probably others as well if that one doesn't do)

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Tom Lane-2
Magnus Hagander <[hidden email]> writes:
> On Wed, Jun 27, 2012 at 3:50 AM, Tom Lane <[hidden email]> wrote:
>> I wonder whether this design can be adapted to Windows?  IIRC we do
>> not have a bulletproof data directory lock scheme for Windows.
>> It seems like this makes few enough demands on the lock mechanism
>> that there ought to be suitable primitives available there too.

> I assume you're saying we need to make changes in the internal API,
> right? Because we alreayd have a windows native implementation of
> shared memory that AFAIK works,

Right, but does it provide honest protection against starting two
postmasters in the same data directory?  Or more to the point,
does it prevent starting a new postmaster when the old postmaster
crashed but there are still orphaned backends making changes?
AFAIR we basically punted on those problems for the Windows port,
for lack of an equivalent to nattch.

                        regards, tom lane

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Tom Lane-2
In reply to this post by Robert Haas
Robert Haas <[hidden email]> writes:
> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <[hidden email]> wrote:
>> Would Posix shmem help with that at all?  Why did you choose not to
>> use the Posix API, anyway?

> It seemed more complicated.  If we use the POSIX API, we've got to
> have code to find a non-colliding name for the shm, and we've got to
> arrange to clean it up at process exit.  Anonymous shm doesn't require
> a name and goes away automatically when it's no longer in use.

I see.  Those are pretty good reasons ...

> With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to
> make it continue to use a full-sized sysv shm.

Well, if the ultimate objective is to get out from under the SysV APIs
entirely, we're not going to get there if we still have to have all that
code for the EXEC_BACKEND case.  Maybe it's time to decide that we don't
need to support EXEC_BACKEND on Unix.

                        regards, tom lane

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Stephen Frost
All,

* Tom Lane ([hidden email]) wrote:

> Robert Haas <[hidden email]> writes:
> > On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <[hidden email]> wrote:
> >> Would Posix shmem help with that at all?  Why did you choose not to
> >> use the Posix API, anyway?
>
> > It seemed more complicated.  If we use the POSIX API, we've got to
> > have code to find a non-colliding name for the shm, and we've got to
> > arrange to clean it up at process exit.  Anonymous shm doesn't require
> > a name and goes away automatically when it's no longer in use.
>
> I see.  Those are pretty good reasons ...
After talking to Magnus a bit this morning regarding this, it sounds
like what we're doing on Windows is closer to Anonymous shm, except that
they use an intentionally specific name, which also allows them to
detect if any children are still alive by using a "create-if-not-exists"
approach on the shm segment and failing if it still exists.  There were
some corner cases around restarts due to it taking a few seconds for the
Windows kernel to pick up on the fact that all the children are dead and
that the shm segment should go away, but they were able to work around
that, and failure to start is surely much better than possible
corruption.

What this all boils down to is- can you have a shm segment that goes
away when no one is still attached to it, but actually give it a name
and then detect if it already exists atomically on startup on
Linux/Unixes?  If so, perhaps we could use the same mechanism on both..

        Thanks,

                Stephen

signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Stephen Frost
In reply to this post by Tom Lane-2
* Tom Lane ([hidden email]) wrote:
> Right, but does it provide honest protection against starting two
> postmasters in the same data directory?  Or more to the point,
> does it prevent starting a new postmaster when the old postmaster
> crashed but there are still orphaned backends making changes?
> AFAIR we basically punted on those problems for the Windows port,
> for lack of an equivalent to nattch.

See my other mail, but, after talking to Magnus, it's my understanding
that we had that problem initially, but it was later solved by using a
named shared memory segment which the kernel will clean up when all
children are gone.  That, combined with a 'create-if-exists' call,
allows detection of lost children to be done.

        Thanks,

                Stephen

signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Magnus Hagander-2
In reply to this post by Tom Lane-2
On Wed, Jun 27, 2012 at 3:40 PM, Tom Lane <[hidden email]> wrote:

> Magnus Hagander <[hidden email]> writes:
>> On Wed, Jun 27, 2012 at 3:50 AM, Tom Lane <[hidden email]> wrote:
>>> I wonder whether this design can be adapted to Windows?  IIRC we do
>>> not have a bulletproof data directory lock scheme for Windows.
>>> It seems like this makes few enough demands on the lock mechanism
>>> that there ought to be suitable primitives available there too.
>
>> I assume you're saying we need to make changes in the internal API,
>> right? Because we alreayd have a windows native implementation of
>> shared memory that AFAIK works,
>
> Right, but does it provide honest protection against starting two
> postmasters in the same data directory?  Or more to the point,
> does it prevent starting a new postmaster when the old postmaster
> crashed but there are still orphaned backends making changes?
> AFAIR we basically punted on those problems for the Windows port,
> for lack of an equivalent to nattch.

No, we spent a lot of time trying to *fix* it, and IIRC we did.

We create a shared memory segment with a fixed name based on the data
directory. This shared memory segment is inherited by all children. It
will automatically go away only when all processes that have an open
handle to it go away (in fact, it can even take a second or two more,
if they go away by crash and not by cleanup - we have a workaround in
the code for that). But as long as there is an orphaned backend
around, the shared memory segment stays around.

We don't have "nattch". But we do have "nattch>0". Or something like that.

You can work around it if you find two different paths to the same
data directory (e.g .using junctions), but you are really actively
trying to break the system if you do that...


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Tom Lane-2
Magnus Hagander <[hidden email]> writes:
> On Wed, Jun 27, 2012 at 3:40 PM, Tom Lane <[hidden email]> wrote:
>> AFAIR we basically punted on those problems for the Windows port,
>> for lack of an equivalent to nattch.

> No, we spent a lot of time trying to *fix* it, and IIRC we did.

OK, in that case this isn't as interesting as I thought.

If we do go over to a file-locking-based solution on Unix, it might be
worthwhile changing to something similar on Windows.  But it would be
more about reducing coding differences between the platforms than
plugging any real holes.

                        regards, tom lane

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Robert Haas
In reply to this post by Tom Lane-2
On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane <[hidden email]> wrote:

> Robert Haas <[hidden email]> writes:
>> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <[hidden email]> wrote:
>>> Would Posix shmem help with that at all?  Why did you choose not to
>>> use the Posix API, anyway?
>
>> It seemed more complicated.  If we use the POSIX API, we've got to
>> have code to find a non-colliding name for the shm, and we've got to
>> arrange to clean it up at process exit.  Anonymous shm doesn't require
>> a name and goes away automatically when it's no longer in use.
>
> I see.  Those are pretty good reasons ...
>
>> With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to
>> make it continue to use a full-sized sysv shm.
>
> Well, if the ultimate objective is to get out from under the SysV APIs
> entirely, we're not going to get there if we still have to have all that
> code for the EXEC_BACKEND case.  Maybe it's time to decide that we don't
> need to support EXEC_BACKEND on Unix.

I don't personally see a need to do anything that drastic at this
point.  Admittedly, I rarely compile with EXEC_BACKEND, but I don't
think it's bad to have the option available.  Adjusting shared memory
limits isn't really a big problem for PostgreSQL developers; what
we're trying to avoid is the need for PostgreSQL *users* to concern
themselves with it.  And surely anyone who is using EXEC_BACKEND on
Unix is a developer, not a user.

If and when we come up with a substitute for the nattch interlock,
then this might be worth thinking a bit harder about.  At that point,
if we still want to support EXEC_BACKEND on Unix, then we'd need the
EXEC_BACKEND case at least to use POSIX shm rather than anonymous
shared mmap.  Personally I think that would be not that hard and
probably worth doing, but there doesn't seem to be any point in
writing that code now, because for the simple case of just reducing
the amount of shm that we allocate, an anonymous mapping seems better
all around.

We shouldn't overthink this.  Our shared memory code has allocated a
bunch of crufty hacks over the years to work around various
platform-specific issues, but it's still not a lot of code, so I don't
see any reason to worry unduly about making a surgical fix without
having a master plan.  Nothing we want to do down the road will
require moving the earth.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Robert Haas
In reply to this post by Stephen Frost
On Wed, Jun 27, 2012 at 9:52 AM, Stephen Frost <[hidden email]> wrote:
> What this all boils down to is- can you have a shm segment that goes
> away when no one is still attached to it, but actually give it a name
> and then detect if it already exists atomically on startup on
> Linux/Unixes?  If so, perhaps we could use the same mechanism on both..

As I understand it, no.  You can either have anonymous shared
mappings, which go away when no longer in use but do not have a name.
Or you can have POSIX or sysv shm, which have a name but do not
automatically go away when no longer in use.  There seems to be no
method for setting up a segment that both has a name and goes away
automatically.  POSIX shm in particular tries to "look like a file",
whereas anonymous memory tries to look more like malloc (except that
you can share the mapping with child processes).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

A.M.
In reply to this post by Robert Haas

On Jun 27, 2012, at 7:34 AM, Robert Haas wrote:

> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <[hidden email]> wrote:
>> Robert Haas <[hidden email]> writes:
>>> So, here's a patch.  Instead of using POSIX shmem, I just took the
>>> expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
>>> memory.  The sysv shm is still allocated, but it's just a copy of
>>> PGShmemHeader; the "real" shared memory is the anonymous block.  This
>>> won't work if EXEC_BACKEND is defined so it just falls back on
>>> straight sysv shm in that case.
>>
>> Um.  I hadn't thought about the EXEC_BACKEND interaction, but that seems
>> like a bit of a showstopper.  I would not like to give up the ability
>> to debug EXEC_BACKEND mode on Unixen.
>>
>> Would Posix shmem help with that at all?  Why did you choose not to
>> use the Posix API, anyway?
>
> It seemed more complicated.  If we use the POSIX API, we've got to
> have code to find a non-colliding name for the shm, and we've got to
> arrange to clean it up at process exit.  Anonymous shm doesn't require
> a name and goes away automatically when it's no longer in use.
>
> With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to
> make it continue to use a full-sized sysv shm.
>

I solved this by unlinking the posix shared memory segment immediately after creation. The file descriptor to the shared memory is inherited, so, by definition, only the postmaster children can access the memory. This ensures that shared memory cleanup is immediate after the postmaster and all children close, as well. The fcntl locking is not required to protect the posix shared memory- it can protect itself.

Cheers,
M




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Robert Haas
In reply to this post by Tom Lane-2
On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane <[hidden email]> wrote:

> Robert Haas <[hidden email]> writes:
>> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <[hidden email]> wrote:
>>> Would Posix shmem help with that at all?  Why did you choose not to
>>> use the Posix API, anyway?
>
>> It seemed more complicated.  If we use the POSIX API, we've got to
>> have code to find a non-colliding name for the shm, and we've got to
>> arrange to clean it up at process exit.  Anonymous shm doesn't require
>> a name and goes away automatically when it's no longer in use.
>
> I see.  Those are pretty good reasons ...

So, should we do it this way?

I did a little research and discovered that Linux 2.3.51 (released
3/11/2000) apparently returns EINVAL for MAP_SHARED|MAP_ANONYMOUS.
That combination is documented to work beginning in Linux 2.4.0.  How
worried should we be about people trying to run PostgreSQL 9.3 on
pre-2.4 kernels?  If we want to worry about it, we could try mapping a
one-page shared MAP_SHARED|MAP_ANONYMOUS segment first.  If that
works, we could assume that we have a working MAP_SHARED|MAP_ANONYMOUS
facility and try to allocate the whole segment plus a minimal sysv
shm.  If the single page allocation fails with EINVAL, we could fall
back to allocating the entire segment as sysv shm.

A related question is - if we do this - should we enable it only on
ports where we've verified that it works, or should we just turn it on
everywhere and fix breakage if/when it's reported?  I lean toward the
latter.

If we find that there are platforms where (a) mmap is not supported or
(b) MAP_SHARED|MAP_ANON works but has the wrong semantics, we could
either shut off this optimization on those platforms by fiat, or we
could test not only that the call succeeds, but that it works
properly: create a one-page mapping and fork a child process; in the
child, write to the mapping and exit; in the parent, wait for the
child to exit and then test that we can read back the correct
contents.  This would protect against a hypothetical system where the
flags are accepted but fail to produce the correct behavior.  I'm
inclined to think this is over-engineering in the absence of evidence
that there are platforms that work this way.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Magnus Hagander-2
On Thu, Jun 28, 2012 at 7:00 AM, Robert Haas <[hidden email]> wrote:

> On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane <[hidden email]> wrote:
>> Robert Haas <[hidden email]> writes:
>>> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <[hidden email]> wrote:
>>>> Would Posix shmem help with that at all?  Why did you choose not to
>>>> use the Posix API, anyway?
>>
>>> It seemed more complicated.  If we use the POSIX API, we've got to
>>> have code to find a non-colliding name for the shm, and we've got to
>>> arrange to clean it up at process exit.  Anonymous shm doesn't require
>>> a name and goes away automatically when it's no longer in use.
>>
>> I see.  Those are pretty good reasons ...
>
> So, should we do it this way?
>
> I did a little research and discovered that Linux 2.3.51 (released
> 3/11/2000) apparently returns EINVAL for MAP_SHARED|MAP_ANONYMOUS.
> That combination is documented to work beginning in Linux 2.4.0.  How
> worried should we be about people trying to run PostgreSQL 9.3 on
> pre-2.4 kernels?  If we want to worry about it, we could try mapping a
> one-page shared MAP_SHARED|MAP_ANONYMOUS segment first.  If that
> works, we could assume that we have a working MAP_SHARED|MAP_ANONYMOUS
> facility and try to allocate the whole segment plus a minimal sysv
> shm.  If the single page allocation fails with EINVAL, we could fall
> back to allocating the entire segment as sysv shm.

Do we really need a runtime check for that? Isn't a configure check
enough? If they *do* deploy postgresql 9.3 on something that old,
they're building from source anyway...


> A related question is - if we do this - should we enable it only on
> ports where we've verified that it works, or should we just turn it on
> everywhere and fix breakage if/when it's reported?  I lean toward the
> latter.

Depends on the amount of expected breakage, but I'd lean towards teh
later as well.


> If we find that there are platforms where (a) mmap is not supported or
> (b) MAP_SHARED|MAP_ANON works but has the wrong semantics, we could
> either shut off this optimization on those platforms by fiat, or we
> could test not only that the call succeeds, but that it works
> properly: create a one-page mapping and fork a child process; in the
> child, write to the mapping and exit; in the parent, wait for the
> child to exit and then test that we can read back the correct
> contents.  This would protect against a hypothetical system where the
> flags are accepted but fail to produce the correct behavior.  I'm
> inclined to think this is over-engineering in the absence of evidence
> that there are platforms that work this way.

Could we actually turn *that* into a configure test, or will that be
too complex?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Posix Shared Mem patch

Robert Haas
On Thu, Jun 28, 2012 at 7:05 AM, Magnus Hagander <[hidden email]> wrote:
> Do we really need a runtime check for that? Isn't a configure check
> enough? If they *do* deploy postgresql 9.3 on something that old,
> they're building from source anyway...
[...]
>
> Could we actually turn *that* into a configure test, or will that be
> too complex?

I don't see why we *couldn't* make either of those things into a
configure test, but it seems more complicated than a runtime test and
less accurate, so I guess I'd be in favor of doing them at runtime or
not at all.

Actually, the try-a-one-page-mapping-and-see-if-you-get-EINVAL test is
so simple that I really can't see any reason not to insert that
defense.  The fork-and-check-whether-it-really-works test is probably
excess paranoia until we determine whether that's really a danger
anywhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
12345
Loading...