|
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer pages. page_checksums = on | off (default) There are no required block changes; checksums are optional and some blocks may have a checksum, others not. This means that the patch will allow pg_upgrade. That capability also limits us to 16-bit checksums. Fletcher's 16 is used in this patch and seems rather quick, though that is easily replaceable/tuneable if desired, perhaps even as a parameter enum. This patch is a step on the way to 32-bit checksums in a future redesign of the page layout, though that is not a required future change, nor does this prevent that. Checksum is set whenever the buffer is flushed to disk, and checked when the page is read in from disk. It is not set at other times, and for much of the time may not be accurate. This follows earlier discussions from 2010-12-22, and is discussed in detail in patch comments. Note it works with buffer manager pages, which includes shared and local data buffers, but not SLRU pages (yet? an easy addition but needs other discussion around contention). Note that all this does is detect bit errors on the page, it doesn't identify where the error is, how bad and definitely not what caused it or when it happened. The main body of the patch involves changes to bufpage.c/.h so this differs completely from the VMware patch, for technical reasons. Also included are facilities to LockBufferForHints() with usage in various AMs, to avoid the case where hints are set during calculation of the checksum. In my view this is a fully working, committable patch but I'm not in a hurry to do so given the holiday season. Hopefully its a gift not a turkey, and therefore a challenge for some to prove that wrong. Enjoy either way, Merry Christmas, -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
Simon Riggs <[hidden email]> writes:
> After the various recent discussions on list, I present what I believe > to be a working patch implementing 16-but checksums on all buffer > pages. I think locking around hint-bit-setting is likely to be unworkable from a performance standpoint. I also wonder whether it might not result in deadlocks. Also, as far as I can see this patch usurps the page version field, which I find unacceptably short-sighted. Do you really think this is the last page layout change we'll ever make? regards, tom lane -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
> Simon Riggs <[hidden email]> writes: > > After the various recent discussions on list, I present what I believe > > to be a working patch implementing 16-but checksums on all buffer > > pages. > > I think locking around hint-bit-setting is likely to be unworkable from > a performance standpoint. I also wonder whether it might not result in > deadlocks. Why don't you use the same tricks as the former patch and copy the buffer, compute the checksum on that, and then write out that copy (you can even do both at the same time). I have a hard time believing that the additional copy is more expensive than the locking. Andres -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
In reply to this post by Tom Lane-2
On Sat, Dec 24, 2011 at 2:46 PM, Tom Lane <[hidden email]> wrote:
> Simon Riggs <[hidden email]> writes: >> After the various recent discussions on list, I present what I believe >> to be a working patch implementing 16-but checksums on all buffer >> pages. > > I think locking around hint-bit-setting is likely to be unworkable from > a performance standpoint. Anyone choosing page_checksums = on has already made a performance reducing decision in favour of reliability. So they understand and accept the impact. There is no locking when the parameter is off. A safe alternative is to use LockBuffer, which has a much greater performance impact. I did think about optimistically checking after the write, but if we crash at that point we will then see a block that has an invalid checksum. It's faster but you may get a checksum failure if you crash - but then one important aspect of this is to spot problems in case of a crash, so that seems unacceptable. > I also wonder whether it might not result in > deadlocks. If you can see how, please say. I can't see any ways for that myself. > Also, as far as I can see this patch usurps the page version field, > which I find unacceptably short-sighted. Do you really think this is > the last page layout change we'll ever make? No, I don't. I hope and expect the next page layout change to reintroduce such a field. But since we're agreed now that upgrading is important, changing page format isn't likely to be happening until we get an online upgrade process. So future changes are much less likely. If they do happen, we have some flag bits spare that can be used to indicate later versions. It's not the prettiest thing in the world, but it's a small ugliness in return for an important feature. If there was a way without that, I would have chosen it. pg_filedump will need to be changed more than normal, but the version isn't used anywhere else in the server code. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
In reply to this post by Andres Freund
On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <[hidden email]> wrote:
> On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote: >> Simon Riggs <[hidden email]> writes: >> > After the various recent discussions on list, I present what I believe >> > to be a working patch implementing 16-but checksums on all buffer >> > pages. >> >> I think locking around hint-bit-setting is likely to be unworkable from >> a performance standpoint. I also wonder whether it might not result in >> deadlocks. > Why don't you use the same tricks as the former patch and copy the buffer, > compute the checksum on that, and then write out that copy (you can even do > both at the same time). I have a hard time believing that the additional copy > is more expensive than the locking. We would copy every time we write, yet lock only every time we set hint bits. If that option is favoured, I'll write another version after Christmas. ISTM we can't write and copy at the same time because the cheksum is not a trailer field. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
In reply to this post by Simon Riggs
On Sat, Dec 24, 2011 at 3:51 PM, Aidan Van Dyk <[hidden email]> wrote:
> Not an expert here, but after reading through the patch quickly, I > don't see anything that changes the torn-page problem though, right? > > Hint bits aren't wal-logged, and FPW isn't forced on the hint-bit-only > dirty, right? Checksums merely detect a problem, whereas FPWs correct a problem if it happens, but only in crash situations. So this does nothing to remove the need for FPWs, though checksum detection could be used for double write buffers also. Checksums work even when there is no crash, so if your disk goes bad and corrupts data then you'll know about it as soon as it happens. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
In reply to this post by Simon Riggs
On Saturday, December 24, 2011 05:01:02 PM Simon Riggs wrote:
> On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <[hidden email]> wrote: > > On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote: > >> Simon Riggs <[hidden email]> writes: > >> > After the various recent discussions on list, I present what I believe > >> > to be a working patch implementing 16-but checksums on all buffer > >> > pages. > >> > >> I think locking around hint-bit-setting is likely to be unworkable from > >> a performance standpoint. I also wonder whether it might not result in > >> deadlocks. > > > > Why don't you use the same tricks as the former patch and copy the > > buffer, compute the checksum on that, and then write out that copy (you > > can even do both at the same time). I have a hard time believing that > > the additional copy is more expensive than the locking. > > We would copy every time we write, yet lock only every time we set hint > bits. cached workload where most writeout happens due to checkpoints. > If that option is favoured, I'll write another version after Christmas. Seems less complicated (wrt deadlocking et al) to me. But I havent read your patch, so I will shut up now ;) Andres -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
In reply to this post by Simon Riggs
On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <[hidden email]> wrote:
> Checksums merely detect a problem, whereas FPWs correct a problem if > it happens, but only in crash situations. > > So this does nothing to remove the need for FPWs, though checksum > detection could be used for double write buffers also. This is missing the point. If you have a torn page on a page that is only dirty due to hint bits then the checksum will show a spurious checksum failure. It will "detect" a problem that isn't there. The problem is that there is no WAL indicating the hint bit change. And if the torn page includes the new checksum but not the new hint bit or vice versa it will be a checksum mismatch. The strategy discussed in the past was moving all the hint bits to a common area and skipping them in the checksum. No amount of double writing or buffering or locking will avoid this problem. -- greg -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark <[hidden email]> wrote:
> On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <[hidden email]> wrote: >> Checksums merely detect a problem, whereas FPWs correct a problem if >> it happens, but only in crash situations. >> >> So this does nothing to remove the need for FPWs, though checksum >> detection could be used for double write buffers also. > > This is missing the point. If you have a torn page on a page that is > only dirty due to hint bits then the checksum will show a spurious > checksum failure. It will "detect" a problem that isn't there. It will detect a problem that *is* there, but one you are classifying it as a non-problem because it is a correctable or acceptable bit error. Given that acceptable bit errors on hints cover no more than 1% of a block, the great likelihood is that the bit error is unacceptable in any case, so false positives page errors are in fact very rare. Any bit error is an indicator of problems on the external device, so many would regard any bit error as unacceptable. > The problem is that there is no WAL indicating the hint bit change. > And if the torn page includes the new checksum but not the new hint > bit or vice versa it will be a checksum mismatch. > > The strategy discussed in the past was moving all the hint bits to a > common area and skipping them in the checksum. No amount of double > writing or buffering or locking will avoid this problem. I completely agree we should do this, but we are unable to do it now, so this patch is a stop-gap and provides a much requested feature *now*. In the future, we will be able to tell the difference between an acceptable and an unacceptable bit error. Right now, all we have is the ability to detect a bit error and as I point out above that is 99% of the problem solves, at least. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
In reply to this post by Simon Riggs
> Simon Riggs wrote:
> On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark wrote: >> The problem is that there is no WAL indicating the hint bit >> change. And if the torn page includes the new checksum but not the >> new hint bit or vice versa it will be a checksum mismatch. With *just* this patch, true. An OS crash or hardware failure could sometimes create an invalid page. >> The strategy discussed in the past was moving all the hint bits to >> a common area and skipping them in the checksum. No amount of >> double writing or buffering or locking will avoid this problem. I don't believe that. Double-writing is a technique to avoid torn pages, but it requires a checksum to work. This chicken-and-egg problem requires the checksum to be implemented first. > I completely agree we should do this, but we are unable to do it > now, so this patch is a stop-gap and provides a much requested > feature *now*. Yes, for people who trust their environment to prevent torn pages, or who are willing to tolerate one bad page per OS crash in return for quick reporting of data corruption from unreliable file systems, this is a good feature even without double-writes. > In the future, we will be able to tell the difference between an > acceptable and an unacceptable bit error. A double-write patch would provide that, and it sounds like VMware has a working patch for that which is being polished for submission. It would need to wait until we have some consensus on the checksum patch before it can be finalized. I'll try to review the patch from this thread today, to do what I can to move that along. -Kevin -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
In reply to this post by Simon Riggs
On Sat, Dec 24, 2011 at 04:01:02PM +0000, Simon Riggs wrote:
> On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <[hidden email]> wrote: > > Why don't you use the same tricks as the former patch and copy the buffer, > > compute the checksum on that, and then write out that copy (you can even do > > both at the same time). I have a hard time believing that the additional copy > > is more expensive than the locking. > > ISTM we can't write and copy at the same time because the cheksum is > not a trailer field. Ofcourse you can. If the checksum is in the trailer field you get the nice property that the whole block has a constant checksum. However, if you store the checksum elsewhere you just need to change the checking algorithm to copy the checksum out, zero those bytes and run the checksum and compare with the extracted checksum. Not pretty, but I don't think it makes a difference in performence. Have a nice day, -- Martijn van Oosterhout <[hidden email]> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer |
|
In reply to this post by Simon Riggs
On Sun, Dec 25, 2011 at 5:08 AM, Simon Riggs <[hidden email]> wrote:
> On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark <[hidden email]> wrote: >> On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <[hidden email]> wrote: >>> Checksums merely detect a problem, whereas FPWs correct a problem if >>> it happens, but only in crash situations. >>> >>> So this does nothing to remove the need for FPWs, though checksum >>> detection could be used for double write buffers also. >> >> This is missing the point. If you have a torn page on a page that is >> only dirty due to hint bits then the checksum will show a spurious >> checksum failure. It will "detect" a problem that isn't there. > > It will detect a problem that *is* there, but one you are classifying > it as a non-problem because it is a correctable or acceptable bit > error. I don't agree with this. We don't WAL-log hint bit changes precisely because it's OK if they make it to disk and it's OK if they don't. Given that, I don't see how we can say that writing out only half of a page that has had hint bit changes is a problem. It's not. (And if it is, then we ought to WAL-log all such changes regardless of whether CRCs are in use.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
In reply to this post by Kevin Grittner
On 25.12.2011 15:01, Kevin Grittner wrote:
> I don't believe that. Double-writing is a technique to avoid torn > pages, but it requires a checksum to work. This chicken-and-egg > problem requires the checksum to be implemented first. I don't think double-writes require checksums on the data pages themselves, just on the copies in the double-write buffers. In the double-write buffer, you'll need some extra information per-page anyway, like a relfilenode and block number that indicates which page it is in the buffer. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
<[hidden email]> wrote: > On 25.12.2011 15:01, Kevin Grittner wrote: >> >> I don't believe that. Double-writing is a technique to avoid torn >> pages, but it requires a checksum to work. This chicken-and-egg >> problem requires the checksum to be implemented first. > > > I don't think double-writes require checksums on the data pages themselves, > just on the copies in the double-write buffers. In the double-write buffer, > you'll need some extra information per-page anyway, like a relfilenode and > block number that indicates which page it is in the buffer. How would you know when to look in the double write buffer? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
In reply to this post by Kevin Grittner
On Sun, Dec 25, 2011 at 1:01 PM, Kevin Grittner
<[hidden email]> wrote: > This chicken-and-egg > problem requires the checksum to be implemented first. v2 of checksum patch, using a conditional copy if checksumming is enabled, so locking is removed. Thanks to Andres for thwacking me with the cluestick, though I have used a simple copy rather than a copy & calc. Tested using make installcheck with parameter on/off, then restart and vacuumdb to validate all pages. Reviews, objections, user interface tweaks all welcome. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
In reply to this post by Simon Riggs
On 28.12.2011 01:39, Simon Riggs wrote:
> On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas > <[hidden email]> wrote: >> On 25.12.2011 15:01, Kevin Grittner wrote: >>> >>> I don't believe that. Double-writing is a technique to avoid torn >>> pages, but it requires a checksum to work. This chicken-and-egg >>> problem requires the checksum to be implemented first. >> >> >> I don't think double-writes require checksums on the data pages themselves, >> just on the copies in the double-write buffers. In the double-write buffer, >> you'll need some extra information per-page anyway, like a relfilenode and >> block number that indicates which page it is in the buffer. > > How would you know when to look in the double write buffer? You scan the double-write buffer, and every page in the double write buffer that has a valid checksum, you copy to the main storage. There's no need to check validity of pages in the main storage. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas
<[hidden email]> wrote: >> How would you know when to look in the double write buffer? > > > You scan the double-write buffer, and every page in the double write buffer > that has a valid checksum, you copy to the main storage. There's no need to > check validity of pages in the main storage. OK, then we are talking at cross purposes. Double write buffers, in the way you explain them allow us to remove full page writes. They clearly don't do anything to check page validity on read. Torn pages are not the only fault we wish to correct against... and the double writes idea is orthogonal to the idea of checksums. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
On 28.12.2011 11:22, Simon Riggs wrote:
> On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas > <[hidden email]> wrote: > >>> How would you know when to look in the double write buffer? >> >> >> You scan the double-write buffer, and every page in the double write buffer >> that has a valid checksum, you copy to the main storage. There's no need to >> check validity of pages in the main storage. > > OK, then we are talking at cross purposes. Double write buffers, in > the way you explain them allow us to remove full page writes. They > clearly don't do anything to check page validity on read. Torn pages > are not the only fault we wish to correct against... and the double > writes idea is orthogonal to the idea of checksums. The reason we're talking about double write buffers in this thread is that double write buffers can be used to solve the problem with hint bits and checksums. You're right, though, that it's academical whether double write buffers can be used without checksums on data pages, if the whole point of the exercise is to make it possible to have checksums on data pages.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
On Wed, Dec 28, 2011 at 5:45 PM, Heikki Linnakangas
<[hidden email]> wrote: > On 28.12.2011 11:22, Simon Riggs wrote: >> >> On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas >> <[hidden email]> wrote: >> >>>> How would you know when to look in the double write buffer? >>> >>> >>> >>> You scan the double-write buffer, and every page in the double write >>> buffer >>> that has a valid checksum, you copy to the main storage. There's no need >>> to >>> check validity of pages in the main storage. >> >> >> OK, then we are talking at cross purposes. Double write buffers, in >> the way you explain them allow us to remove full page writes. They >> clearly don't do anything to check page validity on read. Torn pages >> are not the only fault we wish to correct against... and the double >> writes idea is orthogonal to the idea of checksums. > > > The reason we're talking about double write buffers in this thread is that > double write buffers can be used to solve the problem with hint bits and > checksums. Torn pages are not the only problem we need to detect. You said "You scan the double write buffer...". When exactly would you do that? Please explain how a double write buffer detects problems that do not occur as the result of a crash. We don't have much time, so please be clear and lucid. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
In reply to this post by Simon Riggs
> Heikki Linnakangas wrote:
> On 28.12.2011 01:39, Simon Riggs wrote: >> On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas >> wrote: >>> On 25.12.2011 15:01, Kevin Grittner wrote: >>>> >>>> I don't believe that. Double-writing is a technique to avoid >>>> torn pages, but it requires a checksum to work. This chicken- >>>> and-egg problem requires the checksum to be implemented first. >>> >>> I don't think double-writes require checksums on the data pages >>> themselves, just on the copies in the double-write buffers. In >>> the double-write buffer, you'll need some extra information per- >>> page anyway, like a relfilenode and block number that indicates >>> which page it is in the buffer. You are clearly right -- if there is no checksum in the page itself, you can put one in the double-write metadata. I've never seen that discussed before, but I'm embarrassed that it never occurred to me. >> How would you know when to look in the double write buffer? > > You scan the double-write buffer, and every page in the double > write buffer that has a valid checksum, you copy to the main > storage. There's no need to check validity of pages in the main > storage. Right. I'll recap my understanding of double-write (from memory -- if there's a material error or omission, I hope someone will correct me). The write-ups I've seen on double-write techniques have all the writes to the double-write buffer (a single, sequential file that stays around). This is done as sequential writing to a file which is overwritten pretty frequently, making the writes to a controller very fast, and a BBU write-back cache unlikely to actually write to disk very often. On good server-quality hardware, it should be blasting RAM-to_RAM very efficiently. The file is fsync'd (like I said, hopefully to BBU cache), then each page in the double-write buffer is written to the normal page location, and that is fsync'd. Once that is done, the database writes have no risk of being torn, and the double-write buffer is marked as empty. This all happens at the point when you would be writing the page to the database, after the WAL-logging. On crash recovery you read through the double-write buffer from the start and write the pages which look good (including a good checksum) to the database before replaying WAL. If you find a checksum error in processing the double-write buffer, you assume that you never got as far as the fsync of the double-write buffer, which means you never started writing the buffer contents to the database, which means there can't be any torn pages there. If you get to the end and fsync, you can be sure any torn pages from a previous attempt to write to the database itself have been overwritten with the good copy in the double-write buffer. Either way, you move on to WAL processing. You wind up with a database free of torn pages before you apply WAL. full_page_writes to the WAL are not needed as long as double-write is used for any pages which would have been written to the WAL. If checksums were written to the double-buffer metadata instead of adding them to the page itself, this could be implemented alone. It would probably allow a modest speed improvement over using full_page_writes and would eliminate those full-page images from the WAL files, making them smaller. If we do add a checksum to the page header, that could be used for testing for torn pages in the double-write buffer without needing a redundant calculation for double-write. With no torn pages in the actual database, checksum failures there would never be false positives. To get this right for a checksum in the page header, double-write would need to be used for all cases where full_page_writes now are used (i.e., the first write of a page after a checkpoint), and for all unlogged writes (e.g., hint-bit-only writes). There would be no correctness problem for always using double-write, but it would be unnecessary overhead for other page writes, which I think we can avoid. -Kevin -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
| Powered by Nabble | See how NAML generates this page |
