Quantcast

streaming replication does not work across datacenter with 20ms latency?

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

streaming replication does not work across datacenter with 20ms latency?

Yan Chunlu
I was doing postgresql streaming replication, which was fine when two
machine in the same datecenter. but recently I was planning to deploy
new slave at a different datecent, the latency between the master and
slave is 20ms;
below is the related configurateion:
Both master and slave have below configuration:
hot_standby = on
wal_level = hot_standby
max_wal_senders = 5

checkpoint_segments = 64
wal_keep_segments = 128

I am using pgpool to automation but the method is similar to the
method described here:
http://wiki.postgresql.org/wiki/Streaming_Replication

the data dir size is about 30G,  I have tried many times but every
time after the sync was over and slave was started,  postgresql  is
just hanging there with error message(attached below), while trying to
connect it returns error message "psql: FATAL:  the database system is
starting up"


the strange part is with same configuration, other slaves in the same
datacenter works fine...



what does invalid record length and invalid magic number  normally
means? xlog  corrupted?
Thanks for any further help!
the log message with debug5 level was like this(just clips, I could
upload full log file if necessary):



17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ LOG:  database
system was interrupted; last known up at 2011-07-23 07:07:57 CDT
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:  forked
new backend, pid=17998 socket=8
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:  forked
new backend, pid=17999 socket=8
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]FATAL:  the database system is starting up
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG:  shmem_exit(1): 0 callbacks to make
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG:  proc_exit(1): 1 callbacks to make
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG:  exit(1)
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG:  shmem_exit(-1): 0 callbacks to make
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG:  proc_exit(-1): 0 callbacks to make
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
reaping dead processes
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:  server
process (PID 17999) exited with exit code 1
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)FATAL:  the database system is
starting up
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG:  shmem_exit(1): 0
callbacks to make
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG:  proc_exit(1): 1 callbacks
to make
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG:  exit(1)
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG:  shmem_exit(-1): 0
callbacks to make
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG:  proc_exit(-1): 0
callbacks to make
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
reaping dead processes
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:  server
process (PID 17998) exited with exit code 1
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
standby_mode = 'on'
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
primary_conninfo = 'host=jefferson port=5432 user=postgres'
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
trigger_file = '/var/log/pgpool/trigger/trigger_file1'
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ LOG:  entering
standby mode
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:  could
not open file "pg_xlog/0000000300000054000000DB" (log file 84, segment
219): No such file or directory


17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG:  record
known xact 36933672 latestObservedXid 36933674
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT:  xlog
redo commit: 2011-07-23 06:41:41.264405-05
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG:  remove
KnownAssignedXid 36933672
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT:  xlog
redo commit: 2011-07-23 06:41:41.264405-05
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG:  record
known xact 36933674 latestObservedXid 36933674
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT:  xlog
redo insert: rel 1663/16386/17404; tid 18378/37
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ LOG:  invalid
record length at 54/DDFE4010

17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG:  remove
KnownAssignedXid 36929085
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT:  xlog
redo commit: 2011-07-23 06:33:29.760915-05
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG:  record
known xact 36929100 latestObservedXid 36929102
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT:  xlog
redo insert: rel 1663/16386/16436; tid 88370/2
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG:  record
known xact 36929109 latestObservedXid 36929102
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT:  xlog
redo insert: rel 1663/16386/16436; tid 88370/3
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ LOG:  invalid
magic number 0000 in log file 84, segment 219, offset 7733248


--

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Scott Ribe-2
On Jul 23, 2011, at 6:50 AM, Yan Chunlu wrote:

> what does invalid record length and invalid magic number  normally
> means? xlog  corrupted?
> Thanks for any further help!

It means your build settings for pg are not compatible across the 2 machines. For instance, one machine is 32-bit and the other is 64-bit, or one machine is big-endian and the other is little-endian...

--
Scott Ribe
[hidden email]
http://www.elevated-dev.com/
(303) 722-0567 voice





--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Yan Chunlu
thanks for the help!
are there any other possible reasons?

both system are using Debian amd64, as uname -a shows:
Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
x86_64 GNU/Linux

and using the following program it tells both of them are little-endian
#include <stdio.h>
#include <stdbool.h>


bool isBigEndian()
{
    int no = 1;
    char *chk = (char *)&no;

    if (chk[0] == 1)
    {
        return 0;
    }
    else
    {
        return 1;
    }
}

main()
{
    printf("this is %d \n",(int)isBigEndian());
    return 0;
}
~

On Sat, Jul 23, 2011 at 11:55 PM, Scott Ribe
<[hidden email]> wrote:

> On Jul 23, 2011, at 6:50 AM, Yan Chunlu wrote:
>
>> what does invalid record length and invalid magic number  normally
>> means? xlog  corrupted?
>> Thanks for any further help!
>
> It means your build settings for pg are not compatible across the 2 machines. For instance, one machine is 32-bit and the other is 64-bit, or one machine is big-endian and the other is little-endian...
>
> --
> Scott Ribe
> [hidden email]
> http://www.elevated-dev.com/
> (303) 722-0567 voice
>
>
>
>
>



--
闫春路

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Tomas Vondra
On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
> thanks for the help!
> are there any other possible reasons?
>
> both system are using Debian amd64, as uname -a shows:
> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
> x86_64 GNU/Linux

It is not just about the architecture, it means the PostgreSQL was
configured somehow differently during the build. E.g. a different block
size or WAL block size would make such problems.

Or maybe one of the buils might be 32-bit for some reason (you can run
32-bit system in a 64-bit environment). You can do this

$ less postgres | grep Class

to check this (ELF32 => 32bit, ELF64 => 64bit).

Did you use the same binary packages or have you built the server yourself?

Tomas


--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Joshua D. Drake
On 07/23/2011 10:55 AM, Tomas Vondra wrote:

> On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
>> thanks for the help!
>> are there any other possible reasons?
>>
>> both system are using Debian amd64, as uname -a shows:
>> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
>> x86_64 GNU/Linux
>
> It is not just about the architecture, it means the PostgreSQL was
> configured somehow differently during the build. E.g. a different block
> size or WAL block size would make such problems.
>
> Or maybe one of the buils might be 32-bit for some reason (you can run
> 32-bit system in a 64-bit environment). You can do this
>
> $ less postgres | grep Class
>
> to check this (ELF32 =>  32bit, ELF64 =>  64bit).
>
> Did you use the same binary packages or have you built the server yourself?

Run a pg_config and compare the differences.


>
> Tomas
>
>


--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Scott Marlowe-2
In reply to this post by Tomas Vondra
On Sat, Jul 23, 2011 at 11:55 AM, Tomas Vondra <[hidden email]> wrote:

> On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
>> thanks for the help!
>> are there any other possible reasons?
>>
>> both system are using Debian amd64, as uname -a shows:
>> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
>> x86_64 GNU/Linux
>
> It is not just about the architecture, it means the PostgreSQL was
> configured somehow differently during the build. E.g. a different block
> size or WAL block size would make such problems.
>
> Or maybe one of the buils might be 32-bit for some reason (you can run
> 32-bit system in a 64-bit environment). You can do this
>
> $ less postgres | grep Class
>
> to check this (ELF32 => 32bit, ELF64 => 64bit).
>
> Did you use the same binary packages or have you built the server yourself?

Different date formats too.

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Yan Chunlu
I used apt-get to install postgresql, running pg_config showing they
are exactly the same...


running on master:
https://gist.github.com/1102148


running on slave:
https://gist.github.com/1102151

On Sun, Jul 24, 2011 at 2:44 AM, Scott Marlowe <[hidden email]> wrote:

> On Sat, Jul 23, 2011 at 11:55 AM, Tomas Vondra <[hidden email]> wrote:
>> On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
>>> thanks for the help!
>>> are there any other possible reasons?
>>>
>>> both system are using Debian amd64, as uname -a shows:
>>> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
>>> x86_64 GNU/Linux
>>
>> It is not just about the architecture, it means the PostgreSQL was
>> configured somehow differently during the build. E.g. a different block
>> size or WAL block size would make such problems.
>>
>> Or maybe one of the buils might be 32-bit for some reason (you can run
>> 32-bit system in a 64-bit environment). You can do this
>>
>> $ less postgres | grep Class
>>
>> to check this (ELF32 => 32bit, ELF64 => 64bit).
>>
>> Did you use the same binary packages or have you built the server yourself?
>
> Different date formats too.
>

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Yan Chunlu
In reply to this post by Tomas Vondra
less postgres  didn't showing anything... cause it's binary, I tried with -a
 less postgres |grep -a ELF

less postgres | grep -a Class

nothing related to (ELF32 => 32bit, ELF64 => 64bit).





On Sun, Jul 24, 2011 at 1:55 AM, Tomas Vondra <[hidden email]> wrote:

> On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
>> thanks for the help!
>> are there any other possible reasons?
>>
>> both system are using Debian amd64, as uname -a shows:
>> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
>> x86_64 GNU/Linux
>
> It is not just about the architecture, it means the PostgreSQL was
> configured somehow differently during the build. E.g. a different block
> size or WAL block size would make such problems.
>
> Or maybe one of the buils might be 32-bit for some reason (you can run
> 32-bit system in a 64-bit environment). You can do this
>
> $ less postgres | grep Class
>
> to check this (ELF32 => 32bit, ELF64 => 64bit).
>
> Did you use the same binary packages or have you built the server yourself?
>
> Tomas
>
>



--

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Adrian Klaver-3
In reply to this post by Yan Chunlu
On Saturday, July 23, 2011 7:43:56 pm Yan Chunlu wrote:

> I used apt-get to install postgresql, running pg_config showing they
> are exactly the same...
>
>
> running on master:
> https://gist.github.com/1102148
>
>
> running on slave:
> https://gist.github.com/1102151
>

Are you sure there is only one instance of Postgres running on each machine?
--
Adrian Klaver
[hidden email]

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Yan Chunlu
In reply to this post by Scott Marlowe-2
the system date formats? looks the same:

master:
#date
Sat Jul 23 21:53:34 CDT 2011

slave:
#date
Sat Jul 23 21:52:56 CDT 2011



On Sun, Jul 24, 2011 at 2:44 AM, Scott Marlowe <[hidden email]> wrote:

> On Sat, Jul 23, 2011 at 11:55 AM, Tomas Vondra <[hidden email]> wrote:
>> On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
>>> thanks for the help!
>>> are there any other possible reasons?
>>>
>>> both system are using Debian amd64, as uname -a shows:
>>> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
>>> x86_64 GNU/Linux
>>
>> It is not just about the architecture, it means the PostgreSQL was
>> configured somehow differently during the build. E.g. a different block
>> size or WAL block size would make such problems.
>>
>> Or maybe one of the buils might be 32-bit for some reason (you can run
>> 32-bit system in a 64-bit environment). You can do this
>>
>> $ less postgres | grep Class
>>
>> to check this (ELF32 => 32bit, ELF64 => 64bit).
>>
>> Did you use the same binary packages or have you built the server yourself?
>
> Different date formats too.
>



--
闫春路

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Scott Marlowe-2
On Sat, Jul 23, 2011 at 8:53 PM, Yan Chunlu <[hidden email]> wrote:
> the system date formats? looks the same:

hehe, no, the internal formats.  There's a floating point and an
integer method.  They have to be the same and according to your output
of pg_config they are, with this config flag listed for both:

--enable-integer-datetimes

btw, integer is preferred over floating point for date types, at least
as far as I know.

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Scott Ribe-2
In reply to this post by Yan Chunlu
On Jul 23, 2011, at 8:43 PM, Yan Chunlu wrote:

> I used apt-get to install postgresql, running pg_config showing they
> are exactly the same...

BTW, forgot to mention this in my first message: I run streaming replication across the country with latency well over 100ms and no problems.

--
Scott Ribe
[hidden email]
http://www.elevated-dev.com/
(303) 722-0567 voice





--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Yan Chunlu
thanks for all the help!

@Adrian:  yes, only one instance on each machine

not the slave finally started and could be connect, replication didn't
begin, just following errors:
https://gist.github.com/1102225





BTW: is that possible that rsync has finished but the data didn't
flush to disk, so when postgresql started it was seeing corrupted
files?


On Sun, Jul 24, 2011 at 11:23 AM, Scott Ribe
<[hidden email]> wrote:

> On Jul 23, 2011, at 8:43 PM, Yan Chunlu wrote:
>
>> I used apt-get to install postgresql, running pg_config showing they
>> are exactly the same...
>
> BTW, forgot to mention this in my first message: I run streaming replication across the country with latency well over 100ms and no problems.
>
> --
> Scott Ribe
> [hidden email]
> http://www.elevated-dev.com/
> (303) 722-0567 voice
>
>
>
>
>

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Yan Chunlu
In reply to this post by Scott Ribe-2
is there anything special you have configured on master and slave?

could I see the related configuration on your master and slave? such
as wal_keep_segments,checkpoint_segments or any other might be
related? thanks a lot!

On Sun, Jul 24, 2011 at 11:23 AM, Scott Ribe
<[hidden email]> wrote:

> On Jul 23, 2011, at 8:43 PM, Yan Chunlu wrote:
>
>> I used apt-get to install postgresql, running pg_config showing they
>> are exactly the same...
>
> BTW, forgot to mention this in my first message: I run streaming replication across the country with latency well over 100ms and no problems.
>
> --
> Scott Ribe
> [hidden email]
> http://www.elevated-dev.com/
> (303) 722-0567 voice
>
>
>
>
>



--
闫春路

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Tomas Vondra
In reply to this post by Yan Chunlu
On 24 Červenec 2011, 6:09, Yan Chunlu wrote:
> thanks for all the help!
>
> @Adrian:  yes, only one instance on each machine
>
> not the slave finally started and could be connect, replication didn't
> begin, just following errors:
> https://gist.github.com/1102225

These errors just mean the master already removed WAL segments, so the
slave can't actually start the replication because there'd be a gap. This
usually happens with enough write activity (inserts, updates) when the
slave is being setup.

Whaht is your wal_keep_segments value? Increase it or set up WAL
archiving, so that the slave can get the data.

Tomas


--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Yan Chunlu
checkpoint_segments = 64
wal_keep_segments = 128

On Sun, Jul 24, 2011 at 8:25 PM, Tomas Vondra <[hidden email]> wrote:

> On 24 Červenec 2011, 6:09, Yan Chunlu wrote:
>> thanks for all the help!
>>
>> @Adrian:  yes, only one instance on each machine
>>
>> not the slave finally started and could be connect, replication didn't
>> begin, just following errors:
>> https://gist.github.com/1102225
>
> These errors just mean the master already removed WAL segments, so the
> slave can't actually start the replication because there'd be a gap. This
> usually happens with enough write activity (inserts, updates) when the
> slave is being setup.
>
> Whaht is your wal_keep_segments value? Increase it or set up WAL
> archiving, so that the slave can get the data.
>
> Tomas
>
>

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

pgpool HA not working

Sanjay Rao
In reply to this post by Tomas Vondra
Hi

I am working on pgpool /postgresql/linux first time. All these three are new to me.

I am running pgpool-HA using pacemaker-corosync.
I am having following error in my setup. Does any body have any idea about the warning highlighted in logs ? Any type of help would be appreciated a lot...

Jul 24 18:40:41 squarepant attrd: [3772]: info: attrd_perform_update: Sent update 1010: probe_complete=true
Jul 24 18:40:41 squarepant cib: [3770]: WARN: cib_process_request: Operation complete: op cib_modify for section status (origin=local/attrd/1010, version=0.52.352): The object/attribute does not exist (rc=-22)
Jul 24 18:40:41 squarepant attrd: [3772]: WARN: attrd_cib_callback: Update 1010 for probe_complete=true failed: The object/attribute does not exist
Jul 24 18:40:41 squarepant pgpool[7635]: INFO: pgpoolRA: request stop, but not running.
Jul 24 18:40:41 squarepant crmd: [3896]: info: process_lrm_event: LRM operation pgpool_stop_0 (call=339, rc=0, cib-update=472, confirmed=true) ok
Jul 24 18:40:41 squarepant crmd: [3896]: info: match_graph_event: Action pgpool_stop_0 (2) confirmed on squarepant (rc=0)
Jul 24 18:40:41 squarepant crmd: [3896]: info: te_rsc_command: Initiating action 12: start pgpool_start_0 on squarepant (local)
Jul 24 18:40:41 squarepant crmd: [3896]: info: do_lrm_rsc_op: Performing key=12:112:0:6d68acf3-ab99-409f-b686-8533e4b24ca0 op=pgpool_start_0 )
Jul 24 18:40:41 squarepant lrmd: [3771]: info: rsc:pgpool:340: start
Jul 24 18:40:41 squarepant crmd: [3896]: info: te_pseudo_action: Pseudo action 5 fired and confirmed


ouput of crm_mon command is


============
Last updated: Sun Jul 24 18:15:59 2011
Stack: Heartbeat
Current DC: squarepant (85f4f3d6-650e-4620-8cbb-edb1bc9d389c) - partition with quorum
Version: 1.1.5-1.fc14-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
1 Nodes configured, unknown expected votes
3 Resources configured.
============

Online: [ squarepant ]

 DBIP   (ocf::heartbeat:IPaddr2):       Started squarepant
 postgresql     (ocf::heartbeat:pgsql): Started squarepant
 pgpool (ocf::heartbeat:pgpool):        Started squarepant FAILED

Failed actions:
    pgpool_monitor_30000 (node=squarepant, call=1874, rc=7, status=complete): not running


Thanks & Regards
Sanjay
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Yan Chunlu
In reply to this post by Yan Chunlu
I did the SR procedure again, still no luck:

is that normal that after start slave postgresql, the first line of log is:
database system was interrupted; last known up at 2011-07-24 10:53:38 CDT??



4760 2011-07-24 10:55:58 CDT 2011-07-24 10:55:58 CDT @ LOG:  database
system was interrupted; last known up at 2011-07-24 10:53:38 CDT
4760 2011-07-24 10:55:58 CDT 2011-07-24 10:55:58 CDT @ LOG:  entering
standby mode
4762 2011-07-24 10:55:59 CDT 2011-07-24 10:55:59 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4761 2011-07-24 10:55:59 CDT 2011-07-24 10:55:59 CDT @ LOG:  streaming
replication successfully connected to primary
4764 2011-07-24 10:55:59 CDT 2011-07-24 10:55:59 CDT postgres@postgres
10.28.53.11(53442)FATAL:  the database system is starting up
4770 2011-07-24 10:56:00 CDT 2011-07-24 10:56:00 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4802 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4760 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ LOG:  redo
starts at 57/6B002028
4760 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ LOG:  invalid
record length at 57/6B20E010
4761 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ FATAL:
terminating walreceiver process due to administrator command
4760 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ LOG:  invalid
magic number 0000 in log file 87, segment 107, offset 2490368
4847 2011-07-24 10:56:02 CDT 2011-07-24 10:56:02 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4850 2011-07-24 10:56:02 CDT 2011-07-24 10:56:02 CDT postgres@postgres
10.28.53.11(53443)FATAL:  the database system is starting up
4851 2011-07-24 10:56:03 CDT 2011-07-24 10:56:03 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4860 2011-07-24 10:56:04 CDT 2011-07-24 10:56:04 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4865 2011-07-24 10:56:05 CDT 2011-07-24 10:56:05 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4859 2011-07-24 10:56:05 CDT 2011-07-24 10:56:05 CDT @ LOG:  streaming
replication successfully connected to primary
4874 2011-07-24 10:56:06 CDT 2011-07-24 10:56:06 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4869 2011-07-24 10:56:06 CDT 2011-07-24 10:56:06 CDT
postgres@template1 10.28.53.11(53444)FATAL:  the database system is
starting up
4879 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4760 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT @ LOG:  invalid
record length at 57/6B2BA010
4859 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT @ FATAL:
terminating walreceiver process due to administrator command
4760 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT @ LOG:  invalid
magic number 0000 in log file 87, segment 107, offset 2883584
4887 2011-07-24 10:56:08 CDT 2011-07-24 10:56:08 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4888 2011-07-24 10:56:08 CDT 2011-07-24 10:56:08 CDT @ LOG:  streaming
replication successfully connected to primary
4892 2011-07-24 10:56:09 CDT 2011-07-24 10:56:09 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4896 2011-07-24 10:56:09 CDT 2011-07-24 10:56:09 CDT
postgres@template1 10.28.53.11(53445)FATAL:  the database system is
starting up
4901 2011-07-24 10:56:10 CDT 2011-07-24 10:56:10 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4906 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4760 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT @ LOG:  invalid
record length at 57/6B486010
4888 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT @ FATAL:
terminating walreceiver process due to administrator command
4760 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT @ LOG:  invalid
magic number 0000 in log file 87, segment 107, offset 4849664



On Sun, Jul 24, 2011 at 8:46 PM, Yan Chunlu <[hidden email]> wrote:

> checkpoint_segments = 64
> wal_keep_segments = 128
>
> On Sun, Jul 24, 2011 at 8:25 PM, Tomas Vondra <[hidden email]> wrote:
>> On 24 Červenec 2011, 6:09, Yan Chunlu wrote:
>>> thanks for all the help!
>>>
>>> @Adrian:  yes, only one instance on each machine
>>>
>>> not the slave finally started and could be connect, replication didn't
>>> begin, just following errors:
>>> https://gist.github.com/1102225
>>
>> These errors just mean the master already removed WAL segments, so the
>> slave can't actually start the replication because there'd be a gap. This
>> usually happens with enough write activity (inserts, updates) when the
>> slave is being setup.
>>
>> Whaht is your wal_keep_segments value? Increase it or set up WAL
>> archiving, so that the slave can get the data.
>>
>> Tomas
>>
>>
>

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Tomas Vondra
In reply to this post by Yan Chunlu
Dne 24.7.2011 14:46, Yan Chunlu napsal(a):
> checkpoint_segments = 64
> wal_keep_segments = 128

This information alone is not sufficient - we don't know how much write
activity is on the primary system, so we can't say if those number are
sufficient or not. You have to tune them according to write activity on
the primary server.

For example let's suppose the current WAL segment on the primary is "1"
and that it's configured with wal_keep_segments = 5 (i.e. about 80MB of
data).

Before you prepare and start the slave machine, someone writes 100MB of
data to the primary database (one big insert/update or a lot of small
ones, doesn't matter). 100MB is about 6 WAL segments, so the current WAL
segment on the primary is 7, and because of wal_keep_segments there are
segments 3,4,5,6,7 available.

But when the slave connects, it asks for segment no. 2 and it's not
available. It's not possible to skip that segment so the replication
fails to start.

If the primary only received 60MB of data, it'd probably worked (there'd
be enough segments kept on the primary).

Those 128 segments is about 2GB of data. How much data is written on the
primary between creating a filesystem copy and starting the slave?

You don't neet to keep the files on the master, you can set up archiving
and keep them somewhere else (on a different system etc.).

Tomas


--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: streaming replication does not work across datacenter with 20ms latency?

Yan Chunlu
 checkpoint_segments = 64
wal_keep_segments = 128
this setting seems is for 5GB capacity, I think there is noway I would
ever write 5GB data during the rsync progress.


I think the problem is still "invalid record length" and "invalid
magic number", it start showing right after I complete sync data and
start slave.  If I stop slave later and restart, yes it could show
xlog not found and can not catch master. but why the "invalid" things
in the first place?


On Mon, Jul 25, 2011 at 4:28 AM, Tomas Vondra <[hidden email]> wrote:

> Dne 24.7.2011 14:46, Yan Chunlu napsal(a):
>> checkpoint_segments = 64
>> wal_keep_segments = 128
>
> This information alone is not sufficient - we don't know how much write
> activity is on the primary system, so we can't say if those number are
> sufficient or not. You have to tune them according to write activity on
> the primary server.
>
> For example let's suppose the current WAL segment on the primary is "1"
> and that it's configured with wal_keep_segments = 5 (i.e. about 80MB of
> data).
>
> Before you prepare and start the slave machine, someone writes 100MB of
> data to the primary database (one big insert/update or a lot of small
> ones, doesn't matter). 100MB is about 6 WAL segments, so the current WAL
> segment on the primary is 7, and because of wal_keep_segments there are
> segments 3,4,5,6,7 available.
>
> But when the slave connects, it asks for segment no. 2 and it's not
> available. It's not possible to skip that segment so the replication
> fails to start.
>
> If the primary only received 60MB of data, it'd probably worked (there'd
> be enough segments kept on the primary).
>
> Those 128 segments is about 2GB of data. How much data is written on the
> primary between creating a filesystem copy and starting the slave?
>
> You don't neet to keep the files on the master, you can set up archiving
> and keep them somewhere else (on a different system etc.).
>
> Tomas
>
>
> --
> Sent via pgsql-general mailing list ([hidden email])
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
12
Loading...