PostgreSQL 可靠性与大多数关系数据库一样,都是通过REDO来保障的。
群里有位童鞋问了一个问题,为什么PostgreSQL的REDO块大小默认是8K的,不是512字节。
这位童鞋提问的理由是,大多数的块设备扇区大小是512字节的,512字节可以保证原子写,而如果REDO的块大于512字节,可能会出现partial write。
那么PostgreSQL的redo(wal) 块大小设置为8KB时,靠谱吗?本文将给大家分析一下。
1. 当开启了易失缓存时,如果写数据的块大小大于磁盘原子写的大小(通常为512字节),掉电则可能出现partial write。
例如disk cache,没有掉电保护,而且操作系统的fsync接口不感知disk cache,如果你调用了fsync,即使返回成功,数据其实可能还在disk cache里面。
当发生掉电时,在disk cache里的数据会丢失掉,如果程序写一个8K的数据,因为磁盘的原子写小于8K,则可能出现8K里有些写成功了,有些没有写成功,即partial write。
(ps: 某些企业级SSD可以通过电容残余的电量,将DISK CACHE里的数据持久化下来,但是请不要相信所有磁盘都有这个功能)
2. 当开启了易失缓存时,如果写数据的块大小小于或等于磁盘原子写的大小(即”原子写”),掉电时也可能出现partial write。
对于MySQL来说,REDO的写为512字节的,其中包含12个字节的头信息,4个字节的校验信息。
这个怎么理解呢,为什么没有对齐则可能出现。
1. 前面提到了,如果没有对齐,并且开启了易失缓存,原子写是没有用的,同样会出现partial write。
2. 如果没有对齐,会造成写放大,本来写512字节的,磁盘上会造成写1024字节(将两个扇区数据读出来再与要写的数据合并, 分成两个扇区回写)。
1. 开启易失缓存时,原子写一样会丢失易失缓存中的数据。
2. 当未对齐时,原子写并不是真的原子写。
数据库只靠REDO的原子写,如果不考虑以上两个因素,起不到保证数据可靠性和一致性的作用。
1. shared buffer 中的dirty page在write前,必须要保证对应的redo已经持久化(指已经落到非易失存储介质)。
2. 在检查点后出现的脏页,必须要在redo中写dirty page的full page。
这2条保证的是数据文件的一致性。
3. 在不考虑standby的情况下,当设置为同步提交的事务在事务提交时,必须等待事务产生的REDO已持久化才返回(指已经落到非易失存储介质)。
参考
《PostgreSQL 9.6 同步多副本 与 remote_apply事务同步级别》
4. 当设置为异步提交的事务在事务提交时,不需要等待事务产生的REDO持久化。
由于有第一条的保护,所以即使使用异步事务,丢失REDO buffer中的数据后,也不会出现不一致(比如一半提交,一半未提交)的情况,仅仅丢失redo buffer中未提交的事务而已。
一致性由PostgreSQL MVCC的机制来保证,不会读到脏数据。
1. 在使用COW的文件系统(如btrfs, zfs)时,可以关闭full page write,因为这种文件系统可以保证不会出现partial write。
2. 对齐,可以避免写放大的问题。
3. 不要使用易失缓存,但是可以使用有掉电保护的易失缓存。
PostgreSQL认为系统提供的fsync调用是可靠的,即写到了持久化的存储。
如果连fsync都不可靠了,管它是不是原子写,都是不可靠的。
包括DirectIO在内(PostgreSQL支持REDO使用DirectIO),也无法感知disk cache,所以请慎重。
首先,前面已经分析了,原子写并不能抵御易失存储导致的丢数据。
1. PostgreSQL redo block是有checksum的,可以保证块的一致性,不会APPLY不一致的块。
2. 事务提交时,返回给用户前,一定会保证REDO已持久化。
所以用户收到反馈的事务,一定是持久化的,不可能存在partial write。
而没有收到反馈或未结束的事务,才有可能包含partial write,那么问题就简化了:
这些没有收到反馈或未结束的事务产生的REDO 出现partial write会不会导致数据不一致?
回答是不会,参考前面 “PostgreSQL如何保证数据库可靠性”,MVCC机制可以保证这些 。
数据库参数
wal_level = logical 便于观察,验证
产生测试数据
pgbench -i -s 100
模拟压力测试
pgbench -M prepared -n -r -P 2 -c 32 -j 32 -T 1000
观测到产生了一些XLOG,约200秒后,测试过程中强制停库,下次启动会进入恢复状态
pg_ctl stop -m immediate
记录接下来要纂改的REDO文件以及之前的文件最后的内容
纂改的前一个文件的末尾的一些内容,用于判断已持久化的记录
能看到几笔commit rec就行了
pg_xlogdump -b 0000000100000116000000F7 0000000100000116000000F7 | tail -n 20rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F7FFFA80, prev 116/F7FFFA50, desc: CLEAN remxid 772066680blkref #0: rel 1663/13241/38254 fork main blk 90969772208346已持久化
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208346, lsn: 116/F7FFFAC0, prev 116/F7FFFA80, desc: COMMIT 2016-10-11 15:04:16.395000 CST
rmgr: Heap len (rec/tot): 3/ 79, tx: 772208353, lsn: 116/F7FFFAF0, prev 116/F7FFFAC0, desc: INSERT off 130blkref #0: rel 1663/13241/38242 fork main blk 17723
rmgr: Heap len (rec/tot): 14/ 163, tx: 772208368, lsn: 116/F7FFFB40, prev 116/F7FFFAF0, desc: HOT_UPDATE off 71 xmax 772208368 ; new off 76 xmax 0blkref #0: rel 1663/13241/38254 fork main blk 90969
rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F7FFFBE8, prev 116/F7FFFB40, desc: CLEAN remxid 772061924blkref #0: rel 1663/13241/38254 fork main blk 37123
rmgr: Heap len (rec/tot): 14/ 74, tx: 772208358, lsn: 116/F7FFFC28, prev 116/F7FFFBE8, desc: HOT_UPDATE off 22 xmax 772208358 ; new off 25 xmax 0blkref #0: rel 1663/13241/38251 fork main blk 34
rmgr: Heap len (rec/tot): 14/ 78, tx: 772208360, lsn: 116/F7FFFC78, prev 116/F7FFFC28, desc: HOT_UPDATE off 121 xmax 772208360 ; new off 123 xmax 0blkref #0: rel 1663/13241/38245 fork main blk 124772208344已持久化
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208344, lsn: 116/F7FFFCC8, prev 116/F7FFFC78, desc: COMMIT 2016-10-11 15:04:16.395018 CST
rmgr: Heap len (rec/tot): 14/ 163, tx: 772208369, lsn: 116/F7FFFCF8, prev 116/F7FFFCC8, desc: HOT_UPDATE off 67 xmax 772208369 ; new off 73 xmax 0blkref #0: rel 1663/13241/38254 fork main blk 37123
rmgr: Heap len (rec/tot): 14/ 78, tx: 772208355, lsn: 116/F7FFFDA0, prev 116/F7FFFCF8, desc: HOT_UPDATE off 97 xmax 772208355 ; new off 110 xmax 0blkref #0: rel 1663/13241/38245 fork main blk 988772208351,772208352已持久化
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208351, lsn: 116/F7FFFDF0, prev 116/F7FFFDA0, desc: COMMIT 2016-10-11 15:04:16.395031 CST
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208352, lsn: 116/F7FFFE20, prev 116/F7FFFDF0, desc: COMMIT 2016-10-11 15:04:16.395031 CST
rmgr: Heap len (rec/tot): 3/ 79, tx: 772208354, lsn: 116/F7FFFE50, prev 116/F7FFFE20, desc: INSERT off 117blkref #0: rel 1663/13241/38242 fork main blk 17727
rmgr: Heap len (rec/tot): 7/ 53, tx: 772208357, lsn: 116/F7FFFEA0, prev 116/F7FFFE50, desc: LOCK off 133: xid 772208357 LOCK_ONLY EXCL_LOCK blkref #0: rel 1663/13241/38251 fork main blk 42已持久化
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208353, lsn: 116/F7FFFED8, prev 116/F7FFFEA0, desc: COMMIT 2016-10-11 15:04:16.395037 CST
rmgr: Heap len (rec/tot): 14/ 78, tx: 772208363, lsn: 116/F7FFFF08, prev 116/F7FFFED8, desc: HOT_UPDATE off 127 xmax 772208363 ; new off 186 xmax 0blkref #0: rel 1663/13241/38245 fork main blk 79已持久化
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208345, lsn: 116/F7FFFF58, prev 116/F7FFFF08, desc: COMMIT 2016-10-11 15:04:16.395040 CST
rmgr: Heap len (rec/tot): 7/ 53, tx: 772208349, lsn: 116/F7FFFF88, prev 116/F7FFFF58, desc: LOCK off 154: xid 772208349 LOCK_ONLY EXCL_LOCK blkref #0: rel 1663/13241/38251 fork main blk 38查看某事务的REDO
pg_xlogdump -x 772208351 0000000100000116000000F7 0000000100000116000000F7
rmgr: Heap len (rec/tot): 14/ 163, tx: 772208351, lsn: 116/F7FFD3B8, prev 116/F7FFD378, desc: HOT_UPDATE off 8 xmax 772208351 ; new off 73 xmax 0, blkref #0: rel 1663/13241/38254 blk 69436
rmgr: Heap len (rec/tot): 14/ 78, tx: 772208351, lsn: 116/F7FFE6A0, prev 116/F7FFE660, desc: HOT_UPDATE off 17 xmax 772208351 ; new off 40 xmax 0, blkref #0: rel 1663/13241/38245 blk 117
rmgr: Heap len (rec/tot): 14/ 74, tx: 772208351, lsn: 116/F7FFF048, prev 116/F7FFEFA0, desc: HOT_UPDATE off 165 xmax 772208351 ; new off 166 xmax 0, blkref #0: rel 1663/13241/38251 blk 35
rmgr: Heap len (rec/tot): 3/ 79, tx: 772208351, lsn: 116/F7FFF7D8, prev 116/F7FFF788, desc: INSERT off 66, blkref #0: rel 1663/13241/38242 blk 17736
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208351, lsn: 116/F7FFFDF0, prev 116/F7FFFDA0, desc: COMMIT 2016-10-11 15:04:16.395031 CST
被纂改的文件的头部的内容,用于判断未持久化的记录
这里显示的都是将要纂改掉,对PG来说就是未持久化的事务,数据库恢复后是不会显示的.
pg_xlogdump -b -n 20 0000000100000116000000F8 0000000100000116000000F9纂改后,772208342这个事务将不可见
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208342, lsn: 116/F8000038, prev 116/F7FFFFC0, desc: COMMIT 2016-10-11 15:04:16.395055 CST
rmgr: Heap len (rec/tot): 14/ 78, tx: 772208362, lsn: 116/F8000068, prev 116/F8000038, desc: HOT_UPDATE off 148 xmax 772208362 ; new off 154 xmax 0blkref #0: rel 1663/13241/38245 fork main blk 90
rmgr: Heap len (rec/tot): 14/ 78, tx: 772208365, lsn: 116/F80000B8, prev 116/F8000068, desc: HOT_UPDATE off 85 xmax 772208365 ; new off 89 xmax 0blkref #0: rel 1663/13241/38245 fork main blk 68
rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F8000108, prev 116/F80000B8, desc: CLEAN remxid 772208308blkref #0: rel 1663/13241/38254 fork main blk 146480
rmgr: Heap len (rec/tot): 14/ 74, tx: 772208349, lsn: 116/F8000148, prev 116/F8000108, desc: HOT_UPDATE off 154 xmax 772208349 ; new off 155 xmax 772208349blkref #0: rel 1663/13241/38251 fork main blk 38
rmgr: Heap len (rec/tot): 3/ 79, tx: 772208358, lsn: 116/F8000198, prev 116/F8000148, desc: INSERT off 101blkref #0: rel 1663/13241/38242 fork main blk 17730
rmgr: Heap len (rec/tot): 14/ 74, tx: 772208359, lsn: 116/F80001E8, prev 116/F8000198, desc: HOT_UPDATE off 78 xmax 772208359 ; new off 85 xmax 0blkref #0: rel 1663/13241/38251 fork main blk 31
rmgr: Heap len (rec/tot): 14/ 163, tx: 772208370, lsn: 116/F8000238, prev 116/F80001E8, desc: HOT_UPDATE off 25 xmax 772208370 ; new off 71 xmax 0blkref #0: rel 1663/13241/38254 fork main blk 146480纂改后,772208354这个事务将不可见
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208354, lsn: 116/F80002E0, prev 116/F8000238, desc: COMMIT 2016-10-11 15:04:16.395071 CST
rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F8000310, prev 116/F80002E0, desc: CLEAN remxid 772112027blkref #0: rel 1663/13241/38254 fork main blk 121847
rmgr: Heap len (rec/tot): 14/ 74, tx: 772208355, lsn: 116/F8000350, prev 116/F8000310, desc: HOT_UPDATE off 82 xmax 772208355 ; new off 86 xmax 0blkref #0: rel 1663/13241/38251 fork main blk 31
rmgr: Heap len (rec/tot): 14/ 78, tx: 772208366, lsn: 116/F80003A0, prev 116/F8000350, desc: HOT_UPDATE off 73 xmax 772208366 ; new off 104 xmax 0blkref #0: rel 1663/13241/38245 fork main blk 86
rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F80003F0, prev 116/F80003A0, desc: CLEAN remxid 772176420blkref #0: rel 1663/13241/38254 fork main blk 162972
rmgr: Heap len (rec/tot): 14/ 74, tx: 772208363, lsn: 116/F8000430, prev 116/F80003F0, desc: HOT_UPDATE off 23 xmax 772208363 ; new off 26 xmax 0blkref #0: rel 1663/13241/38251 fork main blk 30
rmgr: Heap len (rec/tot): 14/ 74, tx: 772208360, lsn: 116/F8000480, prev 116/F8000430, desc: HOT_UPDATE off 164 xmax 772208360 ; new off 167 xmax 0blkref #0: rel 1663/13241/38251 fork main blk 35
rmgr: Heap len (rec/tot): 14/ 163, tx: 772208371, lsn: 116/F80004D0, prev 116/F8000480, desc: HOT_UPDATE off 2 xmax 772208371 ; new off 72 xmax 0blkref #0: rel 1663/13241/38254 fork main blk 121847纂改后,772208358这个事务将不可见
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208358, lsn: 116/F8000578, prev 116/F80004D0, desc: COMMIT 2016-10-11 15:04:16.395090 CST
rmgr: Heap2 len (rec/tot): 8/ 58, tx: 0, lsn: 116/F80005A8, prev 116/F8000578, desc: CLEAN remxid 772172802blkref #0: rel 1663/13241/38254 fork main blk 120028
rmgr: Heap len (rec/tot): 14/ 163, tx: 772208372, lsn: 116/F80005E8, prev 116/F80005A8, desc: HOT_UPDATE off 57 xmax 772208372 ; new off 71 xmax 0blkref #0: rel 1663/13241/38254 fork main blk 162972纂改后,772208350这个事务将不可见
rmgr: Transaction len (rec/tot): 20/ 46, tx: 772208350, lsn: 116/F8000690, prev 116/F80005E8, desc: COMMIT 2016-10-11 15:04:16.395095 CST
...
纂改redo
cd $PGDATA/pg_xlog找到最后两个xlog文件,纂改一下offset 100字节后的中间位置,模拟partial write。
dd if=/dev/zero of=./0000000100000116000000F8 bs=1 count=10000 skip=100
dd if=/dev/zero of=./0000000100000116000000F9 bs=1 count=10000 skip=100
启动数据库,进入恢复状态,当读到checksum不一致的block,停止继续往前,也就是说数据库恢复到这里截至。
未恢复的事务造成的变更,对用户不可见。
2016-10-11 15:10:49.909 CST,,,15039,,57fc9076.3abf,1,,2016-10-11 15:10:46 CST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,"PostmasterMain, postmaster.c:1223",""由于强制停库,数据库进入恢复状态
2016-10-11 15:10:49.910 CST,,,15042,,57fc9079.3ac2,1,,2016-10-11 15:10:49 CST,,0,LOG,00000,"database system was interrupted; last known up at 2016-10-11 15:03:14 CST",,,,,,,,"StartupXLOG, xlog.c:5934",""
2016-10-11 15:10:49.991 CST,,,15042,,57fc9079.3ac2,2,,2016-10-11 15:10:49 CST,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,"StartupXLOG, xlog.c:6414",""
2016-10-11 15:10:49.992 CST,,,15042,,57fc9079.3ac2,3,,2016-10-11 15:10:49 CST,,0,LOG,00000,"redo starts at 116/9D8E4600",,,,,,,,"StartupXLOG, xlog.c:6669",""读到被纂改的REDO时,停止恢复
2016-10-11 15:11:21.215 CST,,,15042,,57fc9079.3ac2,4,,2016-10-11 15:10:49 CST,,0,LOG,00000,"invalid magic number 0000 in log segment 0000000100000116000000F8, offset 0",,,,,,,,"ReadRecord, xlog.c:3942",""
2016-10-11 15:11:21.215 CST,,,15042,,57fc9079.3ac2,5,,2016-10-11 15:10:49 CST,,0,LOG,00000,"redo done at 116/F7FFFF88",,,,,,,,"StartupXLOG, xlog.c:6921",""
2016-10-11 15:11:21.215 CST,,,15042,,57fc9079.3ac2,6,,2016-10-11 15:10:49 CST,,0,LOG,00000,"last completed transaction was at log time 2016-10-11 15:04:16.39504+08",,,,,,,,"StartupXLOG, xlog.c:6926",""
2016-10-11 15:11:21.216 CST,,,15042,,57fc9079.3ac2,7,,2016-10-11 15:10:49 CST,,0,LOG,00000,"checkpoint starting: end-of-recovery immediate",,,,,,,,"LogCheckpointStart, xlog.c:7949",""
2016-10-11 15:11:23.223 CST,,,15042,,57fc9079.3ac2,8,,2016-10-11 15:10:49 CST,,0,LOG,00000,"checkpoint complete: wrote 215999 buffers (1.3%); 0 transaction log file(s) added, 1 removed, 0 recycled; write=1.598 s, sync=0.405 s, total=2.006 s; sync files=20, longest=0.207 s, average=0.020 s; distance=1481838 kB, estimate=1481838 kB",,,,,,,,"LogCheckpointEnd, xlog.c:8031",""
2016-10-11 15:11:23.223 CST,,,15042,,57fc9079.3ac2,9,,2016-10-11 15:10:49 CST,,0,LOG,00000,"MultiXact member wraparound protections are now enabled",,,,,,,,"SetOffsetVacuumLimit, multixact.c:2628",""
2016-10-11 15:11:23.405 CST,,,15039,,57fc9076.3abf,2,,2016-10-11 15:10:46 CST,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,"reaper, postmaster.c:2792",""
2016-10-11 15:11:23.405 CST,,,15083,,57fc909b.3aeb,1,,2016-10-11 15:11:23 CST,,0,LOG,00000,"autovacuum launcher started",,,,,,,,"AutoVacLauncherMain, autovacuum.c:416",""
验证
纂改前在REDO中显示提交的事务,验证确实已提交。
postgres=# select xmin,* from pgbench_history where xmin in (772208346,772208344,772208351,772208352,772208353,772208345);xmin | tid | bid | aid | delta | mtime | filler
-----------+-----+-----+---------+-------+----------------------------+--------772208345 | 109 | 76 | 96685 | 4792 | 2016-10-11 15:04:16.394519 | 772208353 | 657 | 1 | 7473886 | 1540 | 2016-10-11 15:04:16.394708 | 772208344 | 146 | 58 | 2671263 | -2297 | 2016-10-11 15:04:16.394504 | 772208352 | 55 | 57 | 9608997 | 2862 | 2016-10-11 15:04:16.39463 | 772208351 | 531 | 8 | 4235604 | 1582 | 2016-10-11 15:04:16.394601 | 772208346 | 105 | 83 | 5770382 | 590 | 2016-10-11 15:04:16.394542 |
(6 rows)纂改后在REDO中显示已提交的事务,显示未提交,所以partial write没有影响数据库的一致性。
postgres=# select * from pgbench_history where xmin in (772208342,772208354,772208358,772208350);tid | bid | aid | delta | mtime | filler
-----+-----+-----+-------+-------+--------
(0 rows)
通过检验。
./configure --with-wal-blocksize=?Allowed values are 1,2,4,8,16,32,64.
如果每产生一笔redo都要fsync,性能就差了,所以FSYNC实际上是有调度的。
redo buffer的作用就是减少FSYNC的次数。
1. 当wal writer sleep超过设置的sleep时间(通常设置为10毫秒)时,触发fsync,将redo buffer中已写完整的BLOCK持久化到REDO FILE。
2. 当wal writer write(异步写)的字节数超过配置的阈值(wal_writer_flush_after)时,触发fsync,将redo buffer中已写完整的BLOCK持久化到REDO FILE。
3. 当事务结束时,检查wal write全局变量,LSN是否已FLUSH,如果没有落盘,则触发fsync。
4. 第三种情况,如果开启了分组提交,则多个正在提交的事务只会请求一次fsync。
5. 当redo 日志文件发生切换时,会触发fsync,确保文件持久化。
src/backend/postmaster/walwriter.c
* The WAL writer background process is new as of Postgres 8.3. It attempts* to keep regular backends from having to write out (and fsync) WAL pages.* Also, it guarantees that transaction commit records that weren't synced* to disk immediately upon commit (ie, were "asynchronously committed")* will reach disk within a knowable time --- which, as it happens, is at* most three times the wal_writer_delay cycle time.** Note that as with the bgwriter for shared buffers, regular backends are* still empowered to issue WAL writes and fsyncs when the walwriter doesn't* keep up. This means that the WALWriter is not an essential process and* can shutdown quickly when requested.** Because the walwriter's cycle is directly linked to the maximum delay* before async-commit transactions are guaranteed committed, it's probably* unwise to load additional functionality onto it. For instance, if you've* got a yen to create xlog segments further in advance, that'd be better done* in bgwriter than in walwriter.** The walwriter is started by the postmaster as soon as the startup subprocess* finishes. It remains alive until the postmaster commands it to terminate.* Normal termination is by SIGTERM, which instructs the walwriter to exit(0).* Emergency termination is by SIGQUIT; like any backend, the walwriter will* simply abort and exit on SIGQUIT.** If the walwriter exits unexpectedly, the postmaster treats that the same* as a backend crash: shared memory may be corrupted, so remaining backends* should be killed by SIGQUIT and then a recovery cycle started......./** Loop forever*/for (;;){
....../** Do what we're here for; then, if XLogBackgroundFlush() found useful* work to do, reset hibernation counter.*/if (XLogBackgroundFlush())left_till_hibernate = LOOPS_UNTIL_HIBERNATE;else if (left_till_hibernate > 0)left_till_hibernate--;
......
src/backend/access/transam/xlog.c
/** Write & flush xlog, but without specifying exactly where to.** We normally write only completed blocks; but if there is nothing to do on* that basis, we check for unwritten async commits in the current incomplete* block, and write through the latest one of those. Thus, if async commits* are not being used, we will write complete blocks only.** If, based on the above, there's anything to write we do so immediately. But* to avoid calling fsync, fdatasync et. al. at a rate that'd impact* concurrent IO, we only flush WAL every wal_writer_delay ms, or if there's* more than wal_writer_flush_after unflushed blocks.** We can guarantee that async commits reach disk after at most three* wal_writer_delay cycles. (When flushing complete blocks, we allow XLogWrite* to write "flexibly", meaning it can stop at the end of the buffer ring;* this makes a difference only with very high load or long wal_writer_delay,* but imposes one extra cycle for the worst case for async commits.)** This routine is invoked periodically by the background walwriter process.** Returns TRUE if there was any work to do, even if we skipped flushing due* to wal_writer_delay/wal_flush_after.*/
bool
XLogBackgroundFlush(void)
{XLogwrtRqst WriteRqst;bool flexible &#61; true;static TimestampTz lastflush;TimestampTz now;int flushbytes;/* XLOG doesn&#39;t need flushing during recovery */if (RecoveryInProgress())return false;/* read LogwrtResult and update local state */SpinLockAcquire(&XLogCtl->info_lck);LogwrtResult &#61; XLogCtl->LogwrtResult;WriteRqst &#61; XLogCtl->LogwrtRqst;SpinLockRelease(&XLogCtl->info_lck);/* back off to last completed page boundary */WriteRqst.Write -&#61; WriteRqst.Write % XLOG_BLCKSZ;/* if we have already flushed that far, consider async commit records */if (WriteRqst.Write <&#61; LogwrtResult.Flush){SpinLockAcquire(&XLogCtl->info_lck);WriteRqst.Write &#61; XLogCtl->asyncXactLSN;SpinLockRelease(&XLogCtl->info_lck);flexible &#61; false; /* ensure it all gets written */}/** If already known flushed, we&#39;re done. Just need to check if we are* holding an open file handle to a logfile that&#39;s no longer in use,* preventing the file from being deleted.*/if (WriteRqst.Write <&#61; LogwrtResult.Flush){if (openLogFile >&#61; 0){if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo)){XLogFileClose();}}return false;}/** Determine how far to flush WAL, based on the wal_writer_delay and* wal_writer_flush_after GUCs.*/now &#61; GetCurrentTimestamp();flushbytes &#61;WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;if (WalWriterFlushAfter &#61;&#61; 0 || lastflush &#61;&#61; 0){/* first call, or block based limits disabled */WriteRqst.Flush &#61; WriteRqst.Write;lastflush &#61; now;}// sleep时间调度&#xff0c;更新Flush位点else if (TimestampDifferenceExceeds(lastflush, now, WalWriterDelay)){/** Flush the writes at least every WalWriteDelay ms. This is important* to bound the amount of time it takes for an asynchronous commit to* hit disk.*/WriteRqst.Flush &#61; WriteRqst.Write;lastflush &#61; now;}// wal writer write(异步写)累计调度&#xff0c;更新Flush位点else if (flushbytes >&#61; WalWriterFlushAfter){/* exceeded wal_writer_flush_after blocks, flush */WriteRqst.Flush &#61; WriteRqst.Write;lastflush &#61; now;}// 否则不执行fsyncelse{/* no flushing, this time round */WriteRqst.Flush &#61; 0;}#ifdef WAL_DEBUGif (XLOG_DEBUG)elog(LOG, "xlog bg flush request write %X/%X; flush: %X/%X, current is write %X/%X; flush %X/%X",(uint32) (WriteRqst.Write >> 32), (uint32) WriteRqst.Write,(uint32) (WriteRqst.Flush >> 32), (uint32) WriteRqst.Flush,(uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,(uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
#endifSTART_CRIT_SECTION();/* now wait for any in-progress insertions to finish and get write lock */WaitXLogInsertionsToFinish(WriteRqst.Write);LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);LogwrtResult &#61; XLogCtl->LogwrtResult;if (WriteRqst.Write > LogwrtResult.Write ||WriteRqst.Flush > LogwrtResult.Flush){XLogWrite(WriteRqst, flexible);}LWLockRelease(WALWriteLock);END_CRIT_SECTION();/* wake up walsenders now that we&#39;ve released heavily contended locks */WalSndWakeupProcessRequests();/** Great, done. To take some work off the critical path, try to initialize* as many of the no-longer-needed WAL buffers for future use as we can.*/AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);/** If we determined that we need to write data, but somebody else* wrote/flushed already, it should be considered as being active, to* avoid hibernating too early.*/return true;
}/** Write and/or fsync the log at least as far as WriteRqst indicates.** If flexible &#61;&#61; TRUE, we don&#39;t have to write as far as WriteRqst, but* may stop at any convenient boundary (such as a cache or logfile boundary).* This option allows us to avoid uselessly issuing multiple writes when a* single one would do.** Must be called with WALWriteLock held. WaitXLogInsertionsToFinish(WriteRqst)* must be called before grabbing the lock, to make sure the data is ready to* write.*/
static void
XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{
......
日志切换时&#xff0c;触发fsyncif (finishing_seg){issue_xlog_fsync(openLogFile, openLogSegNo);
......
根据LogwrtResult.Flush位点与请求Flush位点的对比&#xff0c;判断是否需要调用fsync
即前面的调度 /** If asked to flush, do so*/if (LogwrtResult.Flush
......
1. https://www.pgcon.org/2012/schedule/attachments/258_212_Internals%20Of%20PostgreSQL%20Wal.pdf
如果要深入了解PostgreSQL redo的内部机制&#xff0c;可以参考以上文档以及源码。