数据库中去重的需求比较常见,比较常见的如单列去重、多列去重、行去重等。pg中针对这些不同的去重要求,我们也可以使用不同的去重方法。
单列去重应该是最常见的了,就是将某一列中重复的记录去除掉,我们可以根据要求保留最新或者最旧的记录。
–创建测试数据
bill=# create table test1(id int primary key, c1 int, c2 timestamp);
CREATE TABLE
bill=# insert into test1 select generate_series(1,1000000), random()*1000, clock_timestamp();
INSERT 0 1000000
bill=# create index idx_test1 on test1(c1,id);
CREATE INDEX
–方法1:
聚合,not in
bill=# explain delete from test1 where id not in (select max(id) from test1 group by c1);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Delete on test1 (cost=30609.23..48515.23 rows=500000 width=6)
-> Seq Scan on test1 (cost=30609.23..48515.23 rows=500000 width=6)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> GroupAggregate (cost=0.42..30606.73 rows=1001 width=8)
Group Key: test1_1.c1
-> Index Only Scan using idx_test1 on test1 test1_1 (cost=0.42..25596.72 rows=1000000 width=8)
(7 rows)
–方法2:
使用窗口查询,in
bill&#61;# explain select id from (select row_number() over(partition by c1 order by id) as rn, id from test1) t where t.rn<>1;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Subquery Scan on t (cost&#61;0.42..55596.72 rows&#61;995000 width&#61;4)
Filter: (t.rn <> 1)
-> WindowAgg (cost&#61;0.42..43096.72 rows&#61;1000000 width&#61;16)
-> Index Only Scan using idx_test1 on test1 (cost&#61;0.42..25596.72 rows&#61;1000000 width&#61;8)
(4 rows)
–方法3:
使用游标的方式去遍历&#xff0c;每一条记录比较一次。
bill&#61;# do language plpgsql $$
bill$# declare
bill$# v_rec record;
bill$# v_c1 int;
bill$# cur1 cursor for select c1,id from test1 order by c1,id for update;
bill$# begin
bill$# for v_rec in cur1 loop
bill$# if v_rec.c1 &#61; v_c1 then
bill$# delete from test1 where current of cur1;
bill$# end if;
bill$# v_c1 :&#61; v_rec.c1;
bill$# end loop;
bill$# end;
bill$# $$;
DO
上面三种方式&#xff0c;方法二效率最高&#xff0c;其次是方法三。
和单列类似&#xff0c;只是变成了去除多个列的重复记录。
–创建测试数据
bill&#61;# create table test1(id int primary key, c1 int, c2 int, c3 timestamp);
CREATE TABLE
bill&#61;# insert into test1 select generate_series(1,1000000), random()*1000, random()*1000, clock_timestamp();
INSERT 0 1000000
bill&#61;# create index idx_test1 on test1(c1,c2,id);
CREATE INDEX
–方法1:
bill&#61;# explain (analyze,verbose,timing,costs,buffers) delete from test1 where id not in (select max(id) from test1 group by c1,c2);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on public.test1 (cost&#61;37036.38..55906.38 rows&#61;500000 width&#61;6) (actual time&#61;1924.854..1924.854 rows&#61;0 loops&#61;1)
Buffers: shared hit&#61;1373911 read&#61;3834
-> Seq Scan on public.test1 (cost&#61;37036.38..55906.38 rows&#61;500000 width&#61;6) (actual time&#61;1255.586..1672.129 rows&#61;367700 loops&#61;1)
Output: test1.ctid
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 632300
Buffers: shared hit&#61;1006211 read&#61;3834
SubPlan 1
-> GroupAggregate (cost&#61;0.42..36786.38 rows&#61;100000 width&#61;12) (actual time&#61;0.061..1001.212 rows&#61;632300 loops&#61;1)
Output: max(test1_1.id), test1_1.c1, test1_1.c2
Group Key: test1_1.c1, test1_1.c2
Buffers: shared hit&#61;999841 read&#61;3834
-> Index Only Scan using idx_test1 on public.test1 test1_1 (cost&#61;0.42..28286.38 rows&#61;1000000 width&#61;12) (actual time&#61;0.052..708.625 rows&#61;1000000 loops&#61;1)
Output: test1_1.c1, test1_1.c2, test1_1.id
Heap Fetches: 1000000
Buffers: shared hit&#61;999841 read&#61;3834
Planning Time: 0.345 ms
Execution Time: 1931.117 ms
(18 rows)
–方法2:
bill&#61;# explain (analyze,verbose,timing,costs,buffers) delete from test1 where id in (select id from (select row_number() over(partition by c1,c2 order by id) as rn, id from test1) t where t.rn<>1);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on public.test1 (cost&#61;47204.90..79033.85 rows&#61;629138 width&#61;34) (actual time&#61;625.967..625.968 rows&#61;0 loops&#61;1)
Buffers: shared hit&#61;3836
-> Hash Semi Join (cost&#61;47204.90..79033.85 rows&#61;629138 width&#61;34) (actual time&#61;625.966..625.967 rows&#61;0 loops&#61;1)
Output: test1.ctid, t.*
Hash Cond: (test1.id &#61; t.id)
Buffers: shared hit&#61;3836
-> Seq Scan on public.test1 (cost&#61;0.00..12693.00 rows&#61;632300 width&#61;10) (actual time&#61;0.007..0.007 rows&#61;1 loops&#61;1)
Output: test1.ctid, test1.id
Buffers: shared hit&#61;1
-> Hash (cost&#61;35039.68..35039.68 rows&#61;629138 width&#61;32) (actual time&#61;625.801..625.801 rows&#61;0 loops&#61;1)
Output: t.*, t.id
Buckets: 131072 Batches: 8 Memory Usage: 1024kB
Buffers: shared hit&#61;3835
-> Subquery Scan on t (cost&#61;0.42..35039.68 rows&#61;629138 width&#61;32) (actual time&#61;625.800..625.800 rows&#61;0 loops&#61;1)
Output: t.*, t.id
Filter: (t.rn <> 1)
Rows Removed by Filter: 632300
Buffers: shared hit&#61;3835
-> WindowAgg (cost&#61;0.42..27135.92 rows&#61;632300 width&#61;20) (actual time&#61;0.041..574.119 rows&#61;632300 loops&#61;1)
Output: row_number() OVER (?), test1_1.id, test1_1.c1, test1_1.c2
Buffers: shared hit&#61;3835
-> Index Only Scan using idx_test1 on public.test1 test1_1 (cost&#61;0.42..14489.92 rows&#61;632300 width&#61;12) (actual time&#61;0.024..89.633 rows&#61;632300 loops&#61;1)
Output: test1_1.c1, test1_1.c2, test1_1.id
Heap Fetches: 0
Buffers: shared hit&#61;3835
Planning Time: 0.505 ms
Execution Time: 626.029 ms
(27 rows)
–方法3:
bill&#61;# do language plpgsql $$
bill$# declare
bill$# v_rec record;
bill$# v_c1 int;
bill$# v_c2 int;
bill$# cur1 cursor for select c1,c2 from test1 order by c1,c2,id for update;
bill$# begin
bill$# for v_rec in cur1 loop
bill$# if v_rec.c1 &#61; v_c1 and v_rec.c2&#61;v_c2 then
bill$# delete from test1 where current of cur1;
bill$# end if;
bill$# v_c1 :&#61; v_rec.c1;
bill$# v_c2 :&#61; v_rec.c2;
bill$# end loop;
bill$# end;
bill$# $$;
DO
3、行去重
行去重一般可以使用ctid。
–创建测试数据&#xff1a;
bill&#61;# create table test1(c1 int, c2 int);
CREATE TABLE
bill&#61;# insert into test1 select random()*1000, random()*1000 from generate_series(1,1000000);
INSERT 0 1000000
–方法1:
bill&#61;# explain (analyze,verbose,timing,costs,buffers) delete from test1 where ctid not in (select max(ctid) from test1 group by c1,c2);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on public.test1 (cost&#61;135831.29..152756.29 rows&#61;500000 width&#61;6) (actual time&#61;2290.808..2290.808 rows&#61;0 loops&#61;1)
Buffers: shared hit&#61;376170, temp read&#61;2944 written&#61;2954
-> Seq Scan on public.test1 (cost&#61;135831.29..152756.29 rows&#61;500000 width&#61;6) (actual time&#61;1643.262..2040.646 rows&#61;367320 loops&#61;1)
Output: test1.ctid
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 632680
Buffers: shared hit&#61;8850, temp read&#61;2944 written&#61;2954
SubPlan 1
-> GroupAggregate (cost&#61;124581.29..135581.29 rows&#61;100000 width&#61;14) (actual time&#61;732.049..1390.277 rows&#61;632680 loops&#61;1)
Output: max(test1_1.ctid), test1_1.c1, test1_1.c2
Group Key: test1_1.c1, test1_1.c2
Buffers: shared hit&#61;4425, temp read&#61;2944 written&#61;2954
-> Sort (cost&#61;124581.29..127081.29 rows&#61;1000000 width&#61;14) (actual time&#61;732.035..1015.066 rows&#61;1000000 loops&#61;1)
Output: test1_1.c1, test1_1.c2, test1_1.ctid
Sort Key: test1_1.c1, test1_1.c2
Sort Method: external merge Disk: 23552kB
Buffers: shared hit&#61;4425, temp read&#61;2944 written&#61;2954
-> Seq Scan on public.test1 test1_1 (cost&#61;0.00..14425.00 rows&#61;1000000 width&#61;14) (actual time&#61;0.010..138.017 rows&#61;1000000 loops&#61;1)
Output: test1_1.c1, test1_1.c2, test1_1.ctid
Buffers: shared hit&#61;4425
Planning Time: 0.176 ms
Execution Time: 2304.495 ms
(22 rows)
–方法2:
bill&#61;# explain (analyze,verbose,timing,costs,buffers) delete from test1 where ctid &#61; any(array( select ctid from (select row_number() over(partition by c1,c2 order by ctid) as rn, ctid from test1) t where t.rn<>1));
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on public.test1 (cost&#61;100501.36..100514.46 rows&#61;10 width&#61;6) (actual time&#61;1092.431..1092.431 rows&#61;0 loops&#61;1)
Buffers: shared hit&#61;4430, temp read&#61;2013 written&#61;2019
InitPlan 1 (returns $0)
-> Subquery Scan on t (cost&#61;78357.55..100501.35 rows&#61;629517 width&#61;6) (actual time&#61;1092.420..1092.420 rows&#61;0 loops&#61;1)
Output: t.ctid
Filter: (t.rn <> 1)
Rows Removed by Filter: 632680
Buffers: shared hit&#61;4430, temp read&#61;2013 written&#61;2019
-> WindowAgg (cost&#61;78357.55..92592.85 rows&#61;632680 width&#61;22) (actual time&#61;459.611..1042.708 rows&#61;632680 loops&#61;1)
Output: row_number() OVER (?), test1_1.ctid, test1_1.c1, test1_1.c2
Buffers: shared hit&#61;4430, temp read&#61;2013 written&#61;2019
-> Sort (cost&#61;78357.55..79939.25 rows&#61;632680 width&#61;14) (actual time&#61;459.598..616.859 rows&#61;632680 loops&#61;1)
Output: test1_1.ctid, test1_1.c1, test1_1.c2
Sort Key: test1_1.c1, test1_1.c2, test1_1.ctid
Sort Method: external merge Disk: 16104kB
Buffers: shared hit&#61;4430, temp read&#61;2013 written&#61;2019
-> Seq Scan on public.test1 test1_1 (cost&#61;0.00..10751.80 rows&#61;632680 width&#61;14) (actual time&#61;0.006..83.917 rows&#61;632680 loops&#61;1)
Output: test1_1.ctid, test1_1.c1, test1_1.c2
Buffers: shared hit&#61;4425
-> Tid Scan on public.test1 (cost&#61;0.01..13.11 rows&#61;10 width&#61;6) (actual time&#61;1092.429..1092.429 rows&#61;0 loops&#61;1)
Output: test1.ctid
TID Cond: (test1.ctid &#61; ANY ($0))
Buffers: shared hit&#61;4430, temp read&#61;2013 written&#61;2019
Planning Time: 0.204 ms
Execution Time: 1096.153 ms
(25 rows)
–方法3:
bill&#61;# do language plpgsql $$
bill$# declare
bill$# v_rec record;
bill$# v_c1 int;
bill$# v_c2 int;
bill$# cur1 cursor for select c1,c2 from test1 order by c1,c2,ctid for update;
bill$# begin
bill$# for v_rec in cur1 loop
bill$# if v_rec.c1 &#61; v_c1 and v_rec.c2&#61;v_c2 then
bill$# delete from test1 where current of cur1;
bill$# end if;
bill$# v_c1 :&#61; v_rec.c1;
bill$# v_c2 :&#61; v_rec.c2;
bill$# end loop;
bill$# end;
bill$# $$;
DO
Time: 2320.653 ms (00:02.321)
bill&#61;# DO