PostgreSQL数据去重

作者：Less哒哒_288 | 来源：互联网 | 2023-08-07 05:13

数据库中去重的需求比较常见，比较常见的如单列去重、多列去重、行去重等。pg中针对这些不同的去重要求，我们也可以使用不同的去重方法。1、单列去重

数据库中去重的需求比较常见&＃xff0c;比较常见的如单列去重、多列去重、行去重等。pg中针对这些不同的去重要求&＃xff0c;我们也可以使用不同的去重方法。

1、单列去重

单列去重应该是最常见的了&＃xff0c;就是将某一列中重复的记录去除掉&＃xff0c;我们可以根据要求保留最新或者最旧的记录。

–创建测试数据

bill&＃61;# create table test1(id int primary key, c1 int, c2 timestamp); CREATE TABLE bill&＃61;# insert into test1 select generate_series(1,1000000), random()*1000, clock_timestamp(); INSERT 0 1000000 bill&＃61;# create index idx_test1 on test1(c1,id); CREATE INDEX

–方法1&＃xff1a;
聚合&＃xff0c;not in

bill&＃61;# explain delete from test1 where id not in (select max(id) from test1 group by c1); QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Delete on test1 (cost&＃61;30609.23..48515.23 rows&＃61;500000 width&＃61;6) -> Seq Scan on test1 (cost&＃61;30609.23..48515.23 rows&＃61;500000 width&＃61;6) Filter: (NOT (hashed SubPlan 1)) SubPlan 1 -> GroupAggregate (cost&＃61;0.42..30606.73 rows&＃61;1001 width&＃61;8) Group Key: test1_1.c1 -> Index Only Scan using idx_test1 on test1 test1_1 (cost&＃61;0.42..25596.72 rows&＃61;1000000 width&＃61;8) (7 rows)

–方法2:
使用窗口查询&＃xff0c;in

bill&＃61;# explain select id from (select row_number() over(partition by c1 order by id) as rn, id from test1) t where t.rn<>1; QUERY PLAN -------------------------------------------------------------------------------------------------- Subquery Scan on t (cost&＃61;0.42..55596.72 rows&＃61;995000 width&＃61;4) Filter: (t.rn <> 1) -> WindowAgg (cost&＃61;0.42..43096.72 rows&＃61;1000000 width&＃61;16) -> Index Only Scan using idx_test1 on test1 (cost&＃61;0.42..25596.72 rows&＃61;1000000 width&＃61;8) (4 rows)

–方法3:
使用游标的方式去遍历&＃xff0c;每一条记录比较一次。

bill&＃61;# do language plpgsql $$ bill$# declare bill$# v_rec record; bill$# v_c1 int; bill$# cur1 cursor for select c1,id from test1 order by c1,id for update; bill$# begin bill$# for v_rec in cur1 loop bill$# if v_rec.c1 &＃61; v_c1 then bill$# delete from test1 where current of cur1; bill$# end if; bill$# v_c1 :&＃61; v_rec.c1; bill$# end loop; bill$# end; bill$# $$; DO

上面三种方式&＃xff0c;方法二效率最高&＃xff0c;其次是方法三。

2、多列去重

和单列类似&＃xff0c;只是变成了去除多个列的重复记录。

–创建测试数据

bill&＃61;# create table test1(id int primary key, c1 int, c2 int, c3 timestamp); CREATE TABLE bill&＃61;# insert into test1 select generate_series(1,1000000), random()*1000, random()*1000, clock_timestamp(); INSERT 0 1000000 bill&＃61;# create index idx_test1 on test1(c1,c2,id); CREATE INDEX

–方法1:

bill&＃61;# explain (analyze,verbose,timing,costs,buffers) delete from test1 where id not in (select max(id) from test1 group by c1,c2); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Delete on public.test1 (cost&＃61;37036.38..55906.38 rows&＃61;500000 width&＃61;6) (actual time&＃61;1924.854..1924.854 rows&＃61;0 loops&＃61;1) Buffers: shared hit&＃61;1373911 read&＃61;3834 -> Seq Scan on public.test1 (cost&＃61;37036.38..55906.38 rows&＃61;500000 width&＃61;6) (actual time&＃61;1255.586..1672.129 rows&＃61;367700 loops&＃61;1) Output: test1.ctid Filter: (NOT (hashed SubPlan 1)) Rows Removed by Filter: 632300 Buffers: shared hit&＃61;1006211 read&＃61;3834 SubPlan 1 -> GroupAggregate (cost&＃61;0.42..36786.38 rows&＃61;100000 width&＃61;12) (actual time&＃61;0.061..1001.212 rows&＃61;632300 loops&＃61;1) Output: max(test1_1.id), test1_1.c1, test1_1.c2 Group Key: test1_1.c1, test1_1.c2 Buffers: shared hit&＃61;999841 read&＃61;3834 -> Index Only Scan using idx_test1 on public.test1 test1_1 (cost&＃61;0.42..28286.38 rows&＃61;1000000 width&＃61;12) (actual time&＃61;0.052..708.625 rows&＃61;1000000 loops&＃61;1) Output: test1_1.c1, test1_1.c2, test1_1.id Heap Fetches: 1000000 Buffers: shared hit&＃61;999841 read&＃61;3834 Planning Time: 0.345 ms Execution Time: 1931.117 ms (18 rows)

–方法2:

bill&＃61;# explain (analyze,verbose,timing,costs,buffers) delete from test1 where id in (select id from (select row_number() over(partition by c1,c2 order by id) as rn, id from test1) t where t.rn<>1); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Delete on public.test1 (cost&＃61;47204.90..79033.85 rows&＃61;629138 width&＃61;34) (actual time&＃61;625.967..625.968 rows&＃61;0 loops&＃61;1) Buffers: shared hit&＃61;3836 -> Hash Semi Join (cost&＃61;47204.90..79033.85 rows&＃61;629138 width&＃61;34) (actual time&＃61;625.966..625.967 rows&＃61;0 loops&＃61;1) Output: test1.ctid, t.* Hash Cond: (test1.id &＃61; t.id) Buffers: shared hit&＃61;3836 -> Seq Scan on public.test1 (cost&＃61;0.00..12693.00 rows&＃61;632300 width&＃61;10) (actual time&＃61;0.007..0.007 rows&＃61;1 loops&＃61;1) Output: test1.ctid, test1.id Buffers: shared hit&＃61;1 -> Hash (cost&＃61;35039.68..35039.68 rows&＃61;629138 width&＃61;32) (actual time&＃61;625.801..625.801 rows&＃61;0 loops&＃61;1) Output: t.*, t.id Buckets: 131072 Batches: 8 Memory Usage: 1024kB Buffers: shared hit&＃61;3835 -> Subquery Scan on t (cost&＃61;0.42..35039.68 rows&＃61;629138 width&＃61;32) (actual time&＃61;625.800..625.800 rows&＃61;0 loops&＃61;1) Output: t.*, t.id Filter: (t.rn <> 1) Rows Removed by Filter: 632300 Buffers: shared hit&＃61;3835 -> WindowAgg (cost&＃61;0.42..27135.92 rows&＃61;632300 width&＃61;20) (actual time&＃61;0.041..574.119 rows&＃61;632300 loops&＃61;1) Output: row_number() OVER (?), test1_1.id, test1_1.c1, test1_1.c2 Buffers: shared hit&＃61;3835 -> Index Only Scan using idx_test1 on public.test1 test1_1 (cost&＃61;0.42..14489.92 rows&＃61;632300 width&＃61;12) (actual time&＃61;0.024..89.633 rows&＃61;632300 loops&＃61;1) Output: test1_1.c1, test1_1.c2, test1_1.id Heap Fetches: 0 Buffers: shared hit&＃61;3835 Planning Time: 0.505 ms Execution Time: 626.029 ms (27 rows)

–方法3:

bill&＃61;# do language plpgsql $$ bill$# declare bill$# v_rec record; bill$# v_c1 int; bill$# v_c2 int; bill$# cur1 cursor for select c1,c2 from test1 order by c1,c2,id for update; bill$# begin bill$# for v_rec in cur1 loop bill$# if v_rec.c1 &＃61; v_c1 and v_rec.c2&＃61;v_c2 then bill$# delete from test1 where current of cur1; bill$# end if; bill$# v_c1 :&＃61; v_rec.c1; bill$# v_c2 :&＃61; v_rec.c2; bill$# end loop; bill$# end; bill$# $$; DO

3、行去重
行去重一般可以使用ctid。

–创建测试数据&＃xff1a;

bill&＃61;# create table test1(c1 int, c2 int); CREATE TABLE bill&＃61;# insert into test1 select random()*1000, random()*1000 from generate_series(1,1000000); INSERT 0 1000000

–方法1:

bill&＃61;# explain (analyze,verbose,timing,costs,buffers) delete from test1 where ctid not in (select max(ctid) from test1 group by c1,c2); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Delete on public.test1 (cost&＃61;135831.29..152756.29 rows&＃61;500000 width&＃61;6) (actual time&＃61;2290.808..2290.808 rows&＃61;0 loops&＃61;1) Buffers: shared hit&＃61;376170, temp read&＃61;2944 written&＃61;2954 -> Seq Scan on public.test1 (cost&＃61;135831.29..152756.29 rows&＃61;500000 width&＃61;6) (actual time&＃61;1643.262..2040.646 rows&＃61;367320 loops&＃61;1) Output: test1.ctid Filter: (NOT (hashed SubPlan 1)) Rows Removed by Filter: 632680 Buffers: shared hit&＃61;8850, temp read&＃61;2944 written&＃61;2954 SubPlan 1 -> GroupAggregate (cost&＃61;124581.29..135581.29 rows&＃61;100000 width&＃61;14) (actual time&＃61;732.049..1390.277 rows&＃61;632680 loops&＃61;1) Output: max(test1_1.ctid), test1_1.c1, test1_1.c2 Group Key: test1_1.c1, test1_1.c2 Buffers: shared hit&＃61;4425, temp read&＃61;2944 written&＃61;2954 -> Sort (cost&＃61;124581.29..127081.29 rows&＃61;1000000 width&＃61;14) (actual time&＃61;732.035..1015.066 rows&＃61;1000000 loops&＃61;1) Output: test1_1.c1, test1_1.c2, test1_1.ctid Sort Key: test1_1.c1, test1_1.c2 Sort Method: external merge Disk: 23552kB Buffers: shared hit&＃61;4425, temp read&＃61;2944 written&＃61;2954 -> Seq Scan on public.test1 test1_1 (cost&＃61;0.00..14425.00 rows&＃61;1000000 width&＃61;14) (actual time&＃61;0.010..138.017 rows&＃61;1000000 loops&＃61;1) Output: test1_1.c1, test1_1.c2, test1_1.ctid Buffers: shared hit&＃61;4425 Planning Time: 0.176 ms Execution Time: 2304.495 ms (22 rows)

–方法2:

bill&＃61;# explain (analyze,verbose,timing,costs,buffers) delete from test1 where ctid &＃61; any(array( select ctid from (select row_number() over(partition by c1,c2 order by ctid) as rn, ctid from test1) t where t.rn<>1)); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------- Delete on public.test1 (cost&＃61;100501.36..100514.46 rows&＃61;10 width&＃61;6) (actual time&＃61;1092.431..1092.431 rows&＃61;0 loops&＃61;1) Buffers: shared hit&＃61;4430, temp read&＃61;2013 written&＃61;2019 InitPlan 1 (returns $0) -> Subquery Scan on t (cost&＃61;78357.55..100501.35 rows&＃61;629517 width&＃61;6) (actual time&＃61;1092.420..1092.420 rows&＃61;0 loops&＃61;1) Output: t.ctid Filter: (t.rn <> 1) Rows Removed by Filter: 632680 Buffers: shared hit&＃61;4430, temp read&＃61;2013 written&＃61;2019 -> WindowAgg (cost&＃61;78357.55..92592.85 rows&＃61;632680 width&＃61;22) (actual time&＃61;459.611..1042.708 rows&＃61;632680 loops&＃61;1) Output: row_number() OVER (?), test1_1.ctid, test1_1.c1, test1_1.c2 Buffers: shared hit&＃61;4430, temp read&＃61;2013 written&＃61;2019 -> Sort (cost&＃61;78357.55..79939.25 rows&＃61;632680 width&＃61;14) (actual time&＃61;459.598..616.859 rows&＃61;632680 loops&＃61;1) Output: test1_1.ctid, test1_1.c1, test1_1.c2 Sort Key: test1_1.c1, test1_1.c2, test1_1.ctid Sort Method: external merge Disk: 16104kB Buffers: shared hit&＃61;4430, temp read&＃61;2013 written&＃61;2019 -> Seq Scan on public.test1 test1_1 (cost&＃61;0.00..10751.80 rows&＃61;632680 width&＃61;14) (actual time&＃61;0.006..83.917 rows&＃61;632680 loops&＃61;1) Output: test1_1.ctid, test1_1.c1, test1_1.c2 Buffers: shared hit&＃61;4425 -> Tid Scan on public.test1 (cost&＃61;0.01..13.11 rows&＃61;10 width&＃61;6) (actual time&＃61;1092.429..1092.429 rows&＃61;0 loops&＃61;1) Output: test1.ctid TID Cond: (test1.ctid &＃61; ANY ($0)) Buffers: shared hit&＃61;4430, temp read&＃61;2013 written&＃61;2019 Planning Time: 0.204 ms Execution Time: 1096.153 ms (25 rows)

–方法3:

bill&＃61;# do language plpgsql $$ bill$# declare bill$# v_rec record; bill$# v_c1 int; bill$# v_c2 int; bill$# cur1 cursor for select c1,c2 from test1 order by c1,c2,ctid for update; bill$# begin bill$# for v_rec in cur1 loop bill$# if v_rec.c1 &＃61; v_c1 and v_rec.c2&＃61;v_c2 then bill$# delete from test1 where current of cur1; bill$# end if; bill$# v_c1 :&＃61; v_rec.c1; bill$# v_c2 :&＃61; v_rec.c2; bill$# end loop; bill$# end; bill$# $$; DO Time: 2320.653 ms (00:02.321) bill&＃61;# DO

推荐阅读

fetch
PHP操作MySql数据库_PHP教程

PHP操作MySql数据库_PHP教程:链接数据库$conn@mysql_connect(localhost,root,88888888)ordie(链接错误);解决中文乱码mys ... [详细]

蜡笔小新 2023-10-17 04:43:28
fetch
如何在php中将mysql查询结果赋值给变量

本文介绍了在php中将mysql查询结果赋值给变量的方法，包括从mysql表中查询count(学号)并赋值给一个变量，以及如何将sql中查询单条结果赋值给php页面的一个变量。同时还讨论了php调用mysql查询结果到变量的方法，并提供了示例代码。 ... [详细]

蜡笔小新 2023-12-12 18:22:57
fetch
php连接mysql显示数据,php连接mysql数据库的算法思想

本文目录一览：1、怎么用php显示mysql数据表数据 ... [详细]

蜡笔小新 2023-10-16 13:24:54
instance
Oracle seg,V$TEMPSEG_USAGE与Oracle排序的关系及使用方法

本文介绍了Oracle seg,V$TEMPSEG_USAGE与Oracle排序之间的关系，V$TEMPSEG_USAGE是V_$SORT_USAGE的同义词，通过查询dba_objects和dba_synonyms视图可以了解到它们的详细信息。同时，还探讨了V$TEMPSEG_USAGE的使用方法。 ... [详细]

蜡笔小新 2023-12-12 17:57:15
fetch
Python操作MySQL（pymysql模块）详解及示例代码

本文介绍了使用Python操作MySQL数据库的方法，详细讲解了pymysql模块的安装和连接MySQL数据库的步骤，并提供了示例代码。内容涵盖了创建表、插入数据、查询数据等操作，帮助读者快速掌握Python操作MySQL的技巧。 ... [详细]

蜡笔小新 2023-12-10 17:50:06
fetch
Oracle的PL/SQL编程基础技能实战一

一基础代码检查检查以bm_开头的系统初始化编码表是否有空值。与业务系统相关的编码项不能存在空值，会导致系统业务无法办理。为初始化数据表、在做测试数据和正式上线前检查。上线 ... [详细]

蜡笔小新 2023-10-17 09:38:23
fetch
python3连接外部Mysql

前提条件，已经安装过MySQL（比如说以前web开发安装过MySQL）1.安装PyMySQLpipinstallPyMySQL2.测试1i ... [详细]

蜡笔小新 2023-10-16 19:06:51
fetch
mysql 行转列（存储过程）

直接sql(自己抽离代码，无法和orderbyrand()复用)：selectshitfrom(select要累加的字段,@tt:CONCAT(@tt,,,a.要累加的 ... [详细]

蜡笔小新 2023-10-16 12:24:15
require
PostgreSQL的数据存储(十二)数据存储

4.3.2Tuple是否可以跨页面PostgreSQLusesafixedpagesize(commonly8kB),anddoesnotallowtuplestospanmult ... [详细]

蜡笔小新 2023-10-16 10:57:34
require
java 模拟get post请求_Java后台模拟发送http的get和post请求，并测试

个人学习使用：谨慎参考1Client类importcom.thoughtworks.gauge.Step;importcom.thoughtworks.gauge.T ... [详细]

蜡笔小新 2023-12-13 14:20:23
random
【openwrt】设备mt7628关于wan侧eth0.1 mac地址固定的问题

本文讨论了在openwrt-17.01版本中，mt7628设备上初始化启动时eth0的mac地址总是随机生成的问题。每次随机生成的eth0的mac地址都会写到/sys/class/net/eth0/address目录下，而openwrt-17.01原版的SDK会根据随机生成的eth0的mac地址再生成eth0.1、eth0.2等，生成后的mac地址会保存在/etc/config/network下。 ... [详细]

蜡笔小新 2023-12-12 17:47:48
random
模板引擎StringTemplate的使用方法和特点

本文介绍了模板引擎StringTemplate的使用方法和特点，包括强制Model和View的分离、Lazy-Evaluation、Recursive enable等。同时，还介绍了StringTemplate语法中的属性和普通字符的使用方法，并提供了向模板填充属性的示例代码。 ... [详细]

蜡笔小新 2023-12-11 21:45:03
fetch
无法使用fetch在服务器端读取/获取发布的数据

本文介绍了一个React Native新手在尝试将数据发布到服务器时遇到的问题，以及他的React Native代码和服务器端代码。他使用fetch方法将数据发送到服务器，但无法在服务器端读取/获取发布的数据。 ... [详细]

蜡笔小新 2023-12-11 11:26:28
random
OpenMap教程4 – 图层概述

本文介绍了OpenMap教程4中关于地图图层的内容，包括将ShapeLayer添加到MapBean中的方法，OpenMap支持的图层类型以及使用BufferedLayer创建图像的MapBean。此外，还介绍了Layer背景标志的作用和OMGraphicHandlerLayer的基础层类。 ... [详细]

蜡笔小新 2023-12-09 19:26:56
fetch
Vue模拟数据,实现路由进入商品详情页面的示例

一、路由首先需要配置路由，就是点击good组件进入goodDetail组件配置路由如下{path:goodDetail,component:goodDetail}同时在good组件中写入如下点击事件，路由中加入 ... [详细]

蜡笔小新 2023-10-17 20:25:16

Less哒哒_288

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章