当前位置: 开发笔记 > 编程语言 > 正文

Hive_Hive中TopN的实现》利用row_number()函数实现分组TopN

作者：小寒风 | 来源：互联网 | 2023-08-21 12:39

参考文章：1.Hiverow_number()等用法https:www.cnblogs.comAllen-rgp9268627.html2.Hive中分组取前N个

参考文章&＃xff1a;

1. Hive row_number() 等用法

https://www.cnblogs.com/Allen-rg/p/9268627.html

2.Hive中分组取前N个值

https://www.cnblogs.com/1130136248wlxk/articles/5352145.html

TopN 功能一直是一个热门的话题&＃xff0c;下面我们看在Hive 中实现分组 TopN .

Hive 在0.11 之后引入了一个函数 ROW_NUMBER() 可以非常方便的解决此类问题

0.11 前相近功能的实现

在Hive 0.11 之前的版本是没有 topN 函数的。那么我们在0.11 之前的版本该如何实现 topN 呢&＃xff1f; 这里有一篇不错的文章是通过

Hive 的udf 去做实现的&＃xff0c;我们这里做一个参数。

https://www.cnblogs.com/1130136248wlxk/articles/5352145.html

函数使用

下面是函数的原型“

ROW_NUMBER() OVER (partition BY COLUMN_A ORDER BY COLUMN_B ASC/DESC)

首先我们构造一个场景&＃xff0c;假设提供一个成绩表&＃xff0c;有学生姓名&＃xff0c;所选科目 &＃xff0c;分数 3列需要统计每个科目的前三名。

表结构如下&＃xff1a;

表有如下的数据&＃xff1a;

&＃43;----------------&＃43;-----------------&＃43;-------------------&＃43;--&＃43; | student2.name | student2.score | student2.subject | &＃43;----------------&＃43;-----------------&＃43;-------------------&＃43;--&＃43; | a | 22.2 | english | | a | 90.2 | chinese | | a | 33.0 | english | | b | 72.2 | english | | b | 80.2 | chinese | | b | 63.0 | math | | c | 64.2 | english | | c | 85.2 | chinese | | c | 73.0 | math | | d | 24.2 | english | | d | 75.2 | chinese | | d | 43.0 | math | | e | 74.2 | english | | e | 55.2 | chinese | | e | 93.0 | math | | f | 76.2 | english | | f | 20.2 | chinese | | f | 84.0 | math | | f | 63.0 | math | &＃43;----------------&＃43;-----------------&＃43;-------------------&＃43;--&＃43;

0: jdbc:hive2://10.180.0.26:10000> select * from (select name, subject, score, row_number() over(partition by subject order by score desc) rank from student2) tmp where tmp.rank <&＃61; 3;

select * from (

select name, subject, score, row_number() over(partition by subject order by score desc) rank

from student2) tmp

where tmp.rank <&＃61; 3;

对语句简单讲解一下 &＃xff1a;内嵌的子查询根据科目分组求该分组下的排名&＃xff0c;如果有相同分数&＃xff0c;排名&＃43;1

结果&＃xff1a;

&＃43;-----------&＃43;--------------&＃43;------------&＃43;-----------&＃43;--&＃43; | tmp.name | tmp.subject | tmp.score | tmp.rank | &＃43;-----------&＃43;--------------&＃43;------------&＃43;-----------&＃43;--&＃43; | a | chinese | 90.2 | 1 | | c | chinese | 85.2 | 2 | | b | chinese | 80.2 | 3 | | f | english | 76.2 | 1 | | e | english | 74.2 | 2 | | b | english | 72.2 | 3 | | e | math | 93.0 | 1 | | f | math | 84.0 | 2 | | c | math | 73.0 | 3 | &＃43;-----------&＃43;--------------&＃43;------------&＃43;-----------&＃43;--&＃43;

执行流程

除了会使用函数之外&＃xff0c;我们应该也了解下函数的执行顺序。

在使用 row_number() over()函数时候&＃xff0c;over()里头的分组以及排序的执行晚于 where group by order by 的执行。

相近的函数

除了 Hive 的 row_number() &＃xff0c; Hive 有没有提供功能相近的函数呢。答案是有的。

有两个函数&＃xff0c; 分别是

rank() over()

dense_rank() over()

rank() over()

rank() over() 跟 row_number() over() 的功能基本相同&＃xff1a;

不同点在于&＃xff0c;分组中存在相同值的处理流程。

rank() over() 更接近于一般的排名逻辑&＃xff0c;比如有两个并列第一&＃xff0c;那么就会显示

88分第1

80分第3

参考例子&＃xff1a;

0: jdbc:hive2://10.180.0.26:10000> select * from (select name, subject, score, rank() over(partition by subject order by score desc) rank from student2) tmp where tmp.rank <&＃61; 5;

select * from (

select name, subject, score, rank() over(partition by subject order by score desc) rank

from student2) tmp

where tmp.rank <&＃61; 5;

&＃43;-----------&＃43;--------------&＃43;------------&＃43;-----------&＃43;--&＃43; | tmp.name | tmp.subject | tmp.score | tmp.rank | &＃43;-----------&＃43;--------------&＃43;------------&＃43;-----------&＃43;--&＃43; | a | chinese | 90.2 | 1 | | c | chinese | 85.2 | 2 | | b | chinese | 80.2 | 3 | | d | chinese | 75.2 | 4 | | e | chinese | 55.2 | 5 | | f | english | 76.2 | 1 | | e | english | 74.2 | 2 | | b | english | 72.2 | 3 | | c | english | 64.2 | 4 | | a | english | 33.0 | 5 | | e | math | 93.0 | 1 | | f | math | 84.0 | 2 | | c | math | 73.0 | 3 | | f | math | 63.0 | 4 | | b | math | 63.0 | 4 | &＃43;-----------&＃43;--------------&＃43;------------&＃43;-----------&＃43;--&＃43;

可以看到 f 与 b 都是 63分&＃xff0c;显示都是第 4名

dense_rank() over()

该函数与 row_number() over() 的区别也在于相同值的处理过程上。

不同于 rank() over() 的跳跃排序&＃xff0c; 即两个第2名&＃xff0c;之后是第4名&＃xff0c;

该函数是连续排序&＃xff0c;即两个相同第2名&＃xff0c;之后是第3名&＃xff0c;示例如下&＃xff1a;

select name, subject, score, dense_rank() over(partition by subject order by score desc) rank

from student2;

0: jdbc:hive2://10.180.0.26:10000> select name, subject, score, dense_rank() over(partition by subject order by score desc) rank from student2;

结果&＃xff1a;

&＃43;-------&＃43;----------&＃43;--------&＃43;-------&＃43;--&＃43; | name | subject | score | rank | &＃43;-------&＃43;----------&＃43;--------&＃43;-------&＃43;--&＃43; | a | chinese | 90.2 | 1 | | c | chinese | 85.2 | 2 | | b | chinese | 80.2 | 3 | | d | chinese | 75.2 | 4 | | e | chinese | 55.2 | 5 | | f | chinese | 20.2 | 6 | | f | english | 76.2 | 1 | | e | english | 74.2 | 2 | | b | english | 72.2 | 3 | | c | english | 64.2 | 4 | | a | english | 33.0 | 5 | | d | english | 24.2 | 6 | | a | english | 22.2 | 7 | | e | math | 93.0 | 1 | | f | math | 84.0 | 2 | | c | math | 73.0 | 3 | | f | math | 63.0 | 4 | | b | math | 63.0 | 4 | | d | math | 43.0 | 5 | &＃43;-------&＃43;----------&＃43;--------&＃43;-------&＃43;--&＃43;

好了。通过这篇文章你应该对 Hive 的分组排序功能有个非常细致的理解。

祝大家每日进步&＃xff0c;加油&＃xff01;

实际场景

我们有一张表 &＃xff0c;记录了用户的姓名&＃xff0c;性别 &＃xff0c;分数&＃xff0c;我们想找到男生与女生的前3名&＃xff0c;

表中数据如下&＃xff1a;

0: jdbc:hive2://cdh-manager:10000> select * from test_sex_rank . . . . . . . . . . . . . . . . .> ; INFO : Compiling command(queryId&＃61;hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8): select * from test_sex_rank INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test_sex_rank.name, type:string, comment:null), FieldSchema(name:test_sex_rank.sex, type:boolean, comment:null), FieldSchema(name:test_sex_rank.score, type:double, comment:null)], properties:null) INFO : Completed compiling command(queryId&＃61;hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8); Time taken: 0.117 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId&＃61;hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8): select * from test_sex_rank INFO : Completed executing command(queryId&＃61;hive_20190409043814_2dcc9da2-b928-4de4-bda7-87d123c116c8); Time taken: 0.0 seconds INFO : OK &＃43;---------------------&＃43;--------------------&＃43;----------------------&＃43; | test_sex_rank.name | test_sex_rank.sex | test_sex_rank.score | &＃43;---------------------&＃43;--------------------&＃43;----------------------&＃43; | a1 | false | 82.0 | | a2 | false | 98.0 | | a3 | false | 67.4 | | a4 | false | 87.0 | | b1 | true | 42.0 | | b2 | true | 98.0 | | b3 | true | 77.4 | | b4 | true | 87.0 | &＃43;---------------------&＃43;--------------------&＃43;----------------------&＃43; 8 rows selected (0.179 seconds)

我们编写的SQL 如下&＃xff1a;

0: jdbc:hive2://cdh-manager:10000> select * from (select name, score, sex, rank() over(partition by sex order by score) as rank from test_sex_rank) tmp where rank <&＃61;3;

select * from

(

select name, score, sex, rank() over(partition by sex order by score) as rank

from test_sex_rank

) tmp

where rank <&＃61;3;

结果如下&＃xff1a;

&＃43;-----------&＃43;------------&＃43;----------&＃43;-----------&＃43; | tmp.name | tmp.score | tmp.sex | tmp.rank | &＃43;-----------&＃43;------------&＃43;----------&＃43;-----------&＃43; | a3 | 67.4 | false | 1 | | a1 | 82.0 | false | 2 | | a4 | 87.0 | false | 3 | | b1 | 42.0 | true | 1 | | b3 | 77.4 | true | 2 | | b4 | 87.0 | true | 3 | &＃43;-----------&＃43;------------&＃43;----------&＃43;-----------&＃43;

推荐阅读

input
技术分享：从动态网站提取站点密钥的解决方案

本文探讨了如何从动态网站中提取站点密钥，特别是针对验证码（reCAPTCHA）的处理方法。通过结合Selenium和requests库，提供了详细的代码示例和优化建议。 ... [详细]

蜡笔小新 2024-12-28 04:11:47
input
HDFS与Hive中的数据存储和管理机制

本文探讨了Hive中内部表和外部表的区别及其在HDFS上的路径映射，详细解释了两者的创建、加载及删除操作，并提供了查看表详细信息的方法。通过对比这两种表类型，帮助读者理解如何更好地管理和保护数据。 ... [详细]

蜡笔小新 2024-12-27 20:21:48
js
使用 NSTimer 实现倒计时功能

本文介绍如何使用 NSTimer 实现倒计时功能，详细讲解了初始化方法、参数配置以及具体实现步骤。通过示例代码展示如何创建和管理定时器，确保在指定时间间隔内执行特定任务。 ... [详细]

蜡笔小新 2024-12-26 19:08:19
default
Handling Null Object Encoding in OAuth 1.0a API Implementation

Explore a common issue encountered when implementing an OAuth 1.0a API, specifically the inability to encode null objects and how to resolve it. ... [详细]

蜡笔小新 2024-12-28 08:54:34
get
新浪笔试题

1:有如下一段程序：packagea.b.c;publicclassTest{privatestaticinti0;publicintgetNext(){return ... [详细]

蜡笔小新 2024-12-27 19:32:17
email
MyBatis 动态 SQL 详解与应用

本文深入探讨 MyBatis 中动态 SQL 的使用方法，包括 if/where、trim 自定义字符串截取规则、choose 分支选择、封装查询和修改条件的 where/set 标签、批量处理的 foreach 标签以及内置参数和 bind 的用法。 ... [详细]

蜡笔小新 2024-12-27 16:20:10
function
Akka BackoffSupervisor的深入解析与实践

本文详细介绍了Akka中的BackoffSupervisor机制，探讨其在处理持久化失败和Actor重启时的应用。通过具体示例，展示了如何配置和使用BackoffSupervisor以实现更细粒度的异常处理。 ... [详细]

蜡笔小新 2024-12-27 15:04:09
settings
DNN Community 和 Professional 版本的主要差异

本文详细解析了 DotNetNuke (DNN) 的两种主要版本：Community 和 Professional。通过对比两者的功能和附加组件，帮助用户选择最适合其需求的版本。 ... [详细]

蜡笔小新 2024-12-27 13:14:08
js
MySQL中枚举类型的所有可能值获取方法

本文介绍了一种在MySQL数据库中查询枚举（ENUM）类型字段所有可能取值的方法，帮助开发者更好地理解和利用这一数据类型。 ... [详细]

蜡笔小新 2024-12-27 10:36:44
io
macOS系统及其关键功能解析

本文详细介绍了macOS系统的核心组件，包括如何管理其安全特性——系统完整性保护（SIP），并探讨了不同版本的更新亮点。对于使用macOS系统的用户来说，了解这些信息有助于更好地管理和优化系统性能。 ... [详细]

蜡笔小新 2024-12-26 18:05:04
io
使用 SQLiteJDBC 和 HikariCP 实现 Java 程序连接 SQLite 数据库

本文介绍了如何通过 Maven 依赖引入 SQLiteJDBC 和 HikariCP 包，从而在 Java 应用中高效地连接和操作 SQLite 数据库。文章提供了详细的代码示例，并解释了每个步骤的实现细节。 ... [详细]

蜡笔小新 2024-12-26 17:34:42
input
Python学习笔记：使用pydoc工具查询文档

本文介绍了在Windows环境下使用pydoc工具的方法，并详细解释了如何通过命令行和浏览器查看Python内置函数的文档。此外，还提供了关于raw_input和open函数的具体用法和功能说明。 ... [详细]

蜡笔小新 2024-12-26 17:05:56
utf-8
解析JSON格式文本并处理数据

本文介绍如何使用阿里云的fastjson库解析包含时间戳、IP地址和参数等信息的JSON格式文本，并进行数据处理和保存。 ... [详细]

蜡笔小新 2024-12-26 16:06:09
default
ASP.NET MVC中Area机制的实现与优化

本文探讨了在ASP.NET MVC框架中，如何通过Area机制有效地组织和管理大规模应用程序的不同功能模块。通过合理的文件夹结构和命名规则，开发人员可以更高效地管理和扩展项目。 ... [详细]

蜡笔小新 2024-12-25 22:53:48
input
CUGB图论专题：排水系统中的最大流问题 - EK与Dinic算法解析

本题探讨如何通过最大流算法解决农场排水系统的设计问题。题目要求计算从水源点到汇合点的最大水流速率，使用经典的EK（Edmonds-Karp）和Dinic算法进行求解。 ... [详细]

蜡笔小新 2024-12-25 17:47:23

小寒风

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章