热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

分析函数hive计算均值_clickhouse之连续消费用户和漏斗函数案例实践

目录本篇主要内容一、clickhouse连续消费用户的sql分析二、使用漏斗函数windowFunnel,分析用户操作行为,计算转化率clickhous

目录

本篇主要内容

一、clickhouse连续消费用户的sql分析

二、使用漏斗函数 windowFunnel,分析用户操作行为,计算转化率

clickhouse连续消费用户的sql分析

使用漏斗函数 windowFunnel,分析用户操作行为,计算转化率


本篇主要内容

一、clickhouse连续消费用户的sql分析
二、使用漏斗函数 windowFunnel,分析用户操作行为,计算转化率

clickhouse连续消费用户的sql分析

1)准备好数据 在本地 linux01 /data/user.csv
a,2017-02-05,200
a,2017-02-06,300
a,2017-02-07,200
a,2017-02-08,400
a,2017-02-10,600
b,2017-02-05,200
b,2017-02-06,300
b,2017-02-08,200
b,2017-02-09,400
b,2017-02-10,600
c,2017-01-31,200
c,2017-02-01,300
c,2017-02-02,200
c,2017-02-03,400
c,2017-02-10,600
a,2017-03-01,200
a,2017-03-02,300
a,2017-03-03,200
a,2017-03-04,400
a,2017-03-05,600
2)clickhouse 中创建表 【表引擎为MergeTree ,指定主键为name,时间,排序字段为name,时间】
create table tb_shop
(name String,
ctime Date,
money Float64
)engine=MergeTree
primary key(name,ctime)
order by(name,ctime);
3)导入数据
clickhouse-client -q "INSERT INTO default.tb_shop FORMAT CSV" 也可以使用 cat user.csv | clickhouse-client -q "INSERT INTO default.tb_shop FORMAT CSV"; 导入数据
注:需要在本地linux01 /data/ 下 执行该命令
e6a69d9147d1d65549854c38e5707fbe.png


4)分组,查询每个人 对应的消费时间(数组)、编号数组
select name,
groupArray(ctime) arr,
arrayEnumerate(arr) arr2
from tb_shop
group by name;
┌─name─┬─arr─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─arr2───────────────────┐
│ b │ ['2017-02-05','2017-02-06','2017-02-08','2017-02-09','2017-02-10'] │ [1,2,3,4,5] │
│ c │ ['2017-01-31','2017-02-01','2017-02-02','2017-02-03','2017-02-10'] │ [1,2,3,4,5] │
│ a │ ['2017-02-05','2017-02-06','2017-02-07','2017-02-08','2017-02-10','2017-03-01','2017-03-02','2017-03-03','2017-03-04','2017-03-05'] │ [1,2,3,4,5,6,7,8,9,10] │
└──────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────┘
5)使用array join 连续连接 arr和arr2 进行多行拼接
select
name,
t,
diff
from(
select
name,
arr,
arr2
from
(
select name,
groupArray(ctime) arr,
arrayEnumerate(arr) arr2
from tb_shop
group by name))
array join arr as t,
arr2 as diff;

3e2e60a69e2635b33a23285b81e9ac3b.png


6)求出ctime和编号之差,
并通过用户名和差dif 分组,聚合求出次数
select
name,
addDays(t, -diff) as dif,
count(1) as cnt
from
(
select
name,
t,
diff
from(
select
name,
arr,
arr2
from
(
select name,
groupArray(ctime) arr,
arrayEnumerate(arr) arr2
from tb_shop
group by name))
array join arr as t,
arr2 as diff) group by name,dif;

fff176ffff7da4073cc8da6a32fa53ed.png


7)再分组聚合,求出最大的连续消费天数
select
name,
max(cnt)
from(
select
name,
addDays(t, -diff) as dif,
count(1) as cnt
from
(
select
name,
t,
diff
from(
select
name,
arr,
arr2
from
(
select name,
groupArray(ctime) arr,
arrayEnumerate(arr) arr2
from tb_shop
group by name))
array join arr as t,
arr2 as diff) group by name,dif) group by name;

cb6326d844d7dc4196526ad4b5705746.png

本案例要点:

  • clickhouse中创建表一定要指定引擎。
  • MergeTree表引擎,需要指定主键,本案例中,指定主键为name和ctime,排序字段为name和ctime,方便后续数据处理。记住加上小括号,否则报错
  • 在hive中编号函数为row_number() 加上窗口函数 over(partition by name order by ctime) 就可以对每个用户组的每行数据进行编号;而在clickhouse中,arrayEnumerate(arr)可以求出为数组中的每个元素编上号,但结果也是个数组。groupArray(ctime) arr,求出的也是时间的数组,我们使用array join使其两个数组进行炸裂拼接,得到多行对应数据。
  • 编号不是目的,目的是为了得到时间和编号之间的差,方便依据其进行分组得到连续性数据。

使用漏斗函数 windowFunnel,分析用户操作行为,计算转化率

漏斗函数
现有用户行为采集日志数据
ca8874daf4f0bd1358e771dae3a88a1f.png


现需要对其进行初步采集整合到ODS原始数据层,并且需要求出uv(单一用户访问量)和 用户 事件相关的一些数据统计和计算,比如转化率(完成整个流程【最终交易】的人数占总人数的比例)
1)创建hive表,导入数据
由于所采集静态数据为json格式,不方便操作,我们将其读取,存入到hive中,再将其导出为CSV格式
准备数据

5e96c4576019d9e3599fe63c29fd0d26.png


启动hive元数据服务
hive --service metastore &

64da408fb2418406f3239c6068496251.png


启动hive客户端
hive

f9dfc58851cd14a51f02d9b98e3fe891.png


开启hive2服务
hiveserver2

b6d623a7fe6e89c2208b72c8eca01e7b.png


启动连接beeline客户端

1f0bab5dabcf169ffee455170c03df32.png


创建hive表
create table tb_log(
log string
)
partitioned by(dt string);
导入数据到hive表中
load data local inpath "/data/log.log" into table tb_log partition(dt='20201007');
+----------------------------------------------------+------------+
| tb_log.log | tb_log.dt |
+----------------------------------------------------+------------+
| {"account":"l4d9TM","appId":"cn.doitedu.app1","appVersion":"2.0","carrier":"小米移动","deviceId":"KVlABkbMqqwn","deviceType":"REDMI-5","eventId":"adClick","ip":"180.62.91.180","latitude":25.213352855705107,"longitude":100.58718180057615,"netType":"WIFI","osName":"android","osVersion":"6.5","properties":{"adCampain":"7","adId":"6","adLocation":"3","pageId":"41"},"releaseChannel":"豌豆荚","resolution":"1024*768","sessionId":"rGeXt8N0rD1","timeStamp":1602063058368} | 20201007 |
| {"account":"DxL36Dom","appId":"cn.doitedu.app1","appVersion":"2.2","carrier":"中国联通","deviceId":"0HtfcnPofgoR","deviceType":"REDMI-5","eventId":"fetchCoupon","ip":"160.171.79.244","latitude":34.21346724565028,"longitude":117.60034950493103,"netType":"4G","osName":"android","osVersion":"6.5","properties":{"couponId":"2","pageId":"393"},"releaseChannel":"百度手机助手","resolution":"1024*768","sessionId":"meO4lXHYmx1","timeStamp":1602063059899} | 20201007 |
+----------------------------------------------------+------------+
2)初步处理数据、创建原始数据层表 tb_ods_log
create table tb_ods_log as
select
if(account='' ,deviceId , account) as guid,
*
from
(select
json_tuple(
log ,
'account' ,'appId' ,'appVersion','carrier','deviceId','deviceType','eventId','ip','latitude','longitude','netType','osName','osVersion','properties','releaseChannel','resolution','sessionId' ,'timeStamp')
as
(account ,appId ,appVersion,carrier,deviceId,deviceType,eventId,ip,latitude,longitude,netType,osName,osVersion,properties,releaseChannel,resolution,sessionId ,`timeStamp`)
from
tb_log)t
where account != '' or deviceId !='' ;
select * from tb_ods_log limit 2;

5ad9ce8c0e80d943c6e25534c3f33e88.png


3)导出数据到本地 tsv 格式属性之间以tab键符号隔开
insert overwrite local directory '/data/'
row format delimited fields terminated by 't'
select * from tb_ods_log;

cbc055808f2da7b18b3a00f9b5ea3032.png


4)创建clickhouse ods表
原始数据已经初步处理了一下,将json串解析成为了字段,并添加了账号标识guid
clickhouse-client 打开客户端
建表
create table tb_ods_log(
guid String,
account String,
appId String,
appVersion String ,
carrier String,
deviceId String,
deviceType String,
eventId String,
ip String,
latitude String,
longitude String,
netType String,
osName String,
osVersion String,
properties String,
releaseChannel String,
resolution String,
sessionId String,
`timeStamp` String
)engine=MergeTree
order by guid;
导入数据
cat 000000_0 | clickhouse-client -q "insert into tb_ods_log FORMAT TSV";
select * from tb_ods_log limit 4;

3554bc644a69e0dce9fae0e5779d491d.png


5)创建需求主题表
CREATE TABLE funnel_test ( uid String, eventId String, eventTime UInt64)
ENGINE = MergeTree
PARTITION BY (uid, eventTime)
ORDER BY (uid, eventTime);
导入数据
clickhouse-client -q "insert into funnel_test FORMAT TSV" event.dada 数据如下:
uid1 event1 1551398404
uid1 event2 1551398406
uid1 event3 1551398408
uid2 event2 1551398412
uid2 event3 1551398415
uid3 event3 1551398417
uid3 event4 1551398419
6)链条漏斗sql实践
select uid,
windowFunnel(4)(toDateTime(eventTime),eventId='event1',eventId='event2',eventId='event3') as funnel
from
funnel_test
group by uid;
┌─uid──┬─funnel─┐
│ uid3 │ 0 │
│ uid1 │ 3 │
│ uid2 │ 0 │
└──────┴────────┘
解析:windowFunnel(n)表示在n毫秒内 执行了步骤事件1、事件2、事件3 的步数 ,如上述查询显示,uid1用户在4毫秒内执行到了第四步,其他人一步也没有执行,这也叫做批处理、流处理过程。
可以用来分析商城注册用户从浏览广告、查看详情、加入购物车、提交订单、支付成功步骤的一个用户执行情况,可以计算转换率,从而为后期的运营进行一个指导。

5f1e8b741b8e787512ef71cff9d1aaa3.png


7)创建clickhouse log2
create table tb_ods_log2 engine=MergeTree order by (guid,`timeStamp`)
as select * from tb_ods_log

0c8ff159162371bb39259ad54d0247c5.png


8)执行漏斗函数
select `all`,
four,
four+three,
four+three+two,
four+three+two+one
from
(
select
count(1) as `all`,
sum(if(funnel=4,1,0)) as four,
sum(if(funnel=3,1,0)) as three,
sum(if(funnel=2,1,0)) as two,
sum(if(funnel=1,1,0)) as one
from(
select guid,
windowFunnel(1000000)(toDateTime(cast(`timeStamp` as Int64)),
eventId='adShow' ,
eventId='adClick',
eventId='adCart',
eventId='submitOrder'
) as funnel
from tb_ods_log2
group by guid));

6956e027c99d2ce8742277ce6b2e45bc.png

更多学习、面试资料尽在微信公众号:Hadoop大数据开发



推荐阅读
author-avatar
sweet佳楠名人博客
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有