作者:火立者 | 来源:互联网 | 2024-12-26 18:37
这是我在Stack Overflow上的第一篇文章,如果内容不够详尽,请多包涵。
我有一个包含两列(日期和组ID)的数据表。我想根据当前日期,计算过去x天内每个组ID的出现次数。以30天为例进行说明。
以下是数据示例:
date = c("2014-04-01", "2014-04-12", "2014-04-07", "2014-05-03", "2014-04-14", "2014-05-04", "2014-03-31", "2014-04-18", "2014-04-23", "2014-04-01")
group = c("G","G","F","G","E","E","H","H","H","A")
dt = data.table(cbind(group, date))
数据表如下所示:
group date
1: G 2014-04-01
2: G 2014-04-12
3: F 2014-04-07
4: G 2014-05-03
5: E 2014-04-14
6: E 2014-05-04
7: H 2014-03-31
8: H 2014-04-18
9: H 2014-04-23
10: A 2014-04-01
我希望新增一列来显示每个组ID在过去30天内的出现次数:
group date count
1: G 2014-04-01 0
2: G 2014-04-12 1
3: F 2014-04-07 0
4: G 2014-05-03 1 (不包括第一个G,因为它超出了30天)
5: E 2014-04-14 0
6: E 2014-05-04 1
7: H 2014-03-31 0
8: H 2014-04-18 1
9: H 2014-04-23 2
10: A 2014-04-01 0
我已经能够使用dplyr计算非窗口计数,但难以实现30天窗口内的计数。以下是非窗口计数的代码:
dt = dt %>%
group_by(group) %>%
mutate(count = row_number() - 1)
结果如下:
group date count
1: G 2014-04-01 0
2: G 2014-04-12 1
3: F 2014-04-07 0
4: G 2014-05-03 2
5: E 2014-04-14 0
6: E 2014-05-04 1
7: H 2014-03-31 0
8: H 2014-04-18 1
9: H 2014-04-23 2
10: A 2014-04-01 0
这是数据集的一个小样本,实际数据集中包含几百万行记录,因此需要高效的解决方案。任何建议或提示都将非常感谢!
解决方案
data.table选项
dt[, date := as.Date(date)][, count := cumsum(date <= first(date) + 30) - 1, by = group]
结果如下:
> dt
group date count
1: G 2014-04-01 0
2: G 2014-04-12 1
3: F 2014-04-07 0
4: G 2014-05-03 1
5: E 2014-04-14 0
6: E 2014-05-04 1
7: H 2014-03-31 0
8: H 2014-04-18 1
9: H 2014-04-23 2
10: A 2014-04-01 0
dplyr选项
dt %>%
mutate(date = as.Date(date)) %>%
group_by(group) %>%
mutate(count = cumsum(date <= first(date) + 30) - 1) %>%
ungroup()