作者:cut1089289 | 来源:互联网 | 2023-10-12 16:07
ImnewtoR,Ihaveadataframeof500000entriesofpatientIDsanddatesandothervariables..我是
Im new to R, I have a data frame of 500000 entries of patient IDs and dates and other variables..
我是R的新手,我有一个500000条患者ID和日期以及其他变量的数据框。
I want to remove any repeated duplicated patient ID(PtID) if they happen to come within one year of their first appearance.. for example:
我想删除任何重复的重复患者ID(PtID),如果它们恰好在他们第一次出现后的一年内...例如:
PtID date**
1. 1 01/01/2006
2. 2 01/01/2006
3. 1 24/02/2006
4. 4 26/03/2006
5. 1 04/05/2006
6. 1 05/05/2007
in this case I want to remove the 3rd and the 5th rows and keep the 1st and 6th rows..
在这种情况下,我想删除第3行和第5行并保留第1行和第6行..
can somebody help me with this please.. this is the str(my data which is called final1)
请有人帮我这个..这是str(我的数据叫做final1)
str(final1)
'data.frame': 605870 obs. of 70 variables:
...
$ Date : Date, format: "2006-03-12" "2006-04-01" ...
$ PtID : int 11251 11251 11251 11251 11251 11251 11251 30938 30938 11245 ...
...
1 个解决方案
1
Here's one solution that uses ply
and lubridate
. First load the packages:
这是一个使用ply和lubridate的解决方案。首先加载包:
require(plyr)
require(lubridate)
Next create some sample data (notice that this is a bit more straightforward than your example!)
接下来创建一些示例数据(请注意,这比您的示例更简单!)
num = 1:6
PtID = c(1,2,1,4,1,1)
date = c("01/01/2006", "01/01/2006","24/02/2006", "26/03/2006", "04/05/2006",
"05/05/2007")
dd = data.frame(PtID, date)
Now we make the date
column an R date object:
现在我们将日期列设为R日期对象:
dd$date = dmy(date)
and a function that contains the rule of whether a row should be included:
以及包含是否应包含行的规则的函数:
keepId = function(dates) {
keep = ((dates - min(dates)) > 365*24*60*60) |
((dates == min(dates)))
return(keep)
}
All that remains is using ddply
to partition the date frame by the PtID
剩下的就是使用ddply通过PtID对日期帧进行分区
dd_sub = ddply(dd, c("PtID"), transform, keep = keepId(date))
dd_sub[dd_sub$keep,]