作者:优美rosner_704 | 来源:互联网 | 2023-08-22 16:14
Ihaveaquestionconcerningdata.table.IloveitbutIthinkIwasamsometimemisusingthe.SD,a
I have a question concerning data.table. I love it but I think I was/am sometime misusing the .SD, and I would appreciate some clarification about when it is interesting to use it in data.table.
我有一个关于data.table的问题。我喜欢它,但我认为我有时会滥用.SD,我会很感激有关在data.table中使用它时有趣的一些说明。
Here are two examples where I came to think that I was misusing .SD :
这里有两个例子,我认为我在滥用.SD:
The first one is as discussed here (thanks for the Henry's comment)
第一个是这里讨论的(感谢亨利的评论)
library(microbenchmark)
library(data.table)
DTlength <- 2000
DT <-
data.table(
id = rep(sapply(combn(LETTERS, 6, simplify = FALSE), function(x) {
paste(x, collapse = "")
}), each = 4)[1:DTlength],
replicate(10, sample(1001, DTlength, replace = TRUE)),
Answer = sample(c("Yes", "No"), DTlength, TRUE)
)
microbenchmark(
"without SD" = {
b <- DT[, Answer[1], by = id][, V1]
},
"without SD alternative" = {
b <- DT[DT[, .I[1], by = id][, V1], Answer]
},
"with SD" = {
b <- DT[, .SD[1, Answer], by = id][, V1]
}
)
Unit: microseconds
expr min lq mean median uq max neval
without SD 455.795 493.949 569.4979 529.847 558.564 2323.283 100
without Sd alternative 961.231 1010.667 1160.9114 1060.513 1113.641 7783.798 100
with SD 121217.691 123557.590 131071.5699 127495.437 130340.977 240317.227 100
.SD operation are quite slow compared to alternative in grouping operations. Even if you want to group to the entire data.table, the alternatives are slightly faster (although the time difference here is maybe not worth the loss of clarity of the syntax):
与分组操作中的替代操作相比,.SD操作非常慢。即使您想要分组到整个data.table,替代方案也会稍快一些(尽管这里的时间差可能不值得语法清晰度的降低):
microbenchmark(
"with SD" = {b <-DT[,.SD[1], by = id]},
"Without SD" = {b <- DT[DT[,.I[1],by = id][,V1]]}
)
Unit: milliseconds
expr min lq mean median uq max neval
with SD 1.058872 1.361436 1.560866 1.643078 1.741540 1.960206 100
Without SD 1.067898 1.169642 1.279443 1.233437 1.348719 1.781334 100
The second example illustrates the fact that you can't really use .SD to assign new variable to a value with a condition within groups (or I didn't find the way):
第二个例子说明了这样一个事实,即您无法真正使用.SD将新变量分配给具有组内条件的值(或者我找不到方法):
DT[, .SD[V1 - V1[1] > 100][, plouf2 := Answer], by = id] # doesn't assign plouf2
DT[DT[, .I[V1 - V1[1] > 100], by = id][, V1], plouf2 := Answer] # this does
There are two situations where I found it useful to use .SD : the DT[,lapply(.SD,fun),.SDcols = ]
use that is very convenient, and when one wants to assign all values in the group to a particular value that meets a particular condition within the group :
在两种情况下,我发现使用.SD:DT [,lapply(.SD,fun),SDcols =]使用非常方便,并且当想要将组中的所有值分配给特定时满足组内特定条件的值:
DT[, plouf3 := .SD[V1 - V1[1] > 100, Answer][1], by = id]
# all values are assigned, which is actually different from
DT[DT[, .I[V1 - V1[1] > 100][1], by = id][, V1], plouf2 := Answer]
# where only the values that match the condition V1-V1[1]>100 are assigned
So my question: are there other situations where it is needed/interesting to use .SD ?
所以我的问题是:还有其他情况需要/有趣使用.SD?
Thank you in advance for the help.
提前感谢您的帮助。
1 个解决方案
0
Regarding your first question
关于你的第一个问题
The benchmark would only be fair if all three methods would generate the same output. The "without SD alternative" method generates a different result, so let's set that one aside.
如果所有三种方法都能产生相同的输出,那么基准只会是公平的。 “无SD替代”方法会产生不同的结果,所以让我们把它放在一边。
The "with SD" and "without SD" methods generate the same output but the latter is more efficient. Here is why: when you do ... .SD[1, Answer] ...
you are basically subsetting all of the columns for the matching rows, then you are performing the next operation (which is to fetch the first value of the vector Answer
) on this subset. However, in the "without SD" method, you are only subsetting one vector (not all vectors) and then fetching the first value of that one vector. The unnecessary subsetting of the additional, unused columns in the "with SD" method is what makes it slow.
“with SD”和“without SD”方法生成相同的输出,但后者更有效。原因如下:当你执行... .SD [1,Answer] ...你基本上是匹配行的所有列的子集,然后你正在执行下一个操作(即获取第一个值矢量答案)关于这个子集。但是,在“无SD”方法中,您只是对一个向量(不是所有向量)进行子集化,然后获取该向量的第一个值。 “with SD”方法中额外的未使用的列的不必要的子集使其变慢。
Regarding your second question
关于你的第二个问题
This command does not assign the values to DT:
此命令不会将值分配给DT:
DT[, .SD[V1 - V1[1] > 100][, plouf2 := Answer], by = id]
DT [,。SD [V1 - V1 [1]> 100] [,plouf2:= Answer],by = id]
The reason is that the .SD
operator is a one-way operator, that is if you change something in the subset that .SD
gives you, it doesn't apply it back on the larger data.table but only applies it on the in-memory copy of the subset. It is not fair to call it an in-memory copy, because .SD
does not actually copy the data (it just points the relevant portion of the memory that holds the subset of interest), but the point is that assignments to it will only be applied to this in-memory pointer and not the original underlying data.
原因是.SD运算符是一个单向运算符,即如果你改变.SD给你的子集中的某些东西,它就不会将它应用于更大的data.table但只将它应用于 - 子集的内存副本。将它称为内存中副本是不公平的,因为.SD实际上并不复制数据(它只是指向保存感兴趣子集的内存的相关部分),但关键是它的赋值只会应用于此内存中指针而不是原始基础数据。
Note: You could argue, then, that it should not support assignments whatsoever. I don't know what Matt Dowle thinks, but in my humble opinion, the assignment is actually a useful feature! For instance:
注意:那么你可以争辩说,它不应该支持任何分配。我不知道Matt Dowle的想法,但在我看来,这项任务实际上是一个有用的功能!例如:
DT.2 <- DT[, .SD[V1 - V1[1] > 100][, plouf2 := Answer], by = id]
DT.2 <- DT [,。SD [V1 - V1 [1]> 100] [,plouf2:= Answer],by = id]
This way I have a very short, highly readable piece of code, that generates the output I desire and stores it in a new data.table without modifying the original data.table! Any other way that I can think of to generate this exact output without using .SD
and without touching the original data.table involves much longer code.
这样我就有一段非常简短,高度可读的代码,它可以生成我想要的输出并将其存储在新的data.table中,而无需修改原始data.table!我可以想到的任何其他方式来生成这个精确的输出而不使用.SD而不触及原始data.table涉及更长的代码。
Regarding your last question
关于你的上一个问题
.SD
is useful when you want to deal with many or all columns of a data.table and not just a few or only one column. (This is why the "with SD" method you used in the first part is not an appropriate way to do what you want to do). The examples provided in What does .SD stand for in data.table in R are very useful to demonstrate when .SD
can be very handy. In my opinion, the main advantage of .SD
is not with the efficiency at the which the code runs, but rather, in the efficiency in which you can turn a concept into R code, and the readability of that piece of code.
当您想要处理data.table的许多或所有列而不仅仅是几列或只有一列时,.SD非常有用。 (这就是为什么你在第一部分中使用的“with SD”方法不适合做你想做的事情)。 R中data.table中的.SD代表的例子非常有用,可以证明.SD可以非常方便。在我看来,.SD的主要优点不在于代码运行的效率,而在于您将概念转换为R代码的效率,以及该代码片段的可读性。