I have a question concerning data.table. I love it but I think I was/am sometime misusing the .SD, and I would appreciate some clarification about when it is interesting to use it in data.table.


Here are two examples where I came to think that I was misusing .SD :


The first one is as discussed here (thanks for the Henry's comment)



DTlength <- 2000
DT <-
    id = rep(sapply(combn(LETTERS, 6, simplify = FALSE), function(x) {
      paste(x, collapse = "")
    }), each = 4)[1:DTlength],
    replicate(10, sample(1001, DTlength, replace = TRUE)),
    Answer = sample(c("Yes", "No"), DTlength, TRUE)

  "without SD" = {
    b <- DT[, Answer[1], by = id][, V1]
  "without SD alternative" = {
    b <- DT[DT[, .I[1], by = id][, V1], Answer]
  "with SD" = {
    b <- DT[, .SD[1, Answer], by = id][, V1]

Unit: microseconds
                   expr        min         lq        mean     median         uq        max neval
             without SD    455.795    493.949    569.4979    529.847    558.564   2323.283   100
 without Sd alternative    961.231   1010.667   1160.9114   1060.513   1113.641   7783.798   100
                with SD 121217.691 123557.590 131071.5699 127495.437 130340.977 240317.227   100

.SD operation are quite slow compared to alternative in grouping operations. Even if you want to group to the entire data.table, the alternatives are slightly faster (although the time difference here is maybe not worth the loss of clarity of the syntax):


  "with SD" = {b <-DT[,.SD[1], by = id]},
  "Without SD" = {b <- DT[DT[,.I[1],by = id][,V1]]}

Unit: milliseconds
       expr      min       lq     mean   median       uq      max neval
    with SD 1.058872 1.361436 1.560866 1.643078 1.741540 1.960206   100
 Without SD 1.067898 1.169642 1.279443 1.233437 1.348719 1.781334   100

The second example illustrates the fact that you can't really use .SD to assign new variable to a value with a condition within groups (or I didn't find the way):


DT[, .SD[V1 - V1[1] > 100][, plouf2 := Answer], by = id] # doesn't assign plouf2
DT[DT[, .I[V1 - V1[1] > 100], by = id][, V1], plouf2 := Answer] # this does

There are two situations where I found it useful to use .SD : the DT[,lapply(.SD,fun),.SDcols = ] use that is very convenient, and when one wants to assign all values in the group to a particular value that meets a particular condition within the group :

在两种情况下,我发现使用.SD:DT [,lapply(.SD,fun),SDcols =]使用非常方便,并且当想要将组中的所有值分配给特定时满足组内特定条件的值:

DT[, plouf3 := .SD[V1 - V1[1] > 100, Answer][1], by = id] 
# all values are assigned, which is actually different from 
DT[DT[, .I[V1 - V1[1] > 100][1], by = id][, V1], plouf2 := Answer] 
# where only the values that match the condition V1-V1[1]>100 are assigned

So my question: are there other situations where it is needed/interesting to use .SD ?


Thank you in advance for the help.


1 个解决方案



Regarding your first question


The benchmark would only be fair if all three methods would generate the same output. The "without SD alternative" method generates a different result, so let's set that one aside.

如果所有三种方法都能产生相同的输出,那么基准只会是公平的。 “无SD替代”方法会产生不同的结果,所以让我们把它放在一边。

The "with SD" and "without SD" methods generate the same output but the latter is more efficient. Here is why: when you do ... .SD[1, Answer] ... you are basically subsetting all of the columns for the matching rows, then you are performing the next operation (which is to fetch the first value of the vector Answer) on this subset. However, in the "without SD" method, you are only subsetting one vector (not all vectors) and then fetching the first value of that one vector. The unnecessary subsetting of the additional, unused columns in the "with SD" method is what makes it slow.

“with SD”和“without SD”方法生成相同的输出,但后者更有效。原因如下:当你执行... .SD [1,Answer] ...你基本上是匹配行的所有列的子集,然后你正在执行下一个操作(即获取第一个值矢量答案)关于这个子集。但是,在“无SD”方法中,您只是对一个向量(不是所有向量)进行子集化,然后获取该向量的第一个值。 “with SD”方法中额外的未使用的列的不必要的子集使其变慢。

Regarding your second question


This command does not assign the values to DT:


DT[, .SD[V1 - V1[1] > 100][, plouf2 := Answer], by = id]

DT [,。SD [V1 - V1 [1]> 100] [,plouf2:= Answer],by = id]

The reason is that the .SD operator is a one-way operator, that is if you change something in the subset that .SD gives you, it doesn't apply it back on the larger data.table but only applies it on the in-memory copy of the subset. It is not fair to call it an in-memory copy, because .SD does not actually copy the data (it just points the relevant portion of the memory that holds the subset of interest), but the point is that assignments to it will only be applied to this in-memory pointer and not the original underlying data.

原因是.SD运算符是一个单向运算符,即如果你改变.SD给你的子集中的某些东西,它就不会将它应用于更大的data.table但只将它应用于 - 子集的内存副本。将它称为内存中副本是不公平的,因为.SD实际上并不复制数据(它只是指向保存感兴趣子集的内存的相关部分),但关键是它的赋值只会应用于此内存中指针而不是原始基础数据。

Note: You could argue, then, that it should not support assignments whatsoever. I don't know what Matt Dowle thinks, but in my humble opinion, the assignment is actually a useful feature! For instance:

注意:那么你可以争辩说,它不应该支持任何分配。我不知道Matt Dowle的想法,但在我看来,这项任务实际上是一个有用的功能!例如:

DT.2 <- DT[, .SD[V1 - V1[1] > 100][, plouf2 := Answer], by = id]

DT.2 <- DT [,。SD [V1 - V1 [1]> 100] [,plouf2:= Answer],by = id]

This way I have a very short, highly readable piece of code, that generates the output I desire and stores it in a new data.table without modifying the original data.table! Any other way that I can think of to generate this exact output without using .SD and without touching the original data.table involves much longer code.


Regarding your last question


.SD is useful when you want to deal with many or all columns of a data.table and not just a few or only one column. (This is why the "with SD" method you used in the first part is not an appropriate way to do what you want to do). The examples provided in What does .SD stand for in data.table in R are very useful to demonstrate when .SD can be very handy. In my opinion, the main advantage of .SD is not with the efficiency at the which the code runs, but rather, in the efficiency in which you can turn a concept into R code, and the readability of that piece of code.

当您想要处理data.table的许多或所有列而不仅仅是几列或只有一列时,.SD非常有用。 (这就是为什么你在第一部分中使用的“with SD”方法不适合做你想做的事情)。 R中data.table中的.SD代表的例子非常有用,可以证明.SD可以非常方便。在我看来,.SD的主要优点不在于代码运行的效率,而在于您将概念转换为R代码的效率,以及该代码片段的可读性。

