data.table操作中的.SD的兴趣-interestof.SDwithindata.tableoperations

作者：优美rosner_704 | 来源：互联网 | 2023-08-22 16:14

Ihaveaquestionconcerningdata.table.IloveitbutIthinkIwasamsometimemisusingthe.SD,a

I have a question concerning data.table. I love it but I think I was/am sometime misusing the .SD, and I would appreciate some clarification about when it is interesting to use it in data.table.

我有一个关于data.table的问题。我喜欢它,但我认为我有时会滥用.SD,我会很感激有关在data.table中使用它时有趣的一些说明。

Here are two examples where I came to think that I was misusing .SD :

这里有两个例子,我认为我在滥用.SD:

The first one is as discussed here (thanks for the Henry's comment)

第一个是这里讨论的(感谢亨利的评论)

library(microbenchmark)
library(data.table)

DTlength <- 2000
DT <-
  data.table(
    id = rep(sapply(combn(LETTERS, 6, simplify = FALSE), function(x) {
      paste(x, collapse = "")
    }), each = 4)[1:DTlength],
    replicate(10, sample(1001, DTlength, replace = TRUE)),
    Answer = sample(c("Yes", "No"), DTlength, TRUE)
  )

microbenchmark(
  "without SD" = {
    b <- DT[, Answer[1], by = id][, V1]
  },
  "without SD alternative" = {
    b <- DT[DT[, .I[1], by = id][, V1], Answer]
  },
  "with SD" = {
    b <- DT[, .SD[1, Answer], by = id][, V1]
  }
)

Unit: microseconds
                   expr        min         lq        mean     median         uq        max neval
             without SD    455.795    493.949    569.4979    529.847    558.564   2323.283   100
 without Sd alternative    961.231   1010.667   1160.9114   1060.513   1113.641   7783.798   100
                with SD 121217.691 123557.590 131071.5699 127495.437 130340.977 240317.227   100

.SD operation are quite slow compared to alternative in grouping operations. Even if you want to group to the entire data.table, the alternatives are slightly faster (although the time difference here is maybe not worth the loss of clarity of the syntax):

与分组操作中的替代操作相比,.SD操作非常慢。即使您想要分组到整个data.table,替代方案也会稍快一些(尽管这里的时间差可能不值得语法清晰度的降低):

microbenchmark(
  "with SD" = {b <-DT[,.SD[1], by = id]},
  "Without SD" = {b <- DT[DT[,.I[1],by = id][,V1]]}
)

Unit: milliseconds
       expr      min       lq     mean   median       uq      max neval
    with SD 1.058872 1.361436 1.560866 1.643078 1.741540 1.960206   100
 Without SD 1.067898 1.169642 1.279443 1.233437 1.348719 1.781334   100

The second example illustrates the fact that you can't really use .SD to assign new variable to a value with a condition within groups (or I didn't find the way):

第二个例子说明了这样一个事实,即您无法真正使用.SD将新变量分配给具有组内条件的值(或者我找不到方法):

DT[, .SD[V1 - V1[1] > 100][, plouf2 := Answer], by = id] # doesn't assign plouf2
DT[DT[, .I[V1 - V1[1] > 100], by = id][, V1], plouf2 := Answer] # this does

There are two situations where I found it useful to use .SD : the DT[,lapply(.SD,fun),.SDcols = ] use that is very convenient, and when one wants to assign all values in the group to a particular value that meets a particular condition within the group :

在两种情况下,我发现使用.SD:DT [,lapply(.SD,fun),SDcols =]使用非常方便,并且当想要将组中的所有值分配给特定时满足组内特定条件的值:

DT[, plouf3 := .SD[V1 - V1[1] > 100, Answer][1], by = id] 
# all values are assigned, which is actually different from 
DT[DT[, .I[V1 - V1[1] > 100][1], by = id][, V1], plouf2 := Answer] 
# where only the values that match the condition V1-V1[1]>100 are assigned

So my question: are there other situations where it is needed/interesting to use .SD ?

所以我的问题是:还有其他情况需要/有趣使用.SD?

Thank you in advance for the help.

提前感谢您的帮助。

1 个解决方案

#1

Regarding your first question

关于你的第一个问题

The benchmark would only be fair if all three methods would generate the same output. The "without SD alternative" method generates a different result, so let's set that one aside.

如果所有三种方法都能产生相同的输出,那么基准只会是公平的。 “无SD替代”方法会产生不同的结果,所以让我们把它放在一边。

The "with SD" and "without SD" methods generate the same output but the latter is more efficient. Here is why: when you do ... .SD[1, Answer] ... you are basically subsetting all of the columns for the matching rows, then you are performing the next operation (which is to fetch the first value of the vector Answer) on this subset. However, in the "without SD" method, you are only subsetting one vector (not all vectors) and then fetching the first value of that one vector. The unnecessary subsetting of the additional, unused columns in the "with SD" method is what makes it slow.

“with SD”和“without SD”方法生成相同的输出,但后者更有效。原因如下:当你执行... .SD [1,Answer] ...你基本上是匹配行的所有列的子集,然后你正在执行下一个操作(即获取第一个值矢量答案)关于这个子集。但是,在“无SD”方法中,您只是对一个向量(不是所有向量)进行子集化,然后获取该向量的第一个值。 “with SD”方法中额外的未使用的列的不必要的子集使其变慢。

Regarding your second question

关于你的第二个问题

This command does not assign the values to DT:

此命令不会将值分配给DT:

DT[, .SD[V1 - V1[1] > 100][, plouf2 := Answer], by = id]

DT [,。SD [V1 - V1 [1]> 100] [,plouf2:= Answer],by = id]

The reason is that the .SD operator is a one-way operator, that is if you change something in the subset that .SD gives you, it doesn't apply it back on the larger data.table but only applies it on the in-memory copy of the subset. It is not fair to call it an in-memory copy, because .SD does not actually copy the data (it just points the relevant portion of the memory that holds the subset of interest), but the point is that assignments to it will only be applied to this in-memory pointer and not the original underlying data.

原因是.SD运算符是一个单向运算符,即如果你改变.SD给你的子集中的某些东西,它就不会将它应用于更大的data.table但只将它应用于 - 子集的内存副本。将它称为内存中副本是不公平的,因为.SD实际上并不复制数据(它只是指向保存感兴趣子集的内存的相关部分),但关键是它的赋值只会应用于此内存中指针而不是原始基础数据。

Note: You could argue, then, that it should not support assignments whatsoever. I don't know what Matt Dowle thinks, but in my humble opinion, the assignment is actually a useful feature! For instance:

注意:那么你可以争辩说,它不应该支持任何分配。我不知道Matt Dowle的想法,但在我看来,这项任务实际上是一个有用的功能!例如:

DT.2 <- DT[, .SD[V1 - V1[1] > 100][, plouf2 := Answer], by = id]

DT.2 <- DT [,。SD [V1 - V1 [1]> 100] [,plouf2:= Answer],by = id]

This way I have a very short, highly readable piece of code, that generates the output I desire and stores it in a new data.table without modifying the original data.table! Any other way that I can think of to generate this exact output without using .SD and without touching the original data.table involves much longer code.

这样我就有一段非常简短,高度可读的代码,它可以生成我想要的输出并将其存储在新的data.table中,而无需修改原始data.table!我可以想到的任何其他方式来生成这个精确的输出而不使用.SD而不触及原始data.table涉及更长的代码。

Regarding your last question

关于你的上一个问题

.SD is useful when you want to deal with many or all columns of a data.table and not just a few or only one column. (This is why the "with SD" method you used in the first part is not an appropriate way to do what you want to do). The examples provided in What does .SD stand for in data.table in R are very useful to demonstrate when .SD can be very handy. In my opinion, the main advantage of .SD is not with the efficiency at the which the code runs, but rather, in the efficiency in which you can turn a concept into R code, and the readability of that piece of code.

当您想要处理data.table的许多或所有列而不仅仅是几列或只有一列时,.SD非常有用。 (这就是为什么你在第一部分中使用的“with SD”方法不适合做你想做的事情)。 R中data.table中的.SD代表的例子非常有用,可以证明.SD可以非常方便。在我看来,.SD的主要优点不在于代码运行的效率,而在于您将概念转换为R代码的效率,以及该代码片段的可读性。

推荐阅读

js
MooTools和JQuery并排 - MooTools and JQuery Side by Side

IjustinheritedsomewebpageswhichusesMooTools.IneverusedMooTools.NowIneedtoaddsomef ... [详细]

蜡笔小新 2023-12-12 13:43:58
import
Linux重启网络命令实例及关机和重启示例教程

本文介绍了Linux系统中重启网络命令的实例，以及使用不同方式关机和重启系统的示例教程。包括使用图形界面和控制台访问系统的方法，以及使用shutdown命令进行系统关机和重启的句法和用法。 ... [详细]

蜡笔小新 2023-12-14 15:52:52
case
Python正则表达式学习记录及常用方法

本文记录了学习Python正则表达式的过程，介绍了re模块的常用方法re.search，并解释了rawstring的作用。正则表达式是一种方便检查字符串匹配模式的工具，通过本文的学习可以掌握Python中使用正则表达式的基本方法。 ... [详细]

蜡笔小新 2023-12-13 16:37:19
case
在Windows 8上安装gvim中的插件的错误加载问题

本文讨论了在Windows 8上安装gvim中插件时出现的错误加载问题。作者将EasyMotion插件放在了正确的位置，但加载时却出现了错误。作者提供了下载链接和之前放置插件的位置，并列出了出现的错误信息。 ... [详细]

蜡笔小新 2023-12-14 14:44:00
case
[从头学数学] 第101节比例的相关问题研究和修炼

本文介绍了[从头学数学]中第101节关于比例的相关问题的研究和修炼过程。主要内容包括[机器小伟]和[工程师阿伟]一起研究比例的相关问题，并给出了一个求比例的函数scale的实现。 ... [详细]

蜡笔小新 2023-12-14 13:39:15
case
九度OnlineJudge之1002：Grading问题的解决方法

本文介绍了九度OnlineJudge中的1002题目“Grading”的解决方法。该题目要求设计一个公平的评分过程，将每个考题分配给3个独立的专家，如果他们的评分不一致，则需要请一位裁判做出最终决定。文章详细描述了评分规则，并给出了解决该问题的程序。 ... [详细]

蜡笔小新 2023-12-14 13:00:09
case
推荐系统遇上深度学习(十七）详解推荐系统中的常用评测指标

原创：石晓文小小挖掘机2018-06-18笔者是一个痴迷于挖掘数据中的价值的学习人，希望在平日的工作学习中，挖掘数据的价值， ... [详细]

蜡笔小新 2023-12-13 19:35:25
import
不同优化算法的比较分析及实验验证

本文介绍了神经网络优化中常用的优化方法，包括学习率调整和梯度估计修正，并通过实验验证了不同优化算法的效果。实验结果表明，Adam算法在综合考虑学习率调整和梯度估计修正方面表现较好。该研究对于优化神经网络的训练过程具有指导意义。 ... [详细]

蜡笔小新 2023-12-13 16:05:14
java
Java中vector的使用详解

本文详细介绍了Java中vector的使用方法和相关知识，包括vector类的功能、构造方法和使用注意事项。通过使用vector类，可以方便地实现动态数组的功能，并且可以随意插入不同类型的对象，进行查找、插入和删除操作。这篇文章对于需要频繁进行查找、插入和删除操作的情况下，使用vector类是一个很好的选择。 ... [详细]

蜡笔小新 2023-12-13 14:14:39
runtime
C++字符字符串处理及字符集编码方案

本文介绍了C++中字符字符串处理的问题，并详细解释了字符集编码方案，包括UNICODE、Windows apps采用的UTF-16编码、ASCII、SBCS和DBCS编码方案。同时说明了ANSI C标准和Windows中的字符/字符串数据类型实现。文章还提到了在编译时需要定义UNICODE宏以支持unicode编码，否则将使用windows code page编译。最后，给出了相关的头文件和数据类型定义。 ... [详细]

蜡笔小新 2023-12-13 04:59:58
const
clone的fork与pthread_create创建线程有何不同

本文讨论了clone的fork与pthread_create创建线程的不同之处。进程是一个指令执行流及其执行环境，其执行环境是一个系统资源的集合。在调用系统调用fork创建一个进程时，子进程只是完全复制父进程的资源，这样得到的子进程独立于父进程，具有良好的并发性。但是二者之间的通讯需要通过专门的通讯机制，另外通过fork创建子进程系统开销很大。因此，在某些情况下，使用clone或pthread_create创建线程可能更加高效。 ... [详细]

蜡笔小新 2023-12-12 20:00:06
hash
to_a和to_ary有什么区别？ - What's the difference between to_a and to_ary?

Whatsthedifferencebetweento_aandto_ary?to_a和to_ary有什么区别？ ... [详细]

蜡笔小新 2023-12-11 19:30:04
hash
Learning to Paint with Model-based Deep Reinforcement Learning

本文介绍了一种基于模型的深度强化学习方法，通过结合神经渲染器，教机器像人类画家一样进行绘画。该方法能够生成笔画的坐标点、半径、透明度、颜色值等，以生成类似于给定目标图像的绘画。文章还讨论了该方法面临的挑战，包括绘制纹理丰富的图像等。通过对比实验的结果，作者证明了基于模型的深度强化学习方法相对于基于模型的DDPG和模型无关的DDPG方法的优势。该研究对于深度强化学习在绘画领域的应用具有重要意义。 ... [详细]

蜡笔小新 2023-12-11 10:27:44
hash
Gitlab接入公司内部单点登录的安装和配置教程

本文介绍了如何将公司内部的Gitlab系统接入单点登录服务，并提供了安装和配置的详细教程。通过使用oauth2协议，将原有的各子系统的独立登录统一迁移至单点登录。文章包括Gitlab的安装环境、版本号、编辑配置文件的步骤，并解决了在迁移过程中可能遇到的问题。 ... [详细]

蜡笔小新 2023-12-10 14:38:53
java
如何在输入字段中使用默认值的方法及代码

本文介绍了在满足特定条件时如何在输入字段中使用默认值的方法和相应的代码。当输入字段填充100或更多的金额时，使用50作为默认值；当输入字段填充有-20或更多（负数）时，使用-10作为默认值。文章还提供了相关的JavaScript和Jquery代码，用于动态地根据条件使用默认值。 ... [详细]

蜡笔小新 2023-12-10 12:35:46

优美rosner_704

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章