热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

data.tableroll=“nearest”返回多个结果-data.tableroll=“nearest”returnsmultipleresults

Imattemptingtousedata.tabletomatchthenearestdecimalvalueinavectorbutamrunninginto

I'm attempting to use data.table to match the nearest decimal value in a vector but am running into a situation where more than one result is returned. The simplified example below returns two values, 0.1818182 0.2727273, but using a less precise value for x (e.g. 0.0275) returns a single match (0.1818182).

我正在尝试使用data.table来匹配向量中最接近的十进制值,但是遇到了返回多个结果的情况。下面的简化示例返回两个值,0.1818182 0.2727273,但使用较不精确的x值(例如0.0275)会返回单个匹配(0.1818182)。

x = 0.0275016249293408
dt = data.table(rnk = c(0, 0.0909090909090909, 
                        0.181818181818182, 0.272727272727273),
                val = c(0.0233775088495975, 0.0270831481152598, 
                        0.0275016216267234, 0.0275016249293408),
                key="val")
dt[J(x), roll="nearest"][, ifelse(is.na(val), NA_real_, rnk)]

I'm assuming the problem is related to the precision of the numeric values I'm using for this comparison. Is there a limitation to the decimal precision that can be used for a nearest match (i.e. do I need to round the data points)? Is there a better way to accomplish this nearest match?

我假设问题与我用于此比较的数值的精度有关。可以用于最近匹配的小数精度是否有限制(即我需要舍入数据点)?有没有更好的方法来完成这个最接近的比赛?

2 个解决方案

#1


5  

Referring to Matt's answer there is an easy way to use all the 15 significant digits a double offers in order to properly select the closest matching row. Instead of working on the original values, one can scale the values up to ensure that the 15 significant digits lie above the 10^(-8) level. This could be done as follows:

参考Matt的答案,有一种简单的方法可以使用双重提供的所有15位有效数字,以便正确选择最接近的匹配行。可以将值向上扩展以确保15个有效数字位于10 ^( - 8)级别之上,而不是处理原始值。这可以如下完成:

orig_vals <- dt[,val]
scale_fact <- max(10^(trunc(log10(abs(orig_vals)))+8))
scaled_vals <- orig_vals * scale_fact
dt[,scaled_val:=scaled_vals]
setkey(dt,scaled_val)

Now, performing the rolling join

现在,执行滚动连接

scaled_x <- x*scale_fact
dt[J(scaled_x), roll="nearest"][, ifelse(is.na(val), NA_real_, rnk)]

# [1] 0.2727273

yields - as desired - a single value.

If also in the case of two identical key values only one row should be selected, the mult="first" argument can be added to the above data.table call.

产量 - 根据需要 - 单一价值。如果在两个相同的键值的情况下也只应选择一行,则可以将mult =“first”参数添加到上面的data.table调用中。

#2


8  

Yes, data.table automatically applies a tolerance when joining and grouping numeric columns. The tolerance in v1.8.10 is sqrt(.Machine$double.eps) == 1.490116e-08. This comes directly from ?base::all.equal.

是的,data.table在连接和分组数字列时自动应用容差。 v1.8.10中的容差是sqrt(.Machine $ double.eps)== 1.490116e-08。这直接来自?base :: all.equal。

To illustrate, consider grouping :

为了说明,请考虑分组:

> dt
          rnk        val
1: 0.00000000 0.02337751
2: 0.09090909 0.02708315
3: 0.18181818 0.02750162
4: 0.27272727 0.02750162

> dt[,.N,by=val]
          val N
1: 0.02337751 1
2: 0.02708315 1
3: 0.02750162 2    # one group, size two
>

When you joined using dt[J(x), roll="nearest"], that x value matched to within tolerance and you got the group it matched to, as usual when a matching value occurs in a rolling join. roll="nearest" only applies to the values that don't match, outside tolerance.

当您使用dt [J(x),roll =“nearest”]加入时,该x值匹配在容差范围内,并且您获得与之匹配的组,就像通常在滚动连接中出现匹配值一样。 roll =“nearest”仅适用于不匹配的值,超出容差范围。

data.table considers the values in rows 3 and 4 of val to be equal. The thinking behind this is for convenience, since most of the time key values are really a fixed precision such as prices ($1.23) or recorded measurements to a specified precision (1.234567). We'd like to join and group such numerics even after multiplying them for example, without needing to code for machine accuracy ourselves. And we'd like to avoid confusion when numeric data displays as though it's equal in a table, but isn't due to very tiny differences in the bit representation.

data.table认为val的第3行和第4行中的值相等。这背后的想法是为了方便,因为大多数时候键值实际上是固定的精度,例如价格(1.23美元)或记录的测量到指定的精度(1.234567)。我们希望加入并组合这些数字,即使在它们相乘之后,也不需要自己编码机器精度。我们希望避免在数字数据显示时在表格中相等的混淆,但不是由于位表示的微小差异。

See ?unique.data.table for this example :

有关此示例,请参阅?unique.data.table:

DT = data.table(a=tan(pi*(1/4 + 1:10)), b=rep(1,10))   # example from ?all.equal
length(unique(DT$a))         # 10 strictly unique floating point values
all.equal(DT$a,rep(1,10))    # TRUE, all within tolerance of 1.0
DT[,which.min(a)]            # row 10, the strictly smallest floating point value
identical(unique(DT),DT[1])  # TRUE, stable within tolerance
identical(unique(DT),DT[10]) # FALSE

data.table is also stable within tolerance; i.e, when you group by a numeric, the original order of the items within that group are maintained as usual.

data.table在容忍范围内也是稳定的;即,当您按数字分组时,该组中项目的原始顺序将照常维护。

> dt$val[3]  dt[, row:=1:4]  # add a row number to illustrate
> dt[, list(.N, list(row)), by=val]
          val N  V2
1: 0.02337751 1   1
2: 0.02708315 1   2
3: 0.02750162 2 3,4
> dt[3:4, val:=rev(val)]   # swap the two values around
> dt$val[3] > dt$val[4]
[1] TRUE
> dt[, list(.N, list(row)), by=val]
          val N  V2
1: 0.02337751 1   1
2: 0.02708315 1   2
3: 0.02750162 2 3,4    # same result, consistent. stable within tolerance

推荐阅读
author-avatar
我心永恒2602922374_902
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有