I've been trying to make an experiment to see if the local variables in functions are stored on a stack.
我一直在尝试做一个实验,看看函数中的局部变量是否存储在堆栈中。
So I wrote a little performance test
我写了一个性能测试
function test(fn, times){
var i = times;
var t = Date.now()
while(i--){
fn()
}
return Date.now() - t;
}
ene
function straight(){
var a = 1
var b = 2
var c = 3
var d = 4
var e = 5
a = a * 5
b = Math.pow(b, 10)
c = Math.pow(c, 11)
d = Math.pow(d, 12)
e = Math.pow(e, 25)
}
function inversed(){
var a = 1
var b = 2
var c = 3
var d = 4
var e = 5
e = Math.pow(e, 25)
d = Math.pow(d, 12)
c = Math.pow(c, 11)
b = Math.pow(b, 10)
a = a * 5
}
I expected to get inversed function work much faster. Instead an amazing result came out.
我期望逆函数能更快地工作。相反,一个惊人的结果出来了。
Untill I test one of the functions it runs 10 times faster than after testing the second one.
直到我测试其中一个函数,它运行速度比测试第二个函数快10倍。
Example:
例子:
> test(straight, 10000000)
30
> test(straight, 10000000)
32
> test(inversed, 10000000)
390
> test(straight, 10000000)
392
> test(inversed, 10000000)
390
Same behaviour when tested in alternative order.
同样的行为,当以不同的顺序测试时。
> test(inversed, 10000000)
25
> test(straight, 10000000)
392
> test(inversed, 10000000)
394
I've tested it both in the Chrome browser and in Node.js and I've got absolutely no clue why would it happen. The effect lasts till I refresh the current page or restart Node REPL.
我在Chrome浏览器和Node上都测试过。我完全不知道为什么会这样。效果持续到刷新当前页面或重新启动节点REPL。
What could be a source of such significant (~12 times worse) performance?
是什么原因导致了如此显著的(甚至是糟糕的12倍)性能呢?
PS. Since it seems to work only in some environemnts please write the environment You're using to test it.
由于它似乎只适用于某些环境,请编写用于测试的环境。
Mine were:
我的是:
OS: Ubuntu 14.04
Node v0.10.37
Chrome 43.0.2357.134 (Official Build) (64-bit)
操作系统:Ubuntu 14.04 Node v0.10.37 Chrome 43.0.2357.134(官方版本)(64位)
/Edit
On Firefox 39 it takes ~5500 ms for each test regardless of the order. It seems to occur only on specific engines.
/在firefox39上进行编辑,不管顺序如何,每次测试大约需要5500毫秒。它似乎只发生在特定的引擎上。
/Edit2
Inlining the function to the test function makes it run always the same time.
Is it possible that there is an optimization that inlines the function parameter if it's always the same function?
/Edit2将函数嵌入到测试函数中,使它始终运行在相同的时间。如果函数参数总是相同的,是否可能有一个优化来嵌入函数参数?
99
Once you call test
with two different functions fn()
callsite inside it becomes megamorphic and V8 is unable to inline at it.
一旦你调用了两个不同功能的测试,fn()的callsite就变成了megamorphic,而V8则无法内联。
Function calls (as opposed to method calls o.m(...)
) in V8 are accompanied by one element inline cache instead of a true polymorphic inline cache.
在V8中,函数调用(与方法调用o.m(…)相反)伴随着一个元素内联缓存,而不是真正的多态内联缓存。
Because V8 is unable to inline at fn()
callsite it is unable to apply a variety of optimizations to your code. If you look at your code in IRHydra (I uploaded compilation artifacts to gist for your convinience) you will notice that first optimized version of test
(when it was specialized for fn = straight
) has a completely empty main loop.
因为V8无法在fn() callsite中内联,所以它无法对代码应用各种优化。如果您在IRHydra中查看您的代码(我将编译工件上传到了gist,以满足您的方便),您将注意到第一个优化版本的测试(当它专门用于fn = straight时)有一个完全空的主循环。
V8 just inlined straight
and removed all the code your hoped to benchmark with Dead Code Elimination optimization. On an older version of V8 instead of DCE V8 would just hoist the code out of the loop via LICM - because the code is completely loop invariant.
V8只是内联,并删除了你希望用死代码消除优化作为基准的所有代码。在旧版本的V8而不是DCE V8中,只需通过LICM将代码从循环中移除——因为代码是完全循环不变的。
When straight
is not inlined V8 can't apply these optimizations - hence the performance difference. Newer version of V8 would still apply DCE to straight
and inversed
themselves turning them into empty functions
当直线不是内联的时候,V8无法应用这些优化——因此性能不同。新版本的V8仍然会将DCE应用到直上并将其自身转化为空函数
so the performance difference is not that big (around 2-3x). Older V8 was not aggressive enough with DCE - and that would manifest in bigger difference between inlined and not-inlined cases, because peak performance of inlined case was solely result of aggressive loop-invariant code motion (LICM).
所以性能差异不是很大(大约2-3x)。老版本的V8在DCE上不够激进——这将在内联和非内联情况之间表现出更大的差异,因为内联情况的峰值性能仅仅是主动循环不变量代码运动(LICM)的结果。
On related note this shows why benchmarks should never be written like this - as their results are not of any use as you end up measuring an empty loop.
在相关的说明上,这说明了为什么基准不应该像这样写——因为它们的结果在您最终测量一个空循环时没有任何用处。
If you are interested in polymorphism and its implications in V8 check out my post "What's up with monomorphism" (section "Not all caches are the same" talks about the caches associated with function calls). I also recommend reading through one of my talks about dangers of microbenchmarking, e.g. most recent "Benchmarking JS" talk from GOTO Chicago 2015 (video) - it might help you to avoid common pitfalls.
如果您对V8中的多态及其含义感兴趣,请查看我的文章“单态是怎么回事”(有关与函数调用相关的缓存的章节“不是所有缓存都是相同的”)。我还建议大家通读我的一篇关于微基准测试的危险的演讲,例如最近的“标杆JS”演讲(视频)——它可能会帮助你避免常见的陷阱。
17
You're misunderstanding the stack.
你误解了堆栈。
While the "real" stack indeed only has the Push
and Pop
operations, this doesn't really apply for the kind of stack used for execution. Apart from Push
and Pop
, you can also access any variable at random, as long as you have its address. This means that the order of locals doesn't matter, even if the compiler doesn't reorder it for you. In pseudo-assembly, you seem to think that
虽然“真实”堆栈实际上只有Push和Pop操作,但这并不适用于用于执行的堆栈类型。除了Push和Pop,您还可以随机访问任何变量,只要您有它的地址。这意味着,即使编译器没有为您重新排序,但局部变量的顺序并不重要。在伪汇编中,您似乎这样认为
var x = 1;
var y = 2;
x = x + 1;
y = y + 1;
translates to something like
转化为类似
push 1 ; x
push 2 ; y
; get y and save it
pop tmp
; get x and put it in the accumulator
pop a
; add 1 to the accumulator
add a, 1
; store the accumulator back in x
push a
; restore y
push tmp
; ... and add 1 to y
In truth, the real code is more like this:
事实上,真正的代码是这样的:
push 1 ; x
push 2 ; y
add [bp], 1
add [bp+4], 1
If the thread stack really was a real, strict stack, this would be impossible, true. In that case, the order of operations and locals would matter much more than it does now. Instead, by allowing random access to values on the stack, you save a lot of work for both the compilers, and the CPU.
如果线程堆栈真的是一个真实的、严格的堆栈,那么这将是不可能的,是真的。在这种情况下,经营秩序和当地居民将比现在重要得多。相反,通过允许对堆栈上的值进行随机访问,可以为编译器和CPU节省大量工作。
To answer your actual question, I'm suspecting neither of the functions actually does anything. You're only ever modifying locals, and your functions aren't returning anything - it's perfectly legal for the compiler to completely drop the function bodies, and possibly even the function calls. If that's indeed so, whatever performance difference you're observing is probably just a measurement artifact, or something related to the inherent costs of calling a function / iterating.
为了回答你的实际问题,我怀疑这两个函数都没有做任何事情。您只修改了局部变量,而您的函数没有返回任何内容——编译器完全删除函数体是完全合法的,甚至可能删除函数调用。如果确实如此,那么您所观察到的性能差异可能只是一个度量工件,或者与调用函数/迭代的固有成本相关的东西。
3
Inlining the function to the test function makes it run always the same time.
Is it possible that there is an optimization that inlines the function parameter if it's always the same function?将函数内联到测试函数,使其始终运行。如果函数参数总是相同的,是否可能有一个优化来嵌入函数参数?
Yes, this seems to be exactly what you are observing. As already mentioned by @Luaan, the compiler likely drops the bodies of your straight
and inverse
functions anyway because they are not having any side effects, but only manipulating some local variables.
是的,这似乎正是你所观察到的。正如@Luaan已经提到的,无论如何,编译器可能会丢弃直接函数和逆函数的主体,因为它们没有任何副作用,而只是操作一些局部变量。
When you are calling test(…, 100000)
for the first time, the optimising compiler realises after some iterations that the fn()
being called is always the same, and does inline it, avoiding the costly function call. All that it does now is 10 million times decrementing a variable and testing it against 0
.
当您第一次调用test(…,100000)时,经过一些迭代之后,优化编译器意识到被调用的fn()总是相同的,并且进行内联,避免了代价高昂的函数调用。它现在所做的就是1000万次对一个变量进行递减并对其进行0检验。
But when you are calling test
with a different fn
then, it has to de-optimise. It may later do some other optimisations again, but now knowing that there are two different functions to be called it cannot inline them any more.
但是当您使用不同的fn调用test时,它必须去优化。它可能稍后会再做一些其他的优化,但是现在知道有两个不同的函数可以被调用,它不能再内联它们了。
Since the only thing you're really measuring is the function call, that leads to the grave differences in your results.
因为您真正要度量的是函数调用,这将导致结果的严重差异。
An experiment to see if the local variables in functions are stored on a stack
一个实验,看看函数中的局部变量是否存储在堆栈中
Regarding your actual question, no, single variables are not stored on a stack (stack machine), but in registers (register machine). It doesn't matter in which order they are declared or used in your function.
对于您的实际问题,不,单个变量不是存储在堆栈(堆栈机器)中,而是存储在寄存器(寄存器机器)中。在函数中声明或使用它们的顺序并不重要。
Yet, they are stored on the stack, as part of so called "stack frames". You'll have one frame per function call, storing the variables of its execution context. In your case, the stack might look like this:
但是,它们作为所谓的“堆栈帧”的一部分存储在堆栈中。每个函数调用都有一个框架,存储执行上下文的变量。在您的例子中,堆栈可能是这样的:
[straight: a, b, c, d, e]
[test: fn, times, i, t]
…