I've been chasing down the cause of an intermittent crash in one of our .NET services due to an internal error in the .NET Runtime (exit code 0x80131506). The service in question doesn't perform any of the kinds of operation that are usually to blame for such errors (unsafe code, PInvoke, etc.). I've tried disabling concurrent GC as described in KB2679415, as well as switching to server GC, but the intermittent crashes persist. The issue manifests on .NET 4.7.2 and earlier versions, when compiled in debug mode.
由于.NET运行时出现内部错误(退出代码0x80131506),我一直在追逐.NET服务中间歇性崩溃的原因。有问题的服务不执行任何通常应归咎于此类错误的操作(不安全代码,PInvoke等)。我已尝试禁用并发GC,如KB2679415中所述,以及切换到服务器GC,但间歇性崩溃仍然存在。在调试模式下编译时,问题在.NET 4.7.2及更早版本中出现。
The service makes extensive use of an old version of NHibernate (2.0.1), and when I've examined crash dumps in the debugger, there is always NHibernate code in the callstack when the error occurs, though NHibernate itself is all managed code, so should not be capable of causing this kind of crash.
该服务广泛使用旧版本的NHibernate(2.0.1),当我在调试器中检查崩溃转储时,在发生错误时,callstack中总会有NHibernate代码,尽管NHibernate本身都是托管代码,所以不应该导致这种崩溃。
I have managed to reproduce the crash under a debugger and with the GC Stress Log and Heap Verification enabled, and whilst it seems to point to an issue within the JIT/GC, I'm not sure I'm interpreting the output correctly.
我已设法在调试器下重现崩溃并启用了GC压力日志和堆验证,虽然它似乎指向JIT / GC中的问题,但我不确定我是否正确解释输出。
Looking at the thread on which the crash occurs, on this occasion it's occurring at clr!JIT_Stelem_Ref
:
查看发生崩溃的线程,此时它发生在clr!JIT_Stelem_Ref:
clr!JIT_Stelem_Ref+0x18: cmp r9,qword ptr [r8] ds:aaaaaaaa`aaaaaaaa=????????????????
In this case the string of 0xaa
s appears to be the result of having HeapVerify enabled, which causes the GC fill collected memory regions, presumably for easier identification, and suggests that somehow we still have a reference to the old location of a collected/relocated object.
在这种情况下,0xaas的字符串似乎是启用HeapVerify的结果,这导致GC填充收集的内存区域,可能是为了更容易识别,并建议我们仍然有一个对收集/重新定位的对象的旧位置的引用。
Tracking back in the stack, there are plenty of 0xaaaaaaaaaaaaaaaa
entries, however these cease appearing at the method that was at the top of the call stack when the most recent GC occurred, which in this case was NHibernate.Loader.Loader.GetRow()
according to the GC stress log for the most recent GC on this thread:
追溯到堆栈中,有大量的0xaaaaaaaaaaaaaaa条目,但是当最近的GC发生时,这些停止出现在调用堆栈顶部的方法中,在这种情况下是NHibernate.Loader.Loader.GetRow()根据此线程上最新GC的GC压力日志:
(Note: I've reversed the order of the logged lines from SOS' !dumplog
output for easier reading):
(注意:我已经从SOS'!dumplog输出中颠倒了记录行的顺序以便于阅读):
2404 12445.672380360 : `GC`GCROOTS` Starting scan of Thread 000000001EF4DED0 ID = 20 {
2404 12445.672380963 : `GCROOTS` Scanning ExplicitFrame 000000001E6ED3B8 AssocMethod = 0000000000000000 frameVTable = 000007FEF365B640 (clr!RedirectedThreadFrame::`vftable')
2404 12445.672386397 : `GCROOTS` Scanning Frameless method 000007FE93F43460 (NHibernate.Loader.Loader.GetRow(System.Data.IDataReader, NHibernate.Persister.Entity.ILoadable[], NHibernate.Engine.EntityKey[], System.Object, NHibernate.Engine.EntityKey, NHibernate.LockMode[], System.Collections.IList, NHibernate.Engine.ISessionImplementor)) COntrolPC= 000007FE945E3095
2404 12445.672388208 : `GC`GCROOTS` GC Root 000000001E6ED4C0 RELOCATED 000000003B1A7708 -> 000000003AC89F08 MT = 000007FE93DDF5C8 (...)
2404 12445.672388510 : `GC`GCROOTS` GC Root 000000001E6ED4D8 RELOCATED 000000003B1A73A0 -> 000000003AC89D00 MT = 000007FEF1FD6EA8 (System.Object[])
2404 12445.672388510 : `GC`GCROOTS` GC Root 000000001E6ED4E8 RELOCATED 000000003B1A7358 -> 000000003AC89CB8 MT = 000007FE9491D7C8 (NHibernate.Engine.EntityKey)
2404 12445.672388510 : `GC`GCROOTS` GC Root 000000001E6ED4F8 RELOCATED 000000003B1A73A0 -> 000000003AC89D00 MT = 000007FEF1FD6EA8 (System.Object[])
The stack area for this method is as follows:
此方法的堆栈区域如下:
00000000`1e6ed470 000000003b1a7358 ✕
00000000`1e6ed478 000000000291e3d0
00000000`1e6ed480 0000000000000000
00000000`1e6ed488 0000000000000000
00000000`1e6ed490 000000000662a900
00000000`1e6ed498 0000000006523c80
00000000`1e6ed4a0 0000000000000000
00000000`1e6ed4a8 0000000000000000
00000000`1e6ed4b0 0000000000000000
00000000`1e6ed4b8 0000000000000000
00000000`1e6ed4c0 000000003ac89f08 ✔
00000000`1e6ed4c8 0000000000000000
00000000`1e6ed4d0 0000000006524248
00000000`1e6ed4d8 000000003ac89d00 ✔
00000000`1e6ed4e0 0000000000000000
00000000`1e6ed4e8 000000003ac89cb8 ✔
00000000`1e6ed4f0 0000000000000000
00000000`1e6ed4f8 000000003ac89d00 ✔
00000000`1e6ed500 0000000100000000
00000000`1e6ed508 0000000c0000000b
00000000`1e6ed510 0000000006621660
00000000`1e6ed518 000000001e6ed690
00000000`1e6ed520 000000001e6ed6a0
I have indicated the 4 entries mentioned in the GC stress log as relocated, which have been correctly updated with their new addresses, however the first stack entry (000000003b1a7358
- an NHibernate.Engine.EntityKey
), whilst it is one of the relocated objects has not been updated with the new address. This would of course be completely normal if this was no longer going to be used, however it is in fact about to be passed as a parameter to a call to NHibernate.Loader.Loader.InstanceNotYetLoaded()
.
我已经将GC压力日志中提到的4个条目表示为已重新定位,这些条目已使用新地址正确更新,但是第一个堆栈条目(000000003b1a7358 - 一个NHibernate.Engine.EntityKey),而它是已重新定位的对象之一没有使用新地址更新。如果不再使用它,这当然是完全正常的,但实际上它将作为参数传递给对NHibernate.Loader.Loader.InstanceNotYetLoaded()的调用。
InstanceNotYetLoaded()
takes 9 parameters (plus this
), and I have marked where each of these is loaded into the stack/register on the following assembly listing. I have also included the relevant output from SOS' !gcinfo
as it relates to each of the parameters on the stack:
InstanceNotYetLoaded()接受9个参数(加上这个),并且我已经在下面的汇编列表中标记了每个参数加载到堆栈/寄存器中的位置。我还包含了SOS'!gcinfo的相关输出,因为它与堆栈中的每个参数有关:
Param Address Instruction GC Info
000007fe`945e3071 mov r9,qword ptr [rbp-38h]
P4> 000007fe`945e3075 mov qword ptr [rsp+20h],r9
000007fe`945e307a mov r9d,dword ptr [rbp-18h] +sp+20
000007fe`945e307e mov rcx,qword ptr [rbp+40h]
000007fe`945e3082 cmp r9,qword ptr [rcx+8]
000007fe`945e3086 jb 000007fe`945e308d
000007fe`945e3088 call clr!JIT_RngChkFail
000007fe`945e308d lea rcx,[rcx+r9*8+10h] -sp+20
000007fe`945e3092 mov r9,qword ptr [rcx]
-- GC Occurred Here --
P5> 000007fe`945e3095 mov qword ptr [rsp+28h],r9
000007fe`945e309a mov r9,qword ptr [rbp+38h] +sp+28
P6> 000007fe`945e309e mov qword ptr [rsp+30h],r9
000007fe`945e30a3 mov r9,qword ptr [rbp+30h] +sp+30
P7> 000007fe`945e30a7 mov qword ptr [rsp+38h],r9
000007fe`945e30ac mov r9,qword ptr [rbp+48h] +sp+38
P8> 000007fe`945e30b0 mov qword ptr [rsp+40h],r9
000007fe`945e30b5 mov r9,qword ptr [rbp+50h] +sp+40
P9> 000007fe`945e30b9 mov qword ptr [rsp+48h],r9
000007fe`945e30be mov r9d,dword ptr [rbp-18h] +sp+48
000007fe`945e30c2 mov rcx,qword ptr [rbp+20h]
000007fe`945e30c6 cmp r9,qword ptr [rcx+8]
000007fe`945e30ca jb 000007fe`945e30d1
000007fe`945e30cc call clr!JIT_RngChkFail
000007fe`945e30d1 lea rcx,[rcx+r9*8+10h] -sp+48 -sp+40 -sp+38 -sp+30 -sp+28
P3> 000007fe`945e30d6 mov r9,qword ptr [rcx]
this> 000007fe`945e30d9 mov rcx,qword ptr [rbp+10h]
P1> 000007fe`945e30dd mov rdx,qword ptr [rbp+18h]
P2> 000007fe`945e30e1 mov r8d,dword ptr [rbp-18h]
000007fe`945e30e5 call InstanceNotYetLoaded(...)
The GC just prior to the crash occurred at 000007fe945e3095
, which is after parameter 4 is loaded onto the stack (at 000007fe945e3075
), but also after this stack entry has become dead (at 000007fe945e308d
) according to the GC Info, which would explain why the GC relocate phase didn't update this reference.
崩溃之前的GC发生在000007fe945e3095,这是在参数4加载到堆栈之后(在000007fe945e3075),但也是在此堆栈条目已经死亡(在000007fe945e308d)之后根据GC信息,这将解释为什么GC重定位阶段未更新此引用。
It also looks like the GC Info for parameters 5-9 is also incorrectly marking them as dead too early, and perhaps tellingly in both cases they are being marked dead immediately after what looks like an array index range check.
它看起来像参数5-9的GC Info也错误地将它们标记为过早死亡,并且可能在两种情况下都表明它们在看起来像数组索引范围检查之后被立即标记为死。
This to me looks like a JIT bug with the lifetimes of these stack parameters being incorrectly tracked. Is this analysis correct, if so where best to report if. If it's not a JIT bug, what am I missing that could explain these unexpected crashes on purely managed code?
这对我来说看起来像一个JIT错误,这些堆栈参数的生命周期被错误地跟踪。这种分析是否正确,如果是这样,最好报告是否。如果它不是JIT错误,我错过了什么可以解释纯托管代码上的这些意外崩溃?
Edit:I believe the following snippet will reproduce the issue, at least as far as generating the bad GC info in debug mode.
我相信以下代码片段将重现此问题,至少就调试模式中生成错误的GC信息而言。
public void Repro(int p1, object p2, object p3, object p4, object[] p5)
{
// Incorrect GC Info generated for this call
ReproHelper(p1, p2, p3, p4, p5[p1]);
}
public void ReproHelper(int p1, object p2, object p3, object p4, object p5)
{
Console.WriteLine(p1);
Console.WriteLine(p2);
Console.WriteLine(p3);
Console.WriteLine(p4);
Console.WriteLine(p5);
}
In essence, there must be a method call to a method that:
本质上,必须有一个方法调用方法:
需要在堆栈上传递至少2个参数(即,实例方法至少有5个参数)。
堆栈上传递的第二个参数(参数5)必须是数组访问的结果。
When these criteria are met, the 4th parameter is loaded onto the stack for the call, and the stack entry is correctly marked as containing a reference. However in determining the value for parameter 5, an array index range check is made, and after this occurs the stack entry for parameter 4 is marked as dead.
满足这些条件时,第4个参数将加载到堆栈中以进行调用,并且堆栈条目被正确标记为包含引用。但是,在确定参数5的值时,将进行数组索引范围检查,在此之后,参数4的堆栈条目将标记为dead。
If a GC occurs after the range check but before the actual call takes place, and the GC results in the object that was being passed as parameter 4 being relocated, when the method resumes, the call will pass the old (invalid) address to parameter 4, rather than the new one.
如果在范围检查之后但在实际调用发生之前发生GC,并且GC导致在重新定位参数4时传递的对象,则在方法恢复时,调用将旧的(无效)地址传递给参数4,而不是新的。
3
While it is not an answer to the problem, I treat it as such because I confirm it is a bug and should be fixed by the .NET team.
虽然它不是问题的答案,但我认为它是因为我确认它是一个错误,应该由.NET团队修复。
When running your snippet on .NET Framework 4.7.1 (clrjit.dll version 4.7.2xxx), proper GCInfo was generated (and +sp+20
is in fact written to only just before ReproHelper
call):
在.NET Framework 4.7.1(clrjit.dll版本4.7.2xxx)上运行代码段时,生成了正确的GCInfo(并且+ sp + 20实际上只是在ReproHelper调用之前写入):
00007ffb`99450630 55 push rbp
00007ffb`99450631 4883ec40 sub rsp,40h
00000003 is a safepoint:
00007ffb`99450635 488d6c2440 lea rbp,[rsp+40h]
00007ffb`9945063a 33c0 xor eax,eax
00007ffb`9945063c 488945f8 mov qword ptr [rbp-8],rax
00007ffb`99450640 48894d10 mov qword ptr [rbp+10h],rcx
00007ffb`99450644 895518 mov dword ptr [rbp+18h],edx
00007ffb`99450647 4c894520 mov qword ptr [rbp+20h],r8
00007ffb`9945064b 4c894d28 mov qword ptr [rbp+28h],r9
interruptible
+rbp+28 +rbp+20 +rbp+10 +rbp-8
00007ffb`9945064f 833d3a3fefff00 cmp dword ptr [00007ffb`99344590],0
00007ffb`99450656 7405 je 00007ffb`9945065d
00007ffb`99450658 e863eaab5f call clr!JIT_DbgIsJustMyCode (00007ffb`f8f0f0c0)
00007ffb`9945065d 90 nop
00007ffb`9945065e 8b5518 mov edx,dword ptr [rbp+18h]
00007ffb`99450661 4c8b4538 mov r8,qword ptr [rbp+38h]
+r8
00007ffb`99450665 413b5008 cmp edx,dword ptr [r8+8]
00007ffb`99450669 7205 jb 00007ffb`99450670
-rbp-8
00007ffb`9945066b e8f015ac5f call clr!JIT_RngChkFail (00007ffb`f8f11c60)
-r8
00007ffb`99450670 488b5538 mov rdx,qword ptr [rbp+38h]
+rdx
00007ffb`99450674 448b4518 mov r8d,dword ptr [rbp+18h]
00007ffb`99450678 4d63c0 movsxd r8,r8d
00007ffb`9945067b 4a8b54c210 mov rdx,qword ptr [rdx+r8*8+10h]
00007ffb`99450680 488955f8 mov qword ptr [rbp-8],rdx
+rbp-8
00007ffb`99450684 488b55f8 mov rdx,qword ptr [rbp-8]
00007ffb`99450688 4889542428 mov qword ptr [rsp+28h],rdx
+sp+28
00007ffb`9945068d 8b5518 mov edx,dword ptr [rbp+18h]
-rdx
00007ffb`99450690 4c8b4520 mov r8,qword ptr [rbp+20h]
+r8
00007ffb`99450694 4c8b4d28 mov r9,qword ptr [rbp+28h]
+r9
00007ffb`99450698 488b4d30 mov rcx,qword ptr [rbp+30h]
+rcx
00007ffb`9945069c 48894c2420 mov qword ptr [rsp+20h],rcx
+sp+20
00007ffb`994506a1 488b4d10 mov rcx,qword ptr [rbp+10h]
-rbp-8
But after upgrading to .NET Framework 4.7.2 (clrjit.dll version 4.7.3062), it is no longer correct (+sp+20
is written to before array index range check, properly set but out of a sudden unset afterwards, while still used in the ReproHelper
call):
但升级到.NET Framework 4.7.2(clrjit.dll版本4.7.3062)后,它不再正确(+ sp + 20写入数组索引范围检查之前,正确设置但之后突然未设置,而仍然在ReproHelper调用中使用):
00007ffe`62290630 55 push rbp
00007ffe`62290631 4883ec30 sub rsp,30h
00007ffe`62290635 488d6c2430 lea rbp,[rsp+30h]
00000007 is a safepoint:
00007ffe`6229063a 48894d10 mov qword ptr [rbp+10h],rcx
00007ffe`6229063e 895518 mov dword ptr [rbp+18h],edx
00007ffe`62290641 4c894520 mov qword ptr [rbp+20h],r8
00007ffe`62290645 4c894d28 mov qword ptr [rbp+28h],r9
interruptible
+rbp+28 +rbp+20 +rbp+10
00007ffe`62290649 833d483fefff00 cmp dword ptr [00007ffe`62184598],0
00007ffe`62290650 7405 je 00007ffe`62290657
00007ffe`62290652 e869f7aa5f call clr!TranslateSecurityAttributes+0x857b0 (00007ffe`c1d3fdc0) (JitHelp: CORINFO_HELP_DBG_IS_JUST_MY_CODE)
00007ffe`62290657 90 nop
00007ffe`62290658 488b4d30 mov rcx,qword ptr [rbp+30h]
+rcx
00007ffe`6229065c 48894c2420 mov qword ptr [rsp+20h],rcx
+sp+20
00007ffe`62290661 8b4d18 mov ecx,dword ptr [rbp+18h]
-rcx
00007ffe`62290664 488b5538 mov rdx,qword ptr [rbp+38h]
+rdx
00007ffe`62290668 483b4a08 cmp rcx,qword ptr [rdx+8]
00007ffe`6229066c 7205 jb 00007ffe`62290673
00007ffe`6229066e e8ed22ab5f call clr!TranslateSecurityAttributes+0x88350 (00007ffe`c1d42960) (JitHelp: CORINFO_HELP_RNGCHKFAIL)
-sp+20
00007ffe`62290673 488d54ca10 lea rdx,[rdx+rcx*8+10h]
-rdx +rdx(interior)
00007ffe`62290678 488b0a mov rcx,qword ptr [rdx]
+rcx
00007ffe`6229067b 48894c2428 mov qword ptr [rsp+28h],rcx
+sp+28
00007ffe`62290680 488b4d10 mov rcx,qword ptr [rbp+10h]
00007ffe`62290684 8b5518 mov edx,dword ptr [rbp+18h]
-rdx(interior)
00007ffe`62290687 4c8b4520 mov r8,qword ptr [rbp+20h]
+r8
00007ffe`6229068b 4c8b4d28 mov r9,qword ptr [rbp+28h]
+r9
00007ffe`6229068f e804faffff call 00007ffe`62290098 (GCInfoBug.Bug.ReproHelper(Int32, System.Object, System.Object, System.Object, System.Object), mdToken: 0000000006000004)
-sp+28 -r9 -r8 -rcx
00007ffe`62290694 90 nop
00007ffe`62290695 90 nop
not interruptible
-rbp+28 -rbp+20 -rbp+10
00007ffe`62290696 488d6500 lea rsp,[rbp]
00007ffe`6229069a 5d pop rbp
00007ffe`6229069b c3 ret