热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

线上cpu满载排查过程

因为发生问题的为生产环境,为了脱敏,以下所有信息都是通过自己本地代码模拟的,只能大致还原一下当时情况,但是排查思路都是想通的背景通过top命令发现cpu满

因为发生问题的为生产环境,为了脱敏,以下所有信息都是通过自己本地代码模拟的,只能大致还原一下当时情况,但是排查思路都是想通的

背景

通过top命令发现cpu满载

Mem:  16333644k total,  9472968k used,  6860676k free,   165616k buffers
Swap:        0k total,        0k used,        0k free,  6665292k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND     
40433 root      20   0 7588m 112m  11m S 900.7  0.7  47:53.80 java       
 1552 root      20   0  121m  13m 8524 S  0.7  0.1  14:37.75 AliYunDun   
 3581 root      20   0 9750m 2.0g  13m S  0.7 12.9 298:30.20 java        
    1 root      20   0 19360 1612 1308 S  0.0  0.0   0:00.81 init        
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd    
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.14 migration/0 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
32022 root  20   0   24.5g   4.6g  36716 R 99.9  1.2  18:58.02 java
32028 root  20   0   24.5g   4.6g  36716 R 99.9  1.2  18:57.85 java
32030 root  20   0   24.5g   4.6g  36716 R 99.9  1.2  18:57.76 java
32033 root  20   0   24.5g   4.6g  36716 R 99.9  1.2  18:57.84 java
32037 root  20   0   24.5g   4.6g  36716 R 99.9  1.2  18:57.73 java
32020 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:58.59 java
32021 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.88 java
32024 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.69 java
32025 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.62 java
32026 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:58.32 java
32027 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.84 java
32029 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.79 java
32031 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.60 java
32032 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:58.02 java
32036 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.62 java
32038 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:58.01 java
32039 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.55 java
32044 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.88 java
分析

cpu满载的产生原因

  • FullGC
    • 老年代空间不足(或老年代空间极小)
    • 方法区空间不足
    • 显示调用Sytem.gc()
    • Minor GC进入老年代的数据的平均大小>老年代可用内存
    • 由Eden区,Survivor S0(from)区向S1(to)区复制时,对象大小>To Space可用内存,则把该对象转存到老年代,且老年代的可用内存小于该对象大小(进入老年代的时机)
  • 应用层有比较耗CPU的操作
    • 业务层有大量的线程申请和消耗
    • 业务代码死循环导致频繁的FGC
    • 业务层有比较耗时的计算
    • 其他
排查过程

故障1 错误的启动参数

第一步:查看cpu占用高进程

top

Mem:  16333644k total,  9472968k used,  6860676k free,   165616k buffers
Swap:        0k total,        0k used,        0k free,  6665292k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND     
 7622 root      20   0 7588m 112m  11m S 300.7  0.7  47:53.80 java       
 1552 root      20   0  121m  13m 8524 S  0.7  0.1  14:37.75 AliYunDun   
 3581 root      20   0 9750m 2.0g  13m S  0.7 12.9 298:30.20 java        
    1 root      20   0 19360 1612 1308 S  0.0  0.0   0:00.81 init        
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd    
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.14 migration/0 

第二步:查看高进程中cpu占用高线程

top -Hp 7622

top - 17:43:15 up 5 days,  7:31,  1 user,  load average: 0.99, 0.97, 0.91
Tasks:  32 total,   1 running,  31 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.7%us,  8.9%sy,  0.0%ni, 87.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16333644k total,  9592504k used,  6741140k free,   165700k buffers
Swap:        0k total,        0k used,        0k free,  6781620k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
7452 root      20   0 7588m 112m  11m R 99.9  0.7  50:47.43 java
7623 root      20   0 7588m 112m  11m S  0.3  0.7   0:02.08 java
7638 root      20   0 7588m 112m  11m S  0.0  0.7   0:00.00 java
7639 root      20   0 7588m 112m  11m S  0.0  0.7   0:00.23 java
7627 root      20   0 7588m 112m  11m S  0.0  0.7   0:02.09 java
7625 root      20   0 7588m 112m  11m S  0.0  0.7   0:02.12 java
7632 root      20   0 7588m 112m  11m S  0.0  0.7   0:02.07 java


root@server:/usr/local/application/jdk/jdk1.8.0_144# ps H -eo user,pid,ppid,tid,time,%cpu,cmd | grep java | sort -k6 -nr |head -10
root        7622    4578    7452 00:00:02 90.9 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root        7622    4578    7623 00:00:02 90.8 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root        7622    4578    7638 00:00:02 80.0 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root        7622    4578    7636 00:00:02 17.8 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root        7622    4578    7639 00:00:01  8.5 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root        7622    4578    7627 00:00:00  3.5 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root        7622    4578    7628 00:00:00  3.4 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root        7622    4578    7625 00:00:00  3.4 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root        7622    4578    7632 00:00:00  3.2 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root        7622    4578    7630 00:00:00  3.2 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar

第三步:转换线程ID

printf "%xn" 7452          
1d1c

第四步:定位cpu占用线程

[hh_mpp_cd@sjsy-hh305-hx13w ~]$ jps
40433 DBService-1.2.7.jar
20584 Server
37417 QuorumPeerMain
13705 jar
30892 Jps
[hh_mpp_cd@sjsy-hh305-hx13w ~]$ jstack 7622|grep -10 1d1c
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:490)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
        at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:408)
        at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
        at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:834)
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1415)
        at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
        - locked  (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:748)

"http-nio-9372-exec-31" #5185 daemon prio=5 os_prio=0 tid=0x000000079df35438 nid=0x1d1c runnable [0x00007f86f9ad7000]
   java.lang.Thread.State: RUNNABLE                   //cpu占用高  该线程状态应为RUNNABLE
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
--
"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f879c526000 nid=0x9e15 in Object.wait() [0x00007f8732dec000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
        - locked  (a java.lang.ref.Reference$Lock)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

"VM Thread" os_prio=0 tid=0x00007f879c01f000 nid=0x1d1c runnable   //VM Thread表示垃圾回收的线程

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f879c01f000 nid=0x1d1c runnable   //1d1c 这个占用cpu高的线程为GC线程

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f879c021000 nid=0x9df4 runnable

"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f879c022800 nid=0x9df5 runnable

"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f879c024800 nid=0x9df6 runnable

"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007f879c026800 nid=0x9df7 runnable

"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007f879c028000 nid=0x9df8 runnable

排查多个线程号,发现都是gc。
到此可以判断是gc消耗cpu过高
此时通过jstat统计gc

[hh_mpp_cd@sjsy-hh305-hx13w ~]$:/# jstat -gcutil 7622 1000 10
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
  0.00   0.00   0.00  75.07  59.09  59.60   3259    0.919  6517    7.715    8.635
  0.00   0.00   0.00   0.08  59.09  59.60   3306    0.930  6611    7.822    8.752
  0.00   0.00   0.00   0.08  59.09  59.60   3351    0.943  6701    7.924    8.867
  0.00   0.00   0.00   0.08  59.09  59.60   3397    0.955  6793    8.029    8.984

可以发现FGC次数升的很快。频繁FGC不管是那种原因都是因为老年代的空间频繁达到老年代的threshold值,此时正常反应应该是

  • 查询大对象出现在哪里
  • 哪里出现了内存泄漏导致对象回收不了
  • 但是这里可以在回过头来看一下启动参数,cpu满载时查看的点有:
    • heap分配大小
    • gc算法是什么:CMS 并发回收是非常消耗cpu的
      查看启动参数
jinfo -flags 16458
Non-default VM flags: -XX:CICompilerCount=15 -XX:InitialHeapSize=2147483648 -XX:MaxHeapSize=4294967296 
    -XX:MaxMetaspaceSize=4294967296 -XX:MaxNewSize=1431306240 -XX:MetaspaceSize=536870912 -XX:MinHeapDeltaBytes=524288 
        -XX:NewSize=715653120 -XX:OldSize=1431830528 -XX:+UseCompressedClassPointers
            -XX:+UseCompressedOops -XX:+UseFastUnorderedTimeStamps -XX:+UseParallelGC
Command line:  -Xmn4096m -Xmx4096m -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=4096

发现 年轻代分配大小=整个heap空间

线上cpu满载排查过程
image.png

old区只有0.5M
所以这会导致频繁FGC,所以下回排查之前可以先看一下启动参数是如何配置的,确保启动参数没有错误。

故障2

改完启动参数后,运行一段时间,依旧cpu满载

第一步.首先定位cpu满载原因

gc统计信息,发现FGC频繁

root@server:/usr/local/application/jvm# jstat -gc 7998 1000 10
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
1024.0 1024.0  0.0    0.0    1024.0   1024.0    7168.0     7050.2   31104.0 28789.7 4224.0 3726.6    165    0.567  336     7.753    8.320
1024.0 1024.0  0.0    0.0    1024.0   1024.0    7168.0     7048.1   31104.0 28838.2 4224.0 3730.8    165    0.567  384     8.713    9.280
1024.0 1024.0  0.0    0.0    1024.0   1011.2    7168.0     6972.9   31104.0 28869.7 4224.0 3740.3    165    0.567  429     9.678   10.246

dump信息,发现gc原因:Ergonomics(JVM自适应调整导致的GC)


2021-06-20 18:58:31.491  WARN 8153 --- [           main] ConfigServletWebServerApplicationContext : Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'management.info-org.springframework.boot.actuate.autoconfigure.info.InfoContributorProperties': Initialization of bean failed; nested exception is java.lang.OutOfMemoryError: GC overhead limit exceeded
[Full GC (Ergonomics) [PSYoungGen: 2048K->1979K(2560K)] [ParOldGen: 7120K->7120K(7168K)] 9168K->9099K(9728K), [Metaspace: 31869K->31869K(1079296K)], 0.0242829 secs] [Times: user=0.09 sys=0.01, real=0.02 secs]
[Full GC (Ergonomics) [PSYoungGen: 2048K->1964K(2560K)] [ParOldGen: 7120K->7120K(7168K)] 9168K->9084K(9728K), [Metaspace: 31869K->31869K(1079296K)], 0.0217299 secs] [Times: user=0.08 sys=0.00, real=0.02 secs]
[Full GC (Ergonomics) [PSYoungGen: 2048K->1969K(2560K)] [ParOldGen: 7120K->7120K(7168K)] 9168K->9089K(9728K), [Metaspace: 31869K->31869K(1079296K)], 0.0204122 secs] [Times: user=0.06 sys=0.01, real=0.02 secs]
[Full GC (Ergonomics) [PSYoungGen: 2048K->1955K(2560K)] [ParOldGen: 7120K->7119K(7168K)] 9168K->9075K(9728K), [Metaspace: 31869K->31869K(1079296K)], 0.0211435 secs] [Times: user=0.05 sys=0.01, real=0.02 secs]
2021-06-20 18:58:31.636  INFO 8153 --- [           main] o.s.s.concurrent.ThreadPoolTaskExecutor  : Shutting down ExecutorService 'applicationTaskExecutor'
[Full GC (Ergonomics) [PSYoungGen: 2048K->1887K(2560K)] [ParOldGen: 7119K->7119K(7168K)] 9167K->9006K(9728K), [Metaspace: 31871K->31871K(1079296K)], 0.0228763 secs] [Times: user=0.09 sys=0.00, real=0.03 secs]
[Full GC (Ergonomics) [PSYoungGen: 2048K->1784K(2560K)] [ParOldGen: 7119K->7042K(7168K)] 9167K->8827K(9728K), [Metaspace: 31894K->31894K(1079296K)], 0.0564103 secs] [Times: user=0.27 sys=0.05, real=0.06 secs]
Heap
 PSYoungGen      total 2560K, used 2027K [0x00000000ffd00000, 0x0000000100000000, 0x0000000100000000)
  eden space 2048K, 99% used [0x00000000ffd00000,0x00000000ffefaec8,0x00000000fff00000)
  from space 512K, 0% used [0x00000000fff00000,0x00000000fff00000,0x00000000fff80000)
  to   space 512K, 0% used [0x00000000fff80000,0x00000000fff80000,0x0000000100000000)
 ParOldGen       total 7168K, used 7042K [0x00000000ff600000, 0x00000000ffd00000, 0x00000000ffd00000)
  object space 7168K, 98% used [0x00000000ff600000,0x00000000ffce0ab0,0x00000000ffd00000)
 Metaspace       used 31912K, capacity 34276K, committed 34560K, reserved 1079296K
  class space    used 4152K, capacity 4584K, committed 4608K, reserved 1048576K
[Full GC (Ergonomics) [PSYoungGen: 2048K->1785K(2560K)] [ParOldGen: 7042K->7027K(7168K)] 9090K->8812K(9728K), [Metaspace: 31912K->31912K(1079296K)], 0.0434741 secs] [Times: user=0.20 sys=0.00, real=0.04 secs]

至此可以定位到此次cpu占用高的原因了:频繁FGC

第二步.分析线程行为和内存占用

jstack查看线程快照查看具体线程行为,此时一定要多拿几个线程来看,看哪一行为占用cpu最高,去分析具体线程行为,像一般的死锁,死循环都是比较好分析的线程行为,但是有些时候可能通过线程快照并不能发现问题所在,此时需要通过dump信息分析一下到底哪个类内存占用比较高。

jmap -histo:live 7823

此时因为cpu满载,jmap输出存活对象很慢,此时需要截取
jmap -histo pid|head -n 10 查看前10位
jmap -histo pid | sort -k 2 -g -r 查看对象数最多的对象,按降序输出
jmap -histo pid | sort -k 3 -g -r 查看内存的对象,按降序输出

root@server:/usr/local/application/jdk/jdk1.8.0_144# jmap -histo 7823|head -n 10

 num     #instances         #bytes  class name
----------------------------------------------
   1:          2154        9671400  [B
   2:         30968        2989952  [C
   3:          7113         782216  java.lang.Class
   4:         30680         736320  java.lang.String
   5:         21712         694784  java.util.concurrent.ConcurrentHashMap$Node
   6:          3360         584440  [I
   7:          7129         341112  [Ljava.lang.Object;

定位到大对象的位置,上面信息如果可以定位到大对象的位置,直接找大对象即可,如果定位不到,比如像上面这样的,就需要通过一些分析工具了

第三步.借助分析工具分析dump文件

此时可以输出dump文件通过jprofiler等工具查看,该对象的具体引用。

线上cpu满载排查过程
image.png

线上cpu满载排查过程
image.png

可可以看见那个类占用的内存最多,TestDto占用了49%,确实可能会导致Full GC,去找对象的调用,但是这并不能直接说明问题,这个dump文件只能定位出来到底是哪个类占用内存比较多,如果先通过这个分析就想定位问题有时候会失准,一定要结合当时的gc统计信息和线程快照进行问题的再次确认。

补充

排查过程中与自己模拟的过程中,发现一个现象,一旦springboot项目启动后,就算OOM也不会中断掉进程,所以就算发现项目进程还存在,但是他也可能已经发生了内存溢出或者内存泄漏。


推荐阅读
author-avatar
个信2502907653
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有