目前设备后台打印出如上log, 然后串口 ssh等都不能登录,设备死机,必须要断电重启才行!
然而一开始设计是出现内存不足后,会首先kill 掉比较耗费内存的进程,确保设备部挂机。但是此时好像有点不一样了。所以来看看oom的内核代码,看下应该怎样处理。
当我们使用应用时,需要申请内存,即进行 malloc
的操作,进行 malloc
操作如果返回一个 非NULL 的操作表示申请到了可用的内存。事实上,这个地方是可能存在问题的。
当我们在用户空间申请内存时,一般使用 malloc
,是不是当 malloc
返回为空时,没有可以申请的内存空间就会返回呢?答案是 否定 的。在 malloc
申请内存的机制中有如下一段描述
By default, Linux follows an optimistic memory allocation strategy.
This means that when malloc() returns non-NULL there is no guarantee
that the memory really is available. This is a really bad bug. In
case it turns out that the system is out of memory, one or more processes
will be killed by the infamous OOM killer. In case Linux is employed
under circumstances where it would be less desirable to suddenly lose
some randomly picked processes, and moreover the kernel version is
sufficiently recent, one can switch off this overcommitting behavior
using a command like:
# echo 2 > /proc/sys/vm/overcommit_memory
See also the kernel Documentation directory, files vm/overcommit-accounting
and sysctl/vm.txt.
描述中说明了在Linux中当malloc返回的是非空时,并不代表有可以使用的内存空间。Linux系统允许程序申请比系统可用内存更多的内存空间,这个特性叫做 overcommit
特性,这样做可能是为了系统的优化,因为不是所有的程序申请了内存就会立刻使用,当真正的使用时,系统可能已经回收了一些内存。但是,当你使用时Linux系统没有内存可以使用时,OOM Killer就会出来让一些进程退出。
Linux下有3种Overcommit的策略(参考内核文档: Documentation/vm/overcommit-accounting
),可以在 /proc/sys/vm/overcommit_memory
配置(可以取0,1和2三个值,默认是0)。
我们可以修改 /oom_adj/proc/
的值,这里的默认值为0,当我们设置为-17时,对于该进程来说,就不会触发OOM机制,被杀掉:
echo -17 > /proc/$(pidof debugbin)/oom_adj
这里为什么是-17呢?这和Linux的实现有关系。在Linux内核中的oom.h文件中,可以看到下面的定义:
/* /proc/ /oom_adj set to -17 protects from the oom-killer
#define OOM_DISABLE (-17)
/* inclusive */
#define OOM_ADJUST_MIN (-16)
#define OOM_ADJUST_MAX 15
这个oom_adj中的变量的范围为15到-16之间。越大越容易被kill。oom_score就是它计算出来的一个值,就是根据这个值来选择哪些进程被kill掉的。
总之,通过上面的分析可知,满足下面的条件后,就是启动OOM机制。
只要存在overcommit,就可能会有OOM killer。
Linux系统的选择策略也一直在不断的演化。我们可以通过设置一些值来影响OOM killer做出决策。Linux下每个进程都有个OOM权重,在/proc/ /oom_adj里面,取值是-17到+15,取值越高,越容易被干掉。
最终OOM killer是通过 /oom_score/proc/
这个值来决定哪个进程被干掉的。这个值是系统综合进程的内存消耗量、CPU时间(utime + stime)、存活时间(uptime - start time)和oom_adj计算出的,消耗内存越多分越高,存活时间越长分越低。
总之,总的策略是:损失最少的工作,释放最大的内存同时不伤及无辜的用了很大内存的进程,并且杀掉的进程数尽量少。 另外,Linux在计算进程的内存消耗的时候,会将子进程所耗内存的一半同时算到父进程中。
/oom_score_adj
The value of /proc/
/oom_score_adj is added to the badness score before it
is used to determine which task to kill. Acceptable values range from -1000
(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to
polarize the preference for oom killing either by always preferring a certain
task or completely disabling it. The lowest possible value, -1000, is
equivalent to disabling oom killing entirely for that task since it will always
report a badness score of 0.
在计算最终的 badness score
时,会在计算结果是中加上 oom_score_adj
,这样用户就可以通过该在值来保护某个进程不被杀死或者每次都杀某个进程。其取值范围为-1000到1000 。
如果将该值设置为-1000,则进程永远不会被杀死,因为此时 badness score
永远返回0。
/oom_adj
The value of /proc/
/oom_score_adj is added to the badness score before it
For backwards compatibility with previous kernels, /proc/
/oom_adj may also
be used to tune the badness score. Its acceptable values range from -16
(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
(OOM_DISABLE) to disable oom killing entirely for that task. Its value is
scaled linearly with /proc/
/oom_score_adj.
该设置参数的存在是为了和旧版本的内核兼容。其设置范围为-17到15。
注意 :内核使用以上两个接口时,如果更改其中一个,另一个会自动跟着变化。
内核的实现方式为:
task->signal->oom_score_adj
中;task->signal->oom_score_adj
中读取;task->signal->oom_score_adj
中,会根据oom_adj值按比例换算成oom_score_adj。task->signal->oom_score_adj
中读取,只不过显示时又按比例换成oom_adj的范围。/oom_score
This file can be used to check the current score used by the oom-killer is for
any given
. Use it together with /proc/
/oom_score_adj to tune which
process should be killed in an out-of-memory situation.
OOM killer机制主要根据该值和 /oom_score_adj/proc/
来决定杀死哪一个进程的。
也就是oom_score_adj是一个对oom_score动态调整的加权
3.1 /proc/ /oom_adj & /proc/ /oom_score_adj- Adjust the oom-killer score
These file can be used to adjust the badness heuristic used to select which process gets killed in out of memory conditions.
3.2 /proc/ /oom_score - Display current oom-killer score
This file can be used to check the current score used by the oom-killer is for any given
. Use it together with /proc/
/oom_score_adj to tune which process should be killed in an out-of-memory situation.
其他控制oom killer的行为
可以取值为0或者非0(默认为1),表示是否在发送oom killer时,打印task的相关信息。
可以取值为0或者非0(默认为0),0代表发送oom时,进行遍历任务链表,选择一个进程去杀死,而非0代表,发送oom时,直接kill掉引起oom的进程,并不会去遍历任务链表。
当发送out of memory时,该值允许或者禁止内核panic。(默认为0)
panic_on_oom=2+kdump,一起作用时,这样用户就可以分析出为什么会发送oom的原因了。
如果需要的话,可以完全关闭 OOM killer(不推荐用在生产环境下):
# sysctl -w vm.overcommit_memory=2
# echo "vm.overcommit_memory=2" >> /etc/sysctl.con
参考inux-out-of-memory-killer
#!/bin/bash #!/bin/bash
for proc in $(find /proc -maxdepth 1 -regex '/proc/[0-9]+'); do
printf "%2d %5d %s\n" \
"$(cat $proc/oom_score)" \
"$(basename $proc)" \
"$(cat $proc/cmdline | tr '
for proc in $(find /proc -maxdepth 1 -regex '/proc/[0-9]+'); do
printf "%2d %5d %s\n" \
"$(cat $proc/oom_score)" \
"$(basename $proc)" \
"$(cat $proc/cmdline | tr '\0' ' ' | head -c 50)"
done 2>/dev/null | sort -nr | head -n 10' ' ' | head -c 50)
done 2>/dev/null | sort -nr | head -n 10
display omm score
#!/bin/bash
# Displays running processes in descending order of OOM score
# (skipping those with both score and adjust of zero).
# https://dev.to/rrampage/surviving-the-linux-oom-killer-2ki9
contents-or-0 () { if [ -r "$1" ] ; then cat "$1" ; else echo 0 ; fi ; }
{
header='# %8s %7s %9s %5s %5s %5s %s\n'
format="$(echo "$header" | sed 's/^./ /')"
declare -a lines output
IFS=$'\r\n' command eval 'lines=($(ps -e -o user,pid,rss))'
shown=0 ; omits=0
for n in $(eval echo "{1..$(expr ${#lines[@]} - 1)}") ; do # 1..skip header
line="${lines[$n]}"
case "$line" in *[0-9]*)
set $line ; user=$1 ; pid=$2 ; rss=$3 ; shift 3
oom_score=$( contents-or-0 /proc/$pid/oom_score)
oom_adj=$( contents-or-0 /proc/$pid/oom_adj)
oom_score_adj=$(contents-or-0 /proc/$pid/oom_score_adj)
if [ -f /proc/$pid/oom_score ] && \
[ 0 -ne $oom_score -o 0 -ne $oom_score_adj -o 0 -ne $oom_adj ]
then
output[${#output[@]}]="$( \
printf "$format" \
"$user" \
"$pid" \
"$rss" \
"$oom_score" \
"$oom_score_adj" \
"$oom_adj" \
"$(cat /proc/$pid/cmdline | tr '\0' ' ' )" \
)"
(( ++shown ))
else
(( ++omits ))
fi
;;
esac
done
printf "$header" '' '' '' OOM OOM OOM ''
printf "$header" User PID RSS Score ScAdj Adj \
"Command (shown $shown, omits $omits)"
for n in $(eval echo "{0..$(expr ${#output[@]} - 1)}") ; do
echo "${output[$n]}"
done | sort -k 4nr -k 5rn
}
blocking_notifier_call_chain
遍历用户注册的通知链函数,如果通知链的callback函数能够处理OOM,则直接退出OOM killer操作。sysctl_oom_kill_allocating_task
,并且 current->mm
不为空,current的 oom_score_adj != OOM_SCORE_ADJ_MIN
,且可以杀死current,则直接杀死current进程,释放内存。select_bad_process
选择一个最优的进程p去杀oom_kill_process
去kill选择选择的进程p。for_each_process_thread(g, p)
oom_scan_process_thread
检查线程的类别,排除一些特殊的线程,然后对可以作为候选的线程进行评分。/**
* out_of_memory - kill the "best" process when we run out of memory
* @oc: pointer to struct oom_control
*
* If we run out of memory, we have the choice between either
* killing a random task (bad), letting the system crash (worse)
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
*/
bool out_of_memory(struct oom_control *oc)
{
struct task_struct *p;
unsigned long totalpages;
unsigned long freed = 0;
unsigned int uninitialized_var(points);
enum oom_constraint cOnstraint= CONSTRAINT_NONE;
if (oom_killer_disabled)
return false;
blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
if (freed > 0)
/* Got some memory back in the last second. */
return true;
/*
* If current has a pending SIGKILL or is exiting, then automatically
* select it. The goal is to allow it to allocate so that it may
* quickly exit and free its memory.
*
* But don't select if current has already released its mm and cleared
* TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
如果当前进程有pending的SIGKILL(9)信号,或者正在退出,则选择当前进程来kill,
* 这样可以最快的达到释放内存的目的。
*/
if (current->mm &&
(fatal_signal_pending(current) || task_will_free_mem(current))) {
mark_oom_victim(current);
try_oom_reaper(current);
return true;OOM_SCORE_ADJ_MIN
}
/*
* The OOM killer does not compensate for IO-less reclaim.
* pagefault_out_of_memory lost its gfp context so we have to
* make sure exclude 0 mask - all other users should have at least
* ___GFP_DIRECT_RECLAIM to get here.
*/
if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
return true;
/*
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
* 检查是否有限制,有几种不同的限制策略,仅用于NUMA场景
*/
constraint = constrained_alloc(oc, &totalpages);
if (constraint != CONSTRAINT_MEMORY_POLICY)
oc->nodemask = NULL;
// 检查是否配置了/proc/sys/kernel/panic_on_oom,如果是则直接触发panic
check_panic_on_oom(oc, constraint, NULL);
/*
* 检查是否配置了oom_kill_allocating_task,即是否需要kill current进程来
* 回收内存,如果是,且current进程是killable的,则kill current进程。
*/
if (sysctl_oom_kill_allocating_task && current->mm &&
!oom_unkillable_task(current, NULL, oc->nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
get_task_struct(current);
oom_kill_process(oc, current, 0, totalpages, NULL,
"Out of memory (oom_kill_allocating_task)");
return true;
}
// 根据既定策略选择需要kill的process。
p = select_bad_process(oc, &points, totalpages);
/* Found nothing?!?! Either we hang forever, or we panic. */
if (!p && !is_sysrq_oom(oc)) {
dump_header(oc, NULL, NULL);
panic("Out of memory and no killable processes...\n");
/*
* 如果没有选出来,即没有可kill的进程,那么直接panic
* 通常不会走到这个流程,但也有例外,比如,当被选中的进程处于D状态,或者正在被kill
*/
}
// kill掉被选中的进程,以释放内存。
if (p && p != (void *)-1UL) {
oom_kill_process(oc, p, points, totalpages, NULL,
"Out of memory");
/*
* Give the killed process a good chance to exit before trying
* to allocate memory again.
* 在重新分配内存之前,给被kill的进程1s的时间完成exit相关处理,通常情况下,1s应该够了。
*/
schedule_timeout_killable(1);
}
return true;
}
out_of_memory
->select_bad_process
->oom_badness
oom_badness用于计算进程的“点数”,点数最高者被选中
/**
* oom_badness - heuristic function to determine which candidate task to kill
* @p: task struct of which task we should calculate
* @totalpages: total present RAM allowed for page allocation
*
* The heuristic for determining which task to kill is made to be as simple and
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
/*
* 计算进程"点数"(代表进程被选中的可能性)的函数,点数根据进程占用的物理内存来计算
* 物理内存占用越多,被选中的可能性越大。root processes有3%的bonus。
*/
unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
const nodemask_t *nodemask, unsigned long totalpages)
{
long points;
long adj;
if (oom_unkillable_task(p, memcg, nodemask))
return 0;
p = find_lock_task_mm(p);// 确认进程是否还存在
if (!p)
return 0;
/*
* Do not even consider tasks which are explicitly marked oom
* unkillable or have been already oom reaped.
*/
adj = (long)p->signal->oom_score_adj;
//如果将该值设置为-1000,则进程永远不会被杀死,因为此时 badness score 永远返回0。
if (adj == OOM_SCORE_ADJ_MIN ||
test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
task_unlock(p);
return 0;
}
/*
* The baseline for the badness score is the proportion of RAM that each
* task's rss, pagetable and swap space use.
*/ // 点数=rss(驻留内存/占用物理内存)+pte数+交换分区用量
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm);
task_unlock(p);
/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.
*//*
* root用户启动的进程,有总 内存*3% 的bonus,就是说可以使用比其它进程多3%的内存
* 3%=30/1000
*/
if (has_capability_noaudit(p, CAP_SYS_ADMIN))
points -= (points * 3) / 100;
/* Normalize to oom_score_adj units 归一化"点数"单位*/
adj *= totalpages / 1000;
points += adj;
/*
* Never return 0 for an eligible task regardless of the root bonus and
* oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
*/
return points > 0 ? points : 1;
}
顺便说一下memblock机制:
memblock管理算法
将可用可分配的内存在 memblock.memory
进行管理起来,已分配的内存在 memblock.reserved
进行管理,只要内存块加入到 memblock.reserved
里面就表示该内存已经被申请占用了。所以有个关键点需要注意,内存申请的时候,仅是把被申请到的内存加入到 memblock.reserved
中,并不会在 memblock.memory
里面有相关的删除或改动的操作,这也就是为什么申请和释放的操作都集中在 memblock.reserved
的原因了。
这个算法效率并不高,但是这是合理的,毕竟在初始化阶段没有那么多复杂的内存操作场景,甚至很多地方都是申请了内存做永久使用的。
以前会使用memblock分配一块内存 用来检测设备是不是断电重启!
memblock算法下的内存申请和释放的接口分别为 memblock_alloc()
和 memblock_free()
。
http代理服务器(3-4-7层代理)-网络事件库公共组件、内核kernel驱动 摄像头驱动 tcpip网络协议栈、netfilter、bridge 好像看过!!!!
但行好事 莫问前程
--身高体重180的胖子