Energyawarescheduling(LinuxKernelSummit2013)(待续)

作者：金色阳光CC | 来源：互联网 | 2023-08-30 18:12

LinuxKernelSummit2013Energy-awareschedulingMortenRasmussen与会者:MortenRasmussen,IngoMolnar

Linux Kernel Summit 2013

Energy-aware scheduling

Morten Rasmussen与会者: Morten Rasmussen,Ingo Molnar , Peter Zijlstra, Paul Turner, Vincent Guittot, Juri Lelli, Alex Shi, Kevin Hilman, Paul Walmsley, Rafael Wysocki, Artem Bityusky, Srivatsa Bhat, Jonathan Corbet, Michal Hocko, Lai Jiangshan, Fengguang Wu, Mel Gorman, John Linville, Steven Rostedt, Frank Rowand, Ted Ts&＃39;o, Thomas Gleixner, Paul E. McKenney, and at least one other

成果:

Ingo will accept first benchmark into tools/* for scheduler testing
Power awareness moves into scheduler to get unified scheduling decisions:

a. Most of cpufreq dies (aside from backwards compatibilty).b. Governor part of cpuidle framework moves to scheduler.(Drivers doing actual transitions remain in platform-specific codeinitially unchanged.)
Voltage more important than frequency, so start with voltage (or more generally, powr cost).
This comes after we are reasonably happy with #2 above.
Express requirements and desires as user stories.
Add per-task wakeup latency-tolerance input to the scheduler.

主要话题

统一电源政策
- 让调度程序影响电源管理决策并从硬件获取反馈
Task packing
- Only use what you need, but not always
电源驱动
- 电源管理的硬件抽象

Unifying power policies

cpufreq, cpuidle, 调度策略很不协调
用其他东西替换cpufreq和cpuidle&＃xff1f;
这些代码应该在sched / {core&＃xff0c;fair&＃xff0c;rt} .c中存在多少&＃xff1f;
…

Rafael

cpufreq有多个已知多年的问题
一些清理正在进行中
在英特尔硬件上&＃xff0c;我们不需要它使用的所有锁
它在进程上下文中运行的事实并不是很有帮助
cpufreq应考虑在空闲状态下花费多少时间
cpufreq应该替换为其他东西

PeterZ:

cpuidle有一些不正确的启发式方法
在我看的时候&＃xff0c;由于不明显的原因&＃xff0c;这很复杂
intel-cpu依赖

Ingo:

事情没有得到适当的衡量
基于负载的测量
没有一致的方法来枚举现场的所有工作负载
需要“测试载体”

PeterZ: 但基准问题是一个NP(非确定性多项式)难题

MarkG: 对于CPUIdle&＃xff0c;不需要基准测试&＃xff0c;可以只测量错误

Artem: 但也需要测量延迟&＃xff0c;而不仅仅是功率

Ingo:

缺少一种评估补丁的连贯方法
称他们为基准是错误的

Morten:

包装Android用户空间并分发我们正在使用的内容可能不是正确的方法
MarkG:
在印度拥有一支由50名工程师组成的团队&＃xff0c;测量 power & perf
can never settle on anything
大量的测量噪音

PaulW:

它不仅仅是吞吐量 - 也是延迟

MarkG:

衡量代码正确性的指标

PeterZ:

然而英特尔设法制造出功耗更低的芯片

MelG:

负载的集合永远不够全面
每秒唤醒次数是KPI

PaulMck:

在减少唤醒方面付出了很多努力 - 没有节省太多 power

MelG: 什么是关键指标?

随着时间推移的power ?
总时间?
从最低功率状态唤醒的次数&＃xff1f;

pjt: 唤醒的变化到预定的时间

PeterZ: 市场power数字来自哪里&＃xff1f;

MarkG:

CPU-bound, work per watt - active power
standby power

Ingo: 看电影是一项相当不错的负载

PeterZ: 整个软件堆栈也有帮助

能够缓冲大量数据以避免唤醒

Rafael: 电子邮件阅读

pjt: 许多媒体解码由专门的硬件或指令完成

Rafael: 多处理器使测量变得困难

Ingo: 复杂的系统应该不是什么大问题

现在我们没有正确解决这些简单的问题
不能使用隐藏的测量
建议把它放到工具/
tools/power 或者 tools/perfbench

Artem: 我们可以提出一系列简单的事情

Ingo: 视频播放 - 可以渲染到内存

MarkG: 图形power耗费大量精力

PeterZ: 有小设备&＃xff0c;还有服务器
- Arjan从来没有能够直接回答服务器应该做什么
- 英特尔希望内核为节省多节点系统的power做些什么&＃xff1f;

Rafael: 希望硬件应该照顾它

MarkG: 数据中心功率上限

Artem: CPU可以消耗数百瓦特

Rafael: 它从一代CPU变为另一代

Vincent: 它取决于系统上运行的进程

PeterZ: 应该专注于小系统&＃xff0c;并将大系统留在另一天

pjt: 两个优化轨道&＃xff1a;

latency
power

PaulW: 优化延迟可以抵抗吞吐量

Ingo: 尝试测量硬件有什么计数器

试着测量energy

MarkG: 我们如何衡量代码的正确性而不是测量能量&＃xff1f;

PaulMcK: 试图重申

目前的系统非常复杂
一个观点是&＃xff0c;如果你关注总能量&＃xff0c;你就会产生噪音

MarkG: C-state governors

它多久选择一次错误的状态&＃xff1f;

PaulMcK: 再次尝试重新构建
- 有一些决定&＃xff08;例如C状态&＃xff09;&＃xff0c;其中有总能量的代理

PaulW: 并不真正代表总能量

MarkG: 这是我们可以衡量的东西

MelG: 预测并强制CPU进入C状态

或使用处于特定P状态的CPU数量

Srivatsa: PowerPC硬件可能会进入与软件所需的不同的C状态

Ingo: 添加了很多硬件&＃xff0c;因为软件不能正确使用

Artem: 有120个CPU时&＃xff0c;PM QoS存在问题

Kevin: 设备PM QoS旨在解决此问题

PaulMcK: 可以很好地利用专注于特定的子系统

PaulW: 你们现在如何评估调度补丁&＃xff1f;

Peter/pjt: &＃xff08;基本上猜测和经验&＃xff09;

Rafael: 工作负载因服务器而异

MarkG: TPC-C不适用于小型系统

pjt: 从调度程序的角度来看&＃xff0c;我们只关心CPU时间

pjt: 一旦你有一个以上的核心活跃&＃xff0c;那就是一个支点

Ingo: 真正的大系统并不是一个如此巨大的问题

Ingo: 很少有EAS调度补丁经过测试&＃xff0c;并且适用于大型和小型系统

Ingo: 断言大型系统工作负载也适用于小型系统

Artem: 我们有办法重现工作量吗&＃xff1f;

Ingo: 是的 - ‘perf sched’

Vincent: 许多用户正在使用硬件中断

PeterZ: 可以编程定时器来生成这些

PeterZ: linsched具有评估功能

pjt: 因为它是一个模拟&＃xff0c;你可以整合方差

获得一个收敛但不是分歧的分数
更容易优化

Ingo: 我认为服务器和低端工作负载正在融合

PeterZ: 我们可以为linsched生成能源使用模型吗&＃xff1f;

Morten: 我们可以创建自给自足的测试吗&＃xff1f;

PeterZ: 英特尔power目标是什么样的&＃xff1f;

MarkG: post-silicon power characterization(后硅功率表征)

Srivatsa: 使用m5sim来模拟内存功耗

Ingo: 在我们知道补丁改进之前我们无法移动

PeterZ:

模型一致
必须是理智的
不要在整个架构上乱七八糟
没有每个架构的调度程序

Ingo: “通过连贯&＃xff0c;我的意思是英特尔和AMD”

Rafael: 调度程序必须保持与平台无关

Rafael: 如何将平台信息传递给调度程序

PeterZ: 我觉得别搞能量感知的调度程序

PeterZ:你们都需要什么&＃xff1f; 你什么时候节省电力&＃xff1f;

Vincent: 根据C状态&＃xff0c;不同的CPU可以具有不同的唤醒延迟

PeterZ: 是的&＃xff0c;这就是我所说的 - 我们目前没有关于选择正确的CPU唤醒的任何情报 - 可以轻松选择处于深C状态的CPU

pjt: 调度程序需要什么输入&＃xff1f;

Ingo: no regressions policy

Ingo: lots of low hanging fruit

PeterZ: 特异性在哪里&＃xff1f;

假设我们有一个双核系统 - 一个cpu正在做一些小事
我们唤醒另一个核心吗&＃xff1f; 我们安排在其他CPU上吗&＃xff1f;

pjt: 有时我们关心延迟&＃xff0c;有时我们不关心延迟

MarkG: 用于多个CPU的power方向

PeterZ: 分裂 vs. 包装

Rafael: 为什么我们不提出一般的事情清单&＃xff1f;

PeterZ: 硬件的人需要告诉我们什么是重要的

pjt: 如何确定可在多种不同类型的计算机上测试的一组标准工作负载的最佳行为

PeterZ: 我们需要参数化拓扑结构

Vincent: 试图在我的补丁中添加这个

Rafael: 为什么不首先列出我们对当前方法不满意的内容&＃xff1f;

PaulW: Vincent’s 一直试图解决这个问题

Vincent: 使用sched_domain和一个标志&＃xff0c;SD_SHARE_POWERDOMAIN来打包

pjt: 宁可不要使用flag

可以改变紧急行为&＃xff0c;我们对超线程具有相同的splitting vs. packing问题
使用拓扑来描述这一点&＃xff0c;而不是通过标志

PeterZ: 可以根据使用情况构建不同的拓扑树

MelG: 我们仍然没有一个很好的方法来衡量这一点

Ingo: 可调节控制是否选择功率或性能

PeterZ: 我们不想要一个细粒度的旋钮

之前的补丁集有27个状态
不可配置

John Linville: 听起来&＃xff0c;根据board&＃xff0c;我们不知道C状态的影响

在无线领域&＃xff0c;我们有速率控制算法
优化给定错误的吞吐量
e.g. Minstrel

PaulW: 我们应该用C状态做到这一点

PeterZ: 一些大型英特尔系统采用ACPI方法可以得知功耗

Srivatsa: PowerPC 也有

Artem: 在芬兰我有一个功率计

可以衡量负载
可以衡量完成时间
“我只有大服务器”

Rafael: 我想看看为什么人们对目前的情况不满意的一些解释

PeterZ: 典型投诉

我们不断在不同的CPU上唤醒任务
宁愿看到它在同一个CPU上醒来

Srivatsa: 唤醒延迟怎么样&＃xff1f;

pjt: 进程应该向调度程序传达所需的延迟

PaulW: 什么样的延迟&＃xff1f;

pjt: 任务完成的时间

(break)

Rafael: 有一个每任务延迟设置会很好

(general agreement)

PeterZ: 不会太多地影响调度延迟 - 旨在影响从空闲状态唤醒任务

如果你都忙&＃xff0c;唤醒行为由策略指定
FIFO, deadline

pjt: 反馈是次要目标

宁可先尝试解决当前的问题

PeterZ: 以时间表示

默认是无限

Ingo: “100毫秒是一个很好的默认”

Rafael: 有些CPU需要几百毫秒才能唤醒

这实际上是- 300毫秒

MarkG: Razor i - 200 - 300 ms 唤醒

PeterZ: 最初&＃xff0c;这只会影响空闲唤醒
- intended to affect Artem’s usecase
- if one task, e.g., media player, sets a wakeup latency,
we should avoid having all “120&＃43;” CPUs busy-wait

Ingo: 也应该由子任务继承

pjt: 我想看看问题的例子

Rafael: 对于工作负载&＃xff0c;从哪里开始&＃xff1f;

Ingo: the first patch for tools/ to add workload examples will get applied

…

KevinH: coupled CPUidle
- 一个问题是CPUFreq和CPUidle都不是为SMP设计的

Ingo: 忘记CPUidle / CPUfreq的遗产

KevinH: CPUidle - 各种C状态定义我们关心的事物
- 目标盈亏平衡点

pjt: 我们甚至需要打扰遗留代码吗&＃xff1f;
- 大概不需要

…

Rafael: 如何在一致的视图中表示硬件电源拓扑信息&＃xff1f;

Rafael: better to race to idle vs. DVFS?

PaulW: 看处理器的。

Vincent: 当您在软件中选择C状态时&＃xff0c;硬件可能不会进入

PeterZ: 有反馈机制吗&＃xff1f;

Rafael: 我担心这种方法

硬件将为我们做出决定
for both C-states and P-states

pjt: 如果操作系统开始做正确的事情&＃xff0c;硬件将移动

Ingo: 调度程序&＃xff0c;CPUidle&＃xff0c;CPUfreq都进行交互

如果它们不作为连贯单元进行交互&＃xff0c;则输出仍然是随机的

Rafael: intel_pstate driver

Rafael: 我们想要给调度程序提供什么级别的信息&＃xff1f; 硬件可以覆盖

Ingo: 必须尽最大努力

pjt: 仍然可以有微码更新&＃xff0c;以获得更好的&＃xff1f;

Morten: 我们可以获得正确的代表吗&＃xff1f; 这是很多信息

Ingo: PC缓存拓扑与sched_domains非常匹配

Morten: thermal限制怎么样&＃xff1f; GPU限制&＃xff1f;

pjt: 我们今天没有很好地处理这些问题

Morten: 我们应该在调度程序中有通用位

Rafael: 什么是通用的&＃xff1f;

Ingo: 很多CPUidle都是通用的

Rafael: 唯一的通用代码是governor&＃xff0c;它可能没有做正确的事情

Rafael: C state 的要求是什么&＃xff1f;?

MarkG/PeterZ: - 退出延迟&＃xff08;从空闲约束唤醒&＃xff09;&＃xff0c;盈亏平衡点&＃xff08;在时间上指定&＃xff09;&＃xff0c;拓扑限制&＃xff08;其他核心必须为要输入的C状态而空闲&＃xff09;

Ingo: 我们可以做得更好&＃xff0c;不破坏现有的C states

Artem: 我有一个4 package服务器&＃xff0c;大量的核心
- 我看到不同的核心一直在醒来
- Lameter is doing isolation

PeterZ: 但是Lameter正在从用户空间控制它

Ingo: Full NO_HZ?

PeterZ: 不适用&＃xff0c;因为它是手动区分

Rafael: 性能状态怎么样&＃xff1f;

Ingo: easy

Rafael: 需要什么信息

PeterZ: 需要有关CPU实际实现的内容的反馈
- APERF / MPERF比率有用吗&＃xff1f;

Rafael: 这是一个平均指标

PaulW: 究竟是什么APERF / MPERF

PeterZ: 它是固定频率计时器滴答与CPU在非空闲时花费的周期数之比

PeterZ: 我们需要以某种方式过滤掉空闲时间

Rafael: 对于性能状态&＃xff0c;有两种情况&＃xff1a;

负载可能相同
负载可能不同

PeterZ: Arjan建议&＃xff0c;对于任何迁移任务的更改&＃xff0c;将其提升到最大值并让它再次稳定下来

PeterZ: 对于英特尔来说&＃xff0c;这更容易&＃xff0c;因为P状态并不意味着那么多

MarkG: 使用任务迁移P状态非常重要

PeterZ: 以较慢的CPU速度运行是否有意义&＃xff1f;

Vincent: MP3 playback - race to idle

取决于是否需要保持DDR开启
如果开启&＃xff0c;那么由于DDR是主要的电力消费者

Paul: 经常少量工作的情况 - 在低P状态下运行

Vincent: 电压对功耗的影响比频率更重要

PeterZ: PMU / P单元怎么样&＃xff1f;

pjt: 涡轮模式状态真的很大

我们想要了解turbo模式驻留的内容吗&＃xff1f;

Rafael: 你需要了解什么任务&＃xff1f;

PeterZ: Power7&＃xff1a;SMP有一种turbo 的东西

如果你运行一个线程&＃xff0c;它会很快
如果你运行两个线程&＃xff0c;它会变慢
-如果你运行四个线程&＃xff0c;它会更慢
SMT核心编号也不对称
最左边的包装朝向CPU0

pjt: 成本&＃xff08;瓦特&＃xff09;和能力

PeterZ: capacity &＃61;&＃xff08;每个时间单位可以执行的循环数&＃xff09;

pjt: capacity &＃61;“cpu power”在调度程序中

Ingo/pjt: cost用于确定是否打包任务并增加P状态&＃xff0c;还是分布在多个CPU上

pjt: 从sched_domain重新计算开始

sched_domains只是最后一级缓存

PeterZ: start with the text story, to try to parameterize that

尝试确定特定细节的收益水平

pjt: 描述您关心的拓扑并尝试将它们分开

PaulW: Morten提出的钩子怎么样&＃xff1f;

PeterZ: 除了钩子之外别无他物

pjt: 摆脱CPUidle&＃xff0c;CPUfreq

钩子反转
如何要求一个 p-state
如何询问我们所处的p-state

pjt: 将CPUidle放入调度程序将是一个非常重要的第一步

PaulW: CPUidle调控器的性能倍增器怎么样&＃xff1f;

Rafael: CPUidle: ladder vs. menu governor

Vincent: Daniel Lezcano’s patches

Ingo: 我们不需要担心没有NO_HZ的平台

Rafael: 首先将CPUidle调控器移动到调度程序中

Rafael: 单一的governor - 不可插拔

PeterZ: CPUFreq just needs to die

Ingo: 从你满意的事情开始

然后将它与包装结合起来

Artem: 当1个cpu在线且50&＃xff05;时&＃xff0c;在哪里安排任务的情况怎么样&＃xff1f;

PeterZ: compute cost functions for packing, compute cost functions for spreading

用那个

pjt: 使用任务历史

…

Ingo: 不要担心缓存交互&＃xff0c;任务通信

PeterZ: 如果你使负载平衡器立方&＃xff0c;我会反对

pjt: 摆脱点对点负载平衡会很好

Morten Rasmussen

Topics:

Unifying Power PoliciesTask PackingPower Drivers
Unifying Power Policies

更好地协调cpufreq&＃xff0c;cpuidle&＃xff0c;调度。

Rafael: 取代cpufreq&＃xff1f; 是! Stephane&＃xff08;&＃xff1f;&＃xff09;想升级。

Shortcomings: 锁&＃xff0c;空闲计数。但是cpuidle只是改变CPU状态的驱动程序。

Peter Z: 有些cpuidle启发式方法不太正确。

Rafael: 是的&＃xff0c;情节显示出相当大的随机性。

[Peter: Tell Ted! He needs randomness!!!]

Peter: 不清楚为什么状态转变看起来过于复杂。想解释一下。

Rafael: 历史原因。

How to measure?

Ingo: 需要更好的测量方法。当前的方法是临时的&＃xff0c;并且不同的工作负载用于评估不同的补丁。需要良好的负载集。

Mark Gross&＃xff1a;保持一个指标。

Ingo&＃xff1a;希望类似的“语言”能够权衡功率和性能测量。因此&＃xff0c;一个补丁在一个方向上有所改进

Mark&＃xff1a;许多不同的工作负载&＃xff0c;还需要除吞吐量之外的其他措施&＃xff0c;例如用户体验。 “通往地狱的道路。”

Peter&＃xff1a;但英特尔确实提高了能效。

Mel&＃xff1a;但可以使用代理&＃xff0c;例如唤醒。此外&＃xff0c;许多唤醒都可能表明交互问题。

Ingo&＃xff1a;不同意&＃xff0c;大多数减少唤醒的方法 - 增加延迟。

Mel&＃xff1a;仍然相信唤醒的数量与用户体验有关。

Ingo: Dangerous to extrapolate from current broken energyscheduling to any general statement.

Paul Turner: 也许唤醒延迟变化是一个很好的衡量标准。

Ingo: Most new systems have very cheap wakeups.Paul Walmsley: Not all of the lowest-power hardware.

Mark&＃xff1a;每瓦工作和待机功耗是英特尔使用的两个指标。

Ingo&＃xff1a;观看电影不是一个糟糕的基准&＃xff0c;应该是测量套件的一部分。另一个组件可能具有完全CPU利用率。

Peter&＃xff1a;一次倾倒更多音频对提高能效有好处&＃xff0c;但需要与视频同步。如果缓冲区大小不同&＃xff0c;则内核无法合并音频和视频唤醒

Rafael&＃xff1a;可以轻松测量单CPU工作&＃xff0c;但多处理器工作的许多复杂功能 - 仅用于测量。

Ingo&＃xff1a;复杂性是SMP固有的 - 但我们甚至没有正确地做简单的事情&＃xff0c;我们也不能接受基于信仰的补丁。我们需要一些方法来衡量它。在tools目录中将一些基准测试放入内核本身。

Artem&＃xff1a;什么基准&＃xff1f;视频不太适合服务器&＃xff0c;因为他们有糟糕的视频硬件。

Ingo&＃xff1a;但从CPU的角度来看&＃xff0c;没关系。

Rafael: 但视频播放在服务器上并不是什么大问题。

Ingo&＃xff1a;仍然是一个很好的能效负载&＃xff0c;可以测量不同硬件上的东西。

Peter&＃xff1a;很难得到答案“为了节省多节点硬件的功耗&＃xff0c;英特尔希望内核做些什么&＃xff1f;”

Mark&＃xff1a;我们将根据总数据中心功率限制进行调整。

Peter&＃xff1a;大系统的问题在于内存消耗了大量的能量&＃xff0c;而CPU电源管理对此没有帮助。

Artem&＃xff1a;CPU在我的盒子上是一个大问题。我测量进入服务器箱的实际电量。

Peter&＃xff1a;应该首先关注电池供电的系统。

Artem: 不。

Rafael: Just have different set of benchmarks for differenttypes of systems.Paul Turner: Big systems often care more about latency thanabout power.Ingo: Linearly ramp up workload on more and more CPUs, plotpower consumption and other properties.Rafael: Regardless of the workload, we should measure totalpower.Mark Gross: Measure correctness of code rather than total energy.Look at decisions, see whether they were good decisions.For exmaple, did the C-state governor correctly predicthow long the system stay idle.Mel: Accurate prediction might not help in saving energy.For example, making C-state decisions on a per-CPUbasis might give sub-optimal results. A globaldecision might work better from the user&＃39;s viewpoint.Srivatsa: Hardware does not necessarily act on softwaresuggestions.Ingo: Yes, but in part because the OS does not get it right.Paul Walmsley: C-state metric is good, but it is only a smallpart of the whole story.Mal: It is not reasonable to expect the user to tune thescheduler.Peter: The video player needs to be able to tell the systemwhat its wakeup requirements are correct.Artem: System-wide PMQOS keeps all my 128 CPUs at the highestpossible C-state. Not what I want.Kevin Hilman: There is now per-device QOS, so can handle whatyou are doing. [And Artem was in fact using the oldsystem-wide API.]Artem: Measuring the full system with a power meter is stilluseful.Peter: SCHED_DEADLINE!Juri: SCHED_DEADLINE useful, but you need to know the constraints.Rafael: Running a workload useful for servers on a battery-poweredsystem is not useful. We need to run workloads onappropriate hardware.Ingo: Give me an example.Rafael: Database workload not good on small systems.Ingo: So have a high-end game as the workload for the smallsystems.Paul Turner: Question for big system: Does it use more thanone socket, yes or no?Mel: Number of cores active to complete a given piece of workin a given time.Peter: Really big systems sometimes have node-local I/O, whichchanges things.Ingo: But the really big systems are carefully managed byhand, and are relatively easy to optimize. Desktopsand smartphones must automatically optimize, whichis harder.Rafael: Need different workloads for different classes ofsystems.Ingo: Unitary workloads matter on all systems. Ramp-up ofsimple workloads should work everywhere for taskgrouping and C-state measurement.Artem: Is there a way to replay workloads?Ingo: perf-sched, but should be replaced by Paul Turner&＃39;swork. But perf-sched is there now.Artem: Capture workload, put trace in the tools directory,replay it on whatever system you have.Vincent Guittot: Difficult to emulate external interrupts.Ingo: We can simulate the interrupts with hrtimers. Notperfect, but it does work.Peter: linsched has a evaluation feature.Paul Turner: Nice to have a workload metric that diverges whenthe scheduler is behaving badly. Easier to optimize.Ingo: Can create standard workloads.Peter: But can we measure energy consumption.Ingo: https://lkml.org/lkml/2015/1/15/136Morten Rasmussen: The reasonable proxies are not likely to becorrelated with any real hardware.Ingo: But an idle CPU should consume less energy than a busy one.Paul Turner: But it is better than not having any metric.Mark Gross: People predict energy consumption, give me target.We beat the code up until we meet the target.Ingo: If you cannot measure power directly, need a proxy.Srivatsa: We have a model that estimates power savings frommemory power management.Juri Lelli: Power measurement integration into linsched?Paul Turner: Never got complete information...Ingo; Peter needs something to determine whether or not toaccept patches.Peter: I will accept patches that save power for the submitter,but I need a consistent story on how this happensand the patch must be sane. Per-architecturescheduler is not acceptable. Happy to not haveenergy awareness, but...Paul Turner: Move to get things into one place, how do weoptimize this? But if we have some notion of theenergy consumption we could get some sane input.Rafael: What do scheduler maintainers want?Peter: We need something that helps -your- energy efficiency.Scheduler currently just picks a CPU regardless ofwhat the energy consequences might be.Paul Turner: C-state transitions take microseconds. Videoplayback doesn&＃39;t care about microseconds, but otherworkloads might care.Ingo: Having a set of workloads would allow a no-regressionspolicy.Mark: But the energy consumption depends strongly on theplatform. The goal is often to get an extra tenminutes out of the battery for video playback.Ingo: But there is low-hanging fruit from simple workloadsand strategies.Mark Gross: Energy strategy depends on whether you are runningon battery or not. And some hardware has power domaincovering all CPUs, so that idling some but not allCPUs gives little benefit. Other hardware has separatepower gating, so optimal strategy differs.Peter: But the gating can be exposed to the scheduler, hopefullyin simple form.Ingo: Per-core policy?Artem: Per-package for sure!Paul Turner: Story approach. Smartphone playing video, for anexample story.Ingo: Need some topology input.Paul McKenney: Hardware complications affect SMP performance aswell. Not trying to squeeze out the last 1%, gettingto 80-90% automatically will be good. Hand-tuning willbe needed for last 1%.

o Task packing.

Vincent: Have pack-or-spread flag to control task spread.Paul Turner: Would rather have sched-domain relationshipinstead of flag. The hyperthreading experience wasnot so good, don&＃39;t want to constrain based on singleflag. Difficult to work out how it should changeall of the many decisions that the scheduler must make.Ingo: Base much of the decision on battery-powered or not.Have additional override input -- might have AC-poweredsystem that cared more about energy than performance.On battery power, main goal is small energy per unitwork. Off of battery power, main goal is performance(e.g., low latency or high throughput).Mel: Before the preceding discussion, we had no benchmarksand metrics. After the discussion, we still haveno benchmarks or metrics!Alex Shi: Two power models. (1) best Performance per unit power.(2) lowest power consumption.Paul Walmsley: In the context of splitting vs. packing?Peter: No, in general. Earlier approach had something like27 different knobs, not something that administratorscan be expected to deal with optimally.John Linville: Similar to rate-control algorithms in wireless.Choosing a higher transmit rate does not necessarilyresult in higher throughput -- it might increase theerror rate so as to more than offset the higher speed.Ingo: But wireless has the luxury of having a good metric!Peter: There are power measurements, but I have no idea whetherI can trust them!Artem: I have a power meter in my lab. Give me a workloadand I can run them.Peter: Send all your workloads to Artem and he will run them! ;-)Rafael: What exactly is the complaint with the current scheduler.Peter: One complaint is that periodic tasks get bounced amongCPUs.Paul Turner: But if the workload is heavy, migration is neededto reduce idle time.Srivatsa: Do we also need a low-latency knob?Peter: Need a way for tasks to express their latency constraints.Inherited per task. Default setting?

BREAK

o softirq and tasklets. All uses of tasklets unmaintained…

o What should be default latency setting?

Artem: Default should be minimal latency to speed up boot.Peter: Who cares about boot? Won&＃39;t make much difference.For FIFO, constrained by rules, for OTHER we just dosomething. Other rules for DEADLINE. If you need tomeet certain constraints, you need to ensure that yourwakeup times are consistent with those constraints.Ingo: Relationship between latency and throughput.Battery-powered device would prefer power efficiencyover either latency or throughput.Paul Turner: Yes, but that is normally a secondary consideration.Peter: Why the concern over boot? My phone takes forever to boot.Only happens when it runs out of battery.Artem: We care about boot speed.Peter: Wakeup time limit?Mark: Should allow up to a second.Rafael: 300 milliseconds in some cases.Paul Walmsley: Task wakeup from idle the main effect?Peter: Also affects latency requirements. If you have a realtimetask, do you idle-spin all CPUs? Hopefully not, whichis why latency should be a per-task property. Mightidle-spin all CPUs anyway for some time, but it at leastallows an opportunity for improvement.Ingo: Need also topology information on energy propertiesof devices and systems.Rafael: How to get benchmarks started?Ingo: The first guy to submit a benchmark gets it automaticallyaccepted.Paul Walmsley: Off to the races!!!Kevin Hilman: Coupled states. Coupled CPUs run at highest C-state,and when all CPUs have gone to a lower state then andonly then is the actual C-state reduced. Define thetarget residencies and breakeven times.Paul Turner: How does this affect current code?Kevin Hilman: The actual numbers in the kernel are often lies,and testing and tuning are required. There is nodocumentation.Paul Walmsley: The hardware guys provide numbers from EDA tools,then the software guys try to meet those targets.Rafael: Which of these things does the scheduler need to know about?Ingo: All of them. The scheduler is capable of messing uppretty much anything.Paul Walmsley: Haven&＃39;t investigated the relationship betweencpuidle and cpufreq.Ingo: Mostly need to run flat out -- some exceptions, of course.Vincent: When you choose a C-state, you have no guaranteethat the hardware has paid attention.Peter: We do need some feedback -- if we told the hardware todo something, we need to know what it actually did.Rafael: Some hardware will make decisions for us.Ingo: cpuidle, cpufreq, and scheduler interact. If they arenot unified, the result will be random.Rafael: At what level should we do the integration? Couldprovide the scheduler with everything we know, butwe might not be able to really integrate control.Ingo: Best effort.Paul Turner: And tuning one blob of code should be better thantuning three blobs of code.Morten: Lots of platform with different properties, how tohandle?Ingo: Need to represent via topoology information.Morten: Lots of properties -- coupling, clock and powerdomains, GPUs, ...Paul Turner: But we don&＃39;t handle it well today, so we wouldnot be losing anything.Rafael: Where does the architecture-specific code go?Ingo: cpuidle could be considered the common framework,with arch-specific code underneath.Peter: Three major things: exit latency, breakeven point, topology.Mark: If scheduler was aware of the topology constraints ofthe deepest C-states, that would help a lot.Peter: Idle duration estimate and breakeven point governsC-state depth, which determines wakeup latency.Package topology important for large system becausebest gains come from powering off entire package.Paul Turner: Push and pull operation in scheduler.Artem: 4-package server. Can see idle time. Would like toconcentrate system noise on one core.Ingo: Artem needs to talk to Frederic Weisbecker to useNO_HZ_FULL.Rafael: So we know what we need to do with the C-states.What about power setting? Do you need to knowpower states and acceptable performance degradation?Peter: Would like feedback on the real work that a given CPUis getting done. Does aperf/mperf work?Rafael: Get average values from aperf/mperf. One countseverything, one non-idle.Peter: One clock is always running, the other counts cycles.Rafael: divide them, work with ratio.Peter: Yes, but need to filter out the idle time. Need tobe careful with feedback loops: Not doing much, sooffload my tasks, doing even less, offload more.Rafael: All of this is sensitive to the workload. New workloadsmight need different policies.Peter: Arjan suggested ramp up to max if the task is a realtimetask, then see what happens. Easier for Intel becausethe P-states don&＃39;t mean that much any more.If the GPUs are running flat out, the thermal throttlingwill throttle the CPUs as well as the GPU.Mark: Possible issue: migrate from 1GHz CPU to 600MHz CPU,miss latencies.Peter: Does it ever make sense to run at non-maximum speed?Many: In some specific cases, yes.Ingo: Is there a typical case for running at a non-maximalfrequency other than the high-cache-miss case?Kevin: Thermal constraints.Ingo: Yes, but that is a separate constraint.Rafael: Periodic workloads can sometimes work betterat lower frequencies.Vincent: Some cases require the CPUs stay on, and running fullfrequency doesn&＃39;t make sense.Paul Turner: Handle that as a secondary special case.Can possibly use SCHED_DEADLINE for this.Ingo, Vincent: Use highest frequency in a given voltage range.Paul Walmsley: There is a power benefit for running a lowfrequency within a given voltage band. But voltageis more important.Peter: Let&＃39;s start with voltage.Mark Gross: P-unit will opportunistically do burst mode,getting higher frequency, can also throttle frequency.One chip runs at 1.5GHz, but will throttle back to1.1GHz if there is heavy graphics workload.Paul Turner: Need feedback on how much time was spent inturbo mode. Otherwise, we can get oscillationas turbo mode goes on and off.Peter: Power 7 has SMT fun. If you run one SMT threadon the first thread, get high performance. Ifyou run the first two, lower performance per thread.If running four threads, yet slower per-threadperformance.Mark: On Intel, turbo-boosting happens if one thread activeper core, but not if more than one thread per core.Rafael: What granularity required?Peter: Need amount of work actually done during a given periodof time.Ingo: Also need topology information.Kevin: Most hardware might not do what you told it to do, butit will tell you what it actually did.Vincent: "Power down when possible", and find out later ifyou actually powered down?Paul Turner: "Power" in the scheduler is not all that wellcorrelated with power, just normalized to the lowestbin.Ingo: One open question is handling of asymmetry, as in big.LITTLE.Paul Turner: Start with specific stories and refine/generalize.Paul Walmsley: Patches as story? Hardware as story?Peter: Start with English text. What does everyone want, wheredo we draw the line on how specific hardware is handled.Need numeric tradeoffs, for example, voltage stepsbeing more important than frequency steps.Paul Turner: Describe different cases separately, don&＃39;t tryto mix.Paul Walmsley: Morten&＃39;s patches?Peter: I have no overall framework in which to judge Morten&＃39;spatches, it goes no further.Paul Turner: How to ask for lower C-state, what C-state are we in?Ingo: Maybe start by moving code rather than by just cuttingit out.Rafael: Start by moving the coherent part of cpuidle intothe scheduler.Ingo: Perhaps as a separate file, at least to start with.Paul Turner: Can move things around later.Paul Walmsley: Governor?Rafael: Manual governor better for systems with CONFIG_NO_HZ.Automatic better for periodic scheduling clock.Thomas Gleixner: Still have systems that do not supportCONFIG_NO_HZ, but they don&＃39;t support clockeventseither.Rafael: cpuidle governor that is part of the scheduler.Kevin Hilman: Pluggable governors?Ingo, Rafael: No. Good for experimentation, not good forend users.Rafael: Use low-level part of the current cpuidle governor.Ingo: Don&＃39;t need immediate optimality. Start somewhere, improvefrom there. Forget about cpufreq. Need mostly thevoltage switches.Peter: "cpuidle needs to die, just leave it there and stopusing it."Artem: One task, 50% CPU consumed, where to schedule next task?Peter: I don&＃39;t know, but it probably depends on the voltagesteps. Compare wake-up cost and voltage-increaseincremental cost.Paul Turner: Increase frequency for x watts, increase voltageand frequency for y watts. Choose the best of the two.Paul Walmsley: Need accurate cost functions for differentpossible wake-up locations.Artem: Start on different core in same package or on differentpackage?Rafael: Depends on power-gating topology.Artem: Same package might be better for power, but differentpackage would provide more cache.Ingo: We have a quadratic algorithm for NUMA, which is uncool.Peter: NUMA cost is so high that we can justify costly schedulingcalculations.Ingo: Higher cost is probably acceptable.Peter: Don&＃39;t want cubic complexity!Ingo: OK, acceptable within broad limitsPaul Turner: Would be good to get rid of point-to-pointwakeup, but that starts getting into bin-packingproblem.Artem: Want all system noise on one CPU.Ingo, Peter: NO_HZ_FULL, but complex.