I have developed a framework that is used by several teams in our organisation. Those "modules", developed on top of this framework, can behave quite differently but they are all pretty resources consuming even though some are more than others. They all receive data in input, analyse and/or transform it, and send it further.
我开发了一个框架,供我们组织中的几个团队使用。在这个框架之上开发的那些“模块”可以表现得完全不同,但它们都是非常耗费资源的,尽管有些模块比其他模块更多。它们都在输入中接收数据,分析和/或转换数据,并进一步发送。
We planned to buy new hardware and my boss asked me to define and implement a benchmark based on the modules in order to compare the different offers we have got.
我们计划购买新硬件,我的老板让我根据模块定义和实施基准测试,以便比较我们获得的不同报价。
My idea is to simply start sequentially each module with a well chosen bunch of data as input.
我的想法是简单地以一组精心选择的数据作为输入顺序开始每个模块。
Do you have any advice? Any remarks on this simple procedure?
你有什么建议吗?关于这个简单程序的任何评论?
9
Your question is pretty broad, so unfortunately my answer will not be very specific either.
你的问题非常广泛,不幸的是我的回答也不是很具体。
First, benchmarking is hard. Do not underestimate the effort necessary to produce meaningful, repeatable, high-confidence results.
首先,基准测试很难。不要低估产生有意义,可重复,高可信度结果所需的努力。
Second, what is your performance goal? Is it throughput (transaction or operations per second)? Is it latency (time it takes to execute a transaction)? Do you care about average performance? Do I care about worst case performance? Do you care about the absolute worst case or I care that 90%, 95% or some other percentile get adequate performance?
第二,你的表现目标是什么?是吞吐量(事务或每秒操作)?它是延迟(执行事务所需的时间)吗?你关心平均表现吗?我是否关心最坏情况的表现?您是否关心绝对最坏的情况,或者我关心90%,95%或其他百分位数获得足够的表现?
Depending on which goal you have, then you should design your benchmark to measure against that goal. So, if you are interested in throughput, you probably want to send messages / transactions / input into your system at a prescribed rate and see if the system is keeping up.
根据您的目标,您应该设计基准来衡量该目标。因此,如果您对吞吐量感兴趣,您可能希望以指定的速率向系统发送消息/事务/输入,并查看系统是否跟上。
If you are interested in latency, you would send messages / transactions / input and measure how long it takes to process each one.
如果您对延迟感兴趣,您可以发送消息/事务/输入并测量处理每个消息所需的时间。
If you are interested in worst case performance you will add load to the system until up to whatever you consider "realistic" (or whatever the system design says it should support.)
如果您对最坏情况的性能感兴趣,那么您将为系统增加负载,直到达到您认为“实际”的任何内容(或者系统设计应该支持的任何内容)。
Second, you do not say if these modules are going to be CPU bound, I/O bound, if they can take advantage of multiple CPUs/cores, etc. As you are trying to evaluate different hardware solutions you may find that your application benefits more from a great I/O subsystem vs. a huge number of CPUs.
其次,您不会说这些模块是否会受CPU限制,I / O限制,是否可以利用多个CPU /内核等。当您尝试评估不同的硬件解决方案时,您可能会发现您的应用程序受益更多来自一个伟大的I / O子系统与大量的CPU。
Third, the best benchmark (and the hardest) is to put realistic load into the system. Meaning, you record data from a production environment, and put the new hardware solution through this data. Getting this done is harder than it sounds, often, this means adding all kinds of measure points in the system to see how it behaves (if you do not have them already,) modifying the existing system to add record/playback capabilities, modifying the playback to run at different rates, and getting a realistic (i.e., similar to production) environment for testing.
第三,最好的基准(也是最难的)是将实际负载放入系统。这意味着,您从生产环境中记录数据,并通过此数据放置新的硬件解决方案。完成此操作比听起来更难,通常,这意味着在系统中添加各种测量点以查看其行为(如果您还没有它们),修改现有系统以添加记录/回放功能,修改播放以不同的速率运行,并获得一个逼真的(即类似于生产)环境进行测试。
2
The most meaningful benchmark is to measure how your code performs under everyday usage. That will obviously provide you with the most realistic numbers.
最有意义的基准是衡量代码在日常使用中的表现。这显然会为您提供最真实的数字。
Choose several real-life data sets and put them through the same processes your org uses every day. For extra credit, talk with the people that use your framework and ask them to provide some "best-case", "normal", and "worst-case" data. Anonymize the data if there are privacy concerns, but try not to change anything that could affect performance.
选择几个真实数据集,并将它们放入您的组织每天使用的相同流程中。要获得额外的功劳,请与使用您的框架的人交谈,并要求他们提供一些“最佳案例”,“正常”和“最坏情况”的数据。如果存在隐私问题,请对数据进行匿名化,但尽量不要更改可能影响性能的任何内容。
Remember that you are benchmarking and comparing two sets of hardware, not your framework. Treat all of the software as a black box and simply measure the hardware performance.
请记住,您正在对两组硬件进行基准测试和比较,而不是您的框架。将所有软件视为黑盒子,只需测量硬件性能即可。
Lastly, consider saving the data sets and using them to similarly evaluate any later changes you make to the software.
最后,考虑保存数据集并使用它们来类似地评估您对软件所做的任何后续更改。
1
If you're system is supposed to be able to handle multiple clients all calling at the same time, then your benchmark should reflect this. Note that some calls will not play well together. For example, having 25 threads post the same bit of information at the same time could lead to locks on the server end, thus skewing your results.
如果您的系统应该能够同时处理所有呼叫的多个客户端,那么您的基准测试应该反映这一点。请注意,某些调用无法在一起播放。例如,有25个线程同时发布相同的信息可能导致服务器端锁定,从而扭曲您的结果。
From a nuts-and-bolts point of view, I've used Perl and its Benchmark module to gather the information I care about.
从螺母和螺栓的角度来看,我使用Perl及其Benchmark模块来收集我关心的信息。
1
If you're comparing differing hardware, then measuring the cost per transaction will give you a good comparison of the trade offs of hardware for performance. One configuration may give you the best performance, but costs too much. A less expensive configuration may give you adequate performance.
如果您正在比较不同的硬件,那么衡量每笔交易的成本将使您可以很好地比较硬件的性能折衷。一种配置可以为您提供最佳性能,但成本太高。较便宜的配置可以为您提供足够的性能。
It's important to emulate the "worst case" or "peak hour" of load. It's also important to test with "typical" volumes. It's a balancing act to get good server utilization, that doesn't cost too much, that gives the required performance.
模拟负载的“最坏情况”或“高峰时段”非常重要。用“典型”卷进行测试也很重要。这是一种平衡的行为,可以获得良好的服务器利用率,而且成本太高,可以提供所需的性能。
Testing across hardware configurations quickly becomes expensive. Another viable option is to first measure on the configuration you have, then simulate that behavior across virtual systems using a model.
跨硬件配置的测试很快变得昂贵。另一个可行的选择是首先测量您的配置,然后使用模型在虚拟系统中模拟该行为。
0
If you can, try to record some operations users (or processes) are doing with your framework, ideally using a clone of the real system. That gives you the most realistic data. Things to consider:
如果可以的话,尝试记录一些操作用户(或进程)正在使用您的框架,理想情况下使用真实系统的克隆。这为您提供了最真实的数据。需要考虑的事项:
最常用的功能是什么?
传输了多少数据?
不要假设任何事情。如果你认为“那将是快/慢”,不要赌它。在10个案例中的9个案例中,你错了。
Create a top ten for 1+2 and work from that.
为1 + 2创建前十名并从中开始工作。
That said: If you replace old hardware with new hardware, you can expect roughly 10% faster execution for each year that has passed since you bought the first set (if the systems are otherwise pretty equal).
也就是说:如果用新硬件替换旧硬件,那么自购买第一套产品以来,每年的执行速度大约要快10%(如果系统非常相同)。
If you have a specialized system, the numbers may be completely different but usually, new hardware doesn't change much. For example, adding an useful index to a database can reduce the runtime of a query from two hours to two seconds. Hardware will never give you that.
如果你有一个专门的系统,数字可能会完全不同,但通常,新硬件不会有太大变化。例如,向数据库添加有用的索引可以将查询的运行时间从两小时减少到两秒。硬件永远不会给你。
0
As I see it, there are two kinds of benchmarks when it comes to benchmarking software. First, microbenchmarks, when you try to evaluate a piece of code in isolation or how a system deals with narrowly defined workload. Compare two sorting algorithms written in Java. Compare two web browsers how fast can each perform some DOM manipulation operation. Second, there are system benchmarks (I just made the name up), when you try to evaluate a software system under a realistic workload. Compare my Python based backend running on Google Compute Engine and on Amazon AWS.
在我看来,基准测试软件有两种基准。首先,微基准测试,当您尝试单独评估一段代码或系统如何处理狭义定义的工作负载时。比较用Java编写的两种排序算法。比较两个Web浏览器每个执行某些DOM操作操作的速度有多快。其次,当您尝试在实际工作负载下评估软件系统时,有系统基准(我刚刚建立了名称)。比较我在Google Compute Engine和Amazon AWS上运行的基于Python的后端。
When dealing with Java and such like, keep in mind that the VM needs to warm up before it can give you realistic performance. If you measure time with the time
command, the JVM startup time will be included. You almost always want to either ignore start-up time or keep track of it separately.
在处理Java等时,请记住VM需要预热才能为您提供逼真的性能。如果使用time命令测量时间,则将包括JVM启动时间。您几乎总是想要忽略启动时间或单独跟踪它。
Microbenchmarking
During the first run, CPU caches are getting filled with the necessary data. The same goes for disk caches. During few subsequent runs the VM continues to warm up, meaning JIT compiles what it deems helpful to compile. You want to ignore these runs and start measuring afterwards.
在第一次运行期间,CPU缓存充满了必要的数据。磁盘缓存也是如此。在几次后续运行期间,VM继续预热,这意味着JIT编译它认为有助于编译的内容。您想忽略这些运行并在之后开始测量。
Make a lot of measurements and compute some statistics. Mean, median, standard deviation, plot a chart. Look at it and see how much it changes. Things that can influence the result include GC pauses in the VM, frequency scaling on the CPU, some other process may start some background task (like virus scan), OS may decide move the process on a different CPU core, if you have NUMA architecture, the results would be even more marked.
进行大量测量并计算一些统计数据。平均值,中位数,标准差,绘制图表。看看它,看看它有多大变化。可能影响结果的事情包括VM中的GC暂停,CPU上的频率缩放,其他一些进程可能会启动一些后台任务(如病毒扫描),如果您有NUMA架构,操作系统可能决定将进程移动到不同的CPU核心上,结果会更加明显。
In case of microbenchmarks, all of this is a problem. Kill what processes you can before you begin. Use a benchmarking library that can do some of it for you. Like https://github.com/google/caliper and such like.
在微基准测试的情况下,所有这都是一个问题。在开始之前杀死您可以使用的流程。使用可以为您完成部分工作的基准测试库。像https://github.com/google/caliper之类的。
System benchmarking
In case of benchmarking a system under a realistic workload, these details do not really interest you and your problem is "only" to know what a realistic workload is, how to generate it and what data to collect. It is always best if you can instrument a production system and collect data there. You can usually do that, because you are measuring end-user characteristics (how long did a web page render) and these are I/O bound so the code gathering data does not slow down the system. (The page needs to be shipped to the user over the network, it does not matter if we also log a few numbers in the process).
如果在实际工作负载下对系统进行基准测试,这些详细信息对您并不感兴趣,您的问题“仅”了解实际工作负载是什么,如何生成它以及要收集哪些数据。如果您可以检测生产系统并在那里收集数据,这总是最好的。您通常可以这样做,因为您正在测量最终用户特征(网页呈现多长时间)并且这些是I / O绑定的,因此代码收集数据不会减慢系统速度。 (该页面需要通过网络发送给用户,如果我们在此过程中也记录了一些数字也没关系)。
Be mindful of the difference between profiling and benchmarking. Benchmarking can give you absolute time spent doing something, profiling gives you relative time spent doing something compared to everything else that needed doing. This is because profilers run heavily instrumented programs (common technique is to stop-the-world every few hundred ms and save a stack trace) and the instrumentation slows everything down significantly.
请注意性能分析和基准测试之间的区别。基准测试可以给你绝对的时间做某事,分析给你相对于其他所有需要做的事情所花费的相对时间。这是因为分析器运行了大量仪表程序(常见的技术是每隔几百毫秒停止一次并保存堆栈跟踪)并且仪器会显着减慢一切。