热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

你有什么建议可以给我写一个有意义的基准?-Whatadvicecanyougivemeforwritingameaningfulbenchmark?

Ihavedevelopedaframeworkthatisusedbyseveralteamsinourorganisation.Thosemodules,dev

I have developed a framework that is used by several teams in our organisation. Those "modules", developed on top of this framework, can behave quite differently but they are all pretty resources consuming even though some are more than others. They all receive data in input, analyse and/or transform it, and send it further.

我开发了一个框架,供我们组织中的几个团队使用。在这个框架之上开发的那些“模块”可以表现得完全不同,但它们都是非常耗费资源的,尽管有些模块比其他模块更多。它们都在输入中接收数据,分析和/或转换数据,并进一步发送。

We planned to buy new hardware and my boss asked me to define and implement a benchmark based on the modules in order to compare the different offers we have got.

我们计划购买新硬件,我的老板让我根据模块定义和实施基准测试,以便比较我们获得的不同报价。

My idea is to simply start sequentially each module with a well chosen bunch of data as input.

我的想法是简单地以一组精心选择的数据作为输入顺序开始每个模块。

Do you have any advice? Any remarks on this simple procedure?

你有什么建议吗?关于这个简单程序的任何评论?

6 个解决方案

#1


9  

Your question is pretty broad, so unfortunately my answer will not be very specific either.

你的问题非常广泛,不幸的是我的回答也不是很具体。

First, benchmarking is hard. Do not underestimate the effort necessary to produce meaningful, repeatable, high-confidence results.

首先,基准测试很难。不要低估产生有意义,可重复,高可信度结果所需的努力。

Second, what is your performance goal? Is it throughput (transaction or operations per second)? Is it latency (time it takes to execute a transaction)? Do you care about average performance? Do I care about worst case performance? Do you care about the absolute worst case or I care that 90%, 95% or some other percentile get adequate performance?

第二,你的表现目标是什么?是吞吐量(事务或每秒操作)?它是延迟(执行事务所需的时间)吗?你关心平均表现吗?我是否关心最坏情况的表现?您是否关心绝对最坏的情况,或者我关心90%,95%或其他百分位数获得足够的表现?

Depending on which goal you have, then you should design your benchmark to measure against that goal. So, if you are interested in throughput, you probably want to send messages / transactions / input into your system at a prescribed rate and see if the system is keeping up.

根据您的目标,您应该设计基准来衡量该目标。因此,如果您对吞吐量感兴趣,您可能希望以指定的速率向系统发送消息/事务/输入,并查看系统是否跟上。

If you are interested in latency, you would send messages / transactions / input and measure how long it takes to process each one.

如果您对延迟感兴趣,您可以发送消息/事务/输入并测量处理每个消息所需的时间。

If you are interested in worst case performance you will add load to the system until up to whatever you consider "realistic" (or whatever the system design says it should support.)

如果您对最坏情况的性能感兴趣,那么您将为系统增加负载,直到达到您认为“实际”的任何内容(或者系统设计应该支持的任何内容)。

Second, you do not say if these modules are going to be CPU bound, I/O bound, if they can take advantage of multiple CPUs/cores, etc. As you are trying to evaluate different hardware solutions you may find that your application benefits more from a great I/O subsystem vs. a huge number of CPUs.

其次,您不会说这些模块是否会受CPU限制,I / O限制,是否可以利用多个CPU /内核等。当您尝试评估不同的硬件解决方案时,您可能会发现您的应用程序受益更多来自一个伟大的I / O子系统与大量的CPU。

Third, the best benchmark (and the hardest) is to put realistic load into the system. Meaning, you record data from a production environment, and put the new hardware solution through this data. Getting this done is harder than it sounds, often, this means adding all kinds of measure points in the system to see how it behaves (if you do not have them already,) modifying the existing system to add record/playback capabilities, modifying the playback to run at different rates, and getting a realistic (i.e., similar to production) environment for testing.

第三,最好的基准(也是最难的)是将实际负载放入系统。这意味着,您从生产环境中记录数据,并通过此数据放置新的硬件解决方案。完成此操作比听起来更难,通常,这意味着在系统中添加各种测量点以查看其行为(如果您还没有它们),修改现有系统以添加记录/回放功能,修改播放以不同的速率运行,并获得一个逼真的(即类似于生产)环境进行测试。

#2


2  

The most meaningful benchmark is to measure how your code performs under everyday usage. That will obviously provide you with the most realistic numbers.

最有意义的基准是衡量代码在日常使用中的表现。这显然会为您提供最真实的数字。

Choose several real-life data sets and put them through the same processes your org uses every day. For extra credit, talk with the people that use your framework and ask them to provide some "best-case", "normal", and "worst-case" data. Anonymize the data if there are privacy concerns, but try not to change anything that could affect performance.

选择几个真实数据集,并将它们放入您的组织每天使用的相同流程中。要获得额外的功劳,请与使用您的框架的人交谈,并要求他们提供一些“最佳案例”,“正常”和“最坏情况”的数据。如果存在隐私问题,请对数据进行匿名化,但尽量不要更改可能影响性能的任何内容。

Remember that you are benchmarking and comparing two sets of hardware, not your framework. Treat all of the software as a black box and simply measure the hardware performance.

请记住,您正在对两组硬件进行基准测试和比较,而不是您的框架。将所有软件视为黑盒子,只需测量硬件性能即可。

Lastly, consider saving the data sets and using them to similarly evaluate any later changes you make to the software.

最后,考虑保存数据集并使用它们来类似地评估您对软件所做的任何后续更改。

#3


1  

If you're system is supposed to be able to handle multiple clients all calling at the same time, then your benchmark should reflect this. Note that some calls will not play well together. For example, having 25 threads post the same bit of information at the same time could lead to locks on the server end, thus skewing your results.

如果您的系统应该能够同时处理所有呼叫的多个客户端,那么您的基准测试应该反映这一点。请注意,某些调用无法在一起播放。例如,有25个线程同时发布相同的信息可能导致服务器端锁定,从而扭曲您的结果。

From a nuts-and-bolts point of view, I've used Perl and its Benchmark module to gather the information I care about.

从螺母和螺栓的角度来看,我使用Perl及其Benchmark模块来收集我关心的信息。

#4


1  

If you're comparing differing hardware, then measuring the cost per transaction will give you a good comparison of the trade offs of hardware for performance. One configuration may give you the best performance, but costs too much. A less expensive configuration may give you adequate performance.

如果您正在比较不同的硬件,那么衡量每笔交易的成本将使您可以很好地比较硬件的性能折衷。一种配置可以为您提供最佳性能,但成本太高。较便宜的配置可以为您提供足够的性能。

It's important to emulate the "worst case" or "peak hour" of load. It's also important to test with "typical" volumes. It's a balancing act to get good server utilization, that doesn't cost too much, that gives the required performance.

模拟负载的“最坏情况”或“高峰时段”非常重要。用“典型”卷进行测试也很重要。这是一种平衡的行为,可以获得良好的服务器利用率,而且成本太高,可以提供所需的性能。

Testing across hardware configurations quickly becomes expensive. Another viable option is to first measure on the configuration you have, then simulate that behavior across virtual systems using a model.

跨硬件配置的测试很快变得昂贵。另一个可行的选择是首先测量您的配置,然后使用模型在虚拟系统中模拟该行为。

#5


0  

If you can, try to record some operations users (or processes) are doing with your framework, ideally using a clone of the real system. That gives you the most realistic data. Things to consider:

如果可以的话,尝试记录一些操作用户(或进程)正在使用您的框架,理想情况下使用真实系统的克隆。这为您提供了最真实的数据。需要考虑的事项:

  1. Which functions are most often used?
  2. 最常用的功能是什么?

  3. How much data is transferred?
  4. 传输了多少数据?

  5. Do not assume anything. If you think "that is going to be fast/slow", don't bet on it. In 9 out of 10 cases, you're wrong.
  6. 不要假设任何事情。如果你认为“那将是快/慢”,不要赌它。在10个案例中的9个案例中,你错了。

Create a top ten for 1+2 and work from that.

为1 + 2创建前十名并从中开始工作。

That said: If you replace old hardware with new hardware, you can expect roughly 10% faster execution for each year that has passed since you bought the first set (if the systems are otherwise pretty equal).

也就是说:如果用新硬件替换旧硬件,那么自购买第一套产品以来,每年的执行速度大约要快10%(如果系统非常相同)。

If you have a specialized system, the numbers may be completely different but usually, new hardware doesn't change much. For example, adding an useful index to a database can reduce the runtime of a query from two hours to two seconds. Hardware will never give you that.

如果你有一个专门的系统,数字可能会完全不同,但通常,新硬件不会有太大变化。例如,向数据库添加有用的索引可以将查询的运行时间从两小时减少到两秒。硬件永远不会给你。

#6


0  

As I see it, there are two kinds of benchmarks when it comes to benchmarking software. First, microbenchmarks, when you try to evaluate a piece of code in isolation or how a system deals with narrowly defined workload. Compare two sorting algorithms written in Java. Compare two web browsers how fast can each perform some DOM manipulation operation. Second, there are system benchmarks (I just made the name up), when you try to evaluate a software system under a realistic workload. Compare my Python based backend running on Google Compute Engine and on Amazon AWS.

在我看来,基准测试软件有两种基准。首先,微基准测试,当您尝试单独评估一段代码或系统如何处理狭义定义的工作负载时。比较用Java编写的两种排序算法。比较两个Web浏览器每个执行某些DOM操作操作的速度有多快。其次,当您尝试在实际工作负载下评估软件系统时,有系统基准(我刚刚建立了名称)。比较我在Google Compute Engine和Amazon AWS上运行的基于Python的后端。

When dealing with Java and such like, keep in mind that the VM needs to warm up before it can give you realistic performance. If you measure time with the time command, the JVM startup time will be included. You almost always want to either ignore start-up time or keep track of it separately.

在处理Java等时,请记住VM需要预热才能为您提供逼真的性能。如果使用time命令测量时间,则将包括JVM启动时间。您几乎总是想要忽略启动时间或单独跟踪它。

Microbenchmarking

During the first run, CPU caches are getting filled with the necessary data. The same goes for disk caches. During few subsequent runs the VM continues to warm up, meaning JIT compiles what it deems helpful to compile. You want to ignore these runs and start measuring afterwards.

在第一次运行期间,CPU缓存充满了必要的数据。磁盘缓存也是如此。在几次后续运行期间,VM继续预热,这意味着JIT编译它认为有助于编译的内容。您想忽略这些运行并在之后开始测量。

Make a lot of measurements and compute some statistics. Mean, median, standard deviation, plot a chart. Look at it and see how much it changes. Things that can influence the result include GC pauses in the VM, frequency scaling on the CPU, some other process may start some background task (like virus scan), OS may decide move the process on a different CPU core, if you have NUMA architecture, the results would be even more marked.

进行大量测量并计算一些统计数据。平均值,中位数,标准差,绘制图表。看看它,看看它有多大变化。可能影响结果的事情包括VM中的GC暂停,CPU上的频率缩放,其他一些进程可能会启动一些后台任务(如病毒扫描),如果您有NUMA架构,操作系统可能决定将进程移动到不同的CPU核心上,结果会更加明显。

In case of microbenchmarks, all of this is a problem. Kill what processes you can before you begin. Use a benchmarking library that can do some of it for you. Like https://github.com/google/caliper and such like.

在微基准测试的情况下,所有这都是一个问题。在开始之前杀死您可以使用的流程。使用可以为您完成部分工作的基准测试库。像https://github.com/google/caliper之类的。

System benchmarking

In case of benchmarking a system under a realistic workload, these details do not really interest you and your problem is "only" to know what a realistic workload is, how to generate it and what data to collect. It is always best if you can instrument a production system and collect data there. You can usually do that, because you are measuring end-user characteristics (how long did a web page render) and these are I/O bound so the code gathering data does not slow down the system. (The page needs to be shipped to the user over the network, it does not matter if we also log a few numbers in the process).

如果在实际工作负载下对系统进行基准测试,这些详细信息对您并不感兴趣,您的问题“仅”了解实际工作负载是什么,如何生成它以及要收集哪些数据。如果您可以检测生产系统并在那里收集数据,这总是最好的。您通常可以这样做,因为您正在测量最终用户特征(网页呈现多长时间)并且这些是I / O绑定的,因此代码收集数据不会减慢系统速度。 (该页面需要通过网络发送给用户,如果我们在此过程中也记录了一些数字也没关系)。

Be mindful of the difference between profiling and benchmarking. Benchmarking can give you absolute time spent doing something, profiling gives you relative time spent doing something compared to everything else that needed doing. This is because profilers run heavily instrumented programs (common technique is to stop-the-world every few hundred ms and save a stack trace) and the instrumentation slows everything down significantly.

请注意性能分析和基准测试之间的区别。基准测试可以给你绝对的时间做某事,分析给你相对于其他所有需要做的事情所花费的相对时间。这是因为分析器运行了大量仪表程序(常见的技术是每隔几百毫秒停止一次并保存堆栈跟踪)并且仪器会显着减慢一切。


推荐阅读
  • 本文介绍了九度OnlineJudge中的1002题目“Grading”的解决方法。该题目要求设计一个公平的评分过程,将每个考题分配给3个独立的专家,如果他们的评分不一致,则需要请一位裁判做出最终决定。文章详细描述了评分规则,并给出了解决该问题的程序。 ... [详细]
  • 本文主要解析了Open judge C16H问题中涉及到的Magical Balls的快速幂和逆元算法,并给出了问题的解析和解决方法。详细介绍了问题的背景和规则,并给出了相应的算法解析和实现步骤。通过本文的解析,读者可以更好地理解和解决Open judge C16H问题中的Magical Balls部分。 ... [详细]
  • 本文详细介绍了如何使用MySQL来显示SQL语句的执行时间,并通过MySQL Query Profiler获取CPU和内存使用量以及系统锁和表锁的时间。同时介绍了效能分析的三种方法:瓶颈分析、工作负载分析和基于比率的分析。 ... [详细]
  • Ihavethefollowingonhtml我在html上有以下内容<html><head><scriptsrc..3003_Tes ... [详细]
  • 本文讨论了如何在codeigniter中识别来自angularjs的请求,并提供了两种方法的代码示例。作者尝试了$this->input->is_ajax_request()和自定义函数is_ajax(),但都没有成功。最后,作者展示了一个ajax请求的示例代码。 ... [详细]
  • 本文介绍了使用Spark实现低配版高斯朴素贝叶斯模型的原因和原理。随着数据量的增大,单机上运行高斯朴素贝叶斯模型会变得很慢,因此考虑使用Spark来加速运行。然而,Spark的MLlib并没有实现高斯朴素贝叶斯模型,因此需要自己动手实现。文章还介绍了朴素贝叶斯的原理和公式,并对具有多个特征和类别的模型进行了讨论。最后,作者总结了实现低配版高斯朴素贝叶斯模型的步骤。 ... [详细]
  • 使用圣杯布局模式实现网站首页的内容布局
    本文介绍了使用圣杯布局模式实现网站首页的内容布局的方法,包括HTML部分代码和实例。同时还提供了公司新闻、最新产品、关于我们、联系我们等页面的布局示例。商品展示区包括了车里子和农家生态土鸡蛋等产品的价格信息。 ... [详细]
  • 本文介绍了Perl的测试框架Test::Base,它是一个数据驱动的测试框架,可以自动进行单元测试,省去手工编写测试程序的麻烦。与Test::More完全兼容,使用方法简单。以plural函数为例,展示了Test::Base的使用方法。 ... [详细]
  • 不同优化算法的比较分析及实验验证
    本文介绍了神经网络优化中常用的优化方法,包括学习率调整和梯度估计修正,并通过实验验证了不同优化算法的效果。实验结果表明,Adam算法在综合考虑学习率调整和梯度估计修正方面表现较好。该研究对于优化神经网络的训练过程具有指导意义。 ... [详细]
  • 本文介绍了一个在线急等问题解决方法,即如何统计数据库中某个字段下的所有数据,并将结果显示在文本框里。作者提到了自己是一个菜鸟,希望能够得到帮助。作者使用的是ACCESS数据库,并且给出了一个例子,希望得到的结果是560。作者还提到自己已经尝试了使用"select sum(字段2) from 表名"的语句,得到的结果是650,但不知道如何得到560。希望能够得到解决方案。 ... [详细]
  • RouterOS 5.16软路由安装图解教程
    本文介绍了如何安装RouterOS 5.16软路由系统,包括系统要求、安装步骤和登录方式。同时提供了详细的图解教程,方便读者进行操作。 ... [详细]
  • 本文介绍了绕过WAF的XSS检测机制的方法,包括确定payload结构、测试和混淆。同时提出了一种构建XSS payload的方法,该payload与安全机制使用的正则表达式不匹配。通过清理用户输入、转义输出、使用文档对象模型(DOM)接收器和源、实施适当的跨域资源共享(CORS)策略和其他安全策略,可以有效阻止XSS漏洞。但是,WAF或自定义过滤器仍然被广泛使用来增加安全性。本文的方法可以绕过这种安全机制,构建与正则表达式不匹配的XSS payload。 ... [详细]
  • Android工程师面试准备及设计模式使用场景
    本文介绍了Android工程师面试准备的经验,包括面试流程和重点准备内容。同时,还介绍了建造者模式的使用场景,以及在Android开发中的具体应用。 ... [详细]
  • 本文介绍了操作系统的定义和功能,包括操作系统的本质、用户界面以及系统调用的分类。同时还介绍了进程和线程的区别,包括进程和线程的定义和作用。 ... [详细]
  • 本文讨论了微软的STL容器类是否线程安全。根据MSDN的回答,STL容器类包括vector、deque、list、queue、stack、priority_queue、valarray、map、hash_map、multimap、hash_multimap、set、hash_set、multiset、hash_multiset、basic_string和bitset。对于单个对象来说,多个线程同时读取是安全的。但如果一个线程正在写入一个对象,那么所有的读写操作都需要进行同步。 ... [详细]
author-avatar
傻咾厷叫我洪儿
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有