linux使用FIO测试磁盘iops

为什么80%的码农都做不了架构师&＃xff1f;>>>

FIO是测试IOPS的非常好的工具&＃xff0c;用来对硬件进行压力测试和验证&＃xff0c;支持13种不同的I/O引擎&＃xff0c;

包括:sync,mmap, libaio, posixaio, SG v3, splice, null, network, syslet, guasi, solarisaio 等等。

fio 官网地址&＃xff1a;http://freshmeat.net/projects/fio/

一&＃xff0c;FIO安装

wget http://brick.kernel.dk/snaps/fio-2.0.7.tar.gz

yum install libaio-devel

tar -zxvf fio-2.0.7.tar.gz

cd fio-2.0.7

make

make install

//----------epel源&＃xff1a; yum install fio -y

二&＃xff0c;随机读测试&＃xff1a;

随机读&＃xff1a;

fio -filename&＃61;/dev/sdb1 -direct&＃61;1 -iodepth 1 -thread -rw&＃61;randread -ioengine&＃61;psync -bs&＃61;16k -size&＃61;200G

-numjobs&＃61;10 -runtime&＃61;1000 -group_reporting -name&＃61;mytest

说明&＃xff1a;

filename&＃61;/dev/sdb1 测试文件名称&＃xff0c;通常选择需要测试的盘的data目录。

direct&＃61;1 测试过程绕过机器自带的buffer。使测试结果更真实。

rw&＃61;randwrite 测试随机写的I/O

rw&＃61;randrw 测试随机写和读的I/O

bs&＃61;16k 单次io的块文件大小为16k

bsrange&＃61;512-2048 同上&＃xff0c;提定数据块的大小范围

size&＃61;5g 本次的测试文件大小为5g&＃xff0c;以每次4k的io进行测试。

numjobs&＃61;30 本次的测试线程为30.

runtime&＃61;1000 测试时间为1000秒&＃xff0c;如果不写则一直将5g文件分4k每次写完为止。

ioengine&＃61;psync io引擎使用pync方式

rwmixwrite&＃61;30 在混合读写的模式下&＃xff0c;写占30%

group_reporting 关于显示结果的&＃xff0c;汇总每个进程的信息。

此外

lockmem&＃61;1g 只使用1g内存进行测试。

zero_buffers 用0初始化系统buffer。

nrfiles&＃61;8 每个进程生成文件的数量。

顺序读&＃xff1a;

fio -filename&＃61;/dev/sdb1 -direct&＃61;1 -iodepth 1 -thread -rw&＃61;read -ioengine&＃61;psync -bs&＃61;16k -size&＃61;200G -numjobs&＃61;30 -runtime&＃61;1000 -group_reporting -name&＃61;mytest

随机写&＃xff1a;

fio -filename&＃61;/dev/sdb1 -direct&＃61;1 -iodepth 1 -thread -rw&＃61;randwrite -ioengine&＃61;psync -bs&＃61;16k -size&＃61;200G -numjobs&＃61;30 -runtime&＃61;1000 -group_reporting -name&＃61;mytest

顺序写&＃xff1a;

fio -filename&＃61;/dev/sdb1 -direct&＃61;1 -iodepth 1 -thread -rw&＃61;write -ioengine&＃61;psync -bs&＃61;16k -size&＃61;200G -numjobs&＃61;30 -runtime&＃61;1000 -group_reporting -name&＃61;mytest

混合随机读写&＃xff1a;

fio -filename&＃61;/dev/sdb1 -direct&＃61;1 -iodepth 1 -thread -rw&＃61;randrw -rwmixread&＃61;70 -ioengine&＃61;psync -bs&＃61;16k -size&＃61;200G -numjobs&＃61;30 -runtime&＃61;100 -group_reporting -name&＃61;mytest -ioscheduler&＃61;noop

三&＃xff0c;实际测试范例&＃xff1a;

[root&＃64;localhost ~]# fio -filename&＃61;/dev/sdb1 -direct&＃61;1 -iodepth 1 -thread -rw&＃61;randrw -rwmixread&＃61;70 -ioengine&＃61;psync -bs&＃61;16k -size&＃61;200G -numjobs&＃61;30

-runtime&＃61;100 -group_reporting -name&＃61;mytest1

mytest1: (g&＃61;0): rw&＃61;randrw, bs&＃61;16K-16K/16K-16K, ioengine&＃61;psync, iodepth&＃61;1

...

mytest1: (g&＃61;0): rw&＃61;randrw, bs&＃61;16K-16K/16K-16K, ioengine&＃61;psync, iodepth&＃61;1

fio 2.0.7

Starting 30 threads

Jobs: 1 (f&＃61;1): [________________m_____________] [3.5% done] [6935K/3116K /s] [423 /190 iops] [eta 48m:20s] s]

mytest1: (groupid&＃61;0, jobs&＃61;30): err&＃61; 0: pid&＃61;23802

read : io&＃61;1853.4MB, bw&＃61;18967KB/s, iops&＃61;1185 , runt&＃61;100058msec

clat (usec): min&＃61;60 , max&＃61;871116 , avg&＃61;25227.91, stdev&＃61;31653.46

lat (usec): min&＃61;60 , max&＃61;871117 , avg&＃61;25228.08, stdev&＃61;31653.46

clat percentiles (msec):

| 1.00th&＃61;[ 3], 5.00th&＃61;[ 5], 10.00th&＃61;[ 6], 20.00th&＃61;[ 8],

| 30.00th&＃61;[ 10], 40.00th&＃61;[ 12], 50.00th&＃61;[ 15], 60.00th&＃61;[ 19],

| 70.00th&＃61;[ 26], 80.00th&＃61;[ 37], 90.00th&＃61;[ 57], 95.00th&＃61;[ 79],

| 99.00th&＃61;[ 151], 99.50th&＃61;[ 202], 99.90th&＃61;[ 338], 99.95th&＃61;[ 383],

| 99.99th&＃61;[ 523]

bw (KB/s) : min&＃61; 26, max&＃61; 1944, per&＃61;3.36%, avg&＃61;636.84, stdev&＃61;189.15

write: io&＃61;803600KB, bw&＃61;8031.4KB/s, iops&＃61;501 , runt&＃61;100058msec

clat (usec): min&＃61;52 , max&＃61;9302 , avg&＃61;146.25, stdev&＃61;299.17

lat (usec): min&＃61;52 , max&＃61;9303 , avg&＃61;147.19, stdev&＃61;299.17

clat percentiles (usec):

| 1.00th&＃61;[ 62], 5.00th&＃61;[ 65], 10.00th&＃61;[ 68], 20.00th&＃61;[ 74],

| 30.00th&＃61;[ 84], 40.00th&＃61;[ 87], 50.00th&＃61;[ 89], 60.00th&＃61;[ 90],

| 70.00th&＃61;[ 92], 80.00th&＃61;[ 97], 90.00th&＃61;[ 120], 95.00th&＃61;[ 370],

| 99.00th&＃61;[ 1688], 99.50th&＃61;[ 2128], 99.90th&＃61;[ 3088], 99.95th&＃61;[ 3696],

| 99.99th&＃61;[ 5216]

bw (KB/s) : min&＃61; 20, max&＃61; 1117, per&＃61;3.37%, avg&＃61;270.27, stdev&＃61;133.27

lat (usec) : 100&＃61;24.32%, 250&＃61;3.83%, 500&＃61;0.33%, 750&＃61;0.28%, 1000&＃61;0.27%

lat (msec) : 2&＃61;0.64%, 4&＃61;3.08%, 10&＃61;20.67%, 20&＃61;19.90%, 50&＃61;17.91%

lat (msec) : 100&＃61;6.87%, 250&＃61;1.70%, 500&＃61;0.19%, 750&＃61;0.01%, 1000&＃61;0.01%

cpu : usr&＃61;1.70%, sys&＃61;2.41%, ctx&＃61;5237835, majf&＃61;0, minf&＃61;6344162

IO depths : 1&＃61;100.0%, 2&＃61;0.0%, 4&＃61;0.0%, 8&＃61;0.0%, 16&＃61;0.0%, 32&＃61;0.0%, >&＃61;64&＃61;0.0%

submit : 0&＃61;0.0%, 4&＃61;100.0%, 8&＃61;0.0%, 16&＃61;0.0%, 32&＃61;0.0%, 64&＃61;0.0%, >&＃61;64&＃61;0.0%

complete : 0&＃61;0.0%, 4&＃61;100.0%, 8&＃61;0.0%, 16&＃61;0.0%, 32&＃61;0.0%, 64&＃61;0.0%, >&＃61;64&＃61;0.0%

issued : total&＃61;r&＃61;118612/w&＃61;50225/d&＃61;0, short&＃61;r&＃61;0/w&＃61;0/d&＃61;0

Run status group 0 (all jobs):

READ: io&＃61;1853.4MB, aggrb&＃61;18966KB/s, minb&＃61;18966KB/s, maxb&＃61;18966KB/s, mint&＃61;100058msec, maxt&＃61;100058msec

WRITE: io&＃61;803600KB, aggrb&＃61;8031KB/s, minb&＃61;8031KB/s, maxb&＃61;8031KB/s, mint&＃61;100058msec, maxt&＃61;100058msec

Disk stats (read/write):

sdb: ios&＃61;118610/50224, merge&＃61;0/0, ticks&＃61;2991317/6860, in_queue&＃61;2998169, util&＃61;99.77%

主要查看以上红色字体部分的iops(read/write)

**磁盘阵列吞吐量与IOPS两大瓶颈分析**

1、吞吐量

吞吐量主要取决于阵列的构架&＃xff0c;光纤通道的大小(现在阵列一般都是光纤阵列&＃xff0c;至于SCSI这样的SSA阵列&＃xff0c;我们不讨论)以及硬盘的个数。阵列的构架与每个阵列不同而不同&＃xff0c;他们也都存在内部带宽(类似于pc的系统总线)&＃xff0c;不过一般情况下&＃xff0c;内部带宽都设计的很充足&＃xff0c;不是瓶颈的所在。

光纤通道的影响还是比较大的&＃xff0c;如数据仓库环境中&＃xff0c;对数据的流量要求很大&＃xff0c;而一块2Gb的光纤卡&＃xff0c;所77能支撑的最大流量应当是2Gb/8(小B)&＃61;250MB/s(大B)的实际流量&＃xff0c;当4块光纤卡才能达到1GB/s的实际流量&＃xff0c;所以数据仓库环境可以考虑换4Gb的光纤卡。

最后说一下硬盘的限制&＃xff0c;这里是最重要的&＃xff0c;当前面的瓶颈不再存在的时候&＃xff0c;就要看硬盘的个数了&＃xff0c;我下面列一下不同的硬盘所能支撑的流量大小&＃xff1a;

10 K rpm 15 K rpm ATA

——— ——— ———

10M/s 13M/s 8M/s

那么&＃xff0c;假定一个阵列有120块15K rpm的光纤硬盘&＃xff0c;那么硬盘上最大的可以支撑的流量为120*13&＃61;1560MB/s&＃xff0c;如果是2Gb的光纤卡&＃xff0c;可能需要6块才能够&＃xff0c;而4Gb的光纤卡&＃xff0c;3-4块就够了。

2、IOPS

决定IOPS的主要取决与阵列的算法&＃xff0c;cache命中率&＃xff0c;以及磁盘个数。阵列的算法因为不同的阵列不同而不同&＃xff0c;如我们最近遇到在hds usp上面&＃xff0c;可能因为ldev(lun)存在队列或者资源限制&＃xff0c;而单个ldev的iops就上不去&＃xff0c;所以&＃xff0c;在使用这个存储之前&＃xff0c;有必要了解这个存储的一些算法规则与限制。

cache的命中率取决于数据的分布&＃xff0c;cache size的大小&＃xff0c;数据访问的规则&＃xff0c;以及cache的算法&＃xff0c;如果完整的讨论下来&＃xff0c;这里将变得很复杂&＃xff0c;可以有一天好讨论了。我这里只强调一个cache的命中率&＃xff0c;如果一个阵列&＃xff0c;读cache的命中率越高越好&＃xff0c;一般表示它可以支持更多的IOPS&＃xff0c;为什么这么说呢?这个就与我们下面要讨论的硬盘IOPS有关系了。

硬盘的限制&＃xff0c;每个物理硬盘能处理的IOPS是有限制的&＃xff0c;如

10 K rpm 15 K rpm ATA

——— ——— ———

100 150 50

同样&＃xff0c;如果一个阵列有120块15K rpm的光纤硬盘&＃xff0c;那么&＃xff0c;它能撑的最大IOPS为120*150&＃61;18000&＃xff0c;这个为硬件限制的理论值&＃xff0c;如果超过这个值&＃xff0c;硬盘的响应可能会变的非常缓慢而不能正常提供业务。

在raid5与raid10上&＃xff0c;读iops没有差别&＃xff0c;但是&＃xff0c;相同的业务写iops&＃xff0c;最终落在磁盘上的iops是有差别的&＃xff0c;而我们评估的却正是磁盘的IOPS&＃xff0c;如果达到了磁盘的限制&＃xff0c;性能肯定是上不去了。

那我们假定一个case&＃xff0c;业务的iops是10000&＃xff0c;读cache命中率是30%&＃xff0c;读iops为60%&＃xff0c;写iops为40%&＃xff0c;磁盘个数为120&＃xff0c;那么分别计算在raid5与raid10的情况下&＃xff0c;每个磁盘的iops为多少。

raid5:

单块盘的iops &＃61; (10000*(1-0.3)*0.6 &＃43; 4 * (10000*0.4))/120

&＃61; (4200 &＃43; 16000)/120

&＃61; 168

这里的10000*(1-0.3)*0.6表示是读的iops&＃xff0c;比例是0.6&＃xff0c;除掉cache命中&＃xff0c;实际只有4200个iops

而4 * (10000*0.4) 表示写的iops&＃xff0c;因为每一个写&＃xff0c;在raid5中&＃xff0c;实际发生了4个io&＃xff0c;所以写的iops为16000个

为了考虑raid5在写操作的时候&＃xff0c;那2个读操作也可能发生命中&＃xff0c;所以更精确的计算为&＃xff1a;

单块盘的iops &＃61; (10000*(1-0.3)*0.6 &＃43; 2 * (10000*0.4)*(1-0.3) &＃43; 2 * (10000*0.4))/120

&＃61; (4200 &＃43; 5600 &＃43; 8000)/120

&＃61; 148

计算出来单个盘的iops为148个&＃xff0c;基本达到磁盘极限

raid10

单块盘的iops &＃61; (10000*(1-0.3)*0.6 &＃43; 2 * (10000*0.4))/120

&＃61; (4200 &＃43; 8000)/120

&＃61; 102

可以看到&＃xff0c;因为raid10对于一个写操作&＃xff0c;只发生2次io&＃xff0c;所以&＃xff0c;同样的压力&＃xff0c;同样的磁盘&＃xff0c;每个盘的iops只有102个&＃xff0c;还远远低于磁盘的极限iops。

在一个实际的case中&＃xff0c;一个恢复压力很大的standby(这里主要是写&＃xff0c;而且是小io的写)&＃xff0c;采用了raid5的方案&＃xff0c;发现性能很差&＃xff0c;通过分析&＃xff0c;每个磁盘的iops在高峰时期&＃xff0c;快达到200了&＃xff0c;导致响应速度巨慢无比。后来改造成raid10&＃xff0c;就避免了这个性能问题&＃xff0c;每个磁盘的iops降到100左右。