cephRBD克隆卷的io模式分析
RBD 克隆卷IO性能分析谷忠言Agenda什么是克隆卷写操作,op_rw?写操作延迟细分线上集群cloud.shhp实例分析如何优化clone写/op_rw克隆写和Op_rw凡是从快照克隆出来的RBD image, 所有的写操作在集群内部被标记为op_rw类型。op_rw:一个请求首先有读操作(检查child object是否存在),然后修改数据再写入。凡是从快照克隆出来的RBD image,第一次写被定义为copyup操作,即先从parent 拷贝原快照对象,再追加写操作。第一次clone写产生1个op_r + 1个op_rw注意:如果clone卷做了resize操作,比如扩容了100G,则对扩容的100G的写被标记为op_w操作。Op_w vs. Op_rw非克隆写op_w,image没有parent克隆写op_rw,image有parentLIBRBD_AIO_WRITE_FLATLIBRBD_AIO_WRITE_GUARD写入object是否Object是否存在done写入object从parent拷贝整个object(copyup), 合并写操作done整个object写入child imageCase 1doneCase 2:no copyupCase 3: copyupOp_rw实验线下集群,构建copyup模型,让几乎所有写操作都走copyup路径。为一个大image做快照,并克隆child:test_iamge_1T_cloneOp_rw实验数据Fio latency=177ms后台Rados数据统计深入op_rwClient构建写操作:[stat, set-alloc-hint, write 4KB],发送至 primary osd.0osd.0收到消息,stat操作失败。osd.0向client返回消息,查无此对象。client计算定位parent osd.5,构建读操作[sparse-read 4MB],发送至 osd.5osd.5收到消息,读取object 4MB。向client端返回消息client重构写操作:[call rbd.copyup, set-alloc-hint, write 4KB],发送至primary osd.0primary osd.0写入journal replica osd.19写入journalreplica osd.27写入journal11/12,replica osd 写返回13.primary osd 返回Client IO latency breakdown2.5ms181.5ms44ms op_r134msop_rwOp_r latency breakdownOsd 完成op_r的时间是29ms, 读取并发送object实际大小是3.9MB:2016-04-15 11:35:20.421053 7ffa0a43d700 2 op_tracker client.6528.0:2684 rbd_data.13c26b8b4567.0000000000005957 osd_op [sparse-read 0~3891200] 2.b008a075 duration: msg_throttle 0.018 msg_read 0.029 msg_dispatch 0.055 osd_dispatch 0.099 op_wq_dispatch 0.056 op_prepare.find_object_context 1.052 op_prepare.total 1.645 prepare_transaction 27.25 do_op 29.152单机文件系统读取数据花了26.7ms还有15ms gap~~Op_rw latency breakdownOsd 收到object实际大小是4194304B:2016-04-15 11:35:26.124428 7ff 2 op_tracker client.6528.0:2778 rbd_data.181e6b8b4567.00000000000263a7 osd_op [call rbd.copyup,set-alloc-hint object_size 4194304 write_size 4194304,write 2838528~4096] 2.c98265ff duration: msg_throttle 0.018 msg_read 38.536 msg_dispatch 0.074 osd_dispatch 0.174 op_wq_dispatch 0.095 op_prepare.find_object_context 1.438 op_prepare.total 1.966 prep