一步一步学linux操作系统:27文件系统_文件缓存与readwrite读写文件

作者：不如藏拙_487 | 来源：互联网 | 2023-07-20 15:23

系统调用层和虚拟文件系统层文件系统的读写，就是调用系统函数read和write，读和写的很多逻辑是相似的。read和write系统调用在内核里面的定

系统调用层和虚拟文件系统层

文件系统的读写&＃xff0c;就是调用系统函数 read 和 write&＃xff0c;读和写的很多逻辑是相似的。

read 和 write 系统调用在内核里面的定义

SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) {struct fd f &＃61; fdget_pos(fd); ......loff_t pos &＃61; file_pos_read(f.file);ret &＃61; vfs_read(f.file, buf, count, &pos); ...... }SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,size_t, count) {struct fd f &＃61; fdget_pos(fd); ......loff_t pos &＃61; file_pos_read(f.file);ret &＃61; vfs_write(f.file, buf, count, &pos); ...... }

read 调用vfs_read->__vfs_read
write 调用 vfs_write->__vfs_write

__vfs_read 和 __vfs_write 函数

ssize_t __vfs_read(struct file *file, char __user *buf, size_t count,loff_t *pos) {if (file->f_op->read)return file->f_op->read(file, buf, count, pos);else if (file->f_op->read_iter)return new_sync_read(file, buf, count, pos);elsereturn -EINVAL; }ssize_t __vfs_write(struct file *file, const char __user *p, size_t count,loff_t *pos) {if (file->f_op->write)return file->f_op->write(file, p, count, pos);else if (file->f_op->write_iter)return new_sync_write(file, p, count, pos);elsereturn -EINVAL; }

\linux-4.13.16\fs\read_write.c

在这里插入图片描述

每一个打开的文件&＃xff0c;都有一个 struct file 结构。这里面有一个 struct file_operations f_op&＃xff0c;用于定义对这个文件做的操作。__vfs_read 会调用相应文件系统的 file_operations 里面的 read 操作&＃xff0c;__vfs_write 会调用相应文件系统 file_operations 里的 write 操作。

ext4 文件系统层

ext4_file_operations内核定义

\linux-4.13.16\fs\ext4\file.c

const struct file_operations ext4_file_operations &＃61; { .......read_iter &＃61; ext4_file_read_iter,.write_iter &＃61; ext4_file_write_iter, ...... }

在这里插入图片描述
read 和 write 函数会调用 ext4_file_read_iter 和 ext4_file_write_iter。

ext4_file_read_iter 会调用 generic_file_read_iter&＃xff0c;ext4_file_write_iter 会调用 __generic_file_write_iter。

generic_file_read_iter 函数与 __generic_file_write_iter函数

\linux-4.13.16\mm\filemap.c

ssize_t generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter) { ......if (iocb->ki_flags & IOCB_DIRECT) { ......struct address_space *mapping &＃61; file->f_mapping; ......retval &＃61; mapping->a_ops->direct_IO(iocb, iter);} ......retval &＃61; generic_file_buffered_read(iocb, iter, retval); }ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from) { ......if (iocb->ki_flags & IOCB_DIRECT) { ......written &＃61; generic_file_direct_write(iocb, from); ......} else { ......written &＃61; generic_perform_write(file, from, iocb->ki_pos); ......} }

在这里插入图片描述

generic_file_read_iter 和 __generic_file_write_iter 有相似的逻辑&＃xff0c;就是要区分是否用缓存。缓存其实就是内存中的一块空间。

缓存 I/O 与直接 IO

根据是否使用内存做缓存&＃xff0c;可以把文件的 I/O 操作分为两种类型

第一种类型是缓存 I/O

多数文件系统的默认模式

读操作
读操作先检测缓存区中是否有, 若无则从文件系统读取并缓存;
写操作
写操作系统会直接将数据从用户空间赋值到内核缓存中&＃xff08;这时对用户程序来说&＃xff0c;写操作就已经完成&＃xff09;, 再由操作系统决定或用户调用 sync 写到磁盘

第二种类型是直接 IO

应用程序直接访问磁盘数据

读操作
若设置了 IOCB_DIRECT, 调用 address_space 的 direct_io 直接读取硬盘( 文件与内存页映射) ; 对于缓存来讲&＃xff0c;也需要文件和内存页进行关联&＃xff0c;就要用到 address_space。
写操作
若设置了 IOCB_DIRECT, 用 generic_file_direct_write&＃xff0c;里面同样会调用 address_space 的 direct_IO 的函数&＃xff0c;将数据直接写入硬盘。

对于 ext4 文件系统来讲&＃xff0c; address_space 的操作定义在 ext4_aops&＃xff0c;direct_IO 对应的函数是 ext4_direct_IO。

ext4_direct_IO 最终会调用到 __blockdev_direct_IO->do_blockdev_direct_IO&＃xff0c;跨过了缓存层&＃xff0c;到了通用块层&＃xff0c;最终到了文件系统的设备驱动层。

带缓存的写入操作

generic_perform_write 函数

带缓存的写入函数
\linux-4.13.16\mm\filemap.c

ssize_t generic_perform_write(struct file *file,struct iov_iter *i, loff_t pos) {struct address_space *mapping &＃61; file->f_mapping;const struct address_space_operations *a_ops &＃61; mapping->a_ops;do {struct page *page;unsigned long offset; /* Offset into pagecache page */unsigned long bytes; /* Bytes to write to page */status &＃61; a_ops->write_begin(file, mapping, pos, bytes, flags,&page, &fsdata);copied &＃61; iov_iter_copy_from_user_atomic(page, i, offset, bytes);flush_dcache_page(page);status &＃61; a_ops->write_end(file, mapping, pos, bytes, copied,page, fsdata);pos &＃43;&＃61; copied;written &＃43;&＃61; copied;balance_dirty_pages_ratelimited(mapping);} while (iov_iter_count(i)); }

在这里插入图片描述
在 while 循环中, 找出写入影响的页, 并依次写入, 完成以下四步

对于每一页&＃xff0c;先调用 address_space 的 write_begin 做一些准备&＃xff1b;
调用 iov_iter_copy_from_user_atomic&＃xff0c;将写入的内容从用户态拷贝到内核态的页中&＃xff1b;
调用 address_space 的 write_end 完成写操作&＃xff1b;
调用 balance_dirty_pages_ratelimited&＃xff0c;看脏页是否太多&＃xff0c;需要写回硬盘。所谓脏页&＃xff0c;就是写入到缓存&＃xff0c;但是还没有写入到硬盘的页面。

四个步骤

第一步&＃xff0c;对于 ext4 来讲&＃xff0c;调用的是 ext4_write_begin。

ext4 是一种日志文件系统&＃xff0c;是为了防止突然断电的时候的数据丢失&＃xff0c;引入了日志 (Journal) 模式

文件分为文件的元数据和数据, 其操作日志页分开维护

有多种模式 &＃xff1a;Journal 模式、order 模式、Writeback 模式
Journal 模式下: 写入数据前, 元数据及数据日志必须落盘, 安全但性能差
Order 模式下: 只记录元数据日志, 写日志前, 数据必须落盘, 折中
Writeback 模式下: 仅记录元数据日志, 数据不用先落盘

ext4_write_begin 中 ext4_journal_start 是在做日志相关的工作

在 ext4_write_begin 中&＃xff0c;还做了另外一件重要的事情&＃xff0c;就是调用 grab_cache_page_write_begin&＃xff0c;来得到应该写入的缓存页。

内核中缓存以页为单位, 打开文件的 struct file结构中struct address_space 用于关联文件和内存&＃xff0c;这个结构里面有基数树 radix tree 保存所有与这个文件相关的的缓存页

struct address_space {struct inode *host; /* owner: inode, block_device */struct radix_tree_root page_tree; /* radix tree of all pages */spinlock_t tree_lock; /* and lock protecting it */ ...... }

第二步&＃xff0c;调用 iov_iter_copy_from_user_atomic。

\linux-4.13.16\lib\iov_iter.c

size_t iov_iter_copy_from_user_atomic(struct page *page,struct iov_iter *i, unsigned long offset, size_t bytes) {char *kaddr &＃61; kmap_atomic(page), *p &＃61; kaddr &＃43; offset;iterate_all_kinds(i, bytes, v,copyin((p &＃43;&＃61; v.iov_len) - v.iov_len, v.iov_base, v.iov_len),memcpy_from_page((p &＃43;&＃61; v.bv_len) - v.bv_len, v.bv_page,v.bv_offset, v.bv_len),memcpy((p &＃43;&＃61; v.iov_len) - v.iov_len, v.iov_base, v.iov_len))kunmap_atomic(kaddr);return bytes; }

在这里插入图片描述
将分配好的页面调用 kmap_atomic 映射到内核里面的一个虚拟地址&＃xff1b;将用户态的数据拷贝到内核态的页面的虚拟地址中&＃xff1b;调用 kunmap_atomic 解映射

第三步&＃xff0c;调用 ext4_write_end 完成写入。

调用 ext4_journal_stop 完成日志的写入
将修改过的缓存标记为脏页&＃xff0c;调用链为 block_write_end->__block_commit_write->mark_buffer_dirty
并没有真正写入硬盘&＃xff0c;仅仅是写入缓存后&＃xff0c;标记为脏页。
将写入的页面真正写到硬盘中&＃xff0c;称为回写&＃xff08;Write Back&＃xff09;。

第四步&＃xff0c;调用 balance_dirty_pages_ratelimited 回写脏页

balance_dirty_pages_ratelimited 函数
\linux-4.13.16\mm\page-writeback.c

/*** balance_dirty_pages_ratelimited - balance dirty memory state* &＃64;mapping: address_space which was dirtied** Processes which are dirtying memory should call in here once for each page* which was newly dirtied. The function will periodically check the system&＃39;s* dirty state and will initiate writeback if needed.*/ void balance_dirty_pages_ratelimited(struct address_space *mapping) {struct inode *inode &＃61; mapping->host;struct backing_dev_info *bdi &＃61; inode_to_bdi(inode);struct bdi_writeback *wb &＃61; NULL;int ratelimit; ......if (unlikely(current->nr_dirtied >&＃61; ratelimit))balance_dirty_pages(mapping, wb, current->nr_dirtied); ...... }

在这里插入图片描述

若发先脏页超额, 调用 balance_dirty_pages->wb_start_background_writeback 启动一个线程执行回写.

\linux-4.13.16\fs\fs-writeback.c

void wb_start_background_writeback(struct bdi_writeback *wb) {/** We just wake up the flusher thread. It will perform background* writeback as soon as there is no other work to do.*/wb_wakeup(wb); }static void wb_wakeup(struct bdi_writeback *wb) {spin_lock_bh(&wb->work_lock);if (test_bit(WB_registered, &wb->state))mod_delayed_work(bdi_wq, &wb->dwork, 0);spin_unlock_bh(&wb->work_lock); }(_tflags) | TIMER_IRQSAFE); \} while (0)/* bdi_wq serves all asynchronous writeback tasks */ struct workqueue_struct *bdi_wq;/*** mod_delayed_work - modify delay of or queue a delayed work* &＃64;wq: workqueue to use* &＃64;dwork: work to queue* &＃64;delay: number of jiffies to wait before queueing** mod_delayed_work_on() on local CPU.*/ static inline bool mod_delayed_work(struct workqueue_struct *wq,struct delayed_work *dwork,unsigned long delay) {....

回写任务 delayed_work 挂在 bdi_wq 队列, 若delay 设为 0, 马上执行回写

bdi &＃61; backing device info 描述块设备信息, 初始化块设备时会初始化 timer, 到时会执行写回函数

其他回写场景&＃xff1a;

用户主动调用 sync, 最终会调用 wakeup_flusher_threads&＃xff0c;同步脏页&＃xff1b;
内存十分紧张&＃xff0c;以至于无法分配页面的时候&＃xff0c;会调用 free_more_memory&＃xff0c;最终会调用 wakeup_flusher_threads&＃xff0c;释放脏页&＃xff1b;
脏页时间超过 timer, 及时回写

带缓存的读操作

对应的是函数 generic_file_buffered_read。

static ssize_t generic_file_buffered_read(struct kiocb *iocb,struct iov_iter *iter, ssize_t written) {struct file *filp &＃61; iocb->ki_filp;struct address_space *mapping &＃61; filp->f_mapping;struct inode *inode &＃61; mapping->host;for (;;) {struct page *page;pgoff_t end_index;loff_t isize;page &＃61; find_get_page(mapping, index);if (!page) {if (iocb->ki_flags & IOCB_NOWAIT)goto would_block;page_cache_sync_readahead(mapping,ra, filp,index, last_index - index);page &＃61; find_get_page(mapping, index);if (unlikely(page &＃61;&＃61; NULL))goto no_cached_page;}if (PageReadahead(page)) {page_cache_async_readahead(mapping,ra, filp, page,index, last_index - index);}/** Ok, we have the page, and it&＃39;s up-to-date, so* now we can copy it to user space...*/ret &＃61; copy_page_to_iter(page, offset, nr, iter);} }

generic_file_buffered_read 从 page cache 中判断是否由缓存页

若没则从文件系统读取进行预读和缓存, 再次查找缓存页
若有, 还需判断是否需要预读, 若需要调用 page_cache_async_readahead
最后调用 copy_page_to_user 从内核拷贝到用户空间

总结

图片来自极客时间趣谈linux操作系统

系统调用层
read 和 write
VFS 层
vfs_read 和 vfs_write 并且调用 file_operation
ext4 层
调用的是 ext4_file_read_iter 和 ext4_file_write_iter。
缓存 I/O 和直接 I/O
- 直接 I/O
  直接 I/O 读写的流程是一样的&＃xff0c;调用 ext4_direct_IO&＃xff0c;再往下就调用块设备层
- 缓存 I/O
  读写的流程不一样
  - 读
    从块设备读取到缓存中&＃xff0c;然后从缓存中拷贝到用户态
  - 写
    从用户态拷贝到缓存&＃xff0c;设置缓存页为脏&＃xff0c;然后启动一个线程写入块设备

内核版本不同部分函数名称可能不一样

参考资料&＃xff1a;

趣谈Linux操作系统&＃xff08;极客时间&＃xff09;链接&＃xff1a;
http://gk.link/a/10iXZ
欢迎大家来一起交流学习

推荐阅读

string
linux ipc——shared memory

1、概念共享内存：共享内存是进程间通信中最简单的方式之一。共享内存允许两个或更多进程访问同一块内存，就如同malloc()函数向不同进程返回了指向同一个 ... [详细]

蜡笔小新 2024-09-27 11:39:50
string
Redis 一、数据结构与对象五大数据类型的底层结构实现

简单动态字符串redis里面很多地方都用到了字符串，我们知道redis是一个键值对存储的非关系型数据库，那么所有的key都是用字符串存储的，还有字符串类型，这些都是用字符串存储的 ... [详细]

蜡笔小新 2024-09-29 17:23:27
string
【linux命令】linux 文件目录操作命令tail

linux常用命令。tail 命令从指定点开始将文件写到标准输出.使用tail命令的-f选项可以方便的查阅正在改变的日志文件,tail -f filename会把filename里 ... [详细]

蜡笔小新 2024-09-29 14:44:25
loops
Linux网络编程：自己动手写高性能HTTP服务器框架（二）

github：https:github.comfroghuiyolandaIO模型和多线程模型实现多线程设计的几个考虑在我们的设计中，mainre ... [详细]

蜡笔小新 2024-09-29 11:22:09
bit
STM32(三) ENC28J60以太网(二)

3寄存器操作实现ENC28j60的寄存器操作分为222部分，分别为写寄存器和读寄存器部分，读缓冲区和写缓冲区部分，写PHY寄存器和读PH ... [详细]

蜡笔小新 2024-09-29 10:34:54
go
再看ibatis Order By注入问题

接上文http:blog.itpub.net29254281viewspace-1318239领导让开发同学鼓捣一个可配置化的后台.又回到了原来的问题如果要灵活,很多参数要 ... [详细]

蜡笔小新 2024-09-27 19:50:41
go
TLB 缓存延迟刷新漏洞 CVE201818281 解析

TLB 缓存延迟刷新漏洞 CVE201818281 解析 ... [详细]

蜡笔小新 2024-09-27 17:53:48
bit
百度_音频转文字

手机49kbps转换比特率256Kpbs{‘corpus_no’:‘7045177033217452815’,‘err_msg’:‘success.’,‘err_no’:0,‘re ... [详细]

蜡笔小新 2024-09-26 17:35:21
bit
【JVM技术专题】深入分析CG管理和原理查缺补漏「番外篇」

前提概要本文主要针对HotspotVM中“CMSParNew”组合的一些使用场景进行总结。自Sun发布Java语言以来，开始使用GC技术来进行内存自动管理࿰ ... [详细]

蜡笔小新 2024-09-26 17:30:39
go
S3C2440 RTC实时时钟驱动分析以及使用（三十）

https:www.cnblogs.comlifexyp7839625.htmlRTC驱动分析总结：drivers\rtc\rtc-s3c.cs3c_rtc_in ... [详细]

蜡笔小新 2024-09-25 10:40:25
input
TCP/IPLinux数据链路层的包解析

Linux数据链路层的包解析仅以此文作为学习笔记，初学者，如有错误欢迎批评指正，但求轻喷。一般而言，Linux系统截获数据包后，会通过协议栈，按照TCPIP层次进行解析，那我们如何 ... [详细]

蜡笔小新 2024-09-24 17:49:40
string
openssl 实现https 网站

下面是一个用openssl实现获取https网页内容的demo，整个流程比较简单，主要封装的API如下staticinthttps_init(http ... [详细]

蜡笔小新 2024-09-29 16:41:10
bit
Linux文件目录和权限

Linux文件目录和权限前言：Linux一般将文件可存取的身份分为三个类别，分别是ownergroupothers,根据权限划分，每个目录都可以拥有相对身份的-rwx[可读可写可执 ... [详细]

蜡笔小新 2024-09-28 10:41:08
bit
socket8 [命名管道]

::命名管道不但能实现同一台机器上两个进程通信，还能在网络中不同机器上的两个进程之间的通信机制。与邮槽不同，命名管道是采用基于连接并且可靠的传输方式，所以命名管道传输数据只能一对一 ... [详细]

蜡笔小新 2024-09-27 17:36:49
go
服务器性能优化之网络性能优化

hi，大家好，今天分享一篇后台服务器性能优 ... [详细]

蜡笔小新 2024-09-25 08:41:16