Noninclusivecachemethodusingpipelineds

作者：大豆子 | 来源：互联网 | 2023-09-18 01:39

A non-inclusive cache system includes an external cache and a plurality of on-chip cach

A non-inclusive cache system includes an external cache and a plurality of on-chip caches each having a set of tags associated therewith, with at least one of the on-chip caches including data which is absent from the external cache. A pipelined snoop bus is ported to each of the set of tags of the plurality of on-chip caches and transmits a snoop address to the plurality of on-chip caches. A system interface unit is responsive to a received snoop request to scan the external cache and to apply the snoop address of the snoop request to the pipelined snoop bus. A plurality of response signal lines respectively extend from the plurality of on-chip caches to the system interface unit, each of the signal lines for transmitting a snoop response from a corresponding one of the on-board caches to the system interface unit. The set of tags can be implemented by dual-porting the cache tags, or by providing a duplicate and dedicated set of snoop tags.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention generally relates to microprocessor architectures, and more particularly, the present invention relates to a method using a pipelined snoop bus for maintaining coherence among caches in a multiprocessor configuration.

2. Description of the Related Art

In multiprocessor systems, processor cache memories often maintain multiple copies of a same data object. When one processor alters one copy of the data object, it is necessary to somehow update or invalidate all other copies of the object which may appear elsewhere in the multiprocessor system. Thus, to insure coherence among multiple copies, every valid write to one copy of an object must update or invalidate every other copy of the object.

Consider, for example, the conventional multiprocessor configuration and located to the right of the dashed line are external (EXT.) components 104. Reference numeral 106 denotes an external cache (e-cache) which is visible to all processors and which interfaces with a main memory (not shown). Access to and from the main memory can only occur through the e-cache 106.

Non-inclusive cache method using pipelined snoop bus

The CPU 102 contains multiple processors which share the main memory via the common memory bus (not shown) and the e-cache 106. When one processor is granted exclusive use of a data object, the object is placed in the external cache 106 and used on the CPU chip 102 until it is taken away or evicted from the e-cache 106. Illustrated within the CPU 102 are the on-board caches 108 and 110 associated with one processor. Cache 108 is a data cache (d-cache) for storing data as it is passed back and forth from the execution units of the processor, and cache 110 is an instruction cache (i-cache) holding instructions prior to execution by the processor's execution units.

Reference numeral 112 denotes an interface unit. When a processor desires exclusive use of an object from main memory, the corresponding interface unit 112 issues a snoop request. Snooping protocols are generally designed so that all memory access requests are observed by each cache. In the event of a coherent write, each cache is responsive to the snoop request to scan its directory to identify any copies of the object which may require invalidation or updating. However, to avoid searching every cache directory upon the occurrence of every coherent write, the conventional systems adopt an "inclusive" approach to the cache coherencies.

The basic principal underlying cache coherency schemes is that when one processor is granted exclusive use of a data object, all other processors invalidate that data in their own memories. In the conventional inclusive cache coherency structure, the e-cache includes data existing in all the other caches on the chip. That is to say, any data that exists on the on-board caches of the chip must exist in the e-cache as well. If a data object gets evicted out of the e-cache or snooped out of the e-cache, it is removed from all the on-chip caches.

As such, referring to the flowchart of FIG. 2, when a snoop comes in from some other processor (step 202), the system interface unit 112 looks to the e-cache first to scan its contents (step 204), and if the data object is not there (NO at step 206), snooping is complete since the data object cannot exist on the on-board caches of the chip. Again, this is because every time something is evicted from the e-cache, it is invalidated on each of the on-board caches. If the data is found in the e-cache (YES at step 206), then the interface unit 112 sends out a signal to invalidate the data as it exists on the on-board caches.

Non-inclusive cache method using pipelined snoop bus

Since snoop processing is complete when the data is not found in the e-cache, the conventional technique of looking first to the e-cache for the data has the effect of filtering the snoop requests applied to the on-board cache memories of the processors. This in turn reduces the average bandwidth of the on-board snoop processing.

However, the conventional scheme does suffer drawbacks. For example, each time a data object is evicted from the e-cache, it must be invalidated on each of the on-board memories to preserve the inclusiveness of the configuration. If the e-cache is a large direct-mapped cache, and something is evicted, it must be evicted (invalidated) in all the lower level caches as well, even if not necessary. This often results in inefficiencies, since the e-cache might have collisions which are not present in the on-board caches. This ultimately results in a reduction in the cache hit rate.

Further, it is always possible for a number of snoop requests to hit the e-cache in a row which require invalidates in the on-board memories, and thus, the chip must support this "peak" bandwidth. Thus, the filtering is of limited value since over any given stretch of time, it may be necessary to carry out on-board snoop processing at full bandwidth.

SUMMARY OF THE INVENTION

It is an object of the present invention is to overcome or at least minimize the drawbacks associated with the conventional snooping scheme described above.

It is a further object of the present invention to provide a snoop process for ensuring cache coherency without use of an inclusive e-cache arrangement in which all data found in the on-chip caches must be present in the e-cache as well.

According to one aspect of the invention, a non-inclusive cache method is provided for a processor system having an external cache and a plurality of on-chip caches, including: including data in at least one of the on-chip caches which is absent from the external cache; scanning the external cache and applying a snoop address of a snoop request to a pipelined snoop bus in response to receipt of the snoop request; and transmitting the snoop address to the plurality of on-board caches via the pipelined snoop bus which is ported to each of a set of tags associated with the plurality of on-chip caches.

According to another aspect of the invention, the method further includes transmitting a snoop response from a corresponding one of the on-chip caches to a system interface unit via a plurality of response signal lines respectively extending from the plurality of on-chip caches to the system interface unit.

According to yet another aspect of the invention, a same multiple number of clock cycles are expended between a transmission of the snoop address on the pipelined snoop bus to receipt of the snoop response from each of the on-chip caches.

According to still another aspect of the invention, the snoop request is either one of two types, a first type being a request to invalidate a data object contained in any of the on-chip caches, and a second type being a request to check for the presence of the data object in any of the on-chip caches.

According to another aspect of the invention, the set of tags includes a set cache tags and a dedicated set of snoop tags duplicating the set of cache tags, and the pipelined snoop bus is ported to each of the dedicated set of snoop tags.

According to still another aspect of the invention, the sets of tags includes a set of dual-port cache tags.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other objects and advantages of the present invention will become readily apparent to those skilled in the art from the description that follows, with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram for explaining the conventional inclusive-type cache coherency configuration;

FIG. 2 is a simplified flowchart for explaining the snooping protocol of the configuration shown in FIG. 1;

FIG. 3 is a block diagram for explaining the non-inclusive cache coherency configuration of the present invention; and,

FIG. 4 is a simplified flowchart for explaining the snooping protocol of the configuration shown in FIG. 3.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention, an exemplary embodiment of which is shown in FIG. 3, presents a non-inclusive implementation of a cache coherency scheme. That is, there is no requirement that all data contained in the on-chip caches also be present in the off-chip e-cache as well.

Referring to FIG. 3, to the left of the vertical dashed line sits the CPU chip 302, and located to the right of the dashed line are external (EXT.) components 304. Reference numeral 306 denotes an external cache (e-cache) which is visible to all processors and which interfaces with a main memory (not shown). Access to and from the main memory can only occur through the e-cache 306.

Non-inclusive cache method using pipelined snoop bus

The CPU chip 302 contains multiple processors which share the main memory via the common memory bus (not shown) and the e-cache 306. When one processor is granted exclusive use of a data object, the object is placed in the external cache 306 and used on the CPU chip 302 until it is taken away or evicted from the e-cache 306. Illustrated within the CPU chip 302 are the on-board caches 308 and 310 associated with one processor. Cache 308 is a data cache (d-cache) for storing data as it is passed back and forth from the execution units of the processor, and cache 310 is an instruction cache (i-cache) holding instructions prior to execution by the processor's execution units. Each processor may have other types of caches as well.

Each cache includes special snoop tags 308a and 310a which effectively duplicate the corresponding cache tags. Alternatively, the cache tags themselves may be dual-ported to provide a dedicated set of tags ports for snooping. In either case, the arrangement should support a full snoop bandwidth.

Reference numeral 312 denotes a system interface unit (SIU). When a processor desires exclusive use of an object from main memory, the corresponding interface unit 312 issues a snoop request. Reference number 314 is a dedicated pipelined snoop bus which transmits the snoops in a pipeline, and reference numerals 316 and 318 are dedicated snoop response lines.

The operation of the invention will now be described with reference to the flowchart of FIG. 4, as well as the block diagram of FIG. 3.

Non-inclusive cache method using pipelined snoop bus

A snoop arrives to the system interface unit 312 (step 402), and, in the normal manner, the tags on the e-cache 306 are checked right away (step 404). A response is returned from the e-cache 306 indicating whether one of the e-cache tags matches the snoop address. Even in the event that the object is not found in the e-cache 306, it is still necessary to examine the on-chip caches. Again, the cache coherency system of the invention is non-inclusive, meaning that data may exist in an on-chip cache and not be present in the e-cache 306.

Thus, in addition to checking the e-cache in steps 404 and 406, the SIU must check the on-chip caches as well. This is illustrated by steps 408-422 which run in parallel to steps 404 and 406.

Initially, the SIU 312 applies the snoop to the pipelined snoop bus 314 (step 408), and the snoop address is checked against the special dedicated snoop tags of each cache (step 410). It is noted that there is no filtering of snoops in the present invention, and thus, the snoops tags are designed at full bandwidth.

In one embodiment of the invention, there are two types of snoop requests. The first simply invalidates the object, if it is present, and requires no response back. The second (called a "shared response" herein) checks for the presence of an object and thus requires a response back. In the case of the former (YES at step 412), the object in the cache is invalidated (step 416) if it exists, i.e., if there is a match between the snoop address and a cache tag (YES at step 414). In the case of the later (NO at step 412), responses are sent back to the SIU 312 on lines 316 and 318 indicating that the data object is located in the on-chip caches (step 422) or is not located in the on-chip caches (step 420). The lines 316 and 318 are preferably one-bit wide, indicating the presence or absence of the snooped data object in each cache.

According to the invention, the snoop bus 314 is pipelined and goes to its own dedicated port of tags. Moreover, the snoops do not arrive at the tags in one cycle. Rather, in the embodiment of the invention, it takes two cycles to get to the chip tag, and then up to three cycles to go through them, and then another two cycles to go back to the SIU 312. Reference numeral 320 is a delay which is representative of the lengthening of the loops. Since it is generally not possible to get all the way across the chip and back in one cycle, the loops are designed to have the same operational length of a specific number of cycles. In this manner, the SIU 312 knows the timing (number of cycles) of the response back from the caches. Moreover, the pipelined bus is a part of the implementation that allows the system to work at the clock frequency.

Thus, according to the invention, the snoop bus and duplicated snoop tag RAMs are fully pipelined. All snoops (invalidate and share) are handled by all the on-chip caches since such caches may contain data not found in the e-cache. Also, shared responses are of a fixed latency to the snoop originator or system interface unit.

SRC= https://www.google.com.hk/patents/US6061766

Non-inclusive cache method using pipelined snoop bus

推荐阅读

main
全面指南：Red Hat Enterprise Linux 6 中的 Ext3 文件系统详解

第五章详细探讨了 Red Hat Enterprise Linux 6 中的 Ext3 文件系统。5.1 节介绍了如何创建 Ext3 文件系统，包括必要的命令和步骤，以及在实际操作中可能遇到的问题和解决方案。此外，还涵盖了 Ext3 文件系统的性能优化和维护技巧，为用户提供全面的操作指南。 ... [详细]

蜡笔小新 2024-10-29 09:38:50
object
优化Apache配置文件：httpd.conf与.htaccess的深入解析

本文深入解析了 Apache 配置文件 `httpd.conf` 和 `.htaccess` 的优化方法，探讨了如何通过合理配置提升服务器性能和安全性。文章详细介绍了这两个文件的关键参数及其作用，并提供了实际应用中的最佳实践，帮助读者更好地理解和运用 Apache 配置。 ... [详细]

蜡笔小新 2024-11-01 04:26:35
scala
稀疏直接法视觉里程计中的特征点优化：基于光度误差最小化的灰度图像线性插值技术

在稀疏直接法视觉里程计中，通过优化特征点并采用基于光度误差最小化的灰度图像线性插值技术，提高了定位精度。该方法通过对空间点的非齐次和齐次表示进行处理，利用RGB-D传感器获取的3D坐标信息，在两帧图像之间实现精确匹配，有效减少了光度误差，提升了系统的鲁棒性和稳定性。 ... [详细]

蜡笔小新 2024-10-31 13:24:59
hook
MySQL性能优化与调参指南【数据库管理】

本文详细探讨了MySQL数据库的性能优化与参数调整技巧，旨在帮助数据库管理员和开发人员提升系统的运行效率。内容涵盖索引优化、查询优化、配置参数调整等方面，结合实际案例进行深入分析，提供实用的操作建议。此外，还介绍了常见的性能监控工具和方法，助力读者全面掌握MySQL性能优化的核心技能。 ... [详细]

蜡笔小新 2024-10-31 03:13:07
main
POJ 1696: 空间蚂蚁算法优化与分析

针对 POJ 1696 的空间蚂蚁算法进行了深入的优化与分析。本研究通过改进算法的时间复杂度和空间复杂度，显著提升了算法的效率。实验结果表明，优化后的算法在处理大规模数据时表现优异，能够有效减少计算时间和内存消耗。此外，我们还对算法的收敛性和稳定性进行了详细探讨，为实际应用提供了可靠的理论支持。 ... [详细]

蜡笔小新 2024-10-30 00:41:12
web
第七天深入学习DGL框架：官方文档指导下的数据集下载与预处理技巧

在第七天的深度学习课程中，我们将重点探讨DGL框架的高级应用，特别是在官方文档指导下进行数据集的下载与预处理。通过详细的步骤说明和实用技巧，帮助读者高效地构建和优化图神经网络的数据管道。此外，我们还将介绍如何利用DGL提供的模块化工具，实现数据的快速加载和预处理，以提升模型训练的效率和准确性。 ... [详细]

蜡笔小新 2024-10-27 21:10:17
web
从零起步：使用IntelliJ IDEA搭建Spring Boot应用的详细指南

从零起步：使用IntelliJ IDEA搭建Spring Boot应用的详细指南 ... [详细]

蜡笔小新 2024-11-01 11:34:01
filter
深入解析 Django 中用户模型的自定义方法与技巧

深入解析 Django 中用户模型的自定义方法与技巧 ... [详细]

蜡笔小新 2024-11-01 10:21:38
main
如何在Android应用中添加自定义返回按钮功能

本文将详细介绍在Android应用中添加自定义返回按钮的方法，帮助开发者更好地理解和实现这一功能。通过具体的代码示例和步骤说明，本文旨在为初学者提供清晰的指导，确保他们在开发过程中能够顺利集成返回按钮，提升用户体验。 ... [详细]

蜡笔小新 2024-10-31 19:03:53
object
利用StructureMap在ASP.NET MVC三层架构中实现依赖注入的扩展方法与实践

本文将介绍一种扩展的ASP.NET MVC三层架构框架，并通过使用StructureMap实现依赖注入，以降低代码间的耦合度。该方法不仅能够提高代码的可维护性和可测试性，还能增强系统的灵活性和扩展性。通过具体实践案例，详细阐述了如何在实际开发中有效应用这一技术。 ... [详细]

蜡笔小新 2024-10-29 18:20:29
main
在Windows命令行中利用Conda高效管理虚拟环境的创建与删除

在Windows命令行中，通过Conda工具可以高效地管理和操作虚拟环境。具体步骤包括：1. 列出现有虚拟环境：`conda env list`；2. 创建新虚拟环境：`conda create --name 环境名`；3. 删除虚拟环境：`conda env remove --name 环境名`。这些命令不仅简化了环境管理流程，还提高了开发效率。此外，Conda还支持环境文件导出和导入，方便在不同机器间迁移配置。 ... [详细]

蜡笔小新 2024-10-28 17:27:00
web
IIS 7及7.5版本中应用程序池的最佳配置策略与实践

在IIS 7及7.5版本中，优化应用程序池的配置是提升Web站点性能的关键步骤。具体操作包括：首先定位到目标Web站点的应用程序池，然后通过“应用程序池”菜单找到对应的池，右键选择“高级设置”。在一般优化方案中，建议调整以下几个关键参数：1. **基本设置**： - **队列长度**：默认值为1000，可根据实际需求调整队列长度，以提高处理请求的能力。此外，还可以进一步优化其他参数，如处理器使用限制、回收策略等，以确保应用程序池的高效运行。这些优化措施有助于提升系统的稳定性和响应速度。 ... [详细]

蜡笔小新 2024-10-28 16:09:03
main
将PEBuilder转换为DIBooter.sh，集成DI工具至启动层（5）：实现离线镜像引导安装

本文探讨了将PEBuilder转换为DIBooter.sh的方法，重点介绍了如何将DI工具集成到启动层，实现离线镜像引导安装。通过使用DD命令替代传统的grub-install工具，实现了GRUB的离线安装。此外，还详细解析了bootice工具的工作原理及其在该过程中的应用，确保系统在无网络环境下也能顺利引导和安装。 ... [详细]

蜡笔小新 2024-10-28 13:49:10
filter
利用Xutils3实现JSON数据的服务器传输与接收解析

本文探讨了如何使用Xutils3框架实现JSON数据在服务器端的传输与接收解析。通过构建JSON对象并添加所需参数，如 `person.put("pc", 2.0)`，详细介绍了从客户端发送请求到服务器接收并解析JSON数据的完整流程。此外，还提供了优化建议，以提高数据传输的效率和安全性。 ... [详细]

蜡笔小新 2024-10-28 10:44:39
main
[UOJ]#58. 【WC2013】糖果公园：树上动态修改莫队算法优化

Candyland的糖果公园以其独特的结构吸引了众多喜爱糖果的小朋友。公园内设有多个游览点，每个点不仅景色宜人，还提供免费的糖果。这些游览点通过复杂的路径连接，形成了一棵包含n个节点的树状结构。为了优化游客体验，公园管理团队采用了一种基于树上动态修改的莫队算法，有效提升了糖果发放和游玩项目的管理效率。 ... [详细]

蜡笔小新 2024-10-27 15:16:57

大豆子

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章