使用位运算符快速字符串搜索-Faststringsearchusingbitwiseoperators

作者：闻人特荷焦黄01965 | 来源：互联网 | 2024-09-26 22:09

Whatisthefastest(parallel?)waytofindasubstringinaverylongstringusingbitwiseoperator

What is the fastest (parallel?) way to find a substring in a very long string using bitwise operators?

使用位运算符查找超长字符串中的子字符串的最快(并行?)方法是什么?

e.g. find all positions of "GCAGCTGAAAACA" sequence in a human genome http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit (770MB)

例如，找到人类基因组中“GCAGCTGAAAACA”序列的所有位置http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit (770MB)

*the alphabet consists of 4 symbols ('G','C',T,'A') represented using 2 bits: 'G':00, 'A':01, 'T':10, 'C':11

*字母表由4个符号组成(“G”、“C”、“T”、“A”)，用2位表示:“G”:00、“A”:01、“T”:10、“C”:11

*you can assume the query string (the shorter one) is fixed in length, e.g. 127 characters

*您可以假设查询字符串(较短的那个)的长度是固定的，例如127个字符

*by fastest I mean not including any pre-processing/indexing time

*我说的最快指的是不包括任何预处理/索引时间

*the file is going to be loaded into memory after pre-processing, basically there will be billions of short strings to be searched for in a larger string, all in-memory.

*文件在预处理后将被加载到内存中，基本上在一个较大的字符串中有数十亿个短字符串需要搜索，所有都在内存中。

*bitwise because I'm looking for the simplest, fastest way to search for a bit pattern in a large bit array and stay as close as possible to the silicon.

*按位排列，因为我正在寻找最简单、最快的方法来在一个大的位数组中搜索位模式，并尽可能地靠近硅。

*KMP wouldn't work well as the alphabet is small

*KMP不能正常工作，因为字母表很小

*C code, x86 machine code would all be interesting.

*C代码，x86机器代码都很有趣。

Input format description (.2bit): http://jcomeau.freeshell.org/www/genome/2bitformat.html

输入格式描述(.2bit): http://jcomeau.freeshell.org/www/genome/2bitformat.html。

6 个解决方案

#1

If you're just looking through a file, you're pretty much guaranteed to be io-bound. Use of a large buffer (~16K), and strstr() should be all you need. If the file is encoded in ascii,search just for "gcagctgaaaaca". If it actually is encoded in bits; just permute the possible accepted strings(there should be ~8; lop off the first byte), and use memmem() plus a tiny overlapping bit check.

如果您只是浏览一个文件，那么几乎可以保证您是受io限制的。使用大型缓冲区(~16K)和strstr()应该是您所需要的。如果文件是用ascii码编码的，搜索“gcagctgaaaaca”。如果它被编码成比特;只对可能接受的字符串进行排列(应该是~8;删除第一个字节)，并使用memmem()和一个微小的重叠位检查。

I'll note here that glibc strstr and memmem already use Knuth-Morris-Pratt to search in linear time, so test that performance. It may surprise you.

我在这里要注意的是，glibc strstr和memmem已经使用Knuth-Morris-Pratt在线性时间内进行搜索，因此测试性能。它可能会让你大吃一惊。

#2

If you first encode/compress the DNA string with a lossless coding method (e.g. Huffman, exponential Golumb, etc.) then you get a ranked probability table ("coding tree") for DNA tokens of various combinations of nucleotides (e.g., A, AA, CA, etc.).

如果您首先使用无损编码方法(如Huffman、指数Golumb等)对DNA字符串进行编码/压缩，那么您将得到一个排序的概率表(“编码树”)，用于表示各种核苷酸组合(如a、AA、CA等)的DNA标记。

What this means is that, once you compress your DNA:

这意味着，一旦你压缩你的DNA

You'll probably be using fewer bits to store GCAGCTGAAAACA and other subsequences, than the "unencoded" approach of always using two bits per base.
与“未编码”的方法相比，您可能使用更少的位来存储GCAGCTGAAAACA和其他子序列，这种方法总是在每个碱基上使用两个位。
You can walk through the coding tree or table to build an encoded search string, which will usually be shorter than the unencoded search string.
您可以遍历编码树或表来构建一个编码的搜索字符串，它通常比未编码的搜索字符串要短。
You can apply the same family of exact search algorithms (e.g. Boyer-Moore) to locate this shorter, encoded search string.
您可以应用相同的精确搜索算法(例如Boyer-Moore)来定位这个更短的编码搜索字符串。

As for a parallelized approach, split the encoded target string up into N chunks and run the search algorithm on each chunk, using the shortened, encoded search string. By keeping track of the bit offsets of each chunk, you should be able to generate match positions.

对于并行方法，将编码的目标字符串分割成N个块，并使用缩短的编码搜索字符串对每个块运行搜索算法。通过跟踪每个块的位偏移量，您应该能够生成匹配的位置。

Overall, this compression approach would be useful if you plan on doing millions of searches on sequence data that won't change. You'd be searching fewer bits — potentially many fewer, in aggregate.

总的来说，如果您计划对不会改变的序列数据进行数百万次搜索，那么这种压缩方法将非常有用。你会搜索更少的比特——可能更少。

#3

Boyer-More is a technique used to search for substrings in plain strings. The basic idea is that if your substring is, say, 10 characters long, you can look at the character at position 9 in the string to search. If that character is not part of your search string, you could simply start the search after that character. (If that character is, indeed, in your string, the Boyer-More algorithm use a look-up table to skip the optimal number of characters forward.)

Boyer-More是一种在纯字符串中搜索子字符串的技术。基本思想是，如果你的子字符串是10个字符长，你可以查看字符串中位置9的字符来搜索。如果该字符不是搜索字符串的一部分，您可以在该字符之后启动搜索。(如果这个字符确实在您的字符串中，那么Boyer-More算法使用查找表来跳过最优字符数。)

It might be possible to reuse this idea for your packed representation of the genome string. After all, there are only 256 different bytes, so you could safely pre-calculate the skip-table.

也许可以将这个想法重新用于基因组字符串的打包表示。毕竟，只有256个不同的字节，所以您可以安全地预先计算skiptable。

#4

The benefit of encoding the alphabet into bit fields is compactness: one byte holds the equivalent of four characters. This is similar to some of the optimizations Google performs searching for words.

将字母表编码到位域的好处是紧凑:一个字节包含相当于四个字符的字符。这类似于谷歌执行的一些搜索单词的优化。

This suggests four parallel executions, each with the (transformed) search string offset by one character (two bits). A quick-and-dirty approach might be to just look for the first or second byte of the search string and then check extra bytes before and after matching the rest of the string, masking off the ends if necessary. The first search is handily done by the x86 instruction scasb. Subsequent byte matches can build upon the register values with cmpb.

这表明有四个并行执行，每个都有一个(转换的)搜索字符串被一个字符(两个比特)偏移。一种快速而肮脏的方法可能是只查找搜索字符串的第一个或第二个字节，然后在匹配字符串的其余部分之前和之后检查额外的字节，必要时屏蔽端点。第一个搜索是由x86指令scasb轻松完成的。后续的字节匹配可以构建在cmpb的寄存器值之上。

#5

You could create a state machine. In this topic, Fast algorithm to extract thousands of simple patterns out of large amounts of text , I used [f]lex to create the state machine for me. It would require some hackery to use the 4 letter ( := two bit) alphabet, but it can be done using the same tables as generated by [f]lex. (you could even create your own fgetc() like function which extracts two bits at a time from the input stream, and keeps the other six bits for consecutive calls. Pushback will be a bit harder, but not undoable).

您可以创建一个状态机。在本主题中，快速算法从大量文本中提取数千个简单模式，我使用[f]lex为我创建状态机。使用4个字母(:= 2位)的字母表需要一些技巧，但可以使用与[f]lex生成的表相同的表。(您甚至可以创建自己的类似fgetc()的函数，该函数每次从输入流中提取两个比特，并保存其他六个比特用于连续调用。反击会有点困难，但也不是不可能)。

BTW: I seriously doubt if there is any gain in compressing the data to two bits per nucleotide, but that is a different matter.

顺便说一句:我很怀疑把数据压缩到每一个核苷酸两个比特是否有任何好处，但那是另一回事。

#6

Okay, given your parameters, the problem isn't that hard, just not one you'd approach like a traditional string search problem. It more resembles a database table-join problem, where the tables are much larger than RAM.

好的，给定参数，问题不是那么难，只是不像传统的字符串搜索那样。它更类似于数据库表连接问题，其中的表比RAM大得多。

select a good rolling hash function aka buzhash. If you have billions of strings, you're looking for a hash with 64-bit values.
选择一个好的滚动哈希函数(buzhash)。如果您有数十亿个字符串，您正在寻找一个具有64位值的散列。
create a hash table based on each 127-element search string. The table in memory only needs to store (hash,string-id), not the whole strings.
根据每个127个元素的搜索字符串创建一个哈希表。内存中的表只需要存储(哈希、字符串-id)，而不需要存储整个字符串。
scan your very large target string, computing the rolling hash and looking up each value of the hash in your table. Whenever there's a match, write the (string-id, target-offset) pair to a stream, possibly a file.
扫描非常大的目标字符串，计算滚动散列，并在表中查找散列的每个值。无论何时有匹配，将(string-id, target-offset)对写入流(可能是文件)。
reread your target string and the pair stream, loading search strings as needed to compare them against the target at each offset.
重新读取目标字符串和对流，根据需要加载搜索字符串，以便在每个偏移量上与目标进行比较。

I am assuming that loading all pattern strings into memory at once is prohibitive. There are ways to segment the hash table into something that is larger than RAM but not a traditional random-access hash file; if you're interested, search for "hybrid hash" and "grace hash", which are more common in the database world.

我假设一次将所有模式字符串加载到内存中是禁止的。有一些方法可以将哈希表分割成比RAM大但不是传统的随机访问哈希文件的哈希表;如果您感兴趣，可以搜索“混合哈希”和“grace哈希”，这在数据库世界中更为常见。

I don't know if it's worth your while, but your pair stream gives you the perfect predictive input to manage a cache of pattern strings -- Belady's classic VM page replacement algorithm.

我不知道这样做是否值得，但您的结对流为您提供了管理模式字符串缓存的完美预测输入——Belady经典的VM页面替换算法。

推荐阅读

format
400string(99) php,PHP: 字符串Manual

addcslashes—以C语言风格使用反斜线转义字符串中的字符addslashes—使用反斜线引用字符串bin2hex—函数把包含数据的二进制字符串转换为十六进制值chop—rt ... [详细]

蜡笔小新 2024-12-15 12:31:43
format
java编写的简易计算器

主要用了2个类来实现的，话不多说，直接看运行结果，然后在奉上源代码1.Index.javaimportjava.awt.Color;im ... [详细]

蜡笔小新 2024-12-27 18:18:10
format
解析JSON格式文本并处理数据

本文介绍如何使用阿里云的fastjson库解析包含时间戳、IP地址和参数等信息的JSON格式文本，并进行数据处理和保存。 ... [详细]

蜡笔小新 2024-12-26 16:06:09
io
Codeforces Round #566 (Div. 2) A~F个人题解

Dashboard-CodeforcesRound#566(Div.2)-CodeforcesA.FillingShapes题意：给你一个的表格，你 ... [详细]

蜡笔小新 2024-12-25 18:41:21
format
MySQL DateTime 类型数据处理及.0 尾数去除方法

本文介绍如何在 MySQL 中处理 DateTime 类型的数据，并解决获取数据时出现的.0尾数问题。同时，探讨了不同场景下的解决方案，确保数据格式的一致性和准确性。 ... [详细]

蜡笔小新 2024-12-24 19:25:10
io
2017-2018年度《网络编程与安全》第五次实验报告

本报告详细记录了2017-2018学年《网络编程与安全》课程第五次实验的具体内容、实验过程、遇到的问题及解决方案。 ... [详细]

蜡笔小新 2024-12-20 08:38:38
io
使用Java反射机制模拟Webwork URL解析

本文探讨如何利用Java反射技术来模拟Webwork框架中的URL解析过程。通过这一实践，读者可以更好地理解Webwork及其后续版本Struts2的工作原理，尤其是它们在MVC架构下的角色。 ... [详细]

蜡笔小新 2024-12-18 10:06:40
runtime
Java Servlet中获取客户端IP与MAC地址的方法

本文介绍了一种在Java Servlet应用中获取客户端IP地址及MAC地址的技术实现方法，通过示例代码详细解析了获取过程中的关键步骤和技术点。 ... [详细]

蜡笔小新 2024-12-16 08:49:28
io
解析与执行JavaScript中的字符串代码

本文探讨了在JavaScript中执行字符串形式代码的多种方法，包括使用eval()函数以及跨页面调用的方法。同时，文章详细介绍了JavaScript中字符串的各种常用方法及其应用场景。 ... [详细]

蜡笔小新 2024-12-15 17:08:55
io
Java通过串口使用短信猫发送短信（AT指令应用）

本文介绍了两种使用Java发送短信的方法：利用第三方平台的HTTP请求和通过硬件设备短信猫。重点讲解了如何通过Java代码配置和使用短信猫发送短信的过程，包括必要的编码转换、串口操作及短信发送的核心逻辑。 ... [详细]

蜡笔小新 2024-12-14 18:08:31
io
深入理解Redis的数据结构与对象系统

本文详细探讨了Redis中的数据结构和对象系统的实现，包括字符串、列表、集合、哈希表和有序集合等五种核心对象类型，以及它们所使用的底层数据结构。通过分析源码和相关文献，帮助读者更好地理解Redis的设计原理。 ... [详细]

蜡笔小新 2024-12-25 04:11:22
io
Java实现文本到图片转换，支持自动换行、字体自定义及图像优化

本文详细介绍了如何使用Java实现将文本转换为图片的功能，包括自动换行、自定义字体加载、抗锯齿优化以及图片压缩等技术细节。 ... [详细]

蜡笔小新 2024-12-17 13:47:08
foreach
Java中实现数字字符串的自然排序

本文介绍了一种在Java中实现自然排序的方法，通过自定义比较器来处理包含数字的字符串，确保数字部分按照数值大小进行正确排序。 ... [详细]

蜡笔小新 2024-12-15 17:25:06
io
LeetCode 进阶：寻找最长无重复字符子串

本文探讨了一种高效的方法来解决LeetCode上的经典问题——寻找给定字符串中的最长无重复字符子串。 ... [详细]

蜡笔小新 2024-12-13 17:00:20
io
深入理解Java类加载机制

本文详细探讨了Java虚拟机（JVM）中类加载器的工作原理，特别是如何通过类的全限定名从外部源获取二进制字节流，以及不同类型的类加载器及其在双亲委派模型中的角色。 ... [详细]

蜡笔小新 2024-12-12 13:15:46

闻人特荷焦黄01965

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章