Let'sbuildaFull-TextSearchengine

作者：晴兮心语6 | 来源：互联网 | 2023-05-20 02:13

Full-TextSearchisoneofthosetoolspeopleuseeverydaywithoutrealizingit.Ifyouevergoogled"golangcoveragereport"ortriedtofind"indoorwirelesscamera"onane-commercewebsite,youusedsomekindof

Full-Text Search is one of those tools people use every day without realizing it. If you ever googled "golang coverage report" or tried to find "indoor wireless camera" on an e-commerce website, you used some kind of full-text search.

Full-Text Search (FTS) is a technique for searching text in a collection of documents. A document can refer to a web page, a newspaper article, an email message, or any structured text.

Today we are going to build our own FTS engine. By the end of this post, we'll be able to search across millions of documents in less than a millisecond. We'll start with simple search queries like "give me all documents that contain the word cat " and we'll extend the engine to support more sophisticated boolean queries.

Note

Most well-known FTS engine is Lucene (as well as Elasticsearch and Solr built on top of it).

Before we start writing code, you may ask "can't we just use grep or have a loop that checks if every document contains the word I'm looking for?". Yes, we can. However, it's not always the best idea.

We are going to search a part of the abstract of English Wikipedia. The latest dump is available at dumps.wikimedia.org . As of today, the file size after decompression is 913 MB. The XML file contains over 600K documents.

Document example:

https://en.wikipedia.org/wiki/Kit-Cat_Klock
The Kit-Cat Klock is an art deco novelty wall clock shaped like a grinning cat with cartoon eyes that swivel in time with its pendulum tail.

Loading documents

First, we need to load all the documents from the dump. The built-in encoding/xml package comes very handy:

import (
    "encoding/xml"
    "os"
)

type document struct {
    Title string `xml:"title"`
    URL   string `xml:"url"`
    Text  string `xml:"abstract"`
    ID    int
}

func loadDocuments(path string) ([]document, error) {
    f, err := os.Open(path)
    if err != nil {
        return nil, err
    }
    defer f.Close()

    dec := xml.NewDecoder(f)
    dump := struct {
        Documents []document `xml:"doc"`
    }{}
    if err := dec.Decode(&dump); err != nil {
        return nil, err
    }

    docs := dump.Documents
    for i := range docs {
        docs[i].ID = i
    }
    return docs, nil
}

Every loaded document gets assigned a unique identifier. To keep things simple, the first loaded document gets assigned ID=0, the second ID=1 and so on.

Searching the content

Now that we have all documents loaded into memory, we can try to find the ones about cats. At first, let's loop through all documents and check if they contain the substring cat :

func search(docs []document, term string) []document {
    var r []document
    for _, doc := range docs {
        if strings.Contains(doc.Text, term) {
            r = append(r, doc)
        }
    }
    return r
}

On my laptop, the search phase takes 103ms - not too bad. If you spot check a few documents from the output, you may notice that the function matches caterpillar and category , but doesn't match Cat with the capital C . That's not quite what I was looking for.

We need to fix two things before moving forward:

Make the search case-insensitive (so Cat matches as well).
Match on a word boundary rather than on a substring (so caterpillar and communication don't match).

Searching with regular expressions

One solution that quickly comes to mind and allows implementing both requirements is regular expressions .

Here it is - (?i)\bcat\b :

(?i) makes the regex case-insensitive
\b matches a word boundary (position where one side is a word character and another side is not a word character)

func search(docs []document, term string) []document {
    re := regexp.MustCompile(`(?i)\b` + term + `\b`) // Don't do this in production, it's a security risk. term needs to be sanitized.
    var r []document
    for _, doc := range docs {
        if re.MatchString(doc.Text) {
            r = append(r, doc)
        }
    }
    return r
}

Ugh, the search took more than 2 seconds. As you can see, things started getting slow even with 600K documents. While the approach is easy to implement, it doesn't scale well. As the dataset grows larger, we need to scan more and more documents. The time complexity of this algorithm is linear - the number of documents required to scan is equal to the total number of documents. If we had 6M documents instead of 600K, the search would take 20 seconds. We need to do better than that.

To make search queries faster, we'll preprocess the text and build an index in advance.

The core of FTS is a data structure called Inverted Index . The Inverted Index associates every word in documents with documents that contain the word.

Example:

documents = {
    1: "a donut on a glass plate",
    2: "only the donut",
    3: "listen to the drum machine",
}

index = {
    "a": [1],
    "donut": [1, 2],
    "on": [1],
    "glass": [1],
    "plate": [1],
    "only": [2],
    "the": [2, 3],
    "listen": [3],
    "to": [3],
    "drum": [3],
    "machine": [3],
}

Below is a real-world example of the Inverted Index. An index in a book where a term references a page number:

Before we start building the index, we need to break the raw text down into a list of words (tokens) suitable for indexing and searching.

The text analyzer consists of a tokenizer and multiple filters.

The tokenizer is the first step of text analysis. Its job is to convert text into a list of tokens. Our implementation splits the text on a word boundary and removes punctuation marks:

func tokenize(text string) []string {
    return strings.FieldsFunc(text, func(r rune) bool {
        // Split on any character that is not a letter or a number.
        return !unicode.IsLetter(r) && !unicode.IsNumber(r)
    })
}

> tokenize("A donut on a glass plate. Only the donuts.")

["A", "donut", "on", "a", "glass", "plate", "Only", "the", "donuts"]

In most cases, just converting text into a list of tokens is not enough. To make the text easier to index and search, we'll need to do additional normalization.

In order to make the search case-insensitive, the lowercase filter converts tokens to lower case. cAt , Cat and caT are normalized to cat . Later, when we query the index, we'll lower case the search terms as well. This will make the search term cAt match the text Cat .

func lowercaseFilter(tokens []string) []string {
    r := make([]string, len(tokens))
    for i, token := range tokens {
        r[i] = strings.ToLower(token)
    }
    return r
}

> lowercaseFilter([]string{"A", "donut", "on", "a", "glass", "plate", "Only", "the", "donuts"})

["a", "donut", "on", "a", "glass", "plate", "only", "the", "donuts"]

Dropping common words

Almost any English text contains commonly used words like a , I , the or be . Such words are called stop words . We are going to remove them since almost any document would match the stop words.

There is no "official" list of stop words. Let's exclude the top 10 by the OEC rank . Feel free to add more:

var stopwords = map[string]struct{}{ // I wish Go had built-in sets.
    "a": {}, "and": {}, "be": {}, "have": {}, "i": {},
    "in": {}, "of": {}, "that": {}, "the": {}, "to": {},
}

func stopwordFilter(tokens []string) []string {
    r := make([]string, 0, len(tokens))
    for _, token := range tokens {
        if _, ok := stopwords[token]; !ok {
            r = append(r, token)
        }
    }
    return r
}

> stopwordFilter([]string{"a", "donut", "on", "a", "glass", "plate", "only", "the", "donuts"})

["donut", "on", "glass", "plate", "only", "donuts"]

Because of the grammar rules, documents may include different forms of the same word. Stemming reduces words into their base form. For example, fishing , fished and fisher may be reduced to the base form (stem) fish .

Implementing a stemmer is a non-trivial task, it's not covered in this post. We'll take one of the existing modules:

import snowballeng "github.com/kljensen/snowball/english"

func stemmerFilter(tokens []string) []string {
    r := make([]string, len(tokens))
    for i, token := range tokens {
        r[i] = snowballeng.Stem(token, false)
    }
    return r
}

> stemmerFilter([]string{"donut", "on", "glass", "plate", "only", "donuts"})

["donut", "on", "glass", "plate", "only", "donut"]

Note

A stem is not always a valid word. For example, some stemmers may reduce airline to airlin .

Putting the analyzer together

func analyze(text string) []string {
    tokens := tokenize(text)
    tokens = lowercaseFilter(tokens)
    tokens = stopwordFilter(tokens)
    tokens = stemmerFilter(tokens)
    return tokens
}

The tokenizer and filters convert sentences into a list of tokens:

> analyze("A donut on a glass plate. Only the donuts.")

["donut", "on", "glass", "plate", "only", "donut"]

The tokens are ready for indexing.

Building the index

Back to the inverted index. It maps every word in documents to document IDs. The built-in map is a good candidate for storing the mapping. The key in the map is a token (string) and the value is a list of document IDs:

type index map[string][]int

Building the index consists of analyzing the documents and adding their IDs to the map:

func (idx index) add(docs []document) {
    for _, doc := range docs {
        for _, token := range analyze(doc.Text) {
            if ids != nil && ids[len(ids)-1] == doc.ID {
                // Don't add same ID twice.
                continue
            }
            idx[token] = append(idx[token], doc.ID)
        }
    }
}

func main() {
    idx := make(index)
    idx.add([]document{{ID: 1, Text: "A donut on a glass plate. Only the donuts."}})
    idx.add([]document{{ID: 2, Text: "donut is a donut"}})
    fmt.Println(idx)
}

It works! Each token in the map refers to IDs of the documents that contain the token:

map[donut:[1 2] glass:[1] is:[2] on:[1] only:[1] plate:[1]]

To query the index, we are going to apply the same tokenizer and filters we used for indexing:

func (idx index) search(text string) [][]int {
    var r [][]int
    for _, token := range analyze(text) {
        if ids, ok := idx[token]; ok {
            r = append(r, ids)
        }
    }
    return r
}

> idx.search("Small wild cat")

[[24, 173, 303, ...], [98, 173, 765, ...], [[24, 51, 173, ...]]

And finally, we can find all documents that mention cats. Searching 600K documents took less than a millisecond (18µs)!

With the inverted index, the time complexity of the search query is linear to the number of search tokens. In the example query above, other than analyzing the input text, search had to perform only three map lookups.

The query from the previous section returned a disjoined list of documents for each token. What we normally expect to find when we type small wild cat in a search box is a list of results that contain small , wild and cat at the same time. The next step is to compute the set intersection between the lists. This way we'll get a list of documents matching all tokens.

Luckily, IDs in our inverted index are inserted in ascending order. Since the IDs are sorted, it's possible to compute the intersection between two lists in linear time. The intersection function iterates two lists simultaneously and collect IDs that exist in both:

func intersection(a []int, b []int) []int {
    maxLen := len(a)
    if len(b) > maxLen {
        maxLen = len(b)
    }
    r := make([]int, 0, maxLen)
    var i, j int
    for i  b[j] {
            j++
        } else {
            r = append(r, a[i])
            i++
            j++
        }
    }
    return r
}

Updated search analyzes the given query text, lookups tokens and computes the set intersection between lists of IDs:

func (idx index) search(text string) []int {
    var r []int
    for _, token := range analyze(text) {
        if ids, ok := idx[token]; ok {
            if r == nil {
                r = ids
            } else {
                r = intersection(r, ids)
            }
        } else {
            // Token doesn't exist.
            return nil
        }
    }
    return r
}

The Wikipedia dump contains only two documents that match small , wild and cat at the same time:

> idx.search("Small wild cat")

130764  The wildcat is a species complex comprising two small wild cat species, the European wildcat (Felis silvestris) and the African wildcat (F. lybica).
131692  Catopuma is a genus containing two Asian small wild cat species, the Asian golden cat (C. temminckii) and the bay cat.

The search is working as expected!

By the way, this is the first time I hear about catopuma , here is one of them:

We just built a Full-Text Search engine. Despite its simplicity, it can be a solid foundation for more advanced projects.

I didn't touch on a lot of things that can significantly improve the performance and make the engine more user friendly. Here are some ideas for further improvements:

Persist the index to disk. Rebuilding it on every application restart may take a while.
Extend boolean queries to support OR and NOT .
Experiment with memory and CPU-efficient data formats for storing sets of document IDs. Take a look at Roaring Bitmaps .
Support indexing multiple document fields.
Sort results by relevance.

The full source code is available on GitHub .

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持我们

推荐阅读

ip
Skywalking系列博客1安装单机版 Skywalking的快速安装方法

本文介绍了如何快速安装单机版的Skywalking，包括下载、环境需求和端口检查等步骤。同时提供了百度盘下载地址和查询端口是否被占用的命令。 ... [详细]

蜡笔小新 2023-12-14 19:05:47
ip
2018深入java目标计划及学习内容

本文介绍了作者在2018年的深入java目标计划，包括学习计划和工作中要用到的内容。作者计划学习的内容包括kafka、zookeeper、hbase、hdoop、spark、elasticsearch、solr、spring cloud、mysql、mybatis等。其中，作者对jvm的学习有一定了解，并计划通读《jvm》一书。此外，作者还提到了《HotSpot实战》和《高性能MySQL》等书籍。 ... [详细]

蜡笔小新 2023-12-11 20:00:32
string
ElastaticSearchtop_hits，es获取聚合的相关文档结果

使用场景使用es聚合时，有时还需要获取query（或filter)的相关文档结果(数据)。比如统计各个地区编码的营业额，得到了聚合的统 ... [详细]

蜡笔小新 2023-10-13 11:49:09
ip
eselasticsearchaggsmetricsmax

世界上并没有完美的程序，但是我们并不因此而沮丧，因为写程序就是一个不断追求完美的过程。问：max有什么特点？答： ... [详细]

蜡笔小新 2023-10-12 19:11:25
string
proj.zoie.impl.indexing.ZoieConfig.setMaxBatchSize()方法的使用及代码示例

本文整理了Java中proj.zoie.impl.indexing.ZoieConfig.setMaxBatchSize()方法的一些代码示例，展示了Zoi ... [详细]

蜡笔小新 2023-10-12 12:45:36
text
淘淘商城系列——商品搜索功能Service实现

首先我们在taotao-search-interface工程中新建一个SearchService接口，并在接口中添加一个方法，如下图所示。接着，我们到taotao-search-s ... [详细]

蜡笔小新 2023-10-11 18:46:05
ip
搞懂 ELK 并不是一件特别难的事

点击下方“民工哥技术之路”，选择“设为星标”回复“1024”获取独家整理的学习资料！本篇文章主要介绍ELK的一些框架组成，原理和实践&#x ... [详细]

蜡笔小新 2023-10-11 15:13:24
text
LwIP系列内存管理（堆内存）详解

一、目的小型嵌入式系统中的内存资源（SRAM）一般都比较有限，LwIP的运行平台一般都是资源受限的MCU。为了能够更加高效的运行ÿ ... [详细]

蜡笔小新 2024-09-25 18:34:18
ip
ideavim 100个实用映射

配 ... [详细]

蜡笔小新 2024-09-25 13:08:33
text
随笔142 文章0 评论2294 一步一步教你使用AgileEAS.NET基础类库进行应用开发WinForm应用篇演示使用报表构建UI入库业务查询模块...

回顾与说明前面我们把“商品字典”、“商品入库”、“商品库存查询”三个模块已经概括或者详细的演示完了，这些模块涉及到简单数据的增、删、修，也涉及到复杂业务 ... [详细]

蜡笔小新 2024-09-24 17:45:09
ip
rhel5.5搭建网关+LAMP+postfix+dhcp的步骤和配置方法

本文介绍了在rhel5.5操作系统下搭建网关+LAMP+postfix+dhcp的步骤和配置方法。通过配置dhcp自动分配ip、实现外网访问公司网站、内网收发邮件、内网上网以及SNAT转换等功能。详细介绍了安装dhcp和配置相关文件的步骤，并提供了相关的命令和配置示例。 ... [详细]

蜡笔小新 2023-12-14 17:13:20
text
部署solr建立nutch索引

2019独角兽企业重金招聘Python工程师标准接着上篇nutch1.4的部署应用，我们来部署一下solr，solr是对lucene进行了封装的企 ... [详细]

蜡笔小新 2023-10-16 18:06:09
string
spring cloud eureka微服务之间如何调用

小编给大家分享一下springcloudeureka微服务之间如何调用，相信大部分人都还不怎么了解，因此分享这篇文章给大家参考一下，希望大家阅读完这篇 ... [详细]

蜡笔小新 2023-10-13 13:48:31
command
ElasticSearch和hive结合使用

2019独角兽企业重金招聘Python工程师标准首先去这个网站下载elasticsearch-hadoop-2.0.2.jar可以用maven下载 ... [详细]

蜡笔小新 2023-10-12 18:53:45
version
java之学习记录 92lecene 全文检索

搭建springBoot项目依赖：<?xmlversion=1.0 ... [详细]

蜡笔小新 2023-10-11 20:53:23

晴兮心语6

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章