热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

Let'sbuildaFull-TextSearchengine

Full-TextSearchisoneofthosetoolspeopleuseeverydaywithoutrealizingit.Ifyouevergoogled"golangcoveragereport"ortriedtofind"indoorwirelesscamera"onane-commercewebsite,youusedsomekindof

Full-Text Search is one of those tools people use every day without realizing it. If you ever googled "golang coverage report" or tried to find "indoor wireless camera" on an e-commerce website, you used some kind of full-text search.

Full-Text Search (FTS) is a technique for searching text in a collection of documents. A document can refer to a web page, a newspaper article, an email message, or any structured text.

Today we are going to build our own FTS engine. By the end of this post, we'll be able to search across millions of documents in less than a millisecond. We'll start with simple search queries like "give me all documents that contain the word cat " and we'll extend the engine to support more sophisticated boolean queries.

Note

Most well-known FTS engine is Lucene (as well as Elasticsearch and Solr built on top of it).

Before we start writing code, you may ask "can't we just use grep or have a loop that checks if every document contains the word I'm looking for?". Yes, we can. However, it's not always the best idea.

We are going to search a part of the abstract of English Wikipedia. The latest dump is available at dumps.wikimedia.org . As of today, the file size after decompression is 913 MB. The XML file contains over 600K documents.

Document example:

https://en.wikipedia.org/wiki/Kit-Cat_Klock
The Kit-Cat Klock is an art deco novelty wall clock shaped like a grinning cat with cartoon eyes that swivel in time with its pendulum tail.

Loading documents

First, we need to load all the documents from the dump. The built-in encoding/xml package comes very handy:

import (
    "encoding/xml"
    "os"
)

type document struct {
    Title string `xml:"title"`
    URL   string `xml:"url"`
    Text  string `xml:"abstract"`
    ID    int
}

func loadDocuments(path string) ([]document, error) {
    f, err := os.Open(path)
    if err != nil {
        return nil, err
    }
    defer f.Close()

    dec := xml.NewDecoder(f)
    dump := struct {
        Documents []document `xml:"doc"`
    }{}
    if err := dec.Decode(&dump); err != nil {
        return nil, err
    }

    docs := dump.Documents
    for i := range docs {
        docs[i].ID = i
    }
    return docs, nil
}

Every loaded document gets assigned a unique identifier. To keep things simple, the first loaded document gets assigned ID=0, the second ID=1 and so on.

Searching the content

Now that we have all documents loaded into memory, we can try to find the ones about cats. At first, let's loop through all documents and check if they contain the substring cat :

func search(docs []document, term string) []document {
    var r []document
    for _, doc := range docs {
        if strings.Contains(doc.Text, term) {
            r = append(r, doc)
        }
    }
    return r
}

On my laptop, the search phase takes 103ms - not too bad. If you spot check a few documents from the output, you may notice that the function matches caterpillar and category , but doesn't match Cat with the capital C . That's not quite what I was looking for.

We need to fix two things before moving forward:

  • Make the search case-insensitive (so Cat matches as well).

  • Match on a word boundary rather than on a substring (so caterpillar and communication don't match).

Searching with regular expressions

One solution that quickly comes to mind and allows implementing both requirements is regular expressions .

Here it is - (?i)\bcat\b :

  • (?i) makes the regex case-insensitive

  • \b matches a word boundary (position where one side is a word character and another side is not a word character)

func search(docs []document, term string) []document {
    re := regexp.MustCompile(`(?i)\b` + term + `\b`) // Don't do this in production, it's a security risk. term needs to be sanitized.
    var r []document
    for _, doc := range docs {
        if re.MatchString(doc.Text) {
            r = append(r, doc)
        }
    }
    return r
}

Ugh, the search took more than 2 seconds. As you can see, things started getting slow even with 600K documents. While the approach is easy to implement, it doesn't scale well. As the dataset grows larger, we need to scan more and more documents. The time complexity of this algorithm is linear - the number of documents required to scan is equal to the total number of documents. If we had 6M documents instead of 600K, the search would take 20 seconds. We need to do better than that.

To make search queries faster, we'll preprocess the text and build an index in advance.

The core of FTS is a data structure called Inverted Index . The Inverted Index associates every word in documents with documents that contain the word.

Example:

documents = {
    1: "a donut on a glass plate",
    2: "only the donut",
    3: "listen to the drum machine",
}

index = {
    "a": [1],
    "donut": [1, 2],
    "on": [1],
    "glass": [1],
    "plate": [1],
    "only": [2],
    "the": [2, 3],
    "listen": [3],
    "to": [3],
    "drum": [3],
    "machine": [3],
}

Below is a real-world example of the Inverted Index. An index in a book where a term references a page number:

Let's build a Full-Text Search engine

Before we start building the index, we need to break the raw text down into a list of words (tokens) suitable for indexing and searching.

The text analyzer consists of a tokenizer and multiple filters.

Let's build a Full-Text Search engine

The tokenizer is the first step of text analysis. Its job is to convert text into a list of tokens. Our implementation splits the text on a word boundary and removes punctuation marks:

func tokenize(text string) []string {
    return strings.FieldsFunc(text, func(r rune) bool {
        // Split on any character that is not a letter or a number.
        return !unicode.IsLetter(r) && !unicode.IsNumber(r)
    })
}
> tokenize("A donut on a glass plate. Only the donuts.")

["A", "donut", "on", "a", "glass", "plate", "Only", "the", "donuts"]

In most cases, just converting text into a list of tokens is not enough. To make the text easier to index and search, we'll need to do additional normalization.

In order to make the search case-insensitive, the lowercase filter converts tokens to lower case. cAt , Cat and caT are normalized to cat . Later, when we query the index, we'll lower case the search terms as well. This will make the search term cAt match the text Cat .

func lowercaseFilter(tokens []string) []string {
    r := make([]string, len(tokens))
    for i, token := range tokens {
        r[i] = strings.ToLower(token)
    }
    return r
}
> lowercaseFilter([]string{"A", "donut", "on", "a", "glass", "plate", "Only", "the", "donuts"})

["a", "donut", "on", "a", "glass", "plate", "only", "the", "donuts"]

Dropping common words

Almost any English text contains commonly used words like a , I , the or be . Such words are called stop words . We are going to remove them since almost any document would match the stop words.

There is no "official" list of stop words. Let's exclude the top 10 by the OEC rank . Feel free to add more:

var stopwords = map[string]struct{}{ // I wish Go had built-in sets.
    "a": {}, "and": {}, "be": {}, "have": {}, "i": {},
    "in": {}, "of": {}, "that": {}, "the": {}, "to": {},
}

func stopwordFilter(tokens []string) []string {
    r := make([]string, 0, len(tokens))
    for _, token := range tokens {
        if _, ok := stopwords[token]; !ok {
            r = append(r, token)
        }
    }
    return r
}
> stopwordFilter([]string{"a", "donut", "on", "a", "glass", "plate", "only", "the", "donuts"})

["donut", "on", "glass", "plate", "only", "donuts"]

Because of the grammar rules, documents may include different forms of the same word. Stemming reduces words into their base form. For example, fishing , fished and fisher may be reduced to the base form (stem) fish .

Implementing a stemmer is a non-trivial task, it's not covered in this post. We'll take one of the existing modules:

import snowballeng "github.com/kljensen/snowball/english"

func stemmerFilter(tokens []string) []string {
    r := make([]string, len(tokens))
    for i, token := range tokens {
        r[i] = snowballeng.Stem(token, false)
    }
    return r
}
> stemmerFilter([]string{"donut", "on", "glass", "plate", "only", "donuts"})

["donut", "on", "glass", "plate", "only", "donut"]

Note

A stem is not always a valid word. For example, some stemmers may reduce airline to airlin .

Putting the analyzer together

func analyze(text string) []string {
    tokens := tokenize(text)
    tokens = lowercaseFilter(tokens)
    tokens = stopwordFilter(tokens)
    tokens = stemmerFilter(tokens)
    return tokens
}

The tokenizer and filters convert sentences into a list of tokens:

> analyze("A donut on a glass plate. Only the donuts.")

["donut", "on", "glass", "plate", "only", "donut"]

The tokens are ready for indexing.

Building the index

Back to the inverted index. It maps every word in documents to document IDs. The built-in map is a good candidate for storing the mapping. The key in the map is a token (string) and the value is a list of document IDs:

type index map[string][]int

Building the index consists of analyzing the documents and adding their IDs to the map:

func (idx index) add(docs []document) {
    for _, doc := range docs {
        for _, token := range analyze(doc.Text) {
            if ids != nil && ids[len(ids)-1] == doc.ID {
                // Don't add same ID twice.
                continue
            }
            idx[token] = append(idx[token], doc.ID)
        }
    }
}

func main() {
    idx := make(index)
    idx.add([]document{{ID: 1, Text: "A donut on a glass plate. Only the donuts."}})
    idx.add([]document{{ID: 2, Text: "donut is a donut"}})
    fmt.Println(idx)
}

It works! Each token in the map refers to IDs of the documents that contain the token:

map[donut:[1 2] glass:[1] is:[2] on:[1] only:[1] plate:[1]]

To query the index, we are going to apply the same tokenizer and filters we used for indexing:

func (idx index) search(text string) [][]int {
    var r [][]int
    for _, token := range analyze(text) {
        if ids, ok := idx[token]; ok {
            r = append(r, ids)
        }
    }
    return r
}
> idx.search("Small wild cat")

[[24, 173, 303, ...], [98, 173, 765, ...], [[24, 51, 173, ...]]

And finally, we can find all documents that mention cats. Searching 600K documents took less than a millisecond (18µs)!

With the inverted index, the time complexity of the search query is linear to the number of search tokens. In the example query above, other than analyzing the input text, search had to perform only three map lookups.

The query from the previous section returned a disjoined list of documents for each token. What we normally expect to find when we type small wild cat in a search box is a list of results that contain small , wild and cat at the same time. The next step is to compute the set intersection between the lists. This way we'll get a list of documents matching all tokens.

Let's build a Full-Text Search engine

Luckily, IDs in our inverted index are inserted in ascending order. Since the IDs are sorted, it's possible to compute the intersection between two lists in linear time. The intersection function iterates two lists simultaneously and collect IDs that exist in both:

func intersection(a []int, b []int) []int {
    maxLen := len(a)
    if len(b) > maxLen {
        maxLen = len(b)
    }
    r := make([]int, 0, maxLen)
    var i, j int
    for i  b[j] {
            j++
        } else {
            r = append(r, a[i])
            i++
            j++
        }
    }
    return r
}

Updated search analyzes the given query text, lookups tokens and computes the set intersection between lists of IDs:

func (idx index) search(text string) []int {
    var r []int
    for _, token := range analyze(text) {
        if ids, ok := idx[token]; ok {
            if r == nil {
                r = ids
            } else {
                r = intersection(r, ids)
            }
        } else {
            // Token doesn't exist.
            return nil
        }
    }
    return r
}

The Wikipedia dump contains only two documents that match small , wild and cat at the same time:

> idx.search("Small wild cat")

130764  The wildcat is a species complex comprising two small wild cat species, the European wildcat (Felis silvestris) and the African wildcat (F. lybica).
131692  Catopuma is a genus containing two Asian small wild cat species, the Asian golden cat (C. temminckii) and the bay cat.

The search is working as expected!

By the way, this is the first time I hear about catopuma , here is one of them:

Let's build a Full-Text Search engine

We just built a Full-Text Search engine. Despite its simplicity, it can be a solid foundation for more advanced projects.

I didn't touch on a lot of things that can significantly improve the performance and make the engine more user friendly. Here are some ideas for further improvements:

  • Persist the index to disk. Rebuilding it on every application restart may take a while.

  • Extend boolean queries to support OR and NOT .

  • Experiment with memory and CPU-efficient data formats for storing sets of document IDs. Take a look at Roaring Bitmaps .

  • Support indexing multiple document fields.

  • Sort results by relevance.

The full source code is available on GitHub .


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 我们


推荐阅读
  • 尽管我们尽最大努力,任何软件开发过程中都难免会出现缺陷。为了更有效地提升对支持部门的协助与支撑,本文探讨了多种策略和最佳实践,旨在通过改进沟通、增强培训和支持流程来减少这些缺陷的影响,并提高整体服务质量和客户满意度。 ... [详细]
  • 在 CentOS 6.4 上安装 QT5 并启动 Qt Creator 时,可能会遇到缺少 GLIBCXX_3.4.15 的问题。这是由于系统中的 libstdc++.so.6 版本过低。本文将详细介绍如何通过更新 GCC 版本来解决这一问题。 ... [详细]
  • 本文详细介绍了如何在 Linux 系统上安装 JDK 1.8、MySQL 和 Redis,并提供了相应的环境配置和验证步骤。 ... [详细]
  • 解决Only fullscreen opaque activities can request orientation错误的方法
    本文介绍了在使用PictureSelectorLight第三方框架时遇到的Only fullscreen opaque activities can request orientation错误,并提供了一种有效的解决方案。 ... [详细]
  • 本文介绍如何使用线段树解决洛谷 P1531 我讨厌它问题,重点在于单点更新和区间查询最大值。 ... [详细]
  • 在 CentOS 6.7 系统维护中,常用的巡检命令包括:`uname -a` 用于查看内核、操作系统和 CPU 信息;`head -n 1 /etc/issue` 用于查看操作系统的版本;`cat /proc/cpuinfo` 用于获取详细的 CPU 信息;`hostname` 用于显示当前主机名;`ls` 命令则用于列出目录内容。这些命令可以帮助系统管理员快速了解系统的运行状态和配置信息,确保系统的稳定性和安全性。 ... [详细]
  • 本文介绍了如何利用Shell脚本高效地部署MHA(MySQL High Availability)高可用集群。通过详细的脚本编写和配置示例,展示了自动化部署过程中的关键步骤和注意事项。该方法不仅简化了集群的部署流程,还提高了系统的稳定性和可用性。 ... [详细]
  • 为了在Hadoop 2.7.2中实现对Snappy压缩和解压功能的原生支持,本文详细介绍了如何重新编译Hadoop源代码,并优化其Native编译过程。通过这一优化,可以显著提升数据处理的效率和性能。此外,还探讨了编译过程中可能遇到的问题及其解决方案,为用户提供了一套完整的操作指南。 ... [详细]
  • Vue应用预渲染技术详解与实践 ... [详细]
  • 本文介绍了UUID(通用唯一标识符)的概念及其在JavaScript中生成Java兼容UUID的代码实现与优化技巧。UUID是一个128位的唯一标识符,广泛应用于分布式系统中以确保唯一性。文章详细探讨了如何利用JavaScript生成符合Java标准的UUID,并提供了多种优化方法,以提高生成效率和兼容性。 ... [详细]
  • 本文详细解析了LeetCode第215题,即高效寻找数组中前K个最大元素的问题。通过使用快速选择算法(partition),可以在平均时间复杂度为O(N)的情况下完成任务。本文不仅提供了算法的具体实现步骤,还深入探讨了partition算法的工作原理及其在不同场景下的应用,帮助读者更好地理解和掌握这一高效算法。 ... [详细]
  • 在Ubuntu 13.04系统中,如果希望移除OpenJDK以优化Java环境配置,但尝试卸载`openjdk-7-jre`时遇到了问题。具体命令 `$ sudo apt-get purge openjdk-7-jre` 会显示如下提示信息: ... [详细]
  • 从用户转型为开发者:一场思维升级的旅程 | 专访 StarRocks Committer 周威
    从用户转变为开发者,不仅是一次角色的转换,更是一场深刻的思维升级之旅。本次专访中,StarRocks Committer 周威分享了他如何在这一过程中逐步提升技术能力与思维方式,为开源社区贡献自己的力量。 ... [详细]
  • Go语言中的高效排序与搜索算法解析
    在探讨Go语言中高效的排序与搜索算法时,本文深入分析了Go语言提供的内置排序功能及其优化策略。通过实例代码,详细讲解了如何利用Go语言的标准库实现快速、高效的排序和搜索操作,为开发者提供了实用的编程指导。 ... [详细]
  • 深入RTOS实践,面对原子操作提问竟感困惑
    在实时操作系统(RTOS)的实践中,尽管已经积累了丰富的经验,但在面对原子操作的具体问题时,仍感到困惑。本文将深入探讨RTOS中的原子操作机制,分析其在多任务环境下的重要性和实现方式,并结合实际案例解析常见的问题及解决方案,帮助读者更好地理解和应用这一关键技术。 ... [详细]
author-avatar
晴兮心语6
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有