热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

SQLite全文搜索相关性排名-SQLitefull-textsearchrelevanceranking

Iamusingthefts4extensionofsqlite3toenablefull-textindexingandsearchingoftextdata.Th

I am using the fts4 extension of sqlite3 to enable full-text indexing and searching of text data. This it working great, but I've noticed that the results are not relevance-ranked at all. I guess I am too used to Lucene. I've seen some brief suggestions to write a custom rank method using the matchinfo() results, but it's not clear to me how this is done, or whether there are any sophisticated examples out there. How have others dealt with this?

我正在使用sqlite3的fts4扩展来启用全文索引和文本数据搜索。它工作得很好,但我注意到结果根本没有相关性。我想我已经习惯了Lucene。我已经看到了一些使用matchinfo()结果编写自定义排名方法的简短建议,但我不清楚这是如何完成的,或者是否有任何复杂的例子。别人怎么处理这个?

3 个解决方案

#1


7  

There's a complete example in the documentation, look at the end of appendix a. You'll need to do slightly more work to get a good relevance ranking as the function provided is good only for getting started. For example, with matchinfo(table,'pcnalx') there's enough information to implement Okapi BM25.

文档中有一个完整的例子,请看附录a的结尾。您需要做更多的工作才能获得良好的相关性排名,因为所提供的功能仅适用于入门。例如,使用matchinfo(table,'pcnalx'),有足够的信息来实现Okapi BM25。

#2


5  

There seems to be a distinct lack of documentation on how to implement Okapi BM25 in C and it seems it is an unspoken thing that the implementation is left as an exercise for the user.

似乎缺乏关于如何在C中实现Okapi BM25的文档,似乎这是一个未说出口的事情,实现留给用户练习。

Well I found the bro of a programmer "Radford 'rads' Smith" who chucked this up on GitHub

好吧,我找到了程序员“Radford'rads'史密斯”的兄弟,他把这个放在了GitHub上

https://github.com/rads/sqlite-okapi-bm25

https://github.com/rads/sqlite-okapi-bm25

It only implements BM25 although I'm looking into BM25F tweaks now....

它只实现BM25,虽然我现在正在调查BM25F调整....

....and here it is.

......而且在这里。

https://github.com/neozenith/sqlite-okapi-bm25

https://github.com/neozenith/sqlite-okapi-bm25

#3


1  

Here is an implementation of Okapi BM25. Using this in combination with the suggestions at SQLite.org will help you generate a relevance-ranked MATCH query. This was written all in VB.Net and the query was called using System.Data.SQLite functions. The custom SQLiteFunction at the end can be called from the SQL code without issue, as long as the SQL code is called using System.Data.SQLite functions.

这是Okapi BM25的一个实现。将此与SQLite.org中的建议结合使用将帮助您生成相关性排序的MATCH查询。这是用VB.Net编写的,查询是使用System.Data.SQLite函数调用的。只要使用System.Data.SQLite函数调用SQL代码,就可以从SQL代码中调用最终的自定义SQLiteFunction。

Public Class MatchInfo
    Property matchablePhrases As Integer
    Property userDefinedColumns As Integer
    Property totalDocuments As Integer
    Private _int32HitData As List(Of Integer)
    Private _longestSubsequencePhraseMatches As New List(Of Integer)
    Private _tokensInDocument As New List(Of Integer)
    Private _averageTokensInDocument As New List(Of Integer)

    Private _max_hits_this_row As Integer?
    Public ReadOnly Property max_hits_this_row As Integer
        Get
            If _max_hits_this_row Is Nothing Then
                _max_hits_this_row = 0
                For p = 0 To matchablePhrases - 1
                    For c = 0 To userDefinedColumns - 1
                        Dim myHitsThisRow As Integer = hits_this_row(p, c)
                        If myHitsThisRow > _max_hits_this_row Then
                            _max_hits_this_row = myHitsThisRow
                        End If
                    Next
                Next
            End If

            Return _max_hits_this_row
        End Get
    End Property

    Private _max_hits_all_rows As Integer?
    Public ReadOnly Property max_hits_all_rows As Integer
        Get
            If _max_hits_all_rows Is Nothing Then
                _max_hits_all_rows = 0
                For p = 0 To matchablePhrases - 1
                    For c = 0 To userDefinedColumns - 1
                        Dim myHitsAllRows As Integer = hits_all_rows(p, c)
                        If myHitsAllRows > _max_hits_all_rows Then
                            _max_hits_all_rows = myHitsAllRows
                        End If
                    Next
                Next
            End If

            Return _max_hits_all_rows
        End Get
    End Property

    Private _max_docs_with_hits As Integer?
    Public ReadOnly Property max_docs_with_hits As Integer
        Get
            If _max_docs_with_hits Is Nothing Then
                _max_docs_with_hits = 0
                For p = 0 To matchablePhrases - 1
                    For c = 0 To userDefinedColumns - 1
                        Dim myDocsWithHits As Integer = docs_with_hits(p, c)
                        If myDocsWithHits > _max_docs_with_hits Then
                            _max_docs_with_hits = myDocsWithHits
                        End If
                    Next
                Next
            End If

            Return _max_docs_with_hits
        End Get
    End Property

    Private _BM25Rank As Double?
    Public ReadOnly Property BM25Rank As Double
        Get
            If _BM25Rank Is Nothing Then
                _BM25Rank = 0
                'calculate BM25 Rank
                'http://en.wikipedia.org/wiki/Okapi_BM25

                'k1, calibrates the document term frequency scaling. Having k1 as 0 corresponds to a binary model – no term frequency. Increasing k1 will give rare words more boost.
                'b, calibrates the scaling by document length, and can take values from 0 to 1, where having 0 means no length normalization and having 1 corresponds to fully scaling the term weight by the document length.

                Dim k1 As Double = 1.2
                Dim b As Double = 0.75

                For column = 0 To userDefinedColumns - 1
                    For phrase = 0 To matchablePhrases - 1
                        Dim IDF As Double = Math.Log((totalDocuments - hits_all_rows(phrase, column) + 0.5) / (hits_all_rows(phrase, column) + 0.5))
                        Dim score As Double = (IDF * ((hits_this_row(phrase, column) * (k1 + 1)) / (hits_this_row(phrase, column) + k1 * (1 - b + b * _tokensInDocument(column) / _averageTokensInDocument(column)))))
                        If score <0 Then
                            score = 0
                        End If
                        _BM25Rank += score
                    Next
                Next

            End If

            Return _BM25Rank
        End Get
    End Property

    Public Sub New(raw_pcnalsx_MatchInfo As Byte())
        Dim int32_pcsx_MatchInfo As New List(Of Integer)
        For i = 0 To raw_pcnalsx_MatchInfo.Length - 1 Step 4
            int32_pcsx_MatchInfo.Add(BitConverter.ToUInt32(raw_pcnalsx_MatchInfo, i))
        Next

        'take the raw data and parse it out
        Me.matchablePhrases = int32_pcsx_MatchInfo(0)
        int32_pcsx_MatchInfo.RemoveAt(0)

        Me.userDefinedColumns = int32_pcsx_MatchInfo(0)
        int32_pcsx_MatchInfo.RemoveAt(0)

        Me.totalDocuments = int32_pcsx_MatchInfo(0)
        int32_pcsx_MatchInfo.RemoveAt(0)

        'remember that the columns are 0-based
        For i = 0 To userDefinedColumns - 1
            _averageTokensInDocument.Add(int32_pcsx_MatchInfo(0))
            int32_pcsx_MatchInfo.RemoveAt(0)
        Next

        For i = 0 To userDefinedColumns - 1
            _tokensInDocument.Add(int32_pcsx_MatchInfo(0))
            int32_pcsx_MatchInfo.RemoveAt(0)
        Next

        For i = 0 To userDefinedColumns - 1
            _longestSubsequencePhraseMatches.Add(int32_pcsx_MatchInfo(0))
            int32_pcsx_MatchInfo.RemoveAt(0)
        Next

        _int32HitData = New List(Of Integer)(int32_pcsx_MatchInfo)

    End Sub

    Public Function hits_this_row(phrase As Integer, column As Integer) As Integer
        Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 0)
    End Function

    Public Function hits_all_rows(phrase As Integer, column As Integer) As Integer
        Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 1)
    End Function

    Public Function docs_with_hits(phrase As Integer, column As Integer) As Integer
        Return _int32HitData(3 * (column + phrase * userDefinedColumns) + 2)
    End Function
End Class


Public Class Rank
    Inherits SQLiteFunction

    Public Overrides Function Invoke(args() As Object) As Object
        Return New MatchInfo(args(0)).BM25Rank
    End Function

End Class

推荐阅读
author-avatar
若_时光倒影
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有