Solr入门之官方文档6.0阅读笔记系列(九)第四部分数据索引操作

作者：深碍是碍u不是爱 | 来源：互联网 | 2023-05-19 10:17

IndexingandBasicDataOperationsIntroductiontoSolrIndexing:AnoverviewofSolrsindexingproc

Indexing and Basic Data Operations

Introduction to Solr Indexing: An overview of Solr's indexing process.
Post Tool: Information about using post.jar to quickly upload some content to your system.
Uploading Data with Index Handlers: Information about using Solr's Index Handlers to upload
XML/XSLT, JSON and CSV data.
Uploading Data with Solr Cell using Apache Tika: Information about using the Solr Cell framework to
upload data for indexing.
Uploading Structured Data Store Data with the Data Import Handler: Information about uploading and
indexing data from a structured data store.
Updating Parts of Documents: Information about how to use atomic updates and optimistic concurrency
with Solr.
Detecting Languages During Indexing: Information about using language identification during the
indexing process.
De-Duplication: Information about configuring Solr to mark duplicate documents as they are indexed.
Content Streams: Information about streaming content to Solr Request Handlers.
UIMA Integration: Information about integrating Solr with Apache's Unstructured Information
Management Architecture (UIMA). UIMA lets you define custom pipelines of Analysis Engines that
incrementally add metadata to your documents as annotations.

Introduction to Solr Indexing

数据源: cvs xml json ,数据库,pdf word execl等方式: solr cell .http requst,solrj

Post Tool

post脚本实现简单更行 --调用post.jar

Uploading Data with Index Handlers
Topics covered in this section:
UpdateRequestHandler Configuration
XML Formatted Index Updates
Adding Documents
XML Update Commands
Using curl to Perform Updates
Using XSLT to Transform XML Index Updates
JSON Formatted Index Updates
Solr-Style JSON
JSON Update Convenience Paths
Transforming and Indexing Custom JSON
CSV Formatted Index Updates
CSV Update Parameters
Indexing Tab-Delimited files
CSV Update Convenience Paths
Nested Child Documents

UpdateRequestHandler Configuration

XML Formatted Index Updates
commitWithin=
number
Add the document within the specified number of milliseconds
overwrite=bool
ean
Default is true. Indicates if the unique key constraints should be checked to
overwrite previous versions of the same document (see below)
boost=float Default is 1.0. Sets a boost value for the document.To learn more about
boosting, see Searching.
boost=float Default is 1.0. Sets a boost value for the field

简单的xml索引文档提交.都用过了.
XML Update Commands
Commit and Optimize Operations
The and elements accept these optional attributes:
optimize选项能合并小的索引,提高查询的效率waitSearcher Default is true. Blocks until a new searcher is opened and registered as the main query
searcher, making the changes visible.
expungeDeletes (commit only) Default is false. Merges segments that have more than 10% deleted docs,
expunging them in the process.
maxSegments (optimize only) Default is 1. Merges the segments down to no more than this number of
segments.
Delete Operations 两种删除操作

0002166313
0031745983
subject:sport
publisher:penguin

Rollback Operations 回滚到上次commit
.
Using curl to Perform Updates
curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml"
--data-binary '

Patrick Eagar
Sports
796.35
0002166313
1982
Collins

'

0
127

The status field will be non-zero in case of failure.

返回结果不是0的状态都是失败的,这个适用于所有的请求吗?

JSON Formatted Index Updates

Solr-Style JSON
JSON formatted update requests may be sent to Solr's /update handler using Content-Type: application/json or Content-Type: text/json.
JSON formatted updates can take 3 basic forms, described in depth below:
A single document to add, expressed as a top level JSON Object. To differentiate this from a set of
commands, the json.command=false request parameter is required.
A list of documents to add, expressed as a top level JSON Array containing a JSON Object per document.
A sequence of update commands, expressed as a top level JSON Object (aka: Map).

Adding a Single JSON Document
{
"id": "1",
"title": "Doc 1"
}'

Adding Multiple JSON Documents
[
{
"id": "1",
"title": "Doc 1"
},
{
"id": "2",
"title": "Doc 2"
}
]'

Sending JSON Update Commands
{
"add": {
"doc": {
"id": "DOC1",
"my_boosted_field": { /* use a map with boost/value for a boosted field
*/
"boost": 2.3,
"value": "test"
},
"my_multivalued_field": [ "aaa", "bbb" ] /* Can use an array for a
multi-valued field */
}
},
"add": {
"commitWithin": 5000, /* commit this document within 5 seconds */
"overwrite": false, /* don't check for existing documents with the
same uniqueKey */
"boost": 3.45, /* a document boost */
"doc": {
"f1": "v1", /* Can use repeated keys for a multi-valued field
*/
"f1": "v2"
}
},
"commit": {},
"optimize": { "waitSearcher":false },
"delete": { "id":"ID" }, /* delete by ID */
"delete": { "query":"QUERY" } /* delete by query */
}'

上面是演示的例子,实际中可能不是这样的使用的
delete-by-id.

{ "delete":"myid" }

{ "delete":["id1","id2"] }

{
"delete":"id":50,
"_version_":12345
}

JSON Update Convenience Paths

Transforming and Indexing Custom JSON
可以使用自定义的json来做索引

Nested Child Documents

节点中还能添加子节点:提升性能对具有关联关系的文档而言

1
Solr adds block join support
parentDocument

2
SolrCloud supports it too!

3
New Lucene and Solr release is out
parentDocument

4
Lots of new features

note the special _childDocuments_ key need to indicate the nested documents in JSON.
[
{
"id": "1",
"title": "Solr adds block join support",
"content_type": "parentDocument",
"_childDocuments_": [
{
"id": "2",
"comments": "SolrCloud supports it too!"
}
]
},
{
"id": "3",
"title": "New Lucene and Solr release is out",
"content_type": "parentDocument",
"_childDocuments_": [
{
"id": "4",
"comments": "Lots of new features"
}
]
}
]

Uploading Data with Solr Cell using Apache Tika

ExtractingRequestHandler 支持多种文件格式的索引创建 Solr Cell
Topics covered in this section:
Key Concepts
Trying out Tika with the Solr techproducts Example
Input Parameters
Order of Operations
Configuring the Solr ExtractingRequestHandler
Indexing Encrypted Documents with the ExtractingUpdateRequestHandler
Examples
Sending Documents to Solr with a POST
Sending Documents to Solr with Solr Cell and SolrJ
Related Topics

Uploading Structured Data Store Data with the Data Import Handler

例子:example/example-DIH
For more information about the Data Import Handler, see https://wiki.apache.org/solr/DataImportHandler. Topics covered in this section:
Concepts and Terminology
Configuration
Data Import Handler Commands
Property Writer
Data Sources
Entity Processors
Transformers
Special Commands for the Data Import Handler

Concepts and Terminology

四个概念,关于数据导入部分数据源,实体,大概方式,具体转换

Configuration

Configuring solrconfig.xml

class="org.apache.solr.handler.dataimport.DataImportHandler">

/path/to/my/DIHconfigfile.xml

可以配置多个
具体导入文件配置Configuring the DIH Configuration File
模板->可以内嵌sql实体

url="jdbc:hsqldb:./example-DIH/hsqldb/ex" user="sa" password="secret"/>

deltaQuery="select id from item where last_modified >
'${dataimporter.last_index_time}'">

query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID from FEATURE where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">

query="select CATEGORY_ID from item_category where
ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where
last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where
ID=${item_category.ITEM_ID}">
query="select DESCRIPTION from category where ID =
'${item_category.CATEGORY_ID}'"
deltaQuery="select ID from category where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category
where CATEGORY_ID=${category.ID}">

可以直接重新加载配置文件,如果配置错误,会提供xml的错误信息.A reload-config command is also supported,

Request Parameters 请求参数

动态设置参数-->http请求格式: ${dataimporter.request.paramname}.
例子:user="${dataimporter.request.jdbcuser}"
password=${dataimporter.request.jdbcpassword} />

两种方式传参Then, these parameters can be passed to the full-import command or defined in the section in sol
rconfig.xml. This example shows the parameters with the full-import command:
dataimport?command=full-import&jdbcurl=jdbc:hsqldb:./example-DIH/hsqldb/ex&jdbcuser=sa&jdbcpassword=secret
http请求及config.xml的配置
Data Import Handler Commands

command的几种命令
abort 关于
delta-import 增量
full-import 全量会生成一个操作的时间戳在conf/dataimport.properties.
reload-config 重新加载配置文件status 状态show-config 显示配置文件
执行路径 http://:/solr//command=xxxx

Parameters for the full-import Command

全量导入的相关参数
clean
commit
debug
entity
optimize
synchronous

Property Writer 针对时间格式问题
The propertyWriter element defines the date format and locale for use with delta queries. It is an optional configuration. Add the element to the DIH configuration file, directly under the dataConfig element.

例子: directory="data" filename="my_dih.properties" locale="en_US" />

可用参数:dateFormat 格式type 使用写入类型directory l目录
filename 文件名称
locale 国际化

Uploading Structured Data Store Data with the Data Import Handler

Concepts and Terminology

四个概念,关于数据导入部分数据源,实体,大概方式,具体转换

Updating Parts of Documents
文档的更新部分

有两种方式进行文档的更新操作:
The first is atomic updates.
改变字段的方式,通过重建索引来完成文档的更新
The second approach is known as optimistic concurrency or optimistic locking.
乐观锁,通过版本号进行更新操作

可以将两种方法进行混合使用

Atomic Updates

set
add
remove
removeregex
inc

例子:{"id":"mydoc",
"price":10,
"popularity":42,
"categories":["kids"],
"promo_ids":["a123x"],
"tags":["free_to_try","buy_now","clearance","on_sale"]
}

修改文档请求:{"id":"mydoc",
"price":{"set":99},
"popularity":{"inc":20},
"categories":{"add":["toys","games"]},
"promo_ids":{"remove":"a123x"},
"tags":{"remove":["free_to_try","on_sale"]}
}
结果:{"id":"mydoc",
"price":99,
"popularity":62,
"categories":["kids","toys","games"],
"tags":["buy_now","clearance"]
}

Optimistic Concurrency

关于版本号的相描述:updated or when to report a conflict.If the content in the _version_ field is greater than '1' (i.e., '12345'), then the _version_ in thedocument must match the _version_ in the index.
If the content in the _version_ field is equal to '1', then the document must simply exist. In this case, no version matching occurs, but if the document does not exist, the updates will be rejected.
If the content in the _version_ field is less than '0' (i.e., '-1'), then the document must not exist. In this case, no version matching occurs, but if the document exists, the updates will be rejected.
If the content in the _version_ field is equal to '0', then it doesn't matter if the versions match or if the document exists or not. If it exists, it will be overwritten; if it does not exist, it will be added.

例子:$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?versiOns=true' --data-binary '
[ { "id" : "aaa" },
{ "id" : "bbb" } ]'
{"responseHeader":{"status":0,"QTime":6},
"adds":["aaa",1498562471222312960,
"bbb",1498562471225458688]}
$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?_version_=999999&versiOns=true'
--data-binary '
[{ "id" : "aaa",
"foo_s" : "update attempt with wrong existing version" }]'
{"responseHeader":{"status":409,"QTime":3},
"error":{"msg":"version conflict for aaa expected=999999
actual=1498562471222312960",
"code":409}}
$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?_version_=1498562471222312960&versio
ns=true&commit=true' --data-binary '
[{ "id" : "aaa",
"foo_s" : "update attempt with correct existing version" }]'
{"responseHeader":{"status":0,"QTime":5},
"adds":["aaa",1498562624496861184]}
$ curl 'http://localhost:8983/solr/techproducts/query?q=*:*&fl=id,_version_'
{
"responseHeader":{
"status":0,
"QTime":5,
"params":{
"fl":"id,_version_",
"q":"*:*"}},
"response":{"numFound":2,"start":0,"docs":[
{
"id":"bbb",
"_version_":1498562471225458688},
{
"id":"aaa",
"_version_":1498562624496861184}]
}}

Document Centric Versioning Constraints

分布式的情况必须使用solr自己的版本控制_verson_,其余的情况也可以定义自己的字段:
my_version_l

De-Duplication通过签名技术,防止文档副本的索引的重复添加三种方法:MD5Signature
Lookup3Signature
TextProfileSignature

Configuration Options 相关配置
There are two places in Solr to configure de-duplication: in solrconfig.xml and in schema.xml.
In solrconfig.xml

true
id
false
name,features,cat
solr.processor.Lookup3Signature

In schema.xml
If you are using a separate field for storing the signature you must have it indexed:
multiValued="false" />
Be sure to change your update handlers to use the defined chain, as below:

dedupe

...

具体参数见page232

Content StreamsWhen Solr RequestHandlers are accessed using path based URLs, the SolrQueryRequest object containing
the parameters of the request may also contain a list of ContentStreams containing bulk data for the request.
(The name SolrQueryRequest is a bit misleading: it is involved in all requests, regardless of whether it is a query
request or an update request.)
你发送的请求参数到内容流中去;

Stream Sources

RemoteStreaming

Debugging Requests

UIMA Integration
非结构信息化管理架构的集成管理

UIMA lets you define custom pipelines of Analysis Engines that incrementally add metadata to your documents as annotations.

Configuring UIMA
太长了不复制了:总结下;
1.配置jar包2.定义文档字段3.在solrconfig.xml中配置相关的处理器链4.在solrconfig.xml中启用配置的处理器

对于第四部分的总结:Introduction to Solr Indexing: 索引介绍
Post Tool: Post工具上传索引介绍
Uploading Data with Index Handlers: 更新索引通过处理器,几种格式
Uploading Data with Solr Cell using Apache Tika: 使用solr cell 索引处理
Uploading Structured Data Store Data with the Data Import Handler: data import handler的介绍,及集中数据源的导入介绍和参数
Updating Parts of Documents: 文档更新的两种方法
Detecting Languages During Indexing: 没看
De-Duplication: 标记签名,防止索引重复
Content Streams:请求参数转换为内容流
UIMA Integration: 自定义格式化管理系统,来处理更新问题

推荐阅读

config
Go Echo 框架入门指南【1】

本文介绍了 Go 语言中的高性能、可扩展、轻量级 Web 框架 Echo。Echo 框架简单易用，仅需几行代码即可启动一个高性能 HTTP 服务。 ... [详细]

蜡笔小新 2024-11-14 18:30:58
java
Android Studio SQLite 数据库增删改查简单（代码参考）

一个建表一个执行crud操作建表代码importandroid.content.Context;importandroid.database.sqlite.SQLiteDat ... [详细]

蜡笔小新 2024-11-14 11:01:49
default
Spring – Bean Life Cycle

Spring – Bean Life Cycle ... [详细]

蜡笔小新 2024-11-13 13:24:40
instance
Android 自定义加载对话框 CustomProgressDialog

本文介绍如何在 Android 中自定义加载对话框 CustomProgressDialog，包括自定义 View 类和 XML 布局文件的详细步骤。 ... [详细]

蜡笔小新 2024-11-12 21:51:00
config
Spring详解（六）AOP

原文网址：https:www.cnblogs.comysoceanp7476379.html目录1、AOP什么？2、需求3、解决办法1:使用静态代理4 ... [详细]

蜡笔小新 2024-11-12 14:40:40
default
php更新数据库字段的函数是,php更新数据库字段的函数是

php更新数据库字段的函数是,php更新数据库字段的函数是 ... [详细]

蜡笔小新 2024-11-12 11:37:31
config
CentOS 7 中配置开机自动挂载 NFS 的解决方案

本文详细介绍了在 CentOS 7 系统中配置 fstab 文件以实现开机自动挂载 NFS 共享目录的方法，并解决了常见的配置失败问题。 ... [详细]

蜡笔小新 2024-11-13 12:05:24
java
Spring Boot 中配置全局文件上传路径并实现文件上传功能

本文介绍如何在 Spring Boot 项目中配置全局文件上传路径，并通过读取配置项实现文件上传功能。通过这种方式，可以更好地管理和维护文件路径。 ... [详细]

蜡笔小新 2024-11-13 11:19:38
java
Java 编程错误：对象无法转换为 long 类型

本文介绍了在 Java 编程中遇到的一个常见错误：对象无法转换为 long 类型，并提供了详细的解决方案。 ... [详细]

蜡笔小新 2024-11-13 10:57:24
filter
com.sun.javadoc.PackageDoc.exceptions()方法的使用及代码示例

com.sun.javadoc.PackageDoc.exceptions()方法的使用及代码示例 ... [详细]

蜡笔小新 2024-11-13 10:47:33
split
在范围[0..n-1]中产生m个不同的随机数 - Generating m distinct random numbers in the range [0..n-1]

Ihavetwomethodsofgeneratingmdistinctrandomnumbersintherange[0..n-1]我有两种方法在范围[0.n-1]中生 ... [详细]

蜡笔小新 2024-11-13 09:49:14
split
Python 中 UTF-8 编码的中文字符被误识别为 GB2312

探讨了 Python 中 UTF-8 编码的中文字符在某些情况下被误识别为 GB2312 的问题，并提供了详细的代码示例和环境信息。 ... [详细]

蜡笔小新 2024-11-12 20:45:01
default
Python 使用 DOM 和 SAX 解析 XML 的应用实例

本文介绍如何使用 Python 的 DOM 和 SAX 方法解析 XML 文件，并通过示例展示了如何动态创建数据库表和处理大量数据的实时插入。 ... [详细]

蜡笔小新 2024-11-12 16:10:39
java
javascript分页类支持页码格式

前端时间因为项目需要，要对一个产品下所有的附属图片进行分页显示，没考虑ajax一张张请求，所以干脆一次性全部把图片out，然 ... [详细]

蜡笔小新 2024-11-12 14:58:57
java
使用 Matplotlib 保存 Python 动态图像为视频文件的方法与技巧

本文介绍了如何利用 `matplotlib` 库中的 `FuncAnimation` 类将 Python 中的动态图像保存为视频文件。通过详细解释 `FuncAnimation` 类的参数和方法，文章提供了多种实用技巧，帮助用户高效地生成高质量的动态图像视频。此外，还探讨了不同视频编码器的选择及其对输出文件质量的影响，为读者提供了全面的技术指导。 ... [详细]

蜡笔小新 2024-11-11 22:11:30

深碍是碍u不是爱

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章