热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

Solr入门之官方文档6.0阅读笔记系列(九)第四部分数据索引操作

IndexingandBasicDataOperationsIntroductiontoSolrIndexing:AnoverviewofSolrsindexingproc





Indexing and Basic Data Operations
Introduction to Solr Indexing: An overview of Solr's indexing process.
Post Tool: Information about using post.jar to quickly upload some content to your system.
Uploading Data with Index Handlers: Information about using Solr's Index Handlers to upload
XML/XSLT, JSON and CSV data.
Uploading Data with Solr Cell using Apache Tika: Information about using the Solr Cell framework to
upload data for indexing.
Uploading Structured Data Store Data with the Data Import Handler: Information about uploading and
indexing data from a structured data store.
Updating Parts of Documents: Information about how to use atomic updates and optimistic concurrency
with Solr.
Detecting Languages During Indexing: Information about using language identification during the
indexing process.
De-Duplication: Information about configuring Solr to mark duplicate documents as they are indexed.
Content Streams: Information about streaming content to Solr Request Handlers.
UIMA Integration: Information about integrating Solr with Apache's Unstructured Information
Management Architecture (UIMA). UIMA lets you define custom pipelines of Analysis Engines that
incrementally add metadata to your documents as annotations.


Introduction to Solr Indexing

数据源: cvs xml json ,数据库,pdf word execl等方式: solr cell .http requst,solrj
Post Tool
 
post脚本实现简单更行 --调用post.jar
Uploading Data with Index Handlers
Topics covered in this section:
UpdateRequestHandler Configuration
XML Formatted Index Updates
Adding Documents
XML Update Commands
Using curl to Perform Updates
Using XSLT to Transform XML Index Updates
JSON Formatted Index Updates
Solr-Style JSON 
JSON Update Convenience Paths
Transforming and Indexing Custom JSON
CSV Formatted Index Updates
CSV Update Parameters
Indexing Tab-Delimited files
CSV Update Convenience Paths
Nested Child Documents


UpdateRequestHandler Configuration



XML Formatted Index Updates
commitWithin=
number
Add the document within the specified number of milliseconds
overwrite=bool
ean
Default is true. Indicates if the unique key constraints should be checked to
overwrite previous versions of the same document (see below)
boost=float Default is 1.0. Sets a boost value for the document.To learn more about
boosting, see Searching.
boost=float Default is 1.0. Sets a boost value for the field


简单的xml索引文档提交.都用过了.
XML Update Commands
Commit and Optimize Operations
The and elements accept these optional attributes:
optimize选项能合并小的索引,提高查询的效率
waitSearcher Default is true. Blocks until a new searcher is opened and registered as the main query
searcher, making the changes visible.
expungeDeletes (commit only) Default is false. Merges segments that have more than 10% deleted docs,
expunging them in the process.
maxSegments (optimize only) Default is 1. Merges the segments down to no more than this number of
segments.

Delete Operations 两种删除操作

0002166313
0031745983
subject:sport
publisher:penguin



Rollback Operations 回滚到上次commit
.

Using curl to Perform Updates
curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml"
--data-binary '


Patrick Eagar
Sports
796.35
0002166313
1982
Collins

'




0
127



The status field will be non-zero in case of failure.

返回结果不是0的状态都是失败的,这个适用于所有的请求吗?
JSON Formatted Index Updates

Solr-Style JSON 
JSON formatted update requests may be sent to Solr's /update handler using Content-Type: application/json or Content-Type: text/json.
JSON formatted updates can take 3 basic forms, described in depth below:
A single document to add, expressed as a top level JSON Object. To differentiate this from a set of
commands, the json.command=false request parameter is required.
A list of documents to add, expressed as a top level JSON Array containing a JSON Object per document.
A sequence of update commands, expressed as a top level JSON Object (aka: Map).


Adding a Single JSON Document
{
"id": "1",
"title": "Doc 1"
}'


Adding Multiple JSON Documents
[
{
"id": "1",
"title": "Doc 1"
},
{
"id": "2",
"title": "Doc 2"
}
]'


Sending JSON Update Commands
{
"add": {
"doc": {
"id": "DOC1",
"my_boosted_field": { /* use a map with boost/value for a boosted field
*/
"boost": 2.3,
"value": "test"
},
"my_multivalued_field": [ "aaa", "bbb" ] /* Can use an array for a
multi-valued field */
}
},
"add": {
"commitWithin": 5000, /* commit this document within 5 seconds */
"overwrite": false, /* don't check for existing documents with the
same uniqueKey */
"boost": 3.45, /* a document boost */
"doc": {
"f1": "v1", /* Can use repeated keys for a multi-valued field
*/
"f1": "v2"
}
},
"commit": {},
"optimize": { "waitSearcher":false },
"delete": { "id":"ID" }, /* delete by ID */
"delete": { "query":"QUERY" } /* delete by query */
}'


上面是演示的例子,实际中可能不是这样的使用的
delete-by-id. 

{ "delete":"myid" }

{ "delete":["id1","id2"] }

{
"delete":"id":50,
"_version_":12345
}


JSON Update Convenience Paths

Transforming and Indexing Custom JSON
可以使用自定义的json来做索引

Nested Child Documents

节点中还能添加子节点:提升性能对具有关联关系的文档而言

1
Solr adds block join support
parentDocument

2
SolrCloud supports it too!



3
New Lucene and Solr release is out
parentDocument

4
Lots of new features




note the special _childDocuments_ key need to indicate the nested documents in JSON.
[
{
"id": "1",
"title": "Solr adds block join support",
"content_type": "parentDocument",
"_childDocuments_": [
{
"id": "2",
"comments": "SolrCloud supports it too!"
}
]
},
{
"id": "3",
"title": "New Lucene and Solr release is out",
"content_type": "parentDocument",
"_childDocuments_": [
{
"id": "4",
"comments": "Lots of new features"
}
]
}
]
Uploading Data with Solr Cell using Apache Tika

ExtractingRequestHandler 支持多种文件格式的索引创建  Solr Cell
Topics covered in this section:
Key Concepts
Trying out Tika with the Solr techproducts Example
Input Parameters
Order of Operations
Configuring the Solr ExtractingRequestHandler
Indexing Encrypted Documents with the ExtractingUpdateRequestHandler
Examples
Sending Documents to Solr with a POST
Sending Documents to Solr with Solr Cell and SolrJ
Related Topics



Uploading Structured Data Store Data with the Data Import Handler

例子:example/example-DIH
For more information about the Data Import Handler, see https://wiki.apache.org/solr/DataImportHandlerTopics covered in this section:
Concepts and Terminology
Configuration
Data Import Handler Commands
Property Writer
Data Sources
Entity Processors
Transformers
Special Commands for the Data Import Handler



Concepts and Terminology

四个概念,关于数据导入部分数据源,实体,大概方式,具体转换
Configuration

Configuring solrconfig.xml

class="org.apache.solr.handler.dataimport.DataImportHandler">

/path/to/my/DIHconfigfile.xml



可以配置多个
具体导入文件配置Configuring the DIH Configuration File
模板->可以内嵌sql实体


url="jdbc:hsqldb:./example-DIH/hsqldb/ex" user="sa" password="secret"/>





deltaQuery="select id from item where last_modified >
'${dataimporter.last_index_time}'">


query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID from FEATURE where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">


query="select CATEGORY_ID from item_category where
ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where
last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where
ID=${item_category.ITEM_ID}">
query="select DESCRIPTION from category where ID =
'${item_category.CATEGORY_ID}'"
deltaQuery="select ID from category where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category
where CATEGORY_ID=${category.ID}">







可以直接重新加载配置文件,如果配置错误,会提供xml的错误信息.A reload-config command is also supported,

Request Parameters 请求参数

动态设置参数-->http请求 格式: ${dataimporter.request.paramname}. 
例子:user="${dataimporter.request.jdbcuser}"
password=${dataimporter.request.jdbcpassword} />



两种方式传参Then, these parameters can be passed to the full-import command or defined in the section in sol
rconfig.xml. This example shows the parameters with the full-import command:
dataimport?command=full-import&jdbcurl=jdbc:hsqldb:./example-DIH/hsqldb/ex&jdbcuser=sa&jdbcpassword=secret

http请求及config.xml的配置

Data Import Handler Commands

command的几种命令
abort  关于
delta-import  增量
full-import  全量  会生成一个操作的时间戳 在conf/dataimport.properties.
reload-config 重新加载配置文件status 状态show-config  显示配置文件
执行路径 http://:/solr//command=xxxx

Parameters for the full-import Command

全量导入的相关参数

clean
commit
debug
entity
optimize
synchronous 

Property Writer  针对时间格式问题
The propertyWriter element defines the date format and locale for use with delta queries. It is an optional configuration. Add the element to the DIH configuration file, directly under the dataConfig element.

例子: 
directory="data" filename="my_dih.properties" locale="en_US" />

可用参数:
dateFormat  格式type  使用写入类型directory  l目录
filename  文件名称
locale  国际化


Uploading Structured Data Store Data with the Data Import Handler

例子:example/example-DIH
For more information about the Data Import Handler, see https://wiki.apache.org/solr/DataImportHandlerTopics covered in this section:
Concepts and Terminology
Configuration
Data Import Handler Commands
Property Writer
Data Sources
Entity Processors
Transformers
Special Commands for the Data Import Handler



Concepts and Terminology

四个概念,关于数据导入部分数据源,实体,大概方式,具体转换
Configuration

Configuring solrconfig.xml

class="org.apache.solr.handler.dataimport.DataImportHandler">

/path/to/my/DIHconfigfile.xml



可以配置多个
具体导入文件配置Configuring the DIH Configuration File
模板->可以内嵌sql实体


url="jdbc:hsqldb:./example-DIH/hsqldb/ex" user="sa" password="secret"/>





deltaQuery="select id from item where last_modified >
'${dataimporter.last_index_time}'">


query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID from FEATURE where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">


query="select CATEGORY_ID from item_category where
ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where
last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where
ID=${item_category.ITEM_ID}">
query="select DESCRIPTION from category where ID =
'${item_category.CATEGORY_ID}'"
deltaQuery="select ID from category where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category
where CATEGORY_ID=${category.ID}">







可以直接重新加载配置文件,如果配置错误,会提供xml的错误信息.A reload-config command is also supported,

Request Parameters 请求参数

动态设置参数-->http请求 格式: ${dataimporter.request.paramname}. 
例子:user="${dataimporter.request.jdbcuser}"
password=${dataimporter.request.jdbcpassword} />



两种方式传参Then, these parameters can be passed to the full-import command or defined in the section in sol
rconfig.xml. This example shows the parameters with the full-import command:
dataimport?command=full-import&jdbcurl=jdbc:hsqldb:./example-DIH/hsqldb/ex&jdbcuser=sa&jdbcpassword=secret

http请求及config.xml的配置

Data Import Handler Commands

command的几种命令
abort  关于
delta-import  增量
full-import  全量  会生成一个操作的时间戳 在conf/dataimport.properties.
reload-config 重新加载配置文件status 状态show-config  显示配置文件
执行路径 http://:/solr//command=xxxx

Parameters for the full-import Command

全量导入的相关参数

clean
commit
debug
entity
optimize
synchronous 

Property Writer  针对时间格式问题
The propertyWriter element defines the date format and locale for use with delta queries. It is an optional configuration. Add the element to the DIH configuration file, directly under the dataConfig element.

例子: 
directory="data" filename="my_dih.properties" locale="en_US" />

可用参数:
dateFormat  格式type  使用写入类型directory  l目录
filename  文件名称
locale  国际化


Data Sources
使用一下几种数据源
ContentStreamDataSource


FieldReaderDataSource


FileDataSource

JdbcDataSource

URLDataSource

Entity Processors
实体过程可用到的参数可以见下表
page 213

The SQL Entity Processor

The XPathEntityProcessor


The MailEntityProcessor


The TikaEntityProcessor

The FileListEntityProcessor


LineEntityProcessor


PlainTextEntityProcessor


Transformers





Updating Parts of Documents
文档的更新部分
有两种方式进行文档的更新操作:
The first is atomic updates. 
改变字段的方式,通过重建索引来完成文档的更新
The second approach is known as optimistic concurrency or optimistic locking.
乐观锁,通过版本号进行更新操作

可以将两种方法进行混合使用

Atomic Updates

set
add
remove
removeregex
inc


例子:{"id":"mydoc",
"price":10,
"popularity":42,
"categories":["kids"],
"promo_ids":["a123x"],
"tags":["free_to_try","buy_now","clearance","on_sale"]
}


修改文档请求:{"id":"mydoc",
"price":{"set":99},
"popularity":{"inc":20},
"categories":{"add":["toys","games"]},
"promo_ids":{"remove":"a123x"},
"tags":{"remove":["free_to_try","on_sale"]}
}

结果:
{"id":"mydoc",
"price":99,
"popularity":62,
"categories":["kids","toys","games"],
"tags":["buy_now","clearance"]
}

Optimistic Concurrency

关于版本号的相描述:updated or when to report a conflict.If the content in the _version_ field is greater than '1' (i.e., '12345'), then the _version_ in thedocument must match the _version_ in the index.
If the content in the _version_ field is equal to '1', then the document must simply exist. In this case, no version matching occurs, but if the document does not exist, the updates will be rejected.
If the content in the _version_ field is less than '0' (i.e., '-1'), then the document must not exist. In this case, no version matching occurs, but if the document exists, the updates will be rejected.
If the content in the _version_ field is equal to '0', then it doesn't matter if the versions match or if the document exists or not. If it exists, it will be overwritten; if it does not exist, it will be added.


例子:$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?versiOns=true' --data-binary '
[ { "id" : "aaa" },
{ "id" : "bbb" } ]'
{"responseHeader":{"status":0,"QTime":6},
"adds":["aaa",1498562471222312960,
"bbb",1498562471225458688]}
$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?_version_=999999&versiOns=true'
--data-binary '
[{ "id" : "aaa",
"foo_s" : "update attempt with wrong existing version" }]'
{"responseHeader":{"status":409,"QTime":3},
"error":{"msg":"version conflict for aaa expected=999999
actual=1498562471222312960",
"code":409}}
$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?_version_=1498562471222312960&versio
ns=true&commit=true' --data-binary '
[{ "id" : "aaa",
"foo_s" : "update attempt with correct existing version" }]'
{"responseHeader":{"status":0,"QTime":5},
"adds":["aaa",1498562624496861184]}
$ curl 'http://localhost:8983/solr/techproducts/query?q=*:*&fl=id,_version_'
{
"responseHeader":{
"status":0,
"QTime":5,
"params":{
"fl":"id,_version_",
"q":"*:*"}},
"response":{"numFound":2,"start":0,"docs":[
{
"id":"bbb",
"_version_":1498562471225458688},
{
"id":"aaa",
"_version_":1498562624496861184}]
}}



Document Centric Versioning Constraints

分布式的情况必须使用solr自己的版本控制_verson_,其余的情况也可以定义自己的字段:
my_version_l



De-Duplication通过签名技术,防止文档副本的索引的重复添加三种方法:MD5Signature
Lookup3Signature
TextProfileSignature


Configuration Options 相关配置
There are two places in Solr to configure de-duplication: in solrconfig.xml and in schema.xml.
In solrconfig.xml


true
id
false
name,features,cat
solr.processor.Lookup3Signature




In schema.xml
If you are using a separate field for storing the signature you must have it indexed:
multiValued="false" />
Be sure to change your update handlers to use the defined chain, as below:


dedupe

...

具体参数见page232
Content StreamsWhen Solr RequestHandlers are accessed using path based URLs, the SolrQueryRequest object containing
the parameters of the request may also contain a list of ContentStreams containing bulk data for the request.
(The name SolrQueryRequest is a bit misleading: it is involved in all requests, regardless of whether it is a query
request or an update request.)

你发送的请求参数到内容流中去;
Stream Sources

RemoteStreaming

Debugging Requests


UIMA Integration
非结构信息化管理架构的集成管理
UIMA lets you define custom pipelines of Analysis Engines that incrementally add metadata to your documents as annotations.

Configuring UIMA
太长了不复制了:总结下;
1.配置jar包
2.定义文档字段3.在solrconfig.xml中配置相关的处理器链4.在solrconfig.xml中启用配置的处理器

对于第四部分的总结:Introduction to Solr Indexing: 索引介绍
Post Tool: Post工具上传索引介绍
Uploading Data with Index Handlers:  更新索引通过处理器,几种格式
Uploading Data with Solr Cell using Apache Tika: 使用solr cell 索引处理
Uploading Structured Data Store Data with the Data Import Handler
data import handler的介绍,及集中数据源的导入介绍和参数
Updating Parts of Documents: 文档更新的两种方法
Detecting Languages During Indexing: 没看
De-Duplication: 标记签名,防止索引重复
Content Streams:请求参数转换为内容流
UIMA Integration: 自定义格式化管理系统,来处理更新问题























推荐阅读
author-avatar
深碍是碍u不是爱
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有