热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

Solr入门之官方文档6.0阅读笔记系列(九)第四部分数据索引操作

IndexingandBasicDataOperationsIntroductiontoSolrIndexing:AnoverviewofSolrsindexingproc





Indexing and Basic Data Operations
Introduction to Solr Indexing: An overview of Solr's indexing process.
Post Tool: Information about using post.jar to quickly upload some content to your system.
Uploading Data with Index Handlers: Information about using Solr's Index Handlers to upload
XML/XSLT, JSON and CSV data.
Uploading Data with Solr Cell using Apache Tika: Information about using the Solr Cell framework to
upload data for indexing.
Uploading Structured Data Store Data with the Data Import Handler: Information about uploading and
indexing data from a structured data store.
Updating Parts of Documents: Information about how to use atomic updates and optimistic concurrency
with Solr.
Detecting Languages During Indexing: Information about using language identification during the
indexing process.
De-Duplication: Information about configuring Solr to mark duplicate documents as they are indexed.
Content Streams: Information about streaming content to Solr Request Handlers.
UIMA Integration: Information about integrating Solr with Apache's Unstructured Information
Management Architecture (UIMA). UIMA lets you define custom pipelines of Analysis Engines that
incrementally add metadata to your documents as annotations.


Introduction to Solr Indexing

数据源: cvs xml json ,数据库,pdf word execl等方式: solr cell .http requst,solrj
Post Tool
 
post脚本实现简单更行 --调用post.jar
Uploading Data with Index Handlers
Topics covered in this section:
UpdateRequestHandler Configuration
XML Formatted Index Updates
Adding Documents
XML Update Commands
Using curl to Perform Updates
Using XSLT to Transform XML Index Updates
JSON Formatted Index Updates
Solr-Style JSON 
JSON Update Convenience Paths
Transforming and Indexing Custom JSON
CSV Formatted Index Updates
CSV Update Parameters
Indexing Tab-Delimited files
CSV Update Convenience Paths
Nested Child Documents


UpdateRequestHandler Configuration



XML Formatted Index Updates
commitWithin=
number
Add the document within the specified number of milliseconds
overwrite=bool
ean
Default is true. Indicates if the unique key constraints should be checked to
overwrite previous versions of the same document (see below)
boost=float Default is 1.0. Sets a boost value for the document.To learn more about
boosting, see Searching.
boost=float Default is 1.0. Sets a boost value for the field


简单的xml索引文档提交.都用过了.
XML Update Commands
Commit and Optimize Operations
The and elements accept these optional attributes:
optimize选项能合并小的索引,提高查询的效率
waitSearcher Default is true. Blocks until a new searcher is opened and registered as the main query
searcher, making the changes visible.
expungeDeletes (commit only) Default is false. Merges segments that have more than 10% deleted docs,
expunging them in the process.
maxSegments (optimize only) Default is 1. Merges the segments down to no more than this number of
segments.

Delete Operations 两种删除操作

0002166313
0031745983
subject:sport
publisher:penguin



Rollback Operations 回滚到上次commit
.

Using curl to Perform Updates
curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml"
--data-binary '


Patrick Eagar
Sports
796.35
0002166313
1982
Collins

'




0
127



The status field will be non-zero in case of failure.

返回结果不是0的状态都是失败的,这个适用于所有的请求吗?
JSON Formatted Index Updates

Solr-Style JSON 
JSON formatted update requests may be sent to Solr's /update handler using Content-Type: application/json or Content-Type: text/json.
JSON formatted updates can take 3 basic forms, described in depth below:
A single document to add, expressed as a top level JSON Object. To differentiate this from a set of
commands, the json.command=false request parameter is required.
A list of documents to add, expressed as a top level JSON Array containing a JSON Object per document.
A sequence of update commands, expressed as a top level JSON Object (aka: Map).


Adding a Single JSON Document
{
"id": "1",
"title": "Doc 1"
}'


Adding Multiple JSON Documents
[
{
"id": "1",
"title": "Doc 1"
},
{
"id": "2",
"title": "Doc 2"
}
]'


Sending JSON Update Commands
{
"add": {
"doc": {
"id": "DOC1",
"my_boosted_field": { /* use a map with boost/value for a boosted field
*/
"boost": 2.3,
"value": "test"
},
"my_multivalued_field": [ "aaa", "bbb" ] /* Can use an array for a
multi-valued field */
}
},
"add": {
"commitWithin": 5000, /* commit this document within 5 seconds */
"overwrite": false, /* don't check for existing documents with the
same uniqueKey */
"boost": 3.45, /* a document boost */
"doc": {
"f1": "v1", /* Can use repeated keys for a multi-valued field
*/
"f1": "v2"
}
},
"commit": {},
"optimize": { "waitSearcher":false },
"delete": { "id":"ID" }, /* delete by ID */
"delete": { "query":"QUERY" } /* delete by query */
}'


上面是演示的例子,实际中可能不是这样的使用的
delete-by-id. 

{ "delete":"myid" }

{ "delete":["id1","id2"] }

{
"delete":"id":50,
"_version_":12345
}


JSON Update Convenience Paths

Transforming and Indexing Custom JSON
可以使用自定义的json来做索引

Nested Child Documents

节点中还能添加子节点:提升性能对具有关联关系的文档而言

1
Solr adds block join support
parentDocument

2
SolrCloud supports it too!



3
New Lucene and Solr release is out
parentDocument

4
Lots of new features




note the special _childDocuments_ key need to indicate the nested documents in JSON.
[
{
"id": "1",
"title": "Solr adds block join support",
"content_type": "parentDocument",
"_childDocuments_": [
{
"id": "2",
"comments": "SolrCloud supports it too!"
}
]
},
{
"id": "3",
"title": "New Lucene and Solr release is out",
"content_type": "parentDocument",
"_childDocuments_": [
{
"id": "4",
"comments": "Lots of new features"
}
]
}
]
Uploading Data with Solr Cell using Apache Tika

ExtractingRequestHandler 支持多种文件格式的索引创建  Solr Cell
Topics covered in this section:
Key Concepts
Trying out Tika with the Solr techproducts Example
Input Parameters
Order of Operations
Configuring the Solr ExtractingRequestHandler
Indexing Encrypted Documents with the ExtractingUpdateRequestHandler
Examples
Sending Documents to Solr with a POST
Sending Documents to Solr with Solr Cell and SolrJ
Related Topics



Uploading Structured Data Store Data with the Data Import Handler

例子:example/example-DIH
For more information about the Data Import Handler, see https://wiki.apache.org/solr/DataImportHandlerTopics covered in this section:
Concepts and Terminology
Configuration
Data Import Handler Commands
Property Writer
Data Sources
Entity Processors
Transformers
Special Commands for the Data Import Handler



Concepts and Terminology

四个概念,关于数据导入部分数据源,实体,大概方式,具体转换
Configuration

Configuring solrconfig.xml

class="org.apache.solr.handler.dataimport.DataImportHandler">

/path/to/my/DIHconfigfile.xml



可以配置多个
具体导入文件配置Configuring the DIH Configuration File
模板->可以内嵌sql实体


url="jdbc:hsqldb:./example-DIH/hsqldb/ex" user="sa" password="secret"/>





deltaQuery="select id from item where last_modified >
'${dataimporter.last_index_time}'">


query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID from FEATURE where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">


query="select CATEGORY_ID from item_category where
ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where
last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where
ID=${item_category.ITEM_ID}">
query="select DESCRIPTION from category where ID =
'${item_category.CATEGORY_ID}'"
deltaQuery="select ID from category where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category
where CATEGORY_ID=${category.ID}">







可以直接重新加载配置文件,如果配置错误,会提供xml的错误信息.A reload-config command is also supported,

Request Parameters 请求参数

动态设置参数-->http请求 格式: ${dataimporter.request.paramname}. 
例子:user="${dataimporter.request.jdbcuser}"
password=${dataimporter.request.jdbcpassword} />



两种方式传参Then, these parameters can be passed to the full-import command or defined in the section in sol
rconfig.xml. This example shows the parameters with the full-import command:
dataimport?command=full-import&jdbcurl=jdbc:hsqldb:./example-DIH/hsqldb/ex&jdbcuser=sa&jdbcpassword=secret

http请求及config.xml的配置

Data Import Handler Commands

command的几种命令
abort  关于
delta-import  增量
full-import  全量  会生成一个操作的时间戳 在conf/dataimport.properties.
reload-config 重新加载配置文件status 状态show-config  显示配置文件
执行路径 http://:/solr//command=xxxx

Parameters for the full-import Command

全量导入的相关参数

clean
commit
debug
entity
optimize
synchronous 

Property Writer  针对时间格式问题
The propertyWriter element defines the date format and locale for use with delta queries. It is an optional configuration. Add the element to the DIH configuration file, directly under the dataConfig element.

例子: 
directory="data" filename="my_dih.properties" locale="en_US" />

可用参数:
dateFormat  格式type  使用写入类型directory  l目录
filename  文件名称
locale  国际化


Uploading Structured Data Store Data with the Data Import Handler

例子:example/example-DIH
For more information about the Data Import Handler, see https://wiki.apache.org/solr/DataImportHandlerTopics covered in this section:
Concepts and Terminology
Configuration
Data Import Handler Commands
Property Writer
Data Sources
Entity Processors
Transformers
Special Commands for the Data Import Handler



Concepts and Terminology

四个概念,关于数据导入部分数据源,实体,大概方式,具体转换
Configuration

Configuring solrconfig.xml

class="org.apache.solr.handler.dataimport.DataImportHandler">

/path/to/my/DIHconfigfile.xml



可以配置多个
具体导入文件配置Configuring the DIH Configuration File
模板->可以内嵌sql实体


url="jdbc:hsqldb:./example-DIH/hsqldb/ex" user="sa" password="secret"/>





deltaQuery="select id from item where last_modified >
'${dataimporter.last_index_time}'">


query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID from FEATURE where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">


query="select CATEGORY_ID from item_category where
ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where
last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where
ID=${item_category.ITEM_ID}">
query="select DESCRIPTION from category where ID =
'${item_category.CATEGORY_ID}'"
deltaQuery="select ID from category where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category
where CATEGORY_ID=${category.ID}">







可以直接重新加载配置文件,如果配置错误,会提供xml的错误信息.A reload-config command is also supported,

Request Parameters 请求参数

动态设置参数-->http请求 格式: ${dataimporter.request.paramname}. 
例子:user="${dataimporter.request.jdbcuser}"
password=${dataimporter.request.jdbcpassword} />



两种方式传参Then, these parameters can be passed to the full-import command or defined in the section in sol
rconfig.xml. This example shows the parameters with the full-import command:
dataimport?command=full-import&jdbcurl=jdbc:hsqldb:./example-DIH/hsqldb/ex&jdbcuser=sa&jdbcpassword=secret

http请求及config.xml的配置

Data Import Handler Commands

command的几种命令
abort  关于
delta-import  增量
full-import  全量  会生成一个操作的时间戳 在conf/dataimport.properties.
reload-config 重新加载配置文件status 状态show-config  显示配置文件
执行路径 http://:/solr//command=xxxx

Parameters for the full-import Command

全量导入的相关参数

clean
commit
debug
entity
optimize
synchronous 

Property Writer  针对时间格式问题
The propertyWriter element defines the date format and locale for use with delta queries. It is an optional configuration. Add the element to the DIH configuration file, directly under the dataConfig element.

例子: 
directory="data" filename="my_dih.properties" locale="en_US" />

可用参数:
dateFormat  格式type  使用写入类型directory  l目录
filename  文件名称
locale  国际化


Data Sources
使用一下几种数据源
ContentStreamDataSource


FieldReaderDataSource


FileDataSource

JdbcDataSource

URLDataSource

Entity Processors
实体过程可用到的参数可以见下表
page 213

The SQL Entity Processor

The XPathEntityProcessor


The MailEntityProcessor


The TikaEntityProcessor

The FileListEntityProcessor


LineEntityProcessor


PlainTextEntityProcessor


Transformers





Updating Parts of Documents
文档的更新部分
有两种方式进行文档的更新操作:
The first is atomic updates. 
改变字段的方式,通过重建索引来完成文档的更新
The second approach is known as optimistic concurrency or optimistic locking.
乐观锁,通过版本号进行更新操作

可以将两种方法进行混合使用

Atomic Updates

set
add
remove
removeregex
inc


例子:{"id":"mydoc",
"price":10,
"popularity":42,
"categories":["kids"],
"promo_ids":["a123x"],
"tags":["free_to_try","buy_now","clearance","on_sale"]
}


修改文档请求:{"id":"mydoc",
"price":{"set":99},
"popularity":{"inc":20},
"categories":{"add":["toys","games"]},
"promo_ids":{"remove":"a123x"},
"tags":{"remove":["free_to_try","on_sale"]}
}

结果:
{"id":"mydoc",
"price":99,
"popularity":62,
"categories":["kids","toys","games"],
"tags":["buy_now","clearance"]
}

Optimistic Concurrency

关于版本号的相描述:updated or when to report a conflict.If the content in the _version_ field is greater than '1' (i.e., '12345'), then the _version_ in thedocument must match the _version_ in the index.
If the content in the _version_ field is equal to '1', then the document must simply exist. In this case, no version matching occurs, but if the document does not exist, the updates will be rejected.
If the content in the _version_ field is less than '0' (i.e., '-1'), then the document must not exist. In this case, no version matching occurs, but if the document exists, the updates will be rejected.
If the content in the _version_ field is equal to '0', then it doesn't matter if the versions match or if the document exists or not. If it exists, it will be overwritten; if it does not exist, it will be added.


例子:$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?versiOns=true' --data-binary '
[ { "id" : "aaa" },
{ "id" : "bbb" } ]'
{"responseHeader":{"status":0,"QTime":6},
"adds":["aaa",1498562471222312960,
"bbb",1498562471225458688]}
$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?_version_=999999&versiOns=true'
--data-binary '
[{ "id" : "aaa",
"foo_s" : "update attempt with wrong existing version" }]'
{"responseHeader":{"status":409,"QTime":3},
"error":{"msg":"version conflict for aaa expected=999999
actual=1498562471222312960",
"code":409}}
$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?_version_=1498562471222312960&versio
ns=true&commit=true' --data-binary '
[{ "id" : "aaa",
"foo_s" : "update attempt with correct existing version" }]'
{"responseHeader":{"status":0,"QTime":5},
"adds":["aaa",1498562624496861184]}
$ curl 'http://localhost:8983/solr/techproducts/query?q=*:*&fl=id,_version_'
{
"responseHeader":{
"status":0,
"QTime":5,
"params":{
"fl":"id,_version_",
"q":"*:*"}},
"response":{"numFound":2,"start":0,"docs":[
{
"id":"bbb",
"_version_":1498562471225458688},
{
"id":"aaa",
"_version_":1498562624496861184}]
}}



Document Centric Versioning Constraints

分布式的情况必须使用solr自己的版本控制_verson_,其余的情况也可以定义自己的字段:
my_version_l



De-Duplication通过签名技术,防止文档副本的索引的重复添加三种方法:MD5Signature
Lookup3Signature
TextProfileSignature


Configuration Options 相关配置
There are two places in Solr to configure de-duplication: in solrconfig.xml and in schema.xml.
In solrconfig.xml


true
id
false
name,features,cat
solr.processor.Lookup3Signature




In schema.xml
If you are using a separate field for storing the signature you must have it indexed:
multiValued="false" />
Be sure to change your update handlers to use the defined chain, as below:


dedupe

...

具体参数见page232
Content StreamsWhen Solr RequestHandlers are accessed using path based URLs, the SolrQueryRequest object containing
the parameters of the request may also contain a list of ContentStreams containing bulk data for the request.
(The name SolrQueryRequest is a bit misleading: it is involved in all requests, regardless of whether it is a query
request or an update request.)

你发送的请求参数到内容流中去;
Stream Sources

RemoteStreaming

Debugging Requests


UIMA Integration
非结构信息化管理架构的集成管理
UIMA lets you define custom pipelines of Analysis Engines that incrementally add metadata to your documents as annotations.

Configuring UIMA
太长了不复制了:总结下;
1.配置jar包
2.定义文档字段3.在solrconfig.xml中配置相关的处理器链4.在solrconfig.xml中启用配置的处理器

对于第四部分的总结:Introduction to Solr Indexing: 索引介绍
Post Tool: Post工具上传索引介绍
Uploading Data with Index Handlers:  更新索引通过处理器,几种格式
Uploading Data with Solr Cell using Apache Tika: 使用solr cell 索引处理
Uploading Structured Data Store Data with the Data Import Handler
data import handler的介绍,及集中数据源的导入介绍和参数
Updating Parts of Documents: 文档更新的两种方法
Detecting Languages During Indexing: 没看
De-Duplication: 标记签名,防止索引重复
Content Streams:请求参数转换为内容流
UIMA Integration: 自定义格式化管理系统,来处理更新问题























推荐阅读
  • 本文由编程笔记#小编为大家整理,主要介绍了logistic回归(线性和非线性)相关的知识,包括线性logistic回归的代码和数据集的分布情况。希望对你有一定的参考价值。 ... [详细]
  • VScode格式化文档换行或不换行的设置方法
    本文介绍了在VScode中设置格式化文档换行或不换行的方法,包括使用插件和修改settings.json文件的内容。详细步骤为:找到settings.json文件,将其中的代码替换为指定的代码。 ... [详细]
  • 本文介绍了机器学习手册中关于日期和时区操作的重要性以及其在实际应用中的作用。文章以一个故事为背景,描述了学童们面对老先生的教导时的反应,以及上官如在这个过程中的表现。同时,文章也提到了顾慎为对上官如的恨意以及他们之间的矛盾源于早年的结局。最后,文章强调了日期和时区操作在机器学习中的重要性,并指出了其在实际应用中的作用和意义。 ... [详细]
  • Spring常用注解(绝对经典),全靠这份Java知识点PDF大全
    本文介绍了Spring常用注解和注入bean的注解,包括@Bean、@Autowired、@Inject等,同时提供了一个Java知识点PDF大全的资源链接。其中详细介绍了ColorFactoryBean的使用,以及@Autowired和@Inject的区别和用法。此外,还提到了@Required属性的配置和使用。 ... [详细]
  • 生成式对抗网络模型综述摘要生成式对抗网络模型(GAN)是基于深度学习的一种强大的生成模型,可以应用于计算机视觉、自然语言处理、半监督学习等重要领域。生成式对抗网络 ... [详细]
  • Spring源码解密之默认标签的解析方式分析
    本文分析了Spring源码解密中默认标签的解析方式。通过对命名空间的判断,区分默认命名空间和自定义命名空间,并采用不同的解析方式。其中,bean标签的解析最为复杂和重要。 ... [详细]
  • Java序列化对象传给PHP的方法及原理解析
    本文介绍了Java序列化对象传给PHP的方法及原理,包括Java对象传递的方式、序列化的方式、PHP中的序列化用法介绍、Java是否能反序列化PHP的数据、Java序列化的原理以及解决Java序列化中的问题。同时还解释了序列化的概念和作用,以及代码执行序列化所需要的权限。最后指出,序列化会将对象实例的所有字段都进行序列化,使得数据能够被表示为实例的序列化数据,但只有能够解释该格式的代码才能够确定数据的内容。 ... [详细]
  • sklearn数据集库中的常用数据集类型介绍
    本文介绍了sklearn数据集库中常用的数据集类型,包括玩具数据集和样本生成器。其中详细介绍了波士顿房价数据集,包含了波士顿506处房屋的13种不同特征以及房屋价格,适用于回归任务。 ... [详细]
  • 浏览器中的异常检测算法及其在深度学习中的应用
    本文介绍了在浏览器中进行异常检测的算法,包括统计学方法和机器学习方法,并探讨了异常检测在深度学习中的应用。异常检测在金融领域的信用卡欺诈、企业安全领域的非法入侵、IT运维中的设备维护时间点预测等方面具有广泛的应用。通过使用TensorFlow.js进行异常检测,可以实现对单变量和多变量异常的检测。统计学方法通过估计数据的分布概率来计算数据点的异常概率,而机器学习方法则通过训练数据来建立异常检测模型。 ... [详细]
  • 深入理解Kafka服务端请求队列中请求的处理
    本文深入分析了Kafka服务端请求队列中请求的处理过程,详细介绍了请求的封装和放入请求队列的过程,以及处理请求的线程池的创建和容量设置。通过场景分析、图示说明和源码分析,帮助读者更好地理解Kafka服务端的工作原理。 ... [详细]
  • Python爬虫中使用正则表达式的方法和注意事项
    本文介绍了在Python爬虫中使用正则表达式的方法和注意事项。首先解释了爬虫的四个主要步骤,并强调了正则表达式在数据处理中的重要性。然后详细介绍了正则表达式的概念和用法,包括检索、替换和过滤文本的功能。同时提到了re模块是Python内置的用于处理正则表达式的模块,并给出了使用正则表达式时需要注意的特殊字符转义和原始字符串的用法。通过本文的学习,读者可以掌握在Python爬虫中使用正则表达式的技巧和方法。 ... [详细]
  • IOS开发之短信发送与拨打电话的方法详解
    本文详细介绍了在IOS开发中实现短信发送和拨打电话的两种方式,一种是使用系统底层发送,虽然无法自定义短信内容和返回原应用,但是简单方便;另一种是使用第三方框架发送,需要导入MessageUI头文件,并遵守MFMessageComposeViewControllerDelegate协议,可以实现自定义短信内容和返回原应用的功能。 ... [详细]
  • 本文介绍了Python语言程序设计中文件和数据格式化的操作,包括使用np.savetext保存文本文件,对文本文件和二进制文件进行统一的操作步骤,以及使用Numpy模块进行数据可视化编程的指南。同时还提供了一些关于Python的测试题。 ... [详细]
  • 本文介绍了使用FormData对象上传文件同时附带其他参数的方法。通过创建一个表单,将文件和参数添加到FormData对象中,然后使用ajax发送POST请求进行文件上传。在发送请求时,需要设置processData为false,告诉jquery不要处理发送的数据;同时设置contentType为false,告诉jquery不要设置content-Type请求头。 ... [详细]
  • 前言:拿到一个案例,去分析:它该是做分类还是做回归,哪部分该做分类,哪部分该做回归,哪部分该做优化,它们的目标值分别是什么。再挑影响因素,哪些和分类有关的影响因素,哪些和回归有关的 ... [详细]
author-avatar
深碍是碍u不是爱
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有