Indexing and Basic Data Operations |
Introduction to Solr Indexing: An overview of Solr's indexing process. Post Tool: Information about using post.jar to quickly upload some content to your system. Uploading Data with Index Handlers: Information about using Solr's Index Handlers to upload XML/XSLT, JSON and CSV data. Uploading Data with Solr Cell using Apache Tika: Information about using the Solr Cell framework to upload data for indexing. Uploading Structured Data Store Data with the Data Import Handler: Information about uploading and indexing data from a structured data store. Updating Parts of Documents: Information about how to use atomic updates and optimistic concurrency with Solr. Detecting Languages During Indexing: Information about using language identification during the indexing process. De-Duplication: Information about configuring Solr to mark duplicate documents as they are indexed. Content Streams: Information about streaming content to Solr Request Handlers. UIMA Integration: Information about integrating Solr with Apache's Unstructured Information Management Architecture (UIMA). UIMA lets you define custom pipelines of Analysis Engines that incrementally add metadata to your documents as annotations. |
Introduction to Solr Indexing 数据源: cvs xml json ,数据库,pdf word execl等方式: solr cell .http requst,solrj |
Post Tool post脚本实现简单更行 --调用post.jar |
Uploading Data with Index Handlers Topics covered in this section: UpdateRequestHandler Configuration XML Formatted Index Updates Adding Documents XML Update Commands Using curl to Perform Updates Using XSLT to Transform XML Index Updates JSON Formatted Index Updates Solr-Style JSON JSON Update Convenience Paths Transforming and Indexing Custom JSON CSV Formatted Index Updates CSV Update Parameters Indexing Tab-Delimited files CSV Update Convenience Paths Nested Child Documents |
UpdateRequestHandler Configuration |
XML Formatted Index Updates number Add the document within the specified number of milliseconds ean Default is true. Indicates if the unique key constraints should be checked to overwrite previous versions of the same document (see below) boosting, see Searching. 简单的xml索引文档提交.都用过了. XML Update Commands Commit and Optimize Operations The searcher, making the changes visible. expungeDeletes (commit only) Default is false. Merges segments that have more than 10% deleted docs, expunging them in the process. maxSegments (optimize only) Default is 1. Merges the segments down to no more than this number of segments. Delete Operations 两种删除操作 Rollback Operations 回滚到上次commit Using curl to Perform Updates curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml" --data-binary ' The status field will be non-zero in case of failure. 返回结果不是0的状态都是失败的,这个适用于所有的请求吗? |
JSON Formatted Index Updates Solr-Style JSON JSON formatted update requests may be sent to Solr's /update handler using Content-Type: application/json or Content-Type: text/json. JSON formatted updates can take 3 basic forms, described in depth below: A single document to add, expressed as a top level JSON Object. To differentiate this from a set of commands, the json.command=false request parameter is required. A list of documents to add, expressed as a top level JSON Array containing a JSON Object per document. A sequence of update commands, expressed as a top level JSON Object (aka: Map). |
Adding a Single JSON Document { "id": "1", "title": "Doc 1" }' |
Adding Multiple JSON Documents [ { "id": "1", "title": "Doc 1" }, { "id": "2", "title": "Doc 2" } ]' |
Sending JSON Update Commands { "add": { "doc": { "id": "DOC1", "my_boosted_field": { /* use a map with boost/value for a boosted field */ "boost": 2.3, "value": "test" }, "my_multivalued_field": [ "aaa", "bbb" ] /* Can use an array for a multi-valued field */ } }, "add": { "commitWithin": 5000, /* commit this document within 5 seconds */ "overwrite": false, /* don't check for existing documents with the same uniqueKey */ "boost": 3.45, /* a document boost */ "doc": { "f1": "v1", /* Can use repeated keys for a multi-valued field */ "f1": "v2" } }, "commit": {}, "optimize": { "waitSearcher":false }, "delete": { "id":"ID" }, /* delete by ID */ "delete": { "query":"QUERY" } /* delete by query */ }' 上面是演示的例子,实际中可能不是这样的使用的 delete-by-id. { "delete":"myid" } { "delete":["id1","id2"] } { "delete":"id":50, "_version_":12345 } JSON Update Convenience Paths Transforming and Indexing Custom JSON 可以使用自定义的json来做索引 |
Nested Child Documents 节点中还能添加子节点:提升性能对具有关联关系的文档而言 note the special _childDocuments_ key need to indicate the nested documents in JSON. [ { "id": "1", "title": "Solr adds block join support", "content_type": "parentDocument", "_childDocuments_": [ { "id": "2", "comments": "SolrCloud supports it too!" } ] }, { "id": "3", "title": "New Lucene and Solr release is out", "content_type": "parentDocument", "_childDocuments_": [ { "id": "4", "comments": "Lots of new features" } ] } ] |
Uploading Data with Solr Cell using Apache Tika ExtractingRequestHandler 支持多种文件格式的索引创建 Solr Cell Topics covered in this section: Key Concepts Trying out Tika with the Solr techproducts Example Input Parameters Order of Operations Configuring the Solr ExtractingRequestHandler Indexing Encrypted Documents with the ExtractingUpdateRequestHandler Examples Sending Documents to Solr with a POST Sending Documents to Solr with Solr Cell and SolrJ Related Topics |
Uploading Structured Data Store Data with the Data Import Handler |
例子:example/example-DIH For more information about the Data Import Handler, see https://wiki.apache.org/solr/DataImportHandler. Topics covered in this section: Concepts and Terminology Configuration Data Import Handler Commands Property Writer Data Sources Entity Processors Transformers Special Commands for the Data Import Handler |
Concepts and Terminology 四个概念,关于数据导入部分数据源,实体,大概方式,具体转换 |
Configuration Configuring solrconfig.xml 可以配置多个 具体导入文件配置Configuring the DIH Configuration File 模板->可以内嵌sql实体 '${dataimporter.last_index_time}'"> deltaQuery="select ITEM_ID from FEATURE where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}"> ITEM_ID='${item.ID}'" deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}"> '${item_category.CATEGORY_ID}'" deltaQuery="select ID from category where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where CATEGORY_ID=${category.ID}"> 可以直接重新加载配置文件,如果配置错误,会提供xml的错误信息.A reload-config command is also supported, Request Parameters 请求参数 动态设置参数-->http请求 格式: ${dataimporter.request.paramname}. 例子: password=${dataimporter.request.jdbcpassword} /> 两种方式传参Then, these parameters can be passed to the full-import command or defined in the rconfig.xml. This example shows the parameters with the full-import command: dataimport?command=full-import&jdbcurl=jdbc:hsqldb:./example-DIH/hsqldb/ex&jdbcuser=sa&jdbcpassword=secret Data Import Handler Commands command的几种命令 abort 关于 delta-import 增量 full-import 全量 会生成一个操作的时间戳 在conf/dataimport.properties. reload-config 重新加载配置文件status 状态show-config 显示配置文件 执行路径 http:// Parameters for the full-import Command 全量导入的相关参数 clean commit debug entity optimize synchronous Property Writer 针对时间格式问题 The propertyWriter element defines the date format and locale for use with delta queries. It is an optional configuration. Add the element to the DIH configuration file, directly under the dataConfig element. 例子: 可用参数:dateFormat 格式type 使用写入类型directory l目录 filename 文件名称 locale 国际化 |
Uploading Structured Data Store Data with the Data Import Handler |
例子:example/example-DIH For more information about the Data Import Handler, see https://wiki.apache.org/solr/DataImportHandler. Topics covered in this section: Concepts and Terminology Configuration Data Import Handler Commands Property Writer Data Sources Entity Processors Transformers Special Commands for the Data Import Handler |
Concepts and Terminology 四个概念,关于数据导入部分数据源,实体,大概方式,具体转换 |
Configuration Configuring solrconfig.xml 可以配置多个 具体导入文件配置Configuring the DIH Configuration File 模板->可以内嵌sql实体 '${dataimporter.last_index_time}'"> deltaQuery="select ITEM_ID from FEATURE where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}"> ITEM_ID='${item.ID}'" deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}"> '${item_category.CATEGORY_ID}'" deltaQuery="select ID from category where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where CATEGORY_ID=${category.ID}"> 可以直接重新加载配置文件,如果配置错误,会提供xml的错误信息.A reload-config command is also supported, Request Parameters 请求参数 动态设置参数-->http请求 格式: ${dataimporter.request.paramname}. 例子: password=${dataimporter.request.jdbcpassword} /> 两种方式传参Then, these parameters can be passed to the full-import command or defined in the rconfig.xml. This example shows the parameters with the full-import command: dataimport?command=full-import&jdbcurl=jdbc:hsqldb:./example-DIH/hsqldb/ex&jdbcuser=sa&jdbcpassword=secret Data Import Handler Commands command的几种命令 abort 关于 delta-import 增量 full-import 全量 会生成一个操作的时间戳 在conf/dataimport.properties. reload-config 重新加载配置文件status 状态show-config 显示配置文件 执行路径 http:// Parameters for the full-import Command 全量导入的相关参数 clean commit debug entity optimize synchronous Property Writer 针对时间格式问题 The propertyWriter element defines the date format and locale for use with delta queries. It is an optional configuration. Add the element to the DIH configuration file, directly under the dataConfig element. 例子: 可用参数:dateFormat 格式type 使用写入类型directory l目录 filename 文件名称 locale 国际化 Data Sources 使用一下几种数据源ContentStreamDataSource FieldReaderDataSource FileDataSource JdbcDataSource URLDataSource Entity Processors 实体过程可用到的参数可以见下表page 213 The SQL Entity Processor The XPathEntityProcessor The MailEntityProcessor The TikaEntityProcessor The FileListEntityProcessor LineEntityProcessor PlainTextEntityProcessor Transformers |
Updating Parts of Documents 文档的更新部分 |
有两种方式进行文档的更新操作: The first is atomic updates. 改变字段的方式,通过重建索引来完成文档的更新 The second approach is known as optimistic concurrency or optimistic locking. 乐观锁,通过版本号进行更新操作 可以将两种方法进行混合使用 |
Atomic Updates set add remove removeregex inc 例子:{"id":"mydoc", "price":10, "popularity":42, "categories":["kids"], "promo_ids":["a123x"], "tags":["free_to_try","buy_now","clearance","on_sale"] } 修改文档请求:{"id":"mydoc", "price":{"set":99}, "popularity":{"inc":20}, "categories":{"add":["toys","games"]}, "promo_ids":{"remove":"a123x"}, "tags":{"remove":["free_to_try","on_sale"]} } 结果:{"id":"mydoc", "price":99, "popularity":62, "categories":["kids","toys","games"], "tags":["buy_now","clearance"] } |
Optimistic Concurrency 关于版本号的相描述:updated or when to report a conflict.If the content in the _version_ field is greater than '1' (i.e., '12345'), then the _version_ in thedocument must match the _version_ in the index. If the content in the _version_ field is equal to '1', then the document must simply exist. In this case, no version matching occurs, but if the document does not exist, the updates will be rejected. If the content in the _version_ field is less than '0' (i.e., '-1'), then the document must not exist. In this case, no version matching occurs, but if the document exists, the updates will be rejected. If the content in the _version_ field is equal to '0', then it doesn't matter if the versions match or if the document exists or not. If it exists, it will be overwritten; if it does not exist, it will be added. 例子:$ curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/techproducts/update?versiOns=true' --data-binary ' [ { "id" : "aaa" }, { "id" : "bbb" } ]' {"responseHeader":{"status":0,"QTime":6}, "adds":["aaa",1498562471222312960, "bbb",1498562471225458688]} $ curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/techproducts/update?_version_=999999&versiOns=true' --data-binary ' [{ "id" : "aaa", "foo_s" : "update attempt with wrong existing version" }]' {"responseHeader":{"status":409,"QTime":3}, "error":{"msg":"version conflict for aaa expected=999999 actual=1498562471222312960", "code":409}} $ curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/techproducts/update?_version_=1498562471222312960&versio ns=true&commit=true' --data-binary ' [{ "id" : "aaa", "foo_s" : "update attempt with correct existing version" }]' {"responseHeader":{"status":0,"QTime":5}, "adds":["aaa",1498562624496861184]} $ curl 'http://localhost:8983/solr/techproducts/query?q=*:*&fl=id,_version_' { "responseHeader":{ "status":0, "QTime":5, "params":{ "fl":"id,_version_", "q":"*:*"}}, "response":{"numFound":2,"start":0,"docs":[ { "id":"bbb", "_version_":1498562471225458688}, { "id":"aaa", "_version_":1498562624496861184}] }} |
Document Centric Versioning Constraints 分布式的情况必须使用solr自己的版本控制_verson_,其余的情况也可以定义自己的字段: |
De-Duplication通过签名技术,防止文档副本的索引的重复添加三种方法:MD5Signature Lookup3Signature TextProfileSignature |
Configuration Options 相关配置 There are two places in Solr to configure de-duplication: in solrconfig.xml and in schema.xml. In solrconfig.xml In schema.xml If you are using a separate field for storing the signature you must have it indexed: Be sure to change your update handlers to use the defined chain, as below: ... 具体参数见page232 |
Content StreamsWhen Solr RequestHandlers are accessed using path based URLs, the SolrQueryRequest object containing the parameters of the request may also contain a list of ContentStreams containing bulk data for the request. (The name SolrQueryRequest is a bit misleading: it is involved in all requests, regardless of whether it is a query request or an update request.) 你发送的请求参数到内容流中去; |
Stream Sources |
RemoteStreaming |
Debugging Requests |
UIMA Integration 非结构信息化管理架构的集成管理 |
UIMA lets you define custom pipelines of Analysis Engines that incrementally add metadata to your documents as annotations. |
Configuring UIMA 太长了不复制了:总结下; 1.配置jar包2.定义文档字段3.在solrconfig.xml中配置相关的处理器链4.在solrconfig.xml中启用配置的处理器 |
对于第四部分的总结:Introduction to Solr Indexing: 索引介绍 Post Tool: Post工具上传索引介绍 Uploading Data with Index Handlers: 更新索引通过处理器,几种格式 Uploading Data with Solr Cell using Apache Tika: 使用solr cell 索引处理 Uploading Structured Data Store Data with the Data Import Handler: data import handler的介绍,及集中数据源的导入介绍和参数 Updating Parts of Documents: 文档更新的两种方法 Detecting Languages During Indexing: 没看 De-Duplication: 标记签名,防止索引重复 Content Streams:请求参数转换为内容流 UIMA Integration: 自定义格式化管理系统,来处理更新问题 |