使用lucene来遍历ES中的文档数据

2019独角兽企业重金招聘Python工程师标准>>>

es 1.7.5 中使用的是lucene 4.10, 通过研究其数据结构, 明确其嵌套格式(nested)文档结构. 对于一个分片(目录)中的数据,是一个lucene索引结构,因此可以通过使用lucene api来读取这一个目录中的数据.实际上,在一个lucene索引结构中,不仅有倒排表还有顺序结构.因此我们可以通过某种方式来获取这个目录下面的所有文档完成遍历操作.

顺序结构的文档在lucene4.10中的组织是有规律的,文档id从0开始递增,前排文档的子文档,然后排其子文档对应的主文档. 如果索引子文档的field字段设置store为true.则在子文档所对应的doc id上可以相关值,否则需要在source字段中获取,至于如何解析source字段,本篇文章不做解释.

首先获取fields,然后针对某一个term(_uid)来获取所有文档(每个主文档都有一个唯一的uid).然后根据上面表述的特性就可以获取所有文档相关信息,进行相关处理.如果文档中涉及到删除的操作,需要加载删除数据的集合,然后将文档id进行过滤,剔除掉删除的记录.

Directory directory &＃61; FSDirectory.open(new File(path));Lucene40LiveDocsFormat lldf &＃61; new Lucene40LiveDocsFormat();IOContext context &＃61; IOContext.READ;SegmentInfos sifs &＃61; new SegmentInfos();sifs.read(directory);Iterator its &＃61; sifs.iterator();List bitss &＃61; new ArrayList();while (its.hasNext()) {SegmentCommitInfo info &＃61; its.next();if (info.hasDeletions())bitss.add(lldf.readLiveDocs(directory, info, context));}// directory.IndexReader r &＃61; IndexReader.open(directory);IndexSearcher is &＃61; new IndexSearcher(r);Fields fields &＃61; MultiFields.getFields(r);System.out.println(fields.size());Iterator it&＃61;fields.iterator();while(it.hasNext()){System.out.println(it.next());}System.out.println(fields.terms("commus.interactionIdx").getDocCount());int count &＃61; fields.terms("_uid").getDocCount();System.out.println(count);for (int i &＃61; 0; i