热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

org.apache.lucene.analysis.TokenStream类的使用及代码示例

本文整理了Java中org.apache.lucene.analysis.TokenStream类的一些代码示例,展示了TokenStream类的具体用法。这些代码示例主要来源于Github/Stac

本文整理了Java中org.apache.lucene.analysis.TokenStream类的一些代码示例,展示了TokenStream类的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。TokenStream类的具体详情如下:
包路径:org.apache.lucene.analysis.TokenStream
类名称:TokenStream

TokenStream介绍

[英]A TokenStream enumerates the sequence of tokens, either from Fields of a Document or from query text.

This is an abstract class; concrete subclasses are:

  • Tokenizer, a TokenStream whose input is a Reader; and
  • TokenFilter, a TokenStream whose input is another TokenStream.
    A new TokenStream API has been introduced with Lucene 2.9. This API has moved from being Token-based to Attribute-based. While Token still exists in 2.9 as a convenience class, the preferred way to store the information of a Token is to use AttributeImpls.

TokenStream now extends AttributeSource, which provides access to all of the token Attributes for the TokenStream. Note that only one instance per AttributeImpl is created and reused for every token. This approach reduces object creation and allows local caching of references to the AttributeImpls. See #incrementToken() for further details.

The workflow of the new TokenStream API is as follows:

  1. Instantiation of TokenStream/ TokenFilters which add/get attributes to/from the AttributeSource.
  2. The consumer calls TokenStream#reset().
  3. The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access.
  4. The consumer calls #incrementToken() until it returns false consuming the attributes after each call.
  5. The consumer calls #end() so that any end-of-stream operations can be performed.
  6. The consumer calls #close() to release any resource when finished using the TokenStream.
    To make sure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in #incrementToken().

You can find some example code for the new API in the analysis package level Javadoc.

Sometimes it is desirable to capture a current state of a TokenStream, e.g., for buffering purposes (see CachingTokenFilter, TeeSinkTokenFilter). For this usecase AttributeSource#captureState and AttributeSource#restoreStatecan be used.

The TokenStream-API in Lucene is based on the decorator param. Therefore all non-abstract subclasses must be final or have at least a final implementation of #incrementToken! This is checked when Java assertions are enabled.
[中]TokenStream从文档的字段或查询文本中枚举令牌序列。
这是一个抽象类;具体的子类包括:
*标记器,TokenStream,其输入为读卡器;和
*TokenFilter,一个TokenStream,其输入是另一个TokenStream
Lucene 2.9引入了一个新的TokenStreamAPI。这个API已经从基于令牌转变为基于属性。虽然Token在2.9中仍然作为便利类存在,但存储Token信息的首选方法是使用AttributeImpls。
TokenStream现在扩展了AttributeSource,它提供对TokenStream的所有令牌属性的访问。请注意,每个AttributeImpl只创建一个实例,并对每个令牌重复使用。这种方法减少了对象创建,并允许对AttributeImpl的引用进行本地缓存。有关更多详细信息,请参阅#incrementToken()。
TokenStreamAPI的工作流程如下:
1.实例化TokenStream/TokenFilters,向AttributeSource添加/获取属性。
1.消费者调用令牌流#reset()。
1.使用者从流中检索属性,并存储其想要访问的所有属性的本地引用。
1.消费者调用#incrementToken(),直到每次调用后,消费者使用属性返回false。
1.使用者调用#end(),以便执行任何流结束操作。
1.消费者调用#close()以在使用完TokenStream后释放任何资源。
为了确保过滤器和使用者知道哪些属性可用,必须在实例化期间添加这些属性。过滤器和使用者无需检查#incrementToken()中属性的可用性。
您可以在分析包级别的Javadoc中找到新API的一些示例代码。
有时需要捕获TokenStream的当前状态,例如用于缓冲目的(参见CachingTokenFilter,TeeSinkTokenFilter)。对于这个用例,可以使用AttributeSource#captureState和AttributeSource#RestoreState。
Lucene中的TokenStream API基于decorator参数。因此,所有非抽象子类都必须是final,或者至少有#incrementToken!的最终实现!当启用Java断言时,会检查这一点。

代码示例

代码示例来源:origin: stackoverflow.com

public final class LuceneUtil {
private LuceneUtil() {}
public static List tokenizeString(Analyzer analyzer, String string) {
List result = new ArrayList();
try {
TokenStream stream = analyzer.tokenStream(null, new StringReader(string));
stream.reset();
while (stream.incrementToken()) {
result.add(stream.getAttribute(CharTermAttribute.class).toString());
}
} catch (IOException e) {
// not thrown b/c we're using a string reader...
throw new RuntimeException(e);
}
return result;
}
}

代码示例来源:origin: stackoverflow.com

TokenStream stream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
System.out.println(cattr.toString());
}
stream.end();
stream.close();

代码示例来源:origin: stackoverflow.com

TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
}

代码示例来源:origin: stackoverflow.com

TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = termAttribute.term();
}

代码示例来源:origin: sanluan/PublicCMS

/**
* @param text
* @return
*/
public Set getToken(String text) {
Set list = new LinkedHashSet<>();
if (CommonUtils.notEmpty(text)) {
try (StringReader stringReader = new StringReader(text);
TokenStream tokenStream = dao.getAnalyzer().tokenStream(CommonConstants.BLANK, stringReader)) {
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
list.add(charTermAttribute.toString());
}
tokenStream.end();
return list;
} catch (IOException e) {
return list;
}
}
return list;
}

代码示例来源:origin: org.apache.lucene/lucene-core

try (Reader reader = new StringReader(text)) {
Reader filterReader = initReaderForNormalization(fieldName, reader);
char[] buffer = new char[64];
StringBuilder builder = new StringBuilder();
final AttributeFactory attributeFactory = attributeFactory(fieldName);
try (TokenStream ts = normalize(fieldName,
new StringTokenStream(attributeFactory, filteredText, text.length()))) {
final TermToBytesRefAttribute termAtt = ts.addAttribute(TermToBytesRefAttribute.class);
ts.reset();
if (ts.incrementToken() == false) {
throw new IllegalStateException("The normalization token stream is "
+ "expected to produce exactly 1 token, but got 0 for analyzer "
if (ts.incrementToken()) {
throw new IllegalStateException("The normalization token stream is "
+ "expected to produce exactly 1 token, but got 2+ for analyzer "
+ this + " and input \"" + text + "\"");
ts.end();
return term;

代码示例来源:origin: tjake/Solandra

tokReader = new StringReader(field.stringValue());
tokens = analyzer.reusableTokenStream(field.name(), tokReader);
if (position > 0)
position += analyzer.getPositionIncrementGap(field.name());
tokens.reset(); // reset the TokenStream to the first token
offsetAttribute = (OffsetAttribute) tokens.addAttribute(OffsetAttribute.class);
.addAttribute(PositionIncrementAttribute.class);
CharTermAttribute termAttribute = (CharTermAttribute) tokens.addAttribute(CharTermAttribute.class);
while (tokens.incrementToken())
position += (posIncrAttribute.getPositionIncrement() - 1);
offsetVector.add(lastOffset + offsetAttribute.startOffset());
offsetVector.add(lastOffset + offsetAttribute.endOffset());

代码示例来源:origin: stackoverflow.com

Reader reader = new StringReader("This is a test string");
TokenStream tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
tokenizer = new ShingleFilter(tokenizer, 1, 3);
CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
while (tokenizer.incrementToken()) {
String token = charTermAttribute.toString();
//Do something
}

代码示例来源:origin: linkedin/indextank-engine

public Iterator parseDocumentField(String fieldName, String content) {
final TokenStream tkstream = analyzer.tokenStream(fieldName, new StringReader(content));
final TermAttribute termAtt = tkstream.addAttribute(TermAttribute.class);
final PositionIncrementAttribute posIncrAttribute = tkstream.addAttribute(PositionIncrementAttribute.class);
final OffsetAttribute offsetAtt = tkstream.addAttribute(OffsetAttribute.class);

代码示例来源:origin: org.apache.lucene/lucene-core

try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) {
stream.reset();
invertState.setAttributeSource(stream);
termsHashPerField.start(field, first);
while (stream.incrementToken()) {
int posIncr = invertState.posIncrAttribute.getPositionIncrement();
invertState.position += posIncr;
if (invertState.position int startOffset = invertState.offset + invertState.offsetAttribute.startOffset();
int endOffset = invertState.offset + invertState.offsetAttribute.endOffset();
if (startOffset throw new IllegalArgumentException("startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards "
stream.end();
invertState.position += invertState.posIncrAttribute.getPositionIncrement();
invertState.offset += invertState.offsetAttribute.endOffset();
invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);

代码示例来源:origin: org.apache.lucene/lucene-analyzers-common

@Override
public final boolean incrementToken() throws IOException {
if (input.incrementToken()) {
if (!keywordAttr.isKeyword() && termAttribute.length() > length)
termAttribute.setLength(length);
return true;
} else {
return false;
}
}
}

代码示例来源:origin: org.apache.lucene/lucene-analyzers

private Token getNextInputToken(Token token) throws IOException {
if (!input.incrementToken()) return null;
token.copyBuffer(in_termAtt.buffer(), 0, in_termAtt.length());
token.setPositionIncrement(in_posIncrAtt.getPositionIncrement());
token.setFlags(in_flagsAtt.getFlags());
token.setOffset(in_offsetAtt.startOffset(), in_offsetAtt.endOffset());
token.setType(in_typeAtt.type());
token.setPayload(in_payloadAtt.getPayload());
return token;
}

代码示例来源:origin: oracle/opengrok

private SToken[] getTokens(String text) throws IOException {
//FIXME somehow integrate below cycle to getSummary to save the cloning and memory,
//also creating Tokens is suboptimal with 3.0.0 , this whole class could be replaced by highlighter
ArrayList result = new ArrayList<>();
try (TokenStream ts = analyzer.tokenStream("full", text)) {
CharTermAttribute term = ts.addAttribute(CharTermAttribute.class);
OffsetAttribute offset = ts.addAttribute(OffsetAttribute.class);
ts.reset();
while (ts.incrementToken()) {
SToken t = new SToken(term.buffer(), 0, term.length(), offset.startOffset(), offset.endOffset());
result.add(t);
}
ts.end();
}
return result.toArray(new SToken[result.size()]);
}

代码示例来源:origin: org.apache.lucene/lucene-analyzers-common

uniqueTerms = new CharArraySet(8, false);
int outputTokenSize = 0;
while (input.incrementToken()) {
if (outputTokenSize > maxOutputTokenSize) {
continue;
final char term[] = termAttribute.buffer();
final int length = termAttribute.length();
input.end();
inputEnded = true;
offsetAtt.setOffset(0, offsetAtt.endOffset());
posLenAtt.setPositionLength(1);
posIncrAtt.setPositionIncrement(1);
typeAtt.setType("fingerprint");
termAttribute.setEmpty();
return false;

代码示例来源:origin: jeremylong/DependencyCheck

String[] parts;
skipCounter = 0;
while (input.incrementToken()) {
final String text = new String(termAtt.buffer(), 0, termAtt.length());
if (text.isEmpty()) {
return true;
skipCounter += posIncrAttribute.getPositionIncrement();
} else {
if (skipCounter != 0) {
posIncrAttribute.setPositionIncrement(posIncrAttribute.getPositionIncrement() + skipCounter);

代码示例来源:origin: org.apache.lucene/lucene-core

final TermToBytesRefAttribute termBytesAtt = in.addAttribute(TermToBytesRefAttribute.class);
final PositionIncrementAttribute posIncAtt = in.addAttribute(PositionIncrementAttribute.class);
final PositionLengthAttribute posLengthAtt = in.addAttribute(PositionLengthAttribute.class);
in.reset();
while (in.incrementToken()) {
int currentIncr = posIncAtt.getPositionIncrement();
if (pos == -1 && currentIncr <1) {
throw new IllegalStateException("Malformed TokenStream, start token can't have increment less than 1");
in.end();
if (state != -1) {
builder.setAccept(state, true);

代码示例来源:origin: org.apache.lucene/lucene-analyzers-common

try (TokenStream ts = analyzer.tokenStream("", text)) {
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
PositionIncrementAttribute posIncAtt = ts.addAttribute(PositionIncrementAttribute.class);
ts.reset();
reuse.clear();
while (ts.incrementToken()) {
int length = termAtt.length();
if (length == 0) {
throw new IllegalArgumentException("term: " + text + " analyzed to a zero-length token");
if (posIncAtt.getPositionIncrement() != 1) {
throw new IllegalArgumentException("term: " + text + " analyzed to a token (" + termAtt +
") with position increment != 1 (got: " + posIncAtt.getPositionIncrement() + ")");
reuse.setLength(reuse.length() + 1);
System.arraycopy(termAtt.buffer(), 0, reuse.chars(), end, length);
reuse.setLength(reuse.length() + length);
ts.end();

代码示例来源:origin: synhershko/HebMorph

private ArrayList analyze(Analyzer analyzer1) throws IOException {
ArrayList results = new ArrayList<>(50);
TokenStream ts = analyzer1.tokenStream("foo", text);
ts.reset();
while (ts.incrementToken()) {
Data data = new Data();
OffsetAttribute offsetAttribute = ts.getAttribute(OffsetAttribute.class);
data.startOffset = offsetAttribute.startOffset();
data.endOffset = offsetAttribute.endOffset();
data.positiOnLength= ts.getAttribute(PositionLengthAttribute.class).getPositionLength();
data.positiOnIncGap= ts.getAttribute(PositionIncrementAttribute.class).getPositionIncrement();
data.tokenType = ts.getAttribute(HebrewTokenTypeAttribute.class).getType().toString();
data.term = ts.getAttribute(CharTermAttribute.class).toString();
if (ts.getAttribute(KeywordAttribute.class) != null)
data.isKeyword = ts.getAttribute(KeywordAttribute.class).isKeyword();
// System.out.println(data.term + " " + data.tokenType);
results.add(data);
}
ts.close();
return results;
}
}

代码示例来源:origin: org.apache.lucene/lucene-analyzers-common

@Override
public boolean incrementToken() throws IOException {
while (!exhausted && input.incrementToken()) {
char[] term = termAttribute.buffer();
int termLength = termAttribute.length();
lastEndOffset = offsetAttribute.endOffset();

代码示例来源:origin: org.apache.lucene/lucene-core

/**
* Creates complex boolean query from the cached tokenstream contents
*/
protected Query analyzeMultiBoolean(String field, TokenStream stream, BooleanClause.Occur operator) throws IOException {
BooleanQuery.Builder q = newBooleanQuery();
List currentQuery = new ArrayList<>();

TermToBytesRefAttribute termAtt = stream.getAttribute(TermToBytesRefAttribute.class);
PositionIncrementAttribute posIncrAtt = stream.getAttribute(PositionIncrementAttribute.class);

stream.reset();
while (stream.incrementToken()) {
if (posIncrAtt.getPositionIncrement() != 0) {
add(q, currentQuery, operator);
currentQuery.clear();
}
currentQuery.add(new Term(field, termAtt.getBytesRef()));
}
add(q, currentQuery, operator);

return q.build();
}

推荐阅读
author-avatar
嘻嘻520000000
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有