本文整理了Java中org.apache.lucene.analysis.TokenStream
类的一些代码示例,展示了TokenStream
类的具体用法。这些代码示例主要来源于Github
/Stackoverflow
/Maven
等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。TokenStream
类的具体详情如下:
包路径:org.apache.lucene.analysis.TokenStream
类名称:TokenStream
[英]A TokenStream
enumerates the sequence of tokens, either from Fields of a Document or from query text.
This is an abstract class; concrete subclasses are:
TokenStream
whose input is a Reader; andTokenStream
whose input is another TokenStream
.TokenStream
API has been introduced with Lucene 2.9. This API has moved from being Token-based to Attribute-based. While Token still exists in 2.9 as a convenience class, the preferred way to store the information of a Token is to use AttributeImpls.TokenStream
now extends AttributeSource, which provides access to all of the token Attributes for the TokenStream
. Note that only one instance per AttributeImpl is created and reused for every token. This approach reduces object creation and allows local caching of references to the AttributeImpls. See #incrementToken() for further details.
The workflow of the new TokenStream
API is as follows:
TokenStream
/ TokenFilters which add/get attributes to/from the AttributeSource.TokenStream
.You can find some example code for the new API in the analysis package level Javadoc.
Sometimes it is desirable to capture a current state of a TokenStream
, e.g., for buffering purposes (see CachingTokenFilter, TeeSinkTokenFilter). For this usecase AttributeSource#captureState and AttributeSource#restoreStatecan be used.
The TokenStream-API in Lucene is based on the decorator param. Therefore all non-abstract subclasses must be final or have at least a final implementation of #incrementToken! This is checked when Java assertions are enabled.
[中]TokenStream
从文档的字段或查询文本中枚举令牌序列。
这是一个抽象类;具体的子类包括:
*标记器,TokenStream
,其输入为读卡器;和
*TokenFilter,一个TokenStream
,其输入是另一个TokenStream
。
Lucene 2.9引入了一个新的TokenStream
API。这个API已经从基于令牌转变为基于属性。虽然Token在2.9中仍然作为便利类存在,但存储Token信息的首选方法是使用AttributeImpls。TokenStream
现在扩展了AttributeSource,它提供对TokenStream
的所有令牌属性的访问。请注意,每个AttributeImpl只创建一个实例,并对每个令牌重复使用。这种方法减少了对象创建,并允许对AttributeImpl的引用进行本地缓存。有关更多详细信息,请参阅#incrementToken()。
新TokenStream
API的工作流程如下:
1.实例化TokenStream
/TokenFilters,向AttributeSource添加/获取属性。
1.消费者调用令牌流#reset()。
1.使用者从流中检索属性,并存储其想要访问的所有属性的本地引用。
1.消费者调用#incrementToken(),直到每次调用后,消费者使用属性返回false。
1.使用者调用#end(),以便执行任何流结束操作。
1.消费者调用#close()以在使用完TokenStream
后释放任何资源。
为了确保过滤器和使用者知道哪些属性可用,必须在实例化期间添加这些属性。过滤器和使用者无需检查#incrementToken()中属性的可用性。
您可以在分析包级别的Javadoc中找到新API的一些示例代码。
有时需要捕获TokenStream
的当前状态,例如用于缓冲目的(参见CachingTokenFilter,TeeSinkTokenFilter)。对于这个用例,可以使用AttributeSource#captureState和AttributeSource#RestoreState。
Lucene中的TokenStream API基于decorator参数。因此,所有非抽象子类都必须是final,或者至少有#incrementToken!的最终实现!当启用Java断言时,会检查这一点。
代码示例来源:origin: stackoverflow.com
public final class LuceneUtil {
private LuceneUtil() {}
public static List
List
try {
TokenStream stream = analyzer.tokenStream(null, new StringReader(string));
stream.reset();
while (stream.incrementToken()) {
result.add(stream.getAttribute(CharTermAttribute.class).toString());
}
} catch (IOException e) {
// not thrown b/c we're using a string reader...
throw new RuntimeException(e);
}
return result;
}
}
代码示例来源:origin: stackoverflow.com
TokenStream stream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
System.out.println(cattr.toString());
}
stream.end();
stream.close();
代码示例来源:origin: stackoverflow.com
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
}
代码示例来源:origin: stackoverflow.com
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = termAttribute.term();
}
代码示例来源:origin: sanluan/PublicCMS
/**
* @param text
* @return
*/
public Set
Set
if (CommonUtils.notEmpty(text)) {
try (StringReader stringReader = new StringReader(text);
TokenStream tokenStream = dao.getAnalyzer().tokenStream(CommonConstants.BLANK, stringReader)) {
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
list.add(charTermAttribute.toString());
}
tokenStream.end();
return list;
} catch (IOException e) {
return list;
}
}
return list;
}
代码示例来源:origin: org.apache.lucene/lucene-core
try (Reader reader = new StringReader(text)) {
Reader filterReader = initReaderForNormalization(fieldName, reader);
char[] buffer = new char[64];
StringBuilder builder = new StringBuilder();
final AttributeFactory attributeFactory = attributeFactory(fieldName);
try (TokenStream ts = normalize(fieldName,
new StringTokenStream(attributeFactory, filteredText, text.length()))) {
final TermToBytesRefAttribute termAtt = ts.addAttribute(TermToBytesRefAttribute.class);
ts.reset();
if (ts.incrementToken() == false) {
throw new IllegalStateException("The normalization token stream is "
+ "expected to produce exactly 1 token, but got 0 for analyzer "
if (ts.incrementToken()) {
throw new IllegalStateException("The normalization token stream is "
+ "expected to produce exactly 1 token, but got 2+ for analyzer "
+ this + " and input \"" + text + "\"");
ts.end();
return term;
代码示例来源:origin: tjake/Solandra
tokReader = new StringReader(field.stringValue());
tokens = analyzer.reusableTokenStream(field.name(), tokReader);
if (position > 0)
position += analyzer.getPositionIncrementGap(field.name());
tokens.reset(); // reset the TokenStream to the first token
offsetAttribute = (OffsetAttribute) tokens.addAttribute(OffsetAttribute.class);
.addAttribute(PositionIncrementAttribute.class);
CharTermAttribute termAttribute = (CharTermAttribute) tokens.addAttribute(CharTermAttribute.class);
while (tokens.incrementToken())
position += (posIncrAttribute.getPositionIncrement() - 1);
offsetVector.add(lastOffset + offsetAttribute.startOffset());
offsetVector.add(lastOffset + offsetAttribute.endOffset());
代码示例来源:origin: stackoverflow.com
Reader reader = new StringReader("This is a test string");
TokenStream tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
tokenizer = new ShingleFilter(tokenizer, 1, 3);
CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
while (tokenizer.incrementToken()) {
String token = charTermAttribute.toString();
//Do something
}
代码示例来源:origin: linkedin/indextank-engine
public Iterator
final TokenStream tkstream = analyzer.tokenStream(fieldName, new StringReader(content));
final TermAttribute termAtt = tkstream.addAttribute(TermAttribute.class);
final PositionIncrementAttribute posIncrAttribute = tkstream.addAttribute(PositionIncrementAttribute.class);
final OffsetAttribute offsetAtt = tkstream.addAttribute(OffsetAttribute.class);
代码示例来源:origin: org.apache.lucene/lucene-core
try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) {
stream.reset();
invertState.setAttributeSource(stream);
termsHashPerField.start(field, first);
while (stream.incrementToken()) {
int posIncr = invertState.posIncrAttribute.getPositionIncrement();
invertState.position += posIncr;
if (invertState.position
int endOffset = invertState.offset + invertState.offsetAttribute.endOffset();
if (startOffset
stream.end();
invertState.position += invertState.posIncrAttribute.getPositionIncrement();
invertState.offset += invertState.offsetAttribute.endOffset();
invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);
代码示例来源:origin: org.apache.lucene/lucene-analyzers-common
@Override
public final boolean incrementToken() throws IOException {
if (input.incrementToken()) {
if (!keywordAttr.isKeyword() && termAttribute.length() > length)
termAttribute.setLength(length);
return true;
} else {
return false;
}
}
}
代码示例来源:origin: org.apache.lucene/lucene-analyzers
private Token getNextInputToken(Token token) throws IOException {
if (!input.incrementToken()) return null;
token.copyBuffer(in_termAtt.buffer(), 0, in_termAtt.length());
token.setPositionIncrement(in_posIncrAtt.getPositionIncrement());
token.setFlags(in_flagsAtt.getFlags());
token.setOffset(in_offsetAtt.startOffset(), in_offsetAtt.endOffset());
token.setType(in_typeAtt.type());
token.setPayload(in_payloadAtt.getPayload());
return token;
}
代码示例来源:origin: oracle/opengrok
private SToken[] getTokens(String text) throws IOException {
//FIXME somehow integrate below cycle to getSummary to save the cloning and memory,
//also creating Tokens is suboptimal with 3.0.0 , this whole class could be replaced by highlighter
ArrayList
try (TokenStream ts = analyzer.tokenStream("full", text)) {
CharTermAttribute term = ts.addAttribute(CharTermAttribute.class);
OffsetAttribute offset = ts.addAttribute(OffsetAttribute.class);
ts.reset();
while (ts.incrementToken()) {
SToken t = new SToken(term.buffer(), 0, term.length(), offset.startOffset(), offset.endOffset());
result.add(t);
}
ts.end();
}
return result.toArray(new SToken[result.size()]);
}
代码示例来源:origin: org.apache.lucene/lucene-analyzers-common
uniqueTerms = new CharArraySet(8, false);
int outputTokenSize = 0;
while (input.incrementToken()) {
if (outputTokenSize > maxOutputTokenSize) {
continue;
final char term[] = termAttribute.buffer();
final int length = termAttribute.length();
input.end();
inputEnded = true;
offsetAtt.setOffset(0, offsetAtt.endOffset());
posLenAtt.setPositionLength(1);
posIncrAtt.setPositionIncrement(1);
typeAtt.setType("fingerprint");
termAttribute.setEmpty();
return false;
代码示例来源:origin: jeremylong/DependencyCheck
String[] parts;
skipCounter = 0;
while (input.incrementToken()) {
final String text = new String(termAtt.buffer(), 0, termAtt.length());
if (text.isEmpty()) {
return true;
skipCounter += posIncrAttribute.getPositionIncrement();
} else {
if (skipCounter != 0) {
posIncrAttribute.setPositionIncrement(posIncrAttribute.getPositionIncrement() + skipCounter);
代码示例来源:origin: org.apache.lucene/lucene-core
final TermToBytesRefAttribute termBytesAtt = in.addAttribute(TermToBytesRefAttribute.class);
final PositionIncrementAttribute posIncAtt = in.addAttribute(PositionIncrementAttribute.class);
final PositionLengthAttribute posLengthAtt = in.addAttribute(PositionLengthAttribute.class);
in.reset();
while (in.incrementToken()) {
int currentIncr = posIncAtt.getPositionIncrement();
if (pos == -1 && currentIncr <1) {
throw new IllegalStateException("Malformed TokenStream, start token can't have increment less than 1");
in.end();
if (state != -1) {
builder.setAccept(state, true);
代码示例来源:origin: org.apache.lucene/lucene-analyzers-common
try (TokenStream ts = analyzer.tokenStream("", text)) {
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
PositionIncrementAttribute posIncAtt = ts.addAttribute(PositionIncrementAttribute.class);
ts.reset();
reuse.clear();
while (ts.incrementToken()) {
int length = termAtt.length();
if (length == 0) {
throw new IllegalArgumentException("term: " + text + " analyzed to a zero-length token");
if (posIncAtt.getPositionIncrement() != 1) {
throw new IllegalArgumentException("term: " + text + " analyzed to a token (" + termAtt +
") with position increment != 1 (got: " + posIncAtt.getPositionIncrement() + ")");
reuse.setLength(reuse.length() + 1);
System.arraycopy(termAtt.buffer(), 0, reuse.chars(), end, length);
reuse.setLength(reuse.length() + length);
ts.end();
代码示例来源:origin: synhershko/HebMorph
private ArrayList analyze(Analyzer analyzer1) throws IOException {
ArrayList results = new ArrayList<>(50);
TokenStream ts = analyzer1.tokenStream("foo", text);
ts.reset();
while (ts.incrementToken()) {
Data data = new Data();
OffsetAttribute offsetAttribute = ts.getAttribute(OffsetAttribute.class);
data.startOffset = offsetAttribute.startOffset();
data.endOffset = offsetAttribute.endOffset();
data.positiOnLength= ts.getAttribute(PositionLengthAttribute.class).getPositionLength();
data.positiOnIncGap= ts.getAttribute(PositionIncrementAttribute.class).getPositionIncrement();
data.tokenType = ts.getAttribute(HebrewTokenTypeAttribute.class).getType().toString();
data.term = ts.getAttribute(CharTermAttribute.class).toString();
if (ts.getAttribute(KeywordAttribute.class) != null)
data.isKeyword = ts.getAttribute(KeywordAttribute.class).isKeyword();
// System.out.println(data.term + " " + data.tokenType);
results.add(data);
}
ts.close();
return results;
}
}
代码示例来源:origin: org.apache.lucene/lucene-analyzers-common
@Override
public boolean incrementToken() throws IOException {
while (!exhausted && input.incrementToken()) {
char[] term = termAttribute.buffer();
int termLength = termAttribute.length();
lastEndOffset = offsetAttribute.endOffset();
代码示例来源:origin: org.apache.lucene/lucene-core
/**
* Creates complex boolean query from the cached tokenstream contents
*/
protected Query analyzeMultiBoolean(String field, TokenStream stream, BooleanClause.Occur operator) throws IOException {
BooleanQuery.Builder q = newBooleanQuery();
List
TermToBytesRefAttribute termAtt = stream.getAttribute(TermToBytesRefAttribute.class);
PositionIncrementAttribute posIncrAtt = stream.getAttribute(PositionIncrementAttribute.class);
stream.reset();
while (stream.incrementToken()) {
if (posIncrAtt.getPositionIncrement() != 0) {
add(q, currentQuery, operator);
currentQuery.clear();
}
currentQuery.add(new Term(field, termAtt.getBytesRef()));
}
add(q, currentQuery, operator);
return q.build();
}