作者:上官邱老 | 来源:互联网 | 2023-09-23 20:15
上文中的抽象类Scoper关联到另外一个成员变量DecideRulescope,我不得不先中断处理器类的分析(后面再继续处理器分析),来插叙一下DecideRulescope对象,
上文中的抽象类Scoper关联到另外一个成员变量DecideRule scope,我不得不先中断处理器类的分析(后面再继续处理器分析),来插叙一下DecideRule scope对象,我说了,DecideRule scope成员是用来控制CrawlURI caUri对象的范围
照例先来浏览一下DecideRule相关类图
DecideRule类是一个抽象类,用来判断一个CrawlURI caUri对象是接受还是拒绝
public DecideResult decisionFor(CrawlURI uri) {
if (!getEnabled()) {
return DecideResult.NONE;
}
DecideResult result = innerDecide(uri);
if (result == DecideResult.NONE) {
return result;
}
return result;
}
protected abstract DecideResult innerDecide(CrawlURI uri);
public DecideResult onlyDecision(CrawlURI uri) {
return null;
}
public boolean accepts(CrawlURI uri) {
return DecideResult.ACCEPT == decisionFor(uri);
}
上面抽象方法由子类DecideResult innerDecide(CrawlURI uri)实现
DecideResult为枚举类,其值有三
/**
* The decision of a DecideRule.
*
* @author pjack
*/
public enum DecideResult {
/** Indicates the URI was accepted. */
ACCEPT,
/** Indicates the URI was neither accepted nor rejected. */
NONE,
/** Indicates the URI was rejected. */
REJECT;
public static DecideResult invert(DecideResult result) {
switch (result) {
case ACCEPT:
return REJECT;
case REJECT:
return ACCEPT;
default:
return result;
}
}
}
我们再来看它的重要子类DecideRuleSequence,该类拥有DecideRule聚集,DecideResult innerDecide(CrawlURI uri)方法里面迭代调用聚集元素的DecideResult decisionFor(CrawlURI uri)方法(composite模式与Iterator模式结合)
@SuppressWarnings("unchecked")
public List getRules() {
return (List) kp.get("rules");
}
public void setRules(List rules) {
kp.put("rules", rules);
}
public DecideResult innerDecide(CrawlURI uri) {
DecideRule decisiveRule = null;
int decisiveRuleNumber = -1;
DecideResult result = DecideResult.NONE;
List rules = getRules();
int max = rules.size();
for (int i = 0; i ) {
DecideRule rule = rules.get(i);
if (rule.onlyDecision(uri) != result) {
DecideResult r = rule.decisionFor(uri);
if (LOGGER.isLoggable(Level.FINEST)) {
LOGGER.finest("DecideRule #" + i + " " +
rule.getClass().getName() + " returned " + r + " for url: " + uri);
}
if (r != DecideResult.NONE) {
result = r;
decisiveRule = rule;
decisiveRuleNumber = i;
}
}
}
if (fileLogger != null) {
fileLogger.info(decisiveRuleNumber + " " + decisiveRule.getClass().getSimpleName() + " " + result + " " + uri);
}
return result;
}
运行环境中该聚集元素我们可以通过crawler-beans.cxml配置文件看到
<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<bean class="org.archive.modules.deciderules.RejectDecideRule">
bean>
<bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
bean>
<bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
bean>
<bean class="org.archive.modules.deciderules.TransclusionDecideRule">
bean>
<bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
<property name="decision" value="REJECT"/>
<property name="seedsAsSurtPrefixes" value="false"/>
<property name="surtsDumpFile" value="${launchId}/negative-surts.dump" />
bean>
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
bean>
<bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
bean>
<bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
bean>
<bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
bean>
<bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
bean>
list>
property>
bean>
抽象类PredicatedDecideRule继承自DecideRule类
@Override
protected DecideResult innerDecide(CrawlURI uri) {
if (evaluate(uri)) {
return getDecision();
}
return DecideResult.NONE;
}
protected abstract boolean evaluate(CrawlURI object);
boolean evaluate(CrawlURI object)方法由子类实现
其他相关实现类我不再一一介绍了
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3037547.html