本文整理汇总了Java中org.apache.tika.parser.html.HtmlMapper类的典型用法代码示例。如果您正苦于以下问题:Java HtmlMapper类的具体用法?Java HtmlMapper怎么用?Java HtmlMapper使用的例子?那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。
HtmlMapper类属于org.apache.tika.parser.html包,在下文中一共展示了HtmlMapper类的2个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于我们的系统推荐出更棒的Java代码示例。
示例1: prepare
import org.apache.tika.parser.html.HtmlMapper; //导入依赖的package包/类
@SuppressWarnings({ "rawtypes", "unchecked" })
@Override
public void prepare(Map conf, TopologyContext context,
OutputCollector collector) {
emitOutlinks = ConfUtils.getBoolean(conf, "parser.emitOutlinks", true);
urlFilters = URLFilters.fromConf(conf);
parseFilters = ParseFilters.fromConf(conf);
upperCaseElementNames = ConfUtils.getBoolean(conf,
"parser.uppercase.element.names", true);
extractEmbedded = ConfUtils.getBoolean(conf, "parser.extract.embedded",
false);
String htmlmapperClassName = ConfUtils.getString(conf,
"parser.htmlmapper.classname",
"org.apache.tika.parser.html.IdentityHtmlMapper");
try {
HTMLMapperClass = Class.forName(htmlmapperClassName);
boolean interfaceOK = HtmlMapper.class
.isAssignableFrom(HTMLMapperClass);
if (!interfaceOK) {
throw new RuntimeException("Class " + htmlmapperClassName
+ " does not implement HtmlMapper");
}
} catch (ClassNotFoundException e) {
LOG.error("Can't load class {}", htmlmapperClassName);
throw new RuntimeException("Can't load class "
+ htmlmapperClassName);
}
// instanciate Tika
long start = System.currentTimeMillis();
tika = new Tika();
long end = System.currentTimeMillis();
LOG.debug("Tika loaded in {} msec", end - start);
this.collector = collector;
this.eventCounter = context.registerMetric(this.getClass()
.getSimpleName(), new MultiCountMetric(), 10);
this.metadataTransfer = MetadataTransfer.getInstance(conf);
}
开发者ID:eorliac,项目名称:patent-crawler,代码行数:50,代码来源:ParserBolt.java
示例2: extract
import org.apache.tika.parser.html.HtmlMapper; //导入依赖的package包/类
/**
* Create a pull-parser from the given {@link TikaInputStream}.
*
* @param input the stream to extract from
* @param document file that is being extracted from
* @return A pull-parsing reader.
*/
protected Reader extract(final Document document, final TikaInputStream input) throws IOException {
final Metadata metadata = document.getMetadata();
final ParseContext context = new ParseContext();
final AutoDetectParser autoDetectParser = new AutoDetectParser(defaultParser);
final Parser parser;
if (null != digester) {
parser = new DigestingParser(autoDetectParser, digester);
} else {
parser = autoDetectParser;
}
if (!ocrDisabled) {
context.set(TesseractOCRConfig.class, ocrConfig);
}
context.set(PDFParserConfig.class, pdfConfig);
// Set a fallback parser that outputs an empty document for empty files,
// otherwise throws an exception.
autoDetectParser.setFallback(FallbackParser.INSTANCE);
// Only include "safe" tags in the HTML output from Tika's HTML parser.
// This excludes script tags and objects.
context.set(HtmlMapper.class, DefaultHtmlMapper.INSTANCE);
final Reader reader;
final Function<Writer, ContentHandler> handler;
if (OutputFormat.HTML == outputFormat) {
handler = (writer) -> new ExpandedTitleContentHandler(new HTML5Serializer(writer));
} else {
// The default BodyContentHandler is used when constructing the ParsingReader for text output, but
// because only the body of embeds is pushed to the content handler further down the line, we can't
// expect a body tag.
handler = WriteOutContentHandler::new;
}
if (EmbedHandling.SPAWN == embedHandling) {
context.set(Parser.class, parser);
context.set(EmbeddedDocumentExtractor.class, new EmbedSpawner(document, context, embedOutput, handler));
} else if (EmbedHandling.CONCATENATE == embedHandling) {
context.set(Parser.class, parser);
context.set(EmbeddedDocumentExtractor.class, new EmbedParser(document, context));
} else {
context.set(Parser.class, EmptyParser.INSTANCE);
context.set(EmbeddedDocumentExtractor.class, new EmbedBlocker());
}
if (OutputFormat.HTML == outputFormat) {
reader = new ParsingReader(parser, input, metadata, context, handler);
} else {
reader = new ParsingReader(parser, input, metadata, context);
}
return reader;
}
开发者ID:ICIJ,项目名称:extract,代码行数:66,代码来源:Extractor.java
注:本文中的org.apache.tika.parser.html.HtmlMapper类示例整理自Github/MSDocs等源码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。 |
请发表评论