本文整理汇总了Java中de.l3s.boilerpipe.BoilerpipeExtractor类的典型用法代码示例。如果您正苦于以下问题:Java BoilerpipeExtractor类的具体用法?Java BoilerpipeExtractor怎么用?Java BoilerpipeExtractor使用的例子?那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。
BoilerpipeExtractor类属于de.l3s.boilerpipe包,在下文中一共展示了BoilerpipeExtractor类的9个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于我们的系统推荐出更棒的Java代码示例。
示例1: process
import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
* returns the article from an document with its basic html structure.
*
* @param HTMLDocument
* @param URI the uri from the document for resolving the relative anchors in the document to absolute anchors
* @return String
*/
public String process(HTMLDocument htmlDoc, URI docUri, final BoilerpipeExtractor extractor) {
final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
hh.setOutputHighlightOnly(true);
TextDocument doc;
String text = "";
try {
doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
extractor.process(doc);
final InputSource is = htmlDoc.toInputSource();
text = hh.process(doc, is);
} catch (Exception ex) {
return null;
}
return removeNotAllowedTags(text, docUri);
}
开发者ID:BartoszJarocki,项目名称:android-boilerpipe,代码行数:28,代码来源:HtmlArticleExtractor.java
示例2: process
import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
* parses the media (picture, video) out of doc
*
* @param doc document to parse the media out
* @param extractor extractor to use
* @return list of extracted media, with size = 0 if no media found
*/
public List<Media> process(String doc, final BoilerpipeExtractor extractor) {
final HTMLDocument htmlDoc = new HTMLDocument(doc);
List<Media> media = new ArrayList<Media>();
TextDocument tdoc;
try {
tdoc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
extractor.process(tdoc);
final InputSource is = htmlDoc.toInputSource();
media = process(tdoc, is);
} catch (Exception e) {
return null;
}
return media;
}
开发者ID:BartoszJarocki,项目名称:android-boilerpipe,代码行数:23,代码来源:MediaExtractor.java
示例3: process
import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
* Fetches the given {@link URL} using {@link HTMLFetcher} and processes the
* retrieved HTML using the specified {@link BoilerpipeExtractor}.
*
* The processed {@link TextDocument}.
* The original HTML document.
* @return A List of enclosed links
* @throws BoilerpipeProcessingException
*/
public List<String> process(final URL url, final BoilerpipeExtractor extractor)
throws IOException, BoilerpipeProcessingException, SAXException {
final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);
final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource())
.getTextDocument();
extractor.process(doc);
final InputSource is = htmlDoc.toInputSource();
return process(doc, is);
}
开发者ID:asimihsan,项目名称:handytrowel,代码行数:22,代码来源:LinkExtractor.java
示例4: process
import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
public String process(final URL url, final BoilerpipeExtractor extractor)
throws IOException, BoilerpipeProcessingException, SAXException {
final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);
// Added to fix bug with unicode characters not being recognized by SAX parser on AppEngine (bug while appending chars to StringBuffer by offset)
htmlDoc.encodeEscapedCharsAsText();
// Added to support including images in extracted HTML output
if (includeImages)
htmlDoc.encodeImageTagsAsText();
final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource())
.getTextDocument();
extractor.process(doc);
final InputSource is = htmlDoc.toInputSource();
String finalHtml = process(doc, is);
// Added to fix bug with unicode characters not being recognized by SAX parser on AppEngine (bug while appending chars to StringBuffer by offset)
finalHtml = HTMLDocument.restoreTextEncodedEscapedChars(finalHtml, htmlDoc.getCharset().name());
// Added to support including images in extracted HTML output
if (includeImages)
finalHtml = HTMLDocument.restoreTextEncodedImageTags(finalHtml, htmlDoc.getCharset().name());
return finalHtml;
}
开发者ID:BartoszJarocki,项目名称:android-boilerpipe,代码行数:29,代码来源:HTMLHighlighter.java
示例5: process
import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
* Fetches the given {@link java.net.URL} using {@link de.l3s.boilerpipe.sax.HTMLFetcher} and processes the
* retrieved HTML using the specified {@link BoilerpipeExtractor}.
*
* The processed {@link TextDocument}.
* The original HTML document.
* @return A List of enclosed {@link Image}s
* @throws BoilerpipeProcessingException
*/
public List<Image> process(final URL url, final BoilerpipeExtractor extractor)
throws IOException, BoilerpipeProcessingException, SAXException {
final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);
final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource())
.getTextDocument();
extractor.process(doc);
final InputSource is = htmlDoc.toInputSource();
return process(doc, is);
}
开发者ID:BartoszJarocki,项目名称:android-boilerpipe,代码行数:22,代码来源:ImageExtractor.java
示例6: process
import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
* Fetches the given {@link URL} using {@link HTMLFetcher} and processes the
* retrieved HTML using the specified {@link BoilerpipeExtractor}.
*
* @param doc
* The processed {@link TextDocument}.
* @param is
* The original HTML document.
* @return A List of enclosed {@link Image}s
* @throws BoilerpipeProcessingException
*/
public List<Image> process(final URL url, final BoilerpipeExtractor extractor)
throws IOException, BoilerpipeProcessingException, SAXException {
final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);
final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource())
.getTextDocument();
extractor.process(doc);
final InputSource is = htmlDoc.toInputSource();
return process(doc, is);
}
开发者ID:socialsensor,项目名称:storm-focused-crawler,代码行数:24,代码来源:ImageExtractor.java
示例7: main
import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
public static void main(String[] args) throws InterruptedException, IOException {
List<String> lines = Files.readLines(new File("data/query.tsv"), Charsets.UTF_8);
Set<String> ids = new HashSet<>();
for (String line : lines) {
ids.add(line.split("\t")[0]);
}
BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
ExecutorService es = Executors.newFixedThreadPool(10);
System.out.println(ids.size());
DecimalFormat df = new DecimalFormat("00");
for (String id : ids) {
String googleHtml = Files.toString(new File("data/googlerp", id + ".html"), Charsets.UTF_8);
Matcher matcher = pattern.matcher(googleHtml);
int count = 0;
while (matcher.find()) {
count++;
// check existence
File docHtmlFile = new File("data/context", id + "-" + df.format(count) + ".html");
File docTextFile = new File("data/context", id + "-" + df.format(count) + ".txt");
if (docHtmlFile.exists() && docTextFile.exists()) {
continue;
}
// get url
String url = matcher.group(1);
if (url.contains("wikihow") || url.contains("google")) {
continue;
}
es.execute(() -> {
System.out.println(id + " " + url);
// download url
try {
String docHtml = Request.Get(url).connectTimeout(2000).socketTimeout(2000).execute()
.returnContent().asString();
Files.write(docHtml, docHtmlFile, Charsets.UTF_8);
String docText = extractor.getText(docHtml);
Files.write(docText, docTextFile, Charsets.UTF_8);
} catch (Exception e) {
e.printStackTrace();
}
});
}
}
es.shutdown();
if (!es.awaitTermination(5, TimeUnit.MINUTES)) {
System.out.println("Timeout occurs for one or some concept retrieval service.");
}
}
开发者ID:ziy,项目名称:pkb,代码行数:48,代码来源:ContextExtractor.java
示例8: downloadSearchResult
import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
public static void downloadSearchResult() throws IOException, BoilerpipeProcessingException,
SAXException, URISyntaxException, InterruptedException {
List<String> lines = Files.readLines(new File("data/e2e-apkbc-suggested-query.tsv"),
Charsets.UTF_8);
Set<String> ids = new HashSet<>();
for (String line : lines) {
ids.add(line.split("\t")[0]);
}
Pattern pattern = Pattern.compile("<a href=\"([^>\"]*)\" onmousedown=\"");
BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
ExecutorService es = Executors.newFixedThreadPool(10);
System.out.println(ids.size());
DecimalFormat df = new DecimalFormat("00");
for (String id : ids) {
String googleHtml = Files.toString(new File("data/e2e-googlerp", id + ".html"),
Charsets.UTF_8);
Matcher matcher = pattern.matcher(googleHtml);
int count = 0;
while (matcher.find()) {
count++;
// check existence
File docHtmlFile = new File("data/e2e-context", id + "-" + df.format(count) + ".html");
File docTextFile = new File("data/e2e-context", id + "-" + df.format(count) + ".txt");
if (docHtmlFile.exists() && docTextFile.exists()) {
continue;
}
// get url
String url = matcher.group(1);
if (url.contains("wikihow") || url.contains("google")) {
continue;
}
es.execute(() -> {
System.out.println(id + " " + url);
// download url
try {
String docHtml = Request.Get(url).connectTimeout(2000).socketTimeout(2000).execute()
.returnContent().asString();
Files.write(docHtml, docHtmlFile, Charsets.UTF_8);
String docText = extractor.getText(docHtml);
Files.write(docText, docTextFile, Charsets.UTF_8);
} catch (Exception e) {
e.printStackTrace();
}
});
}
}
es.shutdown();
if (!es.awaitTermination(5, TimeUnit.SECONDS)) {
System.out.println("Timeout occurs for one or some concept retrieval service.");
}
}
开发者ID:ziy,项目名称:pkb,代码行数:52,代码来源:AutomaticProceduralKnowledgeBaseConstructor.java
示例9: BoilerpipeContentHandler
import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
* Creates a new boilerpipe-based content extractor, using the given
* extraction rules. The extracted main content will be passed to the
* <delegate> content handler.
*
* @param delegate
* The {@link ContentHandler} object
* @param extractor
* Extraction rules to use, e.g. {@link ArticleExtractor}
*/
public BoilerpipeContentHandler(ContentHandler delegate, BoilerpipeExtractor extractor) {
this.td = null;
this.delegate = delegate;
this.extractor = extractor;
}
开发者ID:kolbasa,项目名称:OCRaptor,代码行数:16,代码来源:BoilerpipeContentHandler.java
注:本文中的de.l3s.boilerpipe.BoilerpipeExtractor类示例整理自Github/MSDocs等源码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。 |
请发表评论