• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    公众号

Java BoilerpipeExtractor类代码示例

原作者: [db:作者] 来自: [db:来源] 收藏 邀请

本文整理汇总了Java中de.l3s.boilerpipe.BoilerpipeExtractor的典型用法代码示例。如果您正苦于以下问题:Java BoilerpipeExtractor类的具体用法?Java BoilerpipeExtractor怎么用?Java BoilerpipeExtractor使用的例子?那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。



BoilerpipeExtractor类属于de.l3s.boilerpipe包,在下文中一共展示了BoilerpipeExtractor类的9个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于我们的系统推荐出更棒的Java代码示例。

示例1: process

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
 * returns the article from an document with its basic html structure.
 *
 * @param HTMLDocument
 * @param URI the uri from the document for resolving the relative anchors in the document to absolute anchors
 * @return String
 */
public String process(HTMLDocument htmlDoc, URI docUri, final BoilerpipeExtractor extractor) {

    final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
    hh.setOutputHighlightOnly(true);

    TextDocument doc;

    String text = "";
    try {
        doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
        extractor.process(doc);
        final InputSource is = htmlDoc.toInputSource();
        text = hh.process(doc, is);
    } catch (Exception ex) {
        return null;
    }


    return removeNotAllowedTags(text, docUri);
}
 
开发者ID:BartoszJarocki,项目名称:android-boilerpipe,代码行数:28,代码来源:HtmlArticleExtractor.java


示例2: process

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
 * parses the media (picture, video) out of doc
 * 
 * @param doc document to parse the media out
 * @param extractor extractor to use
 * @return list of extracted media, with size = 0 if no media found
 */
public List<Media> process(String doc, final BoilerpipeExtractor extractor) {
	final HTMLDocument htmlDoc = new HTMLDocument(doc);
	List<Media> media = new ArrayList<Media>();
	TextDocument tdoc;

	try {
		tdoc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
		extractor.process(tdoc);
		final InputSource is = htmlDoc.toInputSource();
		media = process(tdoc, is);
	} catch (Exception e) {
		return null;
	}
	return media;
}
 
开发者ID:BartoszJarocki,项目名称:android-boilerpipe,代码行数:23,代码来源:MediaExtractor.java


示例3: process

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
 * Fetches the given {@link URL} using {@link HTMLFetcher} and processes the
 * retrieved HTML using the specified {@link BoilerpipeExtractor}.
 *
 *            The processed {@link TextDocument}.
 *            The original HTML document.
 * @return A List of enclosed links
 * @throws BoilerpipeProcessingException
 */
public List<String> process(final URL url, final BoilerpipeExtractor extractor)
throws IOException, BoilerpipeProcessingException, SAXException {
    final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);

    final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource())
    .getTextDocument();
    extractor.process(doc);

    final InputSource is = htmlDoc.toInputSource();

    return process(doc, is);
}
 
开发者ID:asimihsan,项目名称:handytrowel,代码行数:22,代码来源:LinkExtractor.java


示例4: process

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
public String process(final URL url, final BoilerpipeExtractor extractor)
        throws IOException, BoilerpipeProcessingException, SAXException {
    final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);

    // Added to fix bug with unicode characters not being recognized by SAX parser on AppEngine (bug while appending chars to StringBuffer by offset)
    htmlDoc.encodeEscapedCharsAsText();

    // Added to support including images in extracted HTML output
    if (includeImages)
        htmlDoc.encodeImageTagsAsText();

    final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource())
                                     .getTextDocument();
    extractor.process(doc);

    final InputSource is = htmlDoc.toInputSource();

    String finalHtml = process(doc, is);

    // Added to fix bug with unicode characters not being recognized by SAX parser on AppEngine (bug while appending chars to StringBuffer by offset)
    finalHtml = HTMLDocument.restoreTextEncodedEscapedChars(finalHtml, htmlDoc.getCharset().name());

    // Added to support including images in extracted HTML output
    if (includeImages)
        finalHtml = HTMLDocument.restoreTextEncodedImageTags(finalHtml, htmlDoc.getCharset().name());

    return finalHtml;
}
 
开发者ID:BartoszJarocki,项目名称:android-boilerpipe,代码行数:29,代码来源:HTMLHighlighter.java


示例5: process

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
 * Fetches the given {@link java.net.URL} using {@link de.l3s.boilerpipe.sax.HTMLFetcher} and processes the
 * retrieved HTML using the specified {@link BoilerpipeExtractor}.
 * 
 *            The processed {@link TextDocument}.
 *            The original HTML document.
 * @return A List of enclosed {@link Image}s
 * @throws BoilerpipeProcessingException
 */
public List<Image> process(final URL url, final BoilerpipeExtractor extractor)
		throws IOException, BoilerpipeProcessingException, SAXException {
	final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);

	final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource())
			.getTextDocument();
	extractor.process(doc);

	final InputSource is = htmlDoc.toInputSource();

	return process(doc, is);
}
 
开发者ID:BartoszJarocki,项目名称:android-boilerpipe,代码行数:22,代码来源:ImageExtractor.java


示例6: process

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
 * Fetches the given {@link URL} using {@link HTMLFetcher} and processes the
 * retrieved HTML using the specified {@link BoilerpipeExtractor}.
 * 
 * @param doc
 *            The processed {@link TextDocument}.
 * @param is
 *            The original HTML document.
 * @return A List of enclosed {@link Image}s
 * @throws BoilerpipeProcessingException
 */
public List<Image> process(final URL url, final BoilerpipeExtractor extractor)
		throws IOException, BoilerpipeProcessingException, SAXException {
	final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);

	final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource())
			.getTextDocument();
	extractor.process(doc);

	final InputSource is = htmlDoc.toInputSource();

	return process(doc, is);
}
 
开发者ID:socialsensor,项目名称:storm-focused-crawler,代码行数:24,代码来源:ImageExtractor.java


示例7: main

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
public static void main(String[] args) throws InterruptedException, IOException {
  List<String> lines = Files.readLines(new File("data/query.tsv"), Charsets.UTF_8);
  Set<String> ids = new HashSet<>();
  for (String line : lines) {
    ids.add(line.split("\t")[0]);
  }
  BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
  ExecutorService es = Executors.newFixedThreadPool(10);
  System.out.println(ids.size());
  DecimalFormat df = new DecimalFormat("00");
  for (String id : ids) {
    String googleHtml = Files.toString(new File("data/googlerp", id + ".html"), Charsets.UTF_8);
    Matcher matcher = pattern.matcher(googleHtml);
    int count = 0;
    while (matcher.find()) {
      count++;
      // check existence
      File docHtmlFile = new File("data/context", id + "-" + df.format(count) + ".html");
      File docTextFile = new File("data/context", id + "-" + df.format(count) + ".txt");
      if (docHtmlFile.exists() && docTextFile.exists()) {
        continue;
      }
      // get url
      String url = matcher.group(1);
      if (url.contains("wikihow") || url.contains("google")) {
        continue;
      }
      es.execute(() -> {
        System.out.println(id + " " + url);
        // download url
        try {
          String docHtml = Request.Get(url).connectTimeout(2000).socketTimeout(2000).execute()
                  .returnContent().asString();
          Files.write(docHtml, docHtmlFile, Charsets.UTF_8);
          String docText = extractor.getText(docHtml);
          Files.write(docText, docTextFile, Charsets.UTF_8);
        } catch (Exception e) {
          e.printStackTrace();
        }
      });
    }
  }
  es.shutdown();
  if (!es.awaitTermination(5, TimeUnit.MINUTES)) {
    System.out.println("Timeout occurs for one or some concept retrieval service.");
  }
}
 
开发者ID:ziy,项目名称:pkb,代码行数:48,代码来源:ContextExtractor.java


示例8: downloadSearchResult

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
public static void downloadSearchResult() throws IOException, BoilerpipeProcessingException,
        SAXException, URISyntaxException, InterruptedException {
  List<String> lines = Files.readLines(new File("data/e2e-apkbc-suggested-query.tsv"),
          Charsets.UTF_8);
  Set<String> ids = new HashSet<>();
  for (String line : lines) {
    ids.add(line.split("\t")[0]);
  }
  Pattern pattern = Pattern.compile("<a href=\"([^>\"]*)\" onmousedown=\"");
  BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
  ExecutorService es = Executors.newFixedThreadPool(10);
  System.out.println(ids.size());
  DecimalFormat df = new DecimalFormat("00");
  for (String id : ids) {
    String googleHtml = Files.toString(new File("data/e2e-googlerp", id + ".html"),
            Charsets.UTF_8);
    Matcher matcher = pattern.matcher(googleHtml);
    int count = 0;
    while (matcher.find()) {
      count++;
      // check existence
      File docHtmlFile = new File("data/e2e-context", id + "-" + df.format(count) + ".html");
      File docTextFile = new File("data/e2e-context", id + "-" + df.format(count) + ".txt");
      if (docHtmlFile.exists() && docTextFile.exists()) {
        continue;
      }
      // get url
      String url = matcher.group(1);
      if (url.contains("wikihow") || url.contains("google")) {
        continue;
      }
      es.execute(() -> {
        System.out.println(id + " " + url);
        // download url
        try {
          String docHtml = Request.Get(url).connectTimeout(2000).socketTimeout(2000).execute()
                  .returnContent().asString();
          Files.write(docHtml, docHtmlFile, Charsets.UTF_8);
          String docText = extractor.getText(docHtml);
          Files.write(docText, docTextFile, Charsets.UTF_8);
        } catch (Exception e) {
          e.printStackTrace();
        }
      });
    }
  }
  es.shutdown();
  if (!es.awaitTermination(5, TimeUnit.SECONDS)) {
    System.out.println("Timeout occurs for one or some concept retrieval service.");
  }
}
 
开发者ID:ziy,项目名称:pkb,代码行数:52,代码来源:AutomaticProceduralKnowledgeBaseConstructor.java


示例9: BoilerpipeContentHandler

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
 * Creates a new boilerpipe-based content extractor, using the given
 * extraction rules. The extracted main content will be passed to the
 * <delegate> content handler.
 *
 * @param delegate
 *            The {@link ContentHandler} object
 * @param extractor
 *            Extraction rules to use, e.g. {@link ArticleExtractor}
 */
public BoilerpipeContentHandler(ContentHandler delegate, BoilerpipeExtractor extractor) {
    this.td = null;
    this.delegate = delegate;
    this.extractor = extractor;
}
 
开发者ID:kolbasa,项目名称:OCRaptor,代码行数:16,代码来源:BoilerpipeContentHandler.java



注:本文中的de.l3s.boilerpipe.BoilerpipeExtractor类示例整理自Github/MSDocs等源码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。


鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
上一篇:
Java BitDocSet类代码示例发布时间:2022-05-23
下一篇:
Java MutableList类代码示例发布时间:2022-05-23
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap