• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    公众号

Java ParseData类代码示例

原作者: [db:作者] 来自: [db:来源] 收藏 邀请

本文整理汇总了Java中com.digitalpebble.stormcrawler.parse.ParseData的典型用法代码示例。如果您正苦于以下问题:Java ParseData类的具体用法?Java ParseData怎么用?Java ParseData使用的例子?那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。



ParseData类属于com.digitalpebble.stormcrawler.parse包,在下文中一共展示了ParseData类的6个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于我们的系统推荐出更棒的Java代码示例。

示例1: filter

import com.digitalpebble.stormcrawler.parse.ParseData; //导入依赖的package包/类
@Override
public void filter(String URL, byte[] content, DocumentFragment doc,
        ParseResult parse) {
    ParseData parseData = parse.get(URL);
    Metadata metadata = parseData.getMetadata();
    if (copyKeyName != null) {
        String signature = metadata.getFirstValue(key_name);
        if (signature != null) {
            metadata.setValue(copyKeyName, signature);
        }
    }
    byte[] data = null;
    if (useText) {
        String text = parseData.getText();
        if (StringUtils.isNotBlank(text)) {
            data = text.getBytes(StandardCharsets.UTF_8);
        }
    } else {
        data = content;
    }
    if (data == null) {
        data = URL.getBytes(StandardCharsets.UTF_8);
    }
    String hex = DigestUtils.md5Hex(data);
    metadata.setValue(key_name, hex);
}
 
开发者ID:DigitalPebble,项目名称:storm-crawler,代码行数:27,代码来源:MD5SignatureParseFilter.java


示例2: filter

import com.digitalpebble.stormcrawler.parse.ParseData; //导入依赖的package包/类
@Override
public void filter(String URL, byte[] content, DocumentFragment doc,
        ParseResult parse) {
    if (doc == null) {
        return;
    }
    try {
        JsonNode json = filterJson(doc);
        if (json == null) {
            return;
        }

        ParseData parseData = parse.get(URL);
        Metadata metadata = parseData.getMetadata();

        // extract patterns and store as metadata
        for (LabelledJsonPointer expression : expressions) {
            JsonNode match = json.at(expression.pointer);
            if (match.isMissingNode()) {
                continue;
            }
            metadata.addValue(expression.label, match.asText());
        }

    } catch (Exception e) {
        LOG.error("Exception caught when extracting json", e);
    }

}
 
开发者ID:DigitalPebble,项目名称:storm-crawler,代码行数:30,代码来源:LDJsonParseFilter.java


示例3: filter

import com.digitalpebble.stormcrawler.parse.ParseData; //导入依赖的package包/类
@Override
public void filter(String URL, byte[] content, DocumentFragment doc,
        ParseResult parse) {

    InputStream stream = new ByteArrayInputStream(content);

    try {
        DocumentBuilderFactory factory = DocumentBuilderFactory
                .newInstance();
        Document document = factory.newDocumentBuilder().parse(stream);
        Element root = document.getDocumentElement();

        XPath xPath = XPathFactory.newInstance().newXPath();
        XPathExpression expression = xPath.compile("//url");

        NodeList nodes = (NodeList) expression.evaluate(root,
                XPathConstants.NODESET);

        for (int i = 0; i < nodes.getLength(); i++) {
            Node node = nodes.item(i);

            expression = xPath.compile("loc");
            Node child = (Node) expression.evaluate(node,
                    XPathConstants.NODE);

            // create a subdocument for each url found in the sitemap
            ParseData parseData = parse.get(child.getTextContent());

            NodeList childs = node.getChildNodes();
            for (int j = 0; j < childs.getLength(); j++) {
                Node n = childs.item(j);
                parseData.put(n.getNodeName(), n.getTextContent());
            }
        }
    } catch (Exception e) {
        LOG.error("Error processing sitemap from {}: {}", URL, e);
    }
}
 
开发者ID:DigitalPebble,项目名称:storm-crawler,代码行数:39,代码来源:SubDocumentsParseFilter.java


示例4: execute

import com.digitalpebble.stormcrawler.parse.ParseData; //导入依赖的package包/类
@Override
public void execute(Tuple tuple) {
    Metadata metadata = (Metadata) tuple.getValueByField("metadata");

    byte[] content = tuple.getBinaryByField("content");
    String url = tuple.getStringByField("url");

    boolean isFeed = Boolean.valueOf(metadata.getFirstValue(isFeedKey));

    if (!isFeed) {
        String ct = metadata.getFirstValue(HttpHeaders.CONTENT_TYPE);
        if (ct != null) {
            for (String clue : mimeTypeClues) {
                if (ct.contains(clue)) {
                    isFeed = true;
                    metadata.setValue(isFeedKey, "true");
                    LOG.info("Feed detected from content type <{}> for {}",
                            ct, url);
                    break;
                }
            }
        }
    }

    if (!isFeed) {
        if (contentDetector.matches(content)) {
            isFeed = true;
            metadata.setValue(isFeedKey, "true");
            LOG.info("Feed detected from content: {}", url);
        }
    }

    if (isFeed) {
        // do not parse but run parse filters
        ParseResult parse = new ParseResult();
        ParseData parseData = parse.get(url);
        parseData.setMetadata(metadata);
        parseFilters.filter(url, content, null, parse);
        // emit status
        collector.emit(Constants.StatusStreamName, tuple,
                new Values(url, metadata, Status.FETCHED));
    } else {
        // pass on
        collector.emit(tuple, tuple.getValues());
    }
    collector.ack(tuple);
}
 
开发者ID:eorliac,项目名称:patent-crawler,代码行数:48,代码来源:FeedDetectorBolt.java


示例5: execute

import com.digitalpebble.stormcrawler.parse.ParseData; //导入依赖的package包/类
@Override
public void execute(Tuple tuple) {
    Metadata metadata = (Metadata) tuple.getValueByField("metadata");

    byte[] content = tuple.getBinaryByField("content");
    String url = tuple.getStringByField("url");

    boolean isSitemap = Boolean.valueOf(
            metadata.getFirstValue(SiteMapParserBolt.isSitemapKey));
    boolean isNewsSitemap = Boolean.valueOf(
            metadata.getFirstValue(NewsSiteMapParserBolt.isSitemapNewsKey));

    if (!isNewsSitemap || !isSitemap) {
        int match = contentDetector.getFirstMatch(content);
        if (match >= 0) {
            // a sitemap, not necessarily a news sitemap
            isSitemap = true;
            metadata.setValue(SiteMapParserBolt.isSitemapKey, "true");
            if (match <= NewsSiteMapParserBolt.contentCluesSitemapNewsMatchUpTo) {
                isNewsSitemap = true;
                LOG.info("{} detected as news sitemap based on content",
                        url);
                metadata.setValue(NewsSiteMapParserBolt.isSitemapNewsKey,
                        "true");
            }
        }
    }

    if (isSitemap) {
        // do not parse but run parse filters
        ParseResult parse = new ParseResult();
        ParseData parseData = parse.get(url);
        parseData.setMetadata(metadata);
        parseFilters.filter(url, content, null, parse);
        // emit status
        collector.emit(Constants.StatusStreamName, tuple,
                new Values(url, metadata, Status.FETCHED));
    } else {
        // pass on
        collector.emit(tuple, tuple.getValues());
    }
    collector.ack(tuple);
}
 
开发者ID:commoncrawl,项目名称:news-crawl,代码行数:44,代码来源:NewsSiteMapDetectorBolt.java


示例6: filter

import com.digitalpebble.stormcrawler.parse.ParseData; //导入依赖的package包/类
@Override
public void filter(String URL, byte[] content, DocumentFragment doc,
        ParseResult parse) {

    ParseData pd = parse.get(URL);

    // TODO determine how to restrict the expressions e.g. regexp on URL
    // or value in metadata

    // iterates on the expressions - stops at the first that matches
    for (LabelledExpression expression : expressions) {
        try {
            NodeList evalResults = (NodeList) expression.evaluate(doc,
                    XPathConstants.NODESET);
            if (evalResults.getLength() == 0) {
                continue;
            }
            StringBuilder newText = new StringBuilder();
            for (int i = 0; i < evalResults.getLength(); i++) {
                Node node = evalResults.item(i);
                newText.append(node.getTextContent()).append("\n");
            }

            // ignore if no text captured
            if (StringUtils.isBlank(newText.toString())) {
                LOG.debug(
                        "Found match for doc {} but empty text extracted - skipping",
                        URL);
                continue;
            }

            // give the doc its new text value
            LOG.debug(
                    "Restricted text for doc {}. Text size was {} and is now {}",
                    URL, pd.getText().length(), newText.length());

            pd.setText(newText.toString());

            pd.getMetadata().setValue(MATCH_KEY, expression.getLabel());

            return;
        } catch (XPathExpressionException e) {
            LOG.error("Caught XPath expression", e);
        }
    }

}
 
开发者ID:DigitalPebble,项目名称:storm-crawler,代码行数:48,代码来源:ContentFilter.java



注:本文中的com.digitalpebble.stormcrawler.parse.ParseData类示例整理自Github/MSDocs等源码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。


鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
上一篇:
Java StyleRepositoryService类代码示例发布时间:2022-05-16
下一篇:
Java Band类代码示例发布时间:2022-05-16
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap