Java BaseRobotRules类代码示例

OStack程序员社区-中国程序员成长平台 › 门户 › 编程› Java›Java编程经验

原作者: [db:作者] 来自: [db:来源] 收藏邀请

本文整理汇总了Java中crawlercommons.robots.BaseRobotRules类的典型用法代码示例。如果您正苦于以下问题：Java BaseRobotRules类的具体用法？Java BaseRobotRules怎么用？Java BaseRobotRules使用的例子？那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。

BaseRobotRules类属于crawlercommons.robots包，在下文中一共展示了BaseRobotRules类的20个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于我们的系统推荐出更棒的Java代码示例。

示例1: main

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
public static void main(String args[]) throws Exception {
    HttpProtocol protocol = new HttpProtocol();

    String url = args[0];
    Config conf = ConfUtils.loadConf(args[1]);

    protocol.configure(conf);

    if (!protocol.skipRobots) {
        BaseRobotRules rules = protocol.getRobotRules(url);
        System.out.println("is allowed : " + rules.isAllowed(url));
    }

    Metadata md = new Metadata();
    ProtocolResponse response = protocol.getProtocolOutput(url, md);
    System.out.println(url);
    System.out.println(response.getMetadata());
    System.out.println(response.getStatusCode());
    System.out.println(response.getContent().length);
}

开发者ID:zaizi，项目名称:alfresco-apache-storm-demo，代码行数:21，代码来源:HttpProtocol.java

示例2: filter

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
@Override
public String filter(URL sourceUrl, Metadata sourceMetadata,
        String urlToFilter) {
    URL target;
    try {
        target = new URL(urlToFilter);
    } catch (MalformedURLException e) {
        return null;
    }

    // check whether the source and target have the same hostname
    if (limitToSameHost) {
        if (!target.getHost().equalsIgnoreCase(sourceUrl.getHost())) {
            return urlToFilter;
        }
    }

    BaseRobotRules rules = robots.getRobotRulesSet(
            factory.getProtocol(target), urlToFilter);
    if (!rules.isAllowed(urlToFilter)) {
        return null;
    }
    return urlToFilter;
}

开发者ID:DigitalPebble，项目名称:storm-crawler，代码行数:25，代码来源:RobotsFilter.java

示例3: parseRobotsTxt

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
public static void parseRobotsTxt(String userAgent, String robotsUrl, String robotsTxt, HtmlAnalysisResult result) {
    result.setRobotsTxt(robotsTxt);
    SimpleRobotRulesParser robotsParser = new SimpleRobotRulesParser();
    BaseRobotRules robotRules = robotsParser.parseContent(robotsUrl, robotsTxt.getBytes(), null, userAgent);
    result.setRobotsAllowedAll(robotRules.isAllowAll());
    result.setRobotsAllowedNone(robotRules.isAllowNone());
    result.setRobotsAllowedHome(robotRules.isAllowed("/"));
    result.setRobotsSitemaps(robotRules.getSitemaps());
    result.setRobotsCrawlDelay(robotRules.getCrawlDelay());
}

开发者ID:tokenmill，项目名称:crawling-framework，代码行数:11，代码来源:PageAnalyzer.java

示例4: getRules

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
protected BaseRobotRules getRules(URI uri) {
    try {
        return RobotUtils.getRobotRules(fetcher, parser,
                new URL(uri.getScheme(), uri.getHost(), uri.getPort(), ROBOTS_FILE_NAME));
    } catch (MalformedURLException e) {
        LOGGER.error("URL of robots.txt file is malformed. Returning rules for HTTP 400.");
        return parser.failedFetch(400);
    }
}

开发者ID:dice-group，项目名称:Squirrel，代码行数:10，代码来源:RobotsManagerImpl.java

示例5: getMinWaitingTime

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
@Override
public long getMinWaitingTime(URI uri) {
    BaseRobotRules rules = getRules(uri);
    long delay = rules.getCrawlDelay();
    if (delay <= 0) {
        return 0;
    } else {
        return delay;
    }
}

开发者ID:dice-group，项目名称:Squirrel，代码行数:11，代码来源:RobotsManagerImpl.java

示例6: getRobotRulesSet

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
public BaseRobotRules getRobotRulesSet(Protocol protocol, Text url) {
  URL u = null;
  try {
    u = new URL(url.toString());
  } catch (Exception e) {
    return EMPTY_RULES;
  }
  return getRobotRulesSet(protocol, u);
}

开发者ID:jorcox，项目名称:GeoCrawler，代码行数:10，代码来源:RobotRulesParser.java

示例7: getRobotRulesSet

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
public BaseRobotRules getRobotRulesSet(Protocol protocol, String url) {
    URL u = null;
    try {
        u = new URL(url);
    } catch (Exception e) {
        return EMPTY_RULES;
    }
    return getRobotRulesSet(protocol, u);
}

开发者ID:zaizi，项目名称:alfresco-apache-storm-demo，代码行数:10，代码来源:RobotRulesParser.java

示例8: isDisallowedByRobots

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
public boolean isDisallowedByRobots(LinkRelevance link) {
    String hostname = link.getURL().getHost();
    BaseRobotRules rules = robotRulesMap.get(hostname);
    return rules != null && !rules.isAllowed(link.getURL().toString());
}

开发者ID:ViDA-NYU，项目名称:ache，代码行数:6，代码来源:Frontier.java

示例9: processRobot

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
private void processRobot(LinkRelevance link, FetchedResult response, boolean fetchFailed) {
    
    BaseRobotRules robotRules;
    if(fetchFailed || response == null) {
        robotRules = parser.failedFetch(HttpStatus.SC_GONE);
    }
    else {
        String contentType = response.getContentType();
        boolean isPlainText = (contentType != null) && (contentType.startsWith("text/plain"));
        if ((response.getNumRedirects() > 0) && !isPlainText) {
            robotRules = parser.failedFetch(HttpStatus.SC_GONE);
        } else {
            robotRules = parser.parseContent(
                response.getFetchedUrl(),
                response.getContent(),
                response.getContentType(),
                userAgentName   
            );
        }
    }
    
    try {
        RobotsData robotsData = new RobotsData(link, robotRules);
        linkStorage.insert(robotsData);
    } catch (Exception e) {
        logger.error("Failed to insert robots.txt data into link storage.", e);
    }
    
}

开发者ID:ViDA-NYU，项目名称:ache，代码行数:30，代码来源:RobotsTxtHandler.java

示例10: failedFetch

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
@Override
public BaseRobotRules failedFetch(int httpStatusCode) {
	ExtendedRobotRules result;

	if ((httpStatusCode >= 200) && (httpStatusCode < 300)) {
		throw new IllegalStateException("Can't use status code constructor with 2xx response");
	} else if ((httpStatusCode >= 300) && (httpStatusCode < 400)) {
		// Should only happen if we're getting endless redirects (more than
		// our follow limit), so
		// treat it as a temporary failure.
		result = new ExtendedRobotRules(RobotRulesMode.ALLOW_NONE);
		result.setDeferVisits(true);
	} else if ((httpStatusCode >= 400) && (httpStatusCode < 500)) {
		// Some sites return 410 (gone) instead of 404 (not found), so treat
		// as the same.
		// Actually treat all (including forbidden) as "no robots.txt", as
		// that's what Google
		// and other search engines do.
		result = new ExtendedRobotRules(RobotRulesMode.ALLOW_ALL);
	} else {
		// Treat all other status codes as a temporary failure.
		result = new ExtendedRobotRules(RobotRulesMode.ALLOW_NONE);
		result.setDeferVisits(true);
	}

	return result;
}

开发者ID:Treydone，项目名称:mandrel，代码行数:28，代码来源:ExtendedRobotRulesParser.java

示例11: getRobotRulesSet

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
public BaseRobotRules getRobotRulesSet(Protocol protocol, String url) {
    URL u;
    try {
        u = new URL(url);
    } catch (Exception e) {
        return EMPTY_RULES;
    }
    return getRobotRulesSet(protocol, u);
}

开发者ID:DigitalPebble，项目名称:storm-crawler，代码行数:10，代码来源:RobotRulesParser.java

示例12: isUriCrawlable

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
@Override
public boolean isUriCrawlable(URI uri) {
    BaseRobotRules rules = getRules(uri);
    return rules.isAllowed(uri.toString());
}

开发者ID:dice-group，项目名称:Squirrel，代码行数:6，代码来源:RobotsManagerImpl.java

示例13: getRobotRules

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
/**
 * Get the robots rules for a given url
 */
public BaseRobotRules getRobotRules(Text url, CrawlDatum datum) {
  return robots.getRobotRulesSet(this, url);
}

开发者ID:jorcox，项目名称:GeoCrawler，代码行数:7，代码来源:Ftp.java

示例14: getRobotRulesSet

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
/**
 * The hosts for which the caching of robots rules is yet to be done, it sends
 * a Ftp request to the host corresponding to the {@link URL} passed, gets
 * robots file, parses the rules and caches the rules object to avoid re-work
 * in future.
 * 
 * @param ftp
 *          The {@link Protocol} object
 * @param url
 *          URL
 * 
 * @return robotRules A {@link BaseRobotRules} object for the rules
 */
public BaseRobotRules getRobotRulesSet(Protocol ftp, URL url) {

  String protocol = url.getProtocol().toLowerCase(); // normalize to lower
                                                     // case
  String host = url.getHost().toLowerCase(); // normalize to lower case

  if (LOG.isTraceEnabled() && isWhiteListed(url)) {
    LOG.trace("Ignoring robots.txt (host is whitelisted) for URL: {}", url);
  }

  BaseRobotRules robotRules = CACHE.get(protocol + ":" + host);

  if (robotRules != null) {
    return robotRules; // cached rule
  } else if (LOG.isTraceEnabled()) {
    LOG.trace("cache miss " + url);
  }

  boolean cacheRule = true;

  if (isWhiteListed(url)) {
    // check in advance whether a host is whitelisted
    // (we do not need to fetch robots.txt)
    robotRules = EMPTY_RULES;
    LOG.info("Whitelisted host found for: {}", url);
    LOG.info("Ignoring robots.txt for all URLs from whitelisted host: {}", host);

  } else {
    try {
      Text robotsUrl = new Text(new URL(url, "/robots.txt").toString());
      ProtocolOutput output = ((Ftp) ftp).getProtocolOutput(robotsUrl,
          new CrawlDatum());
      ProtocolStatus status = output.getStatus();

      if (status.getCode() == ProtocolStatus.SUCCESS) {
        robotRules = parseRules(url.toString(), output.getContent()
            .getContent(), CONTENT_TYPE, agentNames);
      } else {
        robotRules = EMPTY_RULES; // use default rules
      }
    } catch (Throwable t) {
      if (LOG.isInfoEnabled()) {
        LOG.info("Couldn't get robots.txt for " + url + ": " + t.toString());
      }
      cacheRule = false; // try again later to fetch robots.txt
      robotRules = EMPTY_RULES;
    }

  }

  if (cacheRule)
    CACHE.put(protocol + ":" + host, robotRules); // cache rules for host

  return robotRules;
}

开发者ID:jorcox，项目名称:GeoCrawler，代码行数:69，代码来源:FtpRobotRulesParser.java

示例15: getRobotRules

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
public BaseRobotRules getRobotRules(Text url, CrawlDatum datum) {
  return robots.getRobotRulesSet(this, url);
}

开发者ID:jorcox，项目名称:GeoCrawler，代码行数:4，代码来源:HttpBase.java

示例16: getRobotRules

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
@Override
public BaseRobotRules getRobotRules(String url) {
    if (this.skipRobots)
        return RobotRulesParser.EMPTY_RULES;
    return robots.getRobotRulesSet(this, url);
}

开发者ID:zaizi，项目名称:alfresco-apache-storm-demo，代码行数:7，代码来源:HttpProtocol.java

示例17: getRobotRulesSet

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
/**
 * Get the rules from robots.txt which applies for the given {@code url}.
 * Robot rules are cached for a unique combination of host, protocol, and
 * port. If no rules are found in the cache, a HTTP request is send to fetch
 * {{protocol://host:port/robots.txt}}. The robots.txt is then parsed and
 * the rules are cached to avoid re-fetching and re-parsing it again.
 * 
 * @param http
 *            The {@link Protocol} object
 * @param url
 *            URL robots.txt applies to
 * 
 * @return {@link BaseRobotRules} holding the rules from robots.txt
 */
@Override
public BaseRobotRules getRobotRulesSet(Protocol http, URL url) {

    String cacheKey = getCacheKey(url);
    BaseRobotRules robotRules = CACHE.get(cacheKey);

    boolean cacheRule = true;

    if (robotRules == null) { // cache miss
        URL redir = null;
        LOG.trace("cache miss {}", url);
        try {
            ProtocolResponse response = http.getProtocolOutput(new URL(url,
                    "/robots.txt").toString(), Metadata.empty);

            // try one level of redirection ?
            if (response.getStatusCode() == 301
                    || response.getStatusCode() == 302) {
                String redirection = response.getMetadata().getFirstValue(
                        HttpHeaders.LOCATION);
                if (StringUtils.isNotBlank(redirection)) {
                    if (!redirection.startsWith("http")) {
                        // RFC says it should be absolute, but apparently it
                        // isn't
                        redir = new URL(url, redirection);
                    } else {
                        redir = new URL(redirection);
                    }
                    response = http.getProtocolOutput(redir.toString(),
                            Metadata.empty);
                }
            }

            if (response.getStatusCode() == 200) // found rules: parse them
            {
                String ct = response.getMetadata().getFirstValue(
                        HttpHeaders.CONTENT_TYPE);
                robotRules = parseRules(url.toString(),
                        response.getContent(), ct, agentNames);
            } else if ((response.getStatusCode() == 403)
                    && (!allowForbidden))
                robotRules = FORBID_ALL_RULES; // use forbid all
            else if (response.getStatusCode() >= 500) {
                cacheRule = false;
                robotRules = EMPTY_RULES;
            } else
                robotRules = EMPTY_RULES; // use default rules
        } catch (Throwable t) {
            LOG.info("Couldn't get robots.txt for {} : {}", url,
                    t.toString());
            cacheRule = false;
            robotRules = EMPTY_RULES;
        }

        if (cacheRule) {
            CACHE.put(cacheKey, robotRules); // cache rules for host
            if (redir != null
                    && !redir.getHost().equalsIgnoreCase(url.getHost())) {
                // cache also for the redirected host
                CACHE.put(getCacheKey(redir), robotRules);
            }
        }
    }
    return robotRules;
}

开发者ID:zaizi，项目名称:alfresco-apache-storm-demo，代码行数:80，代码来源:HttpRobotRulesParser.java

示例18: Frontier

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
public Frontier(String directory, int maxCacheUrlsSize, DB persistentHashtableBackend) {
    this.urlRelevance = new PersistentHashtable<>(directory, maxCacheUrlsSize,
            LinkRelevance.class, persistentHashtableBackend);
    this.robotRulesMap = new PersistentHashtable<>(directory + "_robots", maxCacheUrlsSize,
            BaseRobotRules.class, persistentHashtableBackend);
}

开发者ID:ViDA-NYU，项目名称:ache，代码行数:7，代码来源:Frontier.java

示例19: RobotsData

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
public RobotsData(LinkRelevance link, BaseRobotRules robotRules) {
    this.link = link;
    this.robotRules = robotRules;
}

开发者ID:ViDA-NYU，项目名称:ache，代码行数:5，代码来源:RobotsTxtHandler.java

示例20: getRobotRules

import crawlercommons.robots.BaseRobotRules; //导入依赖的package包/类
@Override
public BaseRobotRules getRobotRules(String url) {
    return RobotRulesParser.EMPTY_RULES;
}

开发者ID:DigitalPebble，项目名称:storm-crawler，代码行数:5，代码来源:FileProtocol.java

注：本文中的crawlercommons.robots.BaseRobotRules类示例整理自Github/MSDocs等源码及文档管理平台，相关代码片段筛选自各路编程大神贡献的开源项目，源码版权归原作者所有，传播和使用请参考对应项目的License；未经允许，请勿转载。

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

Java PGPCompressedDataGenerator类代码示例发布时间：2022-05-23

Java EdgePair类代码示例发布时间：2022-05-23

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：18877|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：9887|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8289|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8645|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8575|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9584|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8571|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：7962|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8577|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7502|2022-11-06

客服电话

电子邮件

Java BaseRobotRules类代码示例

示例1: main

示例2: filter

示例3: parseRobotsTxt

示例4: getRules

示例5: getMinWaitingTime

示例6: getRobotRulesSet

示例7: getRobotRulesSet

示例8: isDisallowedByRobots

示例9: processRobot

示例10: failedFetch

示例11: getRobotRulesSet

示例12: isUriCrawlable

示例13: getRobotRules

示例14: getRobotRulesSet

示例15: getRobotRules

示例16: getRobotRules

示例17: getRobotRulesSet

示例18: Frontier

示例19: RobotsData

示例20: getRobotRules

请发表评论

全部评论

上一篇：

下一篇：

solegalli/feature-selection-for-machine-

tianli/matlab_offscreen: Matlab offscree

win7系统重装系统初始设置的操作方法

これがマストドンだ！ 使い方からインスタ

CVE-2022-34216

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053

これがマストドンだ！使い方からインスタ