本文整理汇总了Java中org.htmlparser.util.SimpleNodeIterator类的典型用法代码示例。如果您正苦于以下问题:Java SimpleNodeIterator类的具体用法?Java SimpleNodeIterator怎么用?Java SimpleNodeIterator使用的例子?那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。
SimpleNodeIterator类属于org.htmlparser.util包,在下文中一共展示了SimpleNodeIterator类的17个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于我们的系统推荐出更棒的Java代码示例。
示例1: processNodeList
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
private static void processNodeList(NodeList list, String keyword) {
// 迭代开始
SimpleNodeIterator iterator = list.elements();
while (iterator.hasMoreNodes()) {
Node node = iterator.nextNode();
// 得到该节点的子节点列表
NodeList childList = node.getChildren();
// 孩子节点为空,说明是值节点
if (null == childList) {
// 得到值节点的值
String result = node.toPlainTextString();
// 若包含关键字,则简单打印出来文本
if (result.indexOf(keyword) != -1)
System.out.println(result);
} // end if
// 孩子节点不为空,继续迭代该孩子节点
else {
processNodeList(childList, keyword);
}// end else
}// end wile
}
开发者ID:YufangWoo,项目名称:news-crawler,代码行数:22,代码来源:HtmlParserTest.java
示例2: getGangliaAttribute
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
public List<String> getGangliaAttribute(String clusterName)
throws ParserException, MalformedURLException, IOException {
String url = gangliaMetricUrl.replaceAll(clusterPattern, clusterName);
Parser parser = new Parser(new URL(url).openConnection());
NodeFilter nodeFilter = new AndFilter(new TagNameFilter("select"),
new HasAttributeFilter("id", "metrics-picker"));
NodeList nodeList = parser.extractAllNodesThatMatch(nodeFilter);
SimpleNodeIterator iterator = nodeList.elements();
List<String> metricList = new ArrayList<String>();
while (iterator.hasMoreNodes()) {
Node node = iterator.nextNode();
SimpleNodeIterator childIterator = node.getChildren().elements();
while (childIterator.hasMoreNodes()) {
OptionTag children = (OptionTag) childIterator.nextNode();
metricList.add(children.getOptionText());
}
}
return metricList;
}
开发者ID:Ctrip-DI,项目名称:Hue-Ctrip-DI,代码行数:23,代码来源:GangliaHttpParser.java
示例3: main
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
public static void main(String[] args) throws Exception {
Parser parser = new Parser(new URL("http://10.8.75.3/ganglia/?r=hour&cs=&ce=&s=by+name&c=Zookeeper_Cluster&tab=m&vn=&hide-hf=false").openConnection());
NodeFilter nodeFilter = new AndFilter(new TagNameFilter("select"),
new HasAttributeFilter("id", "metrics-picker"));
NodeList nodeList = parser.extractAllNodesThatMatch(nodeFilter);
SimpleNodeIterator iterator = nodeList.elements();
while (iterator.hasMoreNodes()) {
Node node = iterator.nextNode();
SimpleNodeIterator childIterator = node.getChildren().elements();
while (childIterator.hasMoreNodes()) {
OptionTag children = (OptionTag) childIterator.nextNode();
System.out.println(children.getOptionText());
}
}
}
开发者ID:Ctrip-DI,项目名称:Hue-Ctrip-DI,代码行数:18,代码来源:TestGangliaHttpParser.java
示例4: run
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
@Override
public void run() {
try {
parser = new Parser(content);
logger.info(currentThread().getName() + "开始解析Post请求响应的HTML!,并存储到HBASE中!");
NodeIterator rootList = parser.elements();
rootList.nextNode();
NodeList nodeList = rootList.nextNode().getChildren();
// System.out.println("===================="+nodeList.size());
/*
* 判断该HTML响应是否有具体的内容,在出错或者到所有数据读取完毕时起效
* 如果起效,修改endFlag标志位,停止开启新的线程,结束当前任务!
*/
if (nodeList.size() <= 4) {
program.endFlag = true;
}
/*
* 找到对应的tag记录,然后解析
*/
nodeList.remove(0);
nodeList.remove(0);
SimpleNodeIterator childList = nodeList.elements();
while (childList.hasMoreNodes()) {
Node node = childList.nextNode();
if (node.getChildren() != null) {
toObject(node);
}
}
} catch (Exception e) {
logger.error(currentThread().getName() + "解析HTML文件出现异常!\n"+e.getMessage()+"\n");
} finally {
logger.info(currentThread().getName() + "HTML文件解析结束!");
store.close();
}
}
开发者ID:husky00,项目名称:worm,代码行数:36,代码来源:PostRequestHtmlParser.java
示例5: listarCidades
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
@ApiMethod(name = "listarCidades")
public ListaEstadosCidades listarCidades(@Named("state") String state) throws Exception{
inicializaMapaEstados();
if(mapaCidades== null){
mapaCidades = new HashMap<String,Map<String,String>>();
}
if(!mapaCidades.containsKey(state)) {
Map<String,String> mapa = new HashMap<String, String>();
mapaCidades.put(state,mapa);
String responseBody = recuperarDados(mapaEstados.get(state), null);
NodeList nodeList = filterSelectNode(responseBody);
Node cidadeNode = nodeList.elementAt(2);
SimpleNodeIterator iteratorEstado = cidadeNode.getChildren().elements();
while (iteratorEstado.hasMoreNodes()) {
OptionTag node = (OptionTag) iteratorEstado.nextNode();
String cidadeId = node.getValue();
String cidadeNome = node.getChildren().elements().nextNode().getText();
if(!(cidadeNome.indexOf("Selecione") != -1)) {
//System.out.println(cidadeId+","+cidadeNome+","+mapaEstados.get(state));
mapa.put(cidadeNome, cidadeId);
}
}
}
ListaEstadosCidades listaEstados = new ListaEstadosCidades();
listaEstados.setLista(new ArrayList<String>(mapaCidades.get(state).keySet()));
return listaEstados;
}
开发者ID:emivaljr,项目名称:hojenaoapp,代码行数:29,代码来源:MyEndpoint.java
示例6: preencheMapaEstados
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
private void preencheMapaEstados() throws IOException, ParserException {
String responseBody = recuperarDados(null, null);
NodeList nodeList = filterSelectNode(responseBody);
Node estadoNode = nodeList.elementAt(1);
SimpleNodeIterator iteratorEstado = estadoNode.getChildren().elements();
while (iteratorEstado.hasMoreNodes()) {
OptionTag node = (OptionTag) iteratorEstado.nextNode();
String estadoId = node.getValue();
String estadoNome = node.getChildren().elements().nextNode().getText();
//System.out.println(estadoId+","+estadoNome);
mapaEstados.put(estadoNome,estadoId);
}
}
开发者ID:emivaljr,项目名称:hojenaoapp,代码行数:15,代码来源:MyEndpoint.java
示例7: main
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
public static void main(String[] args) {
try {
URL url = new URL(pro.getProperty("mlink"));
SocketAddress address = new InetSocketAddress(pro.getProperty("host"), Integer.parseInt(pro.getProperty("port")));
Proxy proxy = new Proxy(Proxy.Type.HTTP, address);
URLConnection conn = url.openConnection(proxy);
Authenticator.setDefault(new MyAuthenticator(pro.getProperty("username"), pro.getProperty("password")));
conn.setConnectTimeout(Integer.parseInt(pro.getProperty("timeout")));
Parser parser = new Parser(conn);
NodeList nodeList = parser.parse(new TagNameFilter("A"));
System.out.println(nodeList.size());
for (SimpleNodeIterator it = nodeList.elements(); it.hasMoreNodes(); ) {
TagNode node = (TagNode) it.nextNode();
String href = node.getAttribute("href");
String dhref = URLDecoder.decode(href, "UTF-8");
if (CommonHelper.checkIsAlink(dhref)) {
System.out.println(dhref);
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
开发者ID:toulezu,项目名称:play,代码行数:29,代码来源:TestParser.java
示例8: processResponse
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
private boolean processResponse(HttpResponse resp, Document doc, Element root) {
if(resp.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
System.out.println("[INFO] HTTP Status OK.");
System.out.println("[INFO] Extracting html page...");
String html = extractHtml(resp);
if(html == null) return false;
System.out.println("[INFO] " + html.length() + "B html page extracted.");
if(html.length() < 500) {
System.out.println("[INFO] EOF reached, task completed.");
return false;
} else {
System.out.println("[INFO] Parsing html page...");
try {
Parser parser = new Parser(html);
NodeList weibo_list = parser.extractAllNodesThatMatch(
new HasAttributeFilter("action-type", "feed_list_item"));
System.out.println("[INFO] " + weibo_list.size() + " entries detected.");
SimpleNodeIterator iter = weibo_list.elements();
while(iter.hasMoreNodes()) {
System.out.println("[INFO] processing entry #" + (++total) + "...");
Element elem = extractContent(iter.nextNode(), doc);
if(elem == null) {
System.out.println("[ERROR] Data extraction failed.");
return false;
}
root.appendChild(elem);
}
if(weibo_list.size() != 15) return false;
} catch (ParserException e) {
System.out.println("[ERROR] Parser failed.");
e.printStackTrace();
return false;
}
}
} else {
return false;
}
return true;
}
开发者ID:w1ndy,项目名称:weibo-fetcher,代码行数:40,代码来源:Spider.java
示例9: preencheMapaFeriadosEstaduais
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
private void preencheMapaFeriadosEstaduais() throws IOException, ParserException,ParseException {
String estadosPage = recuperarDadosEstado();
StringBuilder stringBuilder = new StringBuilder(estadosPage);
stringBuilder.delete(0,estadosPage.indexOf("<h3"));
NodeList nodeEstadoList = filterTable(stringBuilder.toString());
String todosMeses[] = {"janeiro", "fevereiro", "março", "abril", "maio", "junho", "julho", "agosto", "setembro", "outubro", "novembro", "dezembro"};
Map<String,String> mapaMeses = new HashMap<String,String>();
int i = 1;
for (String mes:todosMeses){
String valor = String.valueOf(i++);
if(valor.length()< 2){
valor ="0"+valor;
}
mapaMeses.put(mes,valor);
}
String estado = null;
for (Node node:nodeEstadoList.toNodeArray()){
if(node instanceof TableTag){
NodeList lista = ((TableTag) node).searchFor(TableColumn.class, true);
SimpleNodeIterator iterator = lista.elements();
while (iterator.hasMoreNodes()){
Feriado feriado = new Feriado();
Node data = iterator.nextNode();
String[] dataExtenso = data.toPlainTextString().split(" de ");
feriado.setData(dataExtenso[0] + "/" + mapaMeses.get(dataExtenso[1]) + "/2015");
Node nome = iterator.nextNode();
feriado.setNome(nome.toPlainTextString());
Node lei = iterator.nextNode();
if(dataExtenso[0].length()==1){
dataExtenso[0] = "0"+dataExtenso[0];
}
System.out.println(dataExtenso[0] + "/" + mapaMeses.get(dataExtenso[1]) + "/2015,"+nome.toPlainTextString()+","+mapaEstados.get(estado));
mapaFeriadosEstado.get(estado).add(feriado);
}
}
if(node instanceof HeadingTag){
estado = node.getChildren().toHtml().trim();
if(node.getChildren().elementAt(0).getChildren() != null){
estado = node.getChildren().elementAt(0).getChildren().toHtml().trim();
}
mapaFeriadosEstado.put(estado,new ArrayList<Feriado>());
}
}
}
开发者ID:emivaljr,项目名称:hojenaoapp,代码行数:48,代码来源:MyEndpoint.java
示例10: list
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
@SuppressWarnings({ "rawtypes", "unchecked" })
@Action(value = "sdlist", results = { @Result(type = "json", params = {
"root", "list" }) })
public String list() {
Cache c = CacheManager.getInstance().getCache("News");
String ckey =domain+listid + page;
Element ele = c.get(ckey);
if (!CommonUtil.isEmpty(ele)) {
list = (List) ele.getObjectValue();
} else {
StringBuffer retstr = fetch(domain+"/"+listid+"/list"
+ page+".htm");
Parser p = Parser.createParser(retstr.toString(), "utf-8");
list = new ArrayList<News>();
try {
NodeList ls = p
.extractAllNodesThatMatch(new AttributeRegexFilter(
"href", ".*/page\\.htm"));
SimpleNodeIterator i = ls.elements();
while (i.hasMoreNodes()) {
Node n = i.nextNode();
if (n instanceof TagNode) {
TagNode tn = (TagNode) n;
News news = new News();
String href = tn.getAttribute("href");
news.setId(href);
news.setTitle(tn.getAttribute("alt"));
Node tmp=tn.getParent().getNextSibling();
while(tmp!=null &&!(tmp instanceof TableColumn))
tmp=tmp.getNextSibling();
if(tmp!=null)
news.setPubdate(tmp.toPlainTextString());
list.add(news);
}
}
c.put(new Element(ckey, list));
} catch (ParserException e) {
e.printStackTrace();
}
}
jsonp(list);
return NONE;
}
开发者ID:BaixiangLiu,项目名称:fudanweixin,代码行数:46,代码来源:SudyPageAction.java
示例11: list
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
@SuppressWarnings({ "rawtypes", "unchecked" })
@Action(value = "newslist", results = { @Result(type = "json", params = {
"root", "list" }) })
public String list() {
Cache c = CacheManager.getInstance().getCache("News");
String ckey = "newslist"+listid + page;
Element ele = c.get(ckey);
if (!CommonUtil.isEmpty(ele)) {
list = (List) ele.getObjectValue();
} else {
StringBuffer retstr = fetch(RD+"/news/"+listid+"/"+page+".html");
Parser p = Parser.createParser(retstr.toString(), "utf-8");
list = new ArrayList<News>();
try {
NodeList ls = p
.extractAllNodesThatMatch(new HasAttributeFilter("class","date"));
SimpleNodeIterator i = ls.elements();
while (i.hasMoreNodes()) {
Node n = i.nextNode();
if (n instanceof TagNode) {
TagNode tn = (TagNode) n;
News news = new News();
news.setPubdate(tn.toPlainTextString());
Node tmp=tn.getNextSibling();
while(tmp!=null &&!(tmp instanceof LinkTag))
tmp=tmp.getNextSibling();
if(tmp!=null)
{
LinkTag link=(LinkTag)tmp;
news.setId(link.getAttribute("href"));
news.setTitle(link.getAttribute("title"));
}
list.add(news);
}
}
c.put(new Element(ckey, list));
} catch (ParserException e) {
e.printStackTrace();
}
}
return SUCCESS;
}
开发者ID:BaixiangLiu,项目名称:fudanweixin,代码行数:46,代码来源:CampusNewsAction.java
示例12: list
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
@SuppressWarnings("rawtypes")
@Action(value = "eventlist")
public String list() throws IOException {
Cache c = CacheManager.getInstance().getCache("News");
String ckey = "eventlist"+page ;
Element ele = c.get(ckey);
if (!CommonUtil.isEmpty(ele)) {
list = (List) ele.getObjectValue();
} else {
StringBuffer retstr = fetch(RD+"/calendar/?a=list&&m=recent&range=30&_="+System.currentTimeMillis()+"&type=0&place=0&type="+page );
Parser p = Parser.createParser(retstr.toString(), "utf-8");
list = new ArrayList<News>();
try {
NodeList ls = p
.extractAllNodesThatMatch(new HasAttributeFilter("class","clear"));
if(ls.size()==2)
{
int tk1=ls.elementAt(0).getEndPosition();
int tk2=ls.elementAt(1).getStartPosition();
ServletActionContext.getResponse().setCharacterEncoding("utf-8");
p=Parser.createParser(retstr.substring(tk1+6, tk2), "utf-8");
NodeList nl=p.parse(null);
NodeList links=nl.extractAllNodesThatMatch(new NodeClassFilter(LinkTag.class),true);
SimpleNodeIterator i=links.elements();
while(i.hasMoreNodes())
{
LinkTag lt=(LinkTag)i.nextNode();
NodeList ll=new NodeList();
ll.add(new TextNode(lt.getAttribute("title")));
lt.setChildren(ll);
lt.removeAttribute("title");
}
ServletActionContext.getResponse().getWriter().print(nl.toHtml());
}
} catch (ParserException e) {
e.printStackTrace();
}
}
return NONE;
}
开发者ID:BaixiangLiu,项目名称:fudanweixin,代码行数:45,代码来源:CampusEventAction.java
示例13: content
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
@Action(value = "eventcontent", results = { @Result(type = "json", params = {
"root", "en" }) })
public String content() {
Cache c = CacheManager.getInstance().getCache("News");
String ckey = "eventcontent" + newsid;
Element ele = c.get(ckey);
if (!CommonUtil.isEmpty(ele)) {
en = (News) ele.getObjectValue();
} else {
StringBuffer retstr = fetch(RD+"/calendar/?a=one&evid="
+ newsid+"&_="+System.currentTimeMillis());
Parser p = Parser.createParser(retstr.toString(), "utf-8");
try {
NodeList nl = p.extractAllNodesThatMatch(new OrFilter(
new TagNameFilter("h1"), new TagNameFilter("table")));
SimpleNodeIterator i = nl.elements();
en = new News();
en.setId(newsid);
while (i.hasMoreNodes()) {
Node n = i.nextNode();
if (n instanceof TagNode) {
TagNode tn = (TagNode) n;
if (tn.getTagName().equalsIgnoreCase("h1"))
en.setTitle(tn.toPlainTextString());
if (tn.getTagName().equalsIgnoreCase("table")) {
en.setContent(tn.toHtml());
}
}
}
String str=retstr.toString().trim();
int tk=retstr.indexOf("imageurl");
if(tk>0)
{
tk=retstr.indexOf("'",tk);
int tk1=retstr.indexOf("'", tk+1);
String imgurl=RD+str.substring(tk+1,tk1);
String imgid = EncodeHelper.digest(
imgurl, "MD5");
BasicDBObject obj = new BasicDBObject("id",
imgid);
DBCollection col = MongoUtil.getInstance().getDB()
.getCollection("CrawlerImages");
DBObject dbo = col.findOne(obj);
if (dbo == null)
col.save(obj.append("url",imgurl));
en.setPubdate(imgid);
}
} catch (ParserException e) {
e.printStackTrace();
}
if (!CommonUtil.isEmpty(en) && !CommonUtil.isEmpty(en.getContent()))
c.put(new Element(ckey, en));
}
return SUCCESS;
}
开发者ID:BaixiangLiu,项目名称:fudanweixin,代码行数:61,代码来源:CampusEventAction.java
示例14: list
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
@SuppressWarnings({ "rawtypes", "unchecked" })
@Action(value = "calist", results = { @Result(type = "json", params = {
"root", "list" }) })
public String list() {
Cache c = CacheManager.getInstance().getCache("News");
String ckey = "calist" + page;
Element ele = c.get(ckey);
if (!CommonUtil.isEmpty(ele)) {
list = (List) ele.getObjectValue();
} else {
StringBuffer retstr = fetch(RD+"/announce/announce_list.php?page="
+ page);
Parser p = Parser.createParser(retstr.toString(), "utf-8");
list = new ArrayList<News>();
try {
NodeList ls = p
.extractAllNodesThatMatch(new AttributeRegexFilter(
"href", "announce/\\?announceid=\\d+"));
SimpleNodeIterator i = ls.elements();
while (i.hasMoreNodes()) {
Node n = i.nextNode();
if (n instanceof TagNode) {
TagNode tn = (TagNode) n;
News news = new News();
String href = tn.getAttribute("href");
int tk = href.indexOf("=");
if (tk > 0)
news.setId(href.substring(tk + 1));
news.setTitle(tn.toPlainTextString());
list.add(news);
}
}
c.put(new Element(ckey, list));
} catch (ParserException e) {
e.printStackTrace();
}
}
return SUCCESS;
}
开发者ID:BaixiangLiu,项目名称:fudanweixin,代码行数:43,代码来源:CampusAnnouceAction.java
示例15: list
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
@SuppressWarnings({ "rawtypes", "unchecked" })
@Action(value = "dstlist", results = { @Result(type = "json", params = {
"root", "list" }) })
public String list() {
Cache c = CacheManager.getInstance().getCache("News");
String ckey = "dstlist" + page;
Element ele = c.get(ckey);
if (!CommonUtil.isEmpty(ele)) {
list = (List) ele.getObjectValue();
} else {
try {
StringBuffer retstr = CommonUtil.postWebRequest(RD+"/news.aspx?info_lb=822", ("__EVENTTARGET=_ctl0$ContentPlaceHolder1$Pager22&__EVENTARGUMENT="+page).getBytes("utf-8"), "application/x-www-form-urlencoded");
Parser p = Parser.createParser(retstr.toString(), "utf-8");
list = new ArrayList<News>();
NodeList ls = p
.extractAllNodesThatMatch(new AttributeRegexFilter(
"href", "show\\.aspx\\?.+"));
SimpleNodeIterator i = ls.elements();
while (i.hasMoreNodes()) {
Node n = i.nextNode();
if (n instanceof TagNode) {
TagNode tn = (TagNode) n;
News news = new News();
String href = tn.getAttribute("href");
news.setId(href);
news.setTitle(tn.toPlainTextString().trim());
Node tmp=tn.getParent().getNextSibling();
while(tmp!=null &&!(tmp instanceof Span))
tmp=tmp.getNextSibling();
if(tmp!=null){
String dtstr=tmp.toPlainTextString();
if(dtstr!=null &&dtstr.length()>2)
news.setPubdate(dtstr.substring(1,dtstr.length()-1));
}
list.add(news);
}
}
c.put(new Element(ckey, list));
} catch (Exception e) {
e.printStackTrace();
}
}
return SUCCESS;
}
开发者ID:BaixiangLiu,项目名称:fudanweixin,代码行数:49,代码来源:DstAnnouceAction.java
示例16: list
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
@SuppressWarnings({ "unchecked", "rawtypes" })
@Action(value = "liblist", results = { @Result(type = "json", params = {
"root", "list" }) })
public String list() {
Cache c = CacheManager.getInstance().getCache("News");
String ckey = "liblist" + page;
Element ele = c.get(ckey);
if (!CommonUtil.isEmpty(ele)) {
list = (List) ele.getObjectValue();
} else {
StringBuffer retstr = fetch(RD
+ "/ddlib/getPublishInfoList.shtml?tid=1012&k=&p="
+ (page - 1));
Parser p = Parser.createParser(retstr.toString(), "utf-8");
list = new ArrayList<News>();
try {
NodeList ls = p
.extractAllNodesThatMatch(new AttributeRegexFilter(
"href", "publishInfo\\.shtml\\?.+"));
SimpleNodeIterator i = ls.elements();
while (i.hasMoreNodes()) {
Node n = i.nextNode();
if (n instanceof TagNode) {
TagNode tn = (TagNode) n;
News news = new News();
String href = tn.getAttribute("href");
news.setId(href);
news.setTitle(tn.toPlainTextString());
Node tmp = tn.getNextSibling();
if (tmp != null && tmp instanceof TextNode) {
if (tmp.getText() != null)
news.setPubdate(tmp.getText().replaceAll(
" ", ""));
}
list.add(news);
}
}
c.put(new Element(ckey, list));
} catch (ParserException e) {
e.printStackTrace();
}
}
return SUCCESS;
}
开发者ID:BaixiangLiu,项目名称:fudanweixin,代码行数:48,代码来源:LibAnnouceAction.java
示例17: fetchComment
import org.htmlparser.util.SimpleNodeIterator; //导入依赖的package包/类
private void fetchComment(String mid, Document doc, Element parent) {
int page = 0;
while(++page > 0) {
System.out.println("[INFO] Fetching comment of W" + mid + " page " + page + "...");
String url = String.format(CommentUrl, mid, page);
HttpResponse resp = connect(url);
if(resp == null) return ;
BufferedReader reader;
try {
reader = new BufferedReader(new InputStreamReader(
resp.getEntity().getContent()));
String raw = "", line;
while((line = reader.readLine()) != null) {
raw += line;
}
JSONParser parser = new JSONParser();
JSONObject json = (JSONObject)parser.parse(raw);
Parser htmlparser = new Parser((String)((JSONObject)json.get("data")).get("html"));
NodeList list = htmlparser.extractAllNodesThatMatch(new HasAttributeFilter("class", "S_txt2"));
SimpleNodeIterator iter = list.elements();
while(iter.hasMoreNodes()) {
Node n = iter.nextNode();
Node p = n.getPreviousSibling(), s = n;
while(p != null && !s.toPlainTextString().startsWith("��")) {
s = p;
p = p.getPreviousSibling();
}
String comment = "";
while(s != n) {
comment += s.toPlainTextString();
s = s.getNextSibling();
}
Node name = n.getParent().getFirstChild().getNextSibling();
Element cmt = doc.createElement("comment");
cmt.setAttribute("by", name.getChildren().asString());
cmt.setAttribute("on", n.getChildren().asString());
cmt.setTextContent(comment.substring(1));
parent.appendChild(cmt);
}
if(list.size() < 20) return ;
} catch (IllegalStateException | IOException | ParseException | ParserException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
开发者ID:w1ndy,项目名称:weibo-fetcher,代码行数:51,代码来源:Spider.java
注:本文中的org.htmlparser.util.SimpleNodeIterator类示例整理自Github/MSDocs等源码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。 |
请发表评论