• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    公众号

gopa: A lightweight spider for Elasticsearch.

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称:

gopa

开源软件地址:

https://gitee.com/medcl/gopa

开源软件介绍:

What a Spider!

GOPA, A Spider Written in Go.

TravisGo Report CardJoin the chat at https://gitter.im/infinitbyte/gopaFOSSA Status

Goal

  • Light weight, low footprint, memory requirement should < 100MB
  • Easy to deploy, no runtime or dependency required
  • Easy to use, no programming or scripts ability needed, out of box features

Screenshoot

What a Spider! GOPA Spider!

How to use

Requirements

  • Elasticsearch v5.3+

Setup

First of all, get it, two opinions: download the pre-built package or compile it yourself.

Download Pre Built Package

Go to Release or Snapshot page, download the right package for your platform.

Note: Darwin is for Mac

Compile The Package Manually

So far, we have:

gopa, the main program, a single binary.
config/, elasticsearch related scripts etc.
gopa.yml, main configuration for gopa.

Optional Config

By default, Gopa works well except indexing, if you want to use elasticsearch as indexing, follow these steps:

  • Create a index in elasticsearch with script config/elasticsearch/gopa-index-mapping.sh (!important settings!)

Example
curl -XPUT "http://localhost:9200/gopa-index" -H 'Content-Type: application/json' -d'       {       "mappings": {       "doc": {       "properties": {       "host": {       "type": "keyword",       "ignore_above": 256       },       "snapshot": {       "properties": {       "bold": {       "type": "text"       },       "url": {       "type": "keyword",       "ignore_above": 256       },       "content_type": {       "type": "keyword",       "ignore_above": 256       },       "file": {       "type": "keyword",       "ignore_above": 256       },       "ext": {       "type": "keyword",       "ignore_above": 256       },       "h1": {       "type": "text"       },       "h2": {       "type": "text"       },       "h3": {       "type": "text"       },       "h4": {       "type": "text"       },       "hash": {       "type": "keyword",       "ignore_above": 256       },       "id": {       "type": "keyword",       "ignore_above": 256       },       "images": {       "properties": {       "external": {       "properties": {       "label": {       "type": "text"       },       "url": {       "type": "keyword",       "ignore_above": 256       }       }       },       "internal": {       "properties": {       "label": {       "type": "text"       },       "url": {       "type": "keyword",       "ignore_above": 256       }       }       }       }       },       "italic": {       "type": "text"       },       "links": {       "properties": {       "external": {       "properties": {       "label": {       "type": "text"       },       "url": {       "type": "keyword",       "ignore_above": 256       }       }       },       "internal": {       "properties": {       "label": {       "type": "text"       },       "url": {       "type": "keyword",       "ignore_above": 256       }       }       }       }       },       "path": {       "type": "keyword",       "ignore_above": 256       },       "sim_hash": {       "type": "keyword",       "ignore_above": 256       },       "lang": {       "type": "keyword",       "ignore_above": 256       },       "screenshot_id": {       "type": "keyword",       "ignore_above": 256       },       "size": {       "type": "long"       },       "text": {       "type": "text"       },       "title": {       "type": "text",       "fields": {       "keyword": {       "type": "keyword"       }       }       },       "version": {       "type": "long"       }       }       },       "task": {       "properties": {       "breadth": {       "type": "long"       },       "created": {       "type": "date"       },       "depth": {       "type": "long"       },       "id": {       "type": "keyword",       "ignore_above": 256       },       "original_url": {       "type": "keyword",       "ignore_above": 256       },       "reference_url": {       "type": "keyword",       "ignore_above": 256       },       "schema": {       "type": "keyword",       "ignore_above": 256       },       "status": {       "type": "integer"       },       "updated": {       "type": "date"       },       "url": {       "type": "keyword",       "ignore_above": 256       },       "last_screenshot_id": {       "type": "keyword",       "ignore_above": 256       }       }       }       }       }       }       }'

Note: Elasticsearch version should >= v5.3

  • Enable index module in gopa.yml, update the elasticsearch's setting:
  - module: index    enabled: true    ui:      enabled: true    elasticsearch:      endpoint: http://localhost:9200      index_prefix: gopa-      username: elastic      password: changeme

Start

Gopa doesn't require any dependencies, simply run ./gopa to start the program.

Gopa can be run as daemon(Note: Only available on Linux and Mac):

Example
➜  gopa git:(master) ✗ ./bin/gopa --daemon  ________ ________ __________  _____ /  _____/ \_____  \\______   \/  _  \/   \  ___  /   |   \|     ___/  /_\  \\    \_\  \/    |    \    |  /    |    \ \______  /\_______  /____|  \____|__  /        \/         \/                \/[gopa] 0.10.0_SNAPSHOT///last commit: 99616a2, Fri Oct 20 14:04:54 2017 +0200, medcl, update version to 0.10.0 ///

[10-21 16:01:09] [INF] [instance.go:23] workspace: data/gopa/nodes/0[gopa] started.

Also run ./gopa -h to get the full list of command line options.

Example
➜  gopa git:(master) ✗ ./bin/gopa -h  ________ ________ __________  _____ /  _____/ \_____  \\______   \/  _  \/   \  ___  /   |   \|     ___/  /_\  \\    \_\  \/    |    \    |  /    |    \ \______  /\_______  /____|  \____|__  /        \/         \/                \/[gopa] 0.10.0_SNAPSHOT///last commit: 99616a2, Fri Oct 20 14:04:54 2017 +0200, medcl, update version to 0.10.0 ///

Usage of ./bin/gopa:-config stringthe location of config file (default "gopa.yml")-cpuprofile stringwrite cpu profile to this file-daemonrun in background as daemon-debugrun in debug mode, gopa will quit with panic error-log stringthe log level,options:trace,debug,info,warn,error (default "info")-log_path stringthe log path (default "log")-memprofile stringwrite memory profile to this file-pidfile stringpidfile path (only for daemon)-pprof stringenable and setup pprof/expvar service, eg: localhost:6060 , the endpoint will be: http://localhost:6060/debug/pprof/ and http://localhost:6060/debug/vars

Stop

It's safety to press ctrl+c stop the current running Gopa, Gopa will handle the rest,saving the checkpoint,you may restore the job later,the world is still in your hand.

If you are running Gopa as daemon, you may stop it like this:

 kill -QUIT `pgrep gopa`

Configuration

UI

  • Search Console http://127.0.0.1:9001/
  • Admin Console http://127.0.0.1:9001/admin/

API

  • TBD

Architecture

What a Spider! GOPA Spider!

Contributing

You are sincerely and warmly welcomed to play with this project,from UI style to core features,or just a piece of document,welcome! let's make it better.

License

Released under the Apache License, Version 2.0 .


鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
热门推荐
热门话题
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap