Python - Elasticsearch入门教程

原作者: [db:作者] 来自: [db:来源] 收藏邀请

图片来源：tryolabs.com

在本文中，我将讨论什么是Elasticsearch以及如何在Python中使用Elasticsearch(简称ES)。

什么是ElasticSearch？

ElasticSearch(ES)是基于Apache Lucene构建的分布式高可用开源搜索引擎。ES使用JAVA开发，可用于很多平台。ES以JSON格式存储非结构化数据，可以作为NoSQL数据库。ES不但具有NoSQL数据库特性，还提供了搜索及很多其他相关功能。

ElasticSearch用例

可以将ES用于多种用途，下面提供了其中的几个例子：

您正在运行的网站提供许多动态内容。无论是电子商务网站还是博客，通过部署ES，不仅可以为Web应用程序提供强大的搜索能力，还可以在应用程序中提供本机自动补全功能。
您可以收集不同种类的日志数据，然后借助ES查找趋势和统计数据。

设置和运行

安装ElasticSearch的最简单方法是下载它并运行可执行文件。必须确保使用的是Java 7或更高版本。

下载后，解压缩并运行其二进制文件。

elasticsearch-6.2.4 bin/elasticsearch

滚动窗口中将有很多文本，如果您看到类似下面的内容，则表明启动成功。

[2018-05-27T17:36:11,744][INFO ][o.e.h.n.Netty4HttpServerTransport] [c6hEGv4] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}

为了进一步验证一切正常，请访问该URL：http://localhost:9200，可以在浏览器中打开或通过cURL，ES会返回以下内容。

{
  "name" : "c6hEGv4",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "HkRyTYXvSkGvkvHX2Q1-oQ",
  "version" : {
    "number" : "6.2.4",
    "build_hash" : "ccec39f",
    "build_date" : "2018-04-12T20:37:28.497551Z",
    "build_snapshot" : false,
    "lucene_version" : "7.2.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

ES提供了REST API，我们尝试使用它来执行不同的任务。

基本范例

要做的第一件事就是创建一个索引(Index)。一切都存储在索引中。 RDBMS(关系数据库管理系统)跟这里所说的索引(Index)相对应的是一个数据库。因此，请勿将这里的Index与您在RDBMS中学习的典型索引概念相混淆。这里用用PostMan运行REST API。

如果运行成功，您将看到如下类似的响应。

{
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "company"
}

即，我们创建了一个数据库，名称为company。换句话说，我们创建了一个Index叫company。如果从浏览器访问http://localhost:9200/company，您将看到类似以下内容：

{
  "company": {
    "aliases": {
      
    },
    "mappings": {
      
    },
    "settings": {
      "index": {
        "creation_date": "1527638692850",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "RnT-gXISSxKchyowgjZOkQ",
        "version": {
          "created": "6020499"
        },
        "provided_name": "company"
      }
    }
  }
}

暂时忽略其中的mappings，这个我们稍后再讨论。creation_date是创建时间戳。number_of_shards告诉将保留此Index数据的分区数。如果您正在运行包含多个Elastic节点的集群，则整个数据将在节点之间拆分。简而言之，如果有5个分片，则整个数据可在5个分片上使用，并且ElasticSearch集群可以从它的任何节点处理请求。

number_of_replicas是指数据的镜像。如果您熟悉数据库主从（master-slave）的概念，那么这对您来说并不新鲜。可以从这里了解有关基本ES概念的更多信息。

使用cURL创建索引可以一行代码搞定：

➜  elasticsearch-6.2.4 curl -X PUT localhost:9200/company
{"acknowledged":true,"shards_acknowledged":true,"index":"company"}%

您还可以一次执行索引创建和记录插入任务，要做的就是以JSON格式传递数据记录。对应的PostMan用法如下：

确保设置Content-Type为application/json

它将创建一个名为company的索引（如果不存在），然后创建一个新的Type(类型)叫employees。ES中的Type(类型)实际上对应在RDBMS中的Table。

上面的请求将输出以下JSON结构:

{
    "_index": "company",
    "_type": "employees",
    "_id": "1",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 0,
    "_primary_term": 1
}

然后，您以JSON格式传入数据，该数据最终将作为新记录或文档插入。如果您从浏览器访问http://localhost:9200/company/employees/1，将看到以下内容。

{"_index":"company","_type":"employees","_id":"1","_version":1,"found":true,"_source":{
    "name": "Adnan Siddiqi",
    "occupation": "Consultant"
}}

可以看到实际记录以及元数据。如果您愿意，可以将请求更改为http://localhost:9200/company/employees/1/_source则只会输出记录的JSON结构。

cURL版本为：

➜  elasticsearch-6.2.4 curl -X POST \
>   http://localhost:9200/company/employees/1 \
>   -H 'content-type: application/json' \
>   -d '{
quote>     "name": "Adnan Siddiqi",
quote>     "occupation": "Consultant"
quote> }'
{"_index":"company","_type":"employees","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}%

如果您想更新该记录怎么办？也很简单。您要做的就是更改JSON记录。如下所示：

它将生成以下输出：

{
    "_index": "company",
    "_type": "employees",
    "_id": "1",
    "_version": 2,
    "result": "updated",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 1,
    "_primary_term": 1
}

注意_result现在设置为updated（替代created）。

当然，您也可以删除某些记录。

如果要删除所有数据，可以用命令：curl -XDELETE localhost:9200/_all【谨慎操作！】。

接下来尝试一些基本的检索。如果你运行：http://localhost:9200/company/employees/_search?q=adnan，它将搜索employees类型下的所有字段并返回相关记录。

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "company",
        "_type": "employees",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "name": "Adnan Siddiqi",
          "occupation": "Software Consultant"
        }
      }
    ]
  }
}

max_score字段表明记录的相关性，即记录的最高分数。

您还可以通过传递字段名称来将搜索条件限制为某个字段。例如，http://localhost:9200/company/employees/_search?q=name:Adnan，将仅搜索name文档的字段。它实际上等效于SQL：SELECT * from table where name='Adnan'

ES还可以做很多事情，可以另行参考文档。这里我们切换到介绍如何使用Python访问ES。

在Python中访问ElasticSearch

老实说，ES的REST API足够好，您可以使用requests库来执行所有任务。不过，您也可以使用Python库处理ElasticSearch从而专注于您的主要任务，而不必担心如何创建请求。

通过pip安装ES，然后可以在Python程序中访问它。

pip install elasticsearch

为确保已正确安装，请从命令行运行以下基本代码段：

➜  elasticsearch-6.2.4 python
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 12:04:33) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
>>> es
<Elasticsearch([{'host': 'localhost', 'port': 9200}])>

网页抓取和Elasticsearch

让我们讨论一下使用Elasticsearch的一些实际用例。目的是访问在线食谱并将其存储在Elasticsearch中用于搜索和分析目的。我们将首先从Allrecipes抓取数据并将其存储在ES中。要注意的是，我们需要创建一个严格的Schema或Mapping，以确保以正确的格式和类型对数据进行索引。具体步骤如下：

抓取数据

import json
from time import sleepimport requests
from bs4 import BeautifulSoupdef parse(u):
    title = '-'
    submit_by = '-'
    description = '-'
    calories = 0
    ingredients = []
    rec = {}try:
        r = requests.get(u, headers=headers)if r.status_code == 200:
            html = r.text
            soup = BeautifulSoup(html, 'lxml')
            # title
            title_section = soup.select('.recipe-summary__h1')
            # submitter
            submitter_section = soup.select('.submitter__name')
            # description
            description_section = soup.select('.submitter__description')
            # ingredients
            ingredients_section = soup.select('.recipe-ingred_txt')# calories
            calories_section = soup.select('.calorie-count')
            if calories_section:
                calories = calories_section[0].text.replace('cals', '').strip()if ingredients_section:
                for ingredient in ingredients_section:
                    ingredient_text = ingredient.text.strip()
                    if 'Add all ingredients to list' not in ingredient_text and ingredient_text != '':
                        ingredients.append({'step': ingredient.text.strip()})if description_section:
                description = description_section[0].text.strip().replace('"', '')if submitter_section:
                submit_by = submitter_section[0].text.strip()if title_section:
                title = title_section[0].textrec = {'title': title, 'submitter': submit_by, 'description': description, 'calories': calories,
                   'ingredients': ingredients}
    except Exception as ex:
        print('Exception while parsing')
        print(str(ex))
    finally:
        return json.dumps(rec)if __name__ == '__main__':
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
        'Pragma': 'no-cache'
    }
    url = 'https://www.allrecipes.com/recipes/96/salad/'
    r = requests.get(url, headers=headers)
    if r.status_code == 200:
        html = r.text
        soup = BeautifulSoup(html, 'lxml')
        links = soup.select('.fixed-recipe-card__h3 a')
        for link in links:
            sleep(2)
            result = parse(link['href'])
            print(result)
            print('=================================')

以上是提取数据的基本程序。由于需要JSON格式的数据，因此我进行了相应的转换。

创建索引

有了所需的数据之后，接下来考虑如何存储。我们要做的第一件事就是创建Index(索引)。将索引命名为recipes，Type为salads。要做的另一件事是给文档结构创建一个 mapping（映射）。
在创建索引之前，我们必须连接ElasticSearch服务器。

import logging
def connect_elasticsearch():
    _es = None
    _es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
    if _es.ping():
        print('Yay Connect')
    else:
        print('Awww it could not connect!')
    return _esif __name__ == '__main__':
  logging.basicConfig(level=logging.ERROR)

其中_es.ping()对服务器发送ping请求，如果连接上的话返回True。

def create_index(es_object, index_name='recipes'):
    created = False
    # index settings
    settings = {
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 0
        },
        "mappings": {
            "members": {
                "dynamic": "strict",
                "properties": {
                    "title": {
                        "type": "text"
                    },
                    "submitter": {
                        "type": "text"
                    },
                    "description": {
                        "type": "text"
                    },
                    "calories": {
                        "type": "integer"
                    },
                }
            }
        }
    }
    try:
        if not es_object.indices.exists(index_name):
            # Ignore 400 means to ignore "Index Already Exist" error.
            es_object.indices.create(index=index_name, ignore=400, body=settings)
            print('Created Index')
        created = True
    except Exception as ex:
        print(str(ex))
    finally:
        return created

对上面的代码做个简要说明：
首先，我们传递了一个config变量，其中包含整个文档结构的映射。Mapping是Elastic的Schema术语。就像我们在数据库Table表中设置某些字段数据类型一样，我们在这里做类似的事情。除了calories是Integer之外，所有其他字段均为text类型。
然后，我要确保索引根本不存在，然后再创建它。参数ignore=400在检查之后不再需要它，但是如果您不检查是否存在，则可以抑制该错误并覆盖现有索引。不过这有风险，就像覆盖数据库一样。

如果索引创建成功，则可以通过访问http://localhost:9200/recipes/_mappings 进行验证，它将打印如下内容：

{
  "recipes": {
    "mappings": {
      "salads": {
        "dynamic": "strict",
        "properties": {
          "calories": {
            "type": "integer"
          },
          "description": {
            "type": "text"
          },
          "submitter": {
            "type": "text"
          },
          "title": {
            "type": "text"
          }
        }
      }
    }
  }
}

通过传入dynamic:strict，我们强迫Elasticsearch对所有传入文档进行严格检查。这里，salads实际上是文档类型。

存入索引

下一步是存储实际数据或文档。

def store_record(elastic_object, index_name, record):
    try:
        outcome = elastic_object.index(index=index_name, doc_type='salads', body=record)
    except Exception as ex:
        print('Error in indexing data')
        print(str(ex))

运行它，您将看到以下输出：

Error in indexing data
TransportError(400, 'strict_dynamic_mapping_exception', 'mapping set to strict, dynamic introduction of [ingredients] within [salads] is not allowed')

你能猜出为什么会这样吗？由于在我们的映射中没有设置ingredients，ES不允许我们存储包含ingredients字段的文档。现在，我们认识到了Mapping的好处，它可以避免破坏数据。现在，让我们对映射进行一些更改，如下所示：

"mappings": {
            "salads": {
                "dynamic": "strict",
                "properties": {
                    "title": {
                        "type": "text"
                    },
                    "submitter": {
                        "type": "text"
                    },
                    "description": {
                        "type": "text"
                    },
                    "calories": {
                        "type": "integer"
                    },
                    "ingredients": {
                        "type": "nested",
                        "properties": {
                            "step": {"type": "text"}
                        }
                    },
                }
            }
        }

我们在Mapping中增加了ingrdients，其类型为nested。然后设置了其内部字段的数据类型，也就是text

nested数据类型让您设置嵌套JSON对象的类型。再次运行它，将会看到以下输出：

{
  '_index': 'recipes',
  '_type': 'salads',
  '_id': 'OvL7s2MBaBpTDjqIPY4m',
  '_version': 1,
  'result': 'created',
  '_shards': {
    'total': 1,
    'successful': 1,
    'failed': 0
  },
  '_seq_no': 0,
  '_primary_term': 1
}

由于您没有通过_id设置文档ID，ES为存储的文档分配了动态ID。我在Chrome浏览器中使用ES数据查看器查看数据，它的工具名为ElasticSearch工具箱。

在继续之前，我们先存储一个字符串给calories字段，看看情况如何。（记住我们已经将calories设置为integer类型）。结果会出现以下错误：

TransportError(400, 'mapper_parsing_exception', 'failed to parse [calories]')

现在我们进一步了解到为文档设置Mapping的好处。当然，如果您不这样做也可以，因为Elasticsearch将在运行时自动进行基本的默认映射。

查询记录

建好索引之后，可以根据需要进行查询。我要创建一个名为search()的函数，它将显示查询结果。

def search(es_object, index_name, search):
    res = es_object.search(index=index_name, body=search)

这让我们尝试一些查询。

if __name__ == '__main__':
  es = connect_elasticsearch()
    if es is not None:
        search_object = {'query': {'match': {'calories': '102'}}}
        search(es, 'recipes', json.dumps(search_object))

上面的查询将返回其中calories等于102的所有记录。在我们的例子中，输出为：

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': 'YkTAuGMBzBKRviZYEDdu',
                    '_index': 'recipes',
                    '_score': 1.0,
                    '_source': {'calories': '102',
                                'description': "I've been making variations of "
                                               'this salad for years. I '
                                               'recently learned how to '
                                               'massage the kale and it makes '
                                               'a huge difference. I had a '
                                               'friend ask for my recipe and I '
                                               "realized I don't have one. "
                                               'This is my first attempt at '
                                               'writing a recipe, so please '
                                               'let me know how it works out! '
                                               'I like to change up the '
                                               'ingredients: sometimes a pear '
                                               'instead of an apple, '
                                               'cranberries instead of '
                                               'currants, Parmesan instead of '
                                               'feta, etc. Great as a side '
                                               'dish or by itself the next day '
                                               'for lunch!',
                                'ingredients': [{'step': '1 bunch kale, large '
                                                         'stems discarded, '
                                                         'leaves finely '
                                                         'chopped'},
                                                {'step': '1/2 teaspoon salt'},
                                                {'step': '1 tablespoon apple '
                                                         'cider vinegar'},
                                                {'step': '1 apple, diced'},
                                                {'step': '1/3 cup feta cheese'},
                                                {'step': '1/4 cup currants'},
                                                {'step': '1/4 cup toasted pine '
                                                         'nuts'}],
                                'submitter': 'Leslie',
                                'title': 'Kale and Feta Salad'},
                    '_type': 'salads'}],
          'max_score': 1.0,
          'total': 1},
 'timed_out': False,
 'took': 2}

如果您想在其中获取calories大于20的记录怎么办？

search_object = {'_source': ['title'], 'query': {'range': {'calories': {'gte': 20}}}}

您还可以指定要返回的列或字段。上面的查询将返回所有calories大于20的记录。

结论

Elasticsearch是一个功能强大的工具，能够为各种应用提供丰富准确的查询功能。我们刚刚介绍了要点，可以进一步阅读文档来熟悉这个工具。特别是模糊搜索功能非常出色。如果有机会，我将在以后的文章中介绍Query DSL。

注：本文代码可以在Github找到。

本文最初发表于这里。

鲜花

握手

雷人

路过

鸡蛋

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

sklearn例程:RBF核的显式特征映射近似发布时间：2022-05-14

sklearn例程:NMF和LDA主题提取发布时间：2022-05-14

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：18962|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：9916|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8301|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8664|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8598|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9606|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8588|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：7973|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8600|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7515|2022-11-06

客服电话

电子邮件

Python - Elasticsearch入门教程

什么是ElasticSearch？

ElasticSearch用例

设置和运行

基本范例

在Python中访问ElasticSearch

网页抓取和Elasticsearch

抓取数据

创建索引

存入索引

查询记录

结论

上一篇：

下一篇：

宽字符std::wstring的长度和大小问题？size

CVE-2022-34286

krishnaik06/Machine-Learning-in-90-days

armancodv/building-energy-model-matlab:

美元符号为什么是“$”

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053