Scroll is the way to go if you want to retrieve a high number of documents, high in the sense that it's way over the 10000 default limit, which can be raised.
The first request needs to specify the query you want to make and the scroll
parameter with duration before the search context times out (1 minute in the example below)
POST /index/type/_search?scroll=1m
{
"size": 1000,
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
In the response to that first call, you get a _scroll_id
that you need to use to make the second call:
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}
In each subsequent response, you'll get a new _scroll_id
that you need to use for the next call until you've retrieved the amount of documents you need.
So in pseudo code it looks somewhat like this:
# first request
response = request('POST /index/type/_search?scroll=1m')
docs = [ response.hits ]
scroll_id = response._scroll_id
# subsequent requests
while (true) {
response = request('POST /_search/scroll', scroll_id)
docs.push(response.hits)
scroll_id = response._scroll_id
}
UPDATE:
Please refer to the following answer which is more accurate regarding the best solution for deep pagination: Elastic Search - Scroll behavior
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…