Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
739 views
in Technique[技术] by (71.8m points)

postgresql - How to provide an API client with 1,000,000 database results?

Following up from my previous question:

Using "Cursors" for paging in PostgreSQL

What is a good way to provide an API client with 1,000,000 database results?

We are currently using PostgreSQL. A few suggested methods:

  • Paging using Cursors
  • Paging using random numbers ( Add "GREATER THAN ORDER BY " to each query )
  • Paging using LIMIT and OFFSET ( breaks down for very large data sets )
  • Save information to a file and let the client download it
  • Iterate through results, then POST the data to the client server
  • Return only keys to the client, then let the client request the objects from Cloud files like Amazon S3 (still may require paging just to get the file names ).

What haven't I thought of that is stupidly simple and way better than any of these options?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The table has a primary key. Make use of it.

Instead of LIMIT and OFFSET, do your paging with a filter on the primary key. You hinted at this with your comment:

Paging using random numbers ( Add "GREATER THAN ORDER BY " to each query )

but there's nothing random about how you should do it.

SELECT * FROM big_table WHERE id > $1 ORDER BY id ASC LIMIT $2

Allow the client to specify both parameters, the last ID it saw and the number of records to fetch. Your API will have to either have a placeholder, extra parameter, or alternate call for "fetch the first n IDs" where it omits the WHERE clause from the query, but that's trivial.

This approach will use a fairly efficient index scan to get the records in order, generally avoiding a sort or the need to iterate through all the skipped records. The client can decide how many rows it wants at once.

This approach differs from the LIMIT and OFFSET approach in one key way: concurrent modification. If you INSERT into the table with a key lower than a key some client has already seen, this approach will not change its results at all, whereas the OFFSET approach will repeat a row. Similarly, if you DELETE a row with a lower-than-already-seen ID the results of this approach will not change, whereas OFFSET will skip an unseen row. There is no difference for append-only tables with generated keys, though.

If you know in advance that the client will want the whole result set, the most efficient thing to do is just send them the whole result set with none of this paging business. That's where I would use a cursor. Read the rows from the DB and send them to the client as fast as the client will accept them. This API would need to set limits on how slow the client was allowed to be to avoid excessive backend load; for a slow client I'd probably switch to paging (as described above) or spool the whole cursor result out to a temporary file and close the DB connection.

Important caveats:

  • Requires a UNIQUE constraint / UNIQUE index or PRIMARY KEY to be reliable
  • Different concurrent modification behaviour to limit/offset, see above

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...