I'm trying to reduce the execution time of an AppEngine query by running multiple sub-queries asynchronously, using query.fetch_async(). However, it seems that the gain is minimal compared to running the queries serially.
Below is some minimal sample code (in Python) illustrating the problem - first a function for the asynchronous run:
def run_parallel(self, repeats):
start = datetime.utcnow()
futures = []
for i in xrange(0, repeats):
q = User.query()
f = q.fetch_async(300, keys_only=True)
futures.append(f)
while futures:
f = ndb.Future.wait_any(futures)
futures.remove(f)
results = f.get_result()
delta_secs = (datetime.utcnow() - start).total_seconds()
self.response.out.write("Got %d results, delta_sec: %f<br>
" %(len(results), delta_secs))
Then a function for the corresponding serial run:
def run_serial(self, repeats):
start = datetime.utcnow()
for i in xrange(0, repeats):
q = User.query()
results = q.fetch(300, keys_only=True)
delta_secs = (datetime.utcnow() - start).total_seconds()
self.response.out.write("Got %d results, delta_sec: %f<br>
" %(len(results), delta_secs))
The output of running these two functions 10 times each (not on the dev-server), i.e. of the following calls:
run_parallel(10)
run_serial(10)
is as follows:
Running parallel queries...
Got 300 results, delta_sec: 0.401090
Got 300 results, delta_sec: 0.501700
Got 300 results, delta_sec: 0.596110
Got 300 results, delta_sec: 0.686120
Got 300 results, delta_sec: 0.709220
Got 300 results, delta_sec: 0.792070
Got 300 results, delta_sec: 0.816500
Got 300 results, delta_sec: 0.904360
Got 300 results, delta_sec: 0.993600
Got 300 results, delta_sec: 1.017320
Running serial queries...
Got 300 results, delta_sec: 0.114950
Got 300 results, delta_sec: 0.269010
Got 300 results, delta_sec: 0.370590
Got 300 results, delta_sec: 0.472090
Got 300 results, delta_sec: 0.575130
Got 300 results, delta_sec: 0.678900
Got 300 results, delta_sec: 0.782540
Got 300 results, delta_sec: 0.883960
Got 300 results, delta_sec: 0.986370
Got 300 results, delta_sec: 1.086500
Hence the parallel and serial versions take roughly the same time, around 1 second. The Appstat are as follows, where the first 10 queries are the parallel ones and the following 10 are the serial ones:
From these stats it looks like the 10 first queries are indeed running in parallel, but that they each are taking a disproportional amount of time compared to the individual serial queries. It looks like they may be blocking somehow, waiting for each other to complete.
So my question: Is there anything wrong with my code for running asynchronous queries? Or is there an inherent limitation in the efficiency of asynchronous queries on AppEngine?
I wondered if the behaviour could be caused by one of the following:
- Running asynchronous queries on the same entity type. However, a similar example using multiple different entity types shows similar results.
- Running identical queries, somehow locking sections of the index. However, a similar example in which each query is different (returning disjoint result sets) yields similar results.
So, I'm at a bit of a loss. Any suggestions would be greatly appreciated.
Update 1
Following Bruyere's suggestion I've tried using db rather than ndb, and I've tried swapping the order of the parallel and serial versions. The results are the same.
Update 2
Here's a related post concerned with the same issue; still no answer as to why parallel queries are so inefficient:
Best practice to query large number of ndb entities from datastore
Update 3
The corresponding code using the Java SDK is parallelised very neatly. Here are the Java appstats:
To be precise, this Java implementation is explicitly multi-threaded, running queries in separate threads; this is necessary because, contrary to what the AppEngine documentation claims, using query iterators does not actually result in queries being executed in parallel.
I've tried to use explicit multi-threading with synchronous query calls in the Python version, but with the same poor results as the original Python version.
The fact that the Java version performs as expected implies that the poor Python async performance is not caused by an AppEngine CPU bottleneck.
The only alternative explanation that I can think of is that Python's Global Interpreter Lock is causing thrashing. This is supported by the fact that decreasing the GIL check interval (using sys.setcheckinterval) exasperates the poor async performance.
This is surprising though: the GIL shouldn't have such a severe impact given that queries are IO bound. I speculate that perhaps the RPC input buffers are small enough that async calls resume frequently during retrieval of results, which could perhaps cause GIL thrashing. I've had a look at the Python AppEngine library code, but the low-level RPC calls are made by _apphosting_runtime___python__apiproxy.MakeCall() which seems to be closed-source.
Alas, my conclusion is that the Python AppEngine runtime is not suited for the kind of parallel querying that I require, leaving me with no other option than moving to the Java runtime. I would really like to avoid this, and so I really hope that I'm wrong and have missed something obvious. Any suggestions or pointers would be greatly appreciated.
Thanks!
See Question&Answers more detail:
os