I know lists use C arrays under the hood which made me conclude that lookup in a list with just a few items would be better than in a dictionary (accessing a few elements in an array is faster than computing a hash).
Accessing a few array elements is cheap, sure, but computing ==
is surprisingly heavyweight in Python. See that spike in your second graph? That's the cost of computing ==
for two ints right there.
Your list lookups need to compute ==
a lot more than your dict lookups do.
Meanwhile, computing hashes might be a pretty heavyweight operation for a lot of objects, but for all ints involved here, they just hash to themselves. (-1 would hash to -2, and large integers (technically long
s) would hash to smaller integers, but that doesn't apply here.)
Dict lookup isn't really that bad in Python, especially when your keys are just a consecutive range of ints. All ints here hash to themselves, and Python uses a custom open addressing scheme instead of chaining, so all your keys end up nearly as contiguous in memory as if you'd used a list (which is to say, the pointers to the keys end up in a contiguous range of PyDictEntry
s). The lookup procedure is fast, and in your test cases, it always hits the right key on the first probe.
Okay, back to the spike in graph 2. The spike in the lookup times at 1024 entries in the second graph is because for all smaller sizes, the integers you were looking for were all <= 256, so they all fell within the range of CPython's small integer cache. The reference implementation of Python keeps canonical integer objects for all integers from -5 to 256, inclusive. For these integers, Python was able to use a quick pointer comparison to avoid going through the (surprisingly heavyweight) process of computing ==
. For larger integers, the argument to in
was no longer the same object as the matching integer in the dict, and Python had to go through the whole ==
process.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…