Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
361 views
in Technique[技术] by (71.8m points)

python - how to avoid using _siftup or _siftdown in heapq

I have no idea how to solve following problem efficiently without using _siftup or _siftdown:

How to restore the heap invariant, when one element is out-of-order?

In other words, update old_value in heap to new_value, and keep heap working. you can assume there is only one old_value in heap. The fucntion definition is like:

def update_value_in_heap(heap, old_value, new_value):

Here is my real scenario, read it if you are interested in.

  • You can imagine it is a small autocomplete system. I need to count the frequency of words, and maintain the top k max-count words, which prepare to output at any moment. So I use heap here. When one word count++, I need update it if it is in heap.

  • All the words and counts are stored in trie-tree's leaf, and heaps
    are stored in trie-tree's middle nodes. If you care about the word
    out of heap, don't worry, I can get it from trie-tree's leaf node.

  • when user type a word, it will first read from heap and then update
    it. For better performance, we can consider decrease update frequency by updated in batch.

So how to update the heap, when one particular word count increase?

Here is _siftup or _siftdown version simple example(not my scenario):

>>> from heapq import _siftup, _siftdown, heapify, heappop

>>> data = [10, 5, 18, 2, 37, 3, 8, 7, 19, 1]
>>> heapify(data)
>>> old, new = 8, 22              # increase the 8 to 22
>>> i = data.index(old)
>>> data[i] = new
>>> _siftup(data, i)
>>> [heappop(data) for i in range(len(data))]
[1, 2, 3, 5, 7, 10, 18, 19, 22, 37]

>>> data = [10, 5, 18, 2, 37, 3, 8, 7, 19, 1]
>>> heapify(data)
>>> old, new = 8, 4              # decrease the 8 to 4
>>> i = data.index(old)
>>> data[i] = new
>>> _siftdown(data, 0, i)
>>> [heappop(data) for i in range(len(data))]
[1, 2, 3, 4, 5, 7, 10, 18, 19, 37]

it costs O(n) to index and O(logn) to update. heapify is another solution, but less efficient than _siftup or _siftdown.

But _siftup and _siftdown are protected member in heapq, so they are not recommended to access from outside.

So is there a better and more efficient way to solve this problem? Best practice for this situation?

Thanks for reading, I really appreciate it to help me out. : )

already refer to heapq python - how to modify values for which heap is sorted, but no answer to my problem

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

TL;DR Use heapify.

One important thing that you have to keep in mind is that theoretical complexity and performances are two different things (even though they are related). In other words, implementation does matter too. Asymptotic complexities give you some lower bounds that you can see as guarantees, for example an algorithm in O(n) ensure that in the worst case scenario, you will execute a number of instructions that is linear in the input size. There are two important things here:

  1. constants are ignored, but constants matter in real life;
  2. the worst case scenario is dependent on the algorithm you consider, not only on the input.

Depending on the topic/problem you consider, the first point can be very important. In some domains, constants hidden in asymptotic complexities are so big that you can't even build inputs that are bigger than the constants (or that input wouldn't be realistic to consider). That's not the case here, but that's something you always have to keep in mind.

Giving these two observations, you can't really say: implementation B is faster than A because A is derived from a O(n) algorithm and B is derived from a O(log n) algorithm. Even if that's a good argument to start with in general, it's not always sufficient. Theoretical complexities are especially good for comparing algorithms when all inputs are equally likely to happen. In other words, when you algorithms are very generic.

In the case where you know what your use cases and inputs will be you can just test for performances directly. Using both the tests and the asymptotic complexity will give you a good idea on how your algorithm will perform (in both extreme cases and arbitrary practical cases).

That being said, lets run some performance tests on the following class that will implement three different strategies (there are actually four strategies here, but Invalidate and Reinsert doesn't seem right in your case as you'll invalidate each item as many time as you see a given word). I'll include most of my code so you can double check that I haven't messed up (you can even check the complete notebook):

from heapq import _siftup, _siftdown, heapify, heappop

class Heap(list):
  def __init__(self, values, sort=False, heap=False):
    super().__init__(values)
    heapify(self)
    self._broken = False
    self.sort = sort
    self.heap = heap or not sort

  # Solution 1) repair using the knowledge we have after every update:        
  def update(self, key, value):
    old, self[key] = self[key], value
    if value > old:
        _siftup(self, key)
    else:
        _siftdown(self, 0, key)
    
  # Solution 2 and 3) repair using sort/heapify in a lazzy way:
  def __setitem__(self, key, value):
    super().__setitem__(key, value)
    self._broken = True
    
  def __getitem__(self, key):
    if self._broken:
        self._repair()
        self._broken = False
    return super().__getitem__(key)

  def _repair(self):  
    if self.sort:
        self.sort()
    elif self.heap:
        heapify(self)

  # … you'll also need to delegate all other heap functions, for example:
  def pop(self):
    self._repair()
    return heappop(self)

We can first check that all three methods work:

data = [10, 5, 18, 2, 37, 3, 8, 7, 19, 1]

heap = Heap(data[:])
heap.update(8, 22)
heap.update(7, 4)
print(heap)

heap = Heap(data[:], sort_fix=True)
heap[8] = 22
heap[7] = 4
print(heap)

heap = Heap(data[:], heap_fix=True)
heap[8] = 22
heap[7] = 4
print(heap)

Then we can run some performance tests using the following functions:

import time
import random

def rand_update(heap, lazzy_fix=False, **kwargs):
    index = random.randint(0, len(heap)-1)
    new_value = random.randint(max_int+1, max_int*2)
    if lazzy_fix:
        heap[index] = new_value
    else:
        heap.update(index, new_value)
    
def rand_updates(n, heap, lazzy_fix=False, **kwargs):
    for _ in range(n):
        rand_update(heap, lazzy_fix)
        
def run_perf_test(n, data, **kwargs):
    test_heap = Heap(data[:], **kwargs)
    t0 = time.time()
    rand_updates(n, test_heap, **kwargs)
    test_heap[0]
    return (time.time() - t0)*1e3

results = []
max_int = 500
nb_updates = 1

for i in range(3, 7):
    test_size = 10**i
    test_data = [random.randint(0, max_int) for _ in range(test_size)]

    perf = run_perf_test(nb_updates, test_data)
    results.append((test_size, "update", perf))
    
    perf = run_perf_test(nb_updates, test_data, lazzy_fix=True, heap_fix=True)
    results.append((test_size, "heapify", perf))

    perf = run_perf_test(nb_updates, test_data, lazzy_fix=True, sort_fix=True)
    results.append((test_size, "sort", perf))

The results are the following:

import pandas as pd
import seaborn as sns

dtf = pd.DataFrame(results, columns=["heap size", "method", "duration (ms)"])
print(dtf)

sns.lineplot(
    data=dtf, 
    x="heap size", 
    y="duration (ms)", 
    hue="method",
)
image

From these tests we can see that heapify seems like the most reasonable choice, it has a decent complexity in the worst case: O(n) and perform better in practice. On the other hand, it's probably a good idea to investigate other options (like having a data structure dedicated to that particular problem, for example using bins to drop words into, then moving them from a bin to the next look like a possible track to investigate).

Important remark: this scenario (updating vs. reading ratio of 1:1) is unfavorable to both the heapify and sort solutions. So if you manage to have a k:1 ratio, this conclusion will be even clearer (you can replace nb_updates = 1 with nb_updates = k in the above code).

Dataframe details:

    heap size   method  duration in ms
0        1000   update        0.435114
1        1000  heapify        0.073195
2        1000     sort        0.101089
3       10000   update        1.668930
4       10000  heapify        0.480175
5       10000     sort        1.151085
6      100000   update       13.194084
7      100000  heapify        4.875898
8      100000     sort       11.922121
9     1000000   update      153.587103
10    1000000  heapify       51.237106
11    1000000     sort      145.306110

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...