Is there an algorithm to estimate the median, mode, skewness, and/or kurtosis of set of values, but that does NOT require storing all the values in memory at once?
I'd like to calculate the basic statistics:
- mean: arithmetic average
- variance: average of squared deviations from the mean
- standard deviation: square root of the variance
- median: value that separates larger half of the numbers from the smaller half
- mode: most frequent value found in the set
- skewness: tl; dr
- kurtosis: tl; dr
The basic formulas for calculating any of these is grade-school arithmetic, and I do know them. There are many stats libraries that implement them, as well.
My problem is the large number (billions) of values in the sets I'm handling: Working in Python, I can't just make a list or hash with billions of elements. Even if I wrote this in C, billion-element arrays aren't too practical.
The data is not sorted. It's produced randomly, on-the-fly, by other processes. The size of each set is highly variable, and the sizes will not be known in advance.
I've already figured out how to handle the mean and variance pretty well, iterating through each value in the set in any order. (Actually, in my case, I take them in the order in which they're generated.) Here's the algorithm I'm using, courtesy http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm:
- Initialize three variables: count, sum, and sum_of_squares
- For each value:
- Increment count.
- Add the value to sum.
- Add the square of the value to sum_of_squares.
- Divide sum by count, storing as the variable mean.
- Divide sum_of_squares by count, storing as the variable mean_of_squares.
- Square mean, storing as square_of_mean.
- Subtract square_of_mean from mean_of_squares, storing as variance.
- Output mean and variance.
This "on-line" algorithm has weaknesses (e.g., accuracy problems as sum_of_squares quickly grows larger than integer range or float precision), but it basically gives me what I need, without having to store every value in each set.
But I don't know whether similar techniques exist for estimating the additional statistics (median, mode, skewness, kurtosis). I could live with a biased estimator, or even a method that compromises accuracy to a certain degree, as long as the memory required to process N values is substantially less than O(N).
Pointing me to an existing stats library will help, too, if the library has functions to calculate one or more of these operations "on-line".
See Question&Answers more detail:
os