Sunday, April 18, 2010

Find median

The problem: Given N computers, each has M numbers. Find median of all N*M numbers.
Sometimes a master computer is used in the setting which controls the rest slave computers.

Can do this by:

Method 1: let the N computers each partition their numbers into X buckets, then count the sum of numbers in each bucket. This step can be done recursively to decrease the range. This will come up with a bucket that contains the median, then do a sort in this bucket to find it. E.g., let's have 4 buckets, [0, 2^30], [2^30, 2^31], [2^31, 2^31 + 2^30], [2^31 + 2^30, 2^32]. Sum shows the numbers fall in each bucket are 5000, 6000, 5000, 5000. Then the median must be in the second bucket, and is the 5500-th number in it. Extend this, if we have maximal 2^32 numbers in total, we can set bucket size to be 2^16, then after one round we can figure out which bucket contains the median, and it contains maximal only 2^16 numbers. Say we do k rounds, This is O(kN) or just O(N) in time, plus the time to do final round sorting on a small range.

Method 2: binary search. Pick a random number X, on each computer, count the numbers bigger than X and smaller than X. Add counts on all computers together, we know the median is bigger or smaller than X. Say median is smaller than X, then next round we pick (0 + X)/2 = X/2 as the guess. Each round it takes O(N) time, there are log(N) rounds. Total time cost is O(N log(N)).

See the discussion here for reference.

No comments:

Blog Archive

Followers