Thursday, February 11, 2010

Reservoir Sampling

Problem: Given a stream of n elements (n is unknown), sample an element with the same probability for each element in ONE pass. (e.g. given a text file with unknown number of lines, randomly print one line in the same probability for each line, using ONE pass only.)

If two passes are allowed, then it's easy since the first pass can get the number of lines, then just randomly pick one in second pass. For one pass, this does not work since total number of lines is unknown.

Solution: Reservoir Sampling.

References:
http://blogs.msdn.com/spt/archive/2008/02/05/reservoir-sampling.aspx
http://gregable.com/2007/10/reservoir-sampling.html

[Added 3/6/2010]
A straightforward way of thinking about this is:

At the n-th element, we already have a sampled element x for the first n-1 elements, now choose between x and n with the probability of (n-1)/n for x, and 1/n for the n-th element. This is is someway similar to the card shuffling algorithm.

The next extension is to sample m (instead of 1) elements out of n. Then just choose the n-th element with probability m/n, and if chosen, replace the n-th element with a random element in the previously chosen m elements.

The next extension is weighted sampling. There are discussion in the first link above.

No comments:

Blog Archive

Followers