Sunday, April 18, 2010

Find median

The problem: Given N computers, each has M numbers. Find median of all N*M numbers.
Sometimes a master computer is used in the setting which controls the rest slave computers.

Can do this by:

Method 1: let the N computers each partition their numbers into X buckets, then count the sum of numbers in each bucket. This step can be done recursively to decrease the range. This will come up with a bucket that contains the median, then do a sort in this bucket to find it. E.g., let's have 4 buckets, [0, 2^30], [2^30, 2^31], [2^31, 2^31 + 2^30], [2^31 + 2^30, 2^32]. Sum shows the numbers fall in each bucket are 5000, 6000, 5000, 5000. Then the median must be in the second bucket, and is the 5500-th number in it. Extend this, if we have maximal 2^32 numbers in total, we can set bucket size to be 2^16, then after one round we can figure out which bucket contains the median, and it contains maximal only 2^16 numbers. Say we do k rounds, This is O(kN) or just O(N) in time, plus the time to do final round sorting on a small range.

Method 2: binary search. Pick a random number X, on each computer, count the numbers bigger than X and smaller than X. Add counts on all computers together, we know the median is bigger or smaller than X. Say median is smaller than X, then next round we pick (0 + X)/2 = X/2 as the guess. Each round it takes O(N) time, there are log(N) rounds. Total time cost is O(N log(N)).

See the discussion here for reference.

Thursday, April 15, 2010

Find Median of Two Sorted Arrays

See here for the following 3 methods:
Method 1: similar to merge sort, the (m+n)/2-th element is median. O(m+n).
Method 2: divide and conquer. O(log(n)) time, O(1) space. This works when m = n.
Method 3: binary search. O(log(n)) time, O(1) space. This works when m = n.

When m != n, it's more complicated. Can method 3 be extended to this case? Can have a try. Another possible way is to append the shorter array by INT_MIN (or value less than min of both arrays) and INT_MAX (or value greater than max of both arrays), then apply previous method.

If n >> m, can use binary search possibly.

A similar problem is finding the k-th largest element in a union of two arrays. See idea here. It's O(log(n+m)). This can be specialized to the situation of finding median, where k = (m+n)/2.

Tuesday, April 13, 2010

Max submatrix problem

Given a binary matrix M of 0s and 1s.
1) find the maximal square sub-matrix.
2) find the maximal sub-matrix.

1) is easy. The key is to setup a helper matrix H, whose first top row and first left column are the same as the original matrix, and other cells are obtained by:
if (M[i][j] == 1) H[i][j] = min(M[i-1][j], M[i][j-1], M[i-1][j-1]) + 1;
else H[i][j] = 0;
The H[i][j] is the size of the maximal square sub-matrix using H[i][j] as right bottom cell. See here for details. It's dynamic programming, time cost is O(n*m).

2) is harder. But it can be solved using the Max rectangle problem. First we pre-process matrix M, so each cell's value is the number of consecutive ones to the right on the same row. Now we can start scanning the matrix M from the left most column to the right most column, treating each column as the input of a max rectangle problem. Time complexity is O(mn), space complexity is O(n) for the stack used, where n is number of rows, m is number of columns. See here for reference.

There are variation problems of this type. For example:
3) Given binary square matrix, find the max square sub-matrix such that its borders are all 0s. See here for the question. As for solution, we may try to apply 1) to find the max sub-matrix of all 1s, and then corresponding max square sub-matrix with 0s on borders, but then it's possible the corner is 1. Therefore applying 1) directly is not a complete answer.

4) Given a matrix of positive and negative numbers, find the sub-matrix with the maximal sum. (Kadane's 2D problem). Solution is: for each cell of each row, get the sum of each prefix (m*(m-1)/2 possibilities). Then for each column, apply the Kadane 1D solution. Get the max of all kadane 1D solutions. This is O(m^2 * n), or O(n^3) if m == n.

See here for a reference discussion of these problems.

Monday, April 12, 2010

Check Equality of Two Number Sets

Say we have two lists of numbers. The lists are equal in length but order of elements are not the same. How to find out if the two lists contain the same set of elements?

Intuitive solution is first sort, then compare one by one. Sorting will take O(nlog(n)) time, and can be in place.

Another solution is to use a hashtable. Put every element of the first set on hashtable, and check if every element of the second set is in it. This is O(n) time, but O(n) space.

Is there an solution that is O(n) in time and O(1) in space?

The solution is given by the information theory. This is a mathematically proved solution. Basic idea is to calculate the entropy of the two sets to see if they are equal. I got this solution from heyes.

bool check_set_equality(int array1[], int array2[]) {
int sum1 := sum of array1 elements;
int sum2 := sum of array2 elements;
int min1 := array1 min element;
int min2 := array2 min element;
int N := array length;

if (sum1 != sum2 || min1 != min2) return false;

// normalization, result is the same for array2.
norm = (sum1 - N * min1) / N;

double s1 = 0.0, s2 = 0.0, p1, p2;

for (i = 0; i < N; i ++) {
p1 = (array1[i] - min1) / norm;
s1 += p1 * log(p1);
p2 = (array2[i] - min2) / norm;
s2 += p2 * log(p2);
}

if (s1 == s2) return true;
else return false;
}

Largest rectangle in histogram problem

The problem is to find the largest rectangle under a histogram.

This question is the last question from 2003/2004 ACM International Collegiate Programming Contest, Univ. of Ulm Local Contest. It's one of the 2 hard ones in this question set. See the problem description and the solution.

The naive algorithm is for each column, scan all the other columns to find max rectangle. This is O(n^2). O(n*log(n)) solution exists, for example, using order statistic trees. O(n) solution also exists, for example, using a stack. See above link for the solutions.

The O(n) solution using a stack is not difficult to code. About 50 lines of C++ code will do. See below.
#include <iostream>
#include <stack>
using namespace std;

void getMax(int hist[], stack<int> * s, int right, int & max) {
  int i = s->top();
  int height = hist[i];
  s->pop();
  int left = (s->size() > 0) ? s->top() : -1;
  int area = height * (right - left);
  if (area > max) max = area;
}

void doHist(int hist[], int len) {
  stack<int> * s = new stack<int>;
  int i, max, top_v, right;

  max = 0;
  for (i = 0; i < len; i ++) {
    if (s->size() == 0) {
      s->push(i);
      continue;
    }

    top_v = hist[s->top()];
    if (hist[i] > top_v) {
      s->push(i);
    } else if (hist[i] < top_v) {
      while (s->size() > 0 && hist[i] < hist[s->top()]) {
        getMax(hist, s, i - 1, max);
      }
      s->push(i);
    }
  }

  while (s->size() > 0) getMax(hist, s, i - 1 , max);

  cout << "max = " << max << endl;
}

int main() {
  int hist[] = {3, 5, 4, 7, 6, 5, 2}; // answer: 20
  doHist(hist, sizeof(hist) / sizeof(int));
  return 0;
}

[5/28/2010]
The above implementation contains a bug. The following is updated version on 5/9/2010.
#include <iostream>
#include <sstream>
#include <stack>
using namespace std;

int DEBUG = 0;

//
// Note that if s->size() becomes 0, then left = start.
// This is useful when entry in hist[] can be zero.
// If all entries in hist[] are positive, then no need to use start,
// In that case, use -1 in place of start in this function.
//
void getMax(int hist[], stack<int> * s, 
            int newHeight, int right, int & max, int & start) {
  int height, left = 0, area;
  while (s->size() > 0 && hist[s->top()] > newHeight) {
    height = hist[s->top()];
    s->pop();
    left = (s->size() > 0) ? s->top() : start; 
    while (s->size() > 0 && hist[s->top()] == height) {
      s->pop();
      left = (s->size() > 0) ? s->top() : start; 
    }

    area = height * (right - left);
    if (DEBUG) printf("area: %d * (%d - %d) = %d\n", height, right, left, area);
    if (area > max) max = area;
  }
}

//
// Note that when hist[i] == top_v, we push i.
// In the acm judge site, it says skip i if equal. 
// But I feel somehow it can't keep track of the left value
// when multiple columns have the same height.
//
int doHist(int hist[], int len) {
  stack<int> * s = new stack<int>;
  int i, max, top_v;
  int start = -1; // the position before the last 0, used by left.

  max = 0;
  for (i = 0; i < len; i ++) {
    if (s->size() == 0) {
      s->push(i);
      continue;
    }

    top_v = hist[s->top()];
    if (hist[i] >= top_v) {
      s->push(i);
    } else if (hist[i] < top_v) {
      getMax(hist, s, hist[i], i - 1, max, start);
      s->push(i); 
      if (hist[i] == 0) start = i - 1;
    }
  }

  getMax(hist, s, 0, i - 1 , max, start);

  cout << "max = " << max << endl;
  return max;
}

int main() {
  int hist[] = {3, 5, 4, 7, 6, 5, 2}; // answer: 20
  doHist(hist, sizeof(hist) / sizeof(int));
  return 0;
}


[1/22/2013]
Found simpler and working solutions:
// Use stack. O(n). 
    int getArea(vector &h) {
        if (h.size() == 0) return 0;
        h.push_back(0);
        
        stack s;
        int i = 0, area = 0;
        
        while (i < h.size()) {
            if (s.empty() || h[s.top()] <= h[i]) {
                s.push(i ++); // if bar goes higher, push to stack.
            }
            else { // bar lower, pop and update area.
                int t = s.top();
                s.pop();
                area = max(area, h[t] * (s.empty() ? i : i - s.top() - 1));
            }
        }
        
        return area;
    }

// Binary search. O(n*log(n)). Call with getArea(height, 0, height.size() - 1).
    int getArea(int h[], int L, int R)
    {
        if (L > R) return 0;
    
        int minIndex = L;
        for (int i = L+1; i <= R; ++ i) {
            if (h[minIndex] > h[i]) minIndex = i;
        }
       
        int width = R - L + 1;
        int areaT = width * h[minIndex];
        int areaL = getArea(h, L, minIndex - 1);
        int areaR = getArea(h, minIndex + 1, R);
    
        return max(areaT, max(areaL, areaR));
    } 

Sunday, April 11, 2010

IR Evaluation

How do you evaluate a search engine such as google or bing? There is a field called IR (Information Retrieval) Evaluation that addresses this topic.

There can be different levels of evaluation. For example, 1) performance, 2) result effectiveness, 3) user satisfaction. IR evaluation discusses result effectiveness.

For a given search query, an IR system returns a ranked list of ordered candidates. We need to evaluate this ranked list of candidates. The criteria for relevance of the ranked list can be measured by two factors:

1) Precision: fraction of retrieved results that are relevant
2) Recall: fraction of relevant results that are retrieved.

There is a trade-off between precision and recall. Van Rijsbergen's F-measure expresses this relationship for a single query:

F(j) = (1 + b^2) / ( b^2/recall(j) + 1/precision(j) )

Other measures include those such as reciprocal measure, error, utility etc. Stability of measure can be an issue since some measures change w.r.t. time.

Reference:
- IR evaluation lecture notes
- Evaluation in IR

Friday, April 9, 2010

Garbage collector implementation

Perl does garbage collection by reference counting. The drawback is circular reference.

Java does this by mark-and-sweep.

C# does this by mark-and-compaction. Mark is similar to Java. Compaction is to copy used blocks together to avoid memory fragmentation.

A usual way of doing garbage collection is:
1) establish directed graph on objects.
2) do reachability test and mark reachable objects.
3) remove unreachable objects.

See here: A Simple Garbage Collector for C++.


Find prime numbers

There are many interesting results on this problem.

The naive method is testing odd factors up to n/2. It's O(n^2) though.
By eliminating multiples of 3, 5, 7 first you can eliminate about 70% of the numbers.
In testing, you can use square instead of square root to speed up calculation.

There are probabilistic algorithms, such as Rabin test, and Lehman test.

The Sieve of Atkin algorithm is described here. Cost is O((N/log) * log N) in time, and N^(1/2 + o(1)) in memory.

The sieve of Eratosthenes algorithm uses O(N) time and O(N^(1/2)(log (log N))/log N) memory.

BTW, the density of prime numbers around N is 1/ln(N). This translates to that the average gap between prime numbers around N is ln(N).

Newton Ralphson's method to calculate square root

x^2 = n, need to find x.
f(x) = x^2 - n.
x_{n+1} = x_{n} - f(x_{n}) / f ' (x_{n}).

See the following graph for intuition.
Note that the slope of the tangent line is f ' (x_{n}) = f(x_{n}) / {x_{n+1} - x_{n}}.


The Frobenius Number

This is a set of problems that ask for this: given a set of numbers S, find the minimal number that cannot be represented using linear combination of numbers in S.

Some results are:
1) if |S| = 1, then {1} is the only solution.
2) if |S| = 2, then for S = {a, b}, solution exists iff gcd(a, b) = 1, and the solution is a*b - a - b (James Joseph Sylvester, 1884). And the number of non-representable numbers in this case is (a-1)(b-1)/2.
3) if |S| = 3, algorithm exists to calculate the solution. One example is the McNugget number: you can buy McDonald's Chicken McNuggets in the pack of 6, 9 or 20, so S = {6, 9, 20}, the Frobenius number is 43:
44 = 6 + 9 + 9 + 20
45 = 9 + 9 + 9 + 9 + 9
46 = 6 + 20 + 20
47 = 9 + 9 + 9 + 20
48 = 6 + 6 + 9 + 9 + 9 + 9
49 = 9 + 20 + 20
Any number bigger can be obtained from these by adding a multiple of 6.
4) for general cases, it is proved to be NP-hard.

See here for an introduction to the Frobenius Number problem. Here is an extension of this problem into a set of strings and solution (2007).

Thursday, April 8, 2010

Search in a circular sorted array

Given an sorted array, say {1,2,3,4,5}, it's easy to use binary search.

If the array is circular, so we don't know where is the start, say {3,4,5,1,2} . Obviously linear search still works. But then how to do binary search? See here for an explanation. The following recursive idea is from this reference.

// find n in array.
function bsearch (array, n) {
If (array.size() == 1) {
if (n == array[0]) return FOUND.
else return NOT_FOUND.
} Else {
// Find middle point of the array.
// Now either left or right half of the array must be correctly sorted.
// Find which half it is *.
// (If both halves are sorted, then entire array is sorted)

// Check if n is in the correctly sorted half.
// If it is, run regular binary search on that half.
// Otherwise, recursively call this function on the improperly sorted half
}
}

* The idea is:

If the first element is less than the midpoint, the first half of the array is correctly sorted.
If the midpoint is less than the last element, the second half of the array is correctly sorted.
This follows from the fact that every element in the array is sorted with respect to the next element, except at the pivot.

The Maximal Rectangle Problem

Given a n*m matrix of 1 and 0, find 1) the max square submatrix of all 1s, 2) the max rectangle matrix of all 1s.

1) is easy, see here. The code here needs a little correction, see here.

2) is hard, see here.

Saturday, April 3, 2010

Young Tableau

A Young Tableau is a 2-D matrix in which each row is sorted left to right and each column is sorted top down. It has many nice properties. Young Tableau is discussed in the CLRS book.

The problem of searching for an item in the Young Tableau has 3 solutions here. The following are taken from this weblog.

In summary, to search a number x, the 3 strategies are:

1) A grid search: It is easy to see that element A[i][i] divides the matrix to 4 quadrants. if x < A[i][i], then we can remove the lower right sub-matrix, and search in the other 3 quadrants recursively. Cost is T(N)=3*T(N/4)+O(1), which is O(N^(log3/log4)), less than O(N).

2) Move along the matrix diagonal to find i, s.t. A[i][i] < k and A[i+1][i+1] > k. Now we have only 2 search intervals to search for. Cost is T(N)=2*T(N/4)+O(1), O(N). (? check again on this)

3) The most simple to implement strategy: "Start from the point in the last row first column. Every point to its right is greater than this point and every point on its top is smaller than this. So, if the point is greater than this point then move right otherwise move top. So, you will traverse at most 2*n points." Cost is O(N).

bool novice_search(int **grid,int size, int key,int &x,int &y)
{
int i=size-1,j=0;
while(0<=i && i<size && 0<=j && j<size) {
if(grid[i][j]==key) {
x=i;
y=j;
return true;
} else if(grid[i][j] <key) {
j++;
} else {
i--;
}
}
return false;
}


[4/8/2010]

Major properties of a Young Tableau:

- Has properties similar to both Heap and BST.
- Search: O(m+n)
- Insert: O(m+n)
- Find min: O(1)
- Remove min: O(m+n), this is to shift elements to become a Young Tableau again.
- Find the k-th smallest: O(k(m+n)), since k is in [1, n*m], this is O(n^3).
A research paper (2006) on this is here, the result was O((sqrt(m)*lg(m))*(n*lg(n))+ m*lg(n)).

Blog Archive

Followers