Tuesday, March 30, 2010

awk, uniq, tr and grep

Some links:

Remove duplicate lines with uniq
A user's guide for GNU AWK
AWK one-liners

AWK can process each line of a file and gives relevant output.

One problem is to find the number of occurrences of a word in a file. It's easy to output the lines that contain such a word, or count the number of such lines. But if a line contains multiple occurrences of a word, the solution is not so easy. One solution is to use the "tr" command to break the fields in a line to one word each line, and then employ common approach.

grep can be used to find a word in files. The following recursively find in current directory and sub-directories for files that contain "word":
grep -R "word" *

Saturday, March 13, 2010

Pi Day

Quite interesting:

http://en.wikipedia.org/wiki/Pi_Day

探索π值大事记

China Digital Science and Technology Museum.

Saturday, March 6, 2010

Wildcard character pattern matching

The problem is how to do string matching on a pattern including '*' (0-any characters) and '?' (0-1 character) wildcard characters.

Obviously when come into '*', you can ignore it, or eat the current input char, then recursively repeat this process. When come into '?', you can ignore it or eat the current input char. Therefore comes this backtracking recursive solution:

int match(char * s, char * p) {
printf("string:%s \t pattern:%s\n", s, p);

if (*s == 0) return *p == 0;
if (*p == 0) return 1; // or "return *s == 0;" if you don't want extra char in s.
if ('?' == *p) return match(s+1, p+1) || match(s, p+1);
if ('*' == *p) return match(s+1, p) || match(s, p+1);
if (*s == *p) return match(s+1, p+1);
return 0;
}

int DoMatch(char * s, char * p) {
if (s == NULL) return NULL == p;
if (p == NULL) return 1; // or NULL == s if don't want extra char in s.
return match(s, p);
}

Call DoMatch(string, pattern) to start the process. Note in the implementation you must use a+1 instead of ++a since the latter would change the value of a and causes problem for the next invocation of match call. Use of a++ is even worse since a won't be incremented at all until the match call finishes. These are simple issues but easy to ignore until you stumble into problem.

Since backtracking is used, this solution can be exponential in nature. This solution can be used when the input size is small.

This problem obviously is a simplification of regular expression string matching, such as used in lex, in Perl, Python, Java, C# etc. In a general regular expression we start with several basic building blocks (single symbol, concatenation of symbols, kleene closure (*, extensions are + and ?), alternation (|)) to build up the NFA for a regular expression. The next key step is to convert a NFA (Non-deterministic Finite Automata) to a DFA (Deterministic Finite Automata). Using DFA, the complexity is O(n) or linear, and therefore is the ideal solution. Check any computation theory textbook for the algorithm.

Card shuffling problem

An intuitive way of doing card shuffling:
starting with a deck of 52 cards,
randomly choose one from the 52 cards as the first card,
randomly choose one from the left 51 cards as the second card,
...
choose the last one left as the last card.

We can implement this using two arrays or two linked lists. But using array the delete operation is not in constant time, using linked list the access operation (choose the selected node) is not in constant time.

A neat solution is to use one array like this:
for (i := 1; i < 52; i ++) {
randomly choose a card c in the range [i, 52];
swap card i and card c;
}

Each card has the same probability (1/52) of being chosen for a position. This is uniform distribution. To see this, besides correlating this algorithm to the intuitive solution at the beginning of this article, you can also think this way: the first card is chosen with the probability 1/52, then second card is chosen with the probability 51/52 * 1/51 = 1/52, ... the i-th card is chosen with the probability 51/52 * 50/51 * 49/50 * ... 1/(52-i+1) = 1/52.

It's easy to see that the number of cards 52 can be replaced with a random n.

Blog Archive

Followers