Tuesday, March 30, 2010

awk, uniq, tr and grep

Some links:

Remove duplicate lines with uniq
A user's guide for GNU AWK
AWK one-liners

AWK can process each line of a file and gives relevant output.

One problem is to find the number of occurrences of a word in a file. It's easy to output the lines that contain such a word, or count the number of such lines. But if a line contains multiple occurrences of a word, the solution is not so easy. One solution is to use the "tr" command to break the fields in a line to one word each line, and then employ common approach.

grep can be used to find a word in files. The following recursively find in current directory and sub-directories for files that contain "word":
grep -R "word" *

Saturday, March 13, 2010

Pi Day

Quite interesting:

http://en.wikipedia.org/wiki/Pi_Day

探索π值大事记

China Digital Science and Technology Museum.

Saturday, March 6, 2010

Wildcard character pattern matching

The problem is how to do string matching on a pattern including '*' (0-any characters) and '?' (0-1 character) wildcard characters.

Obviously when come into '*', you can ignore it, or eat the current input char, then recursively repeat this process. When come into '?', you can ignore it or eat the current input char. Therefore comes this backtracking recursive solution:

int match(char * s, char * p) {
printf("string:%s \t pattern:%s\n", s, p);

if (*s == 0) return *p == 0;
if (*p == 0) return 1; // or "return *s == 0;" if you don't want extra char in s.
if ('?' == *p) return match(s+1, p+1) || match(s, p+1);
if ('*' == *p) return match(s+1, p) || match(s, p+1);
if (*s == *p) return match(s+1, p+1);
return 0;
}

int DoMatch(char * s, char * p) {
if (s == NULL) return NULL == p;
if (p == NULL) return 1; // or NULL == s if don't want extra char in s.
return match(s, p);
}

Call DoMatch(string, pattern) to start the process. Note in the implementation you must use a+1 instead of ++a since the latter would change the value of a and causes problem for the next invocation of match call. Use of a++ is even worse since a won't be incremented at all until the match call finishes. These are simple issues but easy to ignore until you stumble into problem.

Since backtracking is used, this solution can be exponential in nature. This solution can be used when the input size is small.

This problem obviously is a simplification of regular expression string matching, such as used in lex, in Perl, Python, Java, C# etc. In a general regular expression we start with several basic building blocks (single symbol, concatenation of symbols, kleene closure (*, extensions are + and ?), alternation (|)) to build up the NFA for a regular expression. The next key step is to convert a NFA (Non-deterministic Finite Automata) to a DFA (Deterministic Finite Automata). Using DFA, the complexity is O(n) or linear, and therefore is the ideal solution. Check any computation theory textbook for the algorithm.

Blog Archive

Followers