Some links:
Remove duplicate lines with uniq
A user's guide for GNU AWK
AWK one-liners
AWK can process each line of a file and gives relevant output.
One problem is to find the number of occurrences of a word in a file. It's easy to output the lines that contain such a word, or count the number of such lines. But if a line contains multiple occurrences of a word, the solution is not so easy. One solution is to use the "tr" command to break the fields in a line to one word each line, and then employ common approach.
grep can be used to find a word in files. The following recursively find in current directory and sub-directories for files that contain "word":
grep -R "word" *
Tuesday, March 30, 2010
Saturday, March 13, 2010
Pi Day
Quite interesting:
http://en.wikipedia.org/wiki/Pi_Day
探索π值大事记
China Digital Science and Technology Museum.
http://en.wikipedia.org/wiki/Pi_Day
探索π值大事记
China Digital Science and Technology Museum.
Tuesday, March 9, 2010
Saturday, March 6, 2010
Wildcard character pattern matching
The problem is how to do string matching on a pattern including '*' (0-any characters) and '?' (0-1 character) wildcard characters.
Obviously when come into '*', you can ignore it, or eat the current input char, then recursively repeat this process. When come into '?', you can ignore it or eat the current input char. Therefore comes this backtracking recursive solution:
Call DoMatch(string, pattern) to start the process. Note in the implementation you must use a+1 instead of ++a since the latter would change the value of a and causes problem for the next invocation of match call. Use of a++ is even worse since a won't be incremented at all until the match call finishes. These are simple issues but easy to ignore until you stumble into problem.
Since backtracking is used, this solution can be exponential in nature. This solution can be used when the input size is small.
This problem obviously is a simplification of regular expression string matching, such as used in lex, in Perl, Python, Java, C# etc. In a general regular expression we start with several basic building blocks (single symbol, concatenation of symbols, kleene closure (*, extensions are + and ?), alternation (|)) to build up the NFA for a regular expression. The next key step is to convert a NFA (Non-deterministic Finite Automata) to a DFA (Deterministic Finite Automata). Using DFA, the complexity is O(n) or linear, and therefore is the ideal solution. Check any computation theory textbook for the algorithm.
Obviously when come into '*', you can ignore it, or eat the current input char, then recursively repeat this process. When come into '?', you can ignore it or eat the current input char. Therefore comes this backtracking recursive solution:
int match(char * s, char * p) {
printf("string:%s \t pattern:%s\n", s, p);
if (*s == 0) return *p == 0;
if (*p == 0) return 1; // or "return *s == 0;" if you don't want extra char in s.
if ('?' == *p) return match(s+1, p+1) || match(s, p+1);
if ('*' == *p) return match(s+1, p) || match(s, p+1);
if (*s == *p) return match(s+1, p+1);
return 0;
}
int DoMatch(char * s, char * p) {
if (s == NULL) return NULL == p;
if (p == NULL) return 1; // or NULL == s if don't want extra char in s.
return match(s, p);
}Call DoMatch(string, pattern) to start the process. Note in the implementation you must use a+1 instead of ++a since the latter would change the value of a and causes problem for the next invocation of match call. Use of a++ is even worse since a won't be incremented at all until the match call finishes. These are simple issues but easy to ignore until you stumble into problem.
Since backtracking is used, this solution can be exponential in nature. This solution can be used when the input size is small.
This problem obviously is a simplification of regular expression string matching, such as used in lex, in Perl, Python, Java, C# etc. In a general regular expression we start with several basic building blocks (single symbol, concatenation of symbols, kleene closure (*, extensions are + and ?), alternation (|)) to build up the NFA for a regular expression. The next key step is to convert a NFA (Non-deterministic Finite Automata) to a DFA (Deterministic Finite Automata). Using DFA, the complexity is O(n) or linear, and therefore is the ideal solution. Check any computation theory textbook for the algorithm.
Subscribe to:
Posts (Atom)