Tuesday, September 4, 2012

A Chinese poem processor

A lot of language games can be done. I am also thinking of doing some Natural Language Processing stuff later.
The code below can be improved significantly in the data processing part. For example, on things such as 1) length of poem sentence, 2) position of target character, 3) tone (平仄), 4) semantic analysis so the mood matches.
More poem source files can be added.
The only tricky thing here so far is handling UTF8 characters.
#
# This script reads Chinese poems and store the sentences into a repository,
# then find poem sentences that start with letters in the given target sentence.
#
# This can be used for some language games, 
# such as forming a poem for somebody's birthday in the form of a "藏头诗".
#
# This script should be saved in utf8 format. 
#
# http://ahinea.com/en/tech/perl-unicode-struggle.html
# http://stackoverflow.com/questions/519309/how-do-i-read-utf-8-with-diamond-operator
# http://stackoverflow.com/questions/9574198/comparing-two-unicode-strings-with-perl
# 
# Chinese poems:
# http://www.shuku.net/novels/mulu/shici.html
#
# 藏头诗 generator: http://www.zhelizhao.com/cangtoushi/
#
# By: HomeTom
# Created on: 2012/09/04
#

require Encode;
use utf8;
use strict;

##################
# Change setting here.
##################

# Data source
my @files = ("tang300.txt", "song100.txt");
# Target sentence
my $target = ("小明生日快乐");

##################

my $cnt = 1;
my $len;
my @chars;
my @first = (); # first char.
#my $char;
my $DEBUG1 = 0;
my $DEBUG2 = 0;
#print "hi\n";

#binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";

##################
# Read data.
##################

my @lines = ();
my @lines2;

foreach my $file (@files) {
  open FILE, $file or die $!;
  @lines2 = <FILE>;
  close FILE;
  push(@lines, @lines2);
}

##################
# Process and analyze data.
##################

foreach my $line (@lines) {
  chomp($line);
  $line = trim($line);
  $line = Encode::decode_utf8($line);
  @chars = split //, $line;

  if ($DEBUG1) { 
    print "$cnt: $line\n"; 
    print "$cnt: "; 
    print "$_." foreach (@chars); 
    print "\n"; 
  }

  #if ($chars[0] ne "") 
  {
    push (@first, $chars[0]);
  }

  $cnt ++;
}

if ($DEBUG2) {
  print "$_\n" foreach (@first);
}

$len = @first;


##################
# Search poem sentences for each word in the target sentence.
##################

my @words = split //, $target;

foreach my $w (@words) {
  print "==$w==\n";
  getLetterLines($w);
}


1;


##################
# Subroutines
##################

sub getLetterLines() {
  my $i;
  my $c;
  my $letter = shift;
  for ($i = 0; $i < $len; $i ++) {
    $c = $first[$i];
    if ($c eq $letter) { print "$i: $lines[$i]\n"; }
  }
  #print "\n";
}


#
# Trim functions, from: http://www.somacon.com/p114.php
#

# Perl trim function to remove whitespace from the start and end of the string
sub trim($)
{
  my $string = shift;
  $string =~ s/^\s+//;
  $string =~ s/\s+$//;
  return $string;
}
# Left trim function to remove leading whitespace
sub ltrim($)
{
  my $string = shift;
  $string =~ s/^\s+//;
  return $string;
}
# Right trim function to remove trailing whitespace
sub rtrim($)
{
  my $string = shift;
  $string =~ s/\s+$//;
  return $string;
}

No comments:

Blog Archive

Followers