Saturday, July 19, 2014

Write a crawler in Perl

Now improve a web crawler in Perl, based on a script I wrote several years ago. This can be useful if there is a need to download a bundle of files from a website.

The crawl and storage parts are there now. Some general considerations are below. These considerations, of course, are independent of implementation language.

- Crawl
  - keeps links crawled, in queue or hash
  - get header: status code (200, 404 etc.), content-size, content-type, modification time
  - broken links
  - dynamic page
  - special chars in url if create local folder using url path
  - relative url - handled by relevant lib.
  - wait interval between requests
  - robot.txt [11]
  - mime types
  - referrer
  - html parsing
  - header 302 redirect, and redirect level
  - javascript submit/redirect
  - non-stardard tags

- Storage
  - use web structure, or flat (need to resolve file name conflicts)
  - store progress: in file or database: link_queue, non_link_queue, current pointer in link_queue.

- Data Analysis and mining
  - text mining
  - reverse index
  - NLP etc.

- Rank Analysis
  - web link graph
  - page rank


== Compile Perl to executable ==

Seems PAR is a good choice [1][2].


== Other Perl Crawlers ==

[5] is a good introduction on the modules to use to write a crawler in Perl.
[4] is a simple one. [6] seems more involved.

[10] is a good introduction to general principles of web crawler.


== GUI with Perl/Tk ==

With Tk it's easy to make event driven GUI interface [7][8][9].

Tried to install Tk. Type:

sudo Perl -MCPAN -e shell
> install Tk

But there is error that prevents the installation to finish:

t/wm-tcl.t ................... 119/315 
#   Failed test 'attempting to resize a gridded toplevel to a value bigger'
#   at t/wm-tcl.t line 1153.
#          got: '4'
#     expected: '6'

#   Failed test at t/wm-tcl.t line 1155.
#          got: '4'
#     expected: '5'
t/wm-tcl.t ................... 312/315 # Looks like you failed 2 tests of 315.
t/wm-tcl.t ................... Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/315 subtests 
(less 43 skipped subtests: 270 okay)
(31 TODO tests unexpectedly succeeded)

Test Summary Report
-------------------
t/listbox.t                (Wstat: 0 Tests: 537 Failed: 0)
  TODO passed:   320, 322, 328, 502
t/text.t                   (Wstat: 0 Tests: 415 Failed: 0)
  TODO passed:   121
t/wm-tcl.t                 (Wstat: 512 Tests: 315 Failed: 2)
  Failed tests:  160-161
  TODO passed:   64, 86-87, 154-157, 164-165, 171-176, 221-224
                237-239, 264-265, 275-276, 280-283, 300
  Non-zero exit status: 2
t/zzScrolled.t             (Wstat: 0 Tests: 94 Failed: 0)
  TODO passed:   52, 66, 80, 94
Files=74, Tests=4348, 55 wallclock secs ( 0.84 usr  0.25 sys + 14.66 cusr  1.57 csys = 17.32 CPU)
Result: FAIL
Failed 1/74 test programs. 2/4348 subtests failed.
make: *** [test_dynamic] Error 255
  SREZIC/Tk-804.032.tar.gz
  /usr/bin/make test -- NOT OK
//hint// to see the cpan-testers results for installing this module, try:
  reports SREZIC/Tk-804.032.tar.gz
Running make install
  make test had returned bad status, won't install without force
Failed during this command:
 SREZIC/Tk-804.032.tar.gz                     : make_test NO

Only 2 tests failed out of many on a resize issue, shouldn't be serious. So use force option to install and it worked:

sudo perl -fi Tk


References:

[1] Create self-contained Perl executables, Part II
[2] PAR: Perl Archiving Toolkit
[3] Compiling or packaging an executable from perl code on windows

[4] Web scraping with modern perl
[5] Web crawling with Perl
[6] spider.pl - Example Perl program to spider web servers

[7] Tk:UserGuid
[8] Learning Perl/Tk: Graphical User Interfaces with Perl
[9] Book: Mastering Perl/Tk

[10] Wiki: web crawler
[11] Robots Exclusion Standard


No comments:

Blog Archive

Followers