The crawl and storage parts are there now. Some general considerations are below. These considerations, of course, are independent of implementation language.
- Crawl
- keeps links crawled, in queue or hash
- get header: status code (200, 404 etc.), content-size, content-type, modification time
- broken links
- dynamic page
- special chars in url if create local folder using url path
- relative url - handled by relevant lib.
- wait interval between requests
- robot.txt [11]
- mime types
- referrer
- html parsing
- header 302 redirect, and redirect level
- javascript submit/redirect
- non-stardard tags
- Storage
- use web structure, or flat (need to resolve file name conflicts)
- store progress: in file or database: link_queue, non_link_queue, current pointer in link_queue.
- Data Analysis and mining
- text mining
- reverse index
- NLP etc.
- Rank Analysis
- web link graph
- page rank
== Compile Perl to executable ==
Seems PAR is a good choice [1][2].
== Other Perl Crawlers ==
[5] is a good introduction on the modules to use to write a crawler in Perl.
[4] is a simple one. [6] seems more involved.
[10] is a good introduction to general principles of web crawler.
== GUI with Perl/Tk ==
With Tk it's easy to make event driven GUI interface [7][8][9].
Tried to install Tk. Type:
sudo Perl -MCPAN -e shell
> install Tk
But there is error that prevents the installation to finish:
t/wm-tcl.t ................... 119/315
# Failed test 'attempting to resize a gridded toplevel to a value bigger'
# at t/wm-tcl.t line 1153.
# got: '4'
# expected: '6'
# Failed test at t/wm-tcl.t line 1155.
# got: '4'
# expected: '5'
t/wm-tcl.t ................... 312/315 # Looks like you failed 2 tests of 315.
t/wm-tcl.t ................... Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/315 subtests
(less 43 skipped subtests: 270 okay)
(31 TODO tests unexpectedly succeeded)
Test Summary Report
-------------------
t/listbox.t (Wstat: 0 Tests: 537 Failed: 0)
TODO passed: 320, 322, 328, 502
t/text.t (Wstat: 0 Tests: 415 Failed: 0)
TODO passed: 121
t/wm-tcl.t (Wstat: 512 Tests: 315 Failed: 2)
Failed tests: 160-161
TODO passed: 64, 86-87, 154-157, 164-165, 171-176, 221-224
237-239, 264-265, 275-276, 280-283, 300
Non-zero exit status: 2
t/zzScrolled.t (Wstat: 0 Tests: 94 Failed: 0)
TODO passed: 52, 66, 80, 94
Files=74, Tests=4348, 55 wallclock secs ( 0.84 usr 0.25 sys + 14.66 cusr 1.57 csys = 17.32 CPU)
Result: FAIL
Failed 1/74 test programs. 2/4348 subtests failed.
make: *** [test_dynamic] Error 255
SREZIC/Tk-804.032.tar.gz
/usr/bin/make test -- NOT OK
//hint// to see the cpan-testers results for installing this module, try:
reports SREZIC/Tk-804.032.tar.gz
Running make install
make test had returned bad status, won't install without force
Failed during this command:
SREZIC/Tk-804.032.tar.gz : make_test NO
Only 2 tests failed out of many on a resize issue, shouldn't be serious. So use force option to install and it worked:
sudo perl -fi Tk
References:
[1] Create self-contained Perl executables, Part II
[2] PAR: Perl Archiving Toolkit
[3] Compiling or packaging an executable from perl code on windows
[4] Web scraping with modern perl
[5] Web crawling with Perl
[6] spider.pl - Example Perl program to spider web servers
[7] Tk:UserGuid
[8] Learning Perl/Tk: Graphical User Interfaces with Perl
[9] Book: Mastering Perl/Tk
[10] Wiki: web crawler
[11] Robots Exclusion Standard
No comments:
Post a Comment