Cookie Notice

As far as I know, and as far as I remember, nothing in this page does anything with Cookies.

2015/06/17

Head-to-Head Web Scraping with Perl: Mojo::DOM vs Web::Query

In the last meeting of Purdue Perl Mongers, Joe Kline mentioned Sawyer X's YAPC::NA talk on Modern Web Scraping, where he talked about Web::Query, which uses CSS selectors, compared to the XPath selectors he uses for his own web scraping.

I had just written and posted code where I used Mojo::DOM to scrape YouTube. So decided to do a head-to-head parsing of the same corpus.

And found that, except for wq($file) and Mojo::DOM->new($file), the code is identical.

Seriously, only a small string that says it's using Web::Query or Mojo::DOM that's different.

In running, Mojo::DOM is a little bit faster, though.



#!/usr/bin/env perl
use feature qw{ say state unicode_eval unicode_strings } ;
use strict ;
use warnings ;
use utf8 ;
use Data::Dumper ;
use Mojo::DOM ;
use Web::Query ;
my $base = 'https://www.youtube.com' ;
my $file = join '', (<DATA>) ;
$file =~ s/\p{FORMAT}//g ; # find and replace Unicode formatting chars - http://www.perlmonks.org/?node_id=1020973
wq($file)->find('.channels-content-item')->each(
sub {
state $c = 1 ;
my $e = $_ ;
my $content = $e->find('.yt-lockup-content')->first ;
my $anchor = $content->find('a')->first ;
my $title = $anchor->text ;
my $link = $base . $anchor->attr('href') ;
say join ' : ', ( sprintf '%02d', $c++ ), 'wq', $title, $link ;
}
) ;
Mojo::DOM->new($file)->find('.channels-content-item')->each(
sub {
state $c = 1 ;
my $e = $_ ;
my $content = $e->find('.yt-lockup-content')->first ;
my $anchor = $content->find('a')->first ;
my $title = $anchor->text ;
my $link = $base . $anchor->attr('href') ;
say join ' : ', ( sprintf '%02d', $c++ ), 'md', $title, $link ;
}
) ;
exit ;
__DATA__
... not appropriate to include several thousands of lines of HTML here...
01 : wq : Lightning Talks, Phil Windley, and YAPC 2015 Closing : https://www.youtube.com/watch?v=3xLMG9ELcPI
02 : wq : John McDonald HPCI manage cluster cloud computing : https://www.youtube.com/watch?v=fb7XZj__Pqg
03 : wq : bulk 88 Writing XS in plain C : https://www.youtube.com/watch?v=Iu6RV2wKQwo
04 : wq : Brian Gottreau If you can't remember history rewrite it so you can : https://www.youtube.com/watch?v=6ByzqrG2Nsc
05 : wq : Brad Lhotsky Lessons from High Velocity Logging : https://www.youtube.com/watch?v=6gXxBgGEv_I
06 : wq : Andrew Grangaard Effective Git : https://www.youtube.com/watch?v=oS-mMKnAAL0
07 : wq : Ivan Kohler How Perl helped us make a million dollars : https://www.youtube.com/watch?v=D9fzN18F8iQ
08 : wq : Walt Mankowski Making movies for fun and science : https://www.youtube.com/watch?v=xf2UHZu9NJA
09 : wq : Shawn Moore Lifting Moose : https://www.youtube.com/watch?v=w9HHHNVrmOs
10 : wq : Jason McIntosh The True Story of Plerd : https://www.youtube.com/watch?v=5X4VaeoCSe8
11 : wq : Dana Jacobsen BigNums When 64 bits just is not enough : https://www.youtube.com/watch?v=Dhl4_Chvm_g
12 : wq : Joseph Hall and A Series of Unfortunate Requests : https://www.youtube.com/watch?v=wbaH_jxcA7g
13 : wq : Neil Mansilla Building Smarter Microservices with Scale Oriented Architecture : https://www.youtube.com/watch?v=USXSnfilG4g
14 : wq : Jonathan Taylor Moose in Production A Two year Retrospective : https://www.youtube.com/watch?v=tD1oRoaVn2M
15 : wq : David Golden Juggling Chainsaws Perl and MongoDB : https://www.youtube.com/watch?v=Nf3e6cPU9B0
16 : wq : Michael Conrad DeLorean Digital Dashboard : https://www.youtube.com/watch?v=SERH3_gZOTo
17 : wq : Graham Ollis Practical FFI with Platypus : https://www.youtube.com/watch?v=XjvpxfVJLNg
18 : wq : Ricardo Signes (rjbs) - Perl 5.22 and You : https://www.youtube.com/watch?v=I8VVtqVh9y0
19 : wq : Rafael Almeria - Live Perl : https://www.youtube.com/watch?v=nZHWVAPm9IA
20 : wq : Daisuke Maki -YAPC::Asian Tokyo Behind The Scenes : How We Organize A Conference for 2000 Attendees : https://www.youtube.com/watch?v=VcwsR1yVuII
21 : wq : John Whitney - Perl via Paper Ink Metal and Oil : https://www.youtube.com/watch?v=INSn6cYK19U
22 : wq : Stevan Little (stevan) - Perl's Syntactic Legacy: Using the future to improve the past : https://www.youtube.com/watch?v=sJC725e8ysM
23 : wq : Joe Kline (gizmo) - My Ordnung : https://www.youtube.com/watch?v=vBiKxw1JMZM
24 : wq : Tim Bunce - Life: Enhancing your frame of reference : https://www.youtube.com/watch?v=Y24QnadqqJ4
25 : wq : VM Brasseur (vmbrasseur) - Failure: Why it happens & How to benefit from it : https://www.youtube.com/watch?v=DLn4fZsZsKM
26 : wq : Nick Patch (patch) - Hello, my name is _______. : https://www.youtube.com/watch?v=SKbqCB2NPXw
27 : wq : Andrew Hewus Fresh (AFresh1) - Perl in OpenBSD : https://www.youtube.com/watch?v=GwrnOpYXimE
28 : wq : D Ruth Bavousett (druthb) - Scrum for One : https://www.youtube.com/watch?v=Zh7dXvQY-hg
29 : wq : Q&A With Larry Wall : https://www.youtube.com/watch?v=PK9UnAmrxsA
30 : wq : Seth Johnson - Keynote: Seth Johnson - What Perl Taught Me About Life : https://www.youtube.com/watch?v=afaKtWp0JKM
31 : wq : Curtis Poe (Ovid) - Perl 6 for Mere Mortals : https://www.youtube.com/watch?v=S0OGsFmPW2M
32 : wq : Florian Ragwitz (rafl) - Ansible for Programmers : https://www.youtube.com/watch?v=x3ZbYQSGkBY
33 : wq : Bruce Gray (Util) - Stop Panicking! Perl 6 is just like Perl 5 (where it counts). : https://www.youtube.com/watch?v=KSWp9B-s-Sg
34 : wq : Steven Lembark - Mongering in a Box: Building Perl application containers with Dockers : https://www.youtube.com/watch?v=NuRClr-xREc
35 : wq : DrForr - Everything Old is New Again: Quaternion in Perl6 : https://www.youtube.com/watch?v=fKksZBUDMEo
36 : wq : Jordan Adler (jmadler) Mobile Apps... in Perl?! : https://www.youtube.com/watch?v=7mRHapWZ-AI
37 : wq : Logan Bell - Give Catalyst Some Swag : https://www.youtube.com/watch?v=mHmdrgnMCps
38 : wq : Logan Bell - Perl to Go : https://www.youtube.com/watch?v=y573MDoLraY
39 : wq : Henry Van Styn (vanstyn) - RapidApp by example - database web apps on steroids : https://www.youtube.com/watch?v=9HMHD1u9uc4
40 : wq : James E Keenan (kid51) - A Simple Development Tool for Refactoring & Benchmarking : https://www.youtube.com/watch?v=vSNdp1QkCyE
41 : wq : WHATEVER YOU DO DON'T VIEW THIS : https://www.youtube.com/watch?v=-AJo_RVDoF0
42 : wq : Mark Prather (Trg404) - From bartending to nerdtending : https://www.youtube.com/watch?v=uvETUUMZo9E
43 : wq : William Stevenson (wds) - Dude, where's my data analyst? A quick guide to machine learning : https://www.youtube.com/watch?v=p53qpU78LxI
44 : wq : Chad Granum (Exodist) - Perl Testing, whats new with Test:: More and beyond : https://www.youtube.com/watch?v=uFzr6wu5Pq4
45 : wq : Sawyer X - Modern web scraping : https://www.youtube.com/watch?v=wcXmCMGwZQo
46 : wq : Joel Berger (jberger) - Test Your App's Javascript using Test:: Mojo::Role::Phantom : https://www.youtube.com/watch?v=CKbzBNz4Ksg
47 : wq : Sean Quinlan (spq_easy) - Leave the system alone! : https://www.youtube.com/watch?v=mph-9hqJQ98
48 : wq : Upasana Shukla (upsasana) How to Bring Newbies to Perl : https://www.youtube.com/watch?v=yewFM9XEmlQ
49 : wq : Matt S. Trout (mst) Build management with a dash of prolog : https://www.youtube.com/watch?v=C2RJfykfVcM
50 : wq : Prairie Nyx - CoderDojo and Perl Evangelism : https://www.youtube.com/watch?v=kkD4pCRvwK4
51 : wq : Karen Pauley - Working with Volunteers: Learning from My Mistakes : https://www.youtube.com/watch?v=ek4fmzyXGwM
52 : wq : Stephen Scaffidi (hercynium) - In the desert without a camel : https://www.youtube.com/watch?v=OK1ZY_bw660
53 : wq : R Geoffrey Avery (eGeoffrey) Lightning Talks Day 1 : https://www.youtube.com/watch?v=mQVUvAz3zhQ
54 : wq : Welcome to YAPC & States of the Velociraptors : The Perl5 community lightning talks : https://www.youtube.com/watch?v=88K1h1XhEeo
55 : wq : YAPC::NA::2014 Highlights : https://www.youtube.com/watch?v=GLqtHab06dM
56 : wq : Matt S Trout (mst) - Devops Logique : https://www.youtube.com/watch?v=RQwY28DItLI
57 : wq : John Anderson (genehack) - Yet Another Keynote Speech : https://www.youtube.com/watch?v=MU6IFUZZBuQ
58 : wq : Sawyer X - The Joy in What We Do : https://www.youtube.com/watch?v=CjOQZf0Ad74
59 : wq : R Geoffrey Avery (rGeoffrey) - Lightning Talks Day 3 : https://www.youtube.com/watch?v=m-6o2dBc1qE
60 : wq : Peter Martini - Sub Signatures: Next Steps : https://www.youtube.com/watch?v=ot5yOrMJogA
01 : md : Lightning Talks, Phil Windley, and YAPC 2015 Closing : https://www.youtube.com/watch?v=3xLMG9ELcPI
02 : md : John McDonald HPCI manage cluster cloud computing : https://www.youtube.com/watch?v=fb7XZj__Pqg
03 : md : bulk 88 Writing XS in plain C : https://www.youtube.com/watch?v=Iu6RV2wKQwo
04 : md : Brian Gottreau If you can't remember history rewrite it so you can : https://www.youtube.com/watch?v=6ByzqrG2Nsc
05 : md : Brad Lhotsky Lessons from High Velocity Logging : https://www.youtube.com/watch?v=6gXxBgGEv_I
06 : md : Andrew Grangaard Effective Git : https://www.youtube.com/watch?v=oS-mMKnAAL0
07 : md : Ivan Kohler How Perl helped us make a million dollars : https://www.youtube.com/watch?v=D9fzN18F8iQ
08 : md : Walt Mankowski Making movies for fun and science : https://www.youtube.com/watch?v=xf2UHZu9NJA
09 : md : Shawn Moore Lifting Moose : https://www.youtube.com/watch?v=w9HHHNVrmOs
10 : md : Jason McIntosh The True Story of Plerd : https://www.youtube.com/watch?v=5X4VaeoCSe8
11 : md : Dana Jacobsen BigNums When 64 bits just is not enough : https://www.youtube.com/watch?v=Dhl4_Chvm_g
12 : md : Joseph Hall and A Series of Unfortunate Requests : https://www.youtube.com/watch?v=wbaH_jxcA7g
13 : md : Neil Mansilla Building Smarter Microservices with Scale Oriented Architecture : https://www.youtube.com/watch?v=USXSnfilG4g
14 : md : Jonathan Taylor Moose in Production A Two year Retrospective : https://www.youtube.com/watch?v=tD1oRoaVn2M
15 : md : David Golden Juggling Chainsaws Perl and MongoDB : https://www.youtube.com/watch?v=Nf3e6cPU9B0
16 : md : Michael Conrad DeLorean Digital Dashboard : https://www.youtube.com/watch?v=SERH3_gZOTo
17 : md : Graham Ollis Practical FFI with Platypus : https://www.youtube.com/watch?v=XjvpxfVJLNg
18 : md : Ricardo Signes (rjbs) - Perl 5.22 and You : https://www.youtube.com/watch?v=I8VVtqVh9y0
19 : md : Rafael Almeria - Live Perl : https://www.youtube.com/watch?v=nZHWVAPm9IA
20 : md : Daisuke Maki -YAPC::Asian Tokyo Behind The Scenes : How We Organize A Conference for 2000 Attendees : https://www.youtube.com/watch?v=VcwsR1yVuII
21 : md : John Whitney - Perl via Paper Ink Metal and Oil : https://www.youtube.com/watch?v=INSn6cYK19U
22 : md : Stevan Little (stevan) - Perl's Syntactic Legacy: Using the future to improve the past : https://www.youtube.com/watch?v=sJC725e8ysM
23 : md : Joe Kline (gizmo) - My Ordnung : https://www.youtube.com/watch?v=vBiKxw1JMZM
24 : md : Tim Bunce - Life: Enhancing your frame of reference : https://www.youtube.com/watch?v=Y24QnadqqJ4
25 : md : VM Brasseur (vmbrasseur) - Failure: Why it happens & How to benefit from it : https://www.youtube.com/watch?v=DLn4fZsZsKM
26 : md : Nick Patch (patch) - Hello, my name is _______. : https://www.youtube.com/watch?v=SKbqCB2NPXw
27 : md : Andrew Hewus Fresh (AFresh1) - Perl in OpenBSD : https://www.youtube.com/watch?v=GwrnOpYXimE
28 : md : D Ruth Bavousett (druthb) - Scrum for One : https://www.youtube.com/watch?v=Zh7dXvQY-hg
29 : md : Q&A With Larry Wall : https://www.youtube.com/watch?v=PK9UnAmrxsA
30 : md : Seth Johnson - Keynote: Seth Johnson - What Perl Taught Me About Life : https://www.youtube.com/watch?v=afaKtWp0JKM
31 : md : Curtis Poe (Ovid) - Perl 6 for Mere Mortals : https://www.youtube.com/watch?v=S0OGsFmPW2M
32 : md : Florian Ragwitz (rafl) - Ansible for Programmers : https://www.youtube.com/watch?v=x3ZbYQSGkBY
33 : md : Bruce Gray (Util) - Stop Panicking! Perl 6 is just like Perl 5 (where it counts). : https://www.youtube.com/watch?v=KSWp9B-s-Sg
34 : md : Steven Lembark - Mongering in a Box: Building Perl application containers with Dockers : https://www.youtube.com/watch?v=NuRClr-xREc
35 : md : DrForr - Everything Old is New Again: Quaternion in Perl6 : https://www.youtube.com/watch?v=fKksZBUDMEo
36 : md : Jordan Adler (jmadler) Mobile Apps... in Perl?! : https://www.youtube.com/watch?v=7mRHapWZ-AI
37 : md : Logan Bell - Give Catalyst Some Swag : https://www.youtube.com/watch?v=mHmdrgnMCps
38 : md : Logan Bell - Perl to Go : https://www.youtube.com/watch?v=y573MDoLraY
39 : md : Henry Van Styn (vanstyn) - RapidApp by example - database web apps on steroids : https://www.youtube.com/watch?v=9HMHD1u9uc4
40 : md : James E Keenan (kid51) - A Simple Development Tool for Refactoring & Benchmarking : https://www.youtube.com/watch?v=vSNdp1QkCyE
41 : md : WHATEVER YOU DO DON'T VIEW THIS : https://www.youtube.com/watch?v=-AJo_RVDoF0
42 : md : Mark Prather (Trg404) - From bartending to nerdtending : https://www.youtube.com/watch?v=uvETUUMZo9E
43 : md : William Stevenson (wds) - Dude, where's my data analyst? A quick guide to machine learning : https://www.youtube.com/watch?v=p53qpU78LxI
44 : md : Chad Granum (Exodist) - Perl Testing, whats new with Test:: More and beyond : https://www.youtube.com/watch?v=uFzr6wu5Pq4
45 : md : Sawyer X - Modern web scraping : https://www.youtube.com/watch?v=wcXmCMGwZQo
46 : md : Joel Berger (jberger) - Test Your App's Javascript using Test:: Mojo::Role::Phantom : https://www.youtube.com/watch?v=CKbzBNz4Ksg
47 : md : Sean Quinlan (spq_easy) - Leave the system alone! : https://www.youtube.com/watch?v=mph-9hqJQ98
48 : md : Upasana Shukla (upsasana) How to Bring Newbies to Perl : https://www.youtube.com/watch?v=yewFM9XEmlQ
49 : md : Matt S. Trout (mst) Build management with a dash of prolog : https://www.youtube.com/watch?v=C2RJfykfVcM
50 : md : Prairie Nyx - CoderDojo and Perl Evangelism : https://www.youtube.com/watch?v=kkD4pCRvwK4
51 : md : Karen Pauley - Working with Volunteers: Learning from My Mistakes : https://www.youtube.com/watch?v=ek4fmzyXGwM
52 : md : Stephen Scaffidi (hercynium) - In the desert without a camel : https://www.youtube.com/watch?v=OK1ZY_bw660
53 : md : R Geoffrey Avery (eGeoffrey) Lightning Talks Day 1 : https://www.youtube.com/watch?v=mQVUvAz3zhQ
54 : md : Welcome to YAPC & States of the Velociraptors : The Perl5 community lightning talks : https://www.youtube.com/watch?v=88K1h1XhEeo
55 : md : YAPC::NA::2014 Highlights : https://www.youtube.com/watch?v=GLqtHab06dM
56 : md : Matt S Trout (mst) - Devops Logique : https://www.youtube.com/watch?v=RQwY28DItLI
57 : md : John Anderson (genehack) - Yet Another Keynote Speech : https://www.youtube.com/watch?v=MU6IFUZZBuQ
58 : md : Sawyer X - The Joy in What We Do : https://www.youtube.com/watch?v=CjOQZf0Ad74
59 : md : R Geoffrey Avery (rGeoffrey) - Lightning Talks Day 3 : https://www.youtube.com/watch?v=m-6o2dBc1qE
60 : md : Peter Martini - Sub Signatures: Next Steps : https://www.youtube.com/watch?v=ot5yOrMJogA

No comments:

Post a Comment