descendant-or-self::*[contains(concat(' ', normalize-space(@class), ' '), ' class ')] Elements of class "class" descendant-or-self::div[contains(concat(' ', normalize-space(@class), ' '), ' class ')]

of class "class" descendant-or-self::*[@id = 'id'] Element with id "id" descendant-or-self::div[@id = 'id']

with id "id" descendant-or-self::a[@attr] with attribute "attr" In the following example we extract the C of the page. use HTML::TreeBuilder::XPath; my $html = <<'EOF'; <html> <head> <title>A sample webpage!

Perltuts.com rocks!

EOF my $tree = HTML::TreeBuilder::XPath->new; $tree->ignore_unknown(0); $tree->parse($html); $tree->eof; my @nodes = $tree->findnodes('//title'); say $nodes[0]->as_text; =head3 Exercise Extract and print the C

tag content. use HTML::TreeBuilder::XPath; my $html = <<'EOF'; A sample webpage!

Perltuts.com rocks!

EOF my $tree = HTML::TreeBuilder::XPath->new; $tree->ignore_unknown(0); $tree->parse($html); $tree->eof; my @nodes = $tree->findnodes(...); say $nodes[0]->as_text; __TEST__ like($stdout, qr/Perltuts.com rocks!/, 'Should print correct h1 content'); =head3 CSS selectors scraping C selectors are easier to understand than C for some developers. If you're not familiar with C selectors here is a quick cheatsheet: # no-run * all elements h1

element h1 span within
h1, span
and span h1 > span with parent
div + span preceded by
.class Elements of class "class" div.class
of class "class" #id Element with id "id" div#id
with id "id" a[attr] with attribute "attr" Good thing that by using L we can teach L to understand C selectors too. In the following example we extract the C of the page by using a C<CSS> selector. use HTML::TreeBuilder::XPath; use HTML::Selector::XPath; my $html = <<'EOF'; <html> <head> <title>A sample webpage!
Perltuts.com rocks!
EOF my $tree = HTML::TreeBuilder::XPath->new; $tree->ignore_unknown(0); $tree->parse($html); $tree->eof; my $xpath = HTML::Selector::XPath::selector_to_xpath('h1'); my @nodes = $tree->findnodes($xpath); say $nodes[0]->as_text; =head3 Exercise Put everything together (including fetching a page), extract and print the C
tag content by using a C selector. use LWP::UserAgent; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath; my $ua = LWP::UserAgent->new; my $response = ... my $html = ... my $tree = ... my $xpath = HTML::Selector::XPath::selector_to_xpath(...); my @nodes = $tree->findnodes($xpath); say $nodes[0]->as_text; TEST like($stdout, qr/Perltuts.com rocks!/, 'Should print correct h1 content'); =head2 Following redirects and links =head3 Redirects It's not uncommon that websites have redirects, fortunately L supports them out of the box. Using C you can control how many redirects C will handle. There is a special page C that will redirect to the index page. use LWP::UserAgent; my $ua = LWP::UserAgent->new(agent => 'MyWebScraper/1.0 '); my $response = $ua->get('http://example:3000/redirect'); say $response->decoded_content; If we set C to C<0> we don't get to the index page. use LWP::UserAgent; my $ua = LWP::UserAgent->new( agent => 'MyWebScraper/1.0 ', max_redirect => 0 ); my $response = $ua->get('http://example:3000/redirect'); say $response->decoded_content; =head3 Links It's also not uncommon to follow the links that are available on the web page. We can use C selectors to get all the C tags. Let's try it again on a simple html example: use HTML::TreeBuilder::XPath; use HTML::Selector::XPath; my $html = <<'EOF'; A sample webpage!
Perltuts.com rocks!
perltuts.com EOF my $tree = HTML::TreeBuilder::XPath->new; $tree->ignore_unknown(0); $tree->parse($html); $tree->eof; my $xpath = HTML::Selector::XPath::selector_to_xpath('a'); my @nodes = $tree->findnodes($xpath); my @attrs = $nodes[0]->getAttributes(); say $attrs[0]->getValue(); =head2 See also See the following modules for other scraping tools: =over =item * L =item * L =back =head1 AUTHOR Viacheslav Tykhanovskyi, C =head1 LICENSE L

descendant-or-self::h1 | descendant-or-self::span

Perltuts.com rocks!

tag content. use HTML::TreeBuilder::XPath; my $html = <<'EOF'; A sample webpage!

Perltuts.com rocks!

h1, span

Perltuts.com rocks!

Perltuts.com rocks!