Scraping

Now we know how to fetch pages. Let's extract some data from them! In the next code examples there is no error handling, this is done for simplicity and brevity, but you should always check the errors in real applications.

The default page as you already know looks like this:

<html>
    <head>
        <title>A sample webpage!</title>
    </head>
    <body>
        <h1></h1>
    </body>
</html>

This one is in HTML format, we need an HTML parser and XPath and/or CSS selectors mechanizm to extract the data from it.

XPath scraping

First, we will try to scrape html on its own. We use HTML::TreeBuilder::XPath for XPath. XPath is XML query language. If you are not familiar with XPath here is a quick cheatsheet:

descendant-or-self::*
all elements

//h1
<h1> element

descendant-or-self::h1/span
<span> within <h1>

descendant-or-self::h1 | descendant-or-self::span
<h1> and span

descendant-or-self::h1/descendant::span
<span> with parent <h1>

descendant-or-self::h1/following-sibling::*[name() = 'span' and (position() = 1)]
<span> preceded by <div>

descendant-or-self::*[contains(concat(' ', normalize-space(@class), ' '), ' class ')]
Elements of class "class"

descendant-or-self::div[contains(concat(' ', normalize-space(@class), ' '), ' class ')]
<div> of class "class"

descendant-or-self::*[@id = 'id']
Element with id "id"

descendant-or-self::div[@id = 'id']
<div> with id "id"

descendant-or-self::a[@attr]
<a> with attribute "attr"

In the following example we extract the title of the page.

use HTML::TreeBuilder::XPath;

my $html = <<'EOF';
<html>
    <head>
        <title>A sample webpage!</title>
    </head>
    <body>
        <h1>Perltuts.com rocks!</h1>
    </body>
</html>
EOF

my $tree = HTML::TreeBuilder::XPath->new;
$tree->ignore_unknown(0);
$tree->parse($html);
$tree->eof;

my @nodes = $tree->findnodes('//title');
say $nodes[0]->as_text;

Exercise

Extract and print the h1 tag content.

use HTML::TreeBuilder::XPath;

my $html = <<'EOF';
<html>
    <head>
        <title>A sample webpage!</title>
    </head>
    <body>
        <h1>Perltuts.com rocks!</h1>
    </body>
</html>
EOF

my $tree = HTML::TreeBuilder::XPath->new;
$tree->ignore_unknown(0);
$tree->parse($html);
$tree->eof;

my @nodes = $tree->findnodes(...);
say $nodes[0]->as_text;

CSS selectors scraping

CSS selectors are easier to understand than XPath for some developers. If you're not familiar with CSS selectors here is a quick cheatsheet:

*
all elements

h1
<h1> element

h1 span
<span> within <h1>

h1, span
<h1> and span

h1 > span
<span> with parent <h1>

div + span
<span> preceded by <div>

.class
Elements of class "class"

div.class
<div> of class "class"

#id
Element with id "id"

div#id
<div> with id "id"

a[attr]
<a> with attribute "attr"

Good thing that by using HTML::Selector::XPath we can teach HTML::TreeBuilder::XPath to understand CSS selectors too.

In the following example we extract the title of the page by using a CSS selector.

use HTML::TreeBuilder::XPath;
use HTML::Selector::XPath;

my $html = <<'EOF';
<html>
    <head>
        <title>A sample webpage!</title>
    </head>
    <body>
        <h1>Perltuts.com rocks!</h1>
    </body>
</html>
EOF

my $tree = HTML::TreeBuilder::XPath->new;
$tree->ignore_unknown(0);
$tree->parse($html);
$tree->eof;

my $xpath = HTML::Selector::XPath::selector_to_xpath('h1');
my @nodes = $tree->findnodes($xpath);
say $nodes[0]->as_text;

Exercise

Put everything together (including fetching a page), extract and print the h1 tag content by using a CSS selector.

use LWP::UserAgent;
use HTML::TreeBuilder::XPath;
use HTML::Selector::XPath;

my $ua = LWP::UserAgent->new;

my $response = ...
my $html = ...

my $tree = ...

my $xpath = HTML::Selector::XPath::selector_to_xpath(...);
my @nodes = $tree->findnodes($xpath);
say $nodes[0]->as_text;