Following redirects and links


It's not uncommon that websites have redirects, fortunately LWP::UserAgent supports them out of the box. Using max_redirect you can control how many redirects LWP will handle.

There is a special page /redirect that will redirect to the index page.

use LWP::UserAgent;

my $ua =
  LWP::UserAgent->new(agent => 'MyWebScraper/1.0 <>');

my $response = $ua->get('http://example:3000/redirect');

say $response->decoded_content;

If we set max_redirect to 0 we don't get to the index page.

use LWP::UserAgent;

my $ua = LWP::UserAgent->new(
    agent        => 'MyWebScraper/1.0 <>',
    max_redirect => 0

my $response = $ua->get('http://example:3000/redirect');

say $response->decoded_content;


It's also not uncommon to follow the links that are available on the web page.

We can use CSS selectors to get all the a tags. Let's try it again on a simple html example:

use HTML::TreeBuilder::XPath;
use HTML::Selector::XPath;

my $html = <<'EOF';
        <title>A sample webpage!</title>
        <h1> rocks!</h1>
        <a href=""></a>

my $tree = HTML::TreeBuilder::XPath->new;

my $xpath = HTML::Selector::XPath::selector_to_xpath('a');
my @nodes = $tree->findnodes($xpath);
my @attrs = $nodes[0]->getAttributes();
say $attrs[0]->getValue();