التحليل

الان تعلمنا كيف نقوم بتحميل الصفحات و جلبها و سنقوم الان بعملية تحليل او استخراج لبعض المعلومات. في الامثلة القادمة لن نقوم بالتعامل مع الاخطاء طلبا للاختصار و لكن في الاستخدام الحقيقي يجب علينا التعامل مع كل حالة خطأ ممكنة.

الصفحة الافتراضية تبدو بهذا الشكل :

<html>
    <head>
        <title>A sample webpage!</title>
    </head>
    <body>
        <h1></h1>
    </body>
</html>

طبعا هذه اكواد html و للتعامل معها سنحتاج الى محلل الى html بالاضافة الى ما يعرف بمنقي Xpath و CSS حسب الحاجة طبعا.

XPath استخدام

بداية سنقوم بمحاولة تحليل الصفحة باستخدام HTML::TreeBuilder::XPath اكس باث هو عبارة عن لغة XML استعلامية. اذا لم تكن لديك اي معلومات مسبقة عن الاكس باث يمكنك ان تطلع هذا الملخص السريع: # no-run descendant-or-self::* all elements

//h1
<h1> element

descendant-or-self::h1/span
<span> within <h1>

descendant-or-self::h1 | descendant-or-self::span
<h1> and span

descendant-or-self::h1/descendant::span
<span> with parent <h1>

descendant-or-self::h1/following-sibling::*[name() = 'span' and (position() = 1)]
<span> preceded by <div>

descendant-or-self::*[contains(concat(' ', normalize-space(@class), ' '), ' class ')]
Elements of class "class"

descendant-or-self::div[contains(concat(' ', normalize-space(@class), ' '), ' class ')]
<div> of class "class"

descendant-or-self::*[@id = 'id']
Element with id "id"

descendant-or-self::div[@id = 'id']
<div> with id "id"

descendant-or-self::a[@attr]
<a> with attribute "attr"

في هذا المثال سنقوم باستخراج عنوان الصفحة :

use HTML::TreeBuilder::XPath;

my $html = <<'EOF';
<html>
    <head>
        <title>A sample webpage!</title>
    </head>
    <body>
        <h1>Perltuts.com rocks!</h1>
    </body>
</html>
EOF

my $tree = HTML::TreeBuilder::XPath->new;
$tree->ignore_unknown(0);
$tree->parse($html);
$tree->eof;

my @nodes = $tree->findnodes('//title');
say $nodes[0]->as_text;

تمرين

استخرج ثم اطبع المحتويات المدرجة تحت h1

use HTML::TreeBuilder::XPath;

my $html = <<'EOF';
<html>
    <head>
        <title>A sample webpage!</title>
    </head>
    <body>
        <h1>Perltuts.com rocks!</h1>
    </body>
</html>
EOF

my $tree = HTML::TreeBuilder::XPath->new;
$tree->ignore_unknown(0);
$tree->parse($html);
$tree->eof;

my @nodes = $tree->findnodes(...);
say $nodes[0]->as_text;

CSS selectors استخدام

البعض من المبرمجين يعتبر CSS اكثر سهولة من الاكس باث. اذا كنت لا تعرف CSS فاليك هذا الملخص السريع:

*
all elements

h1
<h1> element

h1 span
<span> within <h1>

h1, span
<h1> and span

h1 > span
<span> with parent <h1>

div + span
<span> preceded by <div>

.class
Elements of class "class"

div.class
<div> of class "class"

#id
Element with id "id"

div#id
<div> with id "id"

a[attr]
<a> with attribute "attr"

باستخدام HTML::Selector::XPath يمكننا ان نجعل من المكتبة السابقة منقي CSS ...

use HTML::TreeBuilder::XPath;
use HTML::Selector::XPath;

my $html = <<'EOF';
<html>
    <head>
        <title>A sample webpage!</title>
    </head>
    <body>
        <h1>Perltuts.com rocks!</h1>
    </body>
</html>
EOF

my $tree = HTML::TreeBuilder::XPath->new;
$tree->ignore_unknown(0);
$tree->parse($html);
$tree->eof;

my $xpath = HTML::Selector::XPath::selector_to_xpath('h1');
my @nodes = $tree->findnodes($xpath);
say $nodes[0]->as_text;

تمرين

في هذا التمرين اكمل الكود في الاسفل ليقوم بعملية جلب ثم استخراج و طباعة لمحتويات h1 و ذلك باستخدام منقي CSS :

use LWP::UserAgent;
use HTML::TreeBuilder::XPath;
use HTML::Selector::XPath;

my $ua = LWP::UserAgent->new;

my $response = ...
my $html = ...

my $tree = ...

my $xpath = HTML::Selector::XPath::selector_to_xpath(...);
my @nodes = $tree->findnodes($xpath);
say $nodes[0]->as_text;