HTML Scrapper in PHP

by Rekha 2009-11-07 14:35:37

Sometimes we want to extract the HTML content of the remote website page, this technique is called as HTML scrapper. This article will discuss on how we can extract the HTML content of the remote webpage.

We can achieve HTML scrapper operation in 2 step operation:

* Call to Remote Web Page and extract the HTML content.
* Match the HTML tags using Regular Expression.

Call to Remote Web Page using PHP:
We will be using CURL to achieve our operation.

$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file_contents = curl_exec($ch);
curl_close($ch);


$url holds the Remote URL you want to connect to; and $file_contents holds the HTML content of the remote web page that we have called.


Match the HTML tags using Regular Expression using PHP:
Here we will be using preg_match/preg_match_all to read the HTML tags from the HTML source. Here i am posting few Regular Expression code that will extract the content inside the HTML tags.

Extracting data from HTML tags

preg_match_all('/[/()-:<>ws]+< /span>/',$file_contents,$htmlContent);


Assume that the $file_contents holds the HTML source code and after executing the above preg_match it will extract all the span tags from the HTML source code.


Tagged in:

1340
like
0
dislike
0
mail
flag

You must LOGIN to add comments