HTML Scrapper in PHP
by Rekha[ Edit ] 2009-11-07 14:35:37
Sometimes we want to extract the HTML content of the remote website page, this technique is called as HTML scrapper. This article will discuss on how we can extract the HTML content of the remote webpage.
We can achieve HTML scrapper operation in 2 step operation:
* Call to Remote Web Page and extract the HTML content.
* Match the HTML tags using Regular Expression.
Call to Remote Web Page using PHP:
We will be using CURL to achieve our operation.
$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file_contents = curl_exec($ch);
curl_close($ch);
$url holds the Remote URL you want to connect to; and $file_contents holds the HTML content of the remote web page that we have called.
Match the HTML tags using Regular Expression using PHP:
Here we will be using preg_match/preg_match_all to read the HTML tags from the HTML source. Here i am posting few Regular Expression code that will extract the content inside the HTML tags.
Extracting data from HTML tags
preg_match_all('/[/()-:<>ws]+< /span>/',$file_contents,$htmlContent);
Assume that the $file_contents holds the HTML source code and after executing the above preg_match it will extract all the span tags from the HTML source code.