Extracting Text from MS Word

by Mohan 2012-09-20 14:01:00



<h2>Extracting Text from MS Word</h2>

If your PHP installation is Unix/Linux, rather than Windows, you don't have access to PHP's COM abilities. This makes it difficult to extract infomation from Microsoft Word documents.

Being able to get at the text from a Word document can be useful, especially for building indexers for search engines.

The solutions that are currently available usually involve using binaries such as catdoc or antiword. Good as these products are, they can be complicated to install and configure (sometimes impossible if using a shared hosting account).

Here's a simple attempt at a solution using just PHP. I don't pretend that it makes a complete success of extracting the text from all Word documents, but I've found it very reliable for the vast majority of the several thousand docs I've used it with. The function returns text from the Word document as a string, with all the formatting removed. Please note that some parts of the Word document (header, footer etc) are not parsed.

<?php
/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc)
{
$fileHandle = fopen($userDoc, "r");
$line = @fread($fileHandle, filesize($userDoc));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
return $outtext;
}
?>

Using the function is as easy as:

$text = parseWord($userDoc);
The recovered text can then be processed as required, e.g. put into an index, or a MySQL table having a FULLTEXT index applied etc.

Tagged in:

894
like
0
dislike
0
mail
flag

You must LOGIN to add comments