PDF searching with ht://dig ------------------------------------------ Super-simplified instructions compiled by: Colin Viebrock (cmv@privateworld.com) First you need to install Adobe Acrobat Reader on your server. Get the latest version from: http://www.adobe.com Second, you need to run the patch that's included in htdig-pdf.tgz. Don't compile it yet because ... Third, make the following changes, as pointed out by Sylvain Wallez: The first one is a bug in PDF.cc (doesn't seem to happen on the PDF files on my Intranet, but we only use Acrobat to produce them). Here's the diff he sent me : diff -c htdig/PDF.cc.old htdig/PDF.cc *** htdig/PDF.cc.old Wed Jul 15 10:46:03 1998 --- htdig/PDF.cc Tue Jul 14 10:21:38 1998 *************** *** 280,286 **** } } ! else if (line == "BT") { // Beginning of text block if (debug > 3) --- 280,286 ---- } } ! else if ( mystrncasecmp( line.get(), "BT", 2 ) == 0 ) { // Beginning of text block if (debug > 3) The second problem is that the default value for the "bad_extension" attribute contains .pdf, which causes all pdf files to be ignored by htdig, even if a parser is available. To correct this, you can either put a "bad_extension" list without ".pdf" in your config file (this is what I did), of apply the following patch to htcommon/defaults.cc : diff -c htcommon/defaults.cc.old htcommon/defaults.cc *** htcommon/defaults.cc.old Fri Aug 15 01:59:25 1997 --- htcommon/defaults.cc Mon Jul 13 19:37:33 1998 *************** *** 37,43 **** {"add_anchors_to_excerpt", "true"}, {"allow_numbers", "false"}, {"allow_virtual_hosts", "true"}, ! {"bad_extensions", ".wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif .jpg .jpeg .aiff .pdf .class .map .ram"}, {"bad_word_list", "${common_dir}/bad_words"}, {"create_image_list", "false"}, {"create_url_list", "false"}, --- 37,43 ---- {"add_anchors_to_excerpt", "true"}, {"allow_numbers", "false"}, {"allow_virtual_hosts", "true"}, ! {"bad_extensions", ".wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif .jpg .jpeg .aiff .class .map .ram"}, {"bad_word_list", "${common_dir}/bad_words"}, {"create_image_list", "false"}, {"create_url_list", "false"}, Thanks to M.J. Long for bug hunting. Now, you can do a configure, make clean, make and make install. Voila, PDF parsing!