
From rick@digi.com Sun Nov  7 10:35:36 1999
Date: Sat, 6 Nov 1999 23:15:12 -0600
From: Rick Richardson <rick@digi.com>
To: htdig3-dev@htdig.org
Cc: Rick Richardson <rick@digi.com>, htdig3-dev@htdig.org
Subject: Re: [htdig3-dev] Portable indexing


Thanks to all who helped with suggestions.

Here is version 1 of my solution.  The short shell script
that does all the work is an attachment.

-Rick

For a long time I've wanted to turn my vast email archives into HTML
and index *every single word* in those archives.  Then, I wanted to
blast those archives onto CD-ROMs for permanent archival storage.

I have been eyeing a very nice, free indexing package called "htdig",
as the search engine of choice for this purpose (http://www.htdig.org).
It will index every word in a collection of text and HTML files.

The problem with htdig is that the conventional usage is to index an
entire web site on a single machine into a single database.

I wanted to index several collections independantly, and wanted to be
able to easily move those collections and their indexes between
machines and onto CD-ROM without having to do a lot of work to
"install" the database onto each machine.

I worked out a shell script to do what I wanted to do.  I have
attached said shell script "digdir".

As a proof of concept, I decided to use the latest copy of the
Internet RFC collection as a test.

I started with the 2700 or so RFC's in text form.  I stored these into
directory /home/httpd/html/rfc.

I then run the "digdir" shell script thusly:

	$ cd /home/httpd/html
	$ digdir rfc

After about 5 minutes, the shell script finishes the indexing process.
It adds a number of new files under /home/httpd/html/rfc, but does not
add or modify any other files on the computer.  These new files
include a "search.html" search form used for submitting queries, and
the indexed database generated by htdig.

It is possible to now blast this entire directory onto CD-R, and you
could mount this CD-ROM on another machine under /home/httpd/html and
it would work (assuming you have previously installed the stock htdig
RPM package).

To see the results, open this URL (will work only on the Digi intranet):

	http://digifax.digi.com/rfc/search.html

In the search form, type "url" or anything else you'd like to search for.

Enjoy.

-Rick

-- 
Rick Richardson  rick@digi.com   http://RickRichardson.freeservers.com/

My current CI is 28.  I'm 41.  I need 14 more cylinders by my next
birthday.  Two PWC's and an SUV ought to do it.  Thats my new goal.

  [ Part 2: "Attached Text" ]

#!/bin/sh

#
# digidir
#
#	Creates a standalone HTDIG searchable index of a directory of web or
#	text documents that can be put onto CD-ROM or traded with other
#	machines.  If put on a CD-ROM, the CD-ROM can be mounted directly
#	under /home/httpd/html with no further configuration required.
#
# Usage:
#
#	$ cd /home/httpd/html
#	$ mkdir documents
#	$ [fill "documents" dorectory with text or HTML files ]
#	$ digdir documents
#
#	At this point, the directory "documents" contains the
#	original documents themselves, as well as a search.html
#	search form, and all "htdig" config and database files
#	(hidden under documents/.htdig).  This directory can
#	be moved to any other machine.
#
#	NOTE: as of htdig 3.1.x, the HTDIG database is architecture
#	dependant, so the database can only be moved to machines
#	of like architecture.  This is a design flaw in htdig.
#
# Requirments:
#	On the machine which is used to initially generate the index,
#	HTDIG 3.1.x must be installed.  Get it from http://www.htdig.org/
#
#	On the machine which is use to search the database, only
#	/home/httpd/html/cgi-bin/htsearch (from HTDIG 3.1.x) must exist.
#
# Author:
#	Rick Richardson, rick@digi.com, November 1999.
#
#	This software is donated to the PUBLIC DOMAIN and may be used for
#	any purpose without restriction.  No warrantees expressed or
#	implied.  Your mileage may vary.
#

error() {
	echo "digdir error: $*"
	exit 1
}

DIR="$1"
PATH=$PATH:/usr/sbin

[ -d "$DIR" ] || error "directory name missing or non-existant"

case "$DIR" in
/*) error "directory name must be relative to current directory";;
esac

cd $DIR || error "can't chdir to $DIR"

HERE=`pwd`

#
#	Remove the old database and copy in a fresh set of HTDIG
#	distribution files.
#
rm -f search.html
rm -rf .htdig
mkdir .htdig || error "can't make directory $HERE/.htdig"

(
	cd /var/lib/htdig; find common ! -name 'db.*' ! -name '*.db' |
		cpio -pudm $HERE/.htdig
)

mkdir .htdig/db || error "can't make directory $HERE/.htdig"

#
#	Make a copy of the matching htsearch binary, so that
#	somebody who gets a copy of this index and doesn't
#	have a matching htsearch binary handy can just grab
#	it from here and stash it in cgi-bin.  Also copy this
#	shell script in case somebody wants to regen the index.
#
cp -a /home/httpd/cgi-bin/htsearch $0 .htdig/ ||
	error "can't find htsearch binary"

#
#	Create two config files, one for htdig and one for htsearch
#
#	Using two config files allows us to eliminate any appearance
#	of an absolute URL (one with a domain name, even localhost)
#	in the results, thus making the database portable.
#
#	We convert the output URL to ../$DIR because the browsers
#	idea of the current directory will be cgi-bin.
#
DCONF=$HERE/.htdig/htdig.conf
SCONF=$HERE/.htdig/htsearch.conf
cp /etc/htdig/htdig.conf $DCONF
cp /etc/htdig/htdig.conf $SCONF

cat <<-EOF >> $DCONF
	database_dir:		$HERE/.htdig/db
	common_dir:		$HERE/.htdig/common
	start_url:		http://localhost/$DIR/
	local_urls:		http://localhost/$DIR/=/home/httpd/html/$DIR/
	local_user_urls:	http:/=/home/,/public_html/
	url_part_aliases:       http://localhost/$DIR *$DIR
EOF

cat <<-EOF >> $SCONF
	database_dir:		$HERE/.htdig/db
	common_dir:		$HERE/.htdig/common
	start_url:		http://localhost/$DIR/
	local_urls:		http://localhost/$DIR/=/home/httpd/html/$DIR/
	local_user_urls:	http:/=/home/,/public_html/
	url_part_aliases:       http:../$DIR *$DIR
EOF

#
#	Generate the database using HTDIG
#
htdig -v -c $DCONF -i
htmerge -c $DCONF
htnotify -c $DCONF
htfuzzy -c $DCONF endings
htfuzzy -c $DCONF synonyms

#
#	Create the initial search page
#
CGI="http:/cgi-bin/htsearch?-c$SCONF"

cat <<-EOF > search.html
	<html>
	<head>
	<title>ht://Dig WWW Search of $DIR</title>
	</head>
	<body bgcolor="#eef7ff">
	<h1>
	<a href="http://www.htdig.org">
	<IMG SRC="/htdig/htdig.gif" align=bottom alt="ht://Dig" border=0></a>
	WWW Site Search</H1>
	<hr noshade size=4>
	This search will allow you to search the contents of
	all documents under this directory.
	<br>
	<p>
	<form method="post" action="$CGI">
	<font size=-1>
	Match: <select name=method>
	<option value=and>All
	<option value=or>Any
	</select>
	Format: <select name=format>
	<option value=builtin-long>Long
	<option value=builtin-short>Short
	</select>
	Sort by: <select name=sort>
	<option value=score>Score
	<option value=time>Time
	<option value=title>Title
	<option value=revscore>Reverse Score
	<option value=revtime>Reverse Time
	<option value=revtitle>Reverse Title
	</select>
	</font>
	<input type=hidden name=config value=htdig>
	<input type=hidden name=restrict value="">
	<input type=hidden name=exclude value="">
	<br>
	Search:
	<input type="text" size="30" name="words" value="">
	<input type="submit" value="Search">
	</form>
	<hr noshade size=4>
	</body>
	</html>
EOF

#
#	Fixup the templates that create the refine page, etc.
#
#	Change these from method GET to POST so that the
#	-c$SCONF option will work.
#
for i in header nomatch syntax wrapper
do
	ex .htdig/common/$i.html <<-EOF
		g#.(CGI)#s##$CGI#
		g#method=.get.#s##method="post"#
		w
		q
	EOF
done

