From grdetil@scrc.umanitoba.ca Mon Apr 9 18:04:56 2001 Date: Mon, 9 Apr 2001 17:29:35 -0500 (CDT) From: Gilles Detillieux To: "ht://Dig mailing list" Subject: [htdig] PATCH: htdump/htload for 3.1.5 OK, I hinted at it last week, and worked on it a bit Friday and quite a bit more today. The following patch introduces htdump/htload utilities to the 3.1.5 version of htdig. To keep it easier to install (i.e. to avoid messing with autoconf or the makefiles), I set it up as an extension to the htdig program, selected by symbolic links to htdig (or copies of it) with the names htdump and htload. htdump will dump out an ASCII version of db.docdb into db.docs, and htload will load in an ASCII version of the database from db.docs into db.docdb. They don't do anything about the wordlist, because db.wordlist is already in ASCII form, and they don't do anything about db.docs.index and db.words.db because htmerge can regenerate these from db.docdb and db.wordlist. In the process, I also fixed the problem with META descriptions containing newlines, returns or tabs (bug #405771), because fields in the ASCII version of the database shouldn't contain any of these characters. They are now replaced with spaces. I also changed the output of htdig -t to be the same format as htdump, as it is in 3.2.0b3, to get all the DocumentRef fields out. I also don't sort the file because this is most likely unnecessary and could potentially cause problems (this too is consistent with the changes in 3.2). I added a -m option to htdig for compatibility with 3.2.0b3, because it meshed nicely with the other changes I made to htdig.cc and String.cc. Finally, I added a readLine() method to String.cc, and also fixed what was reported to be a problem with the String '=' operator while I was in there. Please note: this doesn't mean you can now htdump a 3.1.5 database and htload it into 3.2.0b3 format, nor vice-versa. The reason is the format and content of db.wordlist is very different from the db.worddump file that htdump 3.2.0b3 produces. 3.2's worddump has much more information about the words, including positions of all words, including repeated ones. It wouldn't be possible to convert a 3.1.5 db.wordlist into a db.worddump file for 3.2.0b3 and have phrase searching work, because of the missing information, so you really need to redig. However, it should be possible to write a filter that would convert a db.worddump into a db.wordlist, converting the format and mapping flags to the appropriate weight, so you can dig with 3.2 and carry the db back to 3.1.5. I haven't written this filter, though, and I don't plan to. As always, you can apply this patch in the htdig-3.1.5 main source directory using the command "patch -p0 < this-message-file". --- htcommon/DocumentDB.cc.noload Thu Feb 24 20:29:10 2000 +++ htcommon/DocumentDB.cc Mon Apr 9 15:20:18 2001 @@ -3,7 +3,13 @@ // // Implementation of DocumentDB // +// $Id: DocumentDB.cc,v 1.11 1999/02/17 05:03:52 ghutchis Exp $ // +// Part of the ht://Dig package +// Copyright (c) 1995-2001 The ht://Dig Group +// For copyright details, see the file COPYING in your distribution +// or the GNU Public License version 2 or later +// // #include "DocumentDB.h" @@ -183,35 +189,25 @@ int DocumentDB::Delete(char *u) //***************************************************************************** -// int DocumentDB::CreateSearchDB(char *filename) -// Create an extract from our database which can be used by the -// search engine. The extract will consist of lines with fields -// separated by tabs. The fields are: -// docID -// docURL -// docTime -// docHead -// docMetaDsc -// descriptions (separated by tabs) +// int DocumentDB::DumpDB(char *filename, int verbose) +// Create an extract from our database which can be used by an +// external application. The extract will consist of lines with fields +// separated by tabs. // -// The extract will be sorted by docID. +// The extract will likely not be sorted by anything in particular // -int DocumentDB::CreateSearchDB(char *filename) +int DocumentDB::DumpDB(char *filename, int verbose) { DocumentRef *ref; List *descriptions, *anchors; char *key; String data; FILE *fl; - String command = SORT_PROG; - String tmpdir = getenv("TMPDIR"); - command << " -n -o" << filename; - if (tmpdir.length()) - { - command << " -T " << tmpdir; + if((fl = fopen(filename, "w")) == 0) { + perror(form("DocumentDB::DumpDB: opening %s for writing", filename)); + return NOTOK; } - fl = popen(command, "w"); dbf->Start_Get(); while ((key = dbf->Get_Next())) @@ -227,11 +223,16 @@ int DocumentDB::CreateSearchDB(char *fil fprintf(fl, "\ta:%d", ref->DocState()); fprintf(fl, "\tm:%d", (int) ref->DocTime()); fprintf(fl, "\ts:%d", ref->DocSize()); - fprintf(fl, "\th:%s", ref->DocHead()); + fprintf(fl, "\tH:%s", ref->DocHead()); fprintf(fl, "\th:%s", ref->DocMetaDsc()); fprintf(fl, "\tl:%d", (int) ref->DocAccessed()); fprintf(fl, "\tL:%d", ref->DocLinks()); - fprintf(fl, "\tI:%d", ref->DocImageSize()); + fprintf(fl, "\tb:%d", ref->DocBackLinks()); + fprintf(fl, "\tc:%d", ref->DocHopCount()); + fprintf(fl, "\tg:%d", ref->DocSig()); + fprintf(fl, "\te:%s", ref->DocEmail()); + fprintf(fl, "\tn:%s", ref->DocNotification()); + fprintf(fl, "\tS:%s", ref->DocSubject()); fprintf(fl, "\td:"); descriptions = ref->Descriptions(); String *description; @@ -261,13 +262,129 @@ int DocumentDB::CreateSearchDB(char *fil } } - int sortRC = pclose(fl); - if (sortRC) + fclose(fl); + + return OK; +} + +//***************************************************************************** +// int DocumentDB::LoadDB(char *filename, int verbose) +// Load an extract to our database from an ASCII file +// The extract will consist of lines with fields separated by tabs. +// The lines need not be sorted in any fashion. +// +int DocumentDB::LoadDB(char *filename, int verbose) +{ + FILE *input; + DocumentRef ref; + StringList descriptions, anchors; + char *token, field; + String data; + + if((input = fopen(filename, "r")) == 0) { + perror(form("DocumentDB::LoadDB: opening %s for reading", filename)); + return NOTOK; + } + + while (data.readLine(input)) { - cerr << "Document sort failed\n\n"; - exit(1); + token = strtok(data, "\t"); + if (token == NULL) + continue; + + ref.DocID(atoi(token)); + + if (verbose) + cout << "\t loading document ID: " << ref.DocID() << endl; + + while ( (token = strtok(0, "\t")) ) + { + field = *token; + token += 2; + + if (verbose > 2) + cout << "\t field: " << field; + + switch(field) + { + case 'u': // URL + ref.DocURL(token); + break; + case 't': // Title + ref.DocTitle(token); + break; + case 'a': // State + ref.DocState((ReferenceState)atoi(token)); + break; + case 'm': // Modified + ref.DocTime(atoi(token)); + break; + case 's': // Size + ref.DocSize(atoi(token)); + break; + case 'H': // Head + ref.DocHead(token); + break; + case 'h': // Meta Description + ref.DocMetaDsc(token); + break; + case 'l': // Accessed + ref.DocAccessed(atoi(token)); + break; + case 'L': // Links + ref.DocLinks(atoi(token)); + break; + case 'b': // BackLinks + ref.DocBackLinks(atoi(token)); + break; + case 'c': // HopCount + ref.DocHopCount(atoi(token)); + break; + case 'g': // Signature + ref.DocSig(atoi(token)); + break; + case 'e': // E-mail + ref.DocEmail(token); + break; + case 'n': // Notification + ref.DocNotification(token); + break; + case 'S': // Subject + ref.DocSubject(token); + break; + case 'd': // Descriptions + descriptions.Create(token, '\001'); + ref.Descriptions(descriptions); + break; + case 'A': // Anchors + anchors.Create(token, '\001'); + ref.DocAnchors(anchors); + break; + default: + break; + } + + } + + + // We must be careful if the document already exists + // So we'll delete the old document and add the new one + if (Exists(ref.DocURL())) + { + Delete(ref.DocURL()); + } + Add(ref); + + // If we add a record with an ID past nextDocID, update it + if (ref.DocID() > nextDocID) + nextDocID = ref.DocID() + 1; + + descriptions.Destroy(); + anchors.Destroy(); } - return 0; + + fclose(input); + return OK; } --- htcommon/DocumentDB.h.noload Thu Feb 24 20:29:10 2000 +++ htcommon/DocumentDB.h Mon Apr 9 14:00:15 2001 @@ -8,31 +8,18 @@ // // $Id: DocumentDB.h,v 1.5 1999/01/25 01:53:42 hp Exp $ // -// $Log: DocumentDB.h,v $ -// Revision 1.5 1999/01/25 01:53:42 hp -// Provide a clean upgrade from old databses without "url_part_aliases" and -// "common_url_parts" through the new option "uncoded_db_compatible". -// -// Revision 1.4 1999/01/14 01:09:11 ghutchis -// Small speed improvements based on gprof. -// -// Revision 1.3 1999/01/14 00:30:10 ghutchis -// Added IncNextDocID to allow big changes in NextDocID, such as when merging -// databases. -// -// Revision 1.2 1998/01/05 00:47:27 turtle -// reformatting -// -// Revision 1.1.1.1 1997/02/03 17:11:07 turtle -// Initial CVS -// +// Part of the ht://Dig package +// Copyright (c) 1995-2001 The ht://Dig Group +// For copyright details, see the file COPYING in your distribution +// or the GNU Public License version 2 or later +// // #ifndef _DocumentDB_h_ #define _DocumentDB_h_ #include "DocumentRef.h" -#include -#include +#include "List.h" +#include "Database.h" class DocumentDB @@ -45,11 +32,6 @@ public: ~DocumentDB(); // - // The database used for searching is generated from our internal database: - // - int CreateSearchDB(char *filename); - - // // Standard database operations // int Open(char *filename); @@ -75,6 +57,13 @@ public: // We will need to be able to iterate over the complete database. // List *URLs(); // This returns a list of all the URLs + + // Dump the database out to an ASCII text file + int DumpDB(char *filename, int verbose = 0); + + // Read in the database from an ASCII text file + // (created by DumpDB) + int LoadDB(char *filename, int verbose = 0); // // Set compatibility mode (try to support when database --- htdig/htdig.cc.noload Thu Feb 24 20:29:10 2000 +++ htdig/htdig.cc Mon Apr 9 15:53:03 2001 @@ -4,7 +4,13 @@ // Indexes the web sites specified in the config file // generating several databases to be used by htmerge // +// $Id: htdig.cc,v 1.3.2.6 1999/12/06 21:06:01 grdetil Exp $ // +// Part of the ht://Dig package +// Copyright (c) 1995-2001 The ht://Dig Group +// For copyright details, see the file COPYING in your distribution +// or the GNU Public License version 2 or later +// // #include "Document.h" @@ -33,6 +39,7 @@ StringMatch badquerystr; FILE *urls_seen = NULL; FILE *images_seen = NULL; String configFile = DEFAULT_CONFIG_FILE; +String minimalFile = 0; void usage(); void reportError(char *msg); @@ -49,13 +56,32 @@ main(int ac, char **av) int initial = 0; int alt_work_area = 0; int create_text_database = 0; + int create_text_database_only = 0; + int load_text_database = 0; + char *arg0, *s; char *max_hops = 0; RetrieverLog flag = Retriever_noLog; + + // Find argument 0 basename, to see who we're called as + arg0 = av[0]; + s = strrchr(arg0, '/'); + if (s != NULL) + arg0 = s+1; + // For Cygwin on Win32 systems... + s = strrchr(arg0, '\\'); + if (s != NULL) + arg0 = s+1; + + // Select function based on argument 0 + if (mystrncasecmp(arg0, "htdump", 6) == 0) + create_text_database_only = create_text_database = 1; + else if (mystrncasecmp(arg0, "htload", 6) == 0) + load_text_database = 1; // // Parse command line arguments // - while ((c = getopt(ac, av, "lsc:vith:u:a")) != -1) + while ((c = getopt(ac, av, "lsm:c:vith:u:a")) != -1) { int pos; switch (c) @@ -67,6 +93,11 @@ main(int ac, char **av) debug++; break; case 'i': + if (create_text_database_only) + { + cerr << "htdump: -i option not allowed for dumping\n"; + break; + } initial++; break; case 't': @@ -86,6 +117,10 @@ main(int ac, char **av) case 'a': alt_work_area++; break; + case 'm': + minimalFile = optarg; + max_hops = "0"; + break; case 'l': flag = Retriever_logUrl; break; @@ -219,13 +254,21 @@ main(int ac, char **av) String filename = config["doc_db"]; if (initial) unlink(filename); - if (docs.Open(filename) < 0) + if (create_text_database_only) + { + if (docs.Read(filename) < 0) + { + reportError(form("Unable to open document database '%s'", + filename.get())); + } + } + else if (docs.Open(filename) < 0) { reportError(form("Unable to open/create document database '%s'", filename.get())); } - if (initial) + if (initial && !load_text_database) { filename = config["word_list"]; unlink(filename); @@ -238,20 +281,54 @@ main(int ac, char **av) // URLs? // Retriever retriever(flag); - List *list = docs.URLs(); - retriever.Initial(*list); - delete list; - - // Add start_url to the initial list of the retriever. - // Don't check a URL twice! - // Beware order is important, if this bugs you could change - // previous line retriever.Initial(*list, 0) to Initial(*list,1) - retriever.Initial(config["start_url"], 1); + if (minimalFile.length() == 0) + { + List *list = docs.URLs(); + retriever.Initial(*list); + delete list; + + // Add start_url to the initial list of the retriever. + // Don't check a URL twice! + // Beware order is important, if this bugs you could change + // previous line retriever.Initial(*list, 0) to Initial(*list,1) + retriever.Initial(config["start_url"], 1); + } + + // Handle list of URLs given as minimal file (-m file), or on given + // given file name (stdin, if optional "-" argument given). + if (minimalFile.length() != 0 || optind < ac) + { + FILE *input; + String str; + if (minimalFile.length() != 0) + { + if (strcmp(minimalFile.get(), "-") == 0) + input = stdin; + else + input = fopen(minimalFile.get(), "r"); + } + else if (strcmp(av[optind], "-") == 0) + input = stdin; + else + input = fopen(av[optind], "r"); + if (input) + { + while (str.readLine(input)) + { + str.chop("\r\n\t "); + if (str.length() > 0) + retriever.Initial(str, 1); + } + if (input != stdin) + fclose(input); + } + } // // Go do it! // - retriever.Start(); + if (!create_text_database_only && !load_text_database) + retriever.Start(); // // All done with parsing. @@ -265,7 +342,16 @@ main(int ac, char **av) filename = config["doc_list"]; if (initial) unlink(filename); - docs.CreateSearchDB(filename); + docs.DumpDB(filename, debug); + } + + // + // For htload, read in a text version of the document database. + // + if (load_text_database) + { + filename = config["doc_list"]; + docs.LoadDB(filename, debug); } // @@ -291,7 +377,8 @@ main(int ac, char **av) // void usage() { - cout << "usage: htdig [-l][-v][-i][-c configfile][-t]\n"; + cout << "usage: htdig [-v][-i][-c configfile][-t][-h hopcount][-s] \\\n"; + cout << " [-u username:password][-a][-l][-m minimalfile][file]\n"; cout << "This program is part of ht://Dig " << VERSION << "\n\n"; cout << "Options:\n"; @@ -334,6 +421,17 @@ void usage() cout << "\t\tReads in the progress of any previous interrupted digs\n"; cout << "\t\tfrom the log file and write the progress out if\n"; cout << "\t\tinterrupted by a signal.\n\n"; + + cout << "\t-m minimalfile (or just a file name at end of arguments)\n"; + cout << "\t\tTells htdig to read URLs from the supplied file and index\n"; + cout << "\t\tthem in place of (or in addition to) the existing URLs in\n"; + cout << "\t\tthe database and the start_url. With the -m, only the\n"; + cout << "\t\tURLs specified are added to the database. A file name of\n"; + cout << "\t\t'-' indicates the standard input.\n\n"; + + cout << "or usage: htdump [-v][-c configfile][-a]\n"; + cout << "or usage: htload [-v][-i][-c configfile][-a]\n"; + cout << "\t\tto dump/load docdb to/from ASCII text database.\n\n"; exit(0); } --- htdig/HTML.cc.noload Sat May 13 21:40:10 2000 +++ htdig/HTML.cc Mon Apr 9 16:17:09 2001 @@ -849,9 +849,13 @@ HTML::do_tag(Retriever &retriever, Strin { // // We need to do two things. First grab the description + // and clean it up // meta_dsc = transSGML(conf["content"]); - if (meta_dsc.length() > max_meta_description_length) + meta_dsc.replace('\n', ' '); + meta_dsc.replace('\r', ' '); + meta_dsc.replace('\t', ' '); + if (meta_dsc.length() > max_meta_description_length) meta_dsc = meta_dsc.sub(0, max_meta_description_length).get(); if (debug > 1) cout << "META Description: " << conf["content"] << endl; --- htdoc/htdig.html.noload Thu Feb 24 20:29:10 2000 +++ htdoc/htdig.html Mon Apr 9 17:09:43 2001 @@ -10,7 +10,7 @@ htdig

- ht://Dig Copyright © 1995-2000 The ht://Dig Group
+ ht://Dig Copyright © 1995-2001 The ht://Dig Group
Please see the file COPYING for license information.

@@ -89,6 +89,14 @@ progress out if interrupted by a signal.
+ -m [url_file] +
+
+ Minimal. Only index the URLs in the file provided and + no others. The url_file can be a "-", causing htdig + to read the URLs from the STDIN. +
+
-s
@@ -103,6 +111,42 @@ information can be extracted from it for purposes other than searching. One could gather some interesting statistics from this database. +

Each line in the file starts with the document id + followed by a list of + \tfieldname:value. + The fields always appear in the order listed below: +

+ + + + + + + + + + + + + + + + + + + +
fieldnamevalue
uURL
tTitle
aState (0 = normal, 1 = not found, 2 + = not indexed, 3 = obsolete)
mLast modification time as reported + by the server
sSize in bytes
HExcerpt
hMeta description
lTime of last retrieval
LCount of the links in the document + (outgoing links)
bCount of the links to the document + (incoming links or backlinks)
cHopCount of this document
gSignature of the document used for + duplicate-detection
eE-mail address to use for a + notification message from htnotify
nDate to send out a notification + e-mail message
SSubject for a notification e-mail + message
dThe text of links pointing to this + document. (e.g. <a + href="docURL">description</a>)
AAnchors in the document (i.e. <A + NAME=...)
-u username:password @@ -122,7 +166,35 @@ program. Using more than 2 is probably only useful for debugging purposes. The default verbose mode (using only one -v) gives a nice progress report while - digging. + digging. This progress report can be a bit + cryptic, so here is a brief explanation. A line + is shown for each URL, with 3 numbers before the + URL and some symbols after the URL. The first + number is the number of documents parsed so + far, the second is the DocID for this document, + and the third is the hop count of the document + (number of hops from one of the start_url + documents). After the URL, it shows a "*" for + a link in the document that it already visited, + a "+" for a new link it just queued, and a "-" + for a link it rejected for any of a number of + reasons. To find out what those reasons are, + you need to run htdig with at least 3 -v options, + i.e. -vvv. If there are no "*", "+" or "-" symbols + after the URL, it doesn't mean the document was + not parsed or was empty, but only that no links + to other documents were found within it. With + more verbose output, these symbols will get + interspersed in several lines of debugging output. + +
+ url_file (at end of arguments, after options) +
+
+ Get the list URLs to start indexing from the file + provided. This will override the default start_url. + The url_file can be a "-", causing htdig to read + the URLs from the STDIN.
@@ -159,11 +231,8 @@
-
- Andrew Scherpbier <andrew@contigo.com> -
-Last modified: $Date: 2000/02/17 22:05:21 $ + Last modified: $Date: 2001/04/09 17:09:37 $ --- htlib/String.cc.noload Thu Feb 24 20:29:11 2000 +++ htlib/String.cc Mon Apr 9 14:05:07 2001 @@ -3,6 +3,12 @@ // // $Id: String.cc,v 1.16.2.3 1999/11/26 21:59:26 grdetil Exp $ // +// Part of the ht://Dig package +// Copyright (c) 1995-2001 The ht://Dig Group +// For copyright details, see the file COPYING in your distribution +// or the GNU Public License version 2 or later +// +// #if RELEASE static char RCSid[] = "$Id: String.cc,v 1.16.2.3 1999/11/26 21:59:26 grdetil Exp $"; #endif @@ -91,9 +97,16 @@ String::~String() void String::operator = (const String &s) { - allocate_space(s.length()); - Length = s.length(); - copy_data_from(s.Data, Length); + if (s.length() > 0) + { + allocate_space(s.length()); + Length = s.length(); + copy_data_from(s.Data, Length); + } + else + { + Length = 0; + } } void String::operator = (char *s) @@ -622,3 +635,38 @@ void String::debug(ostream &o) } +int String::readLine(FILE *in) +{ + Length = 0; + allocate_fix_space(2048); + + while (fgets(Data + Length, Allocated - Length, in)) + { + Length += strlen(Data + Length); + if (Length == 0) + continue; + if (Data[Length - 1] == '\n') + { + // + // A full line has been read. Return it. + // + chop('\n'); + return 1; + } + if (Allocated > Length + 1) + { + // + // Not all available space filled. Probably EOF? + // + continue; + } + // + // Only a partial line was read. Increase available space in + // string and read some more. + // + reallocate_space(Allocated << 1); + } + chop('\n'); + + return Length > 0; +} --- htlib/htString.h.noload Thu Feb 24 20:29:11 2000 +++ htlib/htString.h Mon Apr 9 15:14:48 2001 @@ -3,11 +3,18 @@ // // $Id: htString.h,v 1.5 1999/02/01 04:02:25 hp Exp $ // +// Part of the ht://Dig package +// Copyright (c) 1995-2001 The ht://Dig Group +// For copyright details, see the file COPYING in your distribution +// or the GNU Public License version 2 or later +// +// #ifndef __String_h #define __String_h #include "Object.h" #include +#include class ostream; @@ -138,6 +145,8 @@ public: friend int operator >= (String &a, String &b); friend ostream &operator << (ostream &o, String &s); + + int readLine(FILE *in); void lowercase(); void uppercase(); -- Gilles R. Detillieux E-mail: Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list To unsubscribe, send a message to with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html