
From wjones@tc.fluke.com Wed Jan 12 15:47:49 2000
Date: Wed, 12 Jan 2000 14:53:07 -0800
From: Warren Jones <wjones@tc.fluke.com>
To: htdig3-dev@htdig.org
Subject: [htdig3-dev] Patches for conv_doc.pl and parse_doc.pl

Here are patches for conv_doc.pl and parse_doc.pl.

In both scripts, I've changed "break" to "last".  Perl doesn't
have a break statement.  (The -w flag would have caught this.)

In parse_doc.pl, I replaced the code for parsing a line into
a list of words with a much simpler expression using "split".
The intent of the original code was hard to grasp, but it was
spitting out "words" that included multiple punctuation
characters, and filling up my word database with gibberish.
I also streamlined parse_doc.pl in a few other places,
without (I hope) changing the output.  I made the changes
to parse_doc.pl before I realized that it had been mostly
superseded by conv_doc.pl, but I'm including the patch anyway,
for whatever it's worth.

-- 
Warren Jones

---------------------------- snip snip ---------------------------- 

Index: conv_doc.pl
===================================================================
RCS file: /home/wjones/src/CVS.repo/htdig/contrib/conv_doc.pl,v
retrieving revision 1.1.1.1
diff -c -r1.1.1.1 conv_doc.pl
*** conv_doc.pl	1999/12/15 22:03:07	1.1.1.1
--- conv_doc.pl	2000/01/12 22:02:43
***************
*** 131,137 ****
                  s/</\&lt\;/g;
                  s/>/\&gt\;/g;
                  $title = $_;
!                 break;
              }
          }
          close INFO;
--- 131,137 ----
                  s/</\&lt\;/g;
                  s/>/\&gt\;/g;
                  $title = $_;
!                 last;
              }
          }
          close INFO;
***************
*** 190,196 ****
  open(CAT, "$cvtcmd |") || die "$cvtr doesn't want to be opened using pipe.\n";
  while (<CAT>) {
      while (/[A-Za-z\300-\377]-\s*$/ && $dehyphenate) {
!         $_ .= <CAT> || break;
          s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/
      }
      s/[\255]/-/g;                       # replace dashes with hyphens
--- 190,196 ----
  open(CAT, "$cvtcmd |") || die "$cvtr doesn't want to be opened using pipe.\n";
  while (<CAT>) {
      while (/[A-Za-z\300-\377]-\s*$/ && $dehyphenate) {
!         $_ .= <CAT> || last;
          s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/
      }
      s/[\255]/-/g;                       # replace dashes with hyphens


Index: parse_doc.pl
===================================================================
RCS file: /home/wjones/src/CVS.repo/htdig/contrib/parse_doc.pl,v
retrieving revision 1.1.1.1
diff -c -r1.1.1.1 parse_doc.pl
*** parse_doc.pl	1999/11/29 20:02:20	1.1.1.1
--- parse_doc.pl	2000/01/12 21:48:18
***************
*** 70,76 ****
  @allwords = ();
  @temp = ();
  $x = 0;
- @fields = ();
  $calc = 0;
  $dehyphenate = 0;
  $title = "";
--- 70,75 ----
***************
*** 122,128 ****
                                  $title =~ s/&/\&amp\;/g;
                                  $title =~ s/</\&lt\;/g;
                                  $title =~ s/>/\&gt\;/g;
!                                 break;
                          }
                  }
                  close INFO;
--- 121,127 ----
                                  $title =~ s/&/\&amp\;/g;
                                  $title =~ s/</\&lt\;/g;
                                  $title =~ s/>/\&gt\;/g;
!                                 last;
                          }
                  }
                  close INFO;
***************
*** 153,174 ****
  open(CAT, "$parsecmd") || die "Hmmm. $parser doesn't want to be opened using pipe.\n";
  while (<CAT>) {
          while (/[A-Za-z\300-\377]-\s*$/ && $dehyphenate) {
!                 $_ .= <CAT> || break;
                  s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/
          }
          $head .= " " . $_;
!         s/\s+[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+\s+|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+$/ /g;    # replace reading-chars with space (only at end or begin of word, but allow multiple characters)
! #       s/\s[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]\s|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]$/ /g;    # replace reading-chars with space (only at end or begin of word)
! #       s/[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]/ /g;      # rigorously replace all by <carl@dpiwe.tas.gov.au>
! #       s/[\-\255]/ /g;                                 # replace hyphens with space
!         s/[\255]/-/g;                                   # replace dashes with hyphens
!         @fields = split;                                # split up line
!         next if (@fields == 0);                         # skip if no fields (does it speed up?)
!         for ($x=0; $x<@fields; $x++) {                  # check each field if string length >= 3
!                 if (length($fields[$x]) >= $minimum_word_length) {
!                         push @allwords, $fields[$x];    # add to list
!                 }
!         }
  }
  
  close CAT;
--- 152,166 ----
  open(CAT, "$parsecmd") || die "Hmmm. $parser doesn't want to be opened using pipe.\n";
  while (<CAT>) {
          while (/[A-Za-z\300-\377]-\s*$/ && $dehyphenate) {
!                 $_ .= <CAT> || last;
                  s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/
          }
          $head .= " " . $_;
! 	# Delete valid punctuation.  These are the default values
! 	# for valid_punctuation, and should be changed other values
! 	# are specified in the config file.
! 	tr{-\255._/!#$%^&'}{}d;
! 	push @allwords, grep { length >= $minimum_word_length } split /\W+/;
  }
  
  close CAT;
***************
*** 207,215 ****
  
  #############################################
  # now the words
! for ($x=0; $x<@allwords; $x++) {
!         $calc=int(1000*$x/@allwords);           # calculate rel. position (0-1000)
!         print "w\t$allwords[$x]\t$calc\t0\n";   # print out word, rel. pos. and text type (0)
  }
  
  $calc=@allwords;
--- 199,208 ----
  
  #############################################
  # now the words
! $x = 0;
! for ( @allwords ) {
!     # print out word, rel. pos. and text type (0)
!     printf "w\t%s\t%d\t0\n", $_, 1000*$x++/@allwords;
  }
  
  $calc=@allwords;

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev-unsubscribe@htdig.org 
You will receive a message to confirm this. 

