
                     Directories
                     -----------

The minisat and sat-solver directories contain code for the Boolean-SAT
parser.

The corpus directory contains code to read word-sense disambiguation
data from an SQL file.

The dict-file directory contains code to read dictionaries from files.
The dict-sql directory contains code to read dictionaries from an SQL DB.
   (unfinished, under development!)


HOWTO use the new regex tokenizer/splitter
------------------------------------------
Its new, experimental code. 

To compile: ../configure --enable-regex-tokenizer



- At the linkparser> prompt, enter:
!/REGEX/,tokentosplit

Currently, if tokentosplit contains white space, command-line.c discards
it.
Also, case is currently observed.

The regex syntax is designed so the regex is a valid one (although
meaningless) as written, so compiling it would reveal syntax errors in
it (the result of this initial compilation is not used).

- All the /regexes/ are anchored at their start and end, as if ^ and $
  were used.

- Mentioning a word class (x is an optional constraint, defaults to
  ".*"):

(?<CLASS>x)

CLASS can be:
 * DICTWORD, to match a word from 4.0.dict.
 * An affix class name (takes priority if there is a regex with the same
   name).
 * A regex name from 4.0.regex (prefix it with "r" if there is such an
   affix class).

For regexes from 4.0.regex, the code combine the ones with the same
name, taking care to omit the ^ and $ from each, if exist (constraints
are said to be supported  (not tested) and can be added if needed, but I
could not find an example of anything useful).

DICTWORD can be optionally followed by a word mark, which is taken from
the affix file:

 * DICTWORDaM append M to DICTWORD before looking it up.
 * DICTWORDpM prepend  M to DICTWORD before looking it up.

If M contains more than one word (in the affix file), only the first one
is currently used.


Examples:
 * (?<SUF>) match a suffix from the affix file
 * (?<NUMBER>) match the regex NUMBER
 * (?<UNITS>) match UNITS from the affix file.
 * (?<rUNITS>) match UNITS from the regex file.
 * (?<DICTWORD>) match a dictionary word.
 * <?<DICTWORDaSTEMSUBSCR>) match word.= (if STEMSUBSCR is ^C=).
 * <?<DICTWORDpINFIXMARK) match =word (if...)

- Using word constrains (x in (?<CLASS>x) ):
Matching single letters by DISTWORD (because they are in the dict) may
note be desired.
In such a case x can be constrained to include 2 letters at least, plus
the desired 1-letter words.
E.g.: (?<DICTWORD>.{2,}|a) , which matches words of 2 letters and more,
plus the word "a".

- Currently the outer part of the regex should not contain alternations.
  This is because I was too lazy to add code for adding (?:...) over it
in such cases. So in order to contain alternations the (?:...) should
currently be added by hand, as in:

/(?:(?<UNITS>)|(?<RPUNC>))*/,dfs,dsfadsdsa,.?!sfads

- Holes are not supported. For example, this is not fine (and not
  tested):
/(?<DICTWORD>)-(?<DICTWORD>)/,khasdkflhdsfa
because the "-" character would create va hole in the result.
But this is fine (and also not tested...):
/(?<DICTWORD>)(-)(?<DICTWORD>)/,asdfkjajfahlad

Currently, since named capturing groups are used for classes, if the same
class name is used more than once, there may be a need to start the regex
by (?J). This will be fixed later.

- The regex cannot include nested capture groups, so inner groups, if
  needed, should be non-capturing ones.

This is because currently the matching groups create a linear string,
without holes.
If you will find a use for internal capture groups, I can use them.
Because of that, backreferences in regexes from the regex file are not
supported (but there are currently none...).

So this is not valid (a DICTWORD which matches a NUMBER)::
/(?<DICTWORD(?<NUMBER>))/,qazwsx
and this too (a nonsense constraint for demo):
/(?<DICTWORD>([A-Z][0-9])*)/,qazwsx
but this should be fine:
/(?<DICTWORD>(?:[A-Z][0-9])*)/,qazwsx

Some fun examples:

!/(.*)*/,test
Modified pattern: (?:(.*)(?C))*$(?C1)
Alternative 1:
 0 (1):  test (0,4)
 1 (1):   (4,4)
Alternative 2:
 0 (1):  test (0,4)
Alternative 3:
 0 (1):  tes (0,3)
 1 (1):  t (3,4)
 2 (1):   (4,4)
Alternative 4:
 0 (1):  tes (0,3)
 1 (1):  t (3,4)
Alternative 5:
 0 (1):  te (0,2)
 1 (1):  st (2,4)
 2 (1):   (4,4)
[...]
Alternative 14:
 0 (1):  t (0,1)
 1 (1):  e (1,2)
 2 (1):  st (2,4)
Alternative 15:
 0 (1):  t (0,1)
 1 (1):  e (1,2)
 2 (1):  s (2,3)
 3 (1):  t (3,4)
 4 (1):   (4,4)
Alternative 16:
 0 (1):  t (0,1)
 1 (1):  e (1,2)
 2 (1):  s (2,3)
 3 (1):  t (3,4)

(Some appear "twice" due to the terminating null match. I think I will
discard such matches.).

With splits to 2 parts only:
linkparser> !/(.*){2}/,test
Modified pattern: (?:(.*)(?C)){2}$(?C1)
Alternative 1:
 0 (1):  test (0,4)
 1 (1):   (4,4)
Alternative 2:
 0 (1):  tes (0,3)
 1 (1):  t (3,4)
Alternative 3:
 0 (1):  te (0,2)
 1 (1):  st (2,4)
Alternative 4:
 0 (1):  t (0,1)
 1 (1):  est (1,4)
Alternative 5:
 0 (1):  test (0,4)
linkparser> 

!/(?:(?<DICTWORD>.{2,}|a)(?<RPUNC>)?)+/,theriver,dangeroustonavigatebutimportantforcommerce,hasmanyshoals.
(This is one long line, just test it...)


!/(?<NUMBERS>)(?<rUNITS>)*(?<RPUNC>)*/,123.2milligram/terag/?!
(test it...)

!/(?<DICTWORD>)(?<SUF>)/,there's
Modified pattern: (?:(?<DICTWORD>.*)(?C))(?:(?<SUF>.*)(?C))$(?C1)
Alternative 1:
 0 (1):  there (0,5) [DICTWORD]
 1 (2):  's (5,7) [SUF]
linkparser> 


In the next example, we get only whole word and double-dash because
it can only match wpwp (when w is DICTWORD and p is --).

!/(?:(?<DICTWORD>)(?<LPUNC>))+/,this--is--
Modified pattern: (?:(?:(?<DICTWORD>.*)(?C))(?:(?<LPUNC>.*)(?C)))+$(?C1)
Alternative 1:
 0 (1):  this (0,4) [DICTWORD]
 1 (2):  -- (4,6) [LPUNC]
 2 (1):  is (6,8) [DICTWORD]
 3 (2):  -- (8,10) [LPUNC]
 4 (1):   (10,10) [DICTWORD]
linkparser> 

However, this breaks to single characters, as expected:
!/(?:(?<DICTWORD>)(?:(?<LPUNC>))*)+/,this--is--
...
Alternative 360:
 0 (1):  t (0,1) [DICTWORD]
 1 (1):  h (1,2) [DICTWORD]
 2 (1):  i (2,3) [DICTWORD]
 3 (1):  s (3,4) [DICTWORD]
 4 (1):  - (4,5) [DICTWORD]
 5 (1):  - (5,6) [DICTWORD]
 6 (1):  i (6,7) [DICTWORD]
 7 (1):  s (7,8) [DICTWORD]
 8 (1):  - (8,9) [DICTWORD]
 9 (1):  - (9,10) [DICTWORD]
10 (1):   (10,10) [DICTWORD]
linkparser> 

But this stops after the first match:
!/(?:(?<DICTWORD>)(?:(?<LPUNC>)(*COMMIT))*)+/,this--is--
Alternative 1:
 0 (1):  this (0,4) [DICTWORD]
 1 (2):  -- (4,6) [LPUNC]
 2 (1):  is (6,8) [DICTWORD]
 3 (2):  -- (8,10) [LPUNC]
 4 (1):   (10,10) [DICTWORD]
linkparser> 

And this is even more interesting:
!/(?:(?<DICTWORD>)(*COMMIT)(?:(?<LPUNC>))*)+/,this--is--
Alternative 1:
 0 (1):  this (0,4) [DICTWORD]
 1 (2):  -- (4,6) [LPUNC]
 2 (1):  is (6,8) [DICTWORD]
 3 (2):  -- (8,10) [LPUNC]
 4 (1):   (10,10) [DICTWORD]
Alternative 2:
 0 (1):  this (0,4) [DICTWORD]
 1 (2):  -- (4,6) [LPUNC]
 2 (1):  is (6,8) [DICTWORD]
 3 (2):  - (8,9) [LPUNC]
 4 (2):  - (9,10) [LPUNC]
 5 (1):   (10,10) [DICTWORD]
Alternative 3:
 0 (1):  this (0,4) [DICTWORD]
 1 (2):  -- (4,6) [LPUNC]
 2 (1):  is (6,8) [DICTWORD]
 3 (2):  - (8,9) [LPUNC]
 4 (1):  - (9,10) [DICTWORD]
 5 (1):   (10,10) [DICTWORD]
linkparser> 

It seems as if conditional matching using (?(condition)yes-pattern|no-pattern)
or (*THEN) can do some fun things, but I don't have useful examples yet.

The question is how to use this code for tokenization. I have some
ideas, more on that later.
