--------------------------------------------------------------------------------
README for directory mklut  
--------------------------------------------------------------------------------

Overview
--------
The source code stored in this directory is intended to be used for building 
executable 'lut'. To build the 'lut' executable on Unix/Linux platform, 
simply type 'make'. For MS DOS environment, use 'pcmake' command.

NOTE: before building executable 'lut', the code in directory ../compute_qv
      must be compiled. This code generated static library file libtt.a, 
      which is needed for building 'lut'.

Purpose
-------
Executable 'lut'
- takes as input a set of base calls stored in alignment file produced 
  by executable 'train',
- generates a "customized" quality value calibration file (a lookup table)
  which can be used by executable 'ttuner' with option -t <table> as 
  external calibration file when evaluating quality values of called bases.

Usage
-----
If you invoke 'lut' without arguments, you get a brief usage message

% lut

Version: TT_3.01
usage: lut
     [ -Q ] [ -V ]
     [ -o <output_file> ]
     <num_thresholds>  <  <alignment_file>   >   <lookup_table_file>

where

    -Q (quiet) Turns off lut's status messages,

    -V (verbose) Specifies that lut produce additional process status messages,
    
    -o <output_file> Specifies that lut output the resulting lookup table to 
       file <output_file>. By default, stdout is used,

    <num_thresholds> is the number of thresholds used for binning predictor /
       trace parameter values. Release 3.0.1 version supports the use of 
       exactly four predictors to calibrate quality values. During the 
       calibration process, values of each predictor will fractionated 
       between the same number of bins.    

    <alignment_file> is the stdout output file produced by executable train.

Example:
--------
To produce a lookup table from <alignment_file> using 50 thresholds,
use the command:

lut 50   <    <alignment_file>   >    lookup.tbl


Algorithm
---------
The algorithm implemented in the 'lut' code to produce a lookup table is 
basically the same as described by Ewing and Green [1]. In particular, like in 
[1], the current version of 'lut' code assumes that quality value of each 
basecall is characterized by values of four trace parameters, or predictors. 
However, we made several algorithmic improvements to the original algorithm 
[2]. In particular, we developed a dynamic programming algorithm to speed up the 
process of generating entries of the lookup table. Our procedure is linear, 
rather than quadratic, in time. For a typical calibration dataset comprising 
20 to 40 million called bases aligned against the reference sequence, it only 
takes several hours to complete, which makes the process of customized 
calibration of quality values on user-supplied data affordable.

Input/output
------------
The input for the 'lut' executable is an alignment file produced by 'train' 
executable. The format of this file is described in README for directory 
'mktrain'.

The output file stores the parameter thresholds generated by the program, 
some auxiliary information and, finally, the actual lookup table, which 
is represented by five columns of data: the first column is the quality value 
and the other four columns are the indexes of the thresholds corresponding 
to this quality value. This data will be read by tracetuner code from the top
to the bottom. A given base call will be assigned a quality value from the 
first processed line of the table where parameter thresholds with specified
four indexes will not exceed the values of the four parameters characterizing 
the basecall.

References
----------
[1] Ewing B, Green P. (1998) Basecalling of automated sequencer traces using 
    phred.  II. Error probabilities. Genome Research 8:186-194.
[2] G.A.Denisov, A.B.Arehart and M.D.Curtin, (2004). System and method for 
    improving the accuracy of DNA sequencing and error probability estimation 
    through application of a mathematical model to the analysis of 
    electropherograms. US Patent # 6,681,186.
