-*-Text-*-

This file is unfortunately very outdated. /arve 090428


			  * Modelling issues

** Varying B/D rates

We want to try to have different birth/death rates in different parts
of the species tree.
   NOte: We could probably use the models for Substiturion rate
variation over edges for this.


** Seq evol rates

First of all, the rate of evolution must become a parameter included
in the MCMC. Second, we would like to allow this rate to change over
the gene or species tree. What is possible and what is reasonable?

  Note: First point is now implemented. The average substitution rate
        is now a parameter. 
	Second point also implemented and under testing. There are two
	different model for edgevariation and three variants of each.
	What is missing is the conection between G and S rates.
	


** Sequence evolution

We want to be able to have rates change across sites too.
   Note: First step is now implemented. We now model rate variation
   across sites with a gamma distribtuion and integrate over discrete
   rate classes for each data column n. However, we do not include a
   probability for a site being invariant (pInv in the parlance of
   many phylogeny programs). I will implement this, but I need to
   check how this usually is done - it will be an additional rate
   class, to integrate over, but how is non-invaraint sites treated,
   is Pr{n|inv} = 0, then? /bens
   Note: I have run into a problem with pInv. I made a trial
   implementation, but this does not yield the saem result as PAUP
   with corresponding settings. Does PAUP use another model? /bens

We want to implement analytical solution classes for those subsitution
   matrices where this is possible, e.g., jc69, Kimura 2p/3P, F81/F85,
   HKY85, UniformAA and UniformCodon.

We also want to implement Lewis's class of substitution models for
   non-randomly sampled data, more specifically when only data 
   variable among the leaves have been sampled, e.g., for 
   morphological data (but also other type of data). This expresses
   transition probablities that are conditional on that the charater
   is varible.  See Lewis, P.O. A likelihood approach to estimating 
   phylogeny from discrete morphological character data. Syst Biol, 
   2001 , 50 , 913-25 I have a copy. /bens

We want to implement MCMC-classes perturbing the parameters of
   substitution	matrices.


** Ancestral sequences

We can always get a handle of the root of the gene tree. And there is
always, implicitly, an ancestral sequence reconstructed there. Hence,
we could always compute a posterior distribution for the ancestral
sequence. 
	  Note: Ninoa's thesis has a mechanism for reconstructing 
	  marginal likely ancestral sequences for non-root nodes that 
	  might be useful. /bens

          Note: There is a class MarginalAncestralDistribution in cvs 
	  Joerg that is based BEEP-0-2::SubstitutionModel / bens

** Locking down the tree

In some cases, some edges in a gene tree can be considered fixed. It
would be good to be able to say that on the input and run with the
constraint. 


* MCMC issues

** Automatic start tree

It should not be necessary to supply a starting tree for the
MCMC. There are two alternatives. 

  1. An arbitrary or random tree is created as a starting point.
  2. A good tree is computed, for example using neighbourjoining.


** Updates

Investigate different MCMC update strategies.


** Multi-chain MCMC

Consider running several chains at once, perhaps with different
"temperatures".

	Lotta implemented multi-chain MCMC on the PVM /arve


** Time estimates

One could try and estimate how many iterations are needed for one
single tree in a first step, then skip the seq evol part too see how
many iterations would be needed. Then, the time necessary for the
problem would be the product of the two.


* Coding

** Thinning parameter with good defaults

Currently, the user has to suggest a thinning parameter if the default
of 10 is not appropriate. However, in most cases, the thinning really
depends on the number of iterations. For example, with up to 1000
iterations, a user probably wants every sample, and certainly not just
every tenth. With 10^4 samples, it might be reasonable to choose 10
for thinning. I'd like to see sensible values by default.



** Branch time and node time

Class Node has three different time fields. One of them is
branchLength, is it really needed?


** Refactoring of important classes.

Tree.hh has code for generating random trees. GammaMap has
randomization, at least in the form of
GammaMap::perturberation(). Break this up into separate classes. Tree
and GammaMap should not contain code for randomization.

    Note: The Tree issue has been fixed! I haven't done anything about
    GammaMap, is the function used at all? /bens

       
** Speeding things up

*** Specialization

Investigate whether we can be clever at the leaves.


*** Avoid recomputing the same thing over and over

Jens observed that many parts of the tree is not affected by time
sampling simply because they don't have duplications, and suggested to
avoid recomputing those parts.

Bengt suggested using gammastar to determine where to avoid.


** New MCMC parameters

As mentioned above, the seq evol rate has to be modelled. Also, the
root time in S must be included.

  Note: Now implemented.


** Ambiguity symbols

Support for ambiguity symbols is prepared but not finished and tested.


** Wilder swaps than NNI

Currently, we support rerooting and NNI. We should add some more types
of perturbing the tree. Also make sure that the swapping works on
small trees et.c.


** ML reconciliation

Lasse should implement the ML alg for reconciliations.



** Profiling

We should profile the code!


** Beta

How choose beta, the parameter for the prior on top slice time, the
time between the species root node and previous node? (Right now) Then, we
are mostly using beta = 1.0, but I don't think we have any
justification for that. 
It should depend on the scale of the times on the tree! In BEEP-o-1, we
use T.rootToLeafTime().


** TreeIO

It would be good to review the TreeIO interface. It is getting bloated
and confusing.

Some problems: 
  - Reading a gene tree with branchlengths, when not caring about node
    times, may cause a lot of warnings from Node::updateNodeTime().
  - Too many pointers used. That is not modern. :-)
  - It would be good to be able to have node and arc attributes for
    extensibility. 
  - Would it be possible to make a generic TreeIO as a template class
    where parameterization decides how the tree is written?
  - Verify that branchlength and nodetime is used correctly. I don't
    think that is the case!
  - DONE: Would be good to be able to read from a string rather than a file.

We need a way of assuring that a reconciliation is consistent with
species (and gene) tree.

{Discussion start suggestion from bens:

Maybe we should have a master function 'readTree(bool EWisET, bool
ET, bool NT, bool BL, bool S, ...)', where
ET=true indicates that the ET variable (if it exists in the treefile)
should be used for the corresponding Node attribute. EWisET= true
indicates that whatever is written after colon (:) in the treefile
should be used as (edge)time in the corresponding Node (as in
readSpeciesTree(...)), EWisET=false indicates that it should be
interpreted as a branchLength.

We could then have specialized functions, e.g., readSpeciesTree(...)
that calls readTree with pre-specified values for EWisET (e.g., true
for readSpeciestree), ET etc. that could be used as standard
tree-reading functions. readTree(...) should be available for those
users that need specialized tree-reading functions.

Problems of this approach could be:
	 - If new markups are added to NHX and we update readTree
	   accordingly, e.g., readTree(<asbefore>, <NEWMARKUP>), we might have
	   back-incompatibility with older programs. This could be
	   remedied by overloading readTree() with a version
	   corresponding to the old version, which then calls the new
	   version readTree(<asbefore>, <NEWMARKUP>=false);
	 - raedTree(...) might become very heavy to use, especially if
	   we mainly want to read trees and disregard most of the markups.

*** NHX

NHX should be able to read edge time markups (e.g., 'ET') in treefile,
e.g., '...[&&BEEP ET=0.123]...' indicating that the (edge)time of this
Node is 0.123.  

NHX should be able to read explicit branch length markups (e.g., 'BL')
in treefile, e.g., '...[&&BEEP BLO=0.123]...'  indicating that the
branch length of this Node is 0.123.

NHX should have a separate variable to store what it reads after the
colon (':') in treefile, e.g., '...(Leaf_1:0.123, Leaf_2:
0.321):0.231,...'. This variable could be called, e.g., EW (for edge
weight)(?). TreeIO is then responsible for assigning this variable to
the correct Node-attribute, see above.

NOTE! Currently we cannot BOTH read edgetimes and branchlengths from a
treefile - this is a problem!

** SeqIO

More sequence formats would be good. An overhaul of the sequence
parsing is also a good idea.

It would be convenient to be able to read Pfam files directly. They
sometimes contain annotations of which species a sequence is from.

What other popular file formats could _easily_ be added? MrBayes and
Nexus? There is no way I (Lasse) sit down to try to parse everything
from those files though!


* Extensions

** Please add codon and RNA support into DataType.
   Codons are now handled! /arve

======================================================================================

Completed tasks

** Implemented: gtl

   The non-seqevol DP-only program gtl (bad name: Gene Tree Likelihood),
   could easily be extended to work with several gene trees and have the
   species tree change in MCMC.


** Removed: Bad old diagnostic code

   In SimpleMCMC, there is support for analysing what goes wrong when a
   chain is stuck: grep for 'diagFile'! The use of this feature should be
   in debug mode.

   Also, it is triggered after 20 * thinning iterations without a
   commitNewState. That is silly when thinning = 1, and should be
   changed.

** Implemented: Orthology prediction

   The current implementation lacks code for extracting orthology
   probabilities. Lasse has started to code this.



*** Lapack implemented: Linear algebra

    There are some different options:

    1. Write our own.
    2. Our own interface ontop of clapack
    3. Modernizing Lapack++. (I tried compiling it: did not work.)
    4. Incorporating code from other sources.

       Note: we are now using fortran lapack and have our own C++
           interface. However, we are still dependent on MTL in
           ReconSeqapproximator::OrthologyMatrix. This dependence
           should be removed!!! /bens

	   Fixed /arve


** Implemented using PVM: Parallelization

   It should be fairly easy to parallelize using PVM. Anyone using MPI?


*** Implemented: Precomputing P(t) for some set of t

Lasse will look at how  P(t) changes with different values of t. Maybe
we can pre-compute a set of P(t) and then approximate when we want to
use a new exciting t.

  Note: A cache of P(t) matrices is now implemented with a good
        speedup. We should still look into the approximation
        idea. First tests by Lasse were not convincing in either
        direction.




********************* post-analysis *******************

mcmc_analysis:
(1) I saw somewhere (in a recent Nylander, Huelsenbeck, Ronquist paper I think) that the use of harmonic means were recommended instead of standard (euclidian?) means when using posterior Bayes' factors. I have therefore included the calculation of harmonic means in mcmc_analysis' analyze_logfloat. It seems to work and I will put it on cvs soon.

(2) mcmc_analysis borde kunna hantera multipla kedjor. I suggest that we add  an option, e.g., -m, which indicates that when multiple files are given as attributes, these are to to be interpreted as contining results independent chains. Thus,
(a) burnin should be removed from each chain
(b) We could give means, cred. intervals, etc for each chain, but definitely for the combined chain.
(c) I am reading up on convergence measures and found a rather simple nonparametric convergence test for independent chains that we could implement.
The reference is 
Brooks, S.P. & Gelman, A.
General methods for monitoring convergence of iterative simulations
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 1998 , 7 , 434-455    (section 3)
and it is a simple ratio between the length of the X% cred. interval over all chains and the average length of the X% cred. interval for the individual chains. It is suggested as a non-parametric generalization of the much used Gelman & Rubin test (that assumes underlying normal distribution. However, it seems to have some trouble in special circumstances. Brooks and Gelman used 80% as X%

(3) Mean, variance, cred. interval is probably best computed on the combined chains, but we could give these for the individual chains as well -- it would give a feel for convergence I guess.

(4) It would be good if we could include output of kernel densities in
the graphical output from mcmc_analysis. COda is a bit cumbersome to
work with in the long run. 

