Bastien Boussau
bastien.boussau@univ-lyon1.fr
Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558
INTRODUCTION
DOWNLOAD
INSTALLATION
USAGE
OTHER OPTIONS
OUTPUT
ADVICE
CITATION
REFERENCES
INTRODUCTION
nhPhyML is a program built to compute phylogenetic trees under the non stationary, non homogeneous
model of DNA sequence evolution of Galtier and Gouy (1998).
As such, it provides estimates of the G+C contents of ancient sequences.
It uses the algorithmic structure of PhyML (Guindon and Gascuel, 2003)
adapted to the rooted and irreversible case (many thanks to Stéphane Guindon for providing PhyML source code).
The program can also be downloaded from PhyML homepage, under "PHYML
unofficial versions".
It is provided as is and should be used with appropriate care.
Please report any bug to my address: bastien.boussau@univ-lyon1.fr.
It has been tested on Unix and Linux systems.
For more information about my work, see my lab page.
If you are interested not in the reconstruction of a phylogenetic tree but in studying the pattern of substitutions along a fixed topology and reconstructing ancestral sequences, I suggest you give a look at Bio++ and the BppSuite suite of software.
DOWNLOAD
You can either download an executable file for LINUX or download an archive containing all the source code. Example files can also be downloaded here.
INSTALLATION
Installation on Linux:
- If you have downloaded the executable file:
- go to its directory, make sure the file is executable (otherwise type chmod +x nhPhyml) and type:
- If you have downloaded the source code:
- To install it, you first need to extract it:
- Then, simply go to the directory named "nhPhyml" and type:
- There should be an executable entitled "nhPhyml".
./nhPhyml
tar zxvf nhPhyml.tgz
make
Installation on Mac OS X (thanks to Cedric Simillion for finding out how to install nhPhyML on Mac OS X)
- cd into the nhPhyml directory after extracting the archive
- remove the nhPhyml binary and all .o files that came with the archive
- open the Makefile in a text editor and remove the -static option from the CFLAGS line
- typing "make" now produces a working binary for OS X.
USAGE
nhPhyml needs a rooted tree (the root will never be moved throughout the tree space search) and a sequence file in phylip format. Note that if the tree is not rooted, you will get a "Seg Fault".
- Phylip-like interface
- Command line
Go to its installation directory and type:
./nhPhyml
Then you face a phylip-like (and PhyML-like) interface which asks for self-explanatory information such as the number of rate categories for the gamma law, whether or not the transition/transversion rate should be optimized...
You can also use nhPhyml directly from the command line using :
./nhPhyml -sequences=SequenceFile -tree=TreeFile -format=i -positions=123 -tstv=e -rates=8 -alpha=e -topology=e -outseqs=y -eqfreq=lim -numeqfreq=5 -treefile=Treefile
Where:
- SequenceFile is the sequence file in phylip format,
- TreeFile is the starting tree file in bracketted (newick) format
- -format=i helps specifying phylip interleaved format (can also be s for phylip sequential format),
- -positions=123 means that the user wants to use all the positions in the codons (could also be 1, 12, 13, 2, 23, or 3)
- -tstv=e is to tell the program that the transition/transversion rate needs to be evaluated (put a value otherwise),
- -rates=8 is the number of rate categories for the discretized gamma distribution,
- -alpha=e means that the gamma distribution parameter alpha is to be evaluated (put a value otherwise),
- -topology=e means that we want to optimize the topology and the branch lengthes ; putting k for keep would only optimize branch length while keeping the topology,
- -eqfreq=lim means we want to use the nhPhyML-Discrete version of nhPhyML, which means that each branch
has the "choice" between a limited set of G+C equilibrium frequencies.
In the default version, specified by -eqfreq=unlim, the G+C equilibrium frequency is optimized for each branch. This results in some convergence problems (the true topology is found less often than with -eqfreq=lim). - -numeqfreq=5: in case you use a limited set of equilibrium frequencies (-eqfreq=lim), you need to specify the number of equilibrium frequencies you want to use. This number is important : too small, and the process of evolution might not be modelled correctly. Too big, and the tree space exploration gets unefficient. Please refer to the article for more details. Recent addition (20/07/2012): Now you can specify upper and lower values for the limited set of equilibrium frequencies. Use options -eqfreqlow=0.2 -eqfrequpp=0.8 for instance.
Only the sequence file and the tree file are mandatory. Default values for the other parameters
are:
-format=i -positions=123 -tstv=e -rates=1 -topology=e -outseqs=n -eqfreq=unlim
When there is only one rate of evolution, no alpha is used.
OTHER OPTIONS
Those options are less central but might be useful to some users.
- -precision=0.0001: sets the precision to 0.0001. When optimizing parameters, if the likelihood difference between the former value of the parameter and the new value of the parameter is below the precision value, the maximum is considered to be found, and the optimization stops. By default this value is 0.000001, fairly low. Increasing this value decreases the computational time.
- -quick=y: the program does not make a final optimization of the parameters. Hence the topology obtained is the most likely found by the program, but the parameters such as branch length are not correctly optimized.
- -gcvar=y: the root G+C content is not optimized, but a range of values are tried, and for each of these values the likelihood is maximized by optimizing the free parameters. This way one can have an idea of the variance of the root G+C content estimate. Moreover, the resulting LNF file can be used as input to Consel (Shimodaira, 2001) to define a confidence interval with e.g. the AU test.
- -gclow=0.50: sets the lower limit of the root G+C contents to be tried to .50. Values tried are .51, .52, .53... until the upper limit is met.
- -gcupp=0.70: sets the upper limit of the root G+C contents to be tried to .70.
- -outseqs=y means that we want to get the ancestral sequence at the root node (put n otherwise). This feature has not been tested and should not be used without extreme caution. When the user does not want to use all the positions in codons, the ancestral sequence cannot be reconstructed.
OUTPUT
5 or 6 files are produced in the directory containing the "SequenceFile" :
- -SequenceFile_nhPhyml.lk possesses 2 lines: the first one displays the final likelihood of the output tree, the second one gives the final estimate of the root G+C content.
- -SequenceFile_nhPhyml.out provides general information concerning the phylogenetic reconstruction such as what were the input files, the options, how many rate categories were used, what was the final likelihood and how long did the run take.
- -SequenceFile_nhPhymlEq.tree is the final tree on which are displayed as bootstrap values the equilibrium G+C contents in each branch.
- -SequenceFile_nhPhymlGC.tree is the same final tree on which are displayed as bootstrap values the G+C contents at each node in the tree, except at the root. For the G+C content of the root, please check sequence_file_nhPhyml.lk, second line.
- -SequenceFile_nhPhyml.seq contains a text representation of the tree displaying the labels of all the nodes (numbers or sequence names). Then the present sequences are displayed, in a fasta-like format, together with the root sequence.
- -SequenceFile_nhPhyml.lnf is a simplified PAML-like (Yang, 1997) file displaying the site likelihoods. This can be used as an input to CONSEL (Shimodaira, 2001) to compare between various trees.
ADVICE
The tree space exploration is done as in PhyML v.2.2, by Nearest Neighbor Interchanges (NNIs). These topological rearrangements are local, and do not permit testing topologies distant from the input one, especially when the number of sequences is important. Therefore, it is recommended that you run the program using many different input trees. The resulting trees can be compared using CONSEL with the help of the LNF files.
CITATION
Please cite the following article when using nhPhyML:
- Boussau B, Gouy M (2006). "Efficient Likelihood Computations with Non-Reversible Models of Evolution.", Syst Biol. 2006, 55(5):756-68.
I believe it would also be good to cite the articles by Galtier and Gouy (1998) as the model used in nhPhyML comes from this work, and by Guindon and Gascuel (2003) as much of nhPhyML code comes from PhyML.
REFERENCES
- Galtier N, Gouy M (1998)."Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis", Mol Biol Evol. 1998 Jul;15(7):871-9.
- Guindon S, Gascuel O (2003). "A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood", Syst Biol. 2003 Oct;52(5):696-704.
- Shimodaira H, Hasegawa M (2001). "CONSEL: for assessing the confidence of phylogenetic tree selection", Bioinformatics. 2001 Dec;17(12):1246-7.
- Yang, Z (1997). "PAML: a program package for phylogenetic analysis by maximum likelihood", Computer Applications in BioSciences 13:555-556.
Bastien Boussau, PhD
bastien.boussau@univ-lyon1.fr
Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558