[Biopython] introducing PAML for Biopython

Brandon Invergo b.invergo at gmail.com
Tue Jun 14 13:15:52 UTC 2011


Hi everyone,

I have written a Python interface to the PAML (Phylogenetic Analysis
by Maximum Likelihood) package of programs by Ziheng Yang which is in
the process of being added to Biopython. In advance of adding it, it
would be great if some people would be willing to take it for a
test-drive to make sure things are working the way you would like them
to, that it's useful, that it's not buggy, etc. For the moment, it can
be had as a branch of Biopython here:
https://github.com/brandoninvergo/biopython/tree/paml-branch
(If you click Downloads you can download the source code as a .tar.gz file)

The PAML interface is located in Bio.Phylo.PAML and includes the
following modules: codeml, baseml, yn00 and chi2. I have not written
implementations of the evolver or mcmctree programs from the package.
Evolver requires menu interaction from the user, so it can't be
scripted easily. As for mcmctree, I will profess complete ignorance in
how people use the program, so I was not sure of the best way to
implement it (particularly in parsing the results). If you use
mcmctree and you would like me to implement an interface to it, feel
free to contact me and we can discuss it.


If you're interested in using the library, I have not yet written
documentation, however usage would be something like this (codeml,
baseml, and yn00 all function similarly):

> from Bio.Phylo.PAML import codeml
> cml = codeml.Codeml()
> cml.alignment = "path/to/alignment"  #can use either relative or absolute paths; they all get converted to relative paths to avoid the path-length limits imposed in PAML
> cml.tree = "path/to/tree"
> cml.working_dir = "path/to/working_directory"
> cml.out_file = "path/to/output_file"
#or, alternatively:
> cml = codeml.Codeml(alignment="path/to/alignment", tree="path/to/tree", working_dir="path/to/working_dir", out_file="path/to/out_file")

# view all options
> cml.print_options()

# read in an existing control file
> cml.read_ctl_file("path/to/control_file")

# set an option
> cml.set_option("clock", 1)
> cml.set_option("NSsites", [0,1,2])
> cml.set_option("aaRatefile", None)

# get an option value
> cml.get_option("clock")

# write all options to a control file (this is done automatically when
you do the run() method, so you probably won't have to do this)
> cml.ctl_file = "path/to/ctl_file"
> cml.write_ctl_file()


# run the program, which returns the results in dict format
> results = cml.run()

# or, to see all of codeml's output to the screen
> results = cml.run(verbose=True)

# or, to specify the location of the executable (ie if it's not in
your path or if you use multiple versions of it)
> results = cml.run(command = "path/to/codeml")

# or, to skip parsing the results
> cml.run(parse = False)

# parse an existing results file
> results = codeml.read("path/to/results_file")



The results are stored in a giant dictionary. I will have to describe
all the contents in the documentation but for now I would recommend
just exploring it to see what's there. For each program, I tried to
parse out as much as possible, on the assumption that I don't know
what *you* need to know from the output file. So, the results dict
probably contains far more than anyone needs. Still, if you find that
you need something that has not been parsed, please let me know and
I'll try to implement it  (I have not parsed Naive Emperical Bayes or
Bayes Empirical Bayes results, for example, or the various codon usage
statistics in the beginning of the file).
nssites = results.get("NSsites")
m0 = nssites.get(0)
m0_maxlnl = m0.get("max lnL")
etc...

I do recommend using the results.get(key) method rather than
results[key] because if codeml encounters an error, the keys won't be
added to the results dictionary and results[key] will raise an
exception.

At the moment, the chi2 program in PAML doesn't properly take
command-line arguments so it's not easily scriptable. Since using PAML
programs calls for a lot of likelihood ratio testing, I went ahead and
reimplemented it in pure Python from the original C code (with
permission). In most cases, it works fine however I have found that if
you have many degrees of freedom, such as in the case of the FMutSel
model testing (41 df), it takes an unacceptably long time to compute.
I've been told that the next version of PAML will include a chi2 which
takes both the test statistic and the d.f. as command line arguments,
so I'll be able to just write an interface to it.

Ok, I think that sums it up for now. I hope that you find this to be
useful! Please let me know if you have any problems, suggestions,
bugs, etc., especially in the parsing!

Thanks!
Brandon Invergo



More information about the Biopython mailing list