[Biopython-dev] Parsing PAML supplementary output

Tue Oct 11 05:33:13 EDT 2011

> Something like that as a new format variant, yes.
> 
> > ...
> 
> Practical.
> 
Ok, I'll start working on that then.

> Do we need to write the "I" flag in our PHYLIP output?

It took me a while to hunt down information on PHYLIP flags. I found
this link which mentions them:
http://www.no.embnet.org/phylipdoc/
They're only used by the program which is using the alignment as input,
corresponding to the PHYLIP programs' menu options. In general, they
have no affect on the format of the alignment (aside from the
'S'/sequential vs 'I'/interleaved flags). However, some of them might
require extra information immediately below the header line, before the
alignment starts. This complicates things. (see below for some PAML
examples)

However, since there's no real standardization to the use of the phylip
format, not all programs pay attention to these flags. In my own work,
I've used TCoffee to generate interleaved alignments and then I have to
add in the 'I' after the fact. As another example, the current Biopython
PhylipIO would not recognize a header line with options as a valid
header line, since there would be more than 2 "parts".

So, if some programs can take options flags (at least PHYLIP and PAML
programs) while other programs may not like their inclusion, they would
need to be treated specially. I would suggest that the PhylipIterator
classes be modified to recognize the existence of options, but not
necessarily do anything with them, and that the PhylipWriter classes be
modified to optionally take a string containing option flags to append
to the header line, ie 'I', 'GC', etc. 

As for the supplementary information for the options, I'm not sure if
those complicate matters beyond the scope of Biopython's intended
functionality, or whether there should be yet another optional string
argument to the writer. The PhylipIterators would then need to be
modified to handle the possible existence of these supplementary lines
as well.

Anyway, I don't think this is an immediate concern and I personally
wouldn't approach it until I start working on the idea of better
integrating the PAML module with the rest of Biopython.

-brandon

Here are some examples:
5 895 G
G 4 3
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
1231231231231231231231231231231231231
444444444444444444444444444444444444444444444444444444444444
444444444444444444444444444444444444444444444444444444444444
444444444444444444444444444444444444444444444444444444444444
444444444444444444
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
12312312312312312312312312312312312312312312312312312312312 
Human
AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGACTTACATCCTCATTACTATT
CTGCCTAGCAAACTCAAACTACGAACGCACTCACAGTCGCATCATAATC........
Chimpanzee .........

"The first line of the file contains the option character G. The second
line begins with a G at the first column, followed by the number of site
classes. The following lines contain the site marks, one for each site
in the sequence (or each codon in the case of
codonml). The site mark specifies which class each site is from. If
there are g classes, the marks should be 1, 2, ..., g, and if g > 9, the
marks need to be separated by spaces. The total number of
marks must be equal to the total number of sites in each sequence."

********

5 1000 G
G 4 100 200 300 400 
Sequence 1
TCGATAGATAGGTTTTAGGGGGGGGGGTAAAAAAAAA.......

"This [alignment has 5 sequences of] 1000 nucleotides from 4 genes,
obtained from concatenating four genes with 100, 200, 300, and 400
nucleotides from genes 1, 2, 3, and 4, respectively. The"

********

5 855 GC 
human          GTG CTG TCT CCT ...

5 sequences, 855 nucleotides, length must be a multiple of three
********

5 300 G 
G2 40 60

sequence1
.....

"This data set has 5 sequences, each of 300 nucleotides (100 codons),
which are partitioned into two genes, with the first gene having 40
codons and the second gene 60 codons."