[Bioperl-l] PAML/Codeml parsing

Stefan Kirov stefan.kirov at bms.com
Wed Dec 5 20:33:47 UTC 2007


Done.

Jason Stajich wrote:
> sounds good - can you
> - make it as a bug with the patch and sample files in bugzilla
> - commit changes and I'll test as well
>
> thanks,
> -j
>
> On Dec 5, 2007, at 11:56 AM, Stefan Kirov wrote:
>
>   
>> Here is a patch that seems to be working and does not break the  
>> existing
>> tests:
>>
>> --- /home/kirovs/bioperl-live/Bio/Tools/Phylo/PAML.pm   2007-12-05
>> 10:16:53.120720000 -0500
>> +++ /home/kirovs/bioperl/bioperl-live/Bio/Tools/Phylo/PAML.pm
>> 2007-12-05 14:46:31.436278000 -0500
>> @@ -419,7 +419,10 @@
>>      # CODONML (in paml 3.12 February 2002)  <<-- what we want to see!
>>
>>      my $SEQTYPES = qr( (?: (?: CODON | AA | BASE | CODON2AA ) ML ) |
>> YN00 )x;
>> +    my $line;
>> +    $self->{'_already_parsed_seqs'}=$self-> 
>> {'_already_parsed_seqs'}?1:0;
>>      while ($_ = $self->_readline) {
>> +           $line++;
>>         if ( m/^($SEQTYPES) \s+                      # seqtype:  
>> CODONML,
>> AAML, BASEML, CODON2AAML, YN00, etc
>>                (?: \(in \s+ ([^\)]+?) \s* \) \s* )?  # version: "paml
>> 3.12 February 2002"; not present < 3.1 or YN00
>>                (\S+) \s*                             # tree filename
>> @@ -436,8 +439,11 @@
>>         } elsif (m/^Data set \d$/) {
>>             $self->{'_summary'} = {};
>>             $self->{'_summary'}->{'multidata'}++;
>> -       } elsif( m/^Before\s+deleting\s+alignment\s+gaps/ ) {
>> -           my ($phylip_header) = $self->_readline;
>> +       }
>> +       elsif( m/^Before\s+deleting\s+alignment\s+gaps/ ) {#Gap
>> +               my ($phylip_header) = $self->_readline;
>> +               $self->_parse_seqs;
>> +       } elsif (($line>3)&&($self->{'_already_parsed_seqs'}!=1))  
>> {#No gap
>>             $self->_parse_seqs;
>>         }
>>      }
>> @@ -681,7 +687,6 @@
>>  }
>>
>>  sub _parse_seqs {
>> -
>>      # this should in fact be packed into a Bio::SimpleAlign object
>> instead of
>>      # an array but we'll stay with this for now
>>      my ($self) = @_;
>>
>>
>> What this does is trigger sequence parsing if the /Before.../  
>> pattern is
>> not seen until line 4. Since phylip_header seems to be doing  
>> nothing one
>> could completely eliminate the first seq parse elsif (even though
>> counting lines is not a good thing).
>>  Since I am not aware of all consequences of changing the sequence
>> parsing and I have no idea how extensive the tests are, I am not
>> committing anything, but feel free to use that if you wish.
>> Stefan
>>
>> Stefan Kirov wrote:
>>     
>>> Jason,
>>> When there is a gapless alignment we have a differently formatted  
>>> output
>>> from codeml:
>>> kirovs at horta:~/AESIG> head -n 10 feJRfxQl8D/mlc
>>>
>>> seed used = 492211105
>>>       3    141
>>>
>>> ENSRNOE00000058637               GCG AGC AAG TGT GAC AGC CAT GGC  
>>> ACC CAC
>>> CTA GCA GGT GTG GTC AGC GGC CGG GAT GCT GGT GTG GCC AAG GGC ACC  
>>> AGT CTG
>>> CAC AGT CTG CGT GTG CTC AAC TGT CAA GGG AAG GGC ACA GTC AGC GGC  
>>> ACC CTC ATA
>>> ENSMUSE00000366347               GCG AGC AAG TGT GAC AGC CAC GGC  
>>> ACC CAC
>>> CTG GCA GGT GTG GTC AGC GGC CGG GAT GCT GGT GTG GCC AAG GGC ACC  
>>> AGC CTG
>>> CAC AGC CTG CGT GTG CTC AAC TGT CAA GGG AAG GGC ACA GTC AGC GGC  
>>> ACC CTC ATA
>>> ENSE00001279150                  GCC AGC AAG TGT GAC AGT CAT GGC  
>>> ACC CAC
>>> CTG GCA GGG GTG GTC AGC GGC CGG GAT GCC GGC GTG GCC AAG GGT GCC  
>>> AGC ATG
>>> CGC AGC CTG CGC GTG CTC AAC TGC CAA GGG AAG GGC ACG GTT AGC GGC  
>>> ACC CTC ATA
>>>
>>> And parsing this fails...
>>> The next one has gaps and works fine:
>>>
>>> kirovs at horta:~/AESIG> head -n 10 4z6ZX7s1B6/mlc
>>>
>>> seed used = 492252697
>>>
>>> Before deleting alignment gaps
>>>       2    162
>>>
>>> ENSMUSE00000460297               AAT ATC GAT ACA TTT TAC AAG GAG  
>>> GCA GAA
>>> AAG AAG CTT ATA CAC GTG CTT GAG GGA GAC AGT CCC AAG TGG TCC ACA  
>>> CCG AAC
>>> AAA GAC CCC ACC CGA GAG CCC CAT GCA GCC TCC ACT TGC TGT GCT TCA  
>>> GAT CTC
>>> CTT GGT TCA GGA GGT CAG TTC CTG
>>> ENSE00000939192                  AAT ATT GAC ATA CTT TGC AAT GAA  
>>> GCA GAA
>>> AAC AAG CTT ATG CAT ATA CTG CAT GCA AAT GAT CCC AAG TGG TCC ACC  
>>> CCA ACT
>>> AAA GAC TGT ACT TCA GGG CCG TAC ACT GCT CAA ATC --- --- --- ---  
>>> --- ATT
>>> CCT GGT ACA GGA AAC AAG CTT CTG
>>>
>>> I will send both whole files as an attachment with another mail (I do
>>> not know if these are going to pass through).
>>> My guess is that the whole _parse_summary method has to be re- 
>>> worked as
>>> there is no tag to look for before the sequences start. Ugly.
>>> I am not sure what else could become broken if I try to fix it, so I
>>> will leave it to you.
>>> Stefan
>>>
>>>       
>>>> should be fixed.
>>>>
>>>> $ cvs log -r HEAD Bio/Tools/Phylo/PAML.pm
>>>> revision 1.56
>>>> date: 2007/11/01 14:52:56;  author: jason;  state: Exp;  lines:  
>>>> +21 -14
>>>> Parsing PAML4 and PAML3.15 should work now.  Dealing with variable
>>>> order for the sequences and summary results in
>>>> the top of the MLC files
>>>>
>>>> On Dec 4, 2007, at 11:25 AM, Stefan Kirov wrote:
>>>>
>>>>
>>>>         
>>>>> Jason Stajich wrote:
>>>>>
>>>>>           
>>>>>> PAML4 breaks our PAML parser right now because the order of  
>>>>>> things in
>>>>>> the result file has changed.  Now sequences precede the  
>>>>>> information
>>>>>> about the version or the program run.  This means that $result-
>>>>>>
>>>>>>             
>>>>>>> get_seqs() fails because we don't parse the sequences.
>>>>>>>
>>>>>>>               
>>>>>> We'll see what we can do, but as usual with supporting 3rd party
>>>>>> programs it is brittle when file formats change.  Th
>>>>>>
>>>>>> -jason
>>>>>>
>>>>>> -- 
>>>>>> Jason Stajich
>>>>>> jason at bioperl.org
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>> Jason,
>>>>> I saw a commit after this post on codeml, but not on PAML.pm- I  
>>>>> assume
>>>>> this is not fixed, am I correct?
>>>>> Thanks!
>>>>> Stefan
>>>>>
>>>>>           
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>       
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>   




More information about the Bioperl-l mailing list