[Bioperl-l] Odd problem with get_tag_values

Francisco J. Ossandón fossandonc at hotmail.com
Sun Feb 26 19:08:55 UTC 2012


I have been parsing Genbank files with Bioperl for a long time now, so for
many types of data I wrote a code that always checks  if the data exists
before asking for it, to avoid scripts crashes/warnings (I remember I made
customs GBKs deleting specific lines to manually check for crashes). I use
that in all my programs and they work fine. I use Perl ternaries for most
one-liners:

my $version    = $seq_obj->version || '';
my $definition = $seq_obj->desc    || '';
my $dna_shape  = $seq_obj->is_circular ? 'circular' : 'linear';

my $prot_id = $feat->has_tag('protein_id') ?
($feat->get_tag_values('protein_id'))[0] : '';
my $product = $feat->has_tag('product')    ?
($feat->get_tag_values('product'))[0]    : '';

I usually try to code defensively when parsing Genbanks, expecting every
step to go wrong. For example, try to parse “Escherichia coli str. K-12
substr. MG1655 chromosome” (NC_000913.gbk), and you will see many things go
wrong (like CDS without /protein_id or /product tags). ;)

Francisco J. Ossandon


-----Mensaje original-----
De: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] En nombre de Fields,
Christopher J
Enviado el: viernes, 24 de febrero de 2012 20:29
Para: Adlai Burman
CC: <bioperl-l at lists.open-bio.org>; Jason Stajich
Asunto: Re: [Bioperl-l] Odd problem with get_tag_values

No problem. This highlights the #1 problem with genbank files (and the
reasons we keep things as simple as possible), namely lack of consistency.
Not a problem with NCBI per se, but a problem nonetheless. 

Chris

On Feb 24, 2012, at 5:07 PM, "Adlai Burman" <adlai at refenestration.com>
wrote:

> I apologize, Chris. That was a terrible example I sent. I accidentally
sent the only record for which there actually are no regular 'gene' tags.
Other gb records that failed before Jason's tag check include NC_005086 and
they all have regular 'gene' tags. I should have referenced that one.
> I appreciate your checking into that and sorry about the red herring. Boy
is my face blushed.
> 
> A.
> 
> 
> On Feb 24, 2012, at 11:38 PM, Fields, Christopher J wrote:
> 
>> There is possibly a slight disconnect here.  The primary tag you are
looking for is 'CDS'.  Here is an example of both the primary tag for 'CDS'
and 'gene' from NC_012927:
>> 
>>    gene            complement(join(91716..92516,69640..69753))
>>                    /locus_tag="BaolC_p001"
>>                    /trans_splicing
>>                    /db_xref="GeneID:8223103"
>>    CDS
complement(join(91716..91744,92285..92516,69640..69753))
>>                    /locus_tag="BaolC_p001"
>>                    /trans_splicing
>>                    /codon_start=1
>>                    /transl_table=11
>>                    /product="ribosomal protein S12"
>>                    /protein_id="YP_003029720.1"
>>                    /db_xref="GI:253729537"
>>                    /db_xref="GeneID:8223103"
>>
/translation="MPTVKQLIRNARQPIRNARKSAALKGCPQRRGTCARVYTINPKK
>>
PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYRIIRGTL
>>                    DAVAVKNRQQGRSKYGVKKPKK"
>> 
>> This 'CDS' example has no 'gene' regular tag, in fact none that I looked
at did.  But there is another feature (above it) that does have the
*primary* tag 'gene' and same locus_tag and dbxref GeneID (it does have tags
such as 'locus_tag', 'trans_splicing', etc).  Is that what you mean?
>> 
>> The way that BioPerl deals with this is to return two different
independent features, one with the 'CDS' primary tag and one with the 'gene'
primary tag.  Past attempts to somehow combine these (and then disambiguate
them again later if needed for output) can be problematic, or at least they
once were.
>> 
>> chris
>> 
>> 
>> On Feb 24, 2012, at 3:55 PM, Adlai Burman wrote:
>> 
>>> Jason,
>>> Your first solution, indeed, did the trick (though I'm not sure why).
There was no need to for checking "else." I'm not sure why some records with
a full set of "gene" tags would not parse without the check, but everything
parsed with it.
>>> 
>>> Brian, you were right.
>>> 
>>> Thanks again,
>>> 
>>> Adlai
>>> On Feb 24, 2012, at 10:21 PM, Jason Stajich wrote:
>>> 
>>>> not all CDS will be annotated with a 'gene' tag, this is due to
variation in how annotation is done and that there is not a requirement that
there be a gene tag for all CDS features.
>>>> 
>>>> You can protect your query - we often do this when dealing with data
from the wild by testing for has_tag first.
>>>> 
>>>> my %strands;
>>>> for my $cds ( grep {$_->primary_tag eq 'CDS' } 
>>>> Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) { if(
$cds->has_tag('gene') ) {
>>>>    my ($gene) = $cds->get_tag_values('gene'); # get the 1st one, this
returns a list
>>>>    $strands{$gene} = $cds->strand; } else { # look in alternative 
>>>> places for a name, e.g. locus, ...
>>>> }
>>>> }
>>>> 
>>>> An alternative is to loop through your list of tags in order of 
>>>> preference
>>>> 
>>>> my %strands;
>>>> for my $cds ( grep {$_->primary_tag eq 'CDS' } 
>>>> Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) { for 
>>>> my $tag ( qw(gene locus name product accession note) ) { if(
$cds->has_tag($tag) ) {
>>>>    my ($name) = $cds->get_tag_values($tag); # get the 1st one, this
returns a list
>>>>    $strands{$name} = $cds->strand;
>>>>     $seen = 1;
>>>>     last;
>>>> }
>>>> if( ! $seen ) { 
>>>>    warn("not tag found for feature at ", 
>>>> $cds->location->to_FTstring, "\n"); } }
>>>> 
>>>> On Feb 24, 2012, at 12:43 PM, Adlai Burman wrote:
>>>> 
>>>>> I have come across a perplexing problem with trying to parse sequence
features into hashes from gb records. This is the minimal code which shows
my problem:
>>>>> 
>>>>> #!/usr/bin/perl

>>>>> use strict;
>>>>> use warnings;
>>>>> use IO::String;
>>>>> use Bio::Perl;
>>>>> use Bio::SeqIO;
>>>>> use IO::String;
>>>>> 
>>>>> my @files = </Users/adlai/Dropbox/atrsh/*>; foreach my 
>>>>> $file(@files){
>>>>> 
>>>>> 
>>>>> my @cds_features = grep {$_->primary_tag eq 'CDS' } 
>>>>> Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
>>>>> my %strands = map {$_->get_tag_values('gene'), $_->strand}
@cds_features; ##This Is The Culprit. 
>>>>> .
>>>>> .
>>>>> .
>>>>> #do nifty stuff
>>>>> }
>>>>> 
>>>>> For some files this approach works just fine.
>>>>> For others the script dies immediately with the error message:
>>>>> 
>>>>> ------------- EXCEPTION -------------
>>>>> MSG: asking for tag value that does not exist gene STACK 
>>>>> Bio::SeqFeature::Generic::get_tag_values 
>>>>> /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
>>>>> STACK toplevel tosend.pl:16
>>>>> -------------------------------------
>>>>> 
>>>>> The difference in the files that parse and those that don't seems to
be that the files that crash have "intron" and "exon" tags. They ALL have
"gene" tags.
>>>>> Does anyone know why this is a problem and what can be done to
circumvent it?
>>>>> 
>>>>> Thanks,
>>>>> Adlai
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>> 
>>>> Jason Stajich
>>>> jason.stajich at gmail.com
>>>> jason at bioperl.org
>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>> 
> 

_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list