[Bioperl-l] How does bioperl handle Genbank sequence records? [buggy behavior?]

Hilmar Lapp hlapp@gnf.org
Fri, 2 Aug 2002 13:56:49 -0600


Danny,

many people from the list will currently be attending BOSC as am I ...

join() locations are split locations in Bioperl speak and should be 
handled OK. Specifically, the $feat->location() will return a 
Bio::Location::SplitLocation instance. There is a strategy object 
handling how start() and end() are computed for a split location 
(the default takes the widest range). So, you will get a sequence 
spanning the entire region from the start of the first to the end of 
the last location. At last night's BoF Ewan proposed an additional 
method on features concatenating the sub-sequences for split 
locations ... if you feel up to this, you're more than welcome to 
make a stab at it.

	-hilmar

BTW as for the error message, the problem lies elsewhere: the 
indicated error is indeed there in the printed section: there's an 
unmatched double-quote. However, it shouldn't affect the feature 
you're interested in, as it gently skips over the affected one only.

On Friday, August 2, 2002, at 01:27  PM, Danny Yoo wrote:

>
> Hi eveyrone,
>
> I sent a message about this a week ago, but haven't gotten any 
> responses
> yet.  Can anyone check to see if the Genbank parser is supposed to 
> handle
> 'join(a...b, c..d)' Locations?
>
> Sorry about my impatience; I just want to know if I'm using the module
> correctly.
>
>
>
> ---------- Forwarded message ----------
> Date: Thu, 25 Jul 2002 15:37:02 -0700 (PDT)
> From: Danny Yoo <dyoo@acoma.Stanford.EDU>
> To: bioperl-l@bioperl.org
> Cc: all@acoma.Stanford.EDU
> Subject: How does bioperl handle Genbank sequence records?
>
> Hi everyone,
>
> I'm using Bioperl 1.0 to parse out some sequences from GenBank.  In
> particular, I've been parsing:
>
>
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=nucleotide&
> list_uids=13449290&dopt=GenBank
>
> As I was going through the 'RPL2' gene and cds feature, I noticed a
> discrepency in sequence lengths.  Here's the relevant coordinate 
> records
> in Genbank:
>
>      gene            154744..157345
>                      /gene="rpl2"
>      mRNA            join(154744..155660,157213..157345)
>                      /gene="rpl2"
>      CDS             join(154744..155660,157213..157345)
>                      /gene="rpl2"
>
> How does Bioperl handle the 'join(154744..15660,157213..157345)'
> description when extracting sequences?
>
>
>
> Here's a test program I wrote to double check things:
>
> ######
> use strict;
> use Bio::DB::GenBank;
>
> sub getRPL2 {
>     my $gb = Bio::DB::GenBank->new();
>     my $seqobj = $gb->get_Seq_by_acc('NC_001284');
>     for my $feat ($seqobj->top_SeqFeatures()) {
> 	if ($feat->primary_tag eq 'CDS'
> 	    && $feat->gff_string =~ /gene "?rpl2"?/) {
> 	    return $feat;
> 	}
>     }
> }
>
> my $rpl2 = getRPL2();
> my $rpl2_seq = $rpl2->seq->seq;
> print length($rpl2_seq), "\n";
> ######
>
>
> When I run this program:
>
> ###
> $ perl test_genbank.pl
> -------------------- WARNING ---------------------
> MSG: [NC_001284] is not a normal sequence database but a RefSeq entry.
> Redirecting the request.
>
> ---------------------------------------------------
> -------------------- WARNING ---------------------
> MSG: Unbalanced quote in:
> /gene="orf153a"
> /codon_start=1
> /protein_id="NP_085474.1"
> /db_xref="GI:13449291"
> /db_xref="SPTR
> /translation="MSLLFQQTVPLSHLHRSLDPPLCFRTHILLILLLLSRHLPGFTG
> SDCESADPSIVSAIAPGTATTSERDCPVRTAGSDPVPIGDSGTFFDVGTAAPELLSPN
> RHHMITRAKDGIRKPNPRYNLFTQKYTPSEPKTITSASQDGDKLCKKRCRH"
> No further qualifiers will be added for this feature
> ---------------------------------------------------
> 2602
> ###
>
> I get a sequence of length 2602.  Am I using this correctly?
>
>
> It appears that Bioperl is grabbing the sequence from '154744..157345'
> (whose length is 157345 - 154744 + 1), rather than what I had expected
> from 'join(154744..155660,157213..157345)' of length 1051.
>
>
> Thank you for any help you can give about this.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------