[Bioperl-l] Bio::DB::SeqFeature sequences with no identifier?

Mark Wilkinson markw at illuminae.com
Thu May 22 08:50:39 EDT 2014


Ok, I can accept that... but it "feels wrong" ;-)

I would have expected the following "order of canon" as far as the 
naming of a sequence:

1)  the FASTA header of the sequence passed to the loading script, 
following the established rules of parsing seqIDs from FASTA headers

2)  the sequence name in the first column of a GFF file (since that's 
what that column contains, by the GFF spec)

...but in the loader, neither of those two well-established 
specifications are used, and rather, a (AFAIK non-standard) directive 
has to be inserted into an (optional) GFF header.

Now that I've solved my problem, I am happily moving on... but the 
process has left me a bit perplexed.  Loading fasta sequences into a DB, 
and then not being able to access them by name, isn't the most intuitive 
behaviour ;-)  ;-)

Anyhooooo... problem solved!  Thanks!

M



On 05/15/2014 04:21 PM, Scott Cain wrote:
> HI Mark,
>
> The sequence has to be identified in some way, either with an explicit 
> GFF line like in my example, with a ##sequence-region directive (which 
> I don't care for but should work for you), or as part of a Target 
> attribute (as part of a match/similarity search result).  I'd say this 
> isn't terribly explicit in the spec but is in the examples.
>
> Scott
>
>
>
> On Thu, May 15, 2014 at 4:59 AM, Mark Wilkinson <markw at illuminae.com 
> <mailto:markw at illuminae.com>> wrote:
>
>     It doesn't.  Thanks! I'll try that when I get in tomorrow.
>
>     (is that a part of the GFF3 spec, or is it a BioPerl "thing"?  is
>     it documented?)
>
>     M
>
>
>
>
>     On 14/05/2014 4:19 PM, Scott Cain wrote:
>>     Hi Mark,
>>
>>     Does you GFF have a line that identifies the reference sequence,
>>     like this:
>>
>>     SEQ1   .    contig    1    123456    .    .  .    Name=SEQ1
>>
>>     If not, that could be the problem.
>>
>>     Scott
>>
>>
>>
>>     On Wed, May 14, 2014 at 5:22 AM, Mark Wilkinson
>>     <markw at illuminae.com <mailto:markw at illuminae.com>> wrote:
>>
>>         Hi all BioPerlers!
>>
>>         I'm confused by something.  In the scenario below I have a
>>         Fasta file and a GFF file:
>>
>>         =========
>>         File:  a.fas
>>
>>         >SEQ1
>>         AAAATTTTCCCCGGGG
>>
>>         =========
>>         File:  b.gff
>>
>>         SEQ1    hit1    match_part    1    5    .    .    .    .
>>         SEQ1    hit2    match_part    6    10    .  .    .    .
>>         =========
>>
>>         I load them into a seqfeature DB:
>>
>>         bp_seqfeature_load.pl <http://bp_seqfeature_load.pl> -d
>>         dbi:mysql:seqdb -c -u root -p pass  a.fas b.gff
>>
>>         I then explore the data as follows:
>>
>>         use Bio::DB::SeqFeature::Store;
>>
>>         my $db = Bio::DB::SeqFeature::Store->new(
>>             -adaptor => 'DBI::mysql',
>>             -dsn     => 'dbi:mysql:seqdb',
>>             -user => 'root',
>>             -password => 'pass');
>>
>>         my $iterator = $db->get_seq_stream();
>>         while (my $feature = $iterator->next_seq){
>>             print $feature->seq->seq;
>>             # THE SEQUENCE IS PRINTED
>>             print " comes from sequence named ";
>>             print $feature->seq->id;
>>             #  THE METHOD ABOVE RETURNS UNDEF
>>         }
>>
>>         my $seq = $db->segment('SEQ1');
>>              # $seq is undef, NOTHING IS RETURNED!?!?
>>
>>         ============
>>
>>         This is all very confusing.  It seems that the feature knows
>>         what sequence it is attached to, because it gives me the
>>         correct string of letters, but it doesn't know what the name
>>         of that sequence is... and in fact, calling the sequence by
>>         name returns undef.
>>
>>         Is this a bug, or is there a reason for this "disconnect"
>>         between a sequence and its name?
>>
>>         Help appreciated!
>>
>>         Mark
>>
>>
>>         _______________________________________________
>>         Bioperl-l mailing list
>>         Bioperl-l at lists.open-bio.org
>>         <mailto:Bioperl-l at lists.open-bio.org>
>>         http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>>
>>
>>     -- 
>>     ------------------------------------------------------------------------
>>     Scott Cain, Ph. D.    scott at scottcain dot net
>>     GMOD Coordinator (http://gmod.org/) 216-392-3087 <tel:216-392-3087>
>>     Ontario Institute for Cancer Research
>
>
>
>     ------------------------------------------------------------------------
>     <http://www.avast.com/> 	
>
>     This email is free from viruses and malware because avast!
>     Antivirus <http://www.avast.com/> protection is active.
>
>
>
>
>
> -- 
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at 
> scottcain dot net
> GMOD Coordinator (http://gmod.org/)    216-392-3087
> Ontario Institute for Cancer Research

-- 
--
Mark Wilkinson
Madrid, Spain



More information about the Bioperl-l mailing list