[Bioperl-l] Bio::DB::SeqFeature sequences with no identifier?
Mark Wilkinson
markw at illuminae.com
Thu May 22 08:50:39 EDT 2014
Ok, I can accept that... but it "feels wrong" ;-)
I would have expected the following "order of canon" as far as the
naming of a sequence:
1) the FASTA header of the sequence passed to the loading script,
following the established rules of parsing seqIDs from FASTA headers
2) the sequence name in the first column of a GFF file (since that's
what that column contains, by the GFF spec)
...but in the loader, neither of those two well-established
specifications are used, and rather, a (AFAIK non-standard) directive
has to be inserted into an (optional) GFF header.
Now that I've solved my problem, I am happily moving on... but the
process has left me a bit perplexed. Loading fasta sequences into a DB,
and then not being able to access them by name, isn't the most intuitive
behaviour ;-) ;-)
Anyhooooo... problem solved! Thanks!
M
On 05/15/2014 04:21 PM, Scott Cain wrote:
> HI Mark,
>
> The sequence has to be identified in some way, either with an explicit
> GFF line like in my example, with a ##sequence-region directive (which
> I don't care for but should work for you), or as part of a Target
> attribute (as part of a match/similarity search result). I'd say this
> isn't terribly explicit in the spec but is in the examples.
>
> Scott
>
>
>
> On Thu, May 15, 2014 at 4:59 AM, Mark Wilkinson <markw at illuminae.com
> <mailto:markw at illuminae.com>> wrote:
>
> It doesn't. Thanks! I'll try that when I get in tomorrow.
>
> (is that a part of the GFF3 spec, or is it a BioPerl "thing"? is
> it documented?)
>
> M
>
>
>
>
> On 14/05/2014 4:19 PM, Scott Cain wrote:
>> Hi Mark,
>>
>> Does you GFF have a line that identifies the reference sequence,
>> like this:
>>
>> SEQ1 . contig 1 123456 . . . Name=SEQ1
>>
>> If not, that could be the problem.
>>
>> Scott
>>
>>
>>
>> On Wed, May 14, 2014 at 5:22 AM, Mark Wilkinson
>> <markw at illuminae.com <mailto:markw at illuminae.com>> wrote:
>>
>> Hi all BioPerlers!
>>
>> I'm confused by something. In the scenario below I have a
>> Fasta file and a GFF file:
>>
>> =========
>> File: a.fas
>>
>> >SEQ1
>> AAAATTTTCCCCGGGG
>>
>> =========
>> File: b.gff
>>
>> SEQ1 hit1 match_part 1 5 . . . .
>> SEQ1 hit2 match_part 6 10 . . . .
>> =========
>>
>> I load them into a seqfeature DB:
>>
>> bp_seqfeature_load.pl <http://bp_seqfeature_load.pl> -d
>> dbi:mysql:seqdb -c -u root -p pass a.fas b.gff
>>
>> I then explore the data as follows:
>>
>> use Bio::DB::SeqFeature::Store;
>>
>> my $db = Bio::DB::SeqFeature::Store->new(
>> -adaptor => 'DBI::mysql',
>> -dsn => 'dbi:mysql:seqdb',
>> -user => 'root',
>> -password => 'pass');
>>
>> my $iterator = $db->get_seq_stream();
>> while (my $feature = $iterator->next_seq){
>> print $feature->seq->seq;
>> # THE SEQUENCE IS PRINTED
>> print " comes from sequence named ";
>> print $feature->seq->id;
>> # THE METHOD ABOVE RETURNS UNDEF
>> }
>>
>> my $seq = $db->segment('SEQ1');
>> # $seq is undef, NOTHING IS RETURNED!?!?
>>
>> ============
>>
>> This is all very confusing. It seems that the feature knows
>> what sequence it is attached to, because it gives me the
>> correct string of letters, but it doesn't know what the name
>> of that sequence is... and in fact, calling the sequence by
>> name returns undef.
>>
>> Is this a bug, or is there a reason for this "disconnect"
>> between a sequence and its name?
>>
>> Help appreciated!
>>
>> Mark
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> <mailto:Bioperl-l at lists.open-bio.org>
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>>
>>
>> --
>> ------------------------------------------------------------------------
>> Scott Cain, Ph. D. scott at scottcain dot net
>> GMOD Coordinator (http://gmod.org/) 216-392-3087 <tel:216-392-3087>
>> Ontario Institute for Cancer Research
>
>
>
> ------------------------------------------------------------------------
> <http://www.avast.com/>
>
> This email is free from viruses and malware because avast!
> Antivirus <http://www.avast.com/> protection is active.
>
>
>
>
>
> --
> ------------------------------------------------------------------------
> Scott Cain, Ph. D. scott at
> scottcain dot net
> GMOD Coordinator (http://gmod.org/) 216-392-3087
> Ontario Institute for Cancer Research
--
--
Mark Wilkinson
Madrid, Spain
More information about the Bioperl-l
mailing list