[Bioperl-l] Bio::DB::SeqFeature sequences with no identifier?

Thu May 22 13:05:16 EDT 2014

The specification *allows* for FASTA to be stored with the features at the end, but it does not *require* it.  I personally never store them together, but my preferences differ from others, which is probably why this is made to be flexible.

I suppose if it’s present one could generate these automatically, but FASTA are only allowed at the end of the file. So, if your parser stores features in any efficient manner (RTree, bins) or runs bounds checking then it would have to do so in a two-pass fashion (first to get the region lengths, then again to check bounds, create genomic bins for features, etc).

If it isn’t present then something (the directive or a ‘region’ feature) must be present to indicate the length of the region, even if it’s approximate.  Without it, basing the length off the coordinate positions of the 3’-most feature (for the seqID in the first column) won’t always guarantee finding the actual end of the region of interest.  Something has to be there as a point of reference, and from a parsing/validation point-of-view it should be early in the file, prior to features.  

(note this also allows creating features for regions w/o having any sequence present, such as genetic maps)

My problem is that allowing both ‘sequence-region’ and a ‘region’ feature is ambiguous; it should really be one or the other.  But that’s another argument for another time...

Re: getting the actual sequence from the database for a region by name, does the below work?  I think this also allows you to get a Bio::PrimarySeq, which can be fed to a Bio::SeqIO to generate output (see the docs for fetch_sequence here: https://metacpan.org/pod/Bio::DB::SeqFeature::Store#fetch_sequence)

# get sequence IDs
my @ids = $db->seq_ids;

for my $seqid (@ids) {
    my $raw_seq = $db->fetch_sequence(-seq_id=>$seqid);
    ...
}

chris

On May 22, 2014, at 7:50 AM, Mark Wilkinson <markw at illuminae.com> wrote:

> Ok, I can accept that... but it "feels wrong" ;-)
> 
> I would have expected the following "order of canon" as far as the naming of a sequence:
> 
> 1)  the FASTA header of the sequence passed to the loading script, following the established rules of parsing seqIDs from FASTA headers
> 
> 2)  the sequence name in the first column of a GFF file (since that's what that column contains, by the GFF spec)
> 
> ...but in the loader, neither of those two well-established specifications are used, and rather, a (AFAIK non-standard) directive has to be inserted into an (optional) GFF header.
> 
> Now that I've solved my problem, I am happily moving on... but the process has left me a bit perplexed.  Loading fasta sequences into a DB, and then not being able to access them by name, isn't the most intuitive behaviour ;-)  ;-)
> 
> Anyhooooo... problem solved!  Thanks!
> 
> M
> 
> 
> 
> On 05/15/2014 04:21 PM, Scott Cain wrote:
>> HI Mark,
>> 
>> The sequence has to be identified in some way, either with an explicit GFF line like in my example, with a ##sequence-region directive (which I don't care for but should work for you), or as part of a Target attribute (as part of a match/similarity search result).  I'd say this isn't terribly explicit in the spec but is in the examples.
>> 
>> Scott
>> 
>> 
>> 
>> On Thu, May 15, 2014 at 4:59 AM, Mark Wilkinson <markw at illuminae.com <mailto:markw at illuminae.com>> wrote:
>> 
>>    It doesn't.  Thanks! I'll try that when I get in tomorrow.
>> 
>>    (is that a part of the GFF3 spec, or is it a BioPerl "thing"?  is
>>    it documented?)
>> 
>>    M
>> 
>> 
>> 
>> 
>>    On 14/05/2014 4:19 PM, Scott Cain wrote:
>>>    Hi Mark,
>>> 
>>>    Does you GFF have a line that identifies the reference sequence,
>>>    like this:
>>> 
>>>    SEQ1   .    contig    1    123456    .    .  .    Name=SEQ1
>>> 
>>>    If not, that could be the problem.
>>> 
>>>    Scott
>>> 
>>> 
>>> 
>>>    On Wed, May 14, 2014 at 5:22 AM, Mark Wilkinson
>>>    <markw at illuminae.com <mailto:markw at illuminae.com>> wrote:
>>> 
>>>        Hi all BioPerlers!
>>> 
>>>        I'm confused by something.  In the scenario below I have a
>>>        Fasta file and a GFF file:
>>> 
>>>        =========
>>>        File:  a.fas
>>> 
>>>        >SEQ1
>>>        AAAATTTTCCCCGGGG
>>> 
>>>        =========
>>>        File:  b.gff
>>> 
>>>        SEQ1    hit1    match_part    1    5    .    .    .    .
>>>        SEQ1    hit2    match_part    6    10    .  .    .    .
>>>        =========
>>> 
>>>        I load them into a seqfeature DB:
>>> 
>>>        bp_seqfeature_load.pl <http://bp_seqfeature_load.pl> -d
>>>        dbi:mysql:seqdb -c -u root -p pass  a.fas b.gff
>>> 
>>>        I then explore the data as follows:
>>> 
>>>        use Bio::DB::SeqFeature::Store;
>>> 
>>>        my $db = Bio::DB::SeqFeature::Store->new(
>>>            -adaptor => 'DBI::mysql',
>>>            -dsn     => 'dbi:mysql:seqdb',
>>>            -user => 'root',
>>>            -password => 'pass');
>>> 
>>>        my $iterator = $db->get_seq_stream();
>>>        while (my $feature = $iterator->next_seq){
>>>            print $feature->seq->seq;
>>>            # THE SEQUENCE IS PRINTED
>>>            print " comes from sequence named ";
>>>            print $feature->seq->id;
>>>            #  THE METHOD ABOVE RETURNS UNDEF
>>>        }
>>> 
>>>        my $seq = $db->segment('SEQ1');
>>>             # $seq is undef, NOTHING IS RETURNED!?!?
>>> 
>>>        ============
>>> 
>>>        This is all very confusing.  It seems that the feature knows
>>>        what sequence it is attached to, because it gives me the
>>>        correct string of letters, but it doesn't know what the name
>>>        of that sequence is... and in fact, calling the sequence by
>>>        name returns undef.
>>> 
>>>        Is this a bug, or is there a reason for this "disconnect"
>>>        between a sequence and its name?
>>> 
>>>        Help appreciated!
>>> 
>>>        Mark
>>> 
>>> 
>>>        _______________________________________________
>>>        Bioperl-l mailing list
>>>        Bioperl-l at lists.open-bio.org
>>>        <mailto:Bioperl-l at lists.open-bio.org>
>>>        http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> 
>>> 
>>> 
>>> 
>>>    --     ------------------------------------------------------------------------
>>>    Scott Cain, Ph. D.    scott at scottcain dot net
>>>    GMOD Coordinator (http://gmod.org/) 216-392-3087 <tel:216-392-3087>
>>>    Ontario Institute for Cancer Research
>> 
>> 
>> 
>>    ------------------------------------------------------------------------
>>    <http://www.avast.com/> 	
>> 
>>    This email is free from viruses and malware because avast!
>>    Antivirus <http://www.avast.com/> protection is active.
>> 
>> 
>> 
>> 
>> 
>> -- 
>> ------------------------------------------------------------------------
>> Scott Cain, Ph. D.                                   scott at scottcain dot net
>> GMOD Coordinator (http://gmod.org/)    216-392-3087
>> Ontario Institute for Cancer Research
> 
> -- 
> --
> Mark Wilkinson
> Madrid, Spain
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l