[BioSQL-l] loading fasta records with load_seqdatabase.pl - correctfasta headers

Hilmar Lapp hlapp at gnf.org
Mon Aug 22 16:48:14 EDT 2005


Yes this is correct. The purpose of a SeqProcessor is exactly to 
massage your data so that they are in the form you want them when they 
enter the database.

	-hilmar

On Aug 22, 2005, at 1:28 PM, Amit Indap wrote:

> Marc and Hilmar,
>
> Thanks for your responses. From my understanding I can write my own
> SequenceProcessor and override the process_seq to munge my data so 
> that is
> is acceptable when loading my sequences in to biosql. I have a whole
> bunch of other sequences from the lab which don't have accessions, etc
> but I can write another pipeline to deal with and give them
> appropriate names and accessions. (If am mis-understanding what
> SeqProcessor is doing, please correct)
>
> Thanks,
> Amit
>
>
>
>
>
> On 8/22/05, Marc Logghe <MarcL at devgen.com> wrote:
>>> I think my fasta headers are incorrect since it says it
>>> cannot store unknown. The first fasta record in my refseq.fa is this:
>>>
>>>> gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin
>>> domain (Ig), short basic domain, secreted, (semaphorin) 3E
>>> (SEMA3E), mRNA
>>>
>>> Do I need to reformat that header? I downloaded the NM series
>>> of Refseqs in fasta form from NCBI's ftp site and wanted to
>>> load them into the biosql schema.
>>
>> You'd definitely better change the display_name to NM_012431.1
>> You could first run the sequences through EMBOSS's seqret cleaning the
>> identifier.
>> Or you handle this in a seq processor. I'd opt for the latter.
>> Because you have to set your accession_number anyway. Thing is that a
>> sequence object from parsed fasta has no accession_number (set to the
>> default the well known 'unknown' ;-), only a display_name.
>> In the processor you can do all: clean up the display_name and pass 
>> that
>> value to the accession_number() call.
>> The processor looks like this (save it as Accession.pm and put it
>> somewhere where perl can find it):
>>
>>
>> # $Id: Accession.pm,v 1.2 2004/03/02 08:15:48 marcl Exp $
>> package Accession;
>> use vars qw(@ISA);
>> use strict;
>>
>> use Bio::Seq::BaseSeqProcessor;
>>
>> @ISA = qw(Bio::Seq::BaseSeqProcessor);
>>
>> sub _id_parser
>> {
>>   return $_[0] =~ /gb\|([^|]+)/       ? $1 :
>>          $_[0] =~ /^\s*\S+\|([^|]+)/  ? $1 :
>>          $_[0] =~ /^\s*>*(\S+)/       ? $1 : $_[0];
>> }
>>
>>
>> sub process_seq{
>>     my ($self,$seq) = @_;
>>     my $display_id = _id_parser($seq->display_id);
>>     $seq->accession_number($display_id);
>>     return ($seq);
>> }
>>
>> 1;
>>
>>
>> Then you can add to your load_seqdatabase.pl command the option:
>> --pipeline "Accession"
>>
>> HTH,
>>
>> Marc
>>
>>
>
>
> -- 
> Real patriots ask questions.
>                                          Carl Sagan
> http://aindap.blogspot.com/
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------



More information about the BioSQL-l mailing list