[BioSQL-l] loading fasta records with load_seqdatabase.pl - correctfasta headers

Amit Indap indapa at gmail.com
Mon Aug 22 16:28:04 EDT 2005


Marc and Hilmar,

Thanks for your responses. From my understanding I can write my own
SequenceProcessor and override the process_seq to munge my data so that is
is acceptable when loading my sequences in to biosql. I have a whole
bunch of other sequences from the lab which don't have accessions, etc
but I can write another pipeline to deal with and give them
appropriate names and accessions. (If am mis-understanding what
SeqProcessor is doing, please correct)

Thanks,
Amit





On 8/22/05, Marc Logghe <MarcL at devgen.com> wrote:
> > I think my fasta headers are incorrect since it says it
> > cannot store unknown. The first fasta record in my refseq.fa is this:
> >
> > >gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin
> > domain (Ig), short basic domain, secreted, (semaphorin) 3E
> > (SEMA3E), mRNA
> >
> > Do I need to reformat that header? I downloaded the NM series
> > of Refseqs in fasta form from NCBI's ftp site and wanted to
> > load them into the biosql schema.
> 
> You'd definitely better change the display_name to NM_012431.1
> You could first run the sequences through EMBOSS's seqret cleaning the
> identifier.
> Or you handle this in a seq processor. I'd opt for the latter.
> Because you have to set your accession_number anyway. Thing is that a
> sequence object from parsed fasta has no accession_number (set to the
> default the well known 'unknown' ;-), only a display_name.
> In the processor you can do all: clean up the display_name and pass that
> value to the accession_number() call.
> The processor looks like this (save it as Accession.pm and put it
> somewhere where perl can find it):
> 
> 
> # $Id: Accession.pm,v 1.2 2004/03/02 08:15:48 marcl Exp $
> package Accession;
> use vars qw(@ISA);
> use strict;
> 
> use Bio::Seq::BaseSeqProcessor;
> 
> @ISA = qw(Bio::Seq::BaseSeqProcessor);
> 
> sub _id_parser
> {
>   return $_[0] =~ /gb\|([^|]+)/       ? $1 :
>          $_[0] =~ /^\s*\S+\|([^|]+)/  ? $1 :
>          $_[0] =~ /^\s*>*(\S+)/       ? $1 : $_[0];
> }
> 
> 
> sub process_seq{
>     my ($self,$seq) = @_;
>     my $display_id = _id_parser($seq->display_id);
>     $seq->accession_number($display_id);
>     return ($seq);
> }
> 
> 1;
> 
> 
> Then you can add to your load_seqdatabase.pl command the option:
> --pipeline "Accession"
> 
> HTH,
> 
> Marc
> 
> 


-- 
Real patriots ask questions.
                                         Carl Sagan
http://aindap.blogspot.com/



More information about the BioSQL-l mailing list