[Bioperl-l] Re: [BioSQL-l] loading fasta records with load_seqdatabase.pl - correct fasta headers

Tue Aug 23 15:43:56 EDT 2005

I guess it may be worth to deposit a suitable SeqProcessor for this  
type of ID in the repository as probably many people may find it  
useful.

On Aug 23, 2005, at 1:53 AM, mark.schreiber at novartis.com wrote:

> The NCBI 'standard' is to format the header like this:
>
>> gi|{identifier}|{namespace}|{accession}.{version}|{accession}  
>> description
>
> eg
>
>> gi|123456|gb|AE657483.3|AE657483.3 Gene of interest from Flying  
>> Spaghetti
> Monster.
>
> Biojava is going to be adopting this approach when the appropriate
> information is available.
>
> - Mark
>
> Mark Schreiber
> Principal Scientist (Bioinformatics)
>
> Novartis Institute for Tropical Diseases (NITD)
> 10 Biopolis Road
> #05-01 Chromos
> Singapore 138670
> www.nitd.novartis.com
>
> phone +65 6722 2973
> fax  +65 6722 2910
>
>
>
>
>
> Hilmar Lapp <hlapp at gnf.org>
> Sent by: biosql-l-bounces at portal.open-bio.org
> 08/23/2005 02:18 AM
>
>
>         To:     Amit Indap <indapa at gmail.com>
>         cc:     Bioperl <bioperl-l at bioperl.org>, Biosql  
> <biosql-l at open-bio.org>, (bcc:
> Mark Schreiber/GP/Novartis)
>         Subject:        Re: [BioSQL-l] loading fasta records with  
> load_seqdatabase.pl - correct
> fasta headers
>
>
> Amit,
>
> this is a problem inherent with the fasta format as there is no precise
> definition of what to put as identifier and/or accession. The Bioperl
> fasta parser doesn't set the accession and so it defaults to "unknown"
> (it cannot be undef). Since fasta format also doesn't have the version
> in a defined place, the version will be undef (i.e., zero for biosql)
> for every entry, so that all your sequences will have the same unique
> key of (accession,version,namespace) which violates the constraint
> after the first sequence was stored.
>
> The easiest way to deal with this is to write your own
> SequenceProcessor (see Bio::Factory::SequenceProcessorI and
> Bio::Seq::BaseSeqProcessor) and then pipeline it using the --pipeline
> argument to load_seqdatabase.pl.
>
> Simple examples for how to write your own SeqProcessor have been posted
> before, e.g., by Marc Logghe:
>
> http://portal.open-bio.org/pipermail/bioperl-l/2005-February/ 
> 018158.html
>
> and by myself
>
> http://portal.open-bio.org/pipermail/bioperl-l/2003-June/012369.html
>
>                  -hilmar
>
> On Aug 22, 2005, at 7:57 AM, Amit Indap wrote:
>
>> Hi,
>>
>> I am new to using the biosql. I am trying to load fasta formatted
>> RefSeq records into the biosql schema. When I try to use the
>> load_seqdatabase.pl script I get the following error
>>
>> load_seqdatabase.pl --host 127.0.0.1 --port 2022 --dbname testbiosql
>> --namespace refseq --format fasta refseq.fa
>>
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values
>> were
>> ("gi|51459331|ref|XM_498785.1|","gi|51459331|ref|XM_498785.1|","unknow 
>> n
>> ","PREDICTED:
>> Homo sapiens LOC440641 (LOC440641), mRNA","0","") FKs (1,<NULL>)
>> Duplicate entry 'unknown-1-0' for key 2
>> ---------------------------------------------------
>> Could not store unknown:
>> ------------- EXCEPTION  -------------
>> MSG: You're trying to lie about the length: is 1316 but you say 6474
>> STACK Bio::PrimarySeq::length
>> /usr/lib/perl5/site_perl/5.8.5/Bio/PrimarySeq.pm:418
>> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:
>> 553
>> STACK Bio::Seq::length /usr/lib/perl5/site_perl/5.8.5/Bio/Seq.pm:612
>> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:
>> 553
>> STACK Bio::DB::BioSQL::BiosequenceAdaptor::populate_from_row
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BiosequenceAdaptor.pm:236
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:1310
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:976
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:855
>> STACK Bio::DB::BioSQL::PrimarySeqAdaptor::attach_children
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/PrimarySeqAdaptor.pm:284
>> STACK Bio::DB::BioSQL::SeqAdaptor::attach_children
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/SeqAdaptor.pm:279
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:1341
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:976
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:855
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:205
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:254
>> STACK Bio::DB::Persistent::PersistentObject::store
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:
>> 272
>> STACK (eval) ./load_seqdatabase.pl:542
>> STACK toplevel ./load_seqdatabase.pl:525
>>
>> --------------------------------------
>>  at ./load_seqdatabase.pl line 555
>>
>> I think my fasta headers are incorrect since it says it cannot store
>> unknown. The first fasta record in my refseq.fa is this:
>>
>>> gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin
>> domain (Ig), short basic domain, secreted, (semaphorin) 3E (SEMA3E),
>> mRNA
>>
>> Do I need to reformat that header? I downloaded the NM series of
>> Refseqs in fasta form from NCBI's ftp site and wanted to load them
>> into the biosql schema.
>>
>> Thanks,
>>
>> Amit Indap
>> Dept. of Biological Statistics and Computational Biology
>> Cornell University
>>
>>
>> (error message)
>> Loading refseq.fa ...
>>
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at open-bio.org
>> http://open-bio.org/mailman/listinfo/biosql-l
>>
> -- 
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
>
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------