[Bioperl-l] loading data into bioperl-db

Michael Thon mrthon at unity.ncsu.edu
Thu Jun 5 11:02:59 EDT 2003


I realize now that in the example of load_seqdatabases.pl that I posted
I had specified --format fasta when I meant to specify --format
genbank.  this may explain some of the unusual error messages that I
reported yesterday.

I converted file formats using SeqIO.  My fasta files were formatted
just as you say:

>gnl|NCSU_FGL|NCU10032.1  NCU10032.1 hypothetical protein (301 - 1378)
MTRQSIQSYRNRGLGGTRKMFLYYFFNYLG*


If I convert sequences formatted like this:

>NCU10032.1  NCU10032.1 hypothetical protein (301 - 1378)
MTRQSIQSYRNRGLGGTRKMFLYYFFNYLG*

into genbank format using SeqIO, the file looks like:

LOCUS       NCU10032.1                31 aa            linear   UNK
DEFINITION  NCU10032.1 hypothetical protein (301 - 1378)
ACCESSION   unknown
FEATURES             Location/Qualifiers
ORIGIN
        1 mtrqsiqsyr nrglggtrkm flyyffnylg *
//

and when I try to load into the database I get errors like:

-- WARNING --
MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
("NCU09800.1","","unknown","NCU09800.1 hypothetical protein (5834 -
4583)","0","linear") FKs (1,<NULL>)
Duplicate entry 'unknown-1-0' for key 2
--
DBD::mysql::st execute failed: Duplicate entry 'unknown-1-0' for key 2
at /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BaseDriver.pm line 922,
<GEN0> line 14741.

If, during conversion of this file to genbank format, I specify an
accession number, then the sequences will load. It looks like an
accession number is required by the database and/or loading script.  It
also looks to me like when sequences are read by Bio::SeqIO::fasta the
accession number is set to 'unknown' Is there ever a case where
Bio::SeqIO::fasta will parse a sequence header like :

>gi|30419336|gb|CD037498.1|CD037498 mgsu014xP21f.b Magnaporthe grisea 

and read the namespace, accession, version etc from it?

So, I've been able to load my sequences by making sure they have an
accession number.  Eventually I'll write a BaseSeqProcessor module to
error-check my sequences at loading time.  

Next things for me to figure out are the query system and
updating/changing sequences that are already in the database.
Thanks for your help
...I'll be back!
Mike




More information about the Bioperl-l mailing list