[Bioperl-l] load_seqdatabase error with a specific locus from genbank

Thu Apr 9 03:35:12 UTC 2009

On Apr 8, 2009, at 11:29 AM, Johann PELLET wrote:

> [...]
> and finally EU608407 and EU608559  made a crash:
>
> [...]
> --------------------- WARNING ---------------------
> MSG: Unexpected error in feature table for  Skipping feature,  
> attempting to recover
> ---------------------------------------------------
> #######...14 times ...############

I would assume that you figured out that this was triggered by or  
affected EU608407? Would you mind sharing how?

> --------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
> values were ("Bonhoeffer,S., Chappey,C., Parkin,N.T.,  
> Whitcomb,LOCUS       EU608407
>      1212 bp    DNA     linear   VRL 20-APR-2008","","","CRC- 
> D35248959C54B9F2","1","1212","") FKs (<NULL>)
> ERROR:  null value in column "location" violates not-null constraint

Is this really the verbatim copy of the error message you saw on the  
screen? What's really puzzling about this is how the genbank SeqIO  
parser could mess up parsing the reference entry to badly. Here's the  
reference from the version online at NCBI:

REFERENCE   1  (bases 1 to 1212)
   AUTHORS   Bonhoeffer,S., Chappey,C., Parkin,N.T., Whitcomb,J.M. and
             Petropoulos,C.J.
   TITLE     Evidence for positive epistasis in HIV-1
   JOURNAL   Science 306 (5701), 1547-1550 (2004)
    PUBMED   15567861

How the first author line would be chopped off at the end and the  
LOCUS line would have gotten inserted there is a mystery to me.

The location is "Science 306 (5701), 1547-1550 (2004)", and according  
to the error message the parser failed to extract that and the TITLE.

Could you confirm that the file you are parsing is not corrupted in  
any way, specifically for this record?

> ---------------------------------------------------
> Could not store EU608559:
> ------------- EXCEPTION: Bio::Root::Exception -------------
> [...]
>
> If I check in the biosql database if some part of this records are  
> inserted:

So are there other sequences associated with that PubMed ID? Can you  
do a grep on the PubMed ID and see whether it occurs already before  
the one that trips up the load?

> [...]
> select * from dbxref where dbxref_id=4179;
> dbxref_id | dbname | accession | version
> -----------+--------+-----------+---------
>      4179 | PUBMED | 15567861  |       0
>
> select * from bioentry where accession=15567861;

Note that 15567861 is the accession (PubMed ID) for the referenced  
article, not the sequence. Which bioentries are associated with a  
reference would be in the bioentry_reference table.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================