[Biopython-dev] [Bug 2425] New: Fasta ID parsing error

Fri Dec 28 16:18:54 UTC 2007

http://bugzilla.open-bio.org/show_bug.cgi?id=2425

           Summary: Fasta ID parsing error
           Product: Biopython
           Version: 1.44
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: BioSQL
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: dtomso at athenixcorp.com

Loader.py will give an error as follows when presented with an unusual FASTA
header line:

>region1.fasta.screen.Contig1
ACAGGATAGGCGGGAGCCATTGAAACCGGAGCGCTAGCTTCGGTGGAGGC
GCTGGTGGGATACCGCCCTGACTGTATTGAAATTCTAACCTACGGGTCTT

Traceback (most recent call last):
  File "biosql_driver.py", line 28, in <module>
    db.load(SeqIO.parse(sfile, 'fasta'))
  File
"/home/dtomso/repository/biopython/build/lib.linux-i686-2.5/BioSQL/BioSeqDatabase.py",
line 412, in load
    db_loader.load_seqrecord(cur_record)
  File "/usr/lib/python2.5/site-packages/BioSQL/Loader.py", line 30, in
load_seqrecord
    bioentry_id = self._load_bioentry_table(record)
  File "/usr/lib/python2.5/site-packages/BioSQL/Loader.py", line 214, in
_load_bioentry_table
    accession, version = record.id.split('.')
ValueError: too many values to unpack

It appears to be looking for any '.' in the file, assuming that is a version
number, and splitting to obtain that number.  However, this only works on
NCBI-type header lines.  Files that deviate from this (e.g. those produced by
phrap, which produced the file above) cause this issue.

I bolted on an inelegant fix by having the code check for multiple '.'
characters, in which case the version defaults to zero.  Other solutions may be
preferable.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.