[Biopython] Error parsing EMBL file

Nick Semenkovich semenko at alum.mit.edu
Mon Sep 17 17:22:26 UTC 2012


Looks like it's dying at a line-wrapped location string:


RN   [16]
RP   1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580,
RP   41454-41724
RX   DOI; 10.1128/JB.185.4.1475-1477.2003.
RX   PUBMED; 12562822.
RA   Pedulla M.L., Ford M.E., Karthikeyan T., Houtz J.M., Hendrix R.W.,
RA   Hatfull G.F., Poteete A.R., Gilcrease E.B., Winn-Stapley D.A.,
RA   Casjens S.R.;
RT   "Corrected sequence of the bacteriophage p22 genome";
RL   J. Bacteriol. 185(4):1475-1477(2003).


This works if RP is just one line:
RP   1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580,41454-41724


On Mon, Sep 17, 2012 at 12:01 PM, Nick Semenkovich <semenko at alum.mit.edu> wrote:
> I'm trying to extract the peptide sequences from a large collection of
> EMBL-formatted files (all phage & virus data from EBI).
>
> EBI provides these as large, concatenated EMBL files, so I've been
> using SeqIO.parse to read & then write the 'translation' key from
> seq_feature.qualifiers.
>
>
> Unfortunately, it looks like the parser dies on one input file:
>
> http://www.ebi.ac.uk/ena/data/view/BK000583&display=txt&expanded=true
>
> Traceback (most recent call last):
>   File "gbk_to_faa.py", line 7, in <module>
>     for seq_record in SeqIO.parse(input_handle, "embl") :
>   File "/usr/lib/pymodules/python2.7/Bio/SeqIO/__init__.py", line 541, in parse
>     for r in i:
>   File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line
> 440, in parse_records
>     record = self.parse(handle, do_features)
>   File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 423, in parse
>     if self.feed(handle, consumer, do_features):
>   File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 391, in feed
>     self._feed_header_lines(consumer, self.parse_header())
>   File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line
> 692, in _feed_header_lines
>     consumer.reference_bases("(bases %s)" % "; ".join(parts))
>   File "/usr/lib/pymodules/python2.7/Bio/GenBank/__init__.py", line
> 740, in reference_bases
>     locations = self._split_reference_locations(ref_base_info)
>   File "/usr/lib/pymodules/python2.7/Bio/GenBank/__init__.py", line
> 777, in _split_reference_locations
>     start, end = base_info.split('to')
> ValueError: need more than 1 value to unpack
>
>
> * I might dig into this a bit more to patch, but does anyone more
> familiar with EMBL files know what's going on?
>
> * Also, is there are more straightforward (or even non-BioPython way)
> to go from EMBL->FAA?
>
>
> Best,
> Nick
>
> --
> Nick Semenkovich
> Laboratory of Dr. Jeffrey I. Gordon
> Medical Scientist Training Program
> School of Medicine
> Washington University in St. Louis
> 314.362.3963 (Lab)
> http://web.mit.edu/semenko/



-- 
Nick Semenkovich
Laboratory of Dr. Jeffrey I. Gordon
Medical Scientist Training Program
School of Medicine
Washington University in St. Louis
314.362.3963 (Lab)
http://web.mit.edu/semenko/



More information about the Biopython mailing list