[Biopython] Error parsing EMBL file
Nick Semenkovich
semenko at alum.mit.edu
Mon Sep 17 17:22:26 UTC 2012
Looks like it's dying at a line-wrapped location string:
RN [16]
RP 1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580,
RP 41454-41724
RX DOI; 10.1128/JB.185.4.1475-1477.2003.
RX PUBMED; 12562822.
RA Pedulla M.L., Ford M.E., Karthikeyan T., Houtz J.M., Hendrix R.W.,
RA Hatfull G.F., Poteete A.R., Gilcrease E.B., Winn-Stapley D.A.,
RA Casjens S.R.;
RT "Corrected sequence of the bacteriophage p22 genome";
RL J. Bacteriol. 185(4):1475-1477(2003).
This works if RP is just one line:
RP 1-5181,6229-11775,13275-15420,18210-23250,29410-32271,34850-38580,41454-41724
On Mon, Sep 17, 2012 at 12:01 PM, Nick Semenkovich <semenko at alum.mit.edu> wrote:
> I'm trying to extract the peptide sequences from a large collection of
> EMBL-formatted files (all phage & virus data from EBI).
>
> EBI provides these as large, concatenated EMBL files, so I've been
> using SeqIO.parse to read & then write the 'translation' key from
> seq_feature.qualifiers.
>
>
> Unfortunately, it looks like the parser dies on one input file:
>
> http://www.ebi.ac.uk/ena/data/view/BK000583&display=txt&expanded=true
>
> Traceback (most recent call last):
> File "gbk_to_faa.py", line 7, in <module>
> for seq_record in SeqIO.parse(input_handle, "embl") :
> File "/usr/lib/pymodules/python2.7/Bio/SeqIO/__init__.py", line 541, in parse
> for r in i:
> File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line
> 440, in parse_records
> record = self.parse(handle, do_features)
> File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 423, in parse
> if self.feed(handle, consumer, do_features):
> File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 391, in feed
> self._feed_header_lines(consumer, self.parse_header())
> File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line
> 692, in _feed_header_lines
> consumer.reference_bases("(bases %s)" % "; ".join(parts))
> File "/usr/lib/pymodules/python2.7/Bio/GenBank/__init__.py", line
> 740, in reference_bases
> locations = self._split_reference_locations(ref_base_info)
> File "/usr/lib/pymodules/python2.7/Bio/GenBank/__init__.py", line
> 777, in _split_reference_locations
> start, end = base_info.split('to')
> ValueError: need more than 1 value to unpack
>
>
> * I might dig into this a bit more to patch, but does anyone more
> familiar with EMBL files know what's going on?
>
> * Also, is there are more straightforward (or even non-BioPython way)
> to go from EMBL->FAA?
>
>
> Best,
> Nick
>
> --
> Nick Semenkovich
> Laboratory of Dr. Jeffrey I. Gordon
> Medical Scientist Training Program
> School of Medicine
> Washington University in St. Louis
> 314.362.3963 (Lab)
> http://web.mit.edu/semenko/
--
Nick Semenkovich
Laboratory of Dr. Jeffrey I. Gordon
Medical Scientist Training Program
School of Medicine
Washington University in St. Louis
314.362.3963 (Lab)
http://web.mit.edu/semenko/
More information about the Biopython
mailing list