[BioPython] Uniprot Parser
Ruchira Datta
ruchira.datta at gmail.com
Sun Feb 24 16:28:33 UTC 2008
On Sun, Feb 24, 2008 at 5:06 AM, Peter <biopython at maubp.freeserve.co.uk>
wrote:
> On Sat, Feb 23, 2008 at 10:44 PM, Ruchira Datta <ruchira.datta at gmail.com>
> wrote:
> > I've been using Bio.SwissProt.SProt to parse this file. The only glitch
> > that came up so far is that when some fields span multiple lines (e.g.,
> OS,
> > the species field), SProt puts a newline in the field. This is not
> > correct--it should be just a blank space. However, this can easily be
> > corrected within SProt itself without requiring a forked parser.
>
> I'm guessing you are using the parser to return Record objects, which
> are a fairly simple direct mapping of the raw file format - and I can
> understand why the newlines were included. If you use the parser to
> get SeqRecord objects (which are generic and not tied to the
> SwissProt/UniProt format), then the newlines are removed.
>
Hi Peter,
I had tried SeqRecord first, but it didn't include the references, which I
absolutely need.
While inclusion of newlines may be understandable, it's a bug. The newline
is stripped
from several other fields by _RecordConsumer, e.g.,
def reference_number(self, line):
rn = line[5:].rstrip()
...
and it needs to be stripped from this one, instead of
def organism_species(self, line):
self.data.organism += line[5:]
The newlines are never significant in any field.
In a couple of weeks I might be able to check out the cvs
version and provide a patch.
--Ruchira
>
> > At least two other parsers for this file have been written by people in
> my
> > group, but I have pushed and implemented standardization on the
> BioPython
> > one. Part of the point of BioPython is to have one central repository
> for
> > development and maintenance of things like this, so that hundreds of
> people
> > don't have to spend their time reinventing the wheel. It is much
> preferable
> > that people contribute changes rather than creating a forked version.
> >
> > --Ruchira
>
More information about the Biopython
mailing list