[BioPython] Uniprot Parser

Ruchira Datta ruchira.datta at gmail.com
Sun Feb 24 17:36:56 UTC 2008


I just found another bug, which would be a bit trickier to fix properly.

This code:

    def database_cross_reference(self, line):
        # From CLD1_HUMAN, Release 39:
        # DR   EMBL; [snip]; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
        # DR   PRODOM [Domain structure / List of seq. sharing at least 1
domai
        # DR   SWISS-2DPAGE; GET REGION ON 2D PAGE.
        line = line[5:]
        # Remove the comments at the end of the line
        i = line.find('[')
        if i >= 0:
            line = line[:i]
        cols = line.rstrip(_CHOMP).split(';')
        cols = [col.lstrip() for col in cols]
        self.data.cross_references.append(tuple(cols))

applied to this line of the TrEMBL record for A2RB21_ASPNG:

DR   GO; GO:0016277; F:[myelin basic protein]-arginine N-methyltra...;
IEA:EC.

got me this tuple:

('GO', 'GO:0016277', 'F:')

The bracketed term was interpreted as a comment and the whole line was
stripped.

Thanks,

--Ruchira







On Sun, Feb 24, 2008 at 8:47 AM, Peter <biopython at maubp.freeserve.co.uk>
wrote:

> On Sun, Feb 24, 2008 at 4:28 PM, Ruchira Datta <ruchira.datta at gmail.com>
> wrote:
> >
> > Hi Peter,
> >
> >  I had tried SeqRecord first, but it didn't include the references,
> which I
> > absolutely need.
>
> The good news is I think the references are included now (in Biopython
> CVS), see enhancement Bug 2235:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2235
>
> > While inclusion of newlines may be understandable, it's a bug.  The
> newline
> > is stripped from several other fields by _RecordConsumer, e.g.,
> > ...
>
> Off the top of my head, I would say that example is a little different
> - reference number lines do not span multiple lines.
>
> > The newlines are never significant in any field.
>
> You are probably right - although perhaps they could be important in
> long text fields where a line break has been inserted mid word and a
> hyphenation added.
>
> The newlines are also important if using the Record object to recreate
> the raw file (e.g. to save to disk).  However I doubt anyone is doing
> this.  Having a __str__ method defined like there is in the
> Bio.GenBank.Record.Record object which would make this easier.
>
> > In a couple of weeks I might be able to check out the cvs
> > version and provide a patch.
>
> Please do.
>
> Peter
>



More information about the Biopython mailing list