[Biojava-dev] Parsing protein EMBL files with RichSequence.IOTools.readEMBLProtein

Deniz Koellhofer deniz.koellhofer at cambia.org
Wed Sep 22 23:10:58 UTC 2010


Hi George,

This entry is from the embl patent protein database:
ftp://ftp.ebi.ac.uk/pub/databases/embl/patent/epo_prt.dat.gz

Have you used the RichSequence.IOTools successfully for parsing EMBL protein
files before? I assume this should always fail due to the "BP" in the regex?

Deniz

On Thu, Sep 23, 2010 at 7:18 AM, George Waldon <gwaldon at geneinfinity.org>wrote:

> Hi Deniz:
>
> I have a quick question that may be obvious, but which database do you get
> those protein files from?
>
> Thank you,
>
> George
>
> On Tue, Sep 21, 2010 at 3:59 PM, Deniz Koellhofer <
> deniz.koellhofer at cambia.org> wrote:
>
>    Hi,
>
>    I'm trying to parse EMBL formatted files
>    with RichSequence.IOTools.readEMBLProtein() - but the pattern for the ID
>    lines don't match.
>
>    Looks like the parser utilises the EMBLFormat class with the following
> ID
>    pattern:
>
>    *protected** **static** **final** Pattern **lp** = Pattern.compile(**
>
>  "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+BP\\.$"
>    **);*
>
>    The ID lines in my files (retrieved from EMBL-EBI) look like *ID
> A00197;
>    SV 1; linear; protein; PRT; SYN; 602 AA.*
>
>    Looks like the pattern is specifically written for dna/rna and should
> more
>    look like:
>
>    *protected** **static** **final** Pattern **lp** = Pattern.compile(**
>
>  "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);
>    **\\s+(\\d+)\\s+(BP|AA)\\.$"**);*
>
>    Or am I using he wrong RichSequence.IOTools function?
>
>    Cheers,
>
>    Deniz
>    --
>    Deniz Koellhofer
>    Cambia
>    Initiative for Open Innovation (IOI)
>    Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia
>    _______________________________________________
>    biojava-dev mailing list
>    biojava-dev at lists.open-bio.org
>    http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
>
>
>


-- 
Deniz Koellhofer
Cambia
Initiative for Open Innovation (IOI)
Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia



More information about the biojava-dev mailing list