[Biojava-dev] [Bug 3137] New: RichSequence.IOTools.readEMBLProtein() fails on EMBL patent protein entries.
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Thu Sep 23 04:35:54 UTC 2010
http://bugzilla.open-bio.org/show_bug.cgi?id=3137
Summary: RichSequence.IOTools.readEMBLProtein() fails on EMBL
patent protein entries.
Product: BioJava
Version: unspecified
Platform: Macintosh
OS/Version: Mac OS
Status: NEW
Severity: normal
Priority: P2
Component: seq.io
AssignedTo: biojava-dev at biojava.org
ReportedBy: dkoellhofer at gmail.com
Hi,
I'm trying to parse EMBL formatted files with
RichSequence.IOTools.readEMBLProtein() - but the pattern for the ID lines don't
match.
Looks like the parser utilises the EMBLFormat class with the following ID
pattern:
protected static final Pattern lp =
Pattern.compile("^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+BP\\.$");
The ID lines in my files (retrieved from EMBL-EBI) look like ID A00197; SV 1;
linear; protein; PRT; SYN; 602 AA.
Looks like the pattern is specifically written for dna/rna and should more look
like:
protected static final Pattern lp =
Pattern.compile("^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+(BP|AA)\\.$");
The failing protein sequences come from the embl patent protein database:
ftp://ftp.ebi.ac.uk/pub/databases/embl/patent/epo_prt.dat.gz
Cheers,
Deniz
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the biojava-dev
mailing list