[BioPython] Cannot parse/convert embl formatted files
Chris Fields
cjfields at uiuc.edu
Sun Aug 13 00:23:41 UTC 2006
Martin,
I think the Bioperl EMBL and GenBank parsers run all features through
a loop using regex to specifically look for the '\' tags and the
quotes. So if there isn't a closing quote the parser chokes (spits
back something about lack of closed or paired quotes). That may not
be too easy to work around. It shouldn't die, though, so if there
isn't a balanced quote it could be added back in bioperl SeqIO.
I have been thinking about rewriting this as there is some redundancy
on the way the features are handled. Just have my hands tied a bit
now (can't get to it yet).
Anyway, I think checking for balanced quotes is done from a
validation point-of-view.
Chris
On Aug 12, 2006, at 7:16 PM, Martin MOKREJŠ wrote:
> Hi Chris,
>
> Chris Fields wrote:
>> Just so everybody knows, EMBL recently made a few major revisions to
>> their sequence format. These are now corrected in Bioperl CVS and
>> will be available for the next dev release (hopefully out within a
>> few months).
>
> I will test that later. Thanks.
>
>>
>> Odd about the unbalanced quotes; is that on the Bioperl end? I
>> missed that bit...
>
> No, the input EMBL files are broken:
>
> And the relevant EBML file was:
>
> ID 5OSAR003520 standard; RNA; PLN; 213 BP.
> ...
> FT 5'UTR 1..213
> FT /source="REFSEQ::XM_479174:1..213"
> FT /gene="B1056G08.147"
> FT /product="putative dihydropterin
> pyrophosphokinase
> FT repeat_region 61..87
> ...
> //
>
> Still, I believe the parser could ignore this minot error and
> terminate
> the string (or treat it as terminated) when it is actually terminated
> by a following feature line.
>
> M.
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Biopython
mailing list