[BioPython] Cannot parse/convert embl formatted files
Martin MOKREJŠ
mmokrejs at ribosome.natur.cuni.cz
Thu Aug 17 11:19:29 UTC 2006
Hi Chris,
thank for your comments. I have filed bugreport at http://bugzilla.open-bio.org/show_bug.cgi?id=2077
Martin
Chris Fields wrote:
> Martin,
>
> I think the Bioperl EMBL and GenBank parsers run all features through a
> loop using regex to specifically look for the '\' tags and the quotes.
> So if there isn't a closing quote the parser chokes (spits back
> something about lack of closed or paired quotes). That may not be too
> easy to work around. It shouldn't die, though, so if there isn't a
> balanced quote it could be added back in bioperl SeqIO.
>
> I have been thinking about rewriting this as there is some redundancy
> on the way the features are handled. Just have my hands tied a bit now
> (can't get to it yet).
>
> Anyway, I think checking for balanced quotes is done from a validation
> point-of-view.
>
> Chris
>
> On Aug 12, 2006, at 7:16 PM, Martin MOKREJŠ wrote:
>
>> Hi Chris,
>>
>> Chris Fields wrote:
>>
>>> Just so everybody knows, EMBL recently made a few major revisions to
>>> their sequence format. These are now corrected in Bioperl CVS and
>>> will be available for the next dev release (hopefully out within a
>>> few months).
>>
>>
>> I will test that later. Thanks.
>>
>>>
>>> Odd about the unbalanced quotes; is that on the Bioperl end? I
>>> missed that bit...
>>
>>
>> No, the input EMBL files are broken:
>>
>> And the relevant EBML file was:
>>
>> ID 5OSAR003520 standard; RNA; PLN; 213 BP.
>> ...
>> FT 5'UTR 1..213
>> FT /source="REFSEQ::XM_479174:1..213"
>> FT /gene="B1056G08.147"
>> FT /product="putative dihydropterin pyrophosphokinase
>> FT repeat_region 61..87
>> ...
>> //
>>
>> Still, I believe the parser could ignore this minot error and terminate
>> the string (or treat it as terminated) when it is actually terminated
>> by a following feature line.
More information about the Biopython
mailing list