[BioPython] Cannot parse/convert embl formatted files

Chris Fields cjfields at uiuc.edu
Sun Aug 13 00:23:41 UTC 2006


Martin,

I think the Bioperl EMBL and GenBank parsers run all features through  
a loop using regex to specifically look for the '\' tags and the  
quotes.  So if there isn't a closing quote the parser chokes (spits  
back something about lack of closed or paired quotes).  That may not  
be too easy to work around.  It shouldn't die, though, so if there  
isn't a balanced quote it could be added back in bioperl SeqIO.

I have been thinking about rewriting this as there is some redundancy  
on the way the features are handled.  Just have my hands tied a bit  
now (can't get to it yet).

Anyway, I think checking for balanced quotes is done from a  
validation point-of-view.

Chris

On Aug 12, 2006, at 7:16 PM, Martin MOKREJŠ wrote:

> Hi Chris,
>
> Chris Fields wrote:
>> Just so everybody knows, EMBL recently made a few major revisions to
>> their sequence format. These are now corrected in Bioperl CVS and
>> will be available for the next dev release (hopefully out within a
>> few months).
>
> I will test that later. Thanks.
>
>>
>> Odd about the unbalanced quotes; is that on the Bioperl end?  I
>> missed that bit...
>
> No, the input EMBL files are broken:
>
> And the relevant EBML file was:
>
> ID   5OSAR003520 standard; RNA; PLN; 213 BP.
> ...
> FT   5'UTR           1..213
> FT                   /source="REFSEQ::XM_479174:1..213"
> FT                   /gene="B1056G08.147"
> FT                   /product="putative dihydropterin  
> pyrophosphokinase
> FT   repeat_region   61..87
> ...
> //
>
> Still, I believe the parser could ignore this minot error and  
> terminate
> the string (or treat it as terminated) when it is actually terminated
> by a following feature line.
>
> M.

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign







More information about the Biopython mailing list