[BioPython] Cannot parse GenBank file

Chris Fields cjfields at uiuc.edu
Thu Jun 7 15:31:45 UTC 2007


On Jun 7, 2007, at 9:26 AM, Martin MOKREJŠ wrote:

> Hi,
>
> Chris Fields wrote:
>> One thing I missed which explains the biopython error: the LOCUS  
>> line is missing the locus identifier (see the NCBI example record  
>> link).  This doesn't choke the bioperl parser but it appears to  
>> stop the biopython parser in it's tracks (maybe a feature instead  
>> of a bug!).
>> You should try adding a unique identifier (maybe the name of the  
>> file or record) to the LOCUS line to see if it works:
>> LOCUS  testfile           6499 bp ds-DNA     linear       02-AUG-2006
>> The bioperl parser in CVS writes out the correct alphabet when  
>> this is added:
>> LOCUS       testfile                6499 bp    ds-DNA  linear   02- 
>> AUG-2006
>> I'll try adding a warning to the bioperl parser for this.
>
> I have updated http://bugzilla.open-bio.org/show_bug.cgi?id=2305  
> but let me
> emphasize the LOCUS line now contains
> LOCUS                      pRL        5428 bp ds-DNA   linear        
> 07-JUN-2007
>
>
> which still does not comply with the line you have proposed. But it  
> can be
> parsed by bioperl-live from cvs. Is it still wrong? Testcase as  
> pRL.gb-new
> in the bugzilla record #2305.
>
> Martin

That should work.  There isn't a strict uniqueness test (that would  
require caching and isn't worth the trouble IMHO), though it's  
required you add something unique for the accession/locus if you plan  
on indexing them in the future.

Parsing GenBank data produced from third-party software is  
problematic at best; there seems to be no steadfast rule with GenBank  
output for some programs, even though the specification is plainly  
stated in the NCBI release notes.  My take on that is to have a  
stricter (read:follows release notes) GenBank parser which passes off  
the data in the record to default handler methods.  A user could then  
subjugate the defined handlers with their own by subclassing the  
default handler class and overloading the methods or adding their own  
code references directly.

chris

...





More information about the Biopython mailing list