[Bioperl-l] using default string values for undef/empty, was Re: parsing GenBank file

Chris Fields cjfields at illinois.edu
Wed May 5 12:30:30 UTC 2010


On May 5, 2010, at 2:48 AM, Torsten Seemann wrote:

>>      i have a huge GenBank file ( downloaded from RDP containing all
>> bacterial 16s). I just want to parse RDP id (in LOCUS) and organism's linage (in ORGANISM).
>> I am getting the output like:
>> S000107505 uncultured Acidobacteria bacterium Geothrix Holophagaceae
>> Holophagales Holophagae "Acidobacteria" Bacteria Root
>> This is the exact output i want, but i am missing lot of records (they are
>> there in the genbank file but not in my output).
>> I also got a warning during parsing:
>> --------------------- WARNING ---------------------
>> MSG: Unbalanced quote in:
>> /db_xref="taxon:35783" /germline"
>> /mol_type="genomic DNA"
>> /organism="Enterococcus sp."
>> /strain="LMG12316"No further qualifiers will be added for this feature
>> ---------------------------------------------------
>> So i was just wondering that is this warning message causing that problem or
>> i am doing something wrong?
> 
> "Unbalanced quote" means there is not an even number (multiple of 2)
> double-quote (") symbols around the tag's value. I can see that the
> first line (below) looks problematic:
> 
> YOU HAVE:
> 
> /db_xref="taxon:35783" /germline"
> 
> SHOULD BE:
> 
> /db_xref="taxon:35783"
> /germline
> 
> I suspect there is a problem either with RDP's genbank producer, or
> Bioperl is having problem with  the "germline" qualifier which is a
> 'null valued' qualifier like /pseudo - it takes no ="value" string. (I
> think in Bioperl this is handled by setting the value to "_no_value"
> ?)
> ...
> --Torsten Seemann
> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
> University, AUSTRALIA

Ugh, didn't notice the '_no_value' bit.  Probably my opinion, but I don't like stubs like that as they tend to be brittle and run into issues (like this one, for instance).  I would prefer we just leave that as undef and only quote defined values (with the exceptions in %FTQUAL_NO_QUOTE).

Any reason for this behavior (is it related to ORM-related stuff like bioperl-db)?  Can we change that to something a bit more realistic?

chris





More information about the Bioperl-l mailing list