[Bioperl-l] EMBL/genbank organism parsing

Thu Mar 9 18:54:36 UTC 2006

Yeah the species parsing has bothered us for a long time.

My thoughts on this - I don't think tweaking individual parsers until 
they behave as desired on a then-current set examples is going to put 
an end to this. Either species parsing will have to be moved into its 
own set of 'drivers' with a fronting factory, like Bio::SpeciesIO or 
Bio::TaxonIO, or alternatively like Bio::Factory::TaxonFactoryI and 
Bio::Factory::EMBLTaxonFactory etc (similar in concept to 
Bio::Factory::LocationFactoryI and Bio::Factory::FTLocationFactory).

Or, quite radical in approach, we require the NCBI taxonomy database 
(or any other implementation of Bio::DB::Taxonomy, e.g. could be 
through BioSQL or what not) and otherwise disclaim responsibility for 
correctly parsing the species.

Even though a TaxonIO or TaxonFactory approach looks like the 'right' 
way to do it in terms of SW design principles, I can't help but wonder 
why we really should spend much time on writing species line parsers 
when NCBI has done the job for us already to put all species into a 
compact (file-)database. If people really want to be 100% sure the 
parser gets the species right, why not download the NCBI taxonomy 
database, index it locally, and simply look-up by taxonID (which is in 
the Organism line in EMBL and the feature table in GenBank). Although - 
there could be a speed issue due to the recursive lookup - one would 
probably want to cache each successful species resolution.

Sorry for not giving precise direction - ideally someone (you?) can 
take charge and spearhead overhauling this.

	-hilmar

On Mar 9, 2006, at 6:16 AM, James Abbott wrote:

> Hi Folks,
>
> The current parsing of OS lines by Bio::SeqIO::embl.pm fails with many
> of the organisms currently found in the database, since the OS lines
> differ considerably from the specification in the EMBL User Manual,
> which appears to have been used as the basis for the current parser. In
> an attempt to improve matters, I have collected a set of examples which
> hopefully cover the majority of the different ways of writing an
> organism name, and managed to get  embl.pm to 'correctly' parse these
> (correctly being open to debate with some of the more esoteric
> examples). I'm sure there are plenty of entries which still don't parse
> correctly, but it's a start. I'll post the patches to bugzilla once I
> get a few loose ends tidied up.
>
> In the interests of consistency, I have also obtained the same set of
> sequences from Genbank, and am trying to make both parsers behave the
> same way, however they currently behave in different ways with respect
> to parsing the common name. According to the EMBL spec, the common name
> is the English name for the organism given in brackets after the latin
> name, consequently calling the common_name method on an embl.pm parsed
> Bio::Species object returns 'human' for a Homo sapiens (human). The
> genbank parser, however, currently takes the entire SOURCE line,
> including the latin name, consequently calling the common_name method 
> on
> a genbank.pm parsed species object returns 'Homo sapiens (human)'. This
> would appear to be the intended behavior, since this is considered the
> correct response by the tests.
>
> Is it considered better to maintain consistency between the EMBL and
> Genbank parsers and risk breaking any code which relies upon the 
> current
> behavior of genbank->species->common_name(), or to have the two parsers
> behaving differently, but consistently with their existing behavior?
>
> Cheers,
> James
>
> -- 
> Dr. James Abbott <j.abbott at imperial.ac.uk>
> Bioinformatics Software Developer, Bioinformatics Support Service
> Imperial College, London
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------