[Bioperl-l] Bio::SeqIO::genbank and Bio::Species

Chris Fields cjfields at uiuc.edu
Tue Jul 18 21:44:29 UTC 2006


For a given GenBank file, you'll have the following (this is from NCBI's
current flatfile format,
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html):

LOCUS       SCU49845     5028 bp    DNA             PLN       21-JUN-1999
DEFINITION  Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
            (AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION   U49845
VERSION     U49845.1  GI:1293613
KEYWORDS    .
SOURCE      Saccharomyces cerevisiae (baker's yeast)
  ORGANISM  Saccharomyces cerevisiae
            Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
            Saccharomycetales; Saccharomycetaceae; Saccharomyces.
...

The SOURCE line above, according to NCBI, contains an abbreviated name and a
common name (optional); it can also apparently contain additional
information, such as organelles and so on.  The ORGANISM line contains
NCBI's definition of the formal scientific name (see the related thread on
Taxonomy proposed changes) along with lineage information 

Currently, Bio::SeqIO::genbank and Bio::Species are very inconsistent with
bacterial names, so when I process everything through SeqIO I get:

SOURCE      Mycobacterium tuberculosis H37Rv H37Rv
  ORGANISM  Mycobacterium tuberculosis

SOURCE      Mycobacterium tuberculosis CDC1551 CDC1551
  ORGANISM  Mycobacterium tuberculosis

SOURCE      Mycobacterium avium subsp. paratuberculosis K-10
paratuberculosis K-10
  ORGANISM  Mycobacterium avium subsp.

SOURCE      Bacillus sp. NRRL B-14911 NRRL B-14911
  ORGANISM  Bacillus sp.

I have added a scientific_name() method to Bio::Species to contain the
string on the ORGANISM line and replace it as is, which seems to work well
(doesn't chop the name down).  The bigger issue is the mess with the SOURCE
line.  This stems from adding back information from sub_species(), which I
don't think needs to be done as it's supposed to be an abbreviated name.  

Anybody mind if I try splitting up the original SOURCE line data into
organelle(), abbreviated_name(), and common_name()?  This will change
common_name a bit (so, instead of 'Saccharomyces cerevisiae' it will give
'baker's yeast') but will also conform more to the NCBI definition of
'common name.'  Also, organelle info isn't handled yet; I could toy with
adding support for it.  Any objections?  

I may proceed to do the same with EMBL, SwissPort, and others that use
Bio::Species if this works out.

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign 





More information about the Bioperl-l mailing list