[Bioperl-l] Bio::SeqIO::genbank and Bio::Species
Chris Fields
cjfields at uiuc.edu
Tue Jul 18 17:44:29 EDT 2006
For a given GenBank file, you'll have the following (this is from NCBI's
current flatfile format,
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html):
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION U49845
VERSION U49845.1 GI:1293613
KEYWORDS .
SOURCE Saccharomyces cerevisiae (baker's yeast)
ORGANISM Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
Saccharomycetales; Saccharomycetaceae; Saccharomyces.
...
The SOURCE line above, according to NCBI, contains an abbreviated name and a
common name (optional); it can also apparently contain additional
information, such as organelles and so on. The ORGANISM line contains
NCBI's definition of the formal scientific name (see the related thread on
Taxonomy proposed changes) along with lineage information
Currently, Bio::SeqIO::genbank and Bio::Species are very inconsistent with
bacterial names, so when I process everything through SeqIO I get:
SOURCE Mycobacterium tuberculosis H37Rv H37Rv
ORGANISM Mycobacterium tuberculosis
SOURCE Mycobacterium tuberculosis CDC1551 CDC1551
ORGANISM Mycobacterium tuberculosis
SOURCE Mycobacterium avium subsp. paratuberculosis K-10
paratuberculosis K-10
ORGANISM Mycobacterium avium subsp.
SOURCE Bacillus sp. NRRL B-14911 NRRL B-14911
ORGANISM Bacillus sp.
I have added a scientific_name() method to Bio::Species to contain the
string on the ORGANISM line and replace it as is, which seems to work well
(doesn't chop the name down). The bigger issue is the mess with the SOURCE
line. This stems from adding back information from sub_species(), which I
don't think needs to be done as it's supposed to be an abbreviated name.
Anybody mind if I try splitting up the original SOURCE line data into
organelle(), abbreviated_name(), and common_name()? This will change
common_name a bit (so, instead of 'Saccharomyces cerevisiae' it will give
'baker's yeast') but will also conform more to the NCBI definition of
'common name.' Also, organelle info isn't handled yet; I could toy with
adding support for it. Any objections?
I may proceed to do the same with EMBL, SwissPort, and others that use
Bio::Species if this works out.
Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list