[Bioperl-l] RE: [Bioperl-guts-l] [Bug 1600] New: $gb->species->ncbi_taxid

Brian Osborne brian_osborne at cognia.com
Fri Mar 12 14:49:17 EST 2004


James,

Yes, I could read your patch but I'm lazy. You said:

>> create a Bio::Species object, but the genus=unknown species=marine
subspecies=gamma.

Shouldn't the values be the same for all these "species" for which the genus
is not known? Like:

Genus=unknown, species=unknown, subspecies=unknown

That way you can check, since one can no longer use "unless defined
$species_object" to see if real species information is lacking or not. Have
I missed something here?

Brian O.


-----Original Message-----
From: James Wasmuth [mailto:james.wasmuth at ed.ac.uk]
Sent: Thursday, March 11, 2004 9:40 AM
To: Brian Osborne
Cc: bioperl-guts-l at bioperl.org
Subject: Re: [Bioperl-guts-l] [Bug 1600] New: $gb->species->ncbi_taxid

Brian and all at bioperl-guts,


below is the comment I've added to the bug[1600].  I think it may need
some discussion, but the patch I've added works to the extent that it
allows creation of a Bio::Species object but the subsequent genus,
species, subspecies calls will be 'wrong'.  Personally I'm more
concerned with the taxid, which I think will be sufficient.

If you want to see the size of this problem go to NCBI taxonomy and
enter the term identified as a token set!  I think that maintaining the
taxid is enough, otherwise the artifical split of terms such as
**unidentified diatom endosymbiont of Peridinium foliaceum*
<http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=42247
&lvl=3&lin=f&keep=1&srchmode=3&unlock>*
may be a problem, though some of them are intuitive.

One last question, I've never tried to fix a bug before, so I've
commited a patch as an attachment to Bugzilla for the bug.  Do others
check this and if okay place it in the code...
apologies for the newbie bit...

-james



genbank.pm

line 1123: return unless $genus and  $genus !~ /^(Unknown|None)$/oi;

a number of species are described as Unknown blah blah blah.

The NCBI taxid assigned to unknown taxa is 32644 and has a number of
synonyms, none of which are 'unknown'.

The list includes: other, unknown organism, not specified, not shown,
unspecified, Unknown, None, unclassified , unidentified organism

I've changed the _read_GenBank_Species subroutine to allow organism
names such as 'unknown marine gamma proteobacterium NOR5'.  This will
create a Bio::Species object, but the genus=unknown species=marine
subspecies=gamma.

There is a whole host of species names that ignore the nice rules in
_read_GenBank_Species!  However this fix will allow the correct taxid to
be provided which I think is more than the name!



sub _read_GenBank_Species {
    my( $self,$buffer) = @_;
    my @organell_names = ("chloroplast", "mitochondr");
     # only those carrying DNA, apart from the nucleus

    #CHANGE
     my @unkn_names=("other", 'unknown organism', 'not specified', 'not
shown', 'Unspecified', 'Unknown', 'None', 'unclassified', 'unidentified
organism');

    $_ = $$buffer;

    my( $sub_species, $species, $genus, $common, $organelle, @class,
$ns_name );
    # upon first entering the loop, we must not read a new line -- the
SOURCE
    # line is already in the buffer (HL 05/10/2000)
    while (defined($_) || defined($_ = $self->_readline())) {
    # de-HTMLify (links that may be encountered here don't contain
    # escaped '>', so a simple-minded approach suffices)
        s/<[^>]+>//g;
    if (/^SOURCE\s+(.*)/o) {
        # FIXME this is probably mostly wrong (e.g., it yields things like
        # Homo sapiens adult placenta cDNA to mRNA
        # which is certainly not what you want)
        $common = $1;
        $common =~ s/\.$//; # remove trailing dot
    } elsif (/^\s{2}ORGANISM/o) {
        my @spflds = split(' ', $_);
            ($ns_name) = $_ =~ /\w+\s+(.*)/o;
        shift(@spflds); # ORGANISM

         if(grep { $_ =~ /^$spflds[0]/i; } @organell_names) {
        $organelle = shift(@spflds);
        }
            $genus = shift(@spflds);
        if(@spflds) {
        $species = shift(@spflds);
        } elsif ( grep { $genus } @unkn_names){
        $species = '';
        } else {$species='sp.';}      #there's no species name but it
isn't unclassified
        $sub_species = shift(@spflds) if(@spflds);
        } elsif (/^\s+(.+)/o) {
        # only split on ';' or '.' so that
        # classification that is 2 words will
        # still get matched
        # use map to remove trailing/leading spaces
            push(@class, map { s/^\s+//; s/\s+$//; $_; } split /[;\.]+/,
$1);
        } else {
            last;
        }

        $_ = undef; # Empty $_ to trigger read of next line
    }

     $$buffer = $_;

     # Don't make a species object if it's empty or "Unknown" or "None"
    my $unkn = grep { $_ =~ /^$genus$species/i; } @unkn_names;

     return unless $genus and  $unkn==0;

     # Bio::Species array needs array in Species -> Kingdom direction
    if ($class[0] eq 'Viruses') {
        push( @class, $ns_name );
    }
    elsif ($class[$#class] eq $genus) {
        push( @class, $species );
    } else {
        push( @class, $genus, $species );
    }
    @class = reverse @class;

    my $make = Bio::Species->new();
    $make->classification( \@class, "FORCE" ); # no name validation please
    $make->common_name( $common      ) if $common;
    unless ($class[-1] eq 'Viruses') {
        $make->sub_species( $sub_species ) if $sub_species;
    }
    $make->organelle($organelle) if $organelle;
    return $make;
}




Brian Osborne wrote:

>James,
>
>Your guess is right, no Species is made because of the name. That's because
>genbank.pm normally looks at:
>
>ORGANISM Bos taurus
>
>And makes "Bos" the genus, and so on.
>
>If it sees:
>
>ORGANISM Unknown
>
>It refuses to make a Species object, and it's interpreting your ORGANISM
>line in the same way because it can't make a valid genus, that's the
current
>rule. Personally I'd say that I agree with its principle - how can we make
a
>Species object without genus and species?
>
>You can get the taxid from a SeqFeature object, you already knew that.
>
>Brian O.
>
>
>-----Original Message-----
>From: bioperl-guts-l-bounces at portal.open-bio.org
>[mailto:bioperl-guts-l-bounces at portal.open-bio.org]On Behalf Of
>bugzilla-daemon at portal.open-bio.org
>Sent: Thursday, March 11, 2004 4:21 AM
>To: bioperl-guts-l at bioperl.org
>Subject: [Bioperl-guts-l] [Bug 1600] New: $gb->species->ncbi_taxid
>
>http://bugzilla.bioperl.org/show_bug.cgi?id=1600
>
>           Summary: $gb->species->ncbi_taxid
>           Product: Bioperl
>           Version: unspecified
>          Platform: PC
>        OS/Version: Linux
>            Status: NEW
>          Severity: normal
>          Priority: P2
>         Component: Bio::SeqIO
>        AssignedTo: bioperl-guts-l at bioperl.org
>        ReportedBy: james.wasmuth at ed.ac.uk
>
>
>I've included a genbank file for which I have been unable to extract the
>ncbi_taxid for using
>
>$gb->species->ncbi_taxid
>
>the error is:
>Can't call method "ncbi_taxid" on an undefined value
>
>infact I don't get a Bio::Species object.  I'm sure its because of the
name,
>which is correct.
>
>I've tried looking into it, but could not find which Seq object creates the
>Bio::Species object.
>
>
>
>LOCUS       AY007676                1389 bp    DNA     linear   BCT
>29-OCT-2001
>DEFINITION  Unknown marine gamma proteobacterium NOR5 16S ribosomal RNA,
>            partial sequence.
>ACCESSION   AY007676
>VERSION     AY007676.1  GI:12000362
>KEYWORDS    .
>SOURCE      unknown marine gamma proteobacterium NOR5
>  ORGANISM  unknown marine gamma proteobacterium NOR5
>            Bacteria; Proteobacteria; Gammaproteobacteria.
>REFERENCE   1  (bases 1 to 1389)
>  AUTHORS   Eilers,H., Pernthaler,J., Peplies,J., Glockner,F.O., Gerdts,G.
>and
>            Amann,R.
>  TITLE     Isolation of novel pelagic bacteria from the German bight and
>their
>            seasonal contributions to surface picoplankton
>  JOURNAL   Appl. Environ. Microbiol. 67 (11), 5134-5142 (2001)
>  MEDLINE   21536174
>   PUBMED   11679337
>REFERENCE   2  (bases 1 to 1389)
>  AUTHORS   Eilers,H., Pernthaler,J., Peplies,J., Gloeckner,F.O.,
Gerdts,G.,
>            Schuett,C. and Amann,R.
>  TITLE     Identification and seasonal dominance of culturable marine
>bacteria
>  JOURNAL   Unpublished
>REFERENCE   3  (bases 1 to 1389)
>  AUTHORS   Eilers,H., Pernthaler,J., Peplies,J., Gloeckner,F.O.,
Gerdts,G.,
>            Schuett,C. and Amann,R.
>  TITLE     Direct Submission
>  JOURNAL   Submitted (29-AUG-2000) Molecular Ecology,
Max-Planck-Institute,
>            Celsiusstrasse 1, Bremen 28359, Germany
>FEATURES             Location/Qualifiers
>     source          1..1389
>                     /organism="unknown marine gamma proteobacterium NOR5"
>                     /mol_type="genomic DNA"
>                     /db_xref="taxon:145658"
>     rRNA            <1..>1389
>                     /product="16S ribosomal RNA"
>BASE COUNT      343 a    319 c    453 g    274 t
>ORIGIN
>        1 cgcgaaagta cttcggtatg agtagagcgg cggacgggtg agtaacgcgt aggaatctat
>       61 ccagtagtgg gggacaactc ggggaaactc gagctaatac cgcatacgtc ctaagggaga
>      121 aagcggggga tcttcggacc tcgcgctatt ggaggagcct gcgttggatt agctagttgg
>      181 tggggtaaag gcctaccaag gcgacgatcc atagctggtc tgagaggatg atcagccaca
>      241 ccgggactga gacacggccc ggactcctac gggaggcagc agtggggaat attgcgcaat
>      301 gggcgaaagc ctgacgcagc catgccgcgt gtgtgaagaa ggccttcggg ttgtaaagca
>      361 ctttcaattg ggaagaaagg ttagtagtta ataactgcta gctgtgacat tacctttaga
>      421 agaagcaccg gctaactccg tgccagcagc cgcggtaata cggaggtgcg agcgttaatc
>      481 ggaattactg ggcgtaaagc gcgcgtaggc ggtctgttaa gtcggatgtg aaagccccgg
>      541 gctcaacctg ggaattgcac ccgatactgg ccgactggag tgcgagagag ggaggtagaa
>      601 ttccacgtgt agcggtgaaa tgcgtagata tgtggaggaa taccggtggc gaaggcggcc
>      661 tcctggctcg acactgacgc tgaggtgcga aagcgtgggg agcaaacagg attagatacc
>      721 ctggtagtcc acgccgtaaa cgatgtctac tagccgttgg gagacttgat ttcttggtgg
>      781 cgaagttaac gcgataagta gaccgcctgg ggagtacggc cgcaaggtta aaactcaaat
>      841 gaattgacgg gggcccgcac aagcggtgga gcatgtggtt taattcgatg caacgcgaag
>      901 aaccttacca ggccttgaca tcctaggaat cctgtagaga tacgggagtg ccttcgggaa
>      961 tctagtgaca ggtgctgcat ggctgtcgtc agctcgtgtc gtgagatgtt gggttaagtc
>     1021 ccgtaacgag cgcaaccctt gtccttagtt gccagcgcgt aatggcggga actctaagga
>     1081 gactgccggt gacaaaccgg aggaaggtgg ggacgacgtc aagtcatcat ggcccttacg
>     1141 gcctgggcta cacacgtgct acaatggaac gcacagaggg cagcaaaccc gcgaggggga
>     1201 gcgaatccca caaaacgttt cgtagtccgg atcggagtct gcaactcgac tccgtgaagt
>     1261 cggaatcgct agtaatcgtg aatcagaatg tcacggtgaa tacgttcccg ggccttgtac
>     1321 acaccgcccg tcacaccatg ggagtgggtt gctccagaag tggttagcct aaccttcggg
>     1381 agggcgatc
>//
>
>
>
>------- You are receiving this mail because: -------
>You are the assignee for the bug, or are watching the assignee.
>_______________________________________________
>Bioperl-guts-l mailing list
>Bioperl-guts-l at portal.open-bio.org
>http://portal.open-bio.org/mailman/listinfo/bioperl-guts-l
>
>
>
>

--
"I have not failed. I've just found 10,000 ways that don't work."
               --- Thomas Edison

Nematode Bioinformatics           ||
Blaxter Nematode Genomics Group   ||
School of Biological Sciences     ||
Ashworth Laboratories             ||
King's Buildings                  ||    tel: +44 131 650 7403
University of Edinburgh           ||    web: www.nematodes.org
Edinburgh                         ||
EH9 3JT                           ||
UK                                ||




More information about the Bioperl-l mailing list