[Bioperl-l] COG software?

Jason Stajich jason@cgt.mc.duke.edu
Mon, 21 Jan 2002 14:23:04 -0500 (EST)


Just my first impression if I understand the COG data structure well
enough and what you are trying to get out of the data.

You can download the COG proteins from ncbi in the dir /pub/COG/COGs.  So
you can get all the protein sequences that make up the COG - just
blastx/fastxy your unfinished genomic sequence against these.  Looks like
there are about 3700 COGs so one would need to combine these into a single
db to search against - if you want to retain which COG a protein is from
you should probably append/prepend the name of the COG to it.

Something like the following bioperl code would let you build a single
database with all the COGs in it with unique names (assuming no sequence
is in 2 COGs).

(assume that you have downloaded all the files)

use Bio::SeqIO;
opendir(DIR,$dirtocogs);
my $out = new Bio::SeqIO(-format => 'fasta', -file => ">COGDB.fa");
foreach my $file ( readdir(DIR) ) {
  next if( $file !~ /COG\d+/);
  my $in = new Bio::SeqIO(-format => 'fasta', -file => "$dirtocogs/$file");

  while( my $seq = $in->next_seq ) {
    $seq->display_id( $file . "_" . $seq->display_id);
    $out->write_seq($seq);
  }
}


Then fastxy/blastx your partial nucleotide sequences against this new db
or blastp them if you have predicted ORFs/genes.

Theoretically your best hits against this database give you your best
guess for which COG your query is in.  If you trying to automate parts of
this and need blast parsers I would reccommend (no surprise) the
Bio::SearchIO or Bio::Tools::BPlite parsers in bioperl.

-jason

On Mon, 21 Jan 2002, Rick Westerman wrote:

>       I asked about COG searching a week or so in bionet.software but did
> not receive a good reply.  Since it looks like I will end up writing the
> search unless I find something pre-done, I will repeat my question in this
> forum.
>
>      Does anyone know of a way to send a bunch of potentially incomplete
> sequences (i.e., those from a partially completed genome) through NCBI's
> COG database?   The NCBI 'coginator' page seems to let only single
> sequences to be analyzed at a time.  There are standalone programs
> (dignitor and xugnitor) available via FTP in /pub/tatusov/dignitor.   These
> may do what I want.  Although I am afraid that the readme is only a 26
> lines long cryptic document and the 1500+ lines of 'C' code do not contain
> a single comment section.  Not even a "written by" or "copyright"
> section.   Makes me shake my head in
> disbelief.  :-(
>
>      Any further help would be appreciated.
>
> Thanks,
>
> -- Rick
>
> Rick Westerman
> westerman@purdue.edu
>
> Phone: (765) 494-0505                         FAX: (765) 496-7255
> S049 WSLR bldg. Purdue Univ. W. Lafayette, IN 47907-1153
>
> Bioinformatics specialist at the Genomics Initiative.
> Part time system manager of Biochemistry department.
>
> href="http://www.biochem.purdue.edu/~westerm"
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu