[Bioperl-l] Annotation Pipeline

Steven Lembark lembark@wrkhors.com
Fri, 01 Nov 2002 10:55:18 -0600


-- Charles Hauser <chauser@duke.edu>

> Hi all,
>
> Working on an EST project and I would like to improve our annotation
> pipeline.
>
> Currently, all I am doing is running blastx and parsing top hits with
> significant evals into a postgres DB which is searchable by users.
>
> I need to improve on this.  Targets are to:
> 	- automate
> 	- incorporate GO-terms
>
> Conceptually I know how to do this, but implementing is another matter.
> I also know that annotation pipelines have been developed and refined by
> others so I see no need to reinvent where possible.

Excellent timing... the last few weeks I've been working
with some people who have designed a caching database for
bioinformatics. The basic design is a star with dimensions
for gene, source, and type of data (whatever you can get
from a $seq object from bioperl or manufacture yourself);
the fact is a combination of the gene, soruce, data type,
sequence (e.g., comments from 1..n) and a fact.

The underlying database can be just about anything, but
postgres would handle the varchar's used for fact values
nicely.

The source dimension has a field which stores the name
of a perl module which will be used to extract the data
from that source (e.g., GenBank, SwissProt). The code
then uses essentially uses

	if( my $sub = $package->can( 'query' ))
	{
		if( my $data = $sub->($gene) )
		{
			$localdb->insert( $data );
		}
	}

to get the contents back from that source into the
local data store. By filtering the data returned in
the insert method you can store a slice of the returned
fields or munge them to your heart's content.

Other nice thing is that the design is quite cron-able.





--
Steven Lembark                               2930 W. Palmer
Workhorse Computing                       Chicago, IL 60647
                                            +1 800 762 1582