[Bioperl-pipeline] things to do

Mon Jan 27 11:09:02 EST 2003

> I'm not sure how you see this, but do you reckon data extraction and
> sanitization would come under input creates?

Data extraction I feel should go deeper than just pulling out raw data, like sequences.
We should have modules that pull out say particular features from say a genbank format file (like
introns/polyA signals etc ) or extract splice site sequenes from ensembl genes etc or check that the sequences
translate correctly, are full length cds etc.. These functions may
not be present in the packages like Bioperl or Ensembl either because it is too specific to fit in those packages
or the user doesn't want to wait to get the blessing of the core developers and he or she already has these
scripts lying around that he wants to plug in. Also sometimes the way in which code is written(e.g iterator (->next_seq))may not
be amenable to how biopipe needs things.  Rather if well designed, we placed these logic into input creates..we can get this to work.
Then if it is eventually shown that this would be better placed into the bioperl or ensembl code base, we can retire these
filters/input creates.

Inputcreates is for automatic job creation and to allow handling of logic which the external IOHandlers do not
support. Sequences returned by IOHandler may not often be passed directly to the runnables. This is more important
when we are handle different input types in downstream analysis, say from genomic seq->protein_seq or protein_seq->tree etc
So you would only need to use an inputcreate for the initial analyis(unless you write a script to populate the job table)
or you have a transition of input types in the pipeline.

In this way, we need an inputcreate to know how to create inputs, how many input per job, what are the iohandler
to be used for these inputs etc. I don't foresee having one for each runnable, but at least having some classes of them.
For example, we have 
	a setup_initial which just takes in an array of input ids and iohandler_ids and creates jobs
	a setup_file    which takes in a file of sequences, splits them up into smaller files, and creates on job for each input
	a setup_genewise which does some of the conversion from contigs, pass chromosomal coords to the input_id (hacky for now)
		         so that they can be mapped later etc.
			this can be generalized for other similar analysis that needs mapping of coordinates..

Hopefully eventually, they can parameterized more and reuseable 

As for Sanitization, right now the design is to have a DataMonger object in which one plugs in InputCreates and Filter objects.
So they are both optional. What pipeline sees is just this DM object which is treated as a runnable. so by sanitization, I mean
logic for checking that the stuff you are running is correct. This logic may be very specific to your protocol. 

> 
> Yea, merging pipelines. It's really strange if you have different pipelines,
> but you can mix and match components of each pipeline. I'm going to poke
> around with the code a bit more to get a better idea of how the runnables
> are doing this....

One use case for this was say to have 2 ways of generating protein families(2 pipelines) written diff people.
We also have a pipeline for protein annotation and a pipeline for tree building.

We want to be able to merge the two family builders  with the tree building and annotation one...
so that was the idea of doing this.

> Let's see if we can come up with some targets for the hackathon. I'll say
> I'm more interested in tackling the data inputs as well as the filtering.

Great I was hoping to focus on this too.

shawn