[Bioperl-pipeline] Changes to come (long long mail)
Shawn Hoon
shawnh at fugu-sg.org
Sat Aug 16 01:06:06 EDT 2003
I'm halfway thru adding more functionality to biopipe. I've been
mulling about the idea
of allowing analysis to be chained in memory and I hope this doesn't go
against any biopipe
philosophy ha..if there are any. These changes will require
modifications to the xml and schema.
Motivation
---------------
During the execution of a series of analysis, the system requires that
each analysis has some place
to store(in db) or dump(to file) in order to pass the results between
analysis.
This means is that one
1) will store all intermediate results so that in the event
that analysis fails, you can rerun from the last failed analysis.
2) will need to design a dumper/schema in which to hold the
intermediate results
1) saves compute time while 2) requires programmer to do work, design
temporary databases and dbadaptors
etc.
An alternative to this is to write a Combo-Runnable for example
BlastEst2Genome which is not very modular and extensible.
Sometimes, the cost of doing 2) is greater than 1) especially if the
analysis are mini jobs that run quickly.
So for the scenario where we have a series analysis that run fast,
and we are only interested in storing the result of the last analysis,
it makes sense to allow
chaining of jobs in memory.
My current use case:
Running a targetted est2genome/genewise to map cdna/proteins to a
genome.
The strategy is to run a blast of the sequence with high cutoffs
against the genome to map the approximate location, then run a
sensitive est2genome or genewise
against the smaller region.
For my case, I only want the run the alignments on the top 2 blast hits
(2 haplotypes).
So rather than doing the following:
est->
Analysis: Run Blast against genome -> Output(store blast hit)
Analysis: setup_est2genome -> Input(fetch_top_2 blast_hit)
Analysis: Est2Genome -> Output (store gene)
I now do the following
est->
Analysis: Run Blast against genome
-> Chain_Output (with filter attached ) && (Output(store blast hit)
{Optional})
->Analysis(setup_est2genome)
Analysis: Est2Genome-> Output(store gene)
We do not need to have some temporary blast hit database but we can
still have it stored if we want to by attaching an additional output
iohandler.
The Guts
---------------
What I'm proposing is to have a grouping of rules.
A rule group means that I will chain a group of analysis in a single
job.
Sample rule table:
+---------+---------------+---------+------+---------+
| rule_id | rule_group_id | current | next | action |
+---------+---------------+---------+------+---------+
| 1 | 1 | 1 | 2 | NOTHING |
| 2 | 2 | 2 | 3 | CHAIN |
| 3 | 3 | 3 | 4 | NOTHING |
+---------+---------------+---------+------+---------+
Analysis1: InputCreate
Analysis2: Blast
Analysis3: SetupEst2Genome
Analysis4: Est2Genome
So here we have 3 rule groups. Each job will have its own rule group.
For a single est input, it will create 3 jobs during the course of the
pipeline execution.
Job 1: Input Create (fetch all ests and create blast jobs)
Job 2: Blast (blast est against database)
Output is chained to Analysis 3 (setup est2genome) using a
IOHandler of type chain with a blast filter attached
Job 3: Run Analysis 4(est2genome) of jobs created by analysis 3
Only between analysis 2 and 3 do chaining occur.
If Job 2 fails, the blast and setup_est2genome analysis will have to be
rerun.
You could imagine having multiple analysis chained within a rule_group.
I have working code for this. The next thing that I'm still thinking
about is to have a stronger
form of datatype definition between the runnables which is currently
not too strongly
enforced . It will be probably based on Martin's (or Pise or emboss)
Analysis data
definition interface. We can either have this information done at the
runnable layer
or the bioperl-run wrappers layer or both.
Once this is done, we can have a hierarchical organization of the
pipelines:
- chaining analysis within rule groups
- chaining rule groups ( add a rule_group relationship table)(defined
within 1 xml)
- chaining pipelines(add a meta_pipeline table) which means re-using
different xmls
as long as the inputs and outputs of first and last analysis of the
pipelines match.
I would like some help with regards to this application definition
interface if people are interested or have
comments...
sorry for the long mail..if u get to reading to this point.
shawn
More information about the bioperl-pipeline
mailing list