[Bioperl-pipeline] Plant-Biotech pipeline

Wed Sep 17 11:44:21 EDT 2003

On Wednesday, September 17, 2003, at 7:28 AM, Joachim H. Strach wrote:

> Hello,
>
> first of all thanks for your previous anwsers, they helped a lot for  
> my understanding of the Biopipe workflow.
>
> Some more question arised ... .
> I would be glad if either you could tell me to find the suitable  
> documentation or  give me some more answers.
>
> I took a closer look to genome_annotation_pipeline.xml:
> - What are the tags <transformer>, <input_iohandler_mapping>, good for?

transformer defines modules that may be used to operate on data input  
and output before they are passed to iohandlers.
This includes filtering or other data transformation operations. Input  
transformers are applied after fetching from iohandlers,
while output transformers are applied before storing to db.

For example in a 2 stage blast which writes the result to the database  
both times and applying filters when fetching
the input and writing output, the flow is :

		                      		    store to db
                 			   		           /
	    filter trransformer		|filter transformer}
             /		       \                        /
Input-                          ----> blast  1/  (analysis 1)
           \                          -> blast 2    (analysis 2)
             \                       /                 \
  	   filter transformer             \
  					|filter transformer|
						   \
						  store to db

Previously biopipe requires that all results should be written to a  
database before being fetched again for the next analysis.
This is so that if say the analysis 2 fails, one would not need to  
rerun the first.

Now, sometimes the first analysis may be some simple operation that  
runs fast, and we don't want to bother with storing her results.
So I have recently committed some relatively new code for doing  
iohandler chaining.

the flow for the same analysis is sightly different:

	    filter transformer	     filter transformer
             /		       \                /                  \
Input-                          ----> blast  1              blast2
            							\
  								filter transformer|
						   			\
						  		store to db

   So for this case if blast2 fails, we gotta go back and rerun blast1.
I don't think I have committed the xml for this. Will do so, when I get  
back from a dept retreat this week.
see mail:

http://bioperl.org/pipermail//bioperl-pipeline/2003-August/000387.html

> - What is the function of the <data_monger> ?

this is to a 'special' runnable that is used to set up analysis.

Say you want to align cdnas to genomes.
What you may want to do is to run est2genome of the cdna on the section  
of the genome where it hits found via blast.
You wouldn't want to pass the entire chromosome for example to  
est2genome. Instead you may need to figure out the
region where the blast , do some padding and pass the slice of the  
genome together with the cdna to the next analysis.
So you would figure out the hit region, pass the start,end strand of  
the coordinates to the est2genome input iohandler .

So we would plug into the DataMonger, a Bio::Pipeline::InputCreate  
which contains various 'hacky' modules that
setup jobs very specifically as to how your analysis requires the  
inputs.  This is to reconcile how a lot of times,
the database adaptor modules do not return what you want to feed  
directly into an analysis.

> - At the rule section in <action>: where is e.g. the "COPY_ID" related  
> to?
>
Once a job is finished, the PipelineManager will look up what it should  
do next with regards to this job.
for COPY_ID it will reuse the same input id for the next analysis  but  
may map the input iohandler to a new one for example:

RepeatMasker->Blast

both use the same input say sequence_1
but the fetching of sequence for blast (via ensembl) would use a  
fetch_repeatmasked_seq while RepeatMasker would
fetch unmasked seq as its input. So there is a reuse of the input id  
and change of the input iohandler

see bioperl-pipeline/xml/README
> - Shawn, why did you say "... return mostly bioperl objects". Which  
> runnables do not and what do they return?

Uhm, okay you got me. All committed runnables return bioperl objects.  
However sometimes we do write specific runnables that may return  
ensembl objects ( in genome annotation)
or other objects that we use for our own data schemas.... not things  
which we are proud of and do not commit yet ... :)

> - My pipeline should perform two blast queries, where the second one  
> gets as input the filtered ouput of the first one.

> How can I filter on the bioperl objects directly without using  
> IO-handling?  Or more general:  How can I pass on the bioperl objects  
> returned from a runnable to the runnable of the next analysis?
>

Ah the IOHandling chaining example describe previously would be the  
way. I will commit some examples this weekend.

cheers,

shawn

> Thanks for your advice.
>
> Joachim
>
>
>
>
>
>
>
> _______________________________________________________________________ 
> _______
> Zwei Mal Platz 1 mit dem jeweils besten Testergebnis! WEB.DE FreeMail
> und WEB.DE Club bei Stiftung Warentest! http://f.web.de/?mc=021183
>
> _______________________________________________
> bioperl-pipeline mailing list
> bioperl-pipeline at bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-pipeline
>
>
-shawn