[Bioperl-pipeline] things to do

Jer-Ming giscjm at nus.edu.sg
Mon Jan 27 10:00:51 EST 2003


Hey Shawn,

Thanks for the rundown. I've looked at the generic pipelines, and I'm
catching on.


> I think the runnable ('running programs on sequences') part of the
pipeline is quite mature, and new pipelines
> really can get implemented without major design issues. The main thing
that is now quite apparent is data
> preparation. I think a major part of analysis involves data extraction and
sanitization.


I'm not sure how you see this, but do you reckon data extraction and
sanitization would come under input creates?
But I can see what you mean by InputCreates growing unwieldy. Do you think
a generic one can be written instead, with details specified by definitions
in the XML template. Otherwise, to set up a new pipeline, one writes a
template, a InputCreate, a filter......
Perhaps one InputCreate, because Bio::DB was written to make fetching from
various sources standardized .... it looks really weird having one
InputCreate for each template?
Yes, it's going to be tricky pulling out specific biological objects. Again,
data extraction and sanitization. What ideas do you have?

>
> InputCreates:
> I think a lot of the conceptualization for designing pipelines is made a
whole lot easier if one doesn't have to
> worry too much about how to translate input data to jobs. This is where
the InputCreates may or may not
> be doing such a great job. InputCreates have become our defined box for
containing
> 'hacky' code that setups analysis. I can see how these may grow unwieldy
and too adhoc, to be used
> by others even if it is meant for hacky codes. In anycase it works
currently. We have modules
> for setting up file based analysis (basically given a set of files, split
it up, convert to certain format,
> create jobs etc) and db analysis(gets dbIDs and creates jobs).
> I have been toying around with a module that given key words, fetches
sequences remotely
> from NCBI using Bio::DB::GenBank and runs it through the pipeline.
> I think raw data is easy to handle with the Bio::DB::*, Bio::Index::* and
Bio::SeqIO::* modules,.
> The challenge will come from computed data like features, genes, other
biological objects.
> If we have more resuable ways of extracting these data that would be
great.
>
> Filters:
> This is majorly underdeveloped. I think a lot of logic in filter scripts
out there are wasted in not being
> reusable. We should have a better framework/interface in which we can plug
in different filters for different
> uses.  It could be object-centric(features, sequences,trees etc)
Currently, filters are attached to IOHandlers.
> In some sense, they may also become their own runnables and an entire
pipeline is just filters.
> Because biopipe is flexible they are both valid solutions. I think we
should think about how to develop
> filters. I think people want filters that allow human eyeballing and
input, and also backup of filtered data.
> Hilmar I think has started some neat filtering code in SeqFeatures which
we could use and extend.


>
> Merging Pipelines:
> We have XML that allows people to share pipelines.At some point, Elia
pointed out that it would be neat
> if it was easier for people to merge pipelines together without much fuss.
More explicit datatype definitions
> (like EMBOSS's acd) between runnables would be the way to go for this I
think...something worth exploring.
>

Yea, merging pipelines. It's really strange if you have different pipelines,
but you can mix and match components of each pipeline. I'm going to poke
around with the code a bit more to get a better idea of how the runnables
are doing this....

Let's see if we can come up with some targets for the hackathon. I'll say
I'm more interested in tackling the data inputs as well as the filtering.

It also looks like there's quite a bit of interest in getting biopipe to
sync better with GBrowse and biosql. Great!

Jerm





> Pipeline Optimization:
> We started developing biopipe with the goal of flexibility but I know
there is plenty of room for optimization
> in the biopipe. At this point, we are still encountering leaky
dbconnections resulting in  too many db connection errors.
> We should also look at benchmarking the pipelines which Frans have been
doing of late with the numder of jobs he is sending
> to the poor 60+ nodes far ;).
>
> Job Tracking Interface
> We should have better interfaces of job tracking and management in
Biopipe. This is something really good if
> can be developed. Right now, we use SQL to count jobs delete jobs etc
Better API or a separate application or even
> a shell to allow one to query jobs, stop/start/ delete/pause jobs would be
cool. We should also flesh out better
> wrappers to the underlying BatchSubmission systems to utilize their more
sophisticated functions. For example,
> one shortfall in Biopipe is if a job fails due to too many dbconnection
errors, it may not be able to update the
> job table to say that it has failed. As such, its state gets stuck in
submitted and never gets resubmitted. We should
> have a smarter way of querying LSF /PBS and determine that a particular
job is taking too long and by querying its
> state through bjobs -jobid, figure out its no longer running, set it to
fail and rerun...
>
> Result Viewing
> Not really part of biopipe since I/O sources are really abstracted out of
biopipe. But practically I think
> the particular pipelines and data that one generates will drive the
development of the data visualization software.
> We have been using the GBrowse closely with the protein annotation
pipeline for really quick viewing of the features
> through BioSQL(and I mean quick, one config file and we are up!)I think
the plans that others have for viewing
> trees and alignments would be great too. Integration with BioSQL to store
richer objects is important. Right
> now the failsafe solution is to dump things out to files since bioperl has
richest set of modules for file I/O. Not
> something great for highthroughput stuff even though Bio::DB::Fasta scales
quite well.
>
> Thats all I can think of right now ;) sorry for the long mail, this is
also  for the entire list
> for people who have ideas and want to pick things up and code or see what
biopipe is up to....
>
> Very interested to hear what you think.
>
>
>
> shawn
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, 24 Jan 2003, Jerm wrote:
>
> > Hey guys.
> >
> > It's been a while since I've done anything on the biopipe. I thought I
would
> > get a picture of the TO DO LIST, so that I'll be able to get back into
the
> > development circle.
> >
> > Can someone please give me a feel on what you guys are tackling at the
> > moment?
> >
> > thanks.
> > Jerm
> >
> > _______________________________________________
> > bioperl-pipeline mailing list
> > bioperl-pipeline at bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-pipeline
> >
>
> --
> ********************************
> * Shawn Hoon
> * http://www.fugu-sg.org/~shawnh
> ********************************
>
>



More information about the bioperl-pipeline mailing list