[BioPython] biopython integration with make-like tools (e.g. waf, paver)

Mon Nov 17 17:27:12 UTC 2008

>> Personally in this situation I tend to just write a wrapper python
>> script (or sometimes a shell script or batch file) to call the sub
>> scripts.  i.e. the KISS principle.
>
> wrapper scripts often are not the very optimal solution.
> - Over time, they tend to be become very complex and full of commented
> statements.

That certainly can happen - but it can happen with any tool, even Makefiles.

> When you complete a part of your experiment (e.g. you download your
> input sequences from ncbi) you will likely to comment out the
> statement that you used to download it.

Personally to avoid this kind of thing, I make the download (or
running BLAST, or whatever) conditional on a check to see if the
output file exists (and don't just comment out the call).  You could
also do date checking in code too.

> If you then discover that the sequences you have downloaded were
> wrong, you have to decomment-out the same statement, but here you can
> make some errors

In my case, I can delete the old input sequences (or the BLAST output)
and re-run the script.  I would agree that for more complicate
multi-step analyses this requires some thought - but you can at least
handle error conditions any way you like (i.e. helpful messages
instead of whatever the build tool does).

> It is very difficult to remember which statements you commented out
> because they were wrong and when, and the wrapper script become messy
> very quickly, while it will take always much time to you to maintain.
> I have used wrapper scripts for a year during my master project and I
> think that's not really KISS. It seems very difficult to reproduce an
> analysis done without a pipeline.

I guess it depends on what you mean by a pipeline - you can have a
robust pipeline which is essentially one master python script.  I
agree there is a danger that the script will evolve over time into a
horrible mess.

> - make can have a nasty syntax, but it is a standard. If you type
> 'make help' you get help, and if you type 'make all' usually you will
> carry out the whole analysis, without having to worry on which scripts
> are be run in particular.

I would agree that make has a nasty syntax.  Note that make isn't a
completely cross platform standard (although you can get it on Windows
via cygwin for example).

> - there are other build system than make, some of them are written in
> python and/or for python.
> That means you won't have to necessarly learn a new programming
> syntax. Have a look at rake, all the examples I've seen are very
> clean. I'll let you know when I will have learnt waf or paver.

These (and Make) all seem to be designed to solve a different problem,
handling the compilation and/or installation of software with multiple
dependencies.  That doesn't mean you can't use them for a pipeline,
but it may not be ideal.

> - makefiles like tools usually already support multi-threading. If I
> want to run a program on a cluster, the easiest thing for me is to
> write a makefile, and it works already.

For trivial multi-threading, yes, make can help.

> - makefile allows you to re-execute parts of your analysis easily when
> your input files or your scripts changes.
> This is very useful, I don't want to write a wrapper script that
> checks if a file has been modified since the last time I have used it
> to calculate some results - because make tools already do that.

If you already know how to work with make files, that this does have
some advantages.  i.e. Instead of writing a python wrapper script, you
write a simple Makefile.

I think we agree that Make is pretty complex, a language in its own
right.  This means if you want someone else to use your pipeline, then
they have to learn how to use make too (if anything goes wrong or they
want to change it).

> Wouldn't you prefer something like:
> - if the blast output doesn't exist, OR it exists but it is older than
> the script used to launch it, or older than the input sequence, then
> run it again?

That sounds potentially useful for a complicated analysis pipeline.
But suppose you also wanted to check the current version of BLAST
installed and the version of BLAST used in the existing output file?
This would probably be possible within a Makefile using some embedded
shell scripts calling grep, but it wouldn't be very nice at all.
Although it would still be a non-trivial bit of code, I would prefer
to do this in python (maybe put the code into a library function for
reuse).

My point is, using some other tool like Make could make certain
operations easier, but with a python script you can do this sort of
thing and more.  You have full control, without adding another
dependency to the project.

> that's the kind of things that makefile tools can do for you already,
> without having to write complicated python conditions.

True - but as I have tried to illustrate above, even Make has its limitations.

> The best thing would be to learn how to write workflows, like the ones
> from taverna and similar.
> But it takes time, and I think it is better if you know the two things.
> As I was saying before, make has the worst syntax, but maybe there are
> other building tools which are better.

I certainly wouldn't be keen on make itself, but there might be a
python library out there that would be a good compromise (making the
common file existence/date based tasks easy, but allowing arbitrary
extension - e.g. my BLAST version check requirement).

Peter