[Biopython] Alternatives to Bio.Application for invoking command line tools?
Peter Cock
p.j.a.cock at googlemail.com
Sat May 9 16:08:04 UTC 2020
Dear Biopythoneers,
Biopython has a lot of command line tool wrappers, based around the objects
in Bio/Application/__init__.py, for building a command line string and
running it. Some time ago I started to think that we might actually be
better off dropping our in-house command line wrappers, and recommending a
standard or third party library approach for defining and executing command
line strings instead.
Taking an example in our tutorial, running the blastx tool from NCBI
BLAST+. Currently Biopython provides a specific object for the blastx
command, which knows all the expected command arguments, can do some
validation, and even has some basic help text included for each of them:
>>> from Bio.Blast.Applications import NcbiblastxCommandline
>>> help(NcbiblastxCommandline)
...
>>> blastx_cline = NcbiblastxCommandline(query="opuntia.fasta", db="nr",
evalue=0.001, outfmt=5, out="opuntia.xml")
>>> blastx_cline
NcbiblastxCommandline(cmd='blastx', out='opuntia.xml', outfmt=5,
query='opuntia.fasta',
db='nr', evalue=0.001)
>>> print(blastx_cline)
blastx -out opuntia.xml -outfmt 5 -query opuntia.fasta -db nr -evalue 0.001
>>> stdout, stderr = blastx_cline()
This works quite nicely, but writing a unique class for each command line
tool we wish to support is a lot of quiet tedious work, especially if
including minimal documentation for the arguments or argument validation.
This is also an on-going maintenance problem - one of the issues I think we
should fix before the next Biopython release is updating the NCBI BLAST+
wrappers as new arguments have been added.
Some tools have a rather cryptic command line API, and in those cases
perhaps our efforts are sensible. However, with tools like NCBI BLAST+
where is a clear command line API, and I don't see that our efforts
actually add a great deal over constructing the string in code and calling
subprocess:
>>> import subprocess
>>> cmd = "blastx -query opuntia.fasta -db nr -out opuntia.xml -evalue
0.001 -outfmt 5"
>>> subprocess.check_call(cmd, shell=True)
There are third party libraries which might be easier? For example, the sh
library supports our our current style with keyword arguments:
>>> from sh import blastx
>>> blastx(query="opuntia.fasta", db="nr", out="opuntia.xml",
evalue="0.001", outfmt="5", _long_prefix="-")
You can avoid repeating the extra argument due to the NCBI not following
the minus-minus prefix convention, e.g.:
>>> import sh
>>> blastx = sh.blastx.bake(_long_prefix="-")
>>> blastx(query="opuntia.fasta", db="nr", out="opuntia.xml",
evalue="0.001", outfmt="5")
See https://github.com/amoffat/sh
This is close to the same usability our wrapper offers, but with no ongoing
maintenance burden. It would need more investigation (especially commands
where the order is critical, often seen on macOS but not Linux), but
Windows support aside it seems attractive.
If there was a cross-platform system which offered this Python-like syntax
for specifying the command line arguments, that would be a tempting
alternative. I don't think plumbum (latin for lead, as used for pipes in
the past) does, and I find this form heavy:
>>> from blumbum import local
>>> cmd = local["blastx"]["-query", "opuntia.fasta", "-db", "nr", "-out",
"opuntia.xml", "-evalue", "0.001", "-outfmt", "5"]
>>> cmd()
''
See https://github.com/tomerfiliba/plumbum
What do people think? Do you have a favourite third party library for this
kind of thing?
Peter
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20200509/be55633b/attachment.htm>
More information about the Biopython
mailing list