[Biopython] A possibility for speeding up FASTA/FASTQ reading in BioPython

Mon Nov 24 16:27:19 EST 2025

On this topic, using an index or an alternative file format would be my
first thought for speed. Any decent benchmarks for different access
patterns / file / index formats out there?

On Mon, Nov 24, 2025, 9:41 AM Peter Cock <p.j.a.cock at googlemail.com> wrote:

> Hello Terry,
>
> I just posted a blog about my thoughts on receiving generative AI
> contributions as an Open Source project maintainer:
>
>
> https://blastedbio.blogspot.com/2025/11/thoughts-on-generative-ai-contributions.html
>
> I am sceptical, and in this case adding a Rust dependency to Biopython
> seems too much to ask. I think you could get similar performance gains
> with C (which we do use) where at least the maintainers have some
> experience. However, even there, gains may not make the additional
> complexity and maintenance burden worthwhile.
>
> Thank you for writting and asking, rather than suprising everyone with
> a large pull request.
>
> Peter
>
> P.S. Cross reference https://github.com/biopython/biopython/pull/5085
>
> On Tue, Nov 11, 2025 at 10:00 PM Jones Kelly, Terence Carleton
> <terence.jones at charite.de> wrote:
> >
> > Hi all
> >
> > I regularly process reasonably large FASTQ (hundreds of billions of
> sequencing reads) and FASTA files using BioPython. For some years I've been
> meaning to implement a FASTQ/FASTA reader in a compiled language and add
> Python bindings to improve the speed. I could've done this in C but I spent
> some decades writing C and I wanted to learn something new, so I considered
> a few languages. Because Rust makes it very easy to create Python bindings,
> I decided to give it a try. I thought I'd get going by asking the Claude
> CLI to write me some Rust. That turned out to be a much, much better
> experience than I had anticipated. With Claude I played with several
> implementations, keeping track of timing. Claude also wrote some tests. To
> compare what I was seeing I got Claude to write a pure Python version, a
> pure C version, Python bindings to the C, and to create a benchmark suite.
> From what I can tell, the Rust/Python (and the C/Python) FASTA reading is
> twice as fast as BioPython and FASTQ reading is four times as fast. I
> didn't write a single line of code. I just did some minimal cleaning up
> when things were already far along. I've been using the code for the last
> month or two with no problems.
> >
> > The repo is at https://github.com/VirologyCharite/prseq  (prseq =
> Python/Rust for sequences). You'll find the benchmark results on that
> page.  There are still some small things I would adjust in the API.  BTW,
> Claude also wrote the README (which should definitely be improved).
> >
> > I am wondering if there might be interest in incorporating this into
> BioPython. I don't know if there are any Rust dependencies in BioPython but
> I know that there are some C extensions. We could use either, as their
> speeds are comparable. If there's interest, I'd be happy to help (or to do
> it all, after some discussion and maybe with some guidance).
> >
> > Thanks very much for all the work on BioPython. It's really been a
> pleasure to use the code over the last dozen years or so.
> >
> > Terry Jones
> >
> >
> > _______________________________________________
> > Biopython mailing list  -  Biopython at biopython.org
> > https://mailman.open-bio.org/mailman/listinfo/biopython
> _______________________________________________
> Biopython mailing list  -  Biopython at biopython.org
> https://mailman.open-bio.org/mailman/listinfo/biopython
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20251124/d6f8860e/attachment.htm>