[BioPython] Fwd: NCBI Abuse Activity with BioPython

Peter biopython at maubp.freeserve.co.uk
Thu Jun 26 11:21:57 UTC 2008


On Thu, Jun 26, 2008 at 2:15 AM, Andrew Dalke <dalke at dalkescientific.com> wrote:
> Hi Chris,
>
>  I'm no longer part of the Biopython dev team, but I read at least the
> subject line on the mailing list.
>
>  I wrote the Biopython EUtils package around December 2002 and according to
> the CVS logs it was added to Biopython in June 2003, so more then 5 years
> ago.  Looking at the commit logs there haven't been any change to the
> relevant code since 2004, and that was a minor patch.
>
>  I thought I put a rate limiter into the code, but looking at it now I see I
> didn't.  The documentation clearly states that users must follow NCBI's
> recommendations, but who actually reads documentation?
>
> There's a couple of points here.  The quickest and most direct way to
> force/fix the code is to change the "def _get()" in ThinClient.py .  ...

I've updated Bio/EUtils/ThinClient.py in CVS based on your suggested
change, and checked the unit tests test_EUtils.py and
test_SeqIO_online.py (which calls Bio.EUtils via Bio.GenBank).

Looking over the code, should this wait also be done for the
ThinClient's epost() method as well?

> When I wrote this module I think I assumed that whoever would use the
> library would use the code correctly.  Using it correctly means a few
> things:
>  - obey the restrictions set by NCBI
>  - change the 'tool' and 'email' settings, so NCBI complains the right
> person.
>     (The default is to say 'EUtils_Python_client' and
> 'biopython-dev at biopython.org')
>
> This isn't happening.  The patch above force-fixes the first.  Should
> Biopython do a better job of the second?  It's not easy to figure out the
> correct email.  I couldn't then and can't now think of a better solution.
>  Perhaps use the result of getpass.getuser()?  But that doesn't get the rest
> of the domain for a proper email.  Though NCBI should be able to guess the
> site from the IP address.

Figuring out the user's email address is tricky, especially cross
platform.  Perhaps we should update the Bio.EUtils and Bio.Entrez
documentation to recommend the user set their email address here, and
if they are wrapping Biopython in part of a larger tool (e.g. a
webservice) to set the tool name too.

> If I had to guess, likely more people find the ThinClient code easier to
> understand, because the NCBI interface has a simple way to get the result
> for a single record, without using the history interface.  The NCBI
> interface doesn't guide people to the right way to use it effectively.

I would agree with you.  I would go further, and say for a new user
even the ThinClient is a bit scary, and that the wrapper functions in
Bio.GenBank are nicer to use.

> I started working on an update to EUtils which improved the API to include a
> few helper functions, like "EUtils.search()" instead of having to create a
> HistoryClient.  That might help guide people to using it better.  I wrote up
> something about it a few years ago:
>  http://www.dalkescientific.com/writings/diary/archive/2005/09/30/using_eutils.html
>
> But a problem in completing that is that I never got any sort of funding or
> user feedback on how people were using the software, and as I moved over to
> chemistry it became lower and lower on my list.  That's still the problem
> with me working on this again.

This complexity is also daunting for anyone else considering taking
over the Bio.EUtils code base.

> I don't know about this next point, but there might also be a lack of
> documentation on how to use the Biopython interface effectively?  The NCBI
> documentation isn't mean for non-programmers (it's more of a
> bytes-on-the-wire document) so perhaps people are pattern matching on what
> looks right and going with what works, vs. what works well.  Then because
> there was no 3 second limit, they had no incentive to find a better/faster
> solution.

That would explain how the unnamed user ended up making over 18
requests per second!  I confess I had assumed that things like the
Bio.GenBank wrappers would be respecting the 3 second rule (at least
they should do now).

Peter



More information about the Biopython mailing list