[Biopython-dev] Two issues on Bio.Entrez (DTD download fallback, recent change in id lists)

Thu Mar 1 14:59:42 UTC 2012

On 01/03/2012, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> 2012/3/1 Daniël van Adrichem <daniel at treparel.com>:
>> Hello list,
>>
>> Firstly I want to report a bug plus suggested fix.
>>
>> Today I noticed a bug which got triggered by missing local DTDs. I was
>> still using 1.58 which does not have the new DTDs.
>>
>> Missing the DTDs locally should be handled by downloading them. This
>> worked for the first DTD, but then on the second one (which is a
>> dependency of the first one) I got a HTTP 404.
>>
>> After investigating I found that the module was making a request for
>> "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD\nlmmedlinecitationset_120101.dtd"
>> Note the backslash right after DTD. It gets turned into a %5C and
>> causes the 404.
>
> That DTD should be in Biopython 1.59 - and hopefully the other
> DTD you mentioned but did not name. Please let us know if there
> are any more we've missed.
>
> https://github.com/biopython/biopython/commit/5f08ccdfe0706f9073bce441609aa86b1ea9d0f4
>

I haven't encountered any missing DTDs since I updated to 1.59

>> The cause of this is usage of os.path.join to concatenate the URL. I
>> am running this on windows, on a platform where the file system uses a
>> forward slash this would work just fine.
>>
>> please find attached a patch to fix this issue.
>
> That makes perfect sense, although as written your patch could
> result in too many slashes being used - thus:

Preventing double slashes is a good thing, nice.

https://github.com/biopython/biopython/commit/c93b32bab5526a830e2cb14f0db782ee1b687715
>
> Would you like to be thanked in the NEWS file and listed as a contributor
> (in the CONTRIB file)?

It is only a single line patch, but if you insist I am fine with it :)

>> Secondly I want to comment on the recent change in Bio.Entrez.efetch
>> (commit 01b091cd4679b58d7e478734324528dd9d52f3ed). While this change
>> did fix the problem, I think this might be achieved in a cleaner way.
>>
>> Please see the code that is used to format the options on the url (in
>> Bio.Entrez._open):
>>
>> options = urllib.urlencode(params, doseq=True)
>>
>> the doseq argument specifically. Its documentation states:
>> "If any values in the query arg are sequences and doseq is true, each
>> sequence element is converted to a separate parameter."
>>
>> So this was the reason for the "id=1&id=2&id=3" formatting. Without
>> doseq set this would turn into: "id=1,2,3"
>>
>> If this doseq functionality is not needed for other params (I am
>> unsure of this), I suggest to revert the change in efetch() and use
>> doseq=False (which is default argument)
>
> Very good question - Michiel?

Ok, what I wrote here isn't really accurate. Using
urllib.urlencode({'id': range(3))})
returns 'id=%5B0%2C+1%2C+2%5D'
note the %5B (square bracket open) and %5D (square bracket close).
Apparently urlencode takes str(range(3)), which is '[0, 1, 2]'

Weirdly enough the URL with the [ and ] surrounding the id list seems
to be accepted, which is why I thought my suggestion worked.

So looking at it again I suggest to keep the code as it is right now.

Maybe only make sure the iterable consists of strings only, since
','.join does not accept anything else.

something like this would do I think:
keywords["id"] = ",".join(map(str, keywds["id"]))

>
> Thanks,

Thank you

-- 
Daniël van Adrichem

Treparel Information Solutions b.v.
Delftechpark 26
2628XH Delft
The Netherlands