[BioRuby] Bio::PubMed efetch xml support and other options
Toshiaki Katayama
ktym at hgc.jp
Tue Nov 20 15:38:32 UTC 2007
Hi Kaustubh,
I've just committed the change that Bio::PubMed.efetch and esearch to wait for 3 seconds between consequent queries.
I also renamed efetch2 and esearch2 (newer version, which accepts E-Util options as a hash) to efetch and esearch (old version). New version of efetch method breaks backward compatibility which could accept a list of ids as variable length arguments.
>>>> 1. efetch("123") --> OK
>>>> 2. efetch("123", "456") --> NG
>>>> 3. efetch(["123", "456"]) --> OK
Here, the pubmed IDs can be (array of) string or numeric.
By the way, currently efetch method returns the following error.
% ruby lib/bio/io/pubmed.rb
:
--- Retrieve PubMed entry by E-Utils ---
Wed Nov 21 00:23:20 +0900 2007
1: id: 16381885 Error occurred: PubMed article server is not avaliable
Wed Nov 21 00:23:23 +0900 2007
1: id: 16381885 Error occurred: PubMed article server is not avaliable
Is this a temporal problem?
I believe efetch2 was working when I have implemented.
Regards,
Toshiaki Katayama
On 2007/11/20, at 23:27, Kaustubh Patil wrote:
> Hi Toshiaki,
>
> Thanks for your email. Please find my answers embedded below;
>
> Thanks,
> kaustubh
>
> Toshiaki Katayama wrote:
>
>> Hi Kaustubh,
>>
>> On 2007/11/15, at 18:53, Kaustubh Patil wrote:
>>
>>
>>> Hi Toshiaki,
>>>
>>> Thank you very much for the improvements. There are some other desirable improvements;
>>>
>>> 1. PubMed has some timing restrictions on two consequitive queries. So it will be very nice if it can be implemented inside a function, like, esearch/efetch.
>>>
>>
>> How about to have following method and call it within efetch and esearch methods before the Bio::Command.post_form?
>>
>> --------------------------------------------------
>> # Make no more than one request every 3 seconds. @@ncbi_interval = 3
>> @@last_accessed = nil
>>
>> def wait_access
>> if @@last_accessed
>> duration = Time.now - @@last_accessed
>> if duration > @@ncbi_interval
>> sleep @@ncbi_interval - duration
>> end
>> else
>> @@last_accessed = Time.now
>> end
>> end
>> --------------------------------------------------
>>
> This could be a very good and quick implementation. In fact I use something similar for my usgae now.
>
>> By the way, NCBI also have another restriction:
>>
>> http://www.ncbi.nlm.nih.gov/entrez/utils/utils_index.html
>>
>>> Run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays for any series of more than 100 requests.
>>>
>>> Do you think this should also be taken care automatically?
>>>
>>>
> I am aware of those restrictions. I will be very nice if this can be taken care automatically. There is a very good Library for accessing/using Medline through R, called MedlineR (btw currentl its not downloadable as their erver is down). MedlineR handles this automatically.
>
> There is another improvement I am thinking about. It is not possible to fetch a large number of documents in one go. I suppose this is mainly because on the practical restrictions on URL length, e.g. IE supports max 2,048 characters (although, I am not aware if PubMed imposes any limits). It will be useful (under some conditions) to cut the fetches into a number of parts and then return the combined result. What do you think?
>
>>> 2. Mapping terms to MeSH (I couldn't find this!).
>>>
>>
>>
>> I'm not sure how to accomplish this.
>>
> I will do bit more research on this and then get back to you.
>
>>
>>
>>> I will post other comments as I recollect them. I have another question (though it is not very appropriate place for it);
>>>
>>> Is there any Ruby library which can do some basic text mining tasks, like, tokenization, sentence boundary discrimination etc. ?
>>>
>>
>> I think yes, but I'm not doing text mining for now, sorry ;-)
>>
> Yet I haven't find a Ruby library for that. I will keep on searching.
>
> Cheers,
> Kaustubh
>
>> Thanks,
>> Toshiaki
>>
>>
>>
>>> Cheers,
>>> Kaustubh
>>>
>>> Toshiaki Katayama wrote:
>>>
>>>
>>>> Hi Patil,
>>>>
>>>> On 2007/11/12, at 20:09, Kaustubh Patil wrote:
>>>>
>>>>> XML is very nice for searching etc. PubMed documents can be fetched in various formats, including xml. I have changed the efetch method in Bio::PubMed class in order to implement this. Here is the modified method;
>>>>>
>>>> Enhancement to accept retmode=xml sounds good idea, so I just committed efetch2 and esearch2 methods which can be better replacements for the efetch and esearch methods.
>>>>
>>>> Both methods are able to accept any E-Utils options as a hash.
>>>>
>>>> I will remove the suffix "2" from these method if the following incompatibility can be accepted.
>>>>
>>>> * changing efetch(*ids) to efetch(ids, hash = {}) breaks compatibility
>>>> currently all of
>>>> 1. efetch("123")
>>>> 2. efetch("123", "456")
>>>> 3. efetch(["123", "456"])
>>>> are accepted but 2. will be unavailable.
>>>>
>>>> Other notes:
>>>>
>>>> * default value for the retmode option remains "text" for the backward compatibility
>>>> * both methods are rewritten to use Bio::Command.post_form to make the code clear
>>>> * Bio::FlatFile is updated to accept recent MEDLINE entry format (UI -> PMID)
>>>>
>>>>
>>>> puts Bio::PubMed.esearch2("(genome AND analysis) OR bioinformatics)")
>>>> puts Bio::PubMed.esearch2("(genome AND analysis) OR bioinformatics)", {"retmax" => "500"})
>>>> puts Bio::PubMed.esearch2("(genome AND analysis) OR bioinformatics)", {"rettype" => "count"})
>>>>
>>>> puts Bio::PubMed.efetch2("10592173")
>>>> puts Bio::PubMed.efetch2(["10592173", "14693808"], {"retmode" => "xml"})
>>>>
>>>>
>>>> Thanks,
>>>> Toshiaki Katayama
>>>>
>>>> _______________________________________________
>>>> BioRuby mailing list
>>>> BioRuby at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>>> <kpatil.vcf>
>>>>
>>
>>
>> <kpatil.vcf>
More information about the BioRuby
mailing list