From wbsmith at gmail.com Fri May 25 18:31:38 2007 From: wbsmith at gmail.com (W. Bryan Smith) Date: Fri, 25 May 2007 15:31:38 -0700 Subject: [Biopython-announce] is this supposed to be really slow? Message-ID: hi there, i just started using biopython today and was going through the example on pages 31 and 32 of the tutorial: "Sending a query to Pubmed" and "Retrieving a PubMed record" and i think i am confused as to how i am supposed to be doing something. as an additional bonus, i am new to python, and so may just be making a stupid python mistake. anyway, what i am basically trying to do is to get an array containing the year of publication for all the publications that match some keyword. from the tutorial, i am doing something like this: #begin code snippet from Bio import PubMed, Medline import string searchTerm = 'mySearchTerm' termIds = PubMed.search_for( searchTerm ) recParser = Medline.RecordParser() medlineDict = PubMed.Dictionary( parser = recParser ) pubDates = numpy.zeros( ( len( termIds ) ), numpy.uint16 ) idx = 0 for idx in range( len( termIds ) ): pubDates[idx] = string.atoi( medlineDict[ termIds[ idx ] ].publication_date[ 0:4 ] ) idx = idx + 1 #end code snippet so this seems to be working, but it seems to be very slow. well, either it's slow, or i don't understand the complexity of what it is doing. i have attempted to time this process, and it is taking about 7 seconds per record to retrieve the date and drop it into my numpy array. is this because this code is fetching something from the internet and that is what is taking such a long time? or is there some other explanation for why this is slow (i.e. my terrible, non-pythonic code writing, what it is doing is actually very complex and i just don't get it, etc)?!? any insight into this would be much appreciated. thanks, bryan From titus at caltech.edu Fri May 25 19:31:51 2007 From: titus at caltech.edu (Titus Brown) Date: Fri, 25 May 2007 16:31:51 -0700 Subject: [Biopython-announce] is this supposed to be really slow? In-Reply-To: References: Message-ID: <20070525233151.GA4507@caltech.edu> -> so this seems to be working, but it seems to be very slow. well, either -> it's -> slow, or i don't understand the complexity of what it is doing. i have -> attempted to time this process, and it is taking about 7 seconds per record -> to retrieve the date and drop it into my numpy array. is this because this -> code is fetching something from the internet and that is what is taking such -> a long time? or is there some other explanation for why this is slow (i.e. -> my -> terrible, non-pythonic code writing, what it is doing is actually very -> complex -> and i just don't get it, etc)?!? -> -> any insight into this would be much appreciated. Hi, Bryan, I'm not too familiar with the underlying code, but I believe that BioPython enforces a three second wait between record retrieval attempts from NCBI. This is by request of NCBI; see http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html Since you're using the one-record-at-a-time retrieval interface, you have a 3 second delay between retrievals. I personally tend to just use the NCBI retrieval URLs directly, but that's kind of ugly. There may be a higher volume retrieval system built directly into BioPython, too. cheers, --titus From wbsmith at gmail.com Fri May 25 22:08:42 2007 From: wbsmith at gmail.com (W. Bryan Smith) Date: Fri, 25 May 2007 19:08:42 -0700 Subject: [Biopython-announce] is this supposed to be really slow? In-Reply-To: <20070525233151.GA4507@caltech.edu> References: <20070525233151.GA4507@caltech.edu> Message-ID: On 5/25/07, Titus Brown wrote: > > > Hi, Bryan, > > I'm not too familiar with the underlying code, but I believe that > BioPython enforces a three second wait between record retrieval attempts > from NCBI. This is by request of NCBI; see > > http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html i did see this constraint for only one request per 3 seconds, but did not realize each time i went through my loop that this was a separate request. you're probably correct, that this is (partly) the source of my slow code. i guess i really didn't understand the nature of how this piece of code was working... i thought that the text data were dropped to memory when i called the PubMed.Dictionary function, so i was thinking that was the one request/3 seconds i had to worry about. i'm sure traffic for these sorts of things can get pretty high, but it does seem to be a bit ridiculous that if i want to retrieve 50 records, it will take a minimum of 2.5 minutes to do so. each record must be only about 10 KB or so (in xml format), so it seems a little ridiculous that i can only pull ~3 KB/s from the ncbi servers. can anyone verify that this is the case? is there anything to do about this constraint? I personally tend to just use the NCBI retrieval URLs directly, but > that's kind of ugly. you mean you just use the pubmed ids and then pull down the text of the corresponding url to process separately? not sure i understand if that is what you mean or not, but i don't really know how to parse and process text in python. maybe this is a good opportunity to learn. :) all i really want is a way to count publications per year for some key word... at least that is all i am trying to accomplish right now. seems like there should be an easy and relatively fast way to do this. There may be a higher volume retrieval system > built directly into BioPython, too. any experts out there care to weigh in on this? thanks so much for the input, bryan From wbsmith at gmail.com Fri May 25 22:31:38 2007 From: wbsmith at gmail.com (W. Bryan Smith) Date: Fri, 25 May 2007 15:31:38 -0700 Subject: [Biopython-announce] is this supposed to be really slow? Message-ID: hi there, i just started using biopython today and was going through the example on pages 31 and 32 of the tutorial: "Sending a query to Pubmed" and "Retrieving a PubMed record" and i think i am confused as to how i am supposed to be doing something. as an additional bonus, i am new to python, and so may just be making a stupid python mistake. anyway, what i am basically trying to do is to get an array containing the year of publication for all the publications that match some keyword. from the tutorial, i am doing something like this: #begin code snippet from Bio import PubMed, Medline import string searchTerm = 'mySearchTerm' termIds = PubMed.search_for( searchTerm ) recParser = Medline.RecordParser() medlineDict = PubMed.Dictionary( parser = recParser ) pubDates = numpy.zeros( ( len( termIds ) ), numpy.uint16 ) idx = 0 for idx in range( len( termIds ) ): pubDates[idx] = string.atoi( medlineDict[ termIds[ idx ] ].publication_date[ 0:4 ] ) idx = idx + 1 #end code snippet so this seems to be working, but it seems to be very slow. well, either it's slow, or i don't understand the complexity of what it is doing. i have attempted to time this process, and it is taking about 7 seconds per record to retrieve the date and drop it into my numpy array. is this because this code is fetching something from the internet and that is what is taking such a long time? or is there some other explanation for why this is slow (i.e. my terrible, non-pythonic code writing, what it is doing is actually very complex and i just don't get it, etc)?!? any insight into this would be much appreciated. thanks, bryan From titus at caltech.edu Fri May 25 23:31:51 2007 From: titus at caltech.edu (Titus Brown) Date: Fri, 25 May 2007 16:31:51 -0700 Subject: [Biopython-announce] is this supposed to be really slow? In-Reply-To: References: Message-ID: <20070525233151.GA4507@caltech.edu> -> so this seems to be working, but it seems to be very slow. well, either -> it's -> slow, or i don't understand the complexity of what it is doing. i have -> attempted to time this process, and it is taking about 7 seconds per record -> to retrieve the date and drop it into my numpy array. is this because this -> code is fetching something from the internet and that is what is taking such -> a long time? or is there some other explanation for why this is slow (i.e. -> my -> terrible, non-pythonic code writing, what it is doing is actually very -> complex -> and i just don't get it, etc)?!? -> -> any insight into this would be much appreciated. Hi, Bryan, I'm not too familiar with the underlying code, but I believe that BioPython enforces a three second wait between record retrieval attempts from NCBI. This is by request of NCBI; see http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html Since you're using the one-record-at-a-time retrieval interface, you have a 3 second delay between retrievals. I personally tend to just use the NCBI retrieval URLs directly, but that's kind of ugly. There may be a higher volume retrieval system built directly into BioPython, too. cheers, --titus From wbsmith at gmail.com Sat May 26 02:08:42 2007 From: wbsmith at gmail.com (W. Bryan Smith) Date: Fri, 25 May 2007 19:08:42 -0700 Subject: [Biopython-announce] is this supposed to be really slow? In-Reply-To: <20070525233151.GA4507@caltech.edu> References: <20070525233151.GA4507@caltech.edu> Message-ID: On 5/25/07, Titus Brown wrote: > > > Hi, Bryan, > > I'm not too familiar with the underlying code, but I believe that > BioPython enforces a three second wait between record retrieval attempts > from NCBI. This is by request of NCBI; see > > http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html i did see this constraint for only one request per 3 seconds, but did not realize each time i went through my loop that this was a separate request. you're probably correct, that this is (partly) the source of my slow code. i guess i really didn't understand the nature of how this piece of code was working... i thought that the text data were dropped to memory when i called the PubMed.Dictionary function, so i was thinking that was the one request/3 seconds i had to worry about. i'm sure traffic for these sorts of things can get pretty high, but it does seem to be a bit ridiculous that if i want to retrieve 50 records, it will take a minimum of 2.5 minutes to do so. each record must be only about 10 KB or so (in xml format), so it seems a little ridiculous that i can only pull ~3 KB/s from the ncbi servers. can anyone verify that this is the case? is there anything to do about this constraint? I personally tend to just use the NCBI retrieval URLs directly, but > that's kind of ugly. you mean you just use the pubmed ids and then pull down the text of the corresponding url to process separately? not sure i understand if that is what you mean or not, but i don't really know how to parse and process text in python. maybe this is a good opportunity to learn. :) all i really want is a way to count publications per year for some key word... at least that is all i am trying to accomplish right now. seems like there should be an easy and relatively fast way to do this. There may be a higher volume retrieval system > built directly into BioPython, too. any experts out there care to weigh in on this? thanks so much for the input, bryan