From markbudde at gmail.com Wed Dec 5 19:53:01 2012 From: markbudde at gmail.com (Mark Budde) Date: Wed, 5 Dec 2012 16:53:01 -0800 Subject: [Biopython] get more than 20 IDs in IdList Message-ID: Can someone help me understand how to get more than 20 records in the IdList from esearch? >>> from Bio import Entrez >>> Entrez.email = "markbudde at gmail.com" >>> handle = Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr") >>> record = Entrez.read(handle) >>> len(record["IdList"]) 20 >>> record["Count"] '354' It seems that I can only ever get 20 IDs in the "IdList", even though the "Count" can be much higher. Thanks, Mark From axfelix at gmail.com Wed Dec 5 19:57:33 2012 From: axfelix at gmail.com (Alex Garnett) Date: Wed, 5 Dec 2012 16:57:33 -0800 Subject: [Biopython] get more than 20 IDs in IdList In-Reply-To: References: Message-ID: This is a fairly naive solution based on just having joined and harassed this list with a similar issue a week ago, but is there a reason you can't use efetch()? -alex On Dec 5, 2012 4:54 PM, "Mark Budde" wrote: > Can someone help me understand how to get more than 20 records in the > IdList from esearch? > > >>> from Bio import Entrez > >>> Entrez.email = "markbudde at gmail.com" > >>> handle = Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr") > >>> record = Entrez.read(handle) > >>> len(record["IdList"]) > 20 > >>> record["Count"] > '354' > > > It seems that I can only ever get 20 IDs in the "IdList", even though the > "Count" can be much higher. > Thanks, > Mark > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From markbudde at gmail.com Wed Dec 5 20:09:07 2012 From: markbudde at gmail.com (Mark Budde) Date: Wed, 5 Dec 2012 17:09:07 -0800 Subject: [Biopython] get more than 20 IDs in IdList In-Reply-To: References:

Message-ID: Thanks. My understanding was that efetch was for getting the records after you know the IDs. That is what I am trying to do, but I can only get the first 20 IDs to run through efetch. Is there a way to search through efetch? -Mark On Wed, Dec 5, 2012 at 4:57 PM, Alex Garnett wrote: > This is a fairly naive solution based on just having joined and harassed > this list with a similar issue a week ago, but is there a reason you can't > use efetch()? > > -alex > On Dec 5, 2012 4:54 PM, "Mark Budde" wrote: > >> Can someone help me understand how to get more than 20 records in the >> IdList from esearch? >> >> >>> from Bio import Entrez >> >>> Entrez.email = "markbudde at gmail.com" >> >>> handle = Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr") >> >>> record = Entrez.read(handle) >> >>> len(record["IdList"]) >> 20 >> >>> record["Count"] >> '354' >> >> >> It seems that I can only ever get 20 IDs in the "IdList", even though the >> "Count" can be much higher. >> Thanks, >> Mark >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > From markbudde at gmail.com Wed Dec 5 21:17:15 2012 From: markbudde at gmail.com (Mark Budde) Date: Wed, 5 Dec 2012 18:17:15 -0800 Subject: [Biopython] get more than 20 IDs in IdList In-Reply-To: <4089E344A49CB1498D43F563331B1CF82CF1A0ED@SINPRD0310MB355.apcprd03.prod.outlook.com> References: <4089E344A49CB1498D43F563331B1CF82CF1A0ED@SINPRD0310MB355.apcprd03.prod.outlook.com> Message-ID: <7635510330127239492@unknownmsgid> Perfect, thanks. -Mark Sent from my phone On Dec 5, 2012, at 6:07 PM, David Winter wrote: > Hi Mark, > > This is the default behaviour for the EUtils API > http://www.ncbi.nlm.nih.gov/books/NBK25499/ > > To get more than 20 records you need to set the "retmax" parameter to some other number: > >>>> handle = Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr", retmax=200) > > Hope that helps, > David Winter > > ________________________________________ > From: biopython-bounces at lists.open-bio.org [biopython-bounces at lists.open-bio.org] on behalf of Mark Budde [markbudde at gmail.com] > Sent: Thursday, 6 December 2012 1:53 p.m. > To: biopython at lists.open-bio.org > Subject: [Biopython] get more than 20 IDs in IdList > > Can someone help me understand how to get more than 20 records in the > IdList from esearch? > >>>> from Bio import Entrez >>>> Entrez.email = "markbudde at gmail.com" >>>> handle = Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr") >>>> record = Entrez.read(handle) >>>> len(record["IdList"]) > 20 >>>> record["Count"] > '354' > > > It seems that I can only ever get 20 IDs in the "IdList", even though the > "Count" can be much higher. > Thanks, > Mark > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From mjldehoon at yahoo.com Wed Dec 5 21:15:57 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 5 Dec 2012 18:15:57 -0800 (PST) Subject: [Biopython] get more than 20 IDs in IdList In-Reply-To: Message-ID: <1354760157.15566.YahooMailClassic@web164002.mail.gq1.yahoo.com> See http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch under retmax. Also see section 8.15 "Using the history and WebEnv" in the Biopython documentation. Best, -Michiel. --- On Wed, 12/5/12, Mark Budde wrote: > From: Mark Budde > Subject: Re: [Biopython] get more than 20 IDs in IdList > To: "Alex Garnett" > Cc: "biopython" > Date: Wednesday, December 5, 2012, 8:09 PM > Thanks. My understanding was that > efetch was for getting the records after > you know the IDs. That is what I am trying to do, but I can > only get the > first 20 IDs to run through efetch. Is there a way to search > through efetch? > -Mark > > > On Wed, Dec 5, 2012 at 4:57 PM, Alex Garnett > wrote: > > > This is a fairly naive solution based on just having > joined and harassed > > this list with a similar issue a week ago, but is there > a reason you can't > > use efetch()? > > > > -alex > > On Dec 5, 2012 4:54 PM, "Mark Budde" > wrote: > > > >>? Can someone help me understand how to get > more than 20 records in the > >> IdList from esearch? > >> > >> >>> from Bio import Entrez > >> >>> Entrez.email = "markbudde at gmail.com" > >> >>> handle = > Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr") > >> >>> record = Entrez.read(handle) > >> >>> len(record["IdList"]) > >> 20 > >> >>> record["Count"] > >> '354' > >> > >> > >> It seems that I can only ever get 20 IDs in the > "IdList", even though the > >> "Count" can be much higher. > >> Thanks, > >> Mark > >> _______________________________________________ > >> Biopython mailing list? -? Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > >> > > > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From winda002 at student.otago.ac.nz Wed Dec 5 21:07:30 2012 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 6 Dec 2012 02:07:30 +0000 Subject: [Biopython] get more than 20 IDs in IdList In-Reply-To: References: Message-ID: <4089E344A49CB1498D43F563331B1CF82CF1A0ED@SINPRD0310MB355.apcprd03.prod.outlook.com> Hi Mark, This is the default behaviour for the EUtils API http://www.ncbi.nlm.nih.gov/books/NBK25499/ To get more than 20 records you need to set the "retmax" parameter to some other number: >>> handle = Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr", retmax=200) Hope that helps, David Winter ________________________________________ From: biopython-bounces at lists.open-bio.org [biopython-bounces at lists.open-bio.org] on behalf of Mark Budde [markbudde at gmail.com] Sent: Thursday, 6 December 2012 1:53 p.m. To: biopython at lists.open-bio.org Subject: [Biopython] get more than 20 IDs in IdList Can someone help me understand how to get more than 20 records in the IdList from esearch? >>> from Bio import Entrez >>> Entrez.email = "markbudde at gmail.com" >>> handle = Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr") >>> record = Entrez.read(handle) >>> len(record["IdList"]) 20 >>> record["Count"] '354' It seems that I can only ever get 20 IDs in the "IdList", even though the "Count" can be much higher. Thanks, Mark _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From fpiston at gmail.com Thu Dec 6 06:09:53 2012 From: fpiston at gmail.com (Fernando) Date: Thu, 06 Dec 2012 12:09:53 +0100 Subject: [Biopython] blast to go annotation Message-ID: <877govwi66.fsf@gmail.com> Hello everybody, I am a beginner in python programming and I do not know if did well. I had wrote a script to do the following task: - BLAST my sequences against the uniprot_sprot (UniProtKB/Swiss-Prot) - Take the best match swiss-prot accession - Take the GOs associated to the swiss-prot accession - Make a file with the my sequence id, best match swiss-prot accession, GOs associated.I am doing this file to use with topGO in bioconductor. I have some question: - The 'NCBIXML.parse' step has a problem. The function does not take the firth accession of the .xml file. I need to insert a fake fasta sequence at the beginning of the multifasta file to have all blast result of my sequences. - En general. It is correct the script? and, can I improve it? Here is the code: from Bio.Blast.Applications import NcbiblastxCommandline blastx_cline = NcbiblastxCommandline(query='/home/fpiston/Desktop/test/test2.fasta', db='uniprot_sprot', out='/home/fpiston/Desktop/test/test.xml', evalue='0.001', outfmt='5', best_hit_overhang='0.1', best_hit_score_edge='0.05', max_target_seqs='1') stdout, stderr = blastx_cline() result_handle = open("/home/fpiston/Desktop/test/test.xml") from Bio.Blast import NCBIXML from Bio import SeqIO import re from Bio import SwissProt q_dict = SeqIO.to_dict(SeqIO.parse(open("/home/fpiston/Desktop/test/test2.fasta"), "fasta")) blast_records = NCBIXML.parse(result_handle) save_file = open("/home/fpiston/Desktop/test/test.out", 'w') blast_record = blast_records.next() hits = [] for blast_record in blast_records: if blast_record.alignments: list = (blast_record.query).split() if re.match('ENA|\w*|\w*', list[0]) != None: list2 = list[0].split("|") save_file.write('\n%s\t' % list2[1]) else: save_file.write('\n%s\t' % list[0]) for alignment in blast_record.alignments: for hsp in alignment.hsps: list = (alignment.hit_def).split() list2 = list[0].split("|") save_file.write('%s\t' % list2[2]) for record in SwissProt.parse(open('/home/db/uniprot_sprot.dat')): if record.entry_name in list2[2]: for cross_reference in record.cross_references: for item in cross_reference: if 'GO:' in item: save_file.write('%s\t' % item) hits.append(blast_record.query.split()[0]) misses = set(q_dict.keys()) - set(hits) for item in misses: save_file.write('\n%s\t' % item) save_file.write('%s' % 'no_match') save_file.close() Fernando -- From p.j.a.cock at googlemail.com Thu Dec 6 06:17:54 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Dec 2012 11:17:54 +0000 Subject: [Biopython] blast to go annotation In-Reply-To: <877govwi66.fsf@gmail.com> References: <877govwi66.fsf@gmail.com> Message-ID: On Thu, Dec 6, 2012 at 11:09 AM, Fernando wrote: > Hello everybody, > I am a beginner in python programming and I do not know if did well. > I had wrote a script to do the following task: > - BLAST my sequences against the uniprot_sprot (UniProtKB/Swiss-Prot) > - Take the best match swiss-prot accession > - Take the GOs associated to the swiss-prot accession > - Make a file with the my sequence id, best match swiss-prot accession, > GOs associated.I am doing this file to use with topGO in bioconductor. > > I have some question: > - The 'NCBIXML.parse' step has a problem. The function does not take the > firth accession of the .xml file. I need to insert a fake fasta sequence > at the beginning of the multifasta file to have all blast result of my > sequences. Do you mean it is ignoring the first (1st) set of results in the XML file? That is because you skipped the first BLAST results - try removing this line before your for loop: blast_record = blast_records.next() > - En general. It is correct the script? and, can I improve it? > It would be worth reading the Blast2GO paper for some of the technical issues and how to weight evidence in assigning GO terms based on BLAST matches. Note Blast2GO has a command line variant called "Blast2GO for pipelines" (b2g4pipe). Peter From fpiston at gmail.com Thu Dec 6 06:59:33 2012 From: fpiston at gmail.com (Fernando) Date: Thu, 06 Dec 2012 12:59:33 +0100 Subject: [Biopython] blast to go annotation In-Reply-To: (Peter Cock's message of "Thu, 6 Dec 2012 11:17:54 +0000") References: <877govwi66.fsf@gmail.com> Message-ID: <87wqwvv1ay.fsf@gmail.com> Peter Cock writes: > On Thu, Dec 6, 2012 at 11:09 AM, Fernando wrote: >> Hello everybody, >> I am a beginner in python programming and I do not know if did well. >> I had wrote a script to do the following task: >> - BLAST my sequences against the uniprot_sprot (UniProtKB/Swiss-Prot) >> - Take the best match swiss-prot accession >> - Take the GOs associated to the swiss-prot accession >> - Make a file with the my sequence id, best match swiss-prot accession, >> GOs associated.I am doing this file to use with topGO in bioconductor. >> >> I have some question: >> - The 'NCBIXML.parse' step has a problem. The function does not take the >> firth accession of the .xml file. I need to insert a fake fasta sequence >> at the beginning of the multifasta file to have all blast result of my >> sequences. > > Do you mean it is ignoring the first (1st) set of results in the XML file? > That is because you skipped the first BLAST results - try removing this > line before your for loop: > > blast_record = blast_records.next() > Yes,I'm. The script ignored the first set of results in the XML file. I had removed the line blast_record = blast_records.next() And it work Ok. >> - En general. It is correct the script? and, can I improve it? >> > > It would be worth reading the Blast2GO paper for some of the technical > issues and how to weight evidence in assigning GO terms based on > BLAST matches. Note Blast2GO has a command line variant called > "Blast2GO for pipelines" (b2g4pipe). > > Peter I know the Blast2GO. In fact, I started the GO annotation with that software, but I had many problems because it is very slow and crashes often.These problems make the annotation of many sequences with Blast2GO impossible. Furthermore, I also tried to use b2g4pipe in a cluster but the administrator told me also gives many problems. The administrator told me that the b2g4pipe has not been updated since their appearance and also requires an Internet connection. I think the free version of Blast2GO not be improved since they released the paid version. For this reason I decided to do it with my own code. Thanks very much Fernando -- From p.j.a.cock at googlemail.com Thu Dec 6 13:54:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Dec 2012 18:54:10 +0000 Subject: [Biopython] blast to go annotation In-Reply-To: <87wqwvv1ay.fsf@gmail.com> References: <877govwi66.fsf@gmail.com> <87wqwvv1ay.fsf@gmail.com> Message-ID: On Thu, Dec 6, 2012 at 11:59 AM, Fernando wrote: > Yes,I'm. The script ignored the first set of results in the XML file. > I had removed the line > blast_record = blast_records.next() > > And it work Ok. > Good. >> It would be worth reading the Blast2GO paper for some of the technical >> issues and how to weight evidence in assigning GO terms based on >> BLAST matches. Note Blast2GO has a command line variant called >> "Blast2GO for pipelines" (b2g4pipe). >> >> Peter > > I know the Blast2GO. In fact, I started the GO annotation with that > software, but I had many problems because it is very slow and crashes > often.These problems make the annotation of many sequences with > Blast2GO impossible. > Furthermore, I also tried to use b2g4pipe in a cluster but the > administrator told me also gives many problems. The administrator told > me that the b2g4pipe has not been updated since their appearance and > also requires an Internet connection. I think the free version of > Blast2GO not be improved since they released the paid version. > > For this reason I decided to do it with my own code. The future of Blast2GO with their paid version is troubling - they never gave a clear answer on how the (money) free version is licensed either. However, b2g4pipe can be used without going online if you have setup a local Blast2GO database - which is *much* faster than connecting to their public database in Spain. We're doing this locally as part of running b2g4pipe within Galaxy. This includes a workaround for Blast2GO being unable to cope with modern BLAST XML files (explained on their website with links to other conversion scripts). You can find my code in the Galaxy Tool Shed http://toolshed.g2.bx.psu.edu or https://bitbucket.org/peterjc/galaxy-central/src/tools/tools/ncbi_blast_plus Regards, Peter From fpiston at gmail.com Thu Dec 6 17:17:55 2012 From: fpiston at gmail.com (Fernando) Date: Thu, 06 Dec 2012 23:17:55 +0100 Subject: [Biopython] blast to go annotation In-Reply-To: (Peter Cock's message of "Thu, 6 Dec 2012 18:54:10 +0000") References: <877govwi66.fsf@gmail.com> <87wqwvv1ay.fsf@gmail.com> Message-ID: <87obi6kep8.fsf@gmail.com> Thanks, I am going to try it. Peter Cock writes: > On Thu, Dec 6, 2012 at 11:59 AM, Fernando wrote: > >> Yes,I'm. The script ignored the first set of results in the XML file. >> I had removed the line >> blast_record = blast_records.next() >> >> And it work Ok. >> > > Good. > >>> It would be worth reading the Blast2GO paper for some of the technical >>> issues and how to weight evidence in assigning GO terms based on >>> BLAST matches. Note Blast2GO has a command line variant called >>> "Blast2GO for pipelines" (b2g4pipe). >>> >>> Peter >> >> I know the Blast2GO. In fact, I started the GO annotation with that >> software, but I had many problems because it is very slow and crashes >> often.These problems make the annotation of many sequences with >> Blast2GO impossible. >> Furthermore, I also tried to use b2g4pipe in a cluster but the >> administrator told me also gives many problems. The administrator told >> me that the b2g4pipe has not been updated since their appearance and >> also requires an Internet connection. I think the free version of >> Blast2GO not be improved since they released the paid version. >> >> For this reason I decided to do it with my own code. > > The future of Blast2GO with their paid version is troubling - they > never gave a clear answer on how the (money) free version is > licensed either. > > However, b2g4pipe can be used without going online if you have > setup a local Blast2GO database - which is *much* faster than > connecting to their public database in Spain. > > We're doing this locally as part of running b2g4pipe within Galaxy. > This includes a workaround for Blast2GO being unable to cope > with modern BLAST XML files (explained on their website with > links to other conversion scripts). You can find my code > in the Galaxy Tool Shed http://toolshed.g2.bx.psu.edu or > https://bitbucket.org/peterjc/galaxy-central/src/tools/tools/ncbi_blast_plus > > Regards, > > Peter Fernando -- From melissacurran530 at gmail.com Sat Dec 8 15:39:57 2012 From: melissacurran530 at gmail.com (Melissa Curran) Date: Sat, 8 Dec 2012 15:39:57 -0500 Subject: [Biopython] Printing non-ASCII characters Message-ID: Hello, I'm trying to print out the titles of articles, but often get an error about it encountering non-ASCII characters. What do I need to do in order to be able to print the article titles? Below is a snippet of code from my program, which is where I'm getting the error. handle = Entrez.efetch(db="pubmed", id=record, retmode="xml") articles = Entrez.parse(handle) for article in articles: print article['MedlineCitation']['Article']['ArticleTitle'] Thanks much, Melissa From w.arindrarto at gmail.com Sat Dec 8 15:53:22 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sat, 8 Dec 2012 21:53:22 +0100 Subject: [Biopython] Printing non-ASCII characters In-Reply-To: References: Message-ID: Hi Melissa, Does the error you're getting look like this: UnicodeEncodeError: 'ascii' codec can't encode character u'\u015f' in position 2: ordinal not in range(128) If so, it's because there are non-ascii characters in the text you're trying to print. To try fix that, change the last line to this: print article['MedlineCitation']['Article']['ArticleTitle'].encode('utf-8') This forces python to encode the characters using utf-8, which can handle non-ascii characters. Note that for the characters to be shown properly (e.g. a with umlaut), your terminal must also support utf-8. Otherwise, you'll simply get a weird replacement character for all the non-ascii characters. Hope this helps :), Bow On Sat, Dec 8, 2012 at 9:39 PM, Melissa Curran wrote: > Hello, > > I'm trying to print out the titles of articles, but often get an error > about it encountering non-ASCII characters. What do I need to do in order > to be able to print the article titles? Below is a snippet of code from my > program, which is where I'm getting the error. > > handle = Entrez.efetch(db="pubmed", id=record, retmode="xml") > articles = Entrez.parse(handle) > for article in articles: > print article['MedlineCitation']['Article']['ArticleTitle'] > > Thanks much, > > Melissa > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From phyrexian.kavu at gmail.com Sat Dec 8 22:53:47 2012 From: phyrexian.kavu at gmail.com (Miguel Romero) Date: Sat, 8 Dec 2012 21:53:47 -0600 Subject: [Biopython] qblast against RefSeq Message-ID: Hi! Im learning how to run BLAST over the internet and Im interested in blasting against the RefSeq database. For now I have managed to blast against the nr database with: result_handle = NCBIWWW.qblast("blastp", "nr", sequence_string) but I dont know how to search against RefSeq, if I enter: result_handle = NCBIWWW.qblast("blastp", "refseq", sequence_string) the blast record is empty! Is it possible to blast against the RefSeq database? What should I type in the second argument? Thank you! Miguel Romero -- [Theropoda is my profession] From w.arindrarto at gmail.com Sat Dec 8 23:59:07 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sun, 9 Dec 2012 05:59:07 +0100 Subject: [Biopython] qblast against RefSeq In-Reply-To: References: Message-ID: Hi Miguel, If I recall correctly, you need to specify which 'refseq' database you're using (there are 'refseq_rna' and 'refseq_protein'). Try running the search again but using 'refseq_protein' as the database name. Hope that helps :), Bow On Sun, Dec 9, 2012 at 4:53 AM, Miguel Romero wrote: > Hi! > Im learning how to run BLAST over the internet and Im interested in > blasting against the RefSeq database. For now I have managed to blast > against the nr database with: > > result_handle = NCBIWWW.qblast("blastp", "nr", sequence_string) > > but I dont know how to search against RefSeq, if I enter: > > result_handle = NCBIWWW.qblast("blastp", "refseq", sequence_string) > > the blast record is empty! Is it possible to blast against the RefSeq > database? What should I type in the second argument? > > Thank you! > > Miguel Romero > > -- > [Theropoda is my profession] > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From winda002 at student.otago.ac.nz Sun Dec 9 00:06:48 2012 From: winda002 at student.otago.ac.nz (David Winter) Date: Sun, 9 Dec 2012 05:06:48 +0000 Subject: [Biopython] qblast against RefSeq In-Reply-To: References: Message-ID: <4089E344A49CB1498D43F563331B1CF82CF1C542@SINPRD0310MB355.apcprd03.prod.outlook.com> Hi Miguel, I think you want 'refseq_protein' (check out the drop-down on the web blast page, which shows you the available databases). The xml file returned by qblast will probably have a useful error message too. Cheers, David ________________________________________ From: biopython-bounces at lists.open-bio.org [biopython-bounces at lists.open-bio.org] on behalf of Miguel Romero [phyrexian.kavu at gmail.com] Sent: Sunday, 9 December 2012 4:53 p.m. To: biopython at lists.open-bio.org Subject: [Biopython] qblast against RefSeq Hi! Im learning how to run BLAST over the internet and Im interested in blasting against the RefSeq database. For now I have managed to blast against the nr database with: result_handle = NCBIWWW.qblast("blastp", "nr", sequence_string) but I dont know how to search against RefSeq, if I enter: result_handle = NCBIWWW.qblast("blastp", "refseq", sequence_string) the blast record is empty! Is it possible to blast against the RefSeq database? What should I type in the second argument? Thank you! Miguel Romero -- [Theropoda is my profession] _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From fpiston at gmail.com Mon Dec 10 15:07:44 2012 From: fpiston at gmail.com (Fernando) Date: Mon, 10 Dec 2012 21:07:44 +0100 Subject: [Biopython] problem to find a accesion number in tab delimited file Message-ID: <8738zdbrhr.fsf@gmail.com> Hello everybody, I'm trying to perform a GOs annotation using the SIMAP database which is Blast2GO annotated. Everything is fine, but I have problems when I try to find the accession number in the file where entry numbers are associated with their GOs. The problem is that the script does not find the number in the input file when really there is. I tried several things without good results (re.match, insert in a list and then extract the element, etc) File where the GOs are associated with entry numbers has this structure (accession number, GO term, blats2go score): 1f0ba1d119f52ff28e907d2b5ea450db GO:0007154 79 1f0ba1d119f52ff28e907d2b5ea450db GO:0005605 99 The python code: #!/usr/bin/env python import re from Bio.Blast import NCBIXML from Bio import SeqIO input_file = open('/home/fpiston/Desktop/test_go/test2.fasta', 'rU') result_handle = open('/home/fpiston/Desktop/test_go/test2.xml', 'rU') save_file = open('/home/fpiston/Desktop/test_go/test2.out', 'w') fh = open('/home/fpiston/Desktop/test_go/Os_Bd_Ta_blat2go_fake', 'rU') q_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta")) blast_records = NCBIXML.parse(result_handle) hits = [] for blast_record in blast_records: if blast_record.alignments: list = (blast_record.query).split() if re.match('ENA|\w*|\w*', list[0]) != None: list2 = list[0].split("|") save_file.write('%s\t' % list2[1]) else: save_file.write('%s\t' % list[0]) for alignment in blast_record.alignments: for hsp in alignment.hsps: h = alignment.hit_def #at this point all right for l in fh: #here, 'l' in not found in 'fh' ls = l.split() if h in ls: print h print 'ok' save_file.write('%s\t' % ls[1]) save_file.write('\n') hits.append(blast_record.query.split()[0]) misses =set(q_dict.keys()) - set(hits) for i in misses: list = i.split("|") if len(list) > 1: save_file.write('%s\t' % list[1]) else: save_file.write('%s\t' % list) save_file.write('%s\n' % 'no_match') save_file.close() Fernando -- From nicolas.joannin at gmail.com Mon Dec 10 20:15:48 2012 From: nicolas.joannin at gmail.com (Nicolas Joannin) Date: Tue, 11 Dec 2012 10:15:48 +0900 Subject: [Biopython] Problem using Entrez with Python 3.2 Message-ID: Hello, I'm having problems when trying to use Bio.Entrez with Python 3.2. I get the following error message: >>> from Bio import Entrez >>> Entrez.email='my at email.address' >>> handle=Entrez.esearch(db='nuccore',term='36329') >>> record=Entrez.read(handle) Traceback (most recent call last): File "", line 1, in File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/Bio/Entrez/__init__.py", line 351, in read record = handler.read(handle) File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/Bio/Entrez/Parser.py", line 169, in read self.parser.ParseFile(handle) TypeError: read() did not return a bytes object (type=str) Any comment, suggestion or help would be greatly appreciated! Best regards, Nicolas From arklenna at gmail.com Mon Dec 10 22:14:38 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 10 Dec 2012 22:14:38 -0500 Subject: [Biopython] problem to find a accesion number in tab delimited file In-Reply-To: <8738zdbrhr.fsf@gmail.com> References: <8738zdbrhr.fsf@gmail.com> Message-ID: Hi Fernando, A filehandle only reads through a file once, so the second time through the loop, `fh` is, as you found, empty. I would suggest reading the entirety of that file into a list-like object, which will be persistent. You might also consider using the accession number as a dictionary key and appending the GO numbers to a list value. Cheers, Lenna On Mon, Dec 10, 2012 at 3:07 PM, Fernando wrote: > Hello everybody, > > I'm trying to perform a GOs annotation using the SIMAP database which is > Blast2GO annotated. Everything is fine, but I have problems when I try > to find the accession number in the file where entry numbers are > associated with their GOs. The problem is that the script does not find > the number in the input file when really there is. I tried several things > without good results (re.match, insert in a list and then extract the > element, etc) > File where the GOs are associated with entry numbers has this structure > (accession number, GO term, blats2go score): > 1f0ba1d119f52ff28e907d2b5ea450db GO:0007154 79 > 1f0ba1d119f52ff28e907d2b5ea450db GO:0005605 99 > > The python code: > #!/usr/bin/env python > import re > from Bio.Blast import NCBIXML > from Bio import SeqIO > > input_file = open('/home/fpiston/Desktop/test_go/test2.fasta', 'rU') > result_handle = open('/home/fpiston/Desktop/test_go/test2.xml', 'rU') > save_file = open('/home/fpiston/Desktop/test_go/test2.out', 'w') > > fh = open('/home/fpiston/Desktop/test_go/Os_Bd_Ta_blat2go_fake', 'rU') > q_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta")) > blast_records = NCBIXML.parse(result_handle) > > hits = [] > > for blast_record in blast_records: > if blast_record.alignments: > list = (blast_record.query).split() > if re.match('ENA|\w*|\w*', list[0]) != None: > list2 = list[0].split("|") > save_file.write('%s\t' % list2[1]) > else: > save_file.write('%s\t' % list[0]) > for alignment in blast_record.alignments: > for hsp in alignment.hsps: > h = alignment.hit_def #at this point all right > for l in fh: #here, 'l' in not found in 'fh' > ls = l.split() > if h in ls: > print h > print 'ok' > save_file.write('%s\t' % ls[1]) > save_file.write('\n') > hits.append(blast_record.query.split()[0]) > misses =set(q_dict.keys()) - set(hits) > > for i in misses: > list = i.split("|") > if len(list) > 1: > save_file.write('%s\t' % list[1]) > else: > save_file.write('%s\t' % list) > save_file.write('%s\n' % 'no_match') > > save_file.close() > > > > Fernando > -- > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From fpiston at gmail.com Tue Dec 11 02:06:30 2012 From: fpiston at gmail.com (Fernando) Date: Tue, 11 Dec 2012 08:06:30 +0100 Subject: [Biopython] problem to find a accesion number in tab delimited file In-Reply-To: (Lenna Peterson's message of "Mon, 10 Dec 2012 22:14:38 -0500") References: <8738zdbrhr.fsf@gmail.com> Message-ID: <87vcc99ifd.fsf@gmail.com> Hi, I'm sorry, the code has an error. Really the problem appear in the following lines: ? ? ? ? ? ? ? ? ? ? if h in ls: ? ? ? ? ? ? ? ? ? ? ? ? print h ? ? ? ? ? ? ? ? ? ? ? ? print 'ok' I can not to find the hit ('h') in the file 'fh' by lines. I am going to try with a dictionary as indicated by Lenna. Thanks for your answers. Lenna Peterson writes: > Hi Fernando,? > > > A filehandle only reads through a file once, so the second time > through the loop, `fh` is, as you found, empty.? > > I would suggest reading the entirety of that file into a list-like > object, which will be persistent. You might also consider using the > accession number as a dictionary key and appending the GO numbers to a > list value.? > > Cheers,? > > Lenna > > > On Mon, Dec 10, 2012 at 3:07 PM, Fernando wrote: > > Hello everybody, > > I'm trying to perform a GOs annotation using the SIMAP database > which is > Blast2GO annotated. Everything is fine, but I have problems when I > try > to find the accession number in the file where entry numbers are > associated with their GOs. The problem is that the script does not > find > the number in the input file when really there is. I tried several > things > without good results (re.match, insert in a list and then extract > the element, etc) > File where the GOs are associated with entry numbers has this > structure (accession number, GO term, blats2go score): > 1f0ba1d119f52ff28e907d2b5ea450db ? ? ? ?GO:0007154 ? ? ?79 > 1f0ba1d119f52ff28e907d2b5ea450db ? ? ? ?GO:0005605 ? ? ?99 > > The python code: > #!/usr/bin/env python > import re > from Bio.Blast import NCBIXML > from Bio import SeqIO > > input_file = open('/home/fpiston/Desktop/test_go/test2.fasta', > 'rU') > result_handle = open('/home/fpiston/Desktop/test_go/test2.xml', > 'rU') > save_file = open('/home/fpiston/Desktop/test_go/test2.out', 'w') > > fh = open('/home/fpiston/Desktop/test_go/Os_Bd_Ta_blat2go_fake', > 'rU') > q_dict = ?SeqIO.to_dict(SeqIO.parse(input_file, "fasta")) > blast_records = NCBIXML.parse(result_handle) > > hits = [] > > for blast_record in blast_records: > ? ? if blast_record.alignments: > ? ? ? ? list = (blast_record.query).split() > ? ? ? ? if re.match('ENA|\w*|\w*', list[0]) != None: > ? ? ? ? ? ? list2 = list[0].split("|") > ? ? ? ? ? ? save_file.write('%s\t' % list2[1]) > ? ? ? ? else: > ? ? ? ? ? ? save_file.write('%s\t' % list[0]) > ? ? ? ? for alignment in blast_record.alignments: > ? ? ? ? ? ? for hsp in alignment.hsps: > ? ? ? ? ? ? ? ? h = alignment.hit_def ? ?#at this point all right > ? ? ? ? ? ? ? ? for l in fh: ? ? ? ? ? ? #here, 'l' in not found > in 'fh' > ? ? ? ? ? ? ? ? ? ? ls = l.split() > ? ? ? ? ? ? ? ? ? ? if h in ls: > ? ? ? ? ? ? ? ? ? ? ? ? print h > ? ? ? ? ? ? ? ? ? ? ? ? print 'ok' > ? ? ? ? ? ? ? ? ? ? ? ? save_file.write('%s\t' % ls[1]) > ? ? ? ? ? ? ? ? save_file.write('\n') > ? ? ? ? hits.append(blast_record.query.split()[0]) > misses =set(q_dict.keys()) - set(hits) > > for i in misses: > ? ? list = i.split("|") > ? ? if len(list) > 1: > ? ? ? ? save_file.write('%s\t' % list[1]) > ? ? else: > ? ? ? ? save_file.write('%s\t' % list) > ? ? save_file.write('%s\n' % 'no_match') > > save_file.close() > > > > ?Fernando > ?-- > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > Fernando -- From w.arindrarto at gmail.com Sat Dec 15 00:58:44 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sat, 15 Dec 2012 06:58:44 +0100 Subject: [Biopython] Problem using Entrez with Python 3.2 In-Reply-To: References: Message-ID: Hi Nicolas, AFAIK, the Entrez XML parser requires bytes instead of strings. The distinction matters in Python 3, so what you need to do is to convert your string stream into a byte stream. To do so, you need to add a new import: from io import BytesIO and replace the last line with these: bytehandle = BytesIO(bytes(handle.read(), 'utf-8')) # utf-8 or any other encoding you want record = Entrez.read(bytehandle) We're basically reading the string, convert it to bytes, and create a buffered bytes I/O that Entrez.read expects. Alternativey, you can also read the handle into a file first, and open it in binary mode: with open('outfile.xml', 'w') as outfile: outfile.write(handle.read()) with open('outfile.xml', 'rb') as sourcefile: record = Entrez.read(sourcefile) Hope that helps :), Bow On Tue, Dec 11, 2012 at 2:15 AM, Nicolas Joannin wrote: > Hello, > > I'm having problems when trying to use Bio.Entrez with Python 3.2. > I get the following error message: > >>>> from Bio import Entrez >>>> Entrez.email='my at email.address' >>>> handle=Entrez.esearch(db='nuccore',term='36329') >>>> record=Entrez.read(handle) > Traceback (most recent call last): > File "", line 1, in > File > "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/Bio/Entrez/__init__.py", > line 351, in read > record = handler.read(handle) > File > "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/Bio/Entrez/Parser.py", > line 169, in read > self.parser.ParseFile(handle) > TypeError: read() did not return a bytes object (type=str) > > Any comment, suggestion or help would be greatly appreciated! > Best regards, > > Nicolas > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From nicolas.joannin at gmail.com Wed Dec 19 01:13:12 2012 From: nicolas.joannin at gmail.com (Nicolas Joannin) Date: Wed, 19 Dec 2012 15:13:12 +0900 Subject: [Biopython] Problem using Entrez with Python 3.2 In-Reply-To: References: Message-ID: Hi Bow, Thanks a million for your answer! It is very clear and pedagogical; I appreciate that a lot! Works like a charm ;) Best regards, Nicolas Nicolas Joannin, Ph.D. Bioinformatics Center Kyoto University, Uji campus, Japan On Sat, Dec 15, 2012 at 2:58 PM, Wibowo Arindrarto wrote: > Hi Nicolas, > > AFAIK, the Entrez XML parser requires bytes instead of strings. The > distinction matters in Python 3, so what you need to do is to convert > your string stream into a byte stream. To do so, you need to add a new > import: > > from io import BytesIO > > and replace the last line with these: > > bytehandle = BytesIO(bytes(handle.read(), 'utf-8')) # utf-8 or any > other encoding you want > record = Entrez.read(bytehandle) > > We're basically reading the string, convert it to bytes, and create a > buffered bytes I/O that Entrez.read expects. > > Alternativey, you can also read the handle into a file first, and open > it in binary mode: > > with open('outfile.xml', 'w') as outfile: > outfile.write(handle.read()) > > with open('outfile.xml', 'rb') as sourcefile: > record = Entrez.read(sourcefile) > > Hope that helps :), > Bow > > On Tue, Dec 11, 2012 at 2:15 AM, Nicolas Joannin > wrote: > > Hello, > > > > I'm having problems when trying to use Bio.Entrez with Python 3.2. > > I get the following error message: > > > >>>> from Bio import Entrez > >>>> Entrez.email='my at email.address' > >>>> handle=Entrez.esearch(db='nuccore',term='36329') > >>>> record=Entrez.read(handle) > > Traceback (most recent call last): > > File "", line 1, in > > File > > > "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/Bio/Entrez/__init__.py", > > line 351, in read > > record = handler.read(handle) > > File > > > "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/Bio/Entrez/Parser.py", > > line 169, in read > > self.parser.ParseFile(handle) > > TypeError: read() did not return a bytes object (type=str) > > > > Any comment, suggestion or help would be greatly appreciated! > > Best regards, > > > > Nicolas > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > From w.arindrarto at gmail.com Wed Dec 19 09:25:46 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 19 Dec 2012 15:25:46 +0100 Subject: [Biopython] Problem using Entrez with Python 3.2 In-Reply-To: References:

Message-ID: Hi Nicolas, Glad to hear that :). I have to note that what I proposed was a workaround for a bug as your initial code was meant to work without any errors. We've recently applied an initial fix and your prior code should be working fine in the next Biopython release. If you really need the fix now (and are feeling a bit adventurous) you can try installing our development version from GitHub: http://github.com/biopython/biopython. We have a quick draft guide on working with git here: http://biopython.org/wiki/GitUsage cheers, Bow On Wed, Dec 19, 2012 at 7:13 AM, Nicolas Joannin wrote: > Hi Bow, > > Thanks a million for your answer! > It is very clear and pedagogical; I appreciate that a lot! > Works like a charm ;) > > Best regards, > Nicolas > > > > Nicolas Joannin, Ph.D. > Bioinformatics Center > Kyoto University, Uji campus, Japan > > > > > On Sat, Dec 15, 2012 at 2:58 PM, Wibowo Arindrarto > wrote: >> >> Hi Nicolas, >> >> AFAIK, the Entrez XML parser requires bytes instead of strings. The >> distinction matters in Python 3, so what you need to do is to convert >> your string stream into a byte stream. To do so, you need to add a new >> import: >> >> from io import BytesIO >> >> and replace the last line with these: >> >> bytehandle = BytesIO(bytes(handle.read(), 'utf-8')) # utf-8 or any >> other encoding you want >> record = Entrez.read(bytehandle) >> >> We're basically reading the string, convert it to bytes, and create a >> buffered bytes I/O that Entrez.read expects. >> >> Alternativey, you can also read the handle into a file first, and open >> it in binary mode: >> >> with open('outfile.xml', 'w') as outfile: >> outfile.write(handle.read()) >> >> with open('outfile.xml', 'rb') as sourcefile: >> record = Entrez.read(sourcefile) >> >> Hope that helps :), >> Bow >> >> On Tue, Dec 11, 2012 at 2:15 AM, Nicolas Joannin >> wrote: >> > Hello, >> > >> > I'm having problems when trying to use Bio.Entrez with Python 3.2. >> > I get the following error message: >> > >> >>>> from Bio import Entrez >> >>>> Entrez.email='my at email.address' >> >>>> handle=Entrez.esearch(db='nuccore',term='36329') >> >>>> record=Entrez.read(handle) >> > Traceback (most recent call last): >> > File "", line 1, in >> > File >> > >> > "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/Bio/Entrez/__init__.py", >> > line 351, in read >> > record = handler.read(handle) >> > File >> > >> > "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/Bio/Entrez/Parser.py", >> > line 169, in read >> > self.parser.ParseFile(handle) >> > TypeError: read() did not return a bytes object (type=str) >> > >> > Any comment, suggestion or help would be greatly appreciated! >> > Best regards, >> > >> > Nicolas >> > _______________________________________________ >> > Biopython mailing list - Biopython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython > > From aljosa.mohorovic at gmail.com Mon Dec 31 04:43:41 2012 From: aljosa.mohorovic at gmail.com (=?UTF-8?B?QWxqb8WhYSBNb2hvcm92acSH?=) Date: Mon, 31 Dec 2012 10:43:41 +0100 Subject: [Biopython] eutils/entrez doesn't work anymore In-Reply-To: References: Message-ID: search/fetch stopped working, anybody knows what happened? also, http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ is temporary unavailable. any info appreciated. Aljosa From p.j.a.cock at googlemail.com Mon Dec 31 07:50:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 31 Dec 2012 12:50:56 +0000 Subject: [Biopython] eutils/entrez doesn't work anymore In-Reply-To: References: Message-ID: On Mon, Dec 31, 2012 at 9:43 AM, Aljo?a Mohorovi? wrote: > search/fetch stopped working, anybody knows what happened? > > also, http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ is temporary unavailable. > any info appreciated. > > Aljosa Looks like a transient glitch at the NCBI, another victim: http://stackoverflow.com/questions/14097157/httperror-with-example-biopython-code-querying-pubmed It seems to have been resolved now (Entrez is working for me). Peter From markbudde at gmail.com Thu Dec 6 00:53:01 2012 From: markbudde at gmail.com (Mark Budde) Date: Wed, 5 Dec 2012 16:53:01 -0800 Subject: [Biopython] get more than 20 IDs in IdList Message-ID: Can someone help me understand how to get more than 20 records in the IdList from esearch? >>> from Bio import Entrez >>> Entrez.email = "markbudde at gmail.com" >>> handle = Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr") >>> record = Entrez.read(handle) >>> len(record["IdList"]) 20 >>> record["Count"] '354' It seems that I can only ever get 20 IDs in the "IdList", even though the "Count" can be much higher. Thanks, Mark From axfelix at gmail.com Thu Dec 6 00:57:33 2012 From: axfelix at gmail.com (Alex Garnett) Date: Wed, 5 Dec 2012 16:57:33 -0800 Subject: [Biopython] get more than 20 IDs in IdList In-Reply-To: References: Message-ID: This is a fairly naive solution based on just having joined and harassed this list with a similar issue a week ago, but is there a reason you can't use efetch()? -alex On Dec 5, 2012 4:54 PM, "Mark Budde" wrote: > Can someone help me understand how to get more than 20 records in the > IdList from esearch? > > >>> from Bio import Entrez > >>> Entrez.email = "markbudde at gmail.com" > >>> handle = Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr") > >>> record = Entrez.read(handle) > >>> len(record["IdList"]) > 20 > >>> record["Count"] > '354' > > > It seems that I can only ever get 20 IDs in the "IdList", even though the > "Count" can be much higher. > Thanks, > Mark > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From markbudde at gmail.com Thu Dec 6 01:09:07 2012 From: markbudde at gmail.com (Mark Budde) Date: Wed, 5 Dec 2012 17:09:07 -0800 Subject: [Biopython] get more than 20 IDs in IdList In-Reply-To: References:

Message-ID: Thanks. My understanding was that efetch was for getting the records after you know the IDs. That is what I am trying to do, but I can only get the first 20 IDs to run through efetch. Is there a way to search through efetch? -Mark On Wed, Dec 5, 2012 at 4:57 PM, Alex Garnett wrote: > This is a fairly naive solution based on just having joined and harassed > this list with a similar issue a week ago, but is there a reason you can't > use efetch()? > > -alex > On Dec 5, 2012 4:54 PM, "Mark Budde" wrote: > >> Can someone help me understand how to get more than 20 records in the >> IdList from esearch? >> >> >>> from Bio import Entrez >> >>> Entrez.email = "markbudde at gmail.com" >> >>> handle = Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr") >> >>> record = Entrez.read(handle) >> >>> len(record["IdList"]) >> 20 >> >>> record["Count"] >> '354' >> >> >> It seems that I can only ever get 20 IDs in the "IdList", even though the >> "Count" can be much higher. >> Thanks, >> Mark >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > From markbudde at gmail.com Thu Dec 6 02:17:15 2012 From: markbudde at gmail.com (Mark Budde) Date: Wed, 5 Dec 2012 18:17:15 -0800 Subject: [Biopython] get more than 20 IDs in IdList In-Reply-To: <4089E344A49CB1498D43F563331B1CF82CF1A0ED@SINPRD0310MB355.apcprd03.prod.outlook.com> References: <4089E344A49CB1498D43F563331B1CF82CF1A0ED@SINPRD0310MB355.apcprd03.prod.outlook.com> Message-ID: <7635510330127239492@unknownmsgid> Perfect, thanks. -Mark Sent from my phone On Dec 5, 2012, at 6:07 PM, David Winter wrote: > Hi Mark, > > This is the default behaviour for the EUtils API > http://www.ncbi.nlm.nih.gov/books/NBK25499/ > > To get more than 20 records you need to set the "retmax" parameter to some other number: > >>>> handle = Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr", retmax=200) > > Hope that helps, > David Winter > > ________________________________________ > From: biopython-bounces at lists.open-bio.org [biopython-bounces at lists.open-bio.org] on behalf of Mark Budde [markbudde at gmail.com] > Sent: Thursday, 6 December 2012 1:53 p.m. > To: biopython at lists.open-bio.org > Subject: [Biopython] get more than 20 IDs in IdList > > Can someone help me understand how to get more than 20 records in the > IdList from esearch? > >>>> from Bio import Entrez >>>> Entrez.email = "markbudde at gmail.com" >>>> handle = Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr") >>>> record = Entrez.read(handle) >>>> len(record["IdList"]) > 20 >>>> record["Count"] > '354' > > > It seems that I can only ever get 20 IDs in the "IdList", even though the > "Count" can be much higher. > Thanks, > Mark > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From mjldehoon at yahoo.com Thu Dec 6 02:15:57 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 5 Dec 2012 18:15:57 -0800 (PST) Subject: [Biopython] get more than 20 IDs in IdList In-Reply-To: Message-ID: <1354760157.15566.YahooMailClassic@web164002.mail.gq1.yahoo.com> See http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch under retmax. Also see section 8.15 "Using the history and WebEnv" in the Biopython documentation. Best, -Michiel. --- On Wed, 12/5/12, Mark Budde wrote: > From: Mark Budde > Subject: Re: [Biopython] get more than 20 IDs in IdList > To: "Alex Garnett" > Cc: "biopython" > Date: Wednesday, December 5, 2012, 8:09 PM > Thanks. My understanding was that > efetch was for getting the records after > you know the IDs. That is what I am trying to do, but I can > only get the > first 20 IDs to run through efetch. Is there a way to search > through efetch? > -Mark > > > On Wed, Dec 5, 2012 at 4:57 PM, Alex Garnett > wrote: > > > This is a fairly naive solution based on just having > joined and harassed > > this list with a similar issue a week ago, but is there > a reason you can't > > use efetch()? > > > > -alex > > On Dec 5, 2012 4:54 PM, "Mark Budde" > wrote: > > > >>? Can someone help me understand how to get > more than 20 records in the > >> IdList from esearch? > >> > >> >>> from Bio import Entrez > >> >>> Entrez.email = "markbudde at gmail.com" > >> >>> handle = > Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr") > >> >>> record = Entrez.read(handle) > >> >>> len(record["IdList"]) > >> 20 > >> >>> record["Count"] > >> '354' > >> > >> > >> It seems that I can only ever get 20 IDs in the > "IdList", even though the > >> "Count" can be much higher. > >> Thanks, > >> Mark > >> _______________________________________________ > >> Biopython mailing list? -? Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > >> > > > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From winda002 at student.otago.ac.nz Thu Dec 6 02:07:30 2012 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 6 Dec 2012 02:07:30 +0000 Subject: [Biopython] get more than 20 IDs in IdList In-Reply-To: References: Message-ID: <4089E344A49CB1498D43F563331B1CF82CF1A0ED@SINPRD0310MB355.apcprd03.prod.outlook.com> Hi Mark, This is the default behaviour for the EUtils API http://www.ncbi.nlm.nih.gov/books/NBK25499/ To get more than 20 records you need to set the "retmax" parameter to some other number: >>> handle = Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr", retmax=200) Hope that helps, David Winter ________________________________________ From: biopython-bounces at lists.open-bio.org [biopython-bounces at lists.open-bio.org] on behalf of Mark Budde [markbudde at gmail.com] Sent: Thursday, 6 December 2012 1:53 p.m. To: biopython at lists.open-bio.org Subject: [Biopython] get more than 20 IDs in IdList Can someone help me understand how to get more than 20 records in the IdList from esearch? >>> from Bio import Entrez >>> Entrez.email = "markbudde at gmail.com" >>> handle = Entrez.esearch(db="protein",term="txid9606[Orgn] AND cftr") >>> record = Entrez.read(handle) >>> len(record["IdList"]) 20 >>> record["Count"] '354' It seems that I can only ever get 20 IDs in the "IdList", even though the "Count" can be much higher. Thanks, Mark _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From fpiston at gmail.com Thu Dec 6 11:09:53 2012 From: fpiston at gmail.com (Fernando) Date: Thu, 06 Dec 2012 12:09:53 +0100 Subject: [Biopython] blast to go annotation Message-ID: <877govwi66.fsf@gmail.com> Hello everybody, I am a beginner in python programming and I do not know if did well. I had wrote a script to do the following task: - BLAST my sequences against the uniprot_sprot (UniProtKB/Swiss-Prot) - Take the best match swiss-prot accession - Take the GOs associated to the swiss-prot accession - Make a file with the my sequence id, best match swiss-prot accession, GOs associated.I am doing this file to use with topGO in bioconductor. I have some question: - The 'NCBIXML.parse' step has a problem. The function does not take the firth accession of the .xml file. I need to insert a fake fasta sequence at the beginning of the multifasta file to have all blast result of my sequences. - En general. It is correct the script? and, can I improve it? Here is the code: from Bio.Blast.Applications import NcbiblastxCommandline blastx_cline = NcbiblastxCommandline(query='/home/fpiston/Desktop/test/test2.fasta', db='uniprot_sprot', out='/home/fpiston/Desktop/test/test.xml', evalue='0.001', outfmt='5', best_hit_overhang='0.1', best_hit_score_edge='0.05', max_target_seqs='1') stdout, stderr = blastx_cline() result_handle = open("/home/fpiston/Desktop/test/test.xml") from Bio.Blast import NCBIXML from Bio import SeqIO import re from Bio import SwissProt q_dict = SeqIO.to_dict(SeqIO.parse(open("/home/fpiston/Desktop/test/test2.fasta"), "fasta")) blast_records = NCBIXML.parse(result_handle) save_file = open("/home/fpiston/Desktop/test/test.out", 'w') blast_record = blast_records.next() hits = [] for blast_record in blast_records: if blast_record.alignments: list = (blast_record.query).split() if re.match('ENA|\w*|\w*', list[0]) != None: list2 = list[0].split("|") save_file.write('\n%s\t' % list2[1]) else: save_file.write('\n%s\t' % list[0]) for alignment in blast_record.alignments: for hsp in alignment.hsps: list = (alignment.hit_def).split() list2 = list[0].split("|") save_file.write('%s\t' % list2[2]) for record in SwissProt.parse(open('/home/db/uniprot_sprot.dat')): if record.entry_name in list2[2]: for cross_reference in record.cross_references: for item in cross_reference: if 'GO:' in item: save_file.write('%s\t' % item) hits.append(blast_record.query.split()[0]) misses = set(q_dict.keys()) - set(hits) for item in misses: save_file.write('\n%s\t' % item) save_file.write('%s' % 'no_match') save_file.close() Fernando -- From p.j.a.cock at googlemail.com Thu Dec 6 11:17:54 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Dec 2012 11:17:54 +0000 Subject: [Biopython] blast to go annotation In-Reply-To: <877govwi66.fsf@gmail.com> References: <877govwi66.fsf@gmail.com> Message-ID: On Thu, Dec 6, 2012 at 11:09 AM, Fernando wrote: > Hello everybody, > I am a beginner in python programming and I do not know if did well. > I had wrote a script to do the following task: > - BLAST my sequences against the uniprot_sprot (UniProtKB/Swiss-Prot) > - Take the best match swiss-prot accession > - Take the GOs associated to the swiss-prot accession > - Make a file with the my sequence id, best match swiss-prot accession, > GOs associated.I am doing this file to use with topGO in bioconductor. > > I have some question: > - The 'NCBIXML.parse' step has a problem. The function does not take the > firth accession of the .xml file. I need to insert a fake fasta sequence > at the beginning of the multifasta file to have all blast result of my > sequences. Do you mean it is ignoring the first (1st) set of results in the XML file? That is because you skipped the first BLAST results - try removing this line before your for loop: blast_record = blast_records.next() > - En general. It is correct the script? and, can I improve it? > It would be worth reading the Blast2GO paper for some of the technical issues and how to weight evidence in assigning GO terms based on BLAST matches. Note Blast2GO has a command line variant called "Blast2GO for pipelines" (b2g4pipe). Peter From fpiston at gmail.com Thu Dec 6 11:59:33 2012 From: fpiston at gmail.com (Fernando) Date: Thu, 06 Dec 2012 12:59:33 +0100 Subject: [Biopython] blast to go annotation In-Reply-To: (Peter Cock's message of "Thu, 6 Dec 2012 11:17:54 +0000") References: <877govwi66.fsf@gmail.com> Message-ID: <87wqwvv1ay.fsf@gmail.com> Peter Cock writes: > On Thu, Dec 6, 2012 at 11:09 AM, Fernando wrote: >> Hello everybody, >> I am a beginner in python programming and I do not know if did well. >> I had wrote a script to do the following task: >> - BLAST my sequences against the uniprot_sprot (UniProtKB/Swiss-Prot) >> - Take the best match swiss-prot accession >> - Take the GOs associated to the swiss-prot accession >> - Make a file with the my sequence id, best match swiss-prot accession, >> GOs associated.I am doing this file to use with topGO in bioconductor. >> >> I have some question: >> - The 'NCBIXML.parse' step has a problem. The function does not take the >> firth accession of the .xml file. I need to insert a fake fasta sequence >> at the beginning of the multifasta file to have all blast result of my >> sequences. > > Do you mean it is ignoring the first (1st) set of results in the XML file? > That is because you skipped the first BLAST results - try removing this > line before your for loop: > > blast_record = blast_records.next() > Yes,I'm. The script ignored the first set of results in the XML file. I had removed the line blast_record = blast_records.next() And it work Ok. >> - En general. It is correct the script? and, can I improve it? >> > > It would be worth reading the Blast2GO paper for some of the technical > issues and how to weight evidence in assigning GO terms based on > BLAST matches. Note Blast2GO has a command line variant called > "Blast2GO for pipelines" (b2g4pipe). > > Peter I know the Blast2GO. In fact, I started the GO annotation with that software, but I had many problems because it is very slow and crashes often.These problems make the annotation of many sequences with Blast2GO impossible. Furthermore, I also tried to use b2g4pipe in a cluster but the administrator told me also gives many problems. The administrator told me that the b2g4pipe has not been updated since their appearance and also requires an Internet connection. I think the free version of Blast2GO not be improved since they released the paid version. For this reason I decided to do it with my own code. Thanks very much Fernando -- From p.j.a.cock at googlemail.com Thu Dec 6 18:54:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Dec 2012 18:54:10 +0000 Subject: [Biopython] blast to go annotation In-Reply-To: <87wqwvv1ay.fsf@gmail.com> References: <877govwi66.fsf@gmail.com> <87wqwvv1ay.fsf@gmail.com> Message-ID: On Thu, Dec 6, 2012 at 11:59 AM, Fernando wrote: > Yes,I'm. The script ignored the first set of results in the XML file. > I had removed the line > blast_record = blast_records.next() > > And it work Ok. > Good. >> It would be worth reading the Blast2GO paper for some of the technical >> issues and how to weight evidence in assigning GO terms based on >> BLAST matches. Note Blast2GO has a command line variant called >> "Blast2GO for pipelines" (b2g4pipe). >> >> Peter > > I know the Blast2GO. In fact, I started the GO annotation with that > software, but I had many problems because it is very slow and crashes > often.These problems make the annotation of many sequences with > Blast2GO impossible. > Furthermore, I also tried to use b2g4pipe in a cluster but the > administrator told me also gives many problems. The administrator told > me that the b2g4pipe has not been updated since their appearance and > also requires an Internet connection. I think the free version of > Blast2GO not be improved since they released the paid version. > > For this reason I decided to do it with my own code. The future of Blast2GO with their paid version is troubling - they never gave a clear answer on how the (money) free version is licensed either. However, b2g4pipe can be used without going online if you have setup a local Blast2GO database - which is *much* faster than connecting to their public database in Spain. We're doing this locally as part of running b2g4pipe within Galaxy. This includes a workaround for Blast2GO being unable to cope with modern BLAST XML files (explained on their website with links to other conversion scripts). You can find my code in the Galaxy Tool Shed http://toolshed.g2.bx.psu.edu or https://bitbucket.org/peterjc/galaxy-central/src/tools/tools/ncbi_blast_plus Regards, Peter From fpiston at gmail.com Thu Dec 6 22:17:55 2012 From: fpiston at gmail.com (Fernando) Date: Thu, 06 Dec 2012 23:17:55 +0100 Subject: [Biopython] blast to go annotation In-Reply-To: (Peter Cock's message of "Thu, 6 Dec 2012 18:54:10 +0000") References: <877govwi66.fsf@gmail.com> <87wqwvv1ay.fsf@gmail.com> Message-ID: <87obi6kep8.fsf@gmail.com> Thanks, I am going to try it. Peter Cock writes: > On Thu, Dec 6, 2012 at 11:59 AM, Fernando wrote: > >> Yes,I'm. The script ignored the first set of results in the XML file. >> I had removed the line >> blast_record = blast_records.next() >> >> And it work Ok. >> > > Good. > >>> It would be worth reading the Blast2GO paper for some of the technical >>> issues and how to weight evidence in assigning GO terms based on >>> BLAST matches. Note Blast2GO has a command line variant called >>> "Blast2GO for pipelines" (b2g4pipe). >>> >>> Peter >> >> I know the Blast2GO. In fact, I started the GO annotation with that >> software, but I had many problems because it is very slow and crashes >> often.These problems make the annotation of many sequences with >> Blast2GO impossible. >> Furthermore, I also tried to use b2g4pipe in a cluster but the >> administrator told me also gives many problems. The administrator told >> me that the b2g4pipe has not been updated since their appearance and >> also requires an Internet connection. I think the free version of >> Blast2GO not be improved since they released the paid version. >> >> For this reason I decided to do it with my own code. > > The future of Blast2GO with their paid version is troubling - they > never gave a clear answer on how the (money) free version is > licensed either. > > However, b2g4pipe can be used without going online if you have > setup a local Blast2GO database - which is *much* faster than > connecting to their public database in Spain. > > We're doing this locally as part of running b2g4pipe within Galaxy. > This includes a workaround for Blast2GO being unable to cope > with modern BLAST XML files (explained on their website with > links to other conversion scripts). You can find my code > in the Galaxy Tool Shed http://toolshed.g2.bx.psu.edu or > https://bitbucket.org/peterjc/galaxy-central/src/tools/tools/ncbi_blast_plus > > Regards, > > Peter Fernando -- From melissacurran530 at gmail.com Sat Dec 8 20:39:57 2012 From: melissacurran530 at gmail.com (Melissa Curran) Date: Sat, 8 Dec 2012 15:39:57 -0500 Subject: [Biopython] Printing non-ASCII characters Message-ID: Hello, I'm trying to print out the titles of articles, but often get an error about it encountering non-ASCII characters. What do I need to do in order to be able to print the article titles? Below is a snippet of code from my program, which is where I'm getting the error. handle = Entrez.efetch(db="pubmed", id=record, retmode="xml") articles = Entrez.parse(handle) for article in articles: print article['MedlineCitation']['Article']['ArticleTitle'] Thanks much, Melissa From w.arindrarto at gmail.com Sat Dec 8 20:53:22 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sat, 8 Dec 2012 21:53:22 +0100 Subject: [Biopython] Printing non-ASCII characters In-Reply-To: References: Message-ID: Hi Melissa, Does the error you're getting look like this: UnicodeEncodeError: 'ascii' codec can't encode character u'\u015f' in position 2: ordinal not in range(128) If so, it's because there are non-ascii characters in the text you're trying to print. To try fix that, change the last line to this: print article['MedlineCitation']['Article']['ArticleTitle'].encode('utf-8') This forces python to encode the characters using utf-8, which can handle non-ascii characters. Note that for the characters to be shown properly (e.g. a with umlaut), your terminal must also support utf-8. Otherwise, you'll simply get a weird replacement character for all the non-ascii characters. Hope this helps :), Bow On Sat, Dec 8, 2012 at 9:39 PM, Melissa Curran wrote: > Hello, > > I'm trying to print out the titles of articles, but often get an error > about it encountering non-ASCII characters. What do I need to do in order > to be able to print the article titles? Below is a snippet of code from my > program, which is where I'm getting the error. > > handle = Entrez.efetch(db="pubmed", id=record, retmode="xml") > articles = Entrez.parse(handle) > for article in articles: > print article['MedlineCitation']['Article']['ArticleTitle'] > > Thanks much, > > Melissa > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From phyrexian.kavu at gmail.com Sun Dec 9 03:53:47 2012 From: phyrexian.kavu at gmail.com (Miguel Romero) Date: Sat, 8 Dec 2012 21:53:47 -0600 Subject: [Biopython] qblast against RefSeq Message-ID: Hi! Im learning how to run BLAST over the internet and Im interested in blasting against the RefSeq database. For now I have managed to blast against the nr database with: result_handle = NCBIWWW.qblast("blastp", "nr", sequence_string) but I dont know how to search against RefSeq, if I enter: result_handle = NCBIWWW.qblast("blastp", "refseq", sequence_string) the blast record is empty! Is it possible to blast against the RefSeq database? What should I type in the second argument? Thank you! Miguel Romero -- [Theropoda is my profession] From w.arindrarto at gmail.com Sun Dec 9 04:59:07 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sun, 9 Dec 2012 05:59:07 +0100 Subject: [Biopython] qblast against RefSeq In-Reply-To: References: Message-ID: Hi Miguel, If I recall correctly, you need to specify which 'refseq' database you're using (there are 'refseq_rna' and 'refseq_protein'). Try running the search again but using 'refseq_protein' as the database name. Hope that helps :), Bow On Sun, Dec 9, 2012 at 4:53 AM, Miguel Romero wrote: > Hi! > Im learning how to run BLAST over the internet and Im interested in > blasting against the RefSeq database. For now I have managed to blast > against the nr database with: > > result_handle = NCBIWWW.qblast("blastp", "nr", sequence_string) > > but I dont know how to search against RefSeq, if I enter: > > result_handle = NCBIWWW.qblast("blastp", "refseq", sequence_string) > > the blast record is empty! Is it possible to blast against the RefSeq > database? What should I type in the second argument? > > Thank you! > > Miguel Romero > > -- > [Theropoda is my profession] > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From winda002 at student.otago.ac.nz Sun Dec 9 05:06:48 2012 From: winda002 at student.otago.ac.nz (David Winter) Date: Sun, 9 Dec 2012 05:06:48 +0000 Subject: [Biopython] qblast against RefSeq In-Reply-To: References: Message-ID: <4089E344A49CB1498D43F563331B1CF82CF1C542@SINPRD0310MB355.apcprd03.prod.outlook.com> Hi Miguel, I think you want 'refseq_protein' (check out the drop-down on the web blast page, which shows you the available databases). The xml file returned by qblast will probably have a useful error message too. Cheers, David ________________________________________ From: biopython-bounces at lists.open-bio.org [biopython-bounces at lists.open-bio.org] on behalf of Miguel Romero [phyrexian.kavu at gmail.com] Sent: Sunday, 9 December 2012 4:53 p.m. To: biopython at lists.open-bio.org Subject: [Biopython] qblast against RefSeq Hi! Im learning how to run BLAST over the internet and Im interested in blasting against the RefSeq database. For now I have managed to blast against the nr database with: result_handle = NCBIWWW.qblast("blastp", "nr", sequence_string) but I dont know how to search against RefSeq, if I enter: result_handle = NCBIWWW.qblast("blastp", "refseq", sequence_string) the blast record is empty! Is it possible to blast against the RefSeq database? What should I type in the second argument? Thank you! Miguel Romero -- [Theropoda is my profession] _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From fpiston at gmail.com Mon Dec 10 20:07:44 2012 From: fpiston at gmail.com (Fernando) Date: Mon, 10 Dec 2012 21:07:44 +0100 Subject: [Biopython] problem to find a accesion number in tab delimited file Message-ID: <8738zdbrhr.fsf@gmail.com> Hello everybody, I'm trying to perform a GOs annotation using the SIMAP database which is Blast2GO annotated. Everything is fine, but I have problems when I try to find the accession number in the file where entry numbers are associated with their GOs. The problem is that the script does not find the number in the input file when really there is. I tried several things without good results (re.match, insert in a list and then extract the element, etc) File where the GOs are associated with entry numbers has this structure (accession number, GO term, blats2go score): 1f0ba1d119f52ff28e907d2b5ea450db GO:0007154 79 1f0ba1d119f52ff28e907d2b5ea450db GO:0005605 99 The python code: #!/usr/bin/env python import re from Bio.Blast import NCBIXML from Bio import SeqIO input_file = open('/home/fpiston/Desktop/test_go/test2.fasta', 'rU') result_handle = open('/home/fpiston/Desktop/test_go/test2.xml', 'rU') save_file = open('/home/fpiston/Desktop/test_go/test2.out', 'w') fh = open('/home/fpiston/Desktop/test_go/Os_Bd_Ta_blat2go_fake', 'rU') q_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta")) blast_records = NCBIXML.parse(result_handle) hits = [] for blast_record in blast_records: if blast_record.alignments: list = (blast_record.query).split() if re.match('ENA|\w*|\w*', list[0]) != None: list2 = list[0].split("|") save_file.write('%s\t' % list2[1]) else: save_file.write('%s\t' % list[0]) for alignment in blast_record.alignments: for hsp in alignment.hsps: h = alignment.hit_def #at this point all right for l in fh: #here, 'l' in not found in 'fh' ls = l.split() if h in ls: print h print 'ok' save_file.write('%s\t' % ls[1]) save_file.write('\n') hits.append(blast_record.query.split()[0]) misses =set(q_dict.keys()) - set(hits) for i in misses: list = i.split("|") if len(list) > 1: save_file.write('%s\t' % list[1]) else: save_file.write('%s\t' % list) save_file.write('%s\n' % 'no_match') save_file.close() Fernando -- From nicolas.joannin at gmail.com Tue Dec 11 01:15:48 2012 From: nicolas.joannin at gmail.com (Nicolas Joannin) Date: Tue, 11 Dec 2012 10:15:48 +0900 Subject: [Biopython] Problem using Entrez with Python 3.2 Message-ID: Hello, I'm having problems when trying to use Bio.Entrez with Python 3.2. I get the following error message: >>> from Bio import Entrez >>> Entrez.email='my at email.address' >>> handle=Entrez.esearch(db='nuccore',term='36329') >>> record=Entrez.read(handle) Traceback (most recent call last): File "", line 1, in File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/Bio/Entrez/__init__.py", line 351, in read record = handler.read(handle) File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/Bio/Entrez/Parser.py", line 169, in read self.parser.ParseFile(handle) TypeError: read() did not return a bytes object (type=str) Any comment, suggestion or help would be greatly appreciated! Best regards, Nicolas From arklenna at gmail.com Tue Dec 11 03:14:38 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 10 Dec 2012 22:14:38 -0500 Subject: [Biopython] problem to find a accesion number in tab delimited file In-Reply-To: <8738zdbrhr.fsf@gmail.com> References: <8738zdbrhr.fsf@gmail.com> Message-ID: Hi Fernando, A filehandle only reads through a file once, so the second time through the loop, `fh` is, as you found, empty. I would suggest reading the entirety of that file into a list-like object, which will be persistent. You might also consider using the accession number as a dictionary key and appending the GO numbers to a list value. Cheers, Lenna On Mon, Dec 10, 2012 at 3:07 PM, Fernando wrote: > Hello everybody, > > I'm trying to perform a GOs annotation using the SIMAP database which is > Blast2GO annotated. Everything is fine, but I have problems when I try > to find the accession number in the file where entry numbers are > associated with their GOs. The problem is that the script does not find > the number in the input file when really there is. I tried several things > without good results (re.match, insert in a list and then extract the > element, etc) > File where the GOs are associated with entry numbers has this structure > (accession number, GO term, blats2go score): > 1f0ba1d119f52ff28e907d2b5ea450db GO:0007154 79 > 1f0ba1d119f52ff28e907d2b5ea450db GO:0005605 99 > > The python code: > #!/usr/bin/env python > import re > from Bio.Blast import NCBIXML > from Bio import SeqIO > > input_file = open('/home/fpiston/Desktop/test_go/test2.fasta', 'rU') > result_handle = open('/home/fpiston/Desktop/test_go/test2.xml', 'rU') > save_file = open('/home/fpiston/Desktop/test_go/test2.out', 'w') > > fh = open('/home/fpiston/Desktop/test_go/Os_Bd_Ta_blat2go_fake', 'rU') > q_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta")) > blast_records = NCBIXML.parse(result_handle) > > hits = [] > > for blast_record in blast_records: > if blast_record.alignments: > list = (blast_record.query).split() > if re.match('ENA|\w*|\w*', list[0]) != None: > list2 = list[0].split("|") > save_file.write('%s\t' % list2[1]) > else: > save_file.write('%s\t' % list[0]) > for alignment in blast_record.alignments: > for hsp in alignment.hsps: > h = alignment.hit_def #at this point all right > for l in fh: #here, 'l' in not found in 'fh' > ls = l.split() > if h in ls: > print h > print 'ok' > save_file.write('%s\t' % ls[1]) > save_file.write('\n') > hits.append(blast_record.query.split()[0]) > misses =set(q_dict.keys()) - set(hits) > > for i in misses: > list = i.split("|") > if len(list) > 1: > save_file.write('%s\t' % list[1]) > else: > save_file.write('%s\t' % list) > save_file.write('%s\n' % 'no_match') > > save_file.close() > > > > Fernando > -- > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From fpiston at gmail.com Tue Dec 11 07:06:30 2012 From: fpiston at gmail.com (Fernando) Date: Tue, 11 Dec 2012 08:06:30 +0100 Subject: [Biopython] problem to find a accesion number in tab delimited file In-Reply-To: (Lenna Peterson's message of "Mon, 10 Dec 2012 22:14:38 -0500") References: <8738zdbrhr.fsf@gmail.com> Message-ID: <87vcc99ifd.fsf@gmail.com> Hi, I'm sorry, the code has an error. Really the problem appear in the following lines: ? ? ? ? ? ? ? ? ? ? if h in ls: ? ? ? ? ? ? ? ? ? ? ? ? print h ? ? ? ? ? ? ? ? ? ? ? ? print 'ok' I can not to find the hit ('h') in the file 'fh' by lines. I am going to try with a dictionary as indicated by Lenna. Thanks for your answers. Lenna Peterson writes: > Hi Fernando,? > > > A filehandle only reads through a file once, so the second time > through the loop, `fh` is, as you found, empty.? > > I would suggest reading the entirety of that file into a list-like > object, which will be persistent. You might also consider using the > accession number as a dictionary key and appending the GO numbers to a > list value.? > > Cheers,? > > Lenna > > > On Mon, Dec 10, 2012 at 3:07 PM, Fernando wrote: > > Hello everybody, > > I'm trying to perform a GOs annotation using the SIMAP database > which is > Blast2GO annotated. Everything is fine, but I have problems when I > try > to find the accession number in the file where entry numbers are > associated with their GOs. The problem is that the script does not > find > the number in the input file when really there is. I tried several > things > without good results (re.match, insert in a list and then extract > the element, etc) > File where the GOs are associated with entry numbers has this > structure (accession number, GO term, blats2go score): > 1f0ba1d119f52ff28e907d2b5ea450db ? ? ? ?GO:0007154 ? ? ?79 > 1f0ba1d119f52ff28e907d2b5ea450db ? ? ? ?GO:0005605 ? ? ?99 > > The python code: > #!/usr/bin/env python > import re > from Bio.Blast import NCBIXML > from Bio import SeqIO > > input_file = open('/home/fpiston/Desktop/test_go/test2.fasta', > 'rU') > result_handle = open('/home/fpiston/Desktop/test_go/test2.xml', > 'rU') > save_file = open('/home/fpiston/Desktop/test_go/test2.out', 'w') > > fh = open('/home/fpiston/Desktop/test_go/Os_Bd_Ta_blat2go_fake', > 'rU') > q_dict = ?SeqIO.to_dict(SeqIO.parse(input_file, "fasta")) > blast_records = NCBIXML.parse(result_handle) > > hits = [] > > for blast_record in blast_records: > ? ? if blast_record.alignments: > ? ? ? ? list = (blast_record.query).split() > ? ? ? ? if re.match('ENA|\w*|\w*', list[0]) != None: > ? ? ? ? ? ? list2 = list[0].split("|") > ? ? ? ? ? ? save_file.write('%s\t' % list2[1]) > ? ? ? ? else: > ? ? ? ? ? ? save_file.write('%s\t' % list[0]) > ? ? ? ? for alignment in blast_record.alignments: > ? ? ? ? ? ? for hsp in alignment.hsps: > ? ? ? ? ? ? ? ? h = alignment.hit_def ? ?#at this point all right > ? ? ? ? ? ? ? ? for l in fh: ? ? ? ? ? ? #here, 'l' in not found > in 'fh' > ? ? ? ? ? ? ? ? ? ? ls = l.split() > ? ? ? ? ? ? ? ? ? ? if h in ls: > ? ? ? ? ? ? ? ? ? ? ? ? print h > ? ? ? ? ? ? ? ? ? ? ? ? print 'ok' > ? ? ? ? ? ? ? ? ? ? ? ? save_file.write('%s\t' % ls[1]) > ? ? ? ? ? ? ? ? save_file.write('\n') > ? ? ? ? hits.append(blast_record.query.split()[0]) > misses =set(q_dict.keys()) - set(hits) > > for i in misses: > ? ? list = i.split("|") > ? ? if len(list) > 1: > ? ? ? ? save_file.write('%s\t' % list[1]) > ? ? else: > ? ? ? ? save_file.write('%s\t' % list) > ? ? save_file.write('%s\n' % 'no_match') > > save_file.close() > > > > ?Fernando > ?-- > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > Fernando -- From w.arindrarto at gmail.com Sat Dec 15 05:58:44 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sat, 15 Dec 2012 06:58:44 +0100 Subject: [Biopython] Problem using Entrez with Python 3.2 In-Reply-To: References: Message-ID: Hi Nicolas, AFAIK, the Entrez XML parser requires bytes instead of strings. The distinction matters in Python 3, so what you need to do is to convert your string stream into a byte stream. To do so, you need to add a new import: from io import BytesIO and replace the last line with these: bytehandle = BytesIO(bytes(handle.read(), 'utf-8')) # utf-8 or any other encoding you want record = Entrez.read(bytehandle) We're basically reading the string, convert it to bytes, and create a buffered bytes I/O that Entrez.read expects. Alternativey, you can also read the handle into a file first, and open it in binary mode: with open('outfile.xml', 'w') as outfile: outfile.write(handle.read()) with open('outfile.xml', 'rb') as sourcefile: record = Entrez.read(sourcefile) Hope that helps :), Bow On Tue, Dec 11, 2012 at 2:15 AM, Nicolas Joannin wrote: > Hello, > > I'm having problems when trying to use Bio.Entrez with Python 3.2. > I get the following error message: > >>>> from Bio import Entrez >>>> Entrez.email='my at email.address' >>>> handle=Entrez.esearch(db='nuccore',term='36329') >>>> record=Entrez.read(handle) > Traceback (most recent call last): > File "", line 1, in > File > "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/Bio/Entrez/__init__.py", > line 351, in read > record = handler.read(handle) > File > "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/Bio/Entrez/Parser.py", > line 169, in read > self.parser.ParseFile(handle) > TypeError: read() did not return a bytes object (type=str) > > Any comment, suggestion or help would be greatly appreciated! > Best regards, > > Nicolas > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From nicolas.joannin at gmail.com Wed Dec 19 06:13:12 2012 From: nicolas.joannin at gmail.com (Nicolas Joannin) Date: Wed, 19 Dec 2012 15:13:12 +0900 Subject: [Biopython] Problem using Entrez with Python 3.2 In-Reply-To: References: Message-ID: Hi Bow, Thanks a million for your answer! It is very clear and pedagogical; I appreciate that a lot! Works like a charm ;) Best regards, Nicolas Nicolas Joannin, Ph.D. Bioinformatics Center Kyoto University, Uji campus, Japan On Sat, Dec 15, 2012 at 2:58 PM, Wibowo Arindrarto wrote: > Hi Nicolas, > > AFAIK, the Entrez XML parser requires bytes instead of strings. The > distinction matters in Python 3, so what you need to do is to convert > your string stream into a byte stream. To do so, you need to add a new > import: > > from io import BytesIO > > and replace the last line with these: > > bytehandle = BytesIO(bytes(handle.read(), 'utf-8')) # utf-8 or any > other encoding you want > record = Entrez.read(bytehandle) > > We're basically reading the string, convert it to bytes, and create a > buffered bytes I/O that Entrez.read expects. > > Alternativey, you can also read the handle into a file first, and open > it in binary mode: > > with open('outfile.xml', 'w') as outfile: > outfile.write(handle.read()) > > with open('outfile.xml', 'rb') as sourcefile: > record = Entrez.read(sourcefile) > > Hope that helps :), > Bow > > On Tue, Dec 11, 2012 at 2:15 AM, Nicolas Joannin > wrote: > > Hello, > > > > I'm having problems when trying to use Bio.Entrez with Python 3.2. > > I get the following error message: > > > >>>> from Bio import Entrez > >>>> Entrez.email='my at email.address' > >>>> handle=Entrez.esearch(db='nuccore',term='36329') > >>>> record=Entrez.read(handle) > > Traceback (most recent call last): > > File "", line 1, in > > File > > > "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/Bio/Entrez/__init__.py", > > line 351, in read > > record = handler.read(handle) > > File > > > "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/Bio/Entrez/Parser.py", > > line 169, in read > > self.parser.ParseFile(handle) > > TypeError: read() did not return a bytes object (type=str) > > > > Any comment, suggestion or help would be greatly appreciated! > > Best regards, > > > > Nicolas > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > From w.arindrarto at gmail.com Wed Dec 19 14:25:46 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 19 Dec 2012 15:25:46 +0100 Subject: [Biopython] Problem using Entrez with Python 3.2 In-Reply-To: References: