From biopyte at yahoo.de Mon Jan 2 10:34:18 2006 From: biopyte at yahoo.de (Hans Meier) Date: Mon Jan 2 10:38:46 2006 Subject: [BioPython] answer to mail from sebastian bassi on 2005Dec31 Message-ID: <20060102153418.92730.qmail@web26303.mail.ukl.yahoo.com> Dear Sebastian, I'm answering your following E-Mail from 2005 Dec 31: > Your computer is not underpowered and the file is not so large, so it > should not hangup. Could you provide code for us to check it? (and the > datafile, you should upload it to a ftp/web server if the data is > public). Sorry, but I couldn't find out how to answer it within the thread. Could you tell me how to do that, please? I don't believe that the problem is specific to the file. Anyway, the data file is ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K12/NC_000913.gbk But watch out. You have to substitute 5" by 5' in the field shown below. Otherwise the file will not parse at all. /note="2'-(5"-phosphoribosyl)-3'-dephospho-CoA transferase; holo-citrate lyase synthase; CitG forms the prosthetic group precursor 2'-(5"-triphosphoribosyl)-3'-dephospho-CoA which is then transferred to apo-ACP by CitX to produce holo-ACP and pyrophosphate; go_process: protein modification [goid 0006464]" Best regards, Harald --------------------------------- Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC! Jetzt Yahoo! Messenger installieren! From biopyte at yahoo.de Mon Jan 2 11:33:49 2006 From: biopyte at yahoo.de (Hans Meier) Date: Mon Jan 2 11:37:57 2006 Subject: [BioPython] Sorry, one more time: extract data from a large .gbk file Message-ID: <20060102163349.57644.qmail@web26305.mail.ukl.yahoo.com> Dear friends, I apologize for bothering you once more with this. But maybe we can make it now clear. All I want to do is extract data from a whole genome .gbk file on my disk. The file has about 5000(!) entries like the one shown below. All I want to do is: Give me the protein sequence (="/translation) (or whatever) of gene (="/gene") soandso. Speed matters. Though I believe I'm not a total dummy in programming and I tried several approaches taken from the web I could not program this so that the request is finished within a reasonable time or without crushing my box (P3,700MHz,256MB) Since this is an important question for me but I don't want to bother you with this any further, maybe someone could just post a code snippet how to accomplish this trivial(?) task? Thanks a lot for all your work and your help, Harald ###### a typical .gbk entry ########### gene 94650..96008 /gene="murF" /locus_tag="b0086" /note="synonyms: mra, EG10622, b0086" /db_xref="GeneID:944813" CDS 94650..96008 /gene="murF" /locus_tag="b0086" /EC_number="6.3.2.15" /function="enzyme; Murein sacculus, peptidoglycan" /note="go_component: cytoplasm [goid 0005737]; go_process: peptidoglycan biosynthesis [goid 0009252]; go_process: peptidoglycan metabolism [goid 0000270]" /codon_start=1 /transl_table=11 /product="D-alanine:D-alanine-adding enzyme" /protein_id="NP_414628.1" /db_xref="ASAP:313" /db_xref="GI:16128079" /db_xref="GeneID:944813" /translation="MISVTLSQLTDILNGELQGADITLDAVTTDTRKLTPGCLFVALK GERFDAHDFADQAKAGGAGALLVSRPLDIDLPQLIVKDTRLAFGELAAWVRQQVPARV VALTGSSGKTSVKEMTAAILSQCGNTLYTAGNLNNDIGVPMTLLRLTPEYDYAVIELG ANHQGEIAWTVSLTRPEAALVNNLAAAHLEGFGSLAGVAKAKGEIFSGLPENGIAIMN ADNNDWLNWQSVIGSRKVWRFSPNAANSDFTATNIHVTSHGTEFTLQTPTGSVDVLLP LPGRHNIANALAAAALSMSVGATLDAIKAGLANLKAVPGRLFPIQLAENQLLLDDSYN ANVGSMTAAVQVLAEMPGYRVLVVGDMAELGAESEACHVQVGEAAKAAGIDRVLSVGK QSHAISTASGVGEHFADKTALITRLKLLIAEQQVITILVKGSRSAAMEEVVRALQENG TC" ########end of the example##################### --------------------------------- Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC! Jetzt Yahoo! Messenger installieren! From cariaso at yahoo.com Mon Jan 2 17:05:43 2006 From: cariaso at yahoo.com (Michael Cariaso) Date: Mon Jan 2 17:09:47 2006 Subject: [BioPython] Sorry, one more time: extract data from a large .gbk file In-Reply-To: <20060102163349.57644.qmail@web26305.mail.ukl.yahoo.com> References: <20060102163349.57644.qmail@web26305.mail.ukl.yahoo.com> Message-ID: <43B9A3B7.9030203@yahoo.com> Until we see some code, I can't be sure. But what you are doing seems well within BioPython's abilities. I wonder if perhaps your code looks like this: alist = readWholeFile(filename) for record in alist: process(record) which is what is causing the performance problems. If your code does resemble the above, try to change it so that it looks more like: recordIterator = createIterator(filename) for record in recordIterator: process(record) or fileobj = open(filename) done = false while not done: record = readNextRecord(fileobj) if record: process(record) else: done = true The first form reads the whole genbank file into memory, and might crush your machine. The second form reads in one record at a time, and processes it. This requires far less memory. Hans Meier wrote: > Dear friends, > > I apologize for bothering you once more with this. > But maybe we can make it now clear. > All I want to do is extract data from a whole genome .gbk file on my disk. > The file has about 5000(!) entries like the one shown below. > All I want to do is: > > Give me the protein sequence (="/translation) (or whatever) > of gene (="/gene") soandso. > > Speed matters. > > Though I believe I'm not a total dummy in programming > and I tried several approaches taken from the web > I could not program this so that the request is finished > within a reasonable time or without crushing my box > (P3,700MHz,256MB) > > Since this is an important question for me but > I don't want to bother you with this any further, > maybe someone could just post a code snippet > how to accomplish this trivial(?) task? > > > Thanks a lot for all your work and your help, Harald > > > ###### a typical .gbk entry ########### > gene 94650..96008 > /gene="murF" > /locus_tag="b0086" > /note="synonyms: mra, EG10622, b0086" > /db_xref="GeneID:944813" > CDS 94650..96008 > /gene="murF" > /locus_tag="b0086" > /EC_number="6.3.2.15" > /function="enzyme; Murein sacculus, peptidoglycan" > /note="go_component: cytoplasm [goid 0005737]; > go_process: peptidoglycan biosynthesis [goid 0009252]; > go_process: peptidoglycan metabolism [goid 0000270]" > /codon_start=1 > /transl_table=11 > /product="D-alanine:D-alanine-adding enzyme" > /protein_id="NP_414628.1" > /db_xref="ASAP:313" > /db_xref="GI:16128079" > /db_xref="GeneID:944813" > /translation="MISVTLSQLTDILNGELQGADITLDAVTTDTRKLTPGCLFVALK > GERFDAHDFADQAKAGGAGALLVSRPLDIDLPQLIVKDTRLAFGELAAWVRQQVPARV > VALTGSSGKTSVKEMTAAILSQCGNTLYTAGNLNNDIGVPMTLLRLTPEYDYAVIELG ANHQGEIAWTVSLTRPEAALVNNLAAAHLEGFGSLAGVAKAKGEIFSGLPENGIAIMN ADNNDWLNWQSVIGSRKVWRFSPNAANSDFTATNIHVTSHGTEFTLQTPTGSVDVLLP LPGRHNIANALAAAALSMSVGATLDAIKAGLANLKAVPGRLFPIQLAENQLLLDDSYN > ANVGSMTAAVQVLAEMPGYRVLVVGDMAELGAESEACHVQVGEAAKAAGIDRVLSVGK QSHAISTASGVGEHFADKTALITRLKLLIAEQQVITILVKGSRSAAMEEVVRALQENG > TC" > ########end of the example##################### > > > > --------------------------------- > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC! > Jetzt Yahoo! Messenger installieren! > _______________________________________________ > BioPython mailing list - BioPython@biopython.org > http://biopython.org/mailman/listinfo/biopython > From srini_iyyer_bio at yahoo.com Tue Jan 3 11:47:08 2006 From: srini_iyyer_bio at yahoo.com (Srinivas Iyyer) Date: Tue Jan 3 11:50:30 2006 Subject: [BioPython] E-utils on NCBI site In-Reply-To: <20060102163349.57644.qmail@web26305.mail.ukl.yahoo.com> Message-ID: <20060103164708.3912.qmail@web31614.mail.mud.yahoo.com> Dear group, there are 2 questions that are important to my research. I hope group members who have tried would be willing to help me. 1. How do I use E-utils using python, bio-python modules. Are there any examples? 2. here is the specific that I want to do. I am interested in downloading all the human affymetrix data submitted to GEO site. for example:GSE2152_RAW_tar is the raw CEL file package submitted to GEO by the authors. It is in the directory: ftp://ftp.ncbi.nih.gov/pub/geo/data/geo/raw_data/series/GSE2152/GSE2152_RAW.tar However, checking manually for every dataset that is submitted to ncbi is pains taking procedure. So I want to be able to write a script that would check for every dataset submitted to GEO. After that I want to filter human tar files. Has any one did this before. could you please help me. Thanks Srini __________________________________ Yahoo! for Good - Make a difference this year. http://brand.yahoo.com/cybergivingweek2005/ From saccenti at cerm.unifi.it Tue Jan 3 12:24:43 2006 From: saccenti at cerm.unifi.it (saccenti@cerm.unifi.it) Date: Tue Jan 3 12:34:17 2006 Subject: [BioPython] E-utils on NCBI site In-Reply-To: <20060103164708.3912.qmail@web31614.mail.mud.yahoo.com> References: <20060102163349.57644.qmail@web26305.mail.ukl.yahoo.com> <20060103164708.3912.qmail@web31614.mail.mud.yahoo.com> Message-ID: <1345.155.52.120.87.1136309083.squirrel@alpha.cerm.unifi.it> > 1. How do I use E-utils using python, bio-python > modules. Are there any examples? I do not know if Biopython has a E-utils module. I had to deal with E-utils to jump from database code to another in NCBI databsae. I read instructions in NCBI E-utils link then a I wrote my own consumer to open different web pages and get different codes parsing the simple html code > > 2. here is the ...... My be you are able to get a complete list of all GEO files in NCBI databes. When you have a list you can use standard commands of ftp module to connect to the ncbi server and download what you want. To filter human data maybe you must find a regularity in files name if possible to discriminate human files. Maybe inside the file there will be a flag. You can download all tar files and then read them one after the other deleting non human file. Python has has an util to read into zipped files without have to open them before, but I do not remeber if it works also with tar files. Hope It can helps edoardo Maybe this is not elgant but should be fast to write From biopython at maubp.freeserve.co.uk Tue Jan 3 13:57:01 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue Jan 3 13:59:22 2006 Subject: [BioPython] Large GenBank files: impossible to handle? In-Reply-To: <20051231025437.5979.qmail@web26304.mail.ukl.yahoo.com> References: <20051231025437.5979.qmail@web26304.mail.ukl.yahoo.com> Message-ID: <43BAC8FD.8030703@maubp.freeserve.co.uk> Hans Meier wrote: > Dear friends, > > I tried to handle a .gbk file of 4,7MB in size > with a "700MHz, Pentium III, 256 MB RAM"-box. This might work with the current release (BioPython 1.41) but will use a lot of memory - I would guess about 250MB, which is all your machine has. This is a limitation of the old Martel based parser. You should be able to install the new GenBank parser (from CVS) which I wrote specifically due to problems with large GenBank files. Ask if you need help with this - and are you on Windows or Linux? See also bug 1747, http://bugzilla.open-bio.org/show_bug.cgi?id=1747 > Parsing with "RecordParser" and indexing with "index_file" > crushed the machine in both cases, I had to reboot > (what happens not so often with Debian). I would avoid using index_file on large GenBank files - this still uses Martel and can be rather slow. Also, I strongly suspect your files have a single record each (i.e. only one LOCUS line) in which case there is no need to index them. > My final goal is to access the .gbk file somehow like a database. Have you tried using the FeatureParser and then accessing the .features list property of the record? > The alternative would be to use .fna,.faa and .fnn files > and write my own methods. Or stuff all the data in a SQL-database. > But I still hope that Biopython could help. > > Before I spend more time on this, I'd like to ask you: > > With the Biopython tools, is it possible to handle > .gbk files of about 5MB in a reasonable time with > a low- to middle-class desktop computer? If so, how? Using the latest BioPython code it should be easy (see above). Also, these two recent examples might be handy: http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/genbank/ http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/genbank2fasta/ Peter From idoerg at gmail.com Wed Jan 4 00:46:53 2006 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed Jan 4 01:10:58 2006 Subject: [BioPython] E-utils on NCBI site In-Reply-To: <20060103164708.3912.qmail@web31614.mail.mud.yahoo.com> References: <20060103164708.3912.qmail@web31614.mail.mud.yahoo.com> Message-ID: <43BB614D.1000808@burnham.org> Bio.EUtils is the module you are looking for. Look in the tests subdirectory for some examples. Cheers, Iddo Srinivas Iyyer wrote: >Dear group, > there are 2 questions that are important to my >research. I hope group members who have tried would >be willing to help me. > >1. How do I use E-utils using python, bio-python >modules. Are there any examples? > >2. here is the specific that I want to do. I am >interested in downloading all the human affymetrix >data submitted to GEO site. > >for example:GSE2152_RAW_tar is the raw CEL file >package submitted to GEO by the authors. It is in the >directory: >ftp://ftp.ncbi.nih.gov/pub/geo/data/geo/raw_data/series/GSE2152/GSE2152_RAW.tar > >However, checking manually for every dataset that is >submitted to ncbi is pains taking procedure. So I want >to be able to write a script that would check for >every dataset submitted to GEO. After that I want to >filter human tar files. > >Has any one did this before. could you please help >me. > >Thanks >Srini > > > > >__________________________________ >Yahoo! for Good - Make a difference this year. >http://brand.yahoo.com/cybergivingweek2005/ >_______________________________________________ >BioPython mailing list - BioPython@biopython.org >http://biopython.org/mailman/listinfo/biopython > > > > -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 Tel: (858) 646 3100 x3516 Fax: (858) 713 9949 http://ffas.ljcrf.edu/~iddo From biopython at maubp.freeserve.co.uk Fri Jan 6 17:44:34 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri Jan 6 17:41:11 2006 Subject: [BioPython] parsing error with GenBank.RecordParser In-Reply-To: <20051230204756.25343.qmail@web26311.mail.ukl.yahoo.com> References: <20051230204756.25343.qmail@web26311.mail.ukl.yahoo.com> Message-ID: <43BEF2D2.1090600@maubp.freeserve.co.uk> Hans Meier wrote: > Hi, > > parsing of NC_000913.gbk does not work. > > Greets, Harald Sorry I didn't reply earlier, I was away for the New Year... From the trackback you provided, I would guess that the old GenBank parser (included with BioPython 1.41) didn't like the double quotes in that note: /note="2'-(5"-phosphoribosyl)-3'-dephospho-CoA... Interestingly enough, in the most recent version of NC_000913.gbk dated Dec 2005 (check the first line, starting LOCUS), the NCBI have switched the double quotes to single quotes in the note (gene citX): /note="2'-(5'-phosphoribosyl)-3'-dephospho-CoA... If you download this revised NC_000913.gbk the problem should go away (but note that as Escherichia coli genbank file is 11 MB you might be better off updating the GenBank parser). The new GenBank parser (available in CVS now) should cope with either version of the file (and should use less memory, and be a lot faster too). To try this, you just need to replace the file /usr/lib/python2.3/site-packages/Bio/GenBank/__init__.py with the latest version (but make a backup of the old one just in case). Peter From ziemys.1 at osu.edu Mon Jan 9 09:30:26 2006 From: ziemys.1 at osu.edu (ARTURAS ZIEMYS) Date: Mon Jan 9 09:45:15 2006 Subject: [BioPython] NeighborSearch (No module named _CKDTree) Message-ID: <214350b2144a7a.2144a7a214350b@osu.edu> HI, I've got a problem with 'NeighborSearch' (python 2.4.2, Biopython 1.41, Windows XP-SP2). I need to find the nearest atoms for my structure, but I can not use 'NeighborSearch'. There something wrong in distribution ? For example , when I tray to import 'NeighborSearch' in a shell : IDLE 1.1.2 ==== No Subprocess ==== >>> from Bio.PDB.NeighborSearch import NeighborSearch Traceback (most recent call last): File "", line 1, in ? from Bio.PDB.NeighborSearch import NeighborSearch File "C:\Python24\Lib\site-packages\Bio\PDB\NeighborSearch.py", line 3, in ? from Bio.KDTree import * File "C:\Python24\Lib\site-packages\Bio\KDTree\__init__.py", line 10, in ? from KDTree import KDTree File "C:\Python24\Lib\site-packages\Bio\KDTree\KDTree.py", line 17, in ? import CKDTree File "C:\Python24\Lib\site-packages\Bio\KDTree\CKDTree.py", line 4, in ? import _CKDTree ImportError: No module named _CKDTree >>> With best Arturas Z. From idoerg at gmail.com Wed Jan 11 00:13:36 2006 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed Jan 11 00:17:27 2006 Subject: [BioPython] BLAST XML problem? Message-ID: <43C49400.1070802@burnham.org> Not sure what we're doing wrong here... Using the cookbook example, biopython 1.41, python 2.2 (our Zope needs that Python version, sorry): from Bio.Blast import NCBIXML b_parser = NCBIXML.BlastParser() b_record = b_parser.parse(blast_out) Breaks on "Alejandro Sch?efer", in the XML tag. The ? seems to cause the error. Replace it with a regular "a" everything is hunky-dory Huh? [idoerg@hotdog:~/results/jafa]> ./try_blast.py star_human.fasta /home/idoerg/biopy_cvs/biopython/Bio/Blast/NCBIWWW.py:1070: UserWarning: qblast works only with blastn and blastp for now. warnings.warn("qblast works only with blastn and blastp for now.") Traceback (most recent call last): File "./try_blast.py", line 19, in ? b_record = b_parser.parse(open('my_blast.out')) File "/home/idoerg/biopy_cvs/biopython/Bio/Blast/NCBIXML.py", line 112, in parse self._parser.parse(handler) File "/usr/lib/python2.3/xml/sax/expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "/usr/lib/python2.3/xml/sax/xmlreader.py", line 123, in parse self.feed(buffer) File "/usr/lib/python2.3/xml/sax/expatreader.py", line 211, in feed self._err_handler.fatalError(exc) File "/usr/lib/python2.3/xml/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: my_blast.out:6:81: not well-formed (invalid token) Iddo -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 Tel: (858) 646 3100 x3516 Fax: (858) 713 9949 http://iddo-friedberg.org From idoerg at gmail.com Wed Jan 11 00:27:16 2006 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed Jan 11 01:21:53 2006 Subject: [BioPython] BLAST XML problem? In-Reply-To: <43C49400.1070802@burnham.org> References: <43C49400.1070802@burnham.org> Message-ID: <43C49734.7050409@burnham.org> Slight correction to my previous email: using biopython from CVS, and python 2.3 as you can see from the stack dump Iddo Friedberg wrote: > Not sure what we're doing wrong here... > > Using the cookbook example, biopython 1.41, python 2.2 (our Zope needs > that Python version, sorry): > > from Bio.Blast import NCBIXML > > b_parser = NCBIXML.BlastParser() > b_record = b_parser.parse(blast_out) > > > Breaks on "Alejandro Sch?efer", in the XML > tag. The ? seems to cause the error. Replace it with a regular "a" > everything is hunky-dory > > Huh? > > [idoerg@hotdog:~/results/jafa]> ./try_blast.py star_human.fasta > /home/idoerg/biopy_cvs/biopython/Bio/Blast/NCBIWWW.py:1070: > UserWarning: qblast works only with blastn and blastp for now. > warnings.warn("qblast works only with blastn and blastp for now.") > Traceback (most recent call last): > File "./try_blast.py", line 19, in ? > b_record = b_parser.parse(open('my_blast.out')) > File "/home/idoerg/biopy_cvs/biopython/Bio/Blast/NCBIXML.py", line > 112, in parse > self._parser.parse(handler) > File "/usr/lib/python2.3/xml/sax/expatreader.py", line 107, in parse > xmlreader.IncrementalParser.parse(self, source) > File "/usr/lib/python2.3/xml/sax/xmlreader.py", line 123, in parse > self.feed(buffer) > File "/usr/lib/python2.3/xml/sax/expatreader.py", line 211, in feed > self._err_handler.fatalError(exc) > File "/usr/lib/python2.3/xml/sax/handler.py", line 38, in fatalError > raise exception > xml.sax._exceptions.SAXParseException: my_blast.out:6:81: not > well-formed (invalid token) > > > > Iddo > -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 Tel: (858) 646 3100 x3516 Fax: (858) 713 9949 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Wed Jan 11 06:48:33 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed Jan 11 07:22:44 2006 Subject: [BioPython] BLAST XML problem? In-Reply-To: <43C49734.7050409@burnham.org> References: <43C49400.1070802@burnham.org> <43C49734.7050409@burnham.org> Message-ID: <43C4F091.2020408@maubp.freeserve.co.uk> Iddo Friedberg wrote: > Slight correction to my previous email: using biopython from CVS, and > python 2.3 as you can see from the stack dump > > Iddo Friedberg wrote: > >> Not sure what we're doing wrong here... >> >> Using the cookbook example, biopython 1.41, python 2.2 (our Zope needs >> that Python version, sorry): >> >> from Bio.Blast import NCBIXML >> >> b_parser = NCBIXML.BlastParser() >> b_record = b_parser.parse(blast_out) >> >> >> Breaks on "Alejandro Sch?efer", in the XML >> tag. The ? seems to cause the error. Replace it with a regular "a" >> everything is hunky-dory Is the lower-case a with umlaut in the XML file as ?, or using an encoding like ä or ä instead? (ampersand characters, aka character entities) Also, what character set does the blast_out XML file claim to be in? And does that fit with the inclusion of an a-umlaut as a character? It may be the NCBI's fault for producing a bad XML file... Peter From idoerg at burnham.org Wed Jan 11 12:08:01 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Wed Jan 11 12:17:13 2006 Subject: [BioPython] BLAST XML problem? In-Reply-To: <43C4F091.2020408@maubp.freeserve.co.uk> References: <43C49400.1070802@burnham.org> <43C49734.7050409@burnham.org> <43C4F091.2020408@maubp.freeserve.co.uk> Message-ID: <43C53B71.5080705@burnham.org> Peter wrote: > Iddo Friedberg wrote: > >> Slight correction to my previous email: using biopython from CVS, and >> python 2.3 as you can see from the stack dump >> >> Iddo Friedberg wrote: >> >>> Not sure what we're doing wrong here... >>> >>> Using the cookbook example, biopython 1.41, python 2.2 (our Zope >>> needs that Python version, sorry): >>> >>> from Bio.Blast import NCBIXML >>> >>> b_parser = NCBIXML.BlastParser() >>> b_record = b_parser.parse(blast_out) >>> >>> >>> Breaks on "Alejandro Sch?ffer", in the XML >>> tag. The ? seems to cause the error. Replace it with a regular "a" >>> everything is hunky-dory >> > > Is the lower-case a with umlaut in the XML file as ?, or using an > encoding like ä or ä instead? (ampersand characters, aka > character entities) It's an ? not a character entity. > > Also, what character set does the blast_out XML file claim to be in? > And does that fit with the inclusion of an a-umlaut as a character? I haven't the foggiest... :) > > It may be the NCBI's fault for producing a bad XML file... > Yeah, well, I still have to deal with it :( In any case, why is this cropping up now? Sch?ffer has been in NCBI for years... The file is available at http://iddo-friedberg.org/biopy_bad_blast.xml in case anyone wants to have a look-see. Thanks, Iddo -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 713 9949 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Wed Jan 11 13:39:46 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed Jan 11 13:43:45 2006 Subject: [BioPython] BLAST XML problem? In-Reply-To: <43C53B71.5080705@burnham.org> References: <43C49400.1070802@burnham.org> <43C49734.7050409@burnham.org> <43C4F091.2020408@maubp.freeserve.co.uk> <43C53B71.5080705@burnham.org> Message-ID: <43C550F2.5030806@maubp.freeserve.co.uk> >> It may be the NCBI's fault for producing a bad XML file... > > Yeah, well, I still have to deal with it :( In any case, why is this > cropping up now? Sch?ffer has been in NCBI for years... I would guess because BioPython users would have parsed the plain text output from blast, rather than XML. > The file is available at http://iddo-friedberg.org/biopy_bad_blast.xml > > in case anyone wants to have a look-see. The first line of the XML file could (should?) define an encoding, e.g. or: Instead its just: Short term solutions which I have just tried and got to work: (1) Edit the offending character by hand (as you did) (2) Specify encoding="ISO-8859-1" by editing the first line by hand (2) Covert the file to unicode (doubles the size) BTW - Are you getting the file from standalone blast, or the NCBI website? Unless a local XML expert steps up, would you like to contact the NCBI on this issue? Peter From sbassi at gmail.com Wed Jan 11 14:01:07 2006 From: sbassi at gmail.com (Sebastian Bassi) Date: Wed Jan 11 14:25:40 2006 Subject: [BioPython] BLAST XML problem? In-Reply-To: <43C550F2.5030806@maubp.freeserve.co.uk> References: <43C49400.1070802@burnham.org> <43C49734.7050409@burnham.org> <43C4F091.2020408@maubp.freeserve.co.uk> <43C53B71.5080705@burnham.org> <43C550F2.5030806@maubp.freeserve.co.uk> Message-ID: On 1/11/06, Peter wrote: > > Instead its just: > > Short term solutions which I have just tried and got to work: > (1) Edit the offending character by hand (as you did) > (2) Specify encoding="ISO-8859-1" by editing the first line by hand > (2) Covert the file to unicode (doubles the size) I have a 4th solution, that doesn't involve XML editing, so it will "fix" the problem for other users: 4) Change Biopython or XML parser to assume encoding = ISO-8859-1 when there is no encoding information. I wonder if is a W3C valid first line for a XML file. If this is OK (from the point of view of the XML standard), then the parser should be corrected, if not, according to the standard, the file should be rejected for non compliance (this is not HTML where the browser client can accept and correct invalid code, the specifications states that XML should validate before being used). -- La web sin popups ni spyware: Usa Firefox en lugar de Internet Explorer From tharder at burnham.org Wed Jan 11 15:07:30 2006 From: tharder at burnham.org (Tim Harder) Date: Wed Jan 11 15:07:08 2006 Subject: [Fwd: Re: [BioPython] BLAST XML problem?] Message-ID: <43C56582.5050809@burnham.org> |XMLDecl| ::= |'' (source http://www.w3.org/TR/2004/REC-xml-20040204/#NT-XMLDecl) As far as I understand that definition, the encoding attribute is optional, so the NCBI File should be ok from the XML point of view. Anyway, how can I tell SAX which encoding table to use, beside editing the XML file itself? Tim Sebastian Bassi wrote: >On 1/11/06, Peter wrote: > > >> >>Instead its just: >> >>Short term solutions which I have just tried and got to work: >>(1) Edit the offending character by hand (as you did) >>(2) Specify encoding="ISO-8859-1" by editing the first line by hand >>(2) Covert the file to unicode (doubles the size) >> >> > >I have a 4th solution, that doesn't involve XML editing, so it will >"fix" the problem for other users: >4) Change Biopython or XML parser to assume encoding = ISO-8859-1 when >there is no encoding information. >I wonder if is a W3C valid first line for a XML >file. If this is OK (from the point of view of the XML standard), then >the parser should be corrected, if not, according to the standard, the >file should be rejected for non compliance (this is not HTML where the >browser client can accept and correct invalid code, the specifications >states that XML should validate before being used). > >-- >La >web sin popups ni spyware: Usa Firefox en lugar de Internet >Explorer > >_______________________________________________ >BioPython mailing list - BioPython@biopython.org >http://biopython.org/mailman/listinfo/biopython > > > > From biopython at maubp.freeserve.co.uk Wed Jan 11 15:13:42 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed Jan 11 15:10:30 2006 Subject: [BioPython] BLAST XML problem? In-Reply-To: References: <43C49400.1070802@burnham.org> <43C49734.7050409@burnham.org> <43C4F091.2020408@maubp.freeserve.co.uk> <43C53B71.5080705@burnham.org> <43C550F2.5030806@maubp.freeserve.co.uk> Message-ID: <43C566F6.2010203@maubp.freeserve.co.uk> Sebastian Bassi wrote: > On 1/11/06, Peter wrote: > >> >>Instead its just: >> >>Short term solutions which I have just tried and got to work: >>(1) Edit the offending character by hand (as you did) >>(2) Specify encoding="ISO-8859-1" by editing the first line by hand >>(2) Covert the file to unicode (doubles the size) > > > I have a 4th solution, that doesn't involve XML editing, so it will > "fix" the problem for other users: > 4) Change Biopython or XML parser to assume encoding = ISO-8859-1 when > there is no encoding information. Well yes, that did cross my mind. I even went off to try and find out how to do this, but failed. Any ideas? > I wonder if is a W3C valid first line for a XML > file. If this is OK (from the point of view of the XML standard), then > the parser should be corrected, if not, according to the standard, the > file should be rejected for non compliance You sound like you know a lot more about XML than I do, would you be able to find out one way or the other? This would be useful information for trying to get the NCBI to make a change. Iddo's bad file is fine, according to www.xmlvalidation.com (cut and pasting). The NCBI DTD files are here: http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd http://www.ncbi.nlm.nih.gov/dtd/NCBI_Entity.mod.dtd http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.mod.dtd I think this does mean declaring the encoding may be optional, but this validation program could identify the encoding on its own. > (this is not HTML where the > browser client can accept and correct invalid code, the specifications > states that XML should validate before being used). Which is good, unless you are trying to deal with bad XML produced by a third party. I'm sure the NCBI will fix this, if it is their problem. It just might take a while. Peter From idoerg at gmail.com Wed Jan 11 15:36:31 2006 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed Jan 11 17:23:42 2006 Subject: [BioPython] BLAST XML problem? In-Reply-To: References: <43C49400.1070802@burnham.org> <43C49734.7050409@burnham.org> <43C4F091.2020408@maubp.freeserve.co.uk> <43C53B71.5080705@burnham.org> <43C550F2.5030806@maubp.freeserve.co.uk> Message-ID: <43C56C4F.3080003@burnham.org> Sebastian Bassi wrote: >On 1/11/06, Peter wrote: > > >> >>Instead its just: >> >>Short term solutions which I have just tried and got to work: >>(1) Edit the offending character by hand (as you did) >>(2) Specify encoding="ISO-8859-1" by editing the first line by hand >>(2) Covert the file to unicode (doubles the size) >> >> > >I have a 4th solution, that doesn't involve XML editing, so it will >"fix" the problem for other users: >4) Change Biopython or XML parser to assume encoding = ISO-8859-1 when >there is no encoding information. > > OK, I was actually going to do this. I found a bit of code that will detect file encoding from the first two bytes. I was planning to put the return value into the BLAST XML parser. http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/363841 If this would not have worked, I would have force-plugged the ISO-8859-1 But... When I generated a new XML file from NCBI to test the encoding-detection module, the code used for the ? actually changed! Everything works now. So... there are there biopython fans with a (very) quick response time in NCBI? Spooky... > I wonder if is a W3C valid first line for a XML > file. If this is OK (from the point of view of the XML standard), then > the parser should be corrected, if not, according to the standard, the > file should be rejected for non compliance (this is not HTML where the > browser client can accept and correct invalid code, the specifications > states that XML should validate before being used). I believe that the default is UTF-8, and that is valid. ./I -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 Tel: (858) 646 3100 x3516 Fax: (858) 713 9949 http://iddo-friedberg.org From idoerg at gmail.com Wed Jan 11 17:14:09 2006 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed Jan 11 20:06:14 2006 Subject: [BioPython] BLAST XML problem? Message-ID: <43C58331.9080304@burnham.org> Sebastian Bassi wrote: >On 1/11/06, Peter wrote: > > >> >>Instead its just: >> >>Short term solutions which I have just tried and got to work: >>(1) Edit the offending character by hand (as you did) >>(2) Specify encoding="ISO-8859-1" by editing the first line by hand >>(2) Covert the file to unicode (doubles the size) >> >> > >I have a 4th solution, that doesn't involve XML editing, so it will >"fix" the problem for other users: >4) Change Biopython or XML parser to assume encoding = ISO-8859-1 when >there is no encoding information. > > OK, I was actually going to do this. I found a bit of code that will detect file encoding from the first two bytes. I was planning to put the return value into the BLAST XML parser. http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/363841 If this would not have worked, I would have force-plugged the ISO-8859-1 But... When I generated a new XML file from NCBI to test the encoding-detection module, the code used for the ? actually changed! Everything works now. So... there are there biopython fans with a (very) quick response time in NCBI? Spooky... > I wonder if is a W3C valid first line for a XML > file. If this is OK (from the point of view of the XML standard), then > the parser should be corrected, if not, according to the standard, the > file should be rejected for non compliance (this is not HTML where the > browser client can accept and correct invalid code, the specifications > states that XML should validate before being used). I believe that the default is UTF-8, and that is valid. ./I -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 Tel: (858) 646 3100 x3516 Fax: (858) 713 9949 http://iddo-friedberg.org -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 Tel: (858) 646 3100 x3516 Fax: (858) 713 9949 http://iddo-friedberg.org From karbak at gmail.com Wed Jan 11 02:41:05 2006 From: karbak at gmail.com (K. Arun) Date: Thu Jan 12 05:18:10 2006 Subject: [BioPython] NeighborSearch (No module named _CKDTree) In-Reply-To: <214350b2144a7a.2144a7a214350b@osu.edu> References: <214350b2144a7a.2144a7a214350b@osu.edu> Message-ID: <162452a10601102341j4d64e729g333a97279d6243@mail.gmail.com> On 1/9/06, ARTURAS ZIEMYS wrote: > I've got a problem with 'NeighborSearch' (python 2.4.2, Biopython 1.41, Windows XP-SP2). I [...] > import _CKDTree > ImportError: No module named _CKDTree If you examine the setup.py used to compile Biopython, you'll find two places where comments indicate that the KDTree extension's compilation is turned off by default to avoid C++ errors. If I remember correctly, all I had to do was uncomment those two sections and run 'python setup.py install' again to get the module working. -arun From sbassi at gmail.com Thu Jan 12 08:17:43 2006 From: sbassi at gmail.com (Sebastian Bassi) Date: Thu Jan 12 08:40:20 2006 Subject: [BioPython] NeighborSearch (No module named _CKDTree) In-Reply-To: <162452a10601102341j4d64e729g333a97279d6243@mail.gmail.com> References: <214350b2144a7a.2144a7a214350b@osu.edu> <162452a10601102341j4d64e729g333a97279d6243@mail.gmail.com> Message-ID: On 1/11/06, K. Arun wrote: > If you examine the setup.py used to compile Biopython, you'll find two > places where comments > indicate that the KDTree extension's compilation is turned off by > default to avoid C++ errors. If I remember correctly, all I had to do > was uncomment those two sections and run 'python setup.py install' > again to get the module working. Another thing to be aware of: I had problems compiling biopython in an AMD64 due to this KDTree extension, I guess it was not turned off in that moment (two years ago). I didn't try again with the KDTree turn off since I don't have that computer, so this is just a warning if you have a x86-64 bit computer. -- La web sin popups ni spyware: Usa Firefox en lugar de Internet Explorer From manuel at pinguinkiste.de Thu Jan 12 14:50:07 2006 From: manuel at pinguinkiste.de (Manuel Prinz) Date: Thu Jan 12 15:12:24 2006 Subject: [Fwd: Re: [BioPython] BLAST XML problem?] In-Reply-To: <43C56582.5050809@burnham.org> References: <43C56582.5050809@burnham.org> Message-ID: <1137095407.4672.37.camel@woodstock> > |XMLDecl| ::= |'' > > (source http://www.w3.org/TR/2004/REC-xml-20040204/#NT-XMLDecl) > > As far as I understand that definition, the encoding attribute is > optional, so the NCBI File should be ok from the XML point of view. This is not totally right. The encoding is optional, if the encoding is proper UTF-8 (or UTF-16) or if the encoding can be obtained from a higher instance such as mimetypes, which does not affect a file. The standard reads this (in "4.3.3 Character Encoding in Entities"): "In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration." It's also mentioned in that section that processors HAVE to know UTF-8 and UTF-16 and MAY know others. The standard further states the following: "It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. Specifically, it is a fatal error if an entity encoded in UTF-8 contains any irregular code unit sequences, as defined in Unicode 3.1 [Unicode3]. Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16." So the BioPython parser has to reject the XML file (since it is/was not proper UTF-8 or UTF-16) to meet the standard. Auto-detecting encodings is a nice feature but from the processors point of view only useful to check if the declared encoding matches the real one in terms of the standard. > Anyway, how can I tell SAX which encoding table to use, beside editing > the XML file itself? Since SAX is standard compliant AFAIK, there probably isn't any. Either convert your files to UTF-8 or you have to declare the character encoding. (iconv is a great tool to convert between different encodings.) With kind regards, Manuel From weisman at lydon.com Mon Jan 16 10:50:26 2006 From: weisman at lydon.com (David Weisman) Date: Mon Jan 16 12:44:27 2006 Subject: [BioPython] NCBIXML for multiple queries Message-ID: <43CBC0C2.10800@lydon.com> Hello, I tried using NCBIXML parsing on a local blast run, in which the input had multiple query sequences. Blastall writes multiple xml documents to the output file, and the SAX parser threw a SAXParseException on the second declaration, complaining of junk after the document element. I couldn't find an obvious workaround, so I wrote a python generator function that returns a new file handle (based on a CStringIO) for each xml document in the stream. The usage model is: import xmlStreamSeparator # new blastInFile = open (blastInPath, "r") # composite blast output x_gen=xmlStreamSeparator.getXmlDoc(blastInFile) x_doc=x_gen.next() while not xmlStreamSeparator.xmlStreamEOF(x_doc): iter=NCBIStandalone.Iterator(x_doc, NCBIXML.BlastParser()) for b_rec in iter: process blast record... x_doc=x_gen.next() # get next xml doc from stream Any pointers to a better model? Many thanks for any tips. Regards, David From OmenkeukwuG at americanimaging.net Mon Jan 16 13:05:27 2006 From: OmenkeukwuG at americanimaging.net (Omenkeukwu, Gregory) Date: Mon Jan 16 13:01:01 2006 Subject: [BioPython] Blast Error Message-ID: <2B8B6630ACA40940AABCF63158986A311CB140@XCHSRV02.DEERFIELD.AIM.local> > I get the following error when I run the BLAST code in the tutorial. I will appreciate any help I can get on this issue. Thanks > > > > Warning (from warnings module): > File "C:\Python24\lib\site-packages\Bio\Blast\NCBIWWW.py", line 1070 > warnings.warn("qblast works only with blastn and blastp for now.") > UserWarning: qblast works only with blastn and blastp for now. > > Traceback (most recent call last): > File "C:\Python24\biopython examples\blast_example.py", line 10, in -toplevel- > result_handle = NCBIWWW.qblast('blastn', 'nr', f_record) > File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIWWW.py", line 1092, in qblast > rid, rtoe = _parse_qblast_ref_page(handle) > File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIWWW.py", line 1177, in _parse_qblast_ref_page > return rid, int(rtoe) > ValueError: invalid literal for int(): 1.0 400 We don't support 0.9 > > > > > > > just for reference, the code I am running is listed below > > from Bio import Fasta > > file_for_blast = open('m_cold.fasta', 'r') > f_iterator = Fasta.Iterator(file_for_blast) > > f_record = f_iterator.next() > > > from Bio.Blast import NCBIWWW > result_handle = NCBIWWW.qblast('blastn', 'nr', f_record) > > > # save the results for later, in case we want to look at it > save_file = open('my_blast.out', 'w') > blast_results = result_handle.read() > save_file.write(blast_results) > save_file.close() > > import cStringIO > blast_out = cStringIO.StringIO(blast_results) > > > blast_out = open('my_blast.out', 'r') > > > from Bio.Blast import NCBIXML > > b_parser = NCBIXML.BlastParser() > b_record = b_parser.parse(blast_out) > > E_VALUE_THRESH = 0.04 > > for alignment in b_record.alignments: > for hsp in alignment.hsps: > if hsp.expect < E_VALUE_THRESH: > print '****Alignment****' > print 'sequence:', alignment.title > print 'length:', alignment.length > print 'e value:', hsp.expect > print hsp.query[0:75] + '...' > print hsp.match[0:75] + '...' > print hsp.sbjct[0:75] + '...' > > > Gregory Omenkeukwu > Provider Information Management > ====================================================================== The material in this transmission contains confidential information intended for the addressee. If you are not the addressee, any disclosure or use of this information by you is strictly prohibited. If you have received this transmission in error, please delete it and destroy all copies. Notify American Imaging Management at 847 564-8500. Thank You. ====================================================================== From as_nascimento at yahoo.com.br Mon Jan 16 20:29:02 2006 From: as_nascimento at yahoo.com.br (Alessandro S. Nascimento) Date: Mon Jan 16 20:34:36 2006 Subject: [BioPython] problems when parsing blast output Message-ID: <43CC485E.7050702@yahoo.com.br> Hi all, I am trying to write something very simpleto parse very extense blast output file. But when I try something as described in web's cookbook i get the following error message: asn@frodo:~/fool/programming/python$ ./teste_asn.py Traceback (most recent call last): File "./teste_asn.py", line 10, in ? b_record = b_iterator.next() File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIStandalone.py", line 1342, in next return self._parser.parse(File.StringHandle(data)) File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIStandalone.py", line 567, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIStandalone.py", line 95, in feed self._scan_header(uhandle, consumer) File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIStandalone.py", line 127, in _scan_header read_and_call(uhandle, consumer.query_info, start='Query=') File "/usr/lib/python2.4/site-packages/Bio/ParserSupport.py", line 300, in read_and_call raise SyntaxError, errmsg SyntaxError: Line does not start with 'Query=': Reference for composition-based statistics: Does anyone have any idea? Thanks so much Alessandro From biopython at maubp.freeserve.co.uk Tue Jan 17 05:28:36 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue Jan 17 05:52:36 2006 Subject: [BioPython] problems when parsing blast output In-Reply-To: <43CC485E.7050702@yahoo.com.br> References: <43CC485E.7050702@yahoo.com.br> Message-ID: <43CCC6D4.4020307@maubp.freeserve.co.uk> Alessandro S. Nascimento wrote: > Hi all, > > I am trying to write something very simpleto parse very extense blast > output file. But when I try something as described in web's cookbook i > get the following error message: ... > SyntaxError: Line does not start with 'Query=': > Reference for composition-based statistics: > > Does anyone have any idea? I don't remember seeing a reference line for "composition-based statistics" before. Could you send us the command line you are using (i.e. what options did you give to BLASTALL). We would probably also need to the the output file. If it is very large, could you create a smaller one (e.g. different input) which shows the same problem? If you like, you could submit a bug report, and then attach the blast output file to it (this saves emailing a large file to everyone on the list). It looks like you are using Linux. We would also like to know which version of BioPython you are using (1.41 maybe?). Thank you Peter From biopython at maubp.freeserve.co.uk Tue Jan 17 06:25:42 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue Jan 17 06:42:00 2006 Subject: [BioPython] problems when parsing blast output In-Reply-To: <43CCCF56.40803@yahoo.com.br> References: <43CC485E.7050702@yahoo.com.br> <43CCC6D4.4020307@maubp.freeserve.co.uk> <43CCCF56.40803@yahoo.com.br> Message-ID: <43CCD436.7020704@maubp.freeserve.co.uk> OK, thanks for the extra information Alessandro. It looks like the current BLAST parser doesn't like the current blastpgp output. A quick Google suggests that it used to work, my guess is the NCBI recently changed the format to add this extra reference: Reference for composition-based statistics: Schaffer, Alejandro A., L. Aravaind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements", Nucleic Acids Res. 29:2994-3005. If I delete this from the blast.output file you sent me, then your example code works fine. If you are running the blast search separately, and then trying to parse the output in Python, this short term fix should get you up and running. You could also try getting BLAST to produce XML output. There was a recent post on the list where someone was having problems with that and multiple inputs, and a suggestion to cope. I have logged a bug for this issue (and attached your test file to it): http://bugzilla.open-bio.org/show_bug.cgi?id=1929 Hopefully someone will tackle this soon - I'm off sick today, and should really be resting. Peter Alessandro S. Nascimento wrote: > Hi Peter, > > as you will see in the attached script file, I tried two parse my blast > output into tow ways described in the biopython cookbook. > I'm using linux Kubuntu, python 2.4.2. I'm not completely sure about my > biopython version, cause it was installed from debian repositories > through apt-get, but it seems to be version 1.30. > > I also performed Blaspgp search separately using parameters "blastpgp -i > seqinput -o blast.output -j 50 -v 10000 -b 10000 -d ../db/nr -h 0.001". > A smaller blast result which gives me the same result from my python > script is also attached. > > My desire is to get a large number of sequences using blastpgp, filter > them by length and identities (e.g. > 30 and < 90), comparing the > results one to another using blast2seq and align them using clustalw for > statistical aanalysis. I have tried to do it using bioperl, but get some > bugs when working with a large number of sequences. Then, I am trying > python now. This should be something quite simple. (I guess) > > Any help will be very appreciable!!!! > > Thank you so much, > > > Alessandro From OmenkeukwuG at americanimaging.net Wed Jan 18 14:54:47 2006 From: OmenkeukwuG at americanimaging.net (Omenkeukwu, Gregory) Date: Wed Jan 18 14:50:24 2006 Subject: [BioPython] Qblast problem Message-ID: <2B8B6630ACA40940AABCF63158986A311CB162@XCHSRV02.DEERFIELD.AIM.local> I am new to Biopython and I am experiencing a little problem. I am running the Blast over the internet example and I keep getting stuck when I invoke the qblast function in NCBIWWW.py. Has anyone ever dealt with the problem below? Every time I run this code I get Invalid literal error for int(). I will appreciate any response thanks. >>> result_handle = NCBIWWW.qblast('blastn', 'nr', f_record) Traceback (most recent call last): File "", line 1, in -toplevel- result_handle = NCBIWWW.qblast('blastn', 'nr', f_record) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIWWW.py", line 1092, in qblast rid, rtoe = _parse_qblast_ref_page(handle) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIWWW.py", line 1177, in _parse_qblast_ref_page return rid, int(rtoe) ValueError: invalid literal for int(): 1.0 400 We don't support 0.9 Gregory Omenkeukwu Provider Information Management ====================================================================== The material in this transmission contains confidential information intended for the addressee. If you are not the addressee, any disclosure or use of this information by you is strictly prohibited. If you have received this transmission in error, please delete it and destroy all copies. Notify American Imaging Management at 847 564-8500. Thank You. ====================================================================== From biopython at maubp.freeserve.co.uk Thu Jan 19 05:12:50 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu Jan 19 05:09:01 2006 Subject: [BioPython] Qblast problem In-Reply-To: <2B8B6630ACA40940AABCF63158986A311CB162@XCHSRV02.DEERFIELD.AIM.local> References: <2B8B6630ACA40940AABCF63158986A311CB162@XCHSRV02.DEERFIELD.AIM.local> Message-ID: <43CF6622.6090607@maubp.freeserve.co.uk> Omenkeukwu, Gregory wrote: > I am new to Biopython and I am experiencing a little problem. I am > running the Blast over the internet example and I keep getting stuck > when I invoke the qblast function in NCBIWWW.py. Has anyone ever > dealt with the problem below? Every time I run this code I get > Invalid literal error for int(). I will appreciate any response > thanks. Do you have an example XML blast output file that we could use to recreate the problem? Ideally could log a bug for us, and then attach the XML file to the bug? Peter From OmenkeukwuG at americanimaging.net Thu Jan 19 10:07:52 2006 From: OmenkeukwuG at americanimaging.net (Omenkeukwu, Gregory) Date: Thu Jan 19 10:04:24 2006 Subject: [BioPython] Qblast problem Message-ID: <2B8B6630ACA40940AABCF63158986A311CB169@XCHSRV02.DEERFIELD.AIM.local> Here is the FASTA file (m_cold.fasta) and below it is the code I am running >gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthemum crystallinum cDNA 5' similar to cold acclimation protein, mRNA sequence CACTAGTACTCGAGCGTNCTGCACCAATTCGGCACGAGCAAGTGACTACGTTNTGTGAACAGAAAATGGG GAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATATC AATGAGCTTAAAATGGCAACAATGAGGCTCATCAATGATGCTAGTATGCTCGGTCATTACGGGTTTGGCA CTCATTTCCTCAAATGGCTCGCCTGCCTTGCGGCTATTTACTTGTTGATATTGGATCGAACAAACTGGAG AACCAACATGCTCACGTCACTTTTAGTCCCTTACATATTCCTCAGTCTTCCATCCGGGCCATTTCATCTG TTCAGAGGCGAGGTCGGGAAATGGATTGCCATCATTGCAGTCGTGTTAAGGCTGTTCTTCAACCGGCATT TCCCAGTTTGGCTGGAAATGCCTGGATCGTTGATACTCCTCCTGGTGGTGGCACCAGACTTCTTTACACA CAAAGTGAAGGAGAGCTGGATCGGAATTGCAATTATGATAGCGATAGGGTGTCACCTGATGCAAGAACAT ATCAGAGCCACTGGTGGCTTTTGGAATTCCTTCACACAGAGCCACGGAACTTTTAACACAATTGGGCTTA TCCTTCTACTGGCTTACCCTGTCTGTTTATGGTCATCTTCATGATGTAGTAGCTTAGTCTTGATCCTAAT CCTCAAATNTACTTTTCCAGCTCTTTCGACGCTCTTGCTAAAGCCCATTCAATTCGCCCCATATTTCGCA CACATTCATTTCACCACCCAATACGTGCTCTCCTTCTCCCTCTCTCCCTCTCCTCCCTCTTTTCTTCCTC TCACTTCTCTTCTCTTCTCTTCTTCAATACTCCCCTGGAGCGCCCTCTTCACCTCCCTACTCTCTACTCC TCTCTCTCACTCTCTCTTCCTCTCTTATCTCTCTCCTCCTCTCCTTCTCATCCCTCCTCCTTCTCTTCCT TTTCTTCTTTCTATCCACGCGCCATCCTCCCTCTTCCCTCTTCCCTTCTCTCTCCTCTCTTTCTCTCTCC TCTCTTCCTCATCTCACCACCTCCTCCTCTCTTTCTTCCGTCCTCCTTCCCTTCCTTCTTC from Bio import Fasta file_for_blast = open('m_cold.fasta', 'r') f_iterator = Fasta.Iterator(file_for_blast) f_record = f_iterator.next() from Bio.Blast import NCBIWWW result_handle = NCBIWWW.qblast('blastn', 'nr', f_record) -----Original Message----- From: Peter [mailto:biopython@maubp.freeserve.co.uk] Sent: Thursday, January 19, 2006 4:13 AM To: Omenkeukwu, Gregory; BioPython@biopython.org Subject: Re: [BioPython] Qblast problem Omenkeukwu, Gregory wrote: > I am new to Biopython and I am experiencing a little problem. I am > running the Blast over the internet example and I keep getting stuck > when I invoke the qblast function in NCBIWWW.py. Has anyone ever > dealt with the problem below? Every time I run this code I get > Invalid literal error for int(). I will appreciate any response > thanks. Do you have an example XML blast output file that we could use to recreate the problem? Ideally could log a bug for us, and then attach the XML file to the bug? Peter ====================================================================== The material in this transmission contains confidential information intended for the addressee. If you are not the addressee, any disclosure or use of this information by you is strictly prohibited. If you have received this transmission in error, please delete it and destroy all copies. Notify American Imaging Management at 847 564-8500. Thank You. ====================================================================== From mike at maibaum.org Thu Jan 19 06:59:07 2006 From: mike at maibaum.org (Michael Anthony Maibaum) Date: Thu Jan 19 10:06:53 2006 Subject: [BioPython] NCBIXML for multiple queries In-Reply-To: <16BDA615-72FC-43EC-8E68-B9739284A33B@maibaum.org> References: <43CBC0C2.10800@lydon.com> <16BDA615-72FC-43EC-8E68-B9739284A33B@maibaum.org> Message-ID: On 16 Jan 2006, at 21:08, Michael Anthony Maibaum wrote: > > On 16 Jan 2006, at 15:50, David Weisman wrote: > >> Hello, >> >> I tried using NCBIXML parsing on a local blast run, in which the >> input had multiple >> query sequences. Blastall writes multiple xml documents to the >> output file, and the >> SAX parser threw a SAXParseException on the second >> declaration, complaining >> of junk after the document element. --snip-- > I've been meaning to check if this fixed in cvs and file a bug if > not but haven't got around to it yet. FWIW, I tried to file a bug with a patch, but bugzilla appears to have taken a dislike to me. Hopefully someone with cvs access can have a look at the patch I sent to biopython-dev but in the meantime if anyone else actually wants a patch I've included it with this message. NCBIStandalone chunks multiple searches based on the string 'BLAST', which works fine for text output but doesn't work for xml. The patch attached adds ' References: <43CBC0C2.10800@lydon.com> <16BDA615-72FC-43EC-8E68-B9739284A33B@maibaum.org> Message-ID: <43CFC2B3.1070001@maubp.freeserve.co.uk> Michael Anthony Maibaum wrote: > FWIW, I tried to file a bug with a patch, but bugzilla appears to have > taken a dislike to me. Hopefully someone with cvs access can have a > look at the patch I sent to biopython-dev but in the meantime if anyone > else actually wants a patch I've included it with this message. Bug filed on your behalf, should stop the patch getting lost:- http://bugzilla.open-bio.org/show_bug.cgi?id=1933 The fix looks fine to me, but I don't really have time to test it out today... Peter From biopython at maubp.freeserve.co.uk Thu Jan 19 11:52:31 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu Jan 19 12:09:54 2006 Subject: [BioPython] BLAST XML problem? (missing encoding) In-Reply-To: <43C550F2.5030806@maubp.freeserve.co.uk> References: <43C49400.1070802@burnham.org> <43C49734.7050409@burnham.org> <43C4F091.2020408@maubp.freeserve.co.uk> <43C53B71.5080705@burnham.org> <43C550F2.5030806@maubp.freeserve.co.uk> Message-ID: <43CFC3CF.9050103@maubp.freeserve.co.uk> Hi all, As discussed earlier, we have a problem on some Blast XML output files where entities like a-umlaut appear in names e.g. "Alejandro Sch?ffer", without the XML file specifying an encoding: rather than say: Did anyone get in touch with the NCBI over this issue? Any reply? Peter From idoerg at burnham.org Thu Jan 19 12:27:01 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Thu Jan 19 12:31:03 2006 Subject: [BioPython] BLAST XML problem? (missing encoding) In-Reply-To: <43CFC3CF.9050103@maubp.freeserve.co.uk> References: <43C49400.1070802@burnham.org> <43C49734.7050409@burnham.org> <43C4F091.2020408@maubp.freeserve.co.uk> <43C53B71.5080705@burnham.org> <43C550F2.5030806@maubp.freeserve.co.uk> <43CFC3CF.9050103@maubp.freeserve.co.uk> Message-ID: <43CFCBE5.5000804@burnham.org> They seem to have fixed the encoding.. we currently have a production level system which seems to work fine with this. Iddo Peter wrote: > Hi all, > > As discussed earlier, we have a problem on some Blast XML output files > where entities like a-umlaut appear in names e.g. "Alejandro Sch?ffer", > without the XML file specifying an encoding: > > > > rather than say: > > > > Did anyone get in touch with the NCBI over this issue? Any reply? > > Peter > > _______________________________________________ > BioPython mailing list - BioPython@biopython.org > http://biopython.org/mailman/listinfo/biopython > > -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 713 9949 http://iddo-friedberg.org From ziemys.1 at osu.edu Fri Jan 20 16:55:38 2006 From: ziemys.1 at osu.edu (ARTURAS ZIEMYS) Date: Fri Jan 20 17:08:20 2006 Subject: [BioPython] DSSP / output Message-ID: <1fa3a151fa29a7.1fa29a71fa3a15@osu.edu> Hi, After 'dssp=DSSP(structure[0], 'test.pdb')' my dssp contains records marked with * (see below). I did not find the description for it in dssp homepage. Is it accessibility expressed in % ? I wanna just to be sure. * (, '-', 46, 0.23711340206185566) (, 'S', 28, 0.19718309859154928) (, 'S', 121, 0.59024390243902436) With best Arturas From bill at barnard-engineering.com Thu Jan 26 14:18:31 2006 From: bill at barnard-engineering.com (Bill Barnard) Date: Thu Jan 26 14:43:05 2006 Subject: [BioPython] Patches to enable Doc building for source and rpm distributions In-Reply-To: <1127857738.16589.94.camel@tioga.barnard-engineering.com> References: <1127779458.16589.60.camel@tioga.barnard-engineering.com> <1127802694.16589.78.camel@tioga.barnard-engineering.com> <1127857738.16589.94.camel@tioga.barnard-engineering.com> Message-ID: <1138303111.12796.29.camel@tioga.barnard-engineering.com> At the end of September, when I last had an opportunity to work on this, I rewrote the Makefile for the Doc directory subtree. The gist of the work was to properly call pdflatex, hevea, & hacha to build the pdfs, htmls, & txt files from their .tex input files. I made a common.mk file to abstract the common parts from the subdirectory makefiles. I created a patch for the Doc/biopdb_faq.tex file, generated from Doc/biopdb_faq.lyx, which contained an error from the perspective of the doc-generating utilities. I modified MANIFEST.in to include the new files, and to exclude the files which will be subsequently generated by the make/build and hence included in the distribution. I added a one line mod to setup.py to call the Doc make [ os.system('make -C Doc') ]. I sent these changes to the mailing list http://www.biopython.org/pipermail/biopython/2005-September/002777.html Recently I retrieved updates from CVS and discovered a small change I needed to make. Doc/Makefile did not correctly clean the dirs and subdirs; I fixed that with a command line env target, e.g. "make TARGET=clean". I also modified the top level Makefile which called make clean for the Doc directory so it uses the new calling convention. (This is probably irrelevant to the purpose of that makefile however.) I also note that my patch attachments to my September emails were cleaned by the mail list server. I will attach the patches and new common.mk in a tarball to this email. These patches and files could be applied and added to the current CVS tree as of 26-Jan-2006. Please feel free to use any portion that seems useful. Best, Bill -- Bill Barnard p.s. In order that you see exactly which files I've touched I'm including some details from my log files below (The Updated files below are the ones currently checked into CVS and are generated from the make; they could be removed from CVS.) cvs-update.2006-01-26.log ######################### ? Doc/biopdb_faq.tex.hevea-html-fix.patch ? Doc/common.mk M MANIFEST.in M Makefile M setup.py M Doc/Makefile U Doc/Tutorial.txt U Doc/cookbook/LogisticRegression/LogisticRegression.html U Doc/cookbook/LogisticRegression/LogisticRegression.pdf U Doc/cookbook/LogisticRegression/LogisticRegression.txt M Doc/cookbook/LogisticRegression/Makefile M Doc/cookbook/biopython_test/Makefile U Doc/cookbook/biopython_test/biopython_test.html U Doc/cookbook/biopython_test/biopython_test.pdf U Doc/cookbook/biopython_test/biopython_test.txt M Doc/cookbook/genbank_to_fasta/Makefile U Doc/cookbook/genbank_to_fasta/genbank_to_fasta.html U Doc/cookbook/genbank_to_fasta/genbank_to_fasta.pdf U Doc/cookbook/genbank_to_fasta/genbank_to_fasta.txt U Doc/install/Installation.html U Doc/install/Installation.pdf U Doc/install/Installation.txt M Doc/install/Makefile -------------- next part -------------- A non-text attachment was scrubbed... Name: Doc_Makefile_fix.tgz Type: application/x-compressed-tar Size: 2985 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biopython/attachments/20060126/ae259533/Doc_Makefile_fix.bin From omid9dr18 at hotmail.com Thu Jan 26 16:30:51 2006 From: omid9dr18 at hotmail.com (Omid Khalouei) Date: Thu Jan 26 16:44:05 2006 Subject: [BioPython] Structural superpostion script Message-ID: Hello, I was wondering if there are any open source protein structural alignment (superposition) programs, specifically in Python. Thanks alot, Sam K. From idoerg at burnham.org Thu Jan 26 18:27:37 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Thu Jan 26 18:25:32 2006 Subject: [BioPython] pbwiki -- bad idea Message-ID: <43D95AE9.9020308@burnham.org> Sorry, the pbwiki thing was a bad idea... not very easy. Just email me the information regarding use of biopython. I'll sort it out. Thanks, Iddo -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 Tel: (858) 646 3100 x3516 Fax: (858) 713 9949 http://iddo-friedberg.org From idoerg at burnham.org Thu Jan 26 18:23:53 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Thu Jan 26 18:25:51 2006 Subject: [BioPython] Tools in biopython Message-ID: <43D95A09.7000508@burnham.org> Hi all, I thought it would be cool to see which bioinformatics tool are have, to any extent, a Biopython module underthe hood. Those can be web servers, proprietary and open-source tools, and anything else. I would like to know by Saturday, so I can write this up in the OBF newsletter (sorry about the last minute anouncement). Can you login to http://biopython.pbwiki.com/BioPythonTools username: biopython password: biopython Add the tools into which biopython modules have been incorporated, and a URL, if relevant. Thanks, Iddo -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 Tel: (858) 646 3100 x3516 Fax: (858) 713 9949 http://iddo-friedberg.org From j.pansanel at pansanel.net Fri Jan 27 03:19:27 2006 From: j.pansanel at pansanel.net (Jerome PANSANEL) Date: Fri Jan 27 03:51:08 2006 Subject: [BioPython] Structural superpostion script In-Reply-To: References: Message-ID: <200601270919.27826.j.pansanel@pansanel.net> Hi ! PyMOL can do such job : http://www.rubor.de/bioinf/tips_modeling.html#superpos http://pymol.sourceforge.net/newman/ref/S1000comref.html#2_110 Jerome Pansanel Le Jeudi 26 Janvier 2006 22:30, Omid Khalouei a ?crit?: > Hello, > > I was wondering if there are any open source protein structural alignment > (superposition) programs, specifically in Python. > > Thanks alot, > Sam K. > > > _______________________________________________ > BioPython mailing list - BioPython@biopython.org > http://biopython.org/mailman/listinfo/biopython From idoerg at burnham.org Fri Jan 27 12:49:34 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Fri Jan 27 12:45:48 2006 Subject: [BioPython] Tools in biopython In-Reply-To: <43D95A09.7000508@burnham.org> References: <43D95A09.7000508@burnham.org> Message-ID: <43DA5D2E.7020400@burnham.org> Thanks to all who have written. A draft release of the Biopython bit in the OBF newsletter is viewable at: http://www.open-bio.org/wiki/Newsletter:2006_Winter#BioPython Let me know if I screwed up, if there is something i should put in wich I have not, or if tehre is something I should take out. There is still time to fix things (by Sunday). Best, Iddo Iddo Friedberg wrote: > Hi all, > > I thought it would be cool to see which bioinformatics tool are have, > to any extent, a Biopython module underthe hood. Those can be web > servers, proprietary and open-source tools, and anything else. I would > like to know by Saturday, so I can write this up in the OBF newsletter > (sorry about the last minute anouncement). -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 713 9949 http://iddo-friedberg.org http://BioFunctionPrediction.com From ajoyner at UCSD.Edu Tue Jan 31 11:50:06 2006 From: ajoyner at UCSD.Edu (ajoyner@UCSD.Edu) Date: Tue Jan 31 11:46:25 2006 Subject: [BioPython] Pairwise BLASTS Message-ID: <200601311650.k0VGo6KW006353@smtp.ucsd.edu> Hi, Does anyone know of a program that I can use to run Pairwise BLASTS in a batch fashion? This would be as opposed to an online GUI. Thanks! From idoerg at burnham.org Tue Jan 31 12:08:38 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Tue Jan 31 12:09:34 2006 Subject: [BioPython] Pairwise BLASTS In-Reply-To: <200601311650.k0VGo6KW006353@smtp.ucsd.edu> References: <200601311650.k0VGo6KW006353@smtp.ucsd.edu> Message-ID: <43DF9996.3030309@burnham.org> Attached is a little script I wrote a while ago. Usage example: ./all_bl2seq *.fasta HTH, Iddo ajoyner@ucsd.edu wrote: >Hi, >Does anyone know of a program that I can use to run Pairwise BLASTS in a batch >fashion? This would be as opposed to an online GUI. >Thanks! > >_______________________________________________ >BioPython mailing list - BioPython@biopython.org >http://biopython.org/mailman/listinfo/biopython > > > > -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 713 9949 http://iddo-friedberg.org http://BioFunctionPrediction.org -------------- next part -------------- #!/usr/bin/python import sys import glob import os TMP_BLAST_OUT = 'tmp_blast_out' BLAST_OUT = 'blast_out.blast' def all_bl2seq(file_list): # accepts a file list, does an all-vs-all pairwise BLAST. Must have NCBI tools installed for seq1 in file_list: print '.', for seq2 in file_list[file_list.index(seq1)+1:]: os.system('bl2seq -p blastp -i %s -D 1 -j %s -o %s' % (seq1, seq2, TMP_BLAST_OUT)) os.system('cat %s >> %s' % (TMP_BLAST_OUT, BLAST_OUT)) if __name__ == "__main__": try: all_bl2seq(sys.argv[1:]) except: print "usage: all_bl2seq " From hubin.keio at gmail.com Tue Jan 31 22:53:17 2006 From: hubin.keio at gmail.com (Bin Hu) Date: Wed Feb 1 01:35:06 2006 Subject: [BioPython] protein net charge and PSI blast Message-ID: <71dea9850601311953s58ec5c30p67de4b751f543035@mail.gmail.com> Hi, Does anyone know any existing package to calculate the protein net charge? And could any one tell me how to do a PSI-blast instead of a regular blast using biopython? Thank you. Bin