From arareko at campus.iztacala.unam.mx Sat Mar 3 17:32:46 2007 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Sat, 03 Mar 2007 16:32:46 -0600 Subject: [BioPython] [Bioperl-l] New Article on Approaches to Web Development for Bioinformatics In-Reply-To: <1ad8057e0703021842y683853f5k1c97dbf362f20dda@mail.gmail.com> References: <1ad8057e0703021842y683853f5k1c97dbf362f20dda@mail.gmail.com> Message-ID: <45E9F78E.8040406@campus.iztacala.unam.mx> Hi Alex, I think you've put a very nice & concise introductory article. I'd like to comment a little on some sections I've read: * Introduction > "Given that you have an idea for analyzing or presenting data in a > particular was, a complete bioinformatics web application depends of > these basic pieces, which is what this article is all about: > > 1. A source of data... > 2. An application programming language... > 3. A web application platform... > 4. Optionally, a data store... > 5. Optionally, you would reuse software tools..." Even though you do a small mention about Web Services at the very end of the article (under Application Integration -> Programmatic Integration), I believe that Web Services can be another optional (or even basic) piece of a web application. In fact, many web applications consist only of Web Services without HTML user interfaces. * Application Development Languages > "There are many different programming platforms and tools available to > solve bioinformatics problems. It can be bewildering at first, but it > makes more sense to build on top of some of these tools rather than > build from scratch. Some the problems with using these tools for a > bioinformatics portal are > > 1. Many tools are written... > 2. Some tools have particular prerequisites... > 3. Many may not be in a form... > 4. The context that gives meaning... > > Standardization on a particular platform can help manageability but > for most organizations a compromise between standardization and > adoption of several different platforms will allow many people to > develop software in platforms that they are already comfortable with > and allow the reuse of a large amount of freely available software..." I would add to the problems list the fact that building web (or other kind of) applications on top of a platform whose codebase is evolving constantly, can make them very difficult to maintain. The case of EnsEMBL comes to my mind here: they opted to stick with BioPerl 1.2.3 as a core library and haven't moved onto a higher version of it because the EnsEMBL code is so vast, that a simple upgrade of BioPerl would break a lot of their code. AFAIK, it's because of this and the slowness at some parts of BioPerl that EnsEMBL is gradually saying goodbye to BioPerl. Also, I think that depending on the amount of available code you plan to import into your application, sometimes having a whole platform at the very bottom can add unnecessary extra weight to your application. More weight could be equal to less speed, this is critical in web development. * Application Integration -> Navigation > "The basic way that users will navigate into and around your > application should be using HTTP GET and POST requests with specific > URL's. Users bookmark these URL's and other applications will link to > them. Most applications developers did not realize it at first, but > these URL's are, in fact, an interface into your application that you > must maintain in a consistent way as you change and evolve your > software. Otherwise, they will find dead links..." Just as I clicked the bookmark button for your article :) The same principle could apply to its filenames. A URL of the form: http://medicalcomputing.net/tools_dna17.php is less indicative of the real content of the article and can mislead potential readers. Optimising the URL's will make them better to be indexed by search engines, something like: http://medicalcomputing.net/web-development-bioinformatics17.php would do the trick. To conclude my comments, I was surprised to see a section about BioPHP and not about other more-known toolkits like BioPython or BioRuby. What about their role in web development? Python is also a common language for web programming and with all the recent *hot* stuff like Ruby On Rails, it's very likely that both Bio* toolkits are more than ready for deploying web applications. I'm Cc'ing this to their respective mailing lists to see if someone wants to give you some feedback about them in order to complement your article. Other than that, I really liked your work :) Cheers, Mauricio. Alex Amies wrote: > I have written an article on Approaches to Web Development for > Bioinformatics at > > http://medicalcomputing.net/tools_dna1.php > > There is a fairly large section on BioPerl at > > http://medicalcomputing.net/tools_dna13.php > > I hope that someone gets something useful out of it. I also looking for > feedback on it and, in particular, please let me know about any mistakes in > it. > > The intent of the article is to give an overview of various approaches to > developing web based tools for bioinformatics. It describes the alternatives > at each layer of the system, including the data layer and sources of data, > the application programming layer, the web layer, and bioinformatics tools > and software libraries. > > Alex > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From alexamies at gmail.com Sat Mar 3 22:09:51 2007 From: alexamies at gmail.com (Alex Amies) Date: Sat, 3 Mar 2007 19:09:51 -0800 Subject: [BioPython] [Bioperl-l] New Article on Approaches to Web Development for Bioinformatics In-Reply-To: <45E9F78E.8040406@campus.iztacala.unam.mx> References: <1ad8057e0703021842y683853f5k1c97dbf362f20dda@mail.gmail.com> <45E9F78E.8040406@campus.iztacala.unam.mx> Message-ID: <1ad8057e0703031909v4880f5f1t3c4159b75c36bcca@mail.gmail.com> Mauricio, Thanks for your comments. You are right that I could have said a lot more about web services. I plan on doing that but I haven't got there yet. Actually, with all the hype about web services I have been surprised to find the programming model so complicated. As you mention, I certainly could have thought out my own URL's better. I have been surprised not to find more PHP activity in bioinformatics. To me, besides being a lightweight and pleasant language to program in it is incredibly economical for hosting Internet applications and there is a huge open source community around PHP in general. The same can be said of Perl. It is because of my own ignorance and lack of time that I have not investigated Python and Ruby. I may do in the future and write about them. Alex On 3/3/07, Mauricio Herrera Cuadra wrote: > Hi Alex, > > I think you've put a very nice & concise introductory article. I'd like > to comment a little on some sections I've read: > > * Introduction > > > "Given that you have an idea for analyzing or presenting data in a > > particular was, a complete bioinformatics web application depends of > > these basic pieces, which is what this article is all about: > > > > 1. A source of data... > > 2. An application programming language... > > 3. A web application platform... > > 4. Optionally, a data store... > > 5. Optionally, you would reuse software tools..." > > Even though you do a small mention about Web Services at the very end of > the article (under Application Integration -> Programmatic Integration), > I believe that Web Services can be another optional (or even basic) > piece of a web application. In fact, many web applications consist only > of Web Services without HTML user interfaces. > > * Application Development Languages > > > "There are many different programming platforms and tools available to > > solve bioinformatics problems. It can be bewildering at first, but it > > makes more sense to build on top of some of these tools rather than > > build from scratch. Some the problems with using these tools for a > > bioinformatics portal are > > > > 1. Many tools are written... > > 2. Some tools have particular prerequisites... > > 3. Many may not be in a form... > > 4. The context that gives meaning... > > > > Standardization on a particular platform can help manageability but > > for most organizations a compromise between standardization and > > adoption of several different platforms will allow many people to > > develop software in platforms that they are already comfortable with > > and allow the reuse of a large amount of freely available software..." > > I would add to the problems list the fact that building web (or other > kind of) applications on top of a platform whose codebase is evolving > constantly, can make them very difficult to maintain. The case of > EnsEMBL comes to my mind here: they opted to stick with BioPerl 1.2.3 as > a core library and haven't moved onto a higher version of it because the > EnsEMBL code is so vast, that a simple upgrade of BioPerl would break a > lot of their code. AFAIK, it's because of this and the slowness at some > parts of BioPerl that EnsEMBL is gradually saying goodbye to BioPerl. > > Also, I think that depending on the amount of available code you plan to > import into your application, sometimes having a whole platform at the > very bottom can add unnecessary extra weight to your application. More > weight could be equal to less speed, this is critical in web development. > > * Application Integration -> Navigation > > > "The basic way that users will navigate into and around your > > application should be using HTTP GET and POST requests with specific > > URL's. Users bookmark these URL's and other applications will link to > > them. Most applications developers did not realize it at first, but > > these URL's are, in fact, an interface into your application that you > > must maintain in a consistent way as you change and evolve your > > software. Otherwise, they will find dead links..." > > Just as I clicked the bookmark button for your article :) The same > principle could apply to its filenames. A URL of the form: > http://medicalcomputing.net/tools_dna17.php is less indicative of the > real content of the article and can mislead potential readers. > Optimising the URL's will make them better to be indexed by search > engines, something like: > http://medicalcomputing.net/web-development-bioinformatics17.php would > do the trick. > > To conclude my comments, I was surprised to see a section about BioPHP > and not about other more-known toolkits like BioPython or BioRuby. What > about their role in web development? Python is also a common language > for web programming and with all the recent *hot* stuff like Ruby On > Rails, it's very likely that both Bio* toolkits are more than ready for > deploying web applications. I'm Cc'ing this to their respective mailing > lists to see if someone wants to give you some feedback about them in > order to complement your article. Other than that, I really liked your > work :) > > Cheers, > Mauricio. > > Alex Amies wrote: > > I have written an article on Approaches to Web Development for > > Bioinformatics at > > > > http://medicalcomputing.net/tools_dna1.php > > > > There is a fairly large section on BioPerl at > > > > http://medicalcomputing.net/tools_dna13.php > > > > I hope that someone gets something useful out of it. I also looking for > > feedback on it and, in particular, please let me know about any mistakes in > > it. > > > > The intent of the article is to give an overview of various approaches to > > developing web based tools for bioinformatics. It describes the alternatives > > at each layer of the system, including the data layer and sources of data, > > the application programming layer, the web layer, and bioinformatics tools > > and software libraries. > > > > Alex > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > MAURICIO HERRERA CUADRA > arareko at campus.iztacala.unam.mx > Laboratorio de Gen?tica > Unidad de Morfofisiolog?a y Funci?n > Facultad de Estudios Superiores Iztacala, UNAM > > > > From shahs at MIT.EDU Sat Mar 3 23:14:59 2007 From: shahs at MIT.EDU (Hossein Shahsavari) Date: Sat, 03 Mar 2007 23:14:59 -0500 Subject: [BioPython] IOError: [Errno 2] No such file or directory: Message-ID: <20070303231459.s0pe4qpb1128o0gw@webmail.mit.edu> Hello, I receive the following error when I am trying to access a file called HISTORY in an another file by this command template = '~/CSH/HISTORY' and I get this error. IOError: [Errno 2] No such file or directory: '~/CSH/HISTORY' I use python in Linux environment. I appreciate any suggestions/comments. Hossein Shahsavari From biopython at maubp.freeserve.co.uk Sun Mar 4 06:58:07 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 04 Mar 2007 11:58:07 +0000 Subject: [BioPython] IOError: [Errno 2] No such file or directory: In-Reply-To: <20070303231459.s0pe4qpb1128o0gw@webmail.mit.edu> References: <20070303231459.s0pe4qpb1128o0gw@webmail.mit.edu> Message-ID: <45EAB44F.1080909@maubp.freeserve.co.uk> Hossein Shahsavari wrote: > Hello, > > I receive the following error when I am trying to access a file called HISTORY > in an another file by this command > > template = '~/CSH/HISTORY' > > and I get this error. > > IOError: [Errno 2] No such file or directory: '~/CSH/HISTORY' > > I use python in Linux environment. I appreciate any suggestions/comments. If you have posted the python code it would be easier to guess what is going wrong. What does this do? import os template = '~/CSH/HISTORY' print os.path.isfile(template) That should print either True or False. You might also try replacing the tilde ('~') with the actual path of your home folder, something like this typically: template = '/home/username/CSH/HISTORY' P.S. Have you checked the case? Linux and Unix are case sensitive. Peter From shahs at MIT.EDU Sun Mar 4 11:57:34 2007 From: shahs at MIT.EDU (Hossein Shahsavari) Date: Sun, 04 Mar 2007 11:57:34 -0500 Subject: [BioPython] IOError: [Errno 2] No such file or directory: In-Reply-To: <45EAB44F.1080909@maubp.freeserve.co.uk> References: <20070303231459.s0pe4qpb1128o0gw@webmail.mit.edu> <45EAB44F.1080909@maubp.freeserve.co.uk> Message-ID: <20070304115734.xe5ms6c7vkv8k4wk@webmail.mit.edu> Hi Thanks for your guidances. The problem was the tilde ('~') which I replaced by the correct path and now it works. I have another maybe simple question: I have 26 files namely output1, output2,...,output26. I can read them one by one but how can read them all by an easier way like a loop ? I put a "for loop" by setting i=1 for i in range(1,27) template='outputi' however, I got the same error as above IOError: [Errno 2] No such file or directory: 'outputi'. It seems "i" can't be attached to the output. Thanks alot Hossein From biopython at maubp.freeserve.co.uk Sun Mar 4 12:34:26 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 04 Mar 2007 17:34:26 +0000 Subject: [BioPython] IOError: [Errno 2] No such file or directory: In-Reply-To: <20070304115734.xe5ms6c7vkv8k4wk@webmail.mit.edu> References: <20070303231459.s0pe4qpb1128o0gw@webmail.mit.edu> <45EAB44F.1080909@maubp.freeserve.co.uk> <20070304115734.xe5ms6c7vkv8k4wk@webmail.mit.edu> Message-ID: <45EB0322.8050809@maubp.freeserve.co.uk> Hossein Shahsavari wrote: > I have another maybe simple question: > > I have 26 files namely output1, output2,...,output26. I can read them > one by one but how can read them all by an easier way like a loop ? > I put a "for loop" by setting > > i=1 > > for i in range(1,27) > template='outputi' > > however, I got the same error as above IOError: [Errno 2] No such file or > directory: 'outputi'. It seems "i" can't be attached to the output. > > Thanks alot > > Hossein You should really try a basic introduction to python. There are lots of tutorials online, and great books too. Your questions so far are not really related to BioPython at all. Note that indentation is very important in python. You were also missing the colon at the end for line. More importantly the following line just sets the variable template to the string 'outputi', and doesn't do anything with the variable i. template='outputi' You want to do something like this: for i in range(1,27) : template = 'output' + str(i) print template Good luck. Peter From lucks at fas.harvard.edu Mon Mar 5 09:13:02 2007 From: lucks at fas.harvard.edu (Julius Lucks) Date: Mon, 5 Mar 2007 09:13:02 -0500 Subject: [BioPython] blast parsing errors Message-ID: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> Hi all, I am trying to parse a bunch of blast results that I gather via NCBIWWW.qblast(). I have the following code snipit: ----------- from Bio imort Fasta from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML import StringIO import re #BLAST cutoff cutoff = 1e-4 #Create a fasta record: title and seq are given title = 'test' seq = 'ATCG' fasta_rec = Fasta.Record() #Sanitize title - blast does not like single quotes or \n in titles title = re.sub("'","prime",title) title = re.sub("\n","",title) fasta_rec.title = title fasta_rec.sequence = seq b_parser = NCBIXML.BlastParser() result_handle = NCBIWWW.qblast ('blastp','nr',fasta_rec,ncbi_gi=1,expect=cutoff,format_type="XML",entre z_query="Viruses [ORGN]") blast_results = result_handle.read() blast_handle = StringIO.StringIO(blast_results) b_record = b_parser.parse(blast_handle) for alignment in b_record.alignments: titles = alignment.title.split('>') print titles ------------- The issue is sometimes the blast parser chokes with tracebacks like: File "./src/create_annotations.py", line 96, in get_blast_annotations b_record = b_parser.parse(blast_handle) File "/sw/lib/python2.5/ site-packages/Bio/Blast/NCBIXML.py", line 112, in parse self._parser.parse(handler) File "/sw/lib/python2.5/xml/sax/ expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "/sw/lib/ python2.5/xml/sax/xmlreader.py", line 123, in parse self.feed(buffer) File "/sw/lib/python2.5/xml/sax/expatreader.py", line 211, in feed self._err_handler.fatalError(exc) File "/sw/lib/python2.5/xml/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: :7:70: not well- formed (invalid token) I am not sure which alignment it choked on, but I would like to rescue it with a try/except block if possible. But it seems to me that if I did something like try: b_record = b_parser.parse(blast_handle) except: ... Then I would not get anything in b_record if an error raised in the parsing. Rather, I would like to have whatever has been successful up to the point of the error stored in b_record. Is there any way to do this via the BioPython API, or do I have to dig into the python xml parsing code? Also, if anyone has a better idea of how to structure this code, I would be very appreciative. Cheers, Julius ----------------------------------------------------- http://openwetware.org/wiki/User:Lucks ----------------------------------------------------- From biopython at maubp.freeserve.co.uk Mon Mar 5 09:55:38 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 05 Mar 2007 14:55:38 +0000 Subject: [BioPython] blast parsing errors In-Reply-To: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> References: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> Message-ID: <45EC2F6A.6090200@maubp.freeserve.co.uk> Julius Lucks wrote: > Hi all, > > I am trying to parse a bunch of blast results that I gather via > NCBIWWW.qblast(). I have the following code snipit: You didn't say which version of BioPython you are using, I would guess 1.42 - there have been some Bio.Blast changes since than. Your example sequence was "ATCG", but you ran a "blastp" search. Did you really mean the peptide Ala-Thr-Cys-Gly here? If you meant to do a nucleotide search, try using "blastn" and "nr" instead. That should work better. However, there is still something funny going on. I tried your example as is using the CVS code, and it fails before it even gets the blast results back... Could you save the XML output to a file and email it to me; or even better file a bug an attach the XML file to the bug. Thanks Peter From biopython at maubp.freeserve.co.uk Mon Mar 5 10:12:25 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 05 Mar 2007 15:12:25 +0000 Subject: [BioPython] blast parsing errors In-Reply-To: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> References: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> Message-ID: <45EC3359.1030802@maubp.freeserve.co.uk> Julius Lucks wrote: > Hi all, > > I am trying to parse a bunch of blast results that I gather via > NCBIWWW.qblast(). I have the following code snipit: I am wondering if your trivial example triggered some "unusual" error page from the NCBI... I would suggest you update to CVS, as we have made a lot of changes to the Blast XML support. You would probably be safe just updating the following Bio.Blast files, located here on your machine: /sw/lib/python2.5/site-packages/Bio/Blast/NCBIStandalone.py /sw/lib/python2.5/site-packages/Bio/Blast/NCBIWWW.py /sw/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py /sw/lib/python2.5/site-packages/Bio/Blast/Record.py If you don't know how to use CVS, then just backup the originals, and replace them with the new files download one by one from here: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/?cvsroot=biopython ---------------------------------------------------------------------- This works for me using the CVS version of BioPython. I have just made a string for rather than messing about with a fasta record object to keep the code short: #Protein example, BLASTP from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML #BLAST cutoff cutoff = 1e-4 fasta_rec = ">GI:121308427\nrslgmevmhernahnfpldlaavevpsing" b_parser = NCBIXML.BlastParser() result_handle = NCBIWWW.qblast('blastp', 'nr', fasta_rec, ncbi_gi=1, expect=cutoff, format_type="XML", entrez_query="Viruses [ORGN]") #This returns a record iterator, changed after release of BioPython 1.42 b_records = b_parser.parse(result_handle) for b_record in b_records : print "%s found %i results" % (b_record.query, len(b_record.alignments)) for alignment in b_record.alignments: titles = alignment.title.split('>') print titles Or, if you wanted to do a nucleotide BLASTN search, try: fasta_rec = '>GI:121308427\nttagccatttatagatggaacttcaacagcagctaagtc' \ + 'tagagggaaattgtgagcattacgctcgtgcatgacctccataccaagagatct' and replace 'blastp' with 'blastn' in the call to qblast(). Peter From mdehoon at c2b2.columbia.edu Mon Mar 5 10:36:43 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 05 Mar 2007 10:36:43 -0500 Subject: [BioPython] blast parsing errors In-Reply-To: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> References: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> Message-ID: <45EC390B.8020400@c2b2.columbia.edu> Julius Lucks wrote: > seq = 'ATCG' > ... > fasta_rec.sequence = seq > ... > result_handle = NCBIWWW.qblast > ('blastp','nr',fasta_rec,ncbi_gi=1,expect=cutoff,format_type="XML",entre You have a nucleotide sequence but are running a protein-protein blast with blastp. If you run this exact search with Blast through a browser, it will show you an error message. The function _parse_qblast_ref_page(handle), which is called from NCBIWWW.qblast, chokes on this error message. If you want to make this more robust, one solution might be to check for error messages returned by the Blast server in _parse_qblast_ref_page. By the way, the code can be simplified as follows: from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML #BLAST cutoff cutoff = 1e-4 #Create a fasta record: title and seq are given seq = 'ATCG' b_parser = NCBIXML.BlastParser() result_handle = NCBIWWW.qblast('blastn', 'nr', seq, ncbi_gi=1, expect=cutoff, format_type="XML", entrez_query="Viruses [ORGN]") b_records = b_parser.parse(result_handle) b_record = b_records[0] for alignment in b_record[0].alignments: titles = alignment.title.split('>') print titles -------------------------------------------- Note: the BlastParser currently in CVS returns a list of Blast records instead of a single Blast record, hence the b_records[0] above. Btw, with NCBIXML currently in CVS, you don't need to create b_parser first: result_handle = NCBIWWW.qblast('blastn', 'nr', seq, ncbi_gi=1, expect=cutoff, format_type="XML", entrez_query="Viruses [ORGN]") b_records = NCBIXML.parse(result_handle) b_record = b_records.next() ----------------------------------------------- --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From winter at biotec.tu-dresden.de Mon Mar 5 10:07:00 2007 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Mon, 05 Mar 2007 16:07:00 +0100 Subject: [BioPython] blast parsing errors In-Reply-To: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> References: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> Message-ID: <45EC3214.5050100@biotec.tu-dresden.de> Running your example, I get: >>> ## working on region in file /tmp/python-18415Uda.py... Traceback (most recent call last): File "", line 1, in ? File "/tmp/python-18415Uda.py", line 25, in ? result_handle = NCBIWWW.qblast('blastp','nr',fasta_rec,ncbi_gi=1,expect=cutoff,format_type="XML",entrez_query="Viruses [ORGN]") File "/var/lib/python-support/python2.4/Bio/Blast/NCBIWWW.py", line 1091, in qblast rid, rtoe = _parse_qblast_ref_page(handle) File "/var/lib/python-support/python2.4/Bio/Blast/NCBIWWW.py", line 1133, in _parse_qblast_ref_page return rid, int(rtoe) ValueError: invalid literal for int(): > NCBI Blast title = re.sub("\n","",title) > fasta_rec.title = title > fasta_rec.sequence = seq > > > b_parser = NCBIXML.BlastParser() > > result_handle = NCBIWWW.qblast > ('blastp','nr',fasta_rec,ncbi_gi=1,expect=cutoff,format_type="XML",entre > z_query="Viruses [ORGN]") > blast_results = result_handle.read() > > blast_handle = StringIO.StringIO(blast_results) > b_record = b_parser.parse(blast_handle) > > for alignment in b_record.alignments: > titles = alignment.title.split('>') > print titles > > ------------- > > > The issue is sometimes the blast parser chokes with tracebacks like: > > File "./src/create_annotations.py", line 96, in get_blast_annotations > b_record = b_parser.parse(blast_handle) File "/sw/lib/python2.5/ > site-packages/Bio/Blast/NCBIXML.py", line 112, in parse > self._parser.parse(handler) File "/sw/lib/python2.5/xml/sax/ > expatreader.py", line 107, in parse > xmlreader.IncrementalParser.parse(self, source) File "/sw/lib/ > python2.5/xml/sax/xmlreader.py", line 123, in parse > self.feed(buffer) > File "/sw/lib/python2.5/xml/sax/expatreader.py", line 211, in feed > self._err_handler.fatalError(exc) > File "/sw/lib/python2.5/xml/sax/handler.py", line 38, in > fatalError raise exception > xml.sax._exceptions.SAXParseException: :7:70: not well- > formed (invalid token) > > I am not sure which alignment it choked on, but I would like to > rescue it with a try/except block if possible. But it seems to me > that if I did something like > > try: > b_record = b_parser.parse(blast_handle) > except: > ... > > Then I would not get anything in b_record if an error raised in the > parsing. Rather, I would like to have whatever has been successful > up to the point of the error stored in b_record. > > Is there any way to do this via the BioPython API, or do I have to > dig into the python xml parsing code? > > Also, if anyone has a better idea of how to structure this code, I > would be very appreciative. > > Cheers, > > Julius > > ----------------------------------------------------- > http://openwetware.org/wiki/User:Lucks > ----------------------------------------------------- > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From lucks at fas.harvard.edu Mon Mar 5 11:24:58 2007 From: lucks at fas.harvard.edu (Julius Lucks) Date: Mon, 5 Mar 2007 11:24:58 -0500 Subject: [BioPython] blast parsing errors In-Reply-To: <45EC2F6A.6090200@maubp.freeserve.co.uk> References: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> <45EC2F6A.6090200@maubp.freeserve.co.uk> Message-ID: <27D93A40-C5AC-4708-BF4C-0ADCFD413B46@fas.harvard.edu> Thanks guys, You are right - I am using BioPython 1.42, and python2.5 installed via fink on Mac OS X. I meant to use an amino acid sequence for the seq variable, and I have included the revised code snippet which uses the protein sequence that gave me trouble in the first place. However, there is no problem when using the current CVS code. Thanks for all of your help! I have 3 questions: 1.) Is the documentation for the new NCBIXML and NBCIWWW up to date? 2.) Why is NCBIXML.parse returning an iterator in this case since there is only one result? Or in other words, what are the use cases where an iterator is necessary? 3.) How are the fink packages of Biopython maintained? I am using the fink unstable tree, which means that I am getting the most current version that fink has. If Biopython 1.44 is substantially different from 1.42 (current fink), can we update the fink version faster than we currently are? Cheers, Julius ---------- code that works in Biopython 1.44 -------- from Bio import Fasta from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML import StringIO import re #BLAST cutoff cutoff = 1e-4 #Create a fasta record: title and seq are given title = 'test' seq = '\ MESFSVQAYLKATDNNFVSTFKDAAKQVQNF\ EKNTNSTMSTVGKVATSTGKTLTKAVTVPII\ GIGVAAAKIGGDFESQMSRVKAISGATGSSF\ EELRQQAIDLGAKTAFSAKESASGMENLASA\ GFNAKEIMEAMPGLLDLAAVSGGDVALASEN\ AATALRGFNLDASQSGHVANVFAKAAADTNA\ EVGDMGEAMKYIAPVANSMGLSIEEVSAAIG\ IMSDAGIKGSQAGTSLRGALSRLADPTDAMQ\ AKMDELGLSFYDSEGKMKPLKDQIGMLKDAF\ KGLTPEQQQNALVTLYGQESLSGMMALIDKG\ PDKLGKLTESLKNSDGAADKMAKTMQDNMNS\ SLEQMMGAFESAAIVVQKILAPAVRKVADSI\ SGLVDKFVSAPEPVQKMIVTIGLIVAAIGPL\ LVIFGQAVVTLQRVKVGFLALRSGLALIGGS\ FTAISLPVLGIIAAIAAVIAIGILVYKNWDK\ ISKFGKEVWANVKKFASDAAEVIKEKWGDIT\ QWFSDTWNNIKNGAKGLWDGTVQGAKNAVDS\ VKNAWNGIKEWFTNLWKGTTSGLSSAWDSVT\ TTLAPFVETIKTIFQPILDFFSGLWGQVQTI\ FGSAWEIIKTVVMGPVLLLIDLITGDFNQFK\ KDFAMLWQTLFTNIQTLVTTYVQIVVGFFTA\ WGQTVSNIWTTVVNTIQSLWGAFTTWVINMA\ KSIVDGIVNGWNSFKQGTVDLWNATVQWVKD\ TWASFKQWVVDSANAIVNGVKQGWENLKQGT\ IDLWNGMINGLKGIWDGLKQSVRNLIDNVKT\ TFNNLKNINLLDIGKAIIDGLVKGLKKKWED\ GMKFISGIGDWIRKHKGPIRKDRKLLIPAGK\ AIMTGLNSGLTGGFRNVQSNVSGMGDMIANA\ INSDYSVDIGANVAAANRSISSQVSHDVNLN\ QGKQPASFTVKLGNQIFKAFVDDISNAQGQA\ INLNMGF*' fasta_rec = Fasta.Record() #Sanitize title - blast does not like single quotes or \n in titles title = re.sub("'","prime",title) title = re.sub("\n","",title) fasta_rec.title = title fasta_rec.sequence = seq result_handle = NCBIWWW.qblast ('blastp','nr',fasta_rec,ncbi_gi=1,expect=cutoff,format_type="XML",entre z_query="Viruses [ORGN]") b_records = NCBIXML.parse(result_handle) for b_record in b_records: print "%s found %i results" % (b_record.query, len (b_record.alignments)) for alignment in b_record.alignments: titles = alignment.title.split('>') print titles ---------- ----------------------------------------------------- http://openwetware.org/wiki/User:Lucks ----------------------------------------------------- On Mar 5, 2007, at 9:55 AM, Peter wrote: > Julius Lucks wrote: >> Hi all, >> I am trying to parse a bunch of blast results that I gather via >> NCBIWWW.qblast(). I have the following code snipit: > > You didn't say which version of BioPython you are using, I would > guess 1.42 - there have been some Bio.Blast changes since than. > > Your example sequence was "ATCG", but you ran a "blastp" search. > Did you really mean the peptide Ala-Thr-Cys-Gly here? > > If you meant to do a nucleotide search, try using "blastn" and "nr" > instead. That should work better. > > However, there is still something funny going on. I tried your > example as is using the CVS code, and it fails before it even gets > the blast results back... > > Could you save the XML output to a file and email it to me; or even > better file a bug an attach the XML file to the bug. > > Thanks > > Peter From mdehoon at c2b2.columbia.edu Mon Mar 5 11:49:53 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 05 Mar 2007 11:49:53 -0500 Subject: [BioPython] blast parsing errors In-Reply-To: <27D93A40-C5AC-4708-BF4C-0ADCFD413B46@fas.harvard.edu> References: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> <45EC2F6A.6090200@maubp.freeserve.co.uk> <27D93A40-C5AC-4708-BF4C-0ADCFD413B46@fas.harvard.edu> Message-ID: <45EC4A31.9010207@c2b2.columbia.edu> Julius Lucks wrote: > 1.) Is the documentation for the new NCBIXML and NBCIWWW up to date? No it is not. To ensure that the documentation on the website agrees with the current Biopython release, the idea was to update the documentation when the next Biopython release comes out. Originally we were planning to make a new Biopython release as soon as the new Bio.SeqIO code is done. However, I'd be happy to make a release in the immediate future without the new Bio.SeqIO, and make another one once Bio.SeqIO is ready. > 2.) Why is NCBIXML.parse returning an iterator in this case since there > is only one result? Or in other words, what are the use cases where an > iterator is necessary? If you're parsing multiple Blast search results at the same time. In other words, if the fasta file for the blast search looked like > gene1 ATAGCTACG... > gene2 ATCGATCGATGGCA... > gene3 .... Such a file can be very large, which is why we are using an iterator instead of a list. Now, one may argue that NCBIXML.parse should return a single record instead of an iterator if there's only one result. Others may argue that for consistency, it should always return an iterator. Either way is fine with me. Anybody have a strong opinion about this? > 3.) How are the fink packages of Biopython maintained? I don't know. But, it's not too difficult to install Biopython from the source distribution or from CVS. So if you want to be sure you have the latest version, you might want to try installing from CVS. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython at maubp.freeserve.co.uk Tue Mar 6 11:27:58 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 06 Mar 2007 16:27:58 +0000 Subject: [BioPython] Bio.Kabat and the Kabat Database Message-ID: <45ED968E.1010102@maubp.freeserve.co.uk> I've been looking though the modules in BioPython, and had a closer look at Bio.Kabat written by Katharine Lindner in 2001 to parse files from the Kabat database of proteins of immunological interest: http://www.kabatdatabase.com/ Quoting the website, > 01 September 2006 > > Interested parties may purchase the Database, in ASCII text > structured flat files, as well as an SQL relationship database (not > previously available), for $2250 US. > > This one-time license fee is unrestricted, except for distribution. > > Analysis Tools > > The searching and analysis tools are additionally available. > Included are generalized lookup, aligned sequence searching light > chain alignment, length distribution, positional correlation, > variability, and much more. Please contact for quote. Does anyone use the Bio.Kabat code? Could we (or should we) mark it as depreciated for the next release of BioPython? Peter From snakepit.rattlesnakes at gmail.com Mon Mar 12 05:59:25 2007 From: snakepit.rattlesnakes at gmail.com (Joydeep Mitra) Date: Mon, 12 Mar 2007 15:29:25 +0530 Subject: [BioPython] Retrieving the raw sequence from sequence object... Message-ID: <972566ff0703120259k3979c223r2172f631d48fa6fd@mail.gmail.com> Hi, I'm a student of bioinformatics (coming from a biological background). I've just started using biopython for parsing biological file formats. The Bio.Fasta module contains the fasta iterator object, which spits out sequence objects...of the form: Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACATAATAAT ...', IUPACAmbiguousDNA()) I want to retrieve the sequence in it's entirety and in raw format....how does one do that using an instance object? I've tried a few things without success...will be glad if some1 could show me how... Thanking in advance, Joy From sdavis2 at mail.nih.gov Mon Mar 12 06:20:11 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 12 Mar 2007 06:20:11 -0400 Subject: [BioPython] Retrieving the raw sequence from sequence object... In-Reply-To: <972566ff0703120259k3979c223r2172f631d48fa6fd@mail.gmail.com> References: <972566ff0703120259k3979c223r2172f631d48fa6fd@mail.gmail.com> Message-ID: <200703120620.11871.sdavis2@mail.nih.gov> On Monday 12 March 2007 05:59, Joydeep Mitra wrote: > Hi, > I'm a student of bioinformatics (coming from a biological background). > > I've just started using biopython for parsing biological file formats. > The Bio.Fasta module contains the fasta iterator object, which spits out > sequence objects...of the form: > > Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACATAATAAT ...', > IUPACAmbiguousDNA()) > > I want to retrieve the sequence in it's entirety and in raw format....how > does one do that using an instance object? > I've tried a few things without success...will be glad if some1 could show > me how... If you have a sequence object, "myseq": myseq.tostring() See here for more details: http://biopython.org/DIST/docs/tutorial/Tutorial.html Section 2.2. Hope that helps. Sean From lucks at fas.harvard.edu Wed Mar 14 18:16:32 2007 From: lucks at fas.harvard.edu (Julius Lucks) Date: Wed, 14 Mar 2007 18:16:32 -0400 Subject: [BioPython] Biopython hackathon Message-ID: <21C70692-33D8-457E-AB5B-D4701E2704FB@fas.harvard.edu> Hi all, I was just chatting with Jason Stajich about the openbio hackathon that took place a few years ago. What biopython projects do people think would be appropriate for another hackathon in the near future? Are there things that have been on the TODO list for a while? New functionality that would benefit from a bunch of us getting together in one place (possibly with other openbio projects)? Cheers, Julius ----------------------------------------------------- http://openwetware.org/wiki/User:Lucks ----------------------------------------------------- From aloraine at gmail.com Wed Mar 14 19:43:23 2007 From: aloraine at gmail.com (Ann Loraine) Date: Wed, 14 Mar 2007 17:43:23 -0600 Subject: [BioPython] Biopython hackathon In-Reply-To: <21C70692-33D8-457E-AB5B-D4701E2704FB@fas.harvard.edu> References: <21C70692-33D8-457E-AB5B-D4701E2704FB@fas.harvard.edu> Message-ID: <83722dde0703141643q52c04a03l4d2b28926e8aff3a@mail.gmail.com> I hope you will consider the following two requests as possible hackathon activities: (1) If it does not already do this, it would be nice if the blast "plain text" (non-XML) parser would report the length of the target ("hit") sequence as well as the query. If I recall correctly, the last time I used the plain text blast parser, I had to measure the length of the targets by opening up the fasta copy of the blastable database and reading the lengths one-by-one. My database wasn't very big, so it wasn't a hassle to do this, but I can foresee situations where this kludge would fail. (2) Another request is for a Python interface to the Bio::Db::SQL database schema described in: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=12368253. The part that seems particularly valuable would be code that to construct the location-based queries. My apologies if this already exists -- if yes, please let me know where I can find it!! All the best, Ann Loraine On 3/14/07, Julius Lucks wrote: > Hi all, > > I was just chatting with Jason Stajich about the openbio hackathon > that took place a few years ago. What biopython projects do people > think would be appropriate for another hackathon in the near future? > Are there things that have been on the TODO list for a while? New > functionality that would benefit from a bunch of us getting together > in one place (possibly with other openbio projects)? > > Cheers, > > Julius > > ----------------------------------------------------- > http://openwetware.org/wiki/User:Lucks > ----------------------------------------------------- > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From cjfields at uiuc.edu Wed Mar 14 20:29:43 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 14 Mar 2007 19:29:43 -0500 Subject: [BioPython] Biopython hackathon In-Reply-To: <83722dde0703141643q52c04a03l4d2b28926e8aff3a@mail.gmail.com> References: <21C70692-33D8-457E-AB5B-D4701E2704FB@fas.harvard.edu> <83722dde0703141643q52c04a03l4d2b28926e8aff3a@mail.gmail.com> Message-ID: <8FC21D46-FEC1-43FF-B770-BC4D05E569D5@uiuc.edu> On Mar 14, 2007, at 6:43 PM, Ann Loraine wrote: > ... > (2) Another request is for a Python interface to the Bio::Db::SQL > database schema described in: > > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? > cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=12368253. > > The part that seems particularly valuable would be code that to > construct the location-based queries. Did you mean Bio::DB::GFF? AFAIK Lincoln is moving stuff over to a newer system that better facilitates GFF3, namely Bio::DB::SeqFeature. I'm not sure how well Bio::DB::GFF is supported currently. > My apologies if this already exists -- if yes, please let me know > where I can find it!! You can find out about the current state of GBrowse affairs by emailing the GBrowse mail list. Here's the sourceforge link: http://sourceforge.net/mailarchive/forum.php?forum_id=31947 Scott and Lincoln can indicate where their current focus is re: GFF3 and sequence feature database development. chris > All the best, > > Ann Loraine Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From aloraine at gmail.com Thu Mar 15 02:52:13 2007 From: aloraine at gmail.com (Ann Loraine) Date: Thu, 15 Mar 2007 00:52:13 -0600 Subject: [BioPython] Biopython hackathon In-Reply-To: <8FC21D46-FEC1-43FF-B770-BC4D05E569D5@uiuc.edu> References: <21C70692-33D8-457E-AB5B-D4701E2704FB@fas.harvard.edu> <83722dde0703141643q52c04a03l4d2b28926e8aff3a@mail.gmail.com> <8FC21D46-FEC1-43FF-B770-BC4D05E569D5@uiuc.edu> Message-ID: <83722dde0703142352y325698bag5d43c4a771bb1e5@mail.gmail.com> Thanks, I meant Bio::DB::GFF -- the schema shown in the paper. A simple schema that represents features on genomic sequence and easily supports fast region-based queries is what I'm after. The indexing scheme in the paper looked good, so I was hoping to find python code that would hide the details of formulating the SQL. My main goal is speed -- running the queries and then outputting data in GFF, bed, or DAS XML, as the need arises. -Ann On 3/14/07, Chris Fields wrote: > > On Mar 14, 2007, at 6:43 PM, Ann Loraine wrote: > > > ... > > (2) Another request is for a Python interface to the Bio::Db::SQL > > database schema described in: > > > > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? > > cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=12368253. > > > > The part that seems particularly valuable would be code that to > > construct the location-based queries. > > Did you mean Bio::DB::GFF? AFAIK Lincoln is moving stuff over to a > newer system that better facilitates GFF3, namely > Bio::DB::SeqFeature. I'm not sure how well Bio::DB::GFF is supported > currently. > > > My apologies if this already exists -- if yes, please let me know > > where I can find it!! > > You can find out about the current state of GBrowse affairs by > emailing the GBrowse mail list. Here's the sourceforge link: > > http://sourceforge.net/mailarchive/forum.php?forum_id=31947 > > Scott and Lincoln can indicate where their current focus is re: GFF3 > and sequence feature database development. > > chris > > > All the best, > > > > Ann Loraine > > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > -- Ann Loraine, Assistant Professor Departments of Genetics, Biostatistics, Computer and Information Sciences Associate Scientist, Comprehensive Cancer Center University of Alabama at Birmingham http://www.transvar.org 205-996-4155 From mdehoon at c2b2.columbia.edu Fri Mar 16 15:09:10 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Fri, 16 Mar 2007 15:09:10 -0400 Subject: [BioPython] Bio.Kabat and the Kabat Database In-Reply-To: <45ED968E.1010102@maubp.freeserve.co.uk> References: <45ED968E.1010102@maubp.freeserve.co.uk> Message-ID: <45FAEB56.9080509@c2b2.columbia.edu> Peter wrote: > Does anyone use the Bio.Kabat code? Could we (or should we) mark it as > depreciated for the next release of BioPython? Since no users of Bio.Kabat came forward, I've marked it as deprecated for the upcoming release. This only means that importing Bio.Kabat will show a warning message, so the code is still usable. If still no users come forward, we can remove Bio.Kabat in a later release. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From mdehoon at c2b2.columbia.edu Sat Mar 17 19:26:50 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Sat, 17 Mar 2007 19:26:50 -0400 Subject: [BioPython] Biopython release 1.43 Message-ID: <45FC793A.2010106@c2b2.columbia.edu> Dear Biopythoneers, We are pleased to announce the release of Biopython 1.43. This release includes a brand-new set of parsers in Bio.SeqIO by Peter Cock for reading biological sequence files in various formats, an updated Blast XML parser in Bio.Blast.NCBIXML, a new UniGene flat-file parser by Sean Davis, and numerous improvements and bug fixes in Bio.PDB, Bio.SwissProt, Bio.Nexus, BioSQL, and others. Believe it or not, even the documentation was updated. Source distributions and Windows installers are available from the Biopython website at http://biopython.org. My thanks to all code contributers who made this new release possible. --Michiel on behalf of the Biopython developers -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From rohini.damle at gmail.com Mon Mar 19 17:08:36 2007 From: rohini.damle at gmail.com (Rohini Damle) Date: Mon, 19 Mar 2007 13:08:36 -0800 Subject: [BioPython] pdf articles from medline Message-ID: Hi, Does anyone know if there is any provision in Biopython to download PDF articles from Medline, if we have a list of pubmed ids? Thank you for your help. -Rohini. From sdavis2 at mail.nih.gov Mon Mar 19 18:37:23 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 19 Mar 2007 18:37:23 -0400 Subject: [BioPython] pdf articles from medline In-Reply-To: References: Message-ID: <45FF10A3.7090708@mail.nih.gov> Rohini Damle wrote: > Hi, > Does anyone know if there is any provision in Biopython to download > PDF articles from Medline, if we have a list of pubmed ids? > Medline does not store the PDFs, in general, so I don't think that is possible. You could certainly scrape the HTML for links and then follow them, looking for a PDF link, but there isn't a general solution for all journals, etc. Sean From jingzhou2005 at gmail.com Wed Mar 21 11:49:20 2007 From: jingzhou2005 at gmail.com (Jing Zhou) Date: Wed, 21 Mar 2007 11:49:20 -0400 Subject: [BioPython] can PDBParser work on a pdb file without residue sequence information? Message-ID: <4b036ac00703210849n43b41ff8hd1f52784452f1d3c@mail.gmail.com> I want to parse a pseudo pdb file I generated for a bunch of points in space. There is no physical meaning to assign residue number or chain number to it. Here is the example of one line: ATOM 1 C -7.083 -6.182 50.181 1.00 (the atom type has no physical meaning either) I just want to be able to display the positions of these points in viewer. What I want to do is to use biopython to parse this pseudo pdb file and display it in vtk. (I am already able to see the points in pymol using this pseudo pdb file) Here is the function I tried to use to parse this pseudo pdb file: from Bio.PDB import PDBParser parser = PDBParser() structure = parser.get_structure('mypdb',mypseudopdbfile) Here is the error message: structure = parser.get_structure('mypdb',mypseudopdbfile) File "C:\Python24\lib\site-packages\Bio\PDB\PDBParser.py", line 66, in get_structure self._parse(file.readlines()) File "C:\Python24\lib\site-packages\Bio\PDB\PDBParser.py", line 87, in _parse self.trailer=self._parse_coordinates(coords_trailer) File "C:\Python24\lib\site-packages\Bio\PDB\PDBParser.py", line 144, in _parse_coordinates resseq=int(split(line[22:26])[0]) # sequence identifier IndexError: list index out of range It seems that the error happens to seek residue sequence information at column 22-26. Sure, I can try to add fake resseq info. but my question is whether there is another way around to neglect the index of residue sequence number? I just want to directly get the array of coordinates and display them. Thanks Jing From zsun at fas.harvard.edu Wed Mar 21 23:41:22 2007 From: zsun at fas.harvard.edu (Zachary Zhipeng Sun) Date: Wed, 21 Mar 2007 23:41:22 -0400 Subject: [BioPython] BLAST SNP access? Message-ID: <000c01c76c33$f98c93a0$eca5bae0$@harvard.edu> Hello, Thank you for the Bioython tools! They are proving increasingly useful in my research. I had a question regarding a tool extension, however - is there any link in biopython to query the BLAST SNP database, or does anyone know of this being under development? If not, I am not too familiar with the backend of biopython but I was looking to be able to automate BLAST SNP searches; does anyone have advice on how to start coding this into the biopython environment, or to a version of NCBIWWW.py ? Thanks for your help! Best, Zachary Sun From mdehoon at c2b2.columbia.edu Thu Mar 22 00:21:40 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Thu, 22 Mar 2007 00:21:40 -0400 Subject: [BioPython] BLAST SNP access? In-Reply-To: <000c01c76c33$f98c93a0$eca5bae0$@harvard.edu> References: <000c01c76c33$f98c93a0$eca5bae0$@harvard.edu> Message-ID: <46020454.2060100@c2b2.columbia.edu> Hi Zach, Can you give us an example of how you currently query the BLAST SNP database (without using Biopython)? Just to get an idea of how different it is from current BLAST searches with Biopython. --Michiel. Zachary Zhipeng Sun wrote: > Hello, > > > > Thank you for the Bioython tools! They are proving increasingly useful in my > research. I had a question regarding a tool extension, however - is there > any link in biopython to query the BLAST SNP database, or does anyone know > of this being under development? If not, I am not too familiar with the > backend of biopython but I was looking to be able to automate BLAST SNP > searches; does anyone have advice on how to start coding this into the > biopython environment, or to a version of NCBIWWW.py ? Thanks for your help! > > > > Best, > > Zachary Sun > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From cjfields at uiuc.edu Thu Mar 22 00:38:11 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 21 Mar 2007 23:38:11 -0500 Subject: [BioPython] BLAST SNP access? In-Reply-To: <46020454.2060100@c2b2.columbia.edu> References: <000c01c76c33$f98c93a0$eca5bae0$@harvard.edu> <46020454.2060100@c2b2.columbia.edu> Message-ID: <1CB2E066-3A73-429E-80B6-639F26FAD144@uiuc.edu> If you are using the NCBI URLAPI interface you can set the databases to anything on the following page, just follow the instructions: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/ remote_accessible_blastdblist.html This includes SNP data. I find this works for BioPerl's RemoteBlast (URLAPI-based). chris On Mar 21, 2007, at 11:21 PM, Michiel de Hoon wrote: > Hi Zach, > > Can you give us an example of how you currently query the BLAST SNP > database (without using Biopython)? Just to get an idea of how > different > it is from current BLAST searches with Biopython. > > --Michiel. > > Zachary Zhipeng Sun wrote: >> Hello, >> >> >> >> Thank you for the Bioython tools! They are proving increasingly >> useful in my >> research. I had a question regarding a tool extension, however - >> is there >> any link in biopython to query the BLAST SNP database, or does >> anyone know >> of this being under development? If not, I am not too familiar >> with the >> backend of biopython but I was looking to be able to automate >> BLAST SNP >> searches; does anyone have advice on how to start coding this into >> the >> biopython environment, or to a version of NCBIWWW.py ? Thanks for >> your help! >> >> >> >> Best, >> >> Zachary Sun >> >> >> >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From zsun at fas.harvard.edu Thu Mar 22 01:01:44 2007 From: zsun at fas.harvard.edu (Zachary Zhipeng Sun) Date: Thu, 22 Mar 2007 01:01:44 -0400 Subject: [BioPython] BLAST SNP access? In-Reply-To: <1CB2E066-3A73-429E-80B6-639F26FAD144@uiuc.edu> References: <000c01c76c33$f98c93a0$eca5bae0$@harvard.edu> <46020454.2060100@c2b2.columbia.edu> <1CB2E066-3A73-429E-80B6-639F26FAD144@uiuc.edu> Message-ID: <000601c76c3f$2fab1870$8f014950$@harvard.edu> Thanks for the replies! Regarding a BLAST SNP search, it uses the blastn or tblastn interface; put in query at http://www.ncbi.nlm.nih.gov/SNP/snp_blastByOrg.cgi, choose db, and result interface is similar to other BLAST components. (sample RID 1174538798-721-146214187572.BLASTQ2 for sample output), except that in the title of hits it shows an rs# which links to dbSNP. I'd imagine the search options would be different, as well as a little bit of parsing the output, but it uses the same engine as blastn. Regarding the URLAPI then: sorry, I'm pretty new to biopython, but is the qblast search feature in the current biopython 1.43 build based on commands to the NCBI URLAPI? If so, then (from someone with moderate experience in coding but little in python or perl) would I be able to painlessly modify the biopython NCBIWWW.py code for my own use? -Zach -----Original Message----- From: Chris Fields [mailto:cjfields at uiuc.edu] Sent: Thursday, March 22, 2007 12:38 AM To: Michiel de Hoon Cc: Zachary Zhipeng Sun; biopython at lists.open-bio.org Subject: Re: [BioPython] BLAST SNP access? If you are using the NCBI URLAPI interface you can set the databases to anything on the following page, just follow the instructions: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/ remote_accessible_blastdblist.html This includes SNP data. I find this works for BioPerl's RemoteBlast (URLAPI-based). chris On Mar 21, 2007, at 11:21 PM, Michiel de Hoon wrote: > Hi Zach, > > Can you give us an example of how you currently query the BLAST SNP > database (without using Biopython)? Just to get an idea of how > different > it is from current BLAST searches with Biopython. > > --Michiel. > > Zachary Zhipeng Sun wrote: >> Hello, >> >> >> >> Thank you for the Bioython tools! They are proving increasingly >> useful in my >> research. I had a question regarding a tool extension, however - >> is there >> any link in biopython to query the BLAST SNP database, or does >> anyone know >> of this being under development? If not, I am not too familiar >> with the >> backend of biopython but I was looking to be able to automate >> BLAST SNP >> searches; does anyone have advice on how to start coding this into >> the >> biopython environment, or to a version of NCBIWWW.py ? Thanks for >> your help! >> >> >> >> Best, >> >> Zachary Sun >> >> >> >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From mdehoon at c2b2.columbia.edu Thu Mar 22 14:54:59 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Thu, 22 Mar 2007 14:54:59 -0400 Subject: [BioPython] BLAST SNP access? In-Reply-To: <000601c76c3f$2fab1870$8f014950$@harvard.edu> References: <000c01c76c33$f98c93a0$eca5bae0$@harvard.edu> <46020454.2060100@c2b2.columbia.edu> <1CB2E066-3A73-429E-80B6-639F26FAD144@uiuc.edu> <000601c76c3f$2fab1870$8f014950$@harvard.edu> Message-ID: <4602D103.20106@c2b2.columbia.edu> Zachary Zhipeng Sun wrote: > Thanks for the replies! Regarding a BLAST SNP search, it uses the blastn or > tblastn interface; put in query at > http://www.ncbi.nlm.nih.gov/SNP/snp_blastByOrg.cgi, choose db, and result > interface is similar to other BLAST components. (sample RID > 1174538798-721-146214187572.BLASTQ2 for sample output), except that in the > title of hits it shows an rs# which links to dbSNP. I'd imagine the search > options would be different, as well as a little bit of parsing the output, > but it uses the same engine as blastn. It looks like that if you know the name of the database (here "snp/human_9606/human_9606"), then you can run for example from Bio.Blast import NCBIWWW result_handle = NCBIWWW.qblast("blastn", "snp/human_9606/human_9606", seq) and then parse the results as usual (see section 3.4 in the Biopython tutorial). Check on the page that Chris sent you: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdblist.html to find the correct name for the database. The result_handle should give you the same information as the web page, and Biopython's parser should parse all information from result_handle correctly. If you find that some information seems to be missing, please let us know. Hope this helps, --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From mauriceling at gmail.com Thu Mar 22 19:47:14 2007 From: mauriceling at gmail.com (Maurice Ling) Date: Fri, 23 Mar 2007 10:47:14 +1100 Subject: [BioPython] How to read SOFT files from GEO? Message-ID: <46031582.7060301@acm.org> Hi, Are there any examples to read SOFT file from GEO? I've looked in the cookbook and there isn't any mention about GEO. Thanks in advance, maurice From sdavis2 at mail.nih.gov Thu Mar 22 20:53:01 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 22 Mar 2007 20:53:01 -0400 Subject: [BioPython] How to read SOFT files from GEO? In-Reply-To: <46031582.7060301@acm.org> References: <46031582.7060301@acm.org> Message-ID: <460324ED.8040504@mail.nih.gov> Maurice Ling wrote: > Hi, > > Are there any examples to read SOFT file from GEO? I've looked in the > cookbook and there isn't any mention about GEO. > I don't know of biopython code to do this, but there is a package for the R statistical language that will do this. The nice thing about doing this from R is that there are hundreds of tools that can then be applied to whatever data you like. The package is part of Bioconductor (http://www.bioconductor.org) and is called GEOquery. And of course, you can use R from python using Rpy. Sean P.S. Legal disclaimer--I am the author of said package, so take my words with the appropriate grain of salt. From mauriceling at gmail.com Thu Mar 22 21:06:07 2007 From: mauriceling at gmail.com (Maurice Ling) Date: Fri, 23 Mar 2007 12:06:07 +1100 Subject: [BioPython] How to read SOFT files from GEO? In-Reply-To: <460324ED.8040504@mail.nih.gov> References: <46031582.7060301@acm.org> <460324ED.8040504@mail.nih.gov> Message-ID: <460327FF.5070909@acm.org> Sean Davis wrote: > Maurice Ling wrote: > >> Hi, >> >> Are there any examples to read SOFT file from GEO? I've looked in the >> cookbook and there isn't any mention about GEO. >> > > I don't know of biopython code to do this, but there is a package for > the R statistical language that will do this. The nice thing about > doing this from R is that there are hundreds of tools that can then be > applied to whatever data you like. The package is part of > Bioconductor (http://www.bioconductor.org) and is called GEOquery. > And of course, you can use R from python using Rpy. > > Sean > > P.S. Legal disclaimer--I am the author of said package, so take my > words with the appropriate grain of salt. > In biopython's CVS, there is a subdirectory called Geo. So I thought that might be for SOFT files... ML P.S. So I know who to ask when I have questions about GEOquery. From sdavis2 at mail.nih.gov Thu Mar 22 21:44:42 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 22 Mar 2007 21:44:42 -0400 Subject: [BioPython] How to read SOFT files from GEO? In-Reply-To: <460327FF.5070909@acm.org> References: <46031582.7060301@acm.org> <460324ED.8040504@mail.nih.gov> <460327FF.5070909@acm.org> Message-ID: <4603310A.20804@mail.nih.gov> Maurice Ling wrote: > Sean Davis wrote: > >> Maurice Ling wrote: >> >>> Hi, >>> >>> Are there any examples to read SOFT file from GEO? I've looked in >>> the cookbook and there isn't any mention about GEO. >>> >> >> I don't know of biopython code to do this, but there is a package for >> the R statistical language that will do this. The nice thing about >> doing this from R is that there are hundreds of tools that can then >> be applied to whatever data you like. The package is part of >> Bioconductor (http://www.bioconductor.org) and is called GEOquery. >> And of course, you can use R from python using Rpy. >> >> Sean >> >> P.S. Legal disclaimer--I am the author of said package, so take my >> words with the appropriate grain of salt. >> > In biopython's CVS, there is a subdirectory called Geo. So I thought > that might be for SOFT files... > I haven't tried it lately, but last I looked, it was not in-sync with GEO (which does infrequently change tags/formats). Also, I think it handles GDS only (if I remember correctly). Only a subset of GEO is available as GDS, since they require hand-curation by GEO staff. http://portal.open-bio.org/pipermail/biopython-dev/2006-May/002352.html > P.S. So I know who to ask when I have questions about GEOquery. Comments and questions are welcome and appreciated. Sean From chris.lasher at gmail.com Sat Mar 24 12:35:04 2007 From: chris.lasher at gmail.com (Chris Lasher) Date: Sat, 24 Mar 2007 12:35:04 -0400 Subject: [BioPython] Biopython to migrate to Subversion Message-ID: <128a885f0703240935p5c139736yfe3142bdbc9d6ac4@mail.gmail.com> Hello Biopythonistas, The Biopython developers are currently planning a migration from CVS to Subversion as our revision control system. The target for the migration is the evening of Sunday May 20, 2007. This change will mostly impact the developers, however, this may also affect some users. If you're a user of Biopython and you... A) install Biopython from the Windows installer, from packages for your Linux distribution, or from Fink on OS X... you will not be affected. You may stop here or read on at your own leisure. ----- B) download and install from the source in the form of the Tarball or Zip file... you will not be affected. You may stop here or read on at your own leisure. ----- C) retrieve and install Biopython from the CVS repository... the Biopython devs would really like to hear from you! For those in category C, the change could mean that you will need a Subversion client installed on your computer. Clients exists for all major platforms, including Windows, OS X, and Linux. Subversion operates through HTTP/HTTPS (ports 80 and 443, respectively), and specifically uses WebDAV, an extended HTTP protocol. Though highly unusual, some organizations networks' may block WebDAV traffic. One way to check whether your organization does this is to attempt to checkout an existing Subversion repository from, for instance, a Google Code Project (all of which use Subversion repositories). For example, you can attempt to check out Kongulo . If you can checkout an existing repository, you will be ready to migrate to Biopython's Subversion repository once in place. If Subversion installation will not be possible, or your network indeed blocks WebDAV traffic, the Biopython devs need to know. We can support the CVS repository with read-only access and inject updates from the Subversion repository into the CVS repository. This involves a bit more work on the part of devs in setting up and supporting, so we will take this on only if the necessity exists. For clarity, Biopython will make a clean move to Subversion and drop CVS support unless we hear requests for legacy CVS support in advance of the migration. ----- If you're a developer of Biopython... you should be on the Biopython-dev list and have been following this thread: ----- We will document all things related to Biopython's migration to Subversion on the Biopython wiki at for interested parties. The developers look forward to a smooth transition and having Subversion in place to assist us in continually improving Biopython. Thank you for your time and feedback, Your friendly neighborhood Biopython developers From aloraine at gmail.com Tue Mar 27 01:31:05 2007 From: aloraine at gmail.com (Ann Loraine) Date: Mon, 26 Mar 2007 23:31:05 -0600 Subject: [BioPython] question regarding writing Seq objects in Fasta format Message-ID: <83722dde0703262231u15042538q479c2fb0b81a9590@mail.gmail.com> Hello, I have a question about how to write out Bio.Seq.Seq objects to a fasta format file. I've generated a lot of these by translating segments of genomic sequence -- see below. What objects or code should I use? It looks like Bio.SeqIO.FASTA.FastaWriter might be the right thing, but it doesn't appear to accept Bio.Seq.Seq objects. Do I need to create a new type of Seq-like object before I can use Bio.SeqIO.FASTA.FastaWriter? Thank you for your time! Yours, Ann xxx my code for generating the Seq objects I want to write def feat2aaseq(feat,seq): """ Function: translate the given feature Returns : a Bio.Seq.Seq [biopython] Args : feat - feature.DNASeqFeature [not biopython] seq - a Bio.Seq.Seq, e.g., a chromosome """ start = feat.start() end = feat.start()+feat.length() fullseq = seq.sequence subseq_ob = Seq(fullseq[start:end],IUPAC.unambiguous_dna) if feat.strand() == -1: subseq_ob = subseq_ob.reverse_complement() translator = Translate.unambiguous_dna_by_id[4] aaseq = translator.translate(subseq_ob) return aaseq -- Ann Loraine, Assistant Professor Departments of Genetics, Biostatistics, Computer and Information Sciences Associate Scientist, Comprehensive Cancer Center University of Alabama at Birmingham http://www.transvar.org 205-996-4155 From biopython at maubp.freeserve.co.uk Tue Mar 27 07:35:54 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Mar 2007 20:35:54 +0900 Subject: [BioPython] question regarding writing Seq objects in Fasta format In-Reply-To: <83722dde0703262231u15042538q479c2fb0b81a9590@mail.gmail.com> References: <83722dde0703262231u15042538q479c2fb0b81a9590@mail.gmail.com> Message-ID: <320fb6e00703270435k34390ce2y46937ab07578d7a7@mail.gmail.com> On 3/27/07, Ann Loraine wrote: > Hello, > > I have a question about how to write out Bio.Seq.Seq objects to a > fasta format file. > > I've generated a lot of these by translating segments of genomic > sequence -- see below. > > What objects or code should I use? I would suggest you try the new Bio.SeqIO code described here, you will need BioPython 1.43 or later: http://biopython.org/wiki/SeqIO You'll need to "upgrade" your Seq objects into SeqRecord objects and give them an identifier before calling Bio.SeqIO.write() on them. We'd welcome feedback on this, for example if there are any errors in the newly writenupdated documentation. > It looks like Bio.SeqIO.FASTA.FastaWriter might be the right thing, > but it doesn't appear to accept Bio.Seq.Seq objects. Do I need to > create a new type of Seq-like object before I can use > Bio.SeqIO.FASTA.FastaWriter? We plan to mark that particular (undocumented) bit of BioPython as depriated in the next release - so I can't really recommend using it. Personally, before working on Bio.SeqIO I used to write fasta files "by hand" using something like this: handle = open("example.faa", "w") for identifier, seq in some_list_of_tuples : handle.write(">%s\n%s\n" % (identifier, seq.tostring())) handle.close() where identifier is a string, and seq is a BioPython Seq object. There is a lot to be said for doing it "by hand" if you want full control over the description for example. Peter From chris.lasher at gmail.com Wed Mar 28 13:19:12 2007 From: chris.lasher at gmail.com (Chris Lasher) Date: Wed, 28 Mar 2007 13:19:12 -0400 Subject: [BioPython] Biopython to migrate to Subversion In-Reply-To: <128a885f0703240935p5c139736yfe3142bdbc9d6ac4@mail.gmail.com> References: <128a885f0703240935p5c139736yfe3142bdbc9d6ac4@mail.gmail.com> Message-ID: <128a885f0703281019k6c64807dpa4e5d54621c944cc@mail.gmail.com> I have a correction to make. Again, this only affects those of you whom obtain your Biopython code via CVS. On 3/24/07, Chris Lasher wrote: > Subversion operates through HTTP/HTTPS (ports 80 and 443, > respectively), and specifically uses WebDAV, an extended HTTP > protocol. Though highly unusual, some organizations networks' may > block WebDAV traffic. One way to check whether your organization does > this is to attempt to checkout an existing Subversion repository from, > for instance, a Google Code Project (all of which use Subversion > repositories). For example, you can attempt to check out Kongulo > . If you can checkout > an existing repository, you will be ready to migrate to Biopython's > Subversion repository once in place. The Subversion server runs through SSH, *not* through WebDAV, and so is accessed in the same way that the CVS repository is now. If you can access the CVS repository now, you can access Subversion repository once we implement it. Therefore, the only case of need for legacy support for Subversion is if you cannot get Subversion installed. If this is the case, please notify me as soon as possible. Thanks, Chris From arareko at campus.iztacala.unam.mx Sat Mar 3 22:32:46 2007 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Sat, 03 Mar 2007 16:32:46 -0600 Subject: [BioPython] [Bioperl-l] New Article on Approaches to Web Development for Bioinformatics In-Reply-To: <1ad8057e0703021842y683853f5k1c97dbf362f20dda@mail.gmail.com> References: <1ad8057e0703021842y683853f5k1c97dbf362f20dda@mail.gmail.com> Message-ID: <45E9F78E.8040406@campus.iztacala.unam.mx> Hi Alex, I think you've put a very nice & concise introductory article. I'd like to comment a little on some sections I've read: * Introduction > "Given that you have an idea for analyzing or presenting data in a > particular was, a complete bioinformatics web application depends of > these basic pieces, which is what this article is all about: > > 1. A source of data... > 2. An application programming language... > 3. A web application platform... > 4. Optionally, a data store... > 5. Optionally, you would reuse software tools..." Even though you do a small mention about Web Services at the very end of the article (under Application Integration -> Programmatic Integration), I believe that Web Services can be another optional (or even basic) piece of a web application. In fact, many web applications consist only of Web Services without HTML user interfaces. * Application Development Languages > "There are many different programming platforms and tools available to > solve bioinformatics problems. It can be bewildering at first, but it > makes more sense to build on top of some of these tools rather than > build from scratch. Some the problems with using these tools for a > bioinformatics portal are > > 1. Many tools are written... > 2. Some tools have particular prerequisites... > 3. Many may not be in a form... > 4. The context that gives meaning... > > Standardization on a particular platform can help manageability but > for most organizations a compromise between standardization and > adoption of several different platforms will allow many people to > develop software in platforms that they are already comfortable with > and allow the reuse of a large amount of freely available software..." I would add to the problems list the fact that building web (or other kind of) applications on top of a platform whose codebase is evolving constantly, can make them very difficult to maintain. The case of EnsEMBL comes to my mind here: they opted to stick with BioPerl 1.2.3 as a core library and haven't moved onto a higher version of it because the EnsEMBL code is so vast, that a simple upgrade of BioPerl would break a lot of their code. AFAIK, it's because of this and the slowness at some parts of BioPerl that EnsEMBL is gradually saying goodbye to BioPerl. Also, I think that depending on the amount of available code you plan to import into your application, sometimes having a whole platform at the very bottom can add unnecessary extra weight to your application. More weight could be equal to less speed, this is critical in web development. * Application Integration -> Navigation > "The basic way that users will navigate into and around your > application should be using HTTP GET and POST requests with specific > URL's. Users bookmark these URL's and other applications will link to > them. Most applications developers did not realize it at first, but > these URL's are, in fact, an interface into your application that you > must maintain in a consistent way as you change and evolve your > software. Otherwise, they will find dead links..." Just as I clicked the bookmark button for your article :) The same principle could apply to its filenames. A URL of the form: http://medicalcomputing.net/tools_dna17.php is less indicative of the real content of the article and can mislead potential readers. Optimising the URL's will make them better to be indexed by search engines, something like: http://medicalcomputing.net/web-development-bioinformatics17.php would do the trick. To conclude my comments, I was surprised to see a section about BioPHP and not about other more-known toolkits like BioPython or BioRuby. What about their role in web development? Python is also a common language for web programming and with all the recent *hot* stuff like Ruby On Rails, it's very likely that both Bio* toolkits are more than ready for deploying web applications. I'm Cc'ing this to their respective mailing lists to see if someone wants to give you some feedback about them in order to complement your article. Other than that, I really liked your work :) Cheers, Mauricio. Alex Amies wrote: > I have written an article on Approaches to Web Development for > Bioinformatics at > > http://medicalcomputing.net/tools_dna1.php > > There is a fairly large section on BioPerl at > > http://medicalcomputing.net/tools_dna13.php > > I hope that someone gets something useful out of it. I also looking for > feedback on it and, in particular, please let me know about any mistakes in > it. > > The intent of the article is to give an overview of various approaches to > developing web based tools for bioinformatics. It describes the alternatives > at each layer of the system, including the data layer and sources of data, > the application programming layer, the web layer, and bioinformatics tools > and software libraries. > > Alex > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From alexamies at gmail.com Sun Mar 4 03:09:51 2007 From: alexamies at gmail.com (Alex Amies) Date: Sat, 3 Mar 2007 19:09:51 -0800 Subject: [BioPython] [Bioperl-l] New Article on Approaches to Web Development for Bioinformatics In-Reply-To: <45E9F78E.8040406@campus.iztacala.unam.mx> References: <1ad8057e0703021842y683853f5k1c97dbf362f20dda@mail.gmail.com> <45E9F78E.8040406@campus.iztacala.unam.mx> Message-ID: <1ad8057e0703031909v4880f5f1t3c4159b75c36bcca@mail.gmail.com> Mauricio, Thanks for your comments. You are right that I could have said a lot more about web services. I plan on doing that but I haven't got there yet. Actually, with all the hype about web services I have been surprised to find the programming model so complicated. As you mention, I certainly could have thought out my own URL's better. I have been surprised not to find more PHP activity in bioinformatics. To me, besides being a lightweight and pleasant language to program in it is incredibly economical for hosting Internet applications and there is a huge open source community around PHP in general. The same can be said of Perl. It is because of my own ignorance and lack of time that I have not investigated Python and Ruby. I may do in the future and write about them. Alex On 3/3/07, Mauricio Herrera Cuadra wrote: > Hi Alex, > > I think you've put a very nice & concise introductory article. I'd like > to comment a little on some sections I've read: > > * Introduction > > > "Given that you have an idea for analyzing or presenting data in a > > particular was, a complete bioinformatics web application depends of > > these basic pieces, which is what this article is all about: > > > > 1. A source of data... > > 2. An application programming language... > > 3. A web application platform... > > 4. Optionally, a data store... > > 5. Optionally, you would reuse software tools..." > > Even though you do a small mention about Web Services at the very end of > the article (under Application Integration -> Programmatic Integration), > I believe that Web Services can be another optional (or even basic) > piece of a web application. In fact, many web applications consist only > of Web Services without HTML user interfaces. > > * Application Development Languages > > > "There are many different programming platforms and tools available to > > solve bioinformatics problems. It can be bewildering at first, but it > > makes more sense to build on top of some of these tools rather than > > build from scratch. Some the problems with using these tools for a > > bioinformatics portal are > > > > 1. Many tools are written... > > 2. Some tools have particular prerequisites... > > 3. Many may not be in a form... > > 4. The context that gives meaning... > > > > Standardization on a particular platform can help manageability but > > for most organizations a compromise between standardization and > > adoption of several different platforms will allow many people to > > develop software in platforms that they are already comfortable with > > and allow the reuse of a large amount of freely available software..." > > I would add to the problems list the fact that building web (or other > kind of) applications on top of a platform whose codebase is evolving > constantly, can make them very difficult to maintain. The case of > EnsEMBL comes to my mind here: they opted to stick with BioPerl 1.2.3 as > a core library and haven't moved onto a higher version of it because the > EnsEMBL code is so vast, that a simple upgrade of BioPerl would break a > lot of their code. AFAIK, it's because of this and the slowness at some > parts of BioPerl that EnsEMBL is gradually saying goodbye to BioPerl. > > Also, I think that depending on the amount of available code you plan to > import into your application, sometimes having a whole platform at the > very bottom can add unnecessary extra weight to your application. More > weight could be equal to less speed, this is critical in web development. > > * Application Integration -> Navigation > > > "The basic way that users will navigate into and around your > > application should be using HTTP GET and POST requests with specific > > URL's. Users bookmark these URL's and other applications will link to > > them. Most applications developers did not realize it at first, but > > these URL's are, in fact, an interface into your application that you > > must maintain in a consistent way as you change and evolve your > > software. Otherwise, they will find dead links..." > > Just as I clicked the bookmark button for your article :) The same > principle could apply to its filenames. A URL of the form: > http://medicalcomputing.net/tools_dna17.php is less indicative of the > real content of the article and can mislead potential readers. > Optimising the URL's will make them better to be indexed by search > engines, something like: > http://medicalcomputing.net/web-development-bioinformatics17.php would > do the trick. > > To conclude my comments, I was surprised to see a section about BioPHP > and not about other more-known toolkits like BioPython or BioRuby. What > about their role in web development? Python is also a common language > for web programming and with all the recent *hot* stuff like Ruby On > Rails, it's very likely that both Bio* toolkits are more than ready for > deploying web applications. I'm Cc'ing this to their respective mailing > lists to see if someone wants to give you some feedback about them in > order to complement your article. Other than that, I really liked your > work :) > > Cheers, > Mauricio. > > Alex Amies wrote: > > I have written an article on Approaches to Web Development for > > Bioinformatics at > > > > http://medicalcomputing.net/tools_dna1.php > > > > There is a fairly large section on BioPerl at > > > > http://medicalcomputing.net/tools_dna13.php > > > > I hope that someone gets something useful out of it. I also looking for > > feedback on it and, in particular, please let me know about any mistakes in > > it. > > > > The intent of the article is to give an overview of various approaches to > > developing web based tools for bioinformatics. It describes the alternatives > > at each layer of the system, including the data layer and sources of data, > > the application programming layer, the web layer, and bioinformatics tools > > and software libraries. > > > > Alex > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > MAURICIO HERRERA CUADRA > arareko at campus.iztacala.unam.mx > Laboratorio de Gen?tica > Unidad de Morfofisiolog?a y Funci?n > Facultad de Estudios Superiores Iztacala, UNAM > > > > From shahs at MIT.EDU Sun Mar 4 04:14:59 2007 From: shahs at MIT.EDU (Hossein Shahsavari) Date: Sat, 03 Mar 2007 23:14:59 -0500 Subject: [BioPython] IOError: [Errno 2] No such file or directory: Message-ID: <20070303231459.s0pe4qpb1128o0gw@webmail.mit.edu> Hello, I receive the following error when I am trying to access a file called HISTORY in an another file by this command template = '~/CSH/HISTORY' and I get this error. IOError: [Errno 2] No such file or directory: '~/CSH/HISTORY' I use python in Linux environment. I appreciate any suggestions/comments. Hossein Shahsavari From biopython at maubp.freeserve.co.uk Sun Mar 4 11:58:07 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 04 Mar 2007 11:58:07 +0000 Subject: [BioPython] IOError: [Errno 2] No such file or directory: In-Reply-To: <20070303231459.s0pe4qpb1128o0gw@webmail.mit.edu> References: <20070303231459.s0pe4qpb1128o0gw@webmail.mit.edu> Message-ID: <45EAB44F.1080909@maubp.freeserve.co.uk> Hossein Shahsavari wrote: > Hello, > > I receive the following error when I am trying to access a file called HISTORY > in an another file by this command > > template = '~/CSH/HISTORY' > > and I get this error. > > IOError: [Errno 2] No such file or directory: '~/CSH/HISTORY' > > I use python in Linux environment. I appreciate any suggestions/comments. If you have posted the python code it would be easier to guess what is going wrong. What does this do? import os template = '~/CSH/HISTORY' print os.path.isfile(template) That should print either True or False. You might also try replacing the tilde ('~') with the actual path of your home folder, something like this typically: template = '/home/username/CSH/HISTORY' P.S. Have you checked the case? Linux and Unix are case sensitive. Peter From shahs at MIT.EDU Sun Mar 4 16:57:34 2007 From: shahs at MIT.EDU (Hossein Shahsavari) Date: Sun, 04 Mar 2007 11:57:34 -0500 Subject: [BioPython] IOError: [Errno 2] No such file or directory: In-Reply-To: <45EAB44F.1080909@maubp.freeserve.co.uk> References: <20070303231459.s0pe4qpb1128o0gw@webmail.mit.edu> <45EAB44F.1080909@maubp.freeserve.co.uk> Message-ID: <20070304115734.xe5ms6c7vkv8k4wk@webmail.mit.edu> Hi Thanks for your guidances. The problem was the tilde ('~') which I replaced by the correct path and now it works. I have another maybe simple question: I have 26 files namely output1, output2,...,output26. I can read them one by one but how can read them all by an easier way like a loop ? I put a "for loop" by setting i=1 for i in range(1,27) template='outputi' however, I got the same error as above IOError: [Errno 2] No such file or directory: 'outputi'. It seems "i" can't be attached to the output. Thanks alot Hossein From biopython at maubp.freeserve.co.uk Sun Mar 4 17:34:26 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 04 Mar 2007 17:34:26 +0000 Subject: [BioPython] IOError: [Errno 2] No such file or directory: In-Reply-To: <20070304115734.xe5ms6c7vkv8k4wk@webmail.mit.edu> References: <20070303231459.s0pe4qpb1128o0gw@webmail.mit.edu> <45EAB44F.1080909@maubp.freeserve.co.uk> <20070304115734.xe5ms6c7vkv8k4wk@webmail.mit.edu> Message-ID: <45EB0322.8050809@maubp.freeserve.co.uk> Hossein Shahsavari wrote: > I have another maybe simple question: > > I have 26 files namely output1, output2,...,output26. I can read them > one by one but how can read them all by an easier way like a loop ? > I put a "for loop" by setting > > i=1 > > for i in range(1,27) > template='outputi' > > however, I got the same error as above IOError: [Errno 2] No such file or > directory: 'outputi'. It seems "i" can't be attached to the output. > > Thanks alot > > Hossein You should really try a basic introduction to python. There are lots of tutorials online, and great books too. Your questions so far are not really related to BioPython at all. Note that indentation is very important in python. You were also missing the colon at the end for line. More importantly the following line just sets the variable template to the string 'outputi', and doesn't do anything with the variable i. template='outputi' You want to do something like this: for i in range(1,27) : template = 'output' + str(i) print template Good luck. Peter From lucks at fas.harvard.edu Mon Mar 5 14:13:02 2007 From: lucks at fas.harvard.edu (Julius Lucks) Date: Mon, 5 Mar 2007 09:13:02 -0500 Subject: [BioPython] blast parsing errors Message-ID: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> Hi all, I am trying to parse a bunch of blast results that I gather via NCBIWWW.qblast(). I have the following code snipit: ----------- from Bio imort Fasta from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML import StringIO import re #BLAST cutoff cutoff = 1e-4 #Create a fasta record: title and seq are given title = 'test' seq = 'ATCG' fasta_rec = Fasta.Record() #Sanitize title - blast does not like single quotes or \n in titles title = re.sub("'","prime",title) title = re.sub("\n","",title) fasta_rec.title = title fasta_rec.sequence = seq b_parser = NCBIXML.BlastParser() result_handle = NCBIWWW.qblast ('blastp','nr',fasta_rec,ncbi_gi=1,expect=cutoff,format_type="XML",entre z_query="Viruses [ORGN]") blast_results = result_handle.read() blast_handle = StringIO.StringIO(blast_results) b_record = b_parser.parse(blast_handle) for alignment in b_record.alignments: titles = alignment.title.split('>') print titles ------------- The issue is sometimes the blast parser chokes with tracebacks like: File "./src/create_annotations.py", line 96, in get_blast_annotations b_record = b_parser.parse(blast_handle) File "/sw/lib/python2.5/ site-packages/Bio/Blast/NCBIXML.py", line 112, in parse self._parser.parse(handler) File "/sw/lib/python2.5/xml/sax/ expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "/sw/lib/ python2.5/xml/sax/xmlreader.py", line 123, in parse self.feed(buffer) File "/sw/lib/python2.5/xml/sax/expatreader.py", line 211, in feed self._err_handler.fatalError(exc) File "/sw/lib/python2.5/xml/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: :7:70: not well- formed (invalid token) I am not sure which alignment it choked on, but I would like to rescue it with a try/except block if possible. But it seems to me that if I did something like try: b_record = b_parser.parse(blast_handle) except: ... Then I would not get anything in b_record if an error raised in the parsing. Rather, I would like to have whatever has been successful up to the point of the error stored in b_record. Is there any way to do this via the BioPython API, or do I have to dig into the python xml parsing code? Also, if anyone has a better idea of how to structure this code, I would be very appreciative. Cheers, Julius ----------------------------------------------------- http://openwetware.org/wiki/User:Lucks ----------------------------------------------------- From biopython at maubp.freeserve.co.uk Mon Mar 5 14:55:38 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 05 Mar 2007 14:55:38 +0000 Subject: [BioPython] blast parsing errors In-Reply-To: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> References: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> Message-ID: <45EC2F6A.6090200@maubp.freeserve.co.uk> Julius Lucks wrote: > Hi all, > > I am trying to parse a bunch of blast results that I gather via > NCBIWWW.qblast(). I have the following code snipit: You didn't say which version of BioPython you are using, I would guess 1.42 - there have been some Bio.Blast changes since than. Your example sequence was "ATCG", but you ran a "blastp" search. Did you really mean the peptide Ala-Thr-Cys-Gly here? If you meant to do a nucleotide search, try using "blastn" and "nr" instead. That should work better. However, there is still something funny going on. I tried your example as is using the CVS code, and it fails before it even gets the blast results back... Could you save the XML output to a file and email it to me; or even better file a bug an attach the XML file to the bug. Thanks Peter From biopython at maubp.freeserve.co.uk Mon Mar 5 15:12:25 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 05 Mar 2007 15:12:25 +0000 Subject: [BioPython] blast parsing errors In-Reply-To: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> References: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> Message-ID: <45EC3359.1030802@maubp.freeserve.co.uk> Julius Lucks wrote: > Hi all, > > I am trying to parse a bunch of blast results that I gather via > NCBIWWW.qblast(). I have the following code snipit: I am wondering if your trivial example triggered some "unusual" error page from the NCBI... I would suggest you update to CVS, as we have made a lot of changes to the Blast XML support. You would probably be safe just updating the following Bio.Blast files, located here on your machine: /sw/lib/python2.5/site-packages/Bio/Blast/NCBIStandalone.py /sw/lib/python2.5/site-packages/Bio/Blast/NCBIWWW.py /sw/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py /sw/lib/python2.5/site-packages/Bio/Blast/Record.py If you don't know how to use CVS, then just backup the originals, and replace them with the new files download one by one from here: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/?cvsroot=biopython ---------------------------------------------------------------------- This works for me using the CVS version of BioPython. I have just made a string for rather than messing about with a fasta record object to keep the code short: #Protein example, BLASTP from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML #BLAST cutoff cutoff = 1e-4 fasta_rec = ">GI:121308427\nrslgmevmhernahnfpldlaavevpsing" b_parser = NCBIXML.BlastParser() result_handle = NCBIWWW.qblast('blastp', 'nr', fasta_rec, ncbi_gi=1, expect=cutoff, format_type="XML", entrez_query="Viruses [ORGN]") #This returns a record iterator, changed after release of BioPython 1.42 b_records = b_parser.parse(result_handle) for b_record in b_records : print "%s found %i results" % (b_record.query, len(b_record.alignments)) for alignment in b_record.alignments: titles = alignment.title.split('>') print titles Or, if you wanted to do a nucleotide BLASTN search, try: fasta_rec = '>GI:121308427\nttagccatttatagatggaacttcaacagcagctaagtc' \ + 'tagagggaaattgtgagcattacgctcgtgcatgacctccataccaagagatct' and replace 'blastp' with 'blastn' in the call to qblast(). Peter From mdehoon at c2b2.columbia.edu Mon Mar 5 15:36:43 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 05 Mar 2007 10:36:43 -0500 Subject: [BioPython] blast parsing errors In-Reply-To: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> References: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> Message-ID: <45EC390B.8020400@c2b2.columbia.edu> Julius Lucks wrote: > seq = 'ATCG' > ... > fasta_rec.sequence = seq > ... > result_handle = NCBIWWW.qblast > ('blastp','nr',fasta_rec,ncbi_gi=1,expect=cutoff,format_type="XML",entre You have a nucleotide sequence but are running a protein-protein blast with blastp. If you run this exact search with Blast through a browser, it will show you an error message. The function _parse_qblast_ref_page(handle), which is called from NCBIWWW.qblast, chokes on this error message. If you want to make this more robust, one solution might be to check for error messages returned by the Blast server in _parse_qblast_ref_page. By the way, the code can be simplified as follows: from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML #BLAST cutoff cutoff = 1e-4 #Create a fasta record: title and seq are given seq = 'ATCG' b_parser = NCBIXML.BlastParser() result_handle = NCBIWWW.qblast('blastn', 'nr', seq, ncbi_gi=1, expect=cutoff, format_type="XML", entrez_query="Viruses [ORGN]") b_records = b_parser.parse(result_handle) b_record = b_records[0] for alignment in b_record[0].alignments: titles = alignment.title.split('>') print titles -------------------------------------------- Note: the BlastParser currently in CVS returns a list of Blast records instead of a single Blast record, hence the b_records[0] above. Btw, with NCBIXML currently in CVS, you don't need to create b_parser first: result_handle = NCBIWWW.qblast('blastn', 'nr', seq, ncbi_gi=1, expect=cutoff, format_type="XML", entrez_query="Viruses [ORGN]") b_records = NCBIXML.parse(result_handle) b_record = b_records.next() ----------------------------------------------- --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From winter at biotec.tu-dresden.de Mon Mar 5 15:07:00 2007 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Mon, 05 Mar 2007 16:07:00 +0100 Subject: [BioPython] blast parsing errors In-Reply-To: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> References: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> Message-ID: <45EC3214.5050100@biotec.tu-dresden.de> Running your example, I get: >>> ## working on region in file /tmp/python-18415Uda.py... Traceback (most recent call last): File "", line 1, in ? File "/tmp/python-18415Uda.py", line 25, in ? result_handle = NCBIWWW.qblast('blastp','nr',fasta_rec,ncbi_gi=1,expect=cutoff,format_type="XML",entrez_query="Viruses [ORGN]") File "/var/lib/python-support/python2.4/Bio/Blast/NCBIWWW.py", line 1091, in qblast rid, rtoe = _parse_qblast_ref_page(handle) File "/var/lib/python-support/python2.4/Bio/Blast/NCBIWWW.py", line 1133, in _parse_qblast_ref_page return rid, int(rtoe) ValueError: invalid literal for int(): > NCBI Blast title = re.sub("\n","",title) > fasta_rec.title = title > fasta_rec.sequence = seq > > > b_parser = NCBIXML.BlastParser() > > result_handle = NCBIWWW.qblast > ('blastp','nr',fasta_rec,ncbi_gi=1,expect=cutoff,format_type="XML",entre > z_query="Viruses [ORGN]") > blast_results = result_handle.read() > > blast_handle = StringIO.StringIO(blast_results) > b_record = b_parser.parse(blast_handle) > > for alignment in b_record.alignments: > titles = alignment.title.split('>') > print titles > > ------------- > > > The issue is sometimes the blast parser chokes with tracebacks like: > > File "./src/create_annotations.py", line 96, in get_blast_annotations > b_record = b_parser.parse(blast_handle) File "/sw/lib/python2.5/ > site-packages/Bio/Blast/NCBIXML.py", line 112, in parse > self._parser.parse(handler) File "/sw/lib/python2.5/xml/sax/ > expatreader.py", line 107, in parse > xmlreader.IncrementalParser.parse(self, source) File "/sw/lib/ > python2.5/xml/sax/xmlreader.py", line 123, in parse > self.feed(buffer) > File "/sw/lib/python2.5/xml/sax/expatreader.py", line 211, in feed > self._err_handler.fatalError(exc) > File "/sw/lib/python2.5/xml/sax/handler.py", line 38, in > fatalError raise exception > xml.sax._exceptions.SAXParseException: :7:70: not well- > formed (invalid token) > > I am not sure which alignment it choked on, but I would like to > rescue it with a try/except block if possible. But it seems to me > that if I did something like > > try: > b_record = b_parser.parse(blast_handle) > except: > ... > > Then I would not get anything in b_record if an error raised in the > parsing. Rather, I would like to have whatever has been successful > up to the point of the error stored in b_record. > > Is there any way to do this via the BioPython API, or do I have to > dig into the python xml parsing code? > > Also, if anyone has a better idea of how to structure this code, I > would be very appreciative. > > Cheers, > > Julius > > ----------------------------------------------------- > http://openwetware.org/wiki/User:Lucks > ----------------------------------------------------- > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From lucks at fas.harvard.edu Mon Mar 5 16:24:58 2007 From: lucks at fas.harvard.edu (Julius Lucks) Date: Mon, 5 Mar 2007 11:24:58 -0500 Subject: [BioPython] blast parsing errors In-Reply-To: <45EC2F6A.6090200@maubp.freeserve.co.uk> References: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> <45EC2F6A.6090200@maubp.freeserve.co.uk> Message-ID: <27D93A40-C5AC-4708-BF4C-0ADCFD413B46@fas.harvard.edu> Thanks guys, You are right - I am using BioPython 1.42, and python2.5 installed via fink on Mac OS X. I meant to use an amino acid sequence for the seq variable, and I have included the revised code snippet which uses the protein sequence that gave me trouble in the first place. However, there is no problem when using the current CVS code. Thanks for all of your help! I have 3 questions: 1.) Is the documentation for the new NCBIXML and NBCIWWW up to date? 2.) Why is NCBIXML.parse returning an iterator in this case since there is only one result? Or in other words, what are the use cases where an iterator is necessary? 3.) How are the fink packages of Biopython maintained? I am using the fink unstable tree, which means that I am getting the most current version that fink has. If Biopython 1.44 is substantially different from 1.42 (current fink), can we update the fink version faster than we currently are? Cheers, Julius ---------- code that works in Biopython 1.44 -------- from Bio import Fasta from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML import StringIO import re #BLAST cutoff cutoff = 1e-4 #Create a fasta record: title and seq are given title = 'test' seq = '\ MESFSVQAYLKATDNNFVSTFKDAAKQVQNF\ EKNTNSTMSTVGKVATSTGKTLTKAVTVPII\ GIGVAAAKIGGDFESQMSRVKAISGATGSSF\ EELRQQAIDLGAKTAFSAKESASGMENLASA\ GFNAKEIMEAMPGLLDLAAVSGGDVALASEN\ AATALRGFNLDASQSGHVANVFAKAAADTNA\ EVGDMGEAMKYIAPVANSMGLSIEEVSAAIG\ IMSDAGIKGSQAGTSLRGALSRLADPTDAMQ\ AKMDELGLSFYDSEGKMKPLKDQIGMLKDAF\ KGLTPEQQQNALVTLYGQESLSGMMALIDKG\ PDKLGKLTESLKNSDGAADKMAKTMQDNMNS\ SLEQMMGAFESAAIVVQKILAPAVRKVADSI\ SGLVDKFVSAPEPVQKMIVTIGLIVAAIGPL\ LVIFGQAVVTLQRVKVGFLALRSGLALIGGS\ FTAISLPVLGIIAAIAAVIAIGILVYKNWDK\ ISKFGKEVWANVKKFASDAAEVIKEKWGDIT\ QWFSDTWNNIKNGAKGLWDGTVQGAKNAVDS\ VKNAWNGIKEWFTNLWKGTTSGLSSAWDSVT\ TTLAPFVETIKTIFQPILDFFSGLWGQVQTI\ FGSAWEIIKTVVMGPVLLLIDLITGDFNQFK\ KDFAMLWQTLFTNIQTLVTTYVQIVVGFFTA\ WGQTVSNIWTTVVNTIQSLWGAFTTWVINMA\ KSIVDGIVNGWNSFKQGTVDLWNATVQWVKD\ TWASFKQWVVDSANAIVNGVKQGWENLKQGT\ IDLWNGMINGLKGIWDGLKQSVRNLIDNVKT\ TFNNLKNINLLDIGKAIIDGLVKGLKKKWED\ GMKFISGIGDWIRKHKGPIRKDRKLLIPAGK\ AIMTGLNSGLTGGFRNVQSNVSGMGDMIANA\ INSDYSVDIGANVAAANRSISSQVSHDVNLN\ QGKQPASFTVKLGNQIFKAFVDDISNAQGQA\ INLNMGF*' fasta_rec = Fasta.Record() #Sanitize title - blast does not like single quotes or \n in titles title = re.sub("'","prime",title) title = re.sub("\n","",title) fasta_rec.title = title fasta_rec.sequence = seq result_handle = NCBIWWW.qblast ('blastp','nr',fasta_rec,ncbi_gi=1,expect=cutoff,format_type="XML",entre z_query="Viruses [ORGN]") b_records = NCBIXML.parse(result_handle) for b_record in b_records: print "%s found %i results" % (b_record.query, len (b_record.alignments)) for alignment in b_record.alignments: titles = alignment.title.split('>') print titles ---------- ----------------------------------------------------- http://openwetware.org/wiki/User:Lucks ----------------------------------------------------- On Mar 5, 2007, at 9:55 AM, Peter wrote: > Julius Lucks wrote: >> Hi all, >> I am trying to parse a bunch of blast results that I gather via >> NCBIWWW.qblast(). I have the following code snipit: > > You didn't say which version of BioPython you are using, I would > guess 1.42 - there have been some Bio.Blast changes since than. > > Your example sequence was "ATCG", but you ran a "blastp" search. > Did you really mean the peptide Ala-Thr-Cys-Gly here? > > If you meant to do a nucleotide search, try using "blastn" and "nr" > instead. That should work better. > > However, there is still something funny going on. I tried your > example as is using the CVS code, and it fails before it even gets > the blast results back... > > Could you save the XML output to a file and email it to me; or even > better file a bug an attach the XML file to the bug. > > Thanks > > Peter From mdehoon at c2b2.columbia.edu Mon Mar 5 16:49:53 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 05 Mar 2007 11:49:53 -0500 Subject: [BioPython] blast parsing errors In-Reply-To: <27D93A40-C5AC-4708-BF4C-0ADCFD413B46@fas.harvard.edu> References: <1E0F24AF-A4F4-4818-B7AA-5D35BF7EA260@fas.harvard.edu> <45EC2F6A.6090200@maubp.freeserve.co.uk> <27D93A40-C5AC-4708-BF4C-0ADCFD413B46@fas.harvard.edu> Message-ID: <45EC4A31.9010207@c2b2.columbia.edu> Julius Lucks wrote: > 1.) Is the documentation for the new NCBIXML and NBCIWWW up to date? No it is not. To ensure that the documentation on the website agrees with the current Biopython release, the idea was to update the documentation when the next Biopython release comes out. Originally we were planning to make a new Biopython release as soon as the new Bio.SeqIO code is done. However, I'd be happy to make a release in the immediate future without the new Bio.SeqIO, and make another one once Bio.SeqIO is ready. > 2.) Why is NCBIXML.parse returning an iterator in this case since there > is only one result? Or in other words, what are the use cases where an > iterator is necessary? If you're parsing multiple Blast search results at the same time. In other words, if the fasta file for the blast search looked like > gene1 ATAGCTACG... > gene2 ATCGATCGATGGCA... > gene3 .... Such a file can be very large, which is why we are using an iterator instead of a list. Now, one may argue that NCBIXML.parse should return a single record instead of an iterator if there's only one result. Others may argue that for consistency, it should always return an iterator. Either way is fine with me. Anybody have a strong opinion about this? > 3.) How are the fink packages of Biopython maintained? I don't know. But, it's not too difficult to install Biopython from the source distribution or from CVS. So if you want to be sure you have the latest version, you might want to try installing from CVS. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython at maubp.freeserve.co.uk Tue Mar 6 16:27:58 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 06 Mar 2007 16:27:58 +0000 Subject: [BioPython] Bio.Kabat and the Kabat Database Message-ID: <45ED968E.1010102@maubp.freeserve.co.uk> I've been looking though the modules in BioPython, and had a closer look at Bio.Kabat written by Katharine Lindner in 2001 to parse files from the Kabat database of proteins of immunological interest: http://www.kabatdatabase.com/ Quoting the website, > 01 September 2006 > > Interested parties may purchase the Database, in ASCII text > structured flat files, as well as an SQL relationship database (not > previously available), for $2250 US. > > This one-time license fee is unrestricted, except for distribution. > > Analysis Tools > > The searching and analysis tools are additionally available. > Included are generalized lookup, aligned sequence searching light > chain alignment, length distribution, positional correlation, > variability, and much more. Please contact for quote. Does anyone use the Bio.Kabat code? Could we (or should we) mark it as depreciated for the next release of BioPython? Peter From snakepit.rattlesnakes at gmail.com Mon Mar 12 09:59:25 2007 From: snakepit.rattlesnakes at gmail.com (Joydeep Mitra) Date: Mon, 12 Mar 2007 15:29:25 +0530 Subject: [BioPython] Retrieving the raw sequence from sequence object... Message-ID: <972566ff0703120259k3979c223r2172f631d48fa6fd@mail.gmail.com> Hi, I'm a student of bioinformatics (coming from a biological background). I've just started using biopython for parsing biological file formats. The Bio.Fasta module contains the fasta iterator object, which spits out sequence objects...of the form: Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACATAATAAT ...', IUPACAmbiguousDNA()) I want to retrieve the sequence in it's entirety and in raw format....how does one do that using an instance object? I've tried a few things without success...will be glad if some1 could show me how... Thanking in advance, Joy From sdavis2 at mail.nih.gov Mon Mar 12 10:20:11 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 12 Mar 2007 06:20:11 -0400 Subject: [BioPython] Retrieving the raw sequence from sequence object... In-Reply-To: <972566ff0703120259k3979c223r2172f631d48fa6fd@mail.gmail.com> References: <972566ff0703120259k3979c223r2172f631d48fa6fd@mail.gmail.com> Message-ID: <200703120620.11871.sdavis2@mail.nih.gov> On Monday 12 March 2007 05:59, Joydeep Mitra wrote: > Hi, > I'm a student of bioinformatics (coming from a biological background). > > I've just started using biopython for parsing biological file formats. > The Bio.Fasta module contains the fasta iterator object, which spits out > sequence objects...of the form: > > Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACATAATAAT ...', > IUPACAmbiguousDNA()) > > I want to retrieve the sequence in it's entirety and in raw format....how > does one do that using an instance object? > I've tried a few things without success...will be glad if some1 could show > me how... If you have a sequence object, "myseq": myseq.tostring() See here for more details: http://biopython.org/DIST/docs/tutorial/Tutorial.html Section 2.2. Hope that helps. Sean From lucks at fas.harvard.edu Wed Mar 14 22:16:32 2007 From: lucks at fas.harvard.edu (Julius Lucks) Date: Wed, 14 Mar 2007 18:16:32 -0400 Subject: [BioPython] Biopython hackathon Message-ID: <21C70692-33D8-457E-AB5B-D4701E2704FB@fas.harvard.edu> Hi all, I was just chatting with Jason Stajich about the openbio hackathon that took place a few years ago. What biopython projects do people think would be appropriate for another hackathon in the near future? Are there things that have been on the TODO list for a while? New functionality that would benefit from a bunch of us getting together in one place (possibly with other openbio projects)? Cheers, Julius ----------------------------------------------------- http://openwetware.org/wiki/User:Lucks ----------------------------------------------------- From aloraine at gmail.com Wed Mar 14 23:43:23 2007 From: aloraine at gmail.com (Ann Loraine) Date: Wed, 14 Mar 2007 17:43:23 -0600 Subject: [BioPython] Biopython hackathon In-Reply-To: <21C70692-33D8-457E-AB5B-D4701E2704FB@fas.harvard.edu> References: <21C70692-33D8-457E-AB5B-D4701E2704FB@fas.harvard.edu> Message-ID: <83722dde0703141643q52c04a03l4d2b28926e8aff3a@mail.gmail.com> I hope you will consider the following two requests as possible hackathon activities: (1) If it does not already do this, it would be nice if the blast "plain text" (non-XML) parser would report the length of the target ("hit") sequence as well as the query. If I recall correctly, the last time I used the plain text blast parser, I had to measure the length of the targets by opening up the fasta copy of the blastable database and reading the lengths one-by-one. My database wasn't very big, so it wasn't a hassle to do this, but I can foresee situations where this kludge would fail. (2) Another request is for a Python interface to the Bio::Db::SQL database schema described in: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=12368253. The part that seems particularly valuable would be code that to construct the location-based queries. My apologies if this already exists -- if yes, please let me know where I can find it!! All the best, Ann Loraine On 3/14/07, Julius Lucks wrote: > Hi all, > > I was just chatting with Jason Stajich about the openbio hackathon > that took place a few years ago. What biopython projects do people > think would be appropriate for another hackathon in the near future? > Are there things that have been on the TODO list for a while? New > functionality that would benefit from a bunch of us getting together > in one place (possibly with other openbio projects)? > > Cheers, > > Julius > > ----------------------------------------------------- > http://openwetware.org/wiki/User:Lucks > ----------------------------------------------------- > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From cjfields at uiuc.edu Thu Mar 15 00:29:43 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 14 Mar 2007 19:29:43 -0500 Subject: [BioPython] Biopython hackathon In-Reply-To: <83722dde0703141643q52c04a03l4d2b28926e8aff3a@mail.gmail.com> References: <21C70692-33D8-457E-AB5B-D4701E2704FB@fas.harvard.edu> <83722dde0703141643q52c04a03l4d2b28926e8aff3a@mail.gmail.com> Message-ID: <8FC21D46-FEC1-43FF-B770-BC4D05E569D5@uiuc.edu> On Mar 14, 2007, at 6:43 PM, Ann Loraine wrote: > ... > (2) Another request is for a Python interface to the Bio::Db::SQL > database schema described in: > > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? > cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=12368253. > > The part that seems particularly valuable would be code that to > construct the location-based queries. Did you mean Bio::DB::GFF? AFAIK Lincoln is moving stuff over to a newer system that better facilitates GFF3, namely Bio::DB::SeqFeature. I'm not sure how well Bio::DB::GFF is supported currently. > My apologies if this already exists -- if yes, please let me know > where I can find it!! You can find out about the current state of GBrowse affairs by emailing the GBrowse mail list. Here's the sourceforge link: http://sourceforge.net/mailarchive/forum.php?forum_id=31947 Scott and Lincoln can indicate where their current focus is re: GFF3 and sequence feature database development. chris > All the best, > > Ann Loraine Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From aloraine at gmail.com Thu Mar 15 06:52:13 2007 From: aloraine at gmail.com (Ann Loraine) Date: Thu, 15 Mar 2007 00:52:13 -0600 Subject: [BioPython] Biopython hackathon In-Reply-To: <8FC21D46-FEC1-43FF-B770-BC4D05E569D5@uiuc.edu> References: <21C70692-33D8-457E-AB5B-D4701E2704FB@fas.harvard.edu> <83722dde0703141643q52c04a03l4d2b28926e8aff3a@mail.gmail.com> <8FC21D46-FEC1-43FF-B770-BC4D05E569D5@uiuc.edu> Message-ID: <83722dde0703142352y325698bag5d43c4a771bb1e5@mail.gmail.com> Thanks, I meant Bio::DB::GFF -- the schema shown in the paper. A simple schema that represents features on genomic sequence and easily supports fast region-based queries is what I'm after. The indexing scheme in the paper looked good, so I was hoping to find python code that would hide the details of formulating the SQL. My main goal is speed -- running the queries and then outputting data in GFF, bed, or DAS XML, as the need arises. -Ann On 3/14/07, Chris Fields wrote: > > On Mar 14, 2007, at 6:43 PM, Ann Loraine wrote: > > > ... > > (2) Another request is for a Python interface to the Bio::Db::SQL > > database schema described in: > > > > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? > > cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=12368253. > > > > The part that seems particularly valuable would be code that to > > construct the location-based queries. > > Did you mean Bio::DB::GFF? AFAIK Lincoln is moving stuff over to a > newer system that better facilitates GFF3, namely > Bio::DB::SeqFeature. I'm not sure how well Bio::DB::GFF is supported > currently. > > > My apologies if this already exists -- if yes, please let me know > > where I can find it!! > > You can find out about the current state of GBrowse affairs by > emailing the GBrowse mail list. Here's the sourceforge link: > > http://sourceforge.net/mailarchive/forum.php?forum_id=31947 > > Scott and Lincoln can indicate where their current focus is re: GFF3 > and sequence feature database development. > > chris > > > All the best, > > > > Ann Loraine > > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > -- Ann Loraine, Assistant Professor Departments of Genetics, Biostatistics, Computer and Information Sciences Associate Scientist, Comprehensive Cancer Center University of Alabama at Birmingham http://www.transvar.org 205-996-4155 From mdehoon at c2b2.columbia.edu Fri Mar 16 19:09:10 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Fri, 16 Mar 2007 15:09:10 -0400 Subject: [BioPython] Bio.Kabat and the Kabat Database In-Reply-To: <45ED968E.1010102@maubp.freeserve.co.uk> References: <45ED968E.1010102@maubp.freeserve.co.uk> Message-ID: <45FAEB56.9080509@c2b2.columbia.edu> Peter wrote: > Does anyone use the Bio.Kabat code? Could we (or should we) mark it as > depreciated for the next release of BioPython? Since no users of Bio.Kabat came forward, I've marked it as deprecated for the upcoming release. This only means that importing Bio.Kabat will show a warning message, so the code is still usable. If still no users come forward, we can remove Bio.Kabat in a later release. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From mdehoon at c2b2.columbia.edu Sat Mar 17 23:26:50 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Sat, 17 Mar 2007 19:26:50 -0400 Subject: [BioPython] Biopython release 1.43 Message-ID: <45FC793A.2010106@c2b2.columbia.edu> Dear Biopythoneers, We are pleased to announce the release of Biopython 1.43. This release includes a brand-new set of parsers in Bio.SeqIO by Peter Cock for reading biological sequence files in various formats, an updated Blast XML parser in Bio.Blast.NCBIXML, a new UniGene flat-file parser by Sean Davis, and numerous improvements and bug fixes in Bio.PDB, Bio.SwissProt, Bio.Nexus, BioSQL, and others. Believe it or not, even the documentation was updated. Source distributions and Windows installers are available from the Biopython website at http://biopython.org. My thanks to all code contributers who made this new release possible. --Michiel on behalf of the Biopython developers -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From rohini.damle at gmail.com Mon Mar 19 21:08:36 2007 From: rohini.damle at gmail.com (Rohini Damle) Date: Mon, 19 Mar 2007 13:08:36 -0800 Subject: [BioPython] pdf articles from medline Message-ID: Hi, Does anyone know if there is any provision in Biopython to download PDF articles from Medline, if we have a list of pubmed ids? Thank you for your help. -Rohini. From sdavis2 at mail.nih.gov Mon Mar 19 22:37:23 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 19 Mar 2007 18:37:23 -0400 Subject: [BioPython] pdf articles from medline In-Reply-To: References: Message-ID: <45FF10A3.7090708@mail.nih.gov> Rohini Damle wrote: > Hi, > Does anyone know if there is any provision in Biopython to download > PDF articles from Medline, if we have a list of pubmed ids? > Medline does not store the PDFs, in general, so I don't think that is possible. You could certainly scrape the HTML for links and then follow them, looking for a PDF link, but there isn't a general solution for all journals, etc. Sean From jingzhou2005 at gmail.com Wed Mar 21 15:49:20 2007 From: jingzhou2005 at gmail.com (Jing Zhou) Date: Wed, 21 Mar 2007 11:49:20 -0400 Subject: [BioPython] can PDBParser work on a pdb file without residue sequence information? Message-ID: <4b036ac00703210849n43b41ff8hd1f52784452f1d3c@mail.gmail.com> I want to parse a pseudo pdb file I generated for a bunch of points in space. There is no physical meaning to assign residue number or chain number to it. Here is the example of one line: ATOM 1 C -7.083 -6.182 50.181 1.00 (the atom type has no physical meaning either) I just want to be able to display the positions of these points in viewer. What I want to do is to use biopython to parse this pseudo pdb file and display it in vtk. (I am already able to see the points in pymol using this pseudo pdb file) Here is the function I tried to use to parse this pseudo pdb file: from Bio.PDB import PDBParser parser = PDBParser() structure = parser.get_structure('mypdb',mypseudopdbfile) Here is the error message: structure = parser.get_structure('mypdb',mypseudopdbfile) File "C:\Python24\lib\site-packages\Bio\PDB\PDBParser.py", line 66, in get_structure self._parse(file.readlines()) File "C:\Python24\lib\site-packages\Bio\PDB\PDBParser.py", line 87, in _parse self.trailer=self._parse_coordinates(coords_trailer) File "C:\Python24\lib\site-packages\Bio\PDB\PDBParser.py", line 144, in _parse_coordinates resseq=int(split(line[22:26])[0]) # sequence identifier IndexError: list index out of range It seems that the error happens to seek residue sequence information at column 22-26. Sure, I can try to add fake resseq info. but my question is whether there is another way around to neglect the index of residue sequence number? I just want to directly get the array of coordinates and display them. Thanks Jing From zsun at fas.harvard.edu Thu Mar 22 03:41:22 2007 From: zsun at fas.harvard.edu (Zachary Zhipeng Sun) Date: Wed, 21 Mar 2007 23:41:22 -0400 Subject: [BioPython] BLAST SNP access? Message-ID: <000c01c76c33$f98c93a0$eca5bae0$@harvard.edu> Hello, Thank you for the Bioython tools! They are proving increasingly useful in my research. I had a question regarding a tool extension, however - is there any link in biopython to query the BLAST SNP database, or does anyone know of this being under development? If not, I am not too familiar with the backend of biopython but I was looking to be able to automate BLAST SNP searches; does anyone have advice on how to start coding this into the biopython environment, or to a version of NCBIWWW.py ? Thanks for your help! Best, Zachary Sun From mdehoon at c2b2.columbia.edu Thu Mar 22 04:21:40 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Thu, 22 Mar 2007 00:21:40 -0400 Subject: [BioPython] BLAST SNP access? In-Reply-To: <000c01c76c33$f98c93a0$eca5bae0$@harvard.edu> References: <000c01c76c33$f98c93a0$eca5bae0$@harvard.edu> Message-ID: <46020454.2060100@c2b2.columbia.edu> Hi Zach, Can you give us an example of how you currently query the BLAST SNP database (without using Biopython)? Just to get an idea of how different it is from current BLAST searches with Biopython. --Michiel. Zachary Zhipeng Sun wrote: > Hello, > > > > Thank you for the Bioython tools! They are proving increasingly useful in my > research. I had a question regarding a tool extension, however - is there > any link in biopython to query the BLAST SNP database, or does anyone know > of this being under development? If not, I am not too familiar with the > backend of biopython but I was looking to be able to automate BLAST SNP > searches; does anyone have advice on how to start coding this into the > biopython environment, or to a version of NCBIWWW.py ? Thanks for your help! > > > > Best, > > Zachary Sun > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From cjfields at uiuc.edu Thu Mar 22 04:38:11 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 21 Mar 2007 23:38:11 -0500 Subject: [BioPython] BLAST SNP access? In-Reply-To: <46020454.2060100@c2b2.columbia.edu> References: <000c01c76c33$f98c93a0$eca5bae0$@harvard.edu> <46020454.2060100@c2b2.columbia.edu> Message-ID: <1CB2E066-3A73-429E-80B6-639F26FAD144@uiuc.edu> If you are using the NCBI URLAPI interface you can set the databases to anything on the following page, just follow the instructions: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/ remote_accessible_blastdblist.html This includes SNP data. I find this works for BioPerl's RemoteBlast (URLAPI-based). chris On Mar 21, 2007, at 11:21 PM, Michiel de Hoon wrote: > Hi Zach, > > Can you give us an example of how you currently query the BLAST SNP > database (without using Biopython)? Just to get an idea of how > different > it is from current BLAST searches with Biopython. > > --Michiel. > > Zachary Zhipeng Sun wrote: >> Hello, >> >> >> >> Thank you for the Bioython tools! They are proving increasingly >> useful in my >> research. I had a question regarding a tool extension, however - >> is there >> any link in biopython to query the BLAST SNP database, or does >> anyone know >> of this being under development? If not, I am not too familiar >> with the >> backend of biopython but I was looking to be able to automate >> BLAST SNP >> searches; does anyone have advice on how to start coding this into >> the >> biopython environment, or to a version of NCBIWWW.py ? Thanks for >> your help! >> >> >> >> Best, >> >> Zachary Sun >> >> >> >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From zsun at fas.harvard.edu Thu Mar 22 05:01:44 2007 From: zsun at fas.harvard.edu (Zachary Zhipeng Sun) Date: Thu, 22 Mar 2007 01:01:44 -0400 Subject: [BioPython] BLAST SNP access? In-Reply-To: <1CB2E066-3A73-429E-80B6-639F26FAD144@uiuc.edu> References: <000c01c76c33$f98c93a0$eca5bae0$@harvard.edu> <46020454.2060100@c2b2.columbia.edu> <1CB2E066-3A73-429E-80B6-639F26FAD144@uiuc.edu> Message-ID: <000601c76c3f$2fab1870$8f014950$@harvard.edu> Thanks for the replies! Regarding a BLAST SNP search, it uses the blastn or tblastn interface; put in query at http://www.ncbi.nlm.nih.gov/SNP/snp_blastByOrg.cgi, choose db, and result interface is similar to other BLAST components. (sample RID 1174538798-721-146214187572.BLASTQ2 for sample output), except that in the title of hits it shows an rs# which links to dbSNP. I'd imagine the search options would be different, as well as a little bit of parsing the output, but it uses the same engine as blastn. Regarding the URLAPI then: sorry, I'm pretty new to biopython, but is the qblast search feature in the current biopython 1.43 build based on commands to the NCBI URLAPI? If so, then (from someone with moderate experience in coding but little in python or perl) would I be able to painlessly modify the biopython NCBIWWW.py code for my own use? -Zach -----Original Message----- From: Chris Fields [mailto:cjfields at uiuc.edu] Sent: Thursday, March 22, 2007 12:38 AM To: Michiel de Hoon Cc: Zachary Zhipeng Sun; biopython at lists.open-bio.org Subject: Re: [BioPython] BLAST SNP access? If you are using the NCBI URLAPI interface you can set the databases to anything on the following page, just follow the instructions: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/ remote_accessible_blastdblist.html This includes SNP data. I find this works for BioPerl's RemoteBlast (URLAPI-based). chris On Mar 21, 2007, at 11:21 PM, Michiel de Hoon wrote: > Hi Zach, > > Can you give us an example of how you currently query the BLAST SNP > database (without using Biopython)? Just to get an idea of how > different > it is from current BLAST searches with Biopython. > > --Michiel. > > Zachary Zhipeng Sun wrote: >> Hello, >> >> >> >> Thank you for the Bioython tools! They are proving increasingly >> useful in my >> research. I had a question regarding a tool extension, however - >> is there >> any link in biopython to query the BLAST SNP database, or does >> anyone know >> of this being under development? If not, I am not too familiar >> with the >> backend of biopython but I was looking to be able to automate >> BLAST SNP >> searches; does anyone have advice on how to start coding this into >> the >> biopython environment, or to a version of NCBIWWW.py ? Thanks for >> your help! >> >> >> >> Best, >> >> Zachary Sun >> >> >> >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From mdehoon at c2b2.columbia.edu Thu Mar 22 18:54:59 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Thu, 22 Mar 2007 14:54:59 -0400 Subject: [BioPython] BLAST SNP access? In-Reply-To: <000601c76c3f$2fab1870$8f014950$@harvard.edu> References: <000c01c76c33$f98c93a0$eca5bae0$@harvard.edu> <46020454.2060100@c2b2.columbia.edu> <1CB2E066-3A73-429E-80B6-639F26FAD144@uiuc.edu> <000601c76c3f$2fab1870$8f014950$@harvard.edu> Message-ID: <4602D103.20106@c2b2.columbia.edu> Zachary Zhipeng Sun wrote: > Thanks for the replies! Regarding a BLAST SNP search, it uses the blastn or > tblastn interface; put in query at > http://www.ncbi.nlm.nih.gov/SNP/snp_blastByOrg.cgi, choose db, and result > interface is similar to other BLAST components. (sample RID > 1174538798-721-146214187572.BLASTQ2 for sample output), except that in the > title of hits it shows an rs# which links to dbSNP. I'd imagine the search > options would be different, as well as a little bit of parsing the output, > but it uses the same engine as blastn. It looks like that if you know the name of the database (here "snp/human_9606/human_9606"), then you can run for example from Bio.Blast import NCBIWWW result_handle = NCBIWWW.qblast("blastn", "snp/human_9606/human_9606", seq) and then parse the results as usual (see section 3.4 in the Biopython tutorial). Check on the page that Chris sent you: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdblist.html to find the correct name for the database. The result_handle should give you the same information as the web page, and Biopython's parser should parse all information from result_handle correctly. If you find that some information seems to be missing, please let us know. Hope this helps, --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From mauriceling at gmail.com Thu Mar 22 23:47:14 2007 From: mauriceling at gmail.com (Maurice Ling) Date: Fri, 23 Mar 2007 10:47:14 +1100 Subject: [BioPython] How to read SOFT files from GEO? Message-ID: <46031582.7060301@acm.org> Hi, Are there any examples to read SOFT file from GEO? I've looked in the cookbook and there isn't any mention about GEO. Thanks in advance, maurice From sdavis2 at mail.nih.gov Fri Mar 23 00:53:01 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 22 Mar 2007 20:53:01 -0400 Subject: [BioPython] How to read SOFT files from GEO? In-Reply-To: <46031582.7060301@acm.org> References: <46031582.7060301@acm.org> Message-ID: <460324ED.8040504@mail.nih.gov> Maurice Ling wrote: > Hi, > > Are there any examples to read SOFT file from GEO? I've looked in the > cookbook and there isn't any mention about GEO. > I don't know of biopython code to do this, but there is a package for the R statistical language that will do this. The nice thing about doing this from R is that there are hundreds of tools that can then be applied to whatever data you like. The package is part of Bioconductor (http://www.bioconductor.org) and is called GEOquery. And of course, you can use R from python using Rpy. Sean P.S. Legal disclaimer--I am the author of said package, so take my words with the appropriate grain of salt. From mauriceling at gmail.com Fri Mar 23 01:06:07 2007 From: mauriceling at gmail.com (Maurice Ling) Date: Fri, 23 Mar 2007 12:06:07 +1100 Subject: [BioPython] How to read SOFT files from GEO? In-Reply-To: <460324ED.8040504@mail.nih.gov> References: <46031582.7060301@acm.org> <460324ED.8040504@mail.nih.gov> Message-ID: <460327FF.5070909@acm.org> Sean Davis wrote: > Maurice Ling wrote: > >> Hi, >> >> Are there any examples to read SOFT file from GEO? I've looked in the >> cookbook and there isn't any mention about GEO. >> > > I don't know of biopython code to do this, but there is a package for > the R statistical language that will do this. The nice thing about > doing this from R is that there are hundreds of tools that can then be > applied to whatever data you like. The package is part of > Bioconductor (http://www.bioconductor.org) and is called GEOquery. > And of course, you can use R from python using Rpy. > > Sean > > P.S. Legal disclaimer--I am the author of said package, so take my > words with the appropriate grain of salt. > In biopython's CVS, there is a subdirectory called Geo. So I thought that might be for SOFT files... ML P.S. So I know who to ask when I have questions about GEOquery. From sdavis2 at mail.nih.gov Fri Mar 23 01:44:42 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 22 Mar 2007 21:44:42 -0400 Subject: [BioPython] How to read SOFT files from GEO? In-Reply-To: <460327FF.5070909@acm.org> References: <46031582.7060301@acm.org> <460324ED.8040504@mail.nih.gov> <460327FF.5070909@acm.org> Message-ID: <4603310A.20804@mail.nih.gov> Maurice Ling wrote: > Sean Davis wrote: > >> Maurice Ling wrote: >> >>> Hi, >>> >>> Are there any examples to read SOFT file from GEO? I've looked in >>> the cookbook and there isn't any mention about GEO. >>> >> >> I don't know of biopython code to do this, but there is a package for >> the R statistical language that will do this. The nice thing about >> doing this from R is that there are hundreds of tools that can then >> be applied to whatever data you like. The package is part of >> Bioconductor (http://www.bioconductor.org) and is called GEOquery. >> And of course, you can use R from python using Rpy. >> >> Sean >> >> P.S. Legal disclaimer--I am the author of said package, so take my >> words with the appropriate grain of salt. >> > In biopython's CVS, there is a subdirectory called Geo. So I thought > that might be for SOFT files... > I haven't tried it lately, but last I looked, it was not in-sync with GEO (which does infrequently change tags/formats). Also, I think it handles GDS only (if I remember correctly). Only a subset of GEO is available as GDS, since they require hand-curation by GEO staff. http://portal.open-bio.org/pipermail/biopython-dev/2006-May/002352.html > P.S. So I know who to ask when I have questions about GEOquery. Comments and questions are welcome and appreciated. Sean From chris.lasher at gmail.com Sat Mar 24 16:35:04 2007 From: chris.lasher at gmail.com (Chris Lasher) Date: Sat, 24 Mar 2007 12:35:04 -0400 Subject: [BioPython] Biopython to migrate to Subversion Message-ID: <128a885f0703240935p5c139736yfe3142bdbc9d6ac4@mail.gmail.com> Hello Biopythonistas, The Biopython developers are currently planning a migration from CVS to Subversion as our revision control system. The target for the migration is the evening of Sunday May 20, 2007. This change will mostly impact the developers, however, this may also affect some users. If you're a user of Biopython and you... A) install Biopython from the Windows installer, from packages for your Linux distribution, or from Fink on OS X... you will not be affected. You may stop here or read on at your own leisure. ----- B) download and install from the source in the form of the Tarball or Zip file... you will not be affected. You may stop here or read on at your own leisure. ----- C) retrieve and install Biopython from the CVS repository... the Biopython devs would really like to hear from you! For those in category C, the change could mean that you will need a Subversion client installed on your computer. Clients exists for all major platforms, including Windows, OS X, and Linux. Subversion operates through HTTP/HTTPS (ports 80 and 443, respectively), and specifically uses WebDAV, an extended HTTP protocol. Though highly unusual, some organizations networks' may block WebDAV traffic. One way to check whether your organization does this is to attempt to checkout an existing Subversion repository from, for instance, a Google Code Project (all of which use Subversion repositories). For example, you can attempt to check out Kongulo . If you can checkout an existing repository, you will be ready to migrate to Biopython's Subversion repository once in place. If Subversion installation will not be possible, or your network indeed blocks WebDAV traffic, the Biopython devs need to know. We can support the CVS repository with read-only access and inject updates from the Subversion repository into the CVS repository. This involves a bit more work on the part of devs in setting up and supporting, so we will take this on only if the necessity exists. For clarity, Biopython will make a clean move to Subversion and drop CVS support unless we hear requests for legacy CVS support in advance of the migration. ----- If you're a developer of Biopython... you should be on the Biopython-dev list and have been following this thread: ----- We will document all things related to Biopython's migration to Subversion on the Biopython wiki at for interested parties. The developers look forward to a smooth transition and having Subversion in place to assist us in continually improving Biopython. Thank you for your time and feedback, Your friendly neighborhood Biopython developers From aloraine at gmail.com Tue Mar 27 05:31:05 2007 From: aloraine at gmail.com (Ann Loraine) Date: Mon, 26 Mar 2007 23:31:05 -0600 Subject: [BioPython] question regarding writing Seq objects in Fasta format Message-ID: <83722dde0703262231u15042538q479c2fb0b81a9590@mail.gmail.com> Hello, I have a question about how to write out Bio.Seq.Seq objects to a fasta format file. I've generated a lot of these by translating segments of genomic sequence -- see below. What objects or code should I use? It looks like Bio.SeqIO.FASTA.FastaWriter might be the right thing, but it doesn't appear to accept Bio.Seq.Seq objects. Do I need to create a new type of Seq-like object before I can use Bio.SeqIO.FASTA.FastaWriter? Thank you for your time! Yours, Ann xxx my code for generating the Seq objects I want to write def feat2aaseq(feat,seq): """ Function: translate the given feature Returns : a Bio.Seq.Seq [biopython] Args : feat - feature.DNASeqFeature [not biopython] seq - a Bio.Seq.Seq, e.g., a chromosome """ start = feat.start() end = feat.start()+feat.length() fullseq = seq.sequence subseq_ob = Seq(fullseq[start:end],IUPAC.unambiguous_dna) if feat.strand() == -1: subseq_ob = subseq_ob.reverse_complement() translator = Translate.unambiguous_dna_by_id[4] aaseq = translator.translate(subseq_ob) return aaseq -- Ann Loraine, Assistant Professor Departments of Genetics, Biostatistics, Computer and Information Sciences Associate Scientist, Comprehensive Cancer Center University of Alabama at Birmingham http://www.transvar.org 205-996-4155 From biopython at maubp.freeserve.co.uk Tue Mar 27 11:35:54 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Mar 2007 20:35:54 +0900 Subject: [BioPython] question regarding writing Seq objects in Fasta format In-Reply-To: <83722dde0703262231u15042538q479c2fb0b81a9590@mail.gmail.com> References: <83722dde0703262231u15042538q479c2fb0b81a9590@mail.gmail.com> Message-ID: <320fb6e00703270435k34390ce2y46937ab07578d7a7@mail.gmail.com> On 3/27/07, Ann Loraine wrote: > Hello, > > I have a question about how to write out Bio.Seq.Seq objects to a > fasta format file. > > I've generated a lot of these by translating segments of genomic > sequence -- see below. > > What objects or code should I use? I would suggest you try the new Bio.SeqIO code described here, you will need BioPython 1.43 or later: http://biopython.org/wiki/SeqIO You'll need to "upgrade" your Seq objects into SeqRecord objects and give them an identifier before calling Bio.SeqIO.write() on them. We'd welcome feedback on this, for example if there are any errors in the newly writenupdated documentation. > It looks like Bio.SeqIO.FASTA.FastaWriter might be the right thing, > but it doesn't appear to accept Bio.Seq.Seq objects. Do I need to > create a new type of Seq-like object before I can use > Bio.SeqIO.FASTA.FastaWriter? We plan to mark that particular (undocumented) bit of BioPython as depriated in the next release - so I can't really recommend using it. Personally, before working on Bio.SeqIO I used to write fasta files "by hand" using something like this: handle = open("example.faa", "w") for identifier, seq in some_list_of_tuples : handle.write(">%s\n%s\n" % (identifier, seq.tostring())) handle.close() where identifier is a string, and seq is a BioPython Seq object. There is a lot to be said for doing it "by hand" if you want full control over the description for example. Peter From chris.lasher at gmail.com Wed Mar 28 17:19:12 2007 From: chris.lasher at gmail.com (Chris Lasher) Date: Wed, 28 Mar 2007 13:19:12 -0400 Subject: [BioPython] Biopython to migrate to Subversion In-Reply-To: <128a885f0703240935p5c139736yfe3142bdbc9d6ac4@mail.gmail.com> References: <128a885f0703240935p5c139736yfe3142bdbc9d6ac4@mail.gmail.com> Message-ID: <128a885f0703281019k6c64807dpa4e5d54621c944cc@mail.gmail.com> I have a correction to make. Again, this only affects those of you whom obtain your Biopython code via CVS. On 3/24/07, Chris Lasher wrote: > Subversion operates through HTTP/HTTPS (ports 80 and 443, > respectively), and specifically uses WebDAV, an extended HTTP > protocol. Though highly unusual, some organizations networks' may > block WebDAV traffic. One way to check whether your organization does > this is to attempt to checkout an existing Subversion repository from, > for instance, a Google Code Project (all of which use Subversion > repositories). For example, you can attempt to check out Kongulo > . If you can checkout > an existing repository, you will be ready to migrate to Biopython's > Subversion repository once in place. The Subversion server runs through SSH, *not* through WebDAV, and so is accessed in the same way that the CVS repository is now. If you can access the CVS repository now, you can access Subversion repository once we implement it. Therefore, the only case of need for legacy support for Subversion is if you cannot get Subversion installed. If this is the case, please notify me as soon as possible. Thanks, Chris