From dag23 at duke.edu Fri Dec 1 13:52:01 2006 From: dag23 at duke.edu (dag23 at duke.edu) Date: Fri, 1 Dec 2006 13:52:01 -0500 Subject: [BioPython] mxTextTools for new Macs Message-ID: <20061201135201.ewz1p2uz3ko4gss4@webmail.duke.edu> Hi List of helpful people, I've a new MacBookPro and in attempting to install mxTools, I've hit major snags. While this isn't a BioPython issue per se, it might impact those trying to install the BioPython required files on the new architecture. Has anyone had similar issues? Cheers, David From mdehoon at c2b2.columbia.edu Sun Dec 3 09:56:18 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 03 Dec 2006 09:56:18 -0500 Subject: [BioPython] mxTextTools for new Macs In-Reply-To: <20061201135201.ewz1p2uz3ko4gss4@webmail.duke.edu> References: <20061201135201.ewz1p2uz3ko4gss4@webmail.duke.edu> Message-ID: <4572E592.1090004@c2b2.columbia.edu> I didn't have any problems installing mxTools on a MacBook Pro. What snags did you run into? --Michiel. dag23 at duke.edu wrote: > Hi List of helpful people, > > I've a new MacBookPro and in attempting to install mxTools, I've hit major > snags. While this isn't a BioPython issue per se, it might impact those trying > to install the BioPython required files on the new architecture. Has anyone had > similar issues? > > Cheers, > > David > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From aloraine at gmail.com Fri Dec 8 03:37:13 2006 From: aloraine at gmail.com (Ann Loraine) Date: Fri, 8 Dec 2006 02:37:13 -0600 Subject: [BioPython] question regarding unicode, biopython Seq object, DAS Message-ID: <83722dde0612080037i33884c20jafcb4b24ac7b6815@mail.gmail.com> Hello, I'm attempting to get sequence data from a DAS server (UCSC, DAS1) and am having what appears to be a unicode-related problem - if you have any insights or advice, I'd be grateful for the help. I'm running biopython v. 1.42 on Mac OS X 10.3.9. My sax parser delivers character (sequence) data as unicode, but when I make a Seq object from the unicode string and then try to reverse complement the sequence, I get an exception: TypeError: character mapping must return integer, None or unicode So I tried this: >>> from Bio.Alphabet import IUPAC >>> from Bio.Seq import Seq >>> s = Seq(u'atcg',IUPAC.unambiguous_dna) >>> s.reverse_complement() Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.4/site-packages/Bio/Seq.py", line 117, in reverse_complement s = self.data[-1::-1].translate(ttable) TypeError: character mapping must return integer, None or unicode >>> s = Seq('atcg',IUPAC.unambiguous_dna) # note: no longer unicode >>> s.reverse_complement() Seq('cgat', IUPACUnambiguousDNA()) An example access of the UCSC DAS1 site follows. In my code I'm using a SAX parser to get the data, but this demonstrates a bit of how the DAS aspect works: >>> u = 'http://genome.cse.ucsc.edu/cgi-bin/das/hg17/dna?segment=1:158288275,158302415' >>> import urllib >>> fh = urllib.urlopen(u) >>> fh.readline() '\n' >>> fh.readline() '\n' >>> fh.readline() '\n' >>> fh.readline() '\n' >>> fh.readline() 'gtctcttaaaacccactggacgttggcacagtgctgggatgactatggag\n' ...and etc. Yours, Ann -- Ann Loraine Assistant Professor Departments of Genetics, Biostatistics, and Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From telliott at hsc.wvu.edu Fri Dec 8 13:58:40 2006 From: telliott at hsc.wvu.edu (Thomas Elliott) Date: Fri, 8 Dec 2006 13:58:40 -0500 Subject: [BioPython] Medline records Message-ID: <8B121229-3B35-4246-9222-A8E4CC8E91EA@hsc.wvu.edu> Hi, I started playing with Biopython again. Glad to see it is still active. I love Python. Some issues came up in parsing Medline records according to the tutorial. It wasn't obvious which variables exist to be queried on a record. The tutorial example gives title, authors, and source only. I tried looking at the code. There are terms in NLMMedlineXML.py that look like they should work, but which raise AttributeError (e.g. journal and date_created). I finally looked in the keys to __dict__ for a record. I found 'year' there, but record.year seems to be always empty, or rather it is a blank string. ? record.title_abbreviation is actually the abbreviated journal name. ? record.volume_issue gives only the volume, not the issue. ? record.journal_title_code is always a blank string. Not sure what the right way is to do this. I guess it would be helpful to know which file of the source I should be looking at for variable names. Tom Elliott From mdehoon at c2b2.columbia.edu Sat Dec 9 15:01:04 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sat, 09 Dec 2006 15:01:04 -0500 Subject: [BioPython] question regarding unicode, biopython Seq object, DAS In-Reply-To: <83722dde0612080037i33884c20jafcb4b24ac7b6815@mail.gmail.com> References: <83722dde0612080037i33884c20jafcb4b24ac7b6815@mail.gmail.com> Message-ID: <457B1600.4000807@c2b2.columbia.edu> Ann Loraine wrote: > My sax parser delivers character (sequence) data as unicode, but when > I make a Seq object from the unicode string and then try to reverse > complement the sequence, I get an exception: Can you convert the unicode string to a regular string before creating the Seq object? As in >>> from Bio.Alphabet import IUPAC >>> from Bio.Seq import Seq >>> s = u'atcg' >>> s = str(s) >>> s = Seq(s, IUPAC.unambiguous_dna) >>> s.reverse_complement() Seq('cgat', IUPACUnambiguousDNA()) >>> By the way, you can also use reverse_complement on a string directly: >>> from Bio.Seq import reverse_complement >>> s = 'atcg' >>> reverse_complement(s) 'cgat' >>> --Michiel. From aloraine at gmail.com Sat Dec 9 17:38:59 2006 From: aloraine at gmail.com (Ann Loraine) Date: Sat, 9 Dec 2006 16:38:59 -0600 Subject: [BioPython] question regarding unicode, biopython Seq object, DAS In-Reply-To: <457B1600.4000807@c2b2.columbia.edu> References: <83722dde0612080037i33884c20jafcb4b24ac7b6815@mail.gmail.com> <457B1600.4000807@c2b2.columbia.edu> Message-ID: <83722dde0612091438k22bf4a6fyb18c6f9276ac2f69@mail.gmail.com> Thanks very much! Also, I found out that strings have an encode method I never noticed before: >>> foo = u'foo' >>> foo u'foo' >>> foo.encode('ascii') 'foo' Yours, Ann On 12/9/06, Michiel de Hoon wrote: > Ann Loraine wrote: > > My sax parser delivers character (sequence) data as unicode, but when > > I make a Seq object from the unicode string and then try to reverse > > complement the sequence, I get an exception: > > Can you convert the unicode string to a regular string before creating > the Seq object? As in > > >>> from Bio.Alphabet import IUPAC > >>> from Bio.Seq import Seq > >>> s = u'atcg' > >>> s = str(s) > >>> s = Seq(s, IUPAC.unambiguous_dna) > >>> s.reverse_complement() > Seq('cgat', IUPACUnambiguousDNA()) > >>> > > By the way, you can also use reverse_complement on a string directly: > > >>> from Bio.Seq import reverse_complement > >>> s = 'atcg' > >>> reverse_complement(s) > 'cgat' > >>> > > > --Michiel. > -- Ann Loraine Assistant Professor Departments of Genetics, Biostatistics, and Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From elventear at gmail.com Sun Dec 10 23:33:03 2006 From: elventear at gmail.com (Pepe Barbe) Date: Sun, 10 Dec 2006 22:33:03 -0600 Subject: [BioPython] Markov Model module in BioPython Message-ID: <3e73596b0612102033s38ce972dl70dbd8b09fc3114d@mail.gmail.com> Hello Everyone, I am curious if the Markov Model module works, of it is complete. I've seen the tests for the module and it doesn't test the functionality of the entire module and I've ran into some issues that make me wonder if it is complete. E.g. the allow_transitions function in the Markov Builder class is missing self._state_alphabet.letters in line 182. After fixing that, another bug showed up. Thus I wonder if it has been tested or if its functionality is complete. I am running BioPython 1.42. Thanks, Pepe From alpersoyler at yahoo.com Mon Dec 11 09:44:12 2006 From: alpersoyler at yahoo.com (alper soyler) Date: Mon, 11 Dec 2006 06:44:12 -0800 (PST) Subject: [BioPython] XML Parser problem Message-ID: <20061211144412.6627.qmail@web56505.mail.re3.yahoo.com> Dear all, I run blastall with option -m7 to save the resulting file as xml. However, when I open the xml file with firefox, it gave the following error message. XML Parsing Error: junk after document element Location: file:///home/alper/Desktop/genes/combinedblastfile.xml Line Number 38, Column 1: ^ But it can be opened with the text editor. When I tried to parse the results with biopython it also gives the below errors. I did not understand the reason. If you help me, I will be very glad. Thank you in advance. Traceback (most recent call last): File "XMLBlastParser.py", line 13, in ? b_record = b_parser.parse(blast_out) File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse self._parser.parse(handler) File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse xmlreader.IncrementalParser.parse(self, source) File "/usr/lib/python2.4/site-packages/_xmlplus/sax/xmlreader.py", line 123, in parse self.feed(buffer) File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed self._err_handler.fatalError(exc) File "/usr/lib/python2.4/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException:/home/alper/Desktop/genes/combinedblastfile.xml:38:0: junk after document element Alper Soyler Dept. of Food Engineering Middle East Technical University,Turkey Tel:+90312 2105625 Fax:+90312 2102767 http://www.metu.edu.tr/~soyler ----- Original Message ---- From: "biopython-request at lists.open-bio.org" To: biopython at lists.open-bio.org Sent: Sunday, October 8, 2006 8:03:28 AM Subject: BioPython Digest, Vol 46, Issue 2 Send BioPython mailing list submissions to biopython at lists.open-bio.org To subscribe or unsubscribe via the World Wide Web, visit http://lists.open-bio.org/mailman/listinfo/biopython or, via email, send a message with subject or body 'help' to biopython-request at lists.open-bio.org You can reach the person managing the list at biopython-owner at lists.open-bio.org When replying, please edit your Subject line so it is more specific than "Re: Contents of BioPython digest..." Today's Topics: 1. Re: Genbank parsing problem and fix (Gemma Atkinson) 2. Re: Genbank parsing problem and fix (Peter) 3. BioPython for TRANSFAC (Wijaya Edward) 4. Creating fusion protein like constructs with BioPython (Mitchell Stanton-Cook) 5. Re: Creating fusion protein like constructs with BioPython (Peter) 6. Re: Creating fusion protein like constructs with BioPython (Thomas Hamelryck) 7. Problem parsing Blast XML output from different sources (Steffi Gebauer-Jung) 8. Re: Problem parsing Blast XML output from different sources (Michiel Jan Laurens de Hoon) 9. Join kirby white on Yahoo! Messenger! (kirbywhite at sbcglobal.net) 10. Re: Problem parsing Blast XML output from different sources (Michiel de Hoon) ---------------------------------------------------------------------- Message: 1 Date: Tue, 3 Oct 2006 12:36:58 +0100 From: Gemma Atkinson Subject: Re: [BioPython] Genbank parsing problem and fix To: biopython at lists.open-bio.org Message-ID: Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Hi Peter, I was using the Bio.Genbank module. This is the code I've been using: from Bio import GenBank parser = GenBank.RecordParser(debug_level=2) record = parser.parse(open("test4.txt")) It was the expressions/genbank.py file, imported from within the Genbank module that I've been changing. I haven't touched the formatdefs/genbank.py file (should have made that clear before - sorry). This was the error I was getting before I changed expressions/ genbank.py: File "testgbparser.py", line 3, in ? record = parser.parse(open("test4.txt")) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ python2.4/Bio/GenBank/__init__.py", line 240, in parse self._scanner.feed(handle, self._consumer) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ python2.4/Bio/GenBank/__init__.py", line 1259, in feed self._parser.parseFile(handle) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ python2.4/Martel/Parser.py", line 328, in parseFile self.parseString(fileobj.read()) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ python2.4/Martel/Parser.py", line 356, in parseString self._err_handler.fatalError(result) File "/Library/Frameworks/Python.framework/Versions/2.4//lib/ python2.4/xml/sax/handler.py", line 38, in fatalError raise exception Martel.Parser.ParserPositionException: error parsing at or beyond character 1153 Gemma On 3 Oct 2006, at 10:54, Peter wrote: > gca500 at york.ac.uk wrote: >> Hi All, >> Been having a problem using the Genbank RecordParser with some >> Genbank files that have recently been added to NCBI. After a bit >> of trial and error, I realised the problem only occurs if a >> REFERENCE field isn't followed by an AUTHOR field (for example in >> reference 2 of this record: http://www.ncbi.nlm.nih.gov/entrez/ >> viewer.fcgi?db=protein&val=88602864). >> There's a very easy fix on line 289 of Genbank.py. Decided to post >> this to the list to save any one else who stumbles across this >> problem tearing their hair out like I've been doing this afternoon! >> Change ... and it works! >> Hope this is useful, >> Gemma > > Hi Gemma, > > I have made your suggested change to biopython/Bio/formatdefs/ > genbank.py as CVS revision 1.10, which should be viewable online soon: > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/ > expressions/genbank.py?cvsroot=biopython > > I am curious as to why you are using this code (part of the > FormatIO system), rather than the Bio.GenBank module. > > Thank you, > > Peter > ------------------------------ Message: 2 Date: Tue, 03 Oct 2006 14:33:48 +0100 From: Peter Subject: Re: [BioPython] Genbank parsing problem and fix To: Gemma Atkinson Cc: biopython at lists.open-bio.org Message-ID: <452266BC.9060809 at maubp.freeserve.co.uk> Content-Type: text/plain; charset=ISO-8859-1; format=flowed >> Hi Gemma, >> >> I have made your suggested change to biopython/Bio/formatdefs/ >> genbank.py as CVS revision 1.10, which should be viewable online soon: >> >> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/ >> expressions/genbank.py?cvsroot=biopython I got the URL right, but I mean to say Bio/expressions/genbank.py (which actually has the Martel definition in it) not Bio/formatdefs/genbank.py Peter wrote: >> I am curious as to why you are using this code ... Gemma replied: > I was using the Bio.Genbank module. This is the code I've been using: > > from Bio import GenBank > parser = GenBank.RecordParser(debug_level=2) > record = parser.parse(open("test4.txt")) I would guess you are using BioPython 1.41 (or older) then, as your stack trace was indeed using Martel internally. Recent versions of BioPython (1.42 and later) use a pure python parser in Bio.GenBank as the old Martel code didn't scale well with large input files (to the point of being almost useless on large genomes). If you do update your installation, and run into any problems with the GenBank parser, please do let us know. Peter ------------------------------ Message: 3 Date: Tue, 03 Oct 2006 22:16:27 +0800 From: Wijaya Edward Subject: [BioPython] BioPython for TRANSFAC To: biopython at lists.open-bio.org Message-ID: <3ACF03E372996C4EACD542EA8A05E66A061584 at mailbe01.teak.local.net> Content-Type: text/plain; charset=iso-8859-1 Hi there, Is there a method in BioPython that allow me to pass the query "fruitfly" or "drosophila" and then returning the: 1. already characterized TF and their binding sites (BS), 2. their respective coregulated genes, and 3. the location of TFBS location/position in the genes. all from TRANSFAC database. -- Regards, Edward WIJAYA ------------ Institute For Infocomm Research - Disclaimer ------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. -------------------------------------------------------- ------------------------------ Message: 4 Date: Wed, 4 Oct 2006 23:38:03 +1000 From: "Mitchell Stanton-Cook" Subject: [BioPython] Creating fusion protein like constructs with BioPython To: BioPython at lists.open-bio.org Message-ID: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hello all. I am trying to create fusion protein-like model from two separate pdb files. I introduce a CYS mutant in the target protein, and then wish to form a disulphide bound between it and a small peptide. This is pure computational work. I am using Bio.PDB. As the two structures are in arbitrary frames of reference I need to rotate and translate to form the "construct". I wish to have TargetProtein-CB-SY-SY-CB-SmallPeptide (the peptide is not really added to the N/C term) I have tried many different approaches but have failed miserable to get SmallPeptide rotated relative to TargetProtein at the correct dihedral angle +/-90deg and bond lengths. My current approach is (omitting the correct bond length at this time): TP-CB-SY SY-CB-SP 1 2 3 4 Translate 2 onto 3 Calculate the angle between 1-(23)-4 Calculate the cross product of 1-23 x 23-4 Generate the rotation matrix given the angle and vector Rotate all SP (SmallPeptide) atoms by this rotation matrix. This has not worked. I have had some other ideas and have written code for them. Ideally, I wish to calculate the rotations about X,Y,Z to place the SP at the correct dihedral angle followed by translation, but I have no idea how to do this. 1) Can I use Bio.PDB to do this above task or do I need to look at something else? 2) Does anyone have any ideas on how to complete this goal? Thanking you for your time. Mitch ------------------------------ Message: 5 Date: Thu, 05 Oct 2006 10:47:30 +0100 From: Peter Subject: Re: [BioPython] Creating fusion protein like constructs with BioPython To: Mitchell Stanton-Cook Cc: BioPython at lists.open-bio.org Message-ID: <4524D4B2.8030600 at maubp.freeserve.co.uk> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Mitchell Stanton-Cook wrote: > Hello all. > > I am trying to create fusion protein-like model from two separate pdb files. > I introduce a CYS mutant in the target protein, and then wish to form a > disulphide bound between it and a small peptide. > > This is pure computational work. > > ... > > 1) Can I use Bio.PDB to do this above task or do I need to look at something > else? My gut instinct is that yes, you probably can - but you will have to do a lot of the work with your own code. Its not something I have ever tried though. > 2) Does anyone have any ideas on how to complete this goal? You might want to have a look at MMTK, which on the face of it would be better suited. Assuming MMTK will read both PDB files you might have better luck - this proviso is because I have found MMTK will choke on "odd" PDB files, and its support for non-standard residues could be better. http://starship.python.net/crew/hinsen/MMTK/index.html Peter ------------------------------ Message: 6 Date: Thu, 5 Oct 2006 11:52:56 +0200 From: "Thomas Hamelryck" Subject: Re: [BioPython] Creating fusion protein like constructs with BioPython To: "Mitchell Stanton-Cook" Cc: BioPython at lists.open-bio.org Message-ID: <2d7c25310610050252j2f889242h84411e0927fb4502 at mail.gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Hi, > I am trying to create fusion protein-like model from two separate pdb files. > I introduce a CYS mutant in the target protein, and then wish to form a > disulphide bound between it and a small peptide. ... > 1) Can I use Bio.PDB to do this above task or do I need to look at something > else? Bio.PDB has functionality to do vector/rotation calculations. Take a look at the Vector.py module. Best, ---- Thomas Hamelryck, Post-doctoral researcher Bioinformatics center Institute of Molecular Biology and Physiology University of Copenhagen Universitetsparken 15 - Bygning 10 DK-2100 Copenhagen ? Denmark Homepage: http://www.binf.ku.dk/Protein_structure ------------------------------ Message: 7 Date: Thu, 05 Oct 2006 12:30:36 +0200 From: Steffi Gebauer-Jung Subject: [BioPython] Problem parsing Blast XML output from different sources To: biopython at lists.open-bio.org Message-ID: <4524DECC.3030307 at ice.mpg.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hello, because of blastall 2.2.14 output was not parsed from the Bio.Blast.NCBIStandalone parser, I tried to switch to the recommended Bio.Blast.NCBIXML parser. Thereby I found, that the xml output of the locally installed standalone blastall (2.2.14) differs from the web xml output. For BlastN hsps on Plus/Minus strands, the xml gives query_frame/hit_frame 1 / -1 as usual. But query and frame positions and sequences are switched in direction (would match frames -1/1). As the Bio.Blast.Record returned by the NCBIXML parser only gives frames, sequences and start positions it is not possible (without knowing the source of the xml file) to be sure to find the right data. This is clearly a problem of Blast. But because of the missing end positions in the returned record object it becomes a problem for users of the parser too. Could somebody try to confirm the different behaviour of the xml blast output with his/her own examples/installation? Thanks, Steffi ------------------------------ Message: 8 Date: Thu, 05 Oct 2006 12:01:04 -0400 From: Michiel Jan Laurens de Hoon Subject: Re: [BioPython] Problem parsing Blast XML output from different sources To: Steffi Gebauer-Jung Cc: biopython at lists.open-bio.org Message-ID: <45252C40.8040806 at c2b2.columbia.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Which sequence are you running blast on? I'd like to try this on our local blast installation. --Michiel. Steffi Gebauer-Jung wrote: > Hello, > > because of blastall 2.2.14 output was not parsed from the > Bio.Blast.NCBIStandalone parser, > I tried to switch to the recommended Bio.Blast.NCBIXML parser. > > Thereby I found, that the xml output of the locally installed standalone > blastall (2.2.14) > differs from the web xml output. > > For BlastN hsps on Plus/Minus strands, the xml gives > query_frame/hit_frame 1 / -1 as usual. > But query and frame positions and sequences are switched in direction > (would match frames -1/1). > > As the Bio.Blast.Record returned by the NCBIXML parser only gives > frames, sequences > and start positions it is not possible (without knowing the source of > the xml file) > to be sure to find the right data. > > This is clearly a problem of Blast. > But because of the missing end positions in the returned record object > it becomes a problem for users of the parser too. > > Could somebody try to confirm the different behaviour of the xml blast > output > with his/her own examples/installation? > > Thanks, Steffi > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 ------------------------------ Message: 9 Date: 06 Oct 2006 01:30:32 -0700 From: kirbywhite at sbcglobal.net Subject: [BioPython] Join kirby white on Yahoo! Messenger! To: biopython at biopython.org Message-ID: <200610060837.k968bH7m002645 at portal.open-bio.org> Content-Type: text/plain; charset=en_US.ISO-8859-1 kirby white wants to talk with you using the new Yahoo! Messenger with Voice: Accept the invitation by clicking this link: http://invite.msg.yahoo.com/invite?op=accept&intl=us&sig=TH4bGUcdNQlSM9glNjqlrYiUe5Ghe81EwN0H9cef5vb5F7R7g9X1RKU7ac1qLispOfRJgQy2V7nt.fUIeMUChnR9ZMz50uB3r5ocpMTyDcxHE4kS.n_LZ2zqpi54EYbR3KHoIq73BouZjRO0y5J6LODqpmvT3VY- With Yahoo! Messenger with Voice, you get: Free worldwide PC-to-PC calls.* All you need are speakers and a microphone (or a headset). If no one's there, leave a voicemail! IM Windows Live™ Messenger friends too. Add your Windows Live friends to your Yahoo! contact list. See when they're online and IM them anytime. Stealth settings keep you in control. Now you can get in touch on your time, by controlling who sees when you're online. So what are you waiting for? It's free. Get Yahoo! Messenger with Voice and start connecting how you want, when you want. * Emergency 911 calling services not available on Yahoo! Messenger. Please inform others who use your Yahoo! Messenger they must dial 911 through traditional phone lines or cell carriers. By using Yahoo! Messenger you agree to not use PC-to-PC calling in countries where prohibited. The above features apply to the Windows version of Yahoo! Messenger. ------------------------------ Message: 10 Date: Sun, 08 Oct 2006 00:51:09 -0400 From: Michiel de Hoon Subject: Re: [BioPython] Problem parsing Blast XML output from different sources To: Steffi Gebauer-Jung , biopython at biopython.org Message-ID: <452883BD.7050907 at c2b2.columbia.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi Steffi, I am trying to replicate this problem with Blast. Where did you get the pat database? I searched for it with google, but there seems to be more than one blast database called pat. --Michiel. Steffi Gebauer-Jung wrote: > Hello, > > I don't know what local databases you have available for testing. > The discrepancy between xml and 'pairwise text' output should be seen > for every Plus/Minus Hsp created by local Blastn (local server or > standalone blastall from command line, I use version 2.2.14) > > I tried several combinations, one is M38240 vs. pat database, > the hsp hit was BD298385. > Here are the interesting output snippets: > >> dbj|BD298385.1| >> >> CLEAN SYNTHETIC VECTORS, PLASMIDS, TRANSGENIC PLANTS AND PLANT PARTS > CONTAINING THEM, AND METHODS FOR OBTAINING THEM > Length = 14108 > > Score = 125 bits (63), Expect = 1e-25 > Identities = 63/63 (100%) > Strand = Plus / Minus > > > Query: 727 aatgaagactaatctttttctctttctcatcttttcacttctcctatcattatcctcggc > 786 > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > Sbjct: 8332 aatgaagactaatctttttctctttctcatcttttcacttctcctatcattatcctcggc > 8273 > > Query: 787 cga 789 > ||| > Sbjct: 8272 cga 8270 > > ===================================================== > > 15 > gi|92136243|dbj|BD298385.1| > CLEAN SYNTHETIC VECTORS, PLASMIDS, TRANSGENIC PLANTS > AND PLANT PARTS CONTAINING THEM, AND METHODS FOR OBTAINING THEM > BD298385 > 14108 > > > 1 > 125.381 > 63 > 9.63859e-26 > 789 > 727 > 8270 > 8332 > 1 > -1 > 63 > 63 > 63 > > TCGGCCGAGGATAATGATAGGAGAAGTGAAAAGATGAGAAAGAGAAAAAGATTAGTCTTCATT > > > TCGGCCGAGGATAATGATAGGAGAAGTGAAAAGATGAGAAAGAGAAAAAGATTAGTCTTCATT > > > ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > > > > > > Thanks, Steffi > > > > > > > Michiel Jan Laurens de Hoon wrote: > >> Which sequence are you running blast on? >> I'd like to try this on our local blast installation. >> >> --Michiel. >> >> Steffi Gebauer-Jung wrote: >> >>> Hello, >>> >>> because of blastall 2.2.14 output was not parsed from the >>> Bio.Blast.NCBIStandalone parser, >>> I tried to switch to the recommended Bio.Blast.NCBIXML parser. >>> >>> Thereby I found, that the xml output of the locally installed >>> standalone blastall (2.2.14) >>> differs from the web xml output. >>> >>> For BlastN hsps on Plus/Minus strands, the xml gives >>> query_frame/hit_frame 1 / -1 as usual. >>> But query and frame positions and sequences are switched in direction >>> (would match frames -1/1). >>> >>> As the Bio.Blast.Record returned by the NCBIXML parser only gives >>> frames, sequences >>> and start positions it is not possible (without knowing the source of >>> the xml file) >>> to be sure to find the right data. >>> >>> This is clearly a problem of Blast. >>> But because of the missing end positions in the returned record object >>> it becomes a problem for users of the parser too. >>> >>> Could somebody try to confirm the different behaviour of the xml >>> blast output >>> with his/her own examples/installation? >>> >>> Thanks, Steffi >>> >>> >>> >>> _______________________________________________ >>> BioPython mailing list - BioPython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >> >> >> > ------------------------------ _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython End of BioPython Digest, Vol 46, Issue 2 **************************************** __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From gebauer-jung at ice.mpg.de Mon Dec 11 10:54:54 2006 From: gebauer-jung at ice.mpg.de (Steffi Gebauer-Jung) Date: Mon, 11 Dec 2006 16:54:54 +0100 Subject: [BioPython] XML Parser problem Message-ID: <457D7F4E.60804@ice.mpg.de> Hello, there are more then 1 concatenated xml documents in the file - which in fact is no valid xml. Some days ago I had the same problem. It does not occur with the NCBI blast server but with our local blast server installation (version 2.2.15): If there is more than only 1 query (regardless of given as sequences or a multiple fasta file) the resulting xml output is a concatenation of several xml files. Interestingly for the command line blastall (version 2.2.15 too) the output is ok and contains a valid xml document with several iterations. Best, Steffi >Message: 9 >Date: Mon, 11 Dec 2006 06:44:12 -0800 (PST) >From: alper soyler >Subject: [BioPython] XML Parser problem >To: biopython at lists.open-bio.org >Message-ID: <20061211144412.6627.qmail at web56505.mail.re3.yahoo.com> >Content-Type: text/plain; charset=ascii > >Dear all, > >I run blastall with option -m7 to save the resulting file as xml. However, when I open the xml file with firefox, it gave the following error message. > >XML Parsing Error: junk after document element >Location: file:///home/alper/Desktop/genes/combinedblastfile.xml >Line Number 38, Column 1: > >^ >But it can be opened with the text editor. When I tried to parse the results with biopython it also gives the below errors. I did not understand the reason. If you help me, I will be very glad. Thank you in advance. > Traceback (most recent call last): > File "XMLBlastParser.py", line 13, in ? > b_record = b_parser.parse(blast_out) > File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse > self._parser.parse(handler) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse > xmlreader.IncrementalParser.parse(self, source) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/xmlreader.py", line 123, in parse > self.feed(buffer) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed > self._err_handler.fatalError(exc) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError > raise exception >xml.sax._exceptions.SAXParseException:/home/alper/Desktop/genes/combinedblastfile.xml:38:0: junk after document element > > >Alper Soyler >Dept. of Food Engineering >Middle East Technical University,Turkey >Tel:+90312 2105625 >Fax:+90312 2102767 >http://www.metu.edu.tr/~soyler > From mdehoon at c2b2.columbia.edu Mon Dec 11 11:24:22 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 11 Dec 2006 11:24:22 -0500 Subject: [BioPython] XML Parser problem In-Reply-To: <20061211144412.6627.qmail@web56505.mail.re3.yahoo.com> References: <20061211144412.6627.qmail@web56505.mail.re3.yahoo.com> Message-ID: <457D8636.9010301@c2b2.columbia.edu> It looks like you are running an old version of Blast. When blasting several sequences at the same time, old blast makes an output file consisting of several XML files concatenated together. According to the XML specification, this is not a well-formed XML file. There are two things you can do: 1) Continue using the old blast, and use NCBIStandalone.Iterator to parse the output file. See section 3.1.6 in the tutorial. 2) Update to the new blast, which creates a well-formed XML file, and see my next mail about the XML parser. --Michiel. alper soyler wrote: > Dear all, > > I run blastall with option -m7 to save the resulting file as xml. However, when I open the xml file with firefox, it gave the following error message. > > XML Parsing Error: junk after document element > Location: file:///home/alper/Desktop/genes/combinedblastfile.xml > Line Number 38, Column 1: > > ^ -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From mdehoon at c2b2.columbia.edu Mon Dec 11 12:11:45 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 11 Dec 2006 12:11:45 -0500 Subject: [BioPython] Blast XML parser Message-ID: <457D9151.3020401@c2b2.columbia.edu> The file format of Blast XML output changed with recent (>= 2.2.14 I believe) versions of blast if multiple sequences are blasted at the same time. Older versions of blast return an output file consisting of several XML files concatenated together. Newer blast versions return one XML file containing the blast results for all blasted sequences. Whereas the advantage is that this is a valid XML file, it breaks NCBIStandalone.Iterator, which looks for the start of a new XML file when iterating over the blast results. There are several bug reports now related to parsing multiple blast records (bugs 1970, 2051, 2090). I have written a patch to Bio/Blast/NCBIXML to fix this problem. As it changes the way NCBIXML is used, I was wondering if anybody has objections to this approach. Current usage of NCBIXML (single Blast record): from Bio.Blast import NCBIXML blast_out = open("myblastoutput.xml") parser = NCBIXML.BlastParser() b_record = parser.parse(blast_out) New usage of NCBIXML (single Blast records): from Bio.Blast import NCBIXML blast_out = open("myblastoutput.xml") b_records = NCBIXML.parse(blast_out) b_record = b_records.next() Current usage of NCBIXML (multiple Blast records): from Bio.Blast import NCBIStandalone, NCBIXML parser = NCBIXML.BlastParser() blast_out = open("myblastoutput.xml") for b_record in NCBIStandalone.Iterator(blast_out, parser): #Do something with the record New usage of NCBIXML (multiple Blast records): from Bio.Blast import NCBIXML blast_out = open("myblastoutput.xml") b_records = NCBIXML.parse(blast_out) for b_record in b_records: #Do something with the record Objections, anybody? In case you want to try this, you can download the patch from Bugzilla bug #1970. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From winter at biotec.tu-dresden.de Mon Dec 11 10:44:24 2006 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Mon, 11 Dec 2006 16:44:24 +0100 Subject: [BioPython] XML Parser problem In-Reply-To: <20061211144412.6627.qmail@web56505.mail.re3.yahoo.com> References: <20061211144412.6627.qmail@web56505.mail.re3.yahoo.com> Message-ID: <457D7CD8.70700@biotec.tu-dresden.de> Dear Alper: The error you get is probably due to a not well-formed XML document produced by older versions of BLAST. On my Debian Linux system, blastall 2.2.10 produces such XML files, whereas blastall 2.2.13 does not anymore. A workaround was included in the NCBIStandalone Iterator class by Michael Anthony Maibaum: http://portal.open-bio.org/pipermail/biopython/2006-January/002889.html The following code should work: from Bio.Blast import NCBIXML, NCBIStandalone blast_results = open(filename) iterator = NCBIStandalone.Iterator(blast_results, NCBIXML.BlastParser()) for record in iterator: # do something blast_results.close() http://www.biopython.org/DIST/docs/tutorial/Tutorial.pdf on page 21 still lists code that uses b_record = b_parser.parse(blast_out), which gives the error when parsing a file that consists of several XML documents. Hope that helps, cheers, Christof alper soyler wrote: > Dear all, > > I run blastall with option -m7 to save the resulting file as xml. However, when I open the xml file with firefox, it gave the following error message. > > XML Parsing Error: junk after document element > Location: file:///home/alper/Desktop/genes/combinedblastfile.xml > Line Number 38, Column 1: > > ^ > But it can be opened with the text editor. When I tried to parse the results with biopython it also gives the below errors. I did not understand the reason. If you help me, I will be very glad. Thank you in advance. > Traceback (most recent call last): > File "XMLBlastParser.py", line 13, in ? > b_record = b_parser.parse(blast_out) > File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse > self._parser.parse(handler) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse > xmlreader.IncrementalParser.parse(self, source) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/xmlreader.py", line 123, in parse > self.feed(buffer) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed > self._err_handler.fatalError(exc) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError > raise exception > xml.sax._exceptions.SAXParseException:/home/alper/Desktop/genes/combinedblastfile.xml:38:0: junk after document element > > > Alper Soyler > Dept. of Food Engineering > Middle East Technical University,Turkey > Tel:+90312 2105625 > Fax:+90312 2102767 > http://www.metu.edu.tr/~soyler -- Christof Winter Bioinformatics Group TU Dresden Tatzberg 47-51 01307 Dresden, Germany From tiagoantao at gmail.com Mon Dec 11 11:40:09 2006 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 11 Dec 2006 16:40:09 +0000 Subject: [BioPython] Extracting gene position information from whole chromosome information Message-ID: <6d941f120612110840j246b0ebane46a74605caf1898@mail.gmail.com> Hi all, I am trying to understand what would be the best practice in BioPython to extract gene position information from genomic information. I am currently using genomic information (as opposed to querying GenBank for a gene and looking at metadata) mainly because I am doing genome-wide studies (in fact crossing information with the HapMap project). My current strategy (which is quite poor, IMO) is as this: 1. I get all ASN files for human chromosomes (e.g. ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_02/hs_ref_chr2.asn.gz ) 2. Search (textual 'dump' search, not using any kind of parser) for the locus of interest (say, lactase). 3. Find the positions in the genome where the gene is coded 4. Use it (in my case to find relevant SNPs in HapMap) For me (being a BioPython newbie), I end up with some doubts: 1. Are there any mechanisms (BioPython wise) to parse genome (chromosome) wide ASN files? I could not find none in the cookbook... 2. Would this be the best strategy (Searching for annotations in single gene files would be another strategy...)? Thanks a lot, Tiago -- For every expert, there is an equal and opposite expert. - Arthur C. Clarke From sdavis2 at mail.nih.gov Tue Dec 12 06:41:27 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 12 Dec 2006 06:41:27 -0500 Subject: [BioPython] Extracting gene position information from whole chromosome information In-Reply-To: <6d941f120612110840j246b0ebane46a74605caf1898@mail.gmail.com> References: <6d941f120612110840j246b0ebane46a74605caf1898@mail.gmail.com> Message-ID: <200612120641.27156.sdavis2@mail.nih.gov> On Monday 11 December 2006 11:40, Tiago Ant?o wrote: > Hi all, > > I am trying to understand what would be the best practice in BioPython > to extract gene position information from genomic information. > I am currently using genomic information (as opposed to querying > GenBank for a gene and looking at metadata) mainly because I am doing > genome-wide studies (in fact crossing information with the HapMap > project). > > My current strategy (which is quite poor, IMO) is as this: > 1. I get all ASN files for human chromosomes (e.g. > ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_02/hs_ref_chr2.asn.gz ) > 2. Search (textual 'dump' search, not using any kind of parser) for > the locus of interest (say, lactase). > 3. Find the positions in the genome where the gene is coded > 4. Use it (in my case to find relevant SNPs in HapMap) > > For me (being a BioPython newbie), I end up with some doubts: > 1. Are there any mechanisms (BioPython wise) to parse genome > (chromosome) wide ASN files? I could not find none in the cookbook... > 2. Would this be the best strategy (Searching for annotations in > single gene files would be another strategy...)? I would simply use the UCSC table browser (or their tab-delimited text files) to do this. Sean From winter at biotec.tu-dresden.de Tue Dec 12 11:42:39 2006 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Tue, 12 Dec 2006 17:42:39 +0100 Subject: [BioPython] Blast XML parser In-Reply-To: <457D9151.3020401@c2b2.columbia.edu> References: <457D9151.3020401@c2b2.columbia.edu> Message-ID: <457EDBFF.2080605@biotec.tu-dresden.de> Dear Michiel: I just tested the patched NCBIXML.py with the XML format output of multiple sequences blasted online at the NCBI BLAST website. The result looks fine, however, it seems that query IDs and definitions are not parsed. This is probably because of NCBI's change of tag names from and in the concatenated, old XML format to and in the new, valid XML format. Concerning the new syntax, I would prefer a unified syntax for parsing of both XML formats, and I would like to vote for Peter's "nice idea" in his comment #6 in http://bugzilla.open-bio.org/show_bug.cgi?id=1970). Running the same code on different machines with different local BLAST versions constantly gives me a headache when parsing the results. As long as these different BLAST versions are out there, people will run into problems, and fill the BioPython discussion lists. Cheers, Christof Michiel Jan Laurens de Hoon wrote: > The file format of Blast XML output changed with recent (>= 2.2.14 I > believe) versions of blast if multiple sequences are blasted at the same > time. Older versions of blast return an output file consisting of > several XML files concatenated together. Newer blast versions return one > XML file containing the blast results for all blasted sequences. Whereas > the advantage is that this is a valid XML file, it breaks > NCBIStandalone.Iterator, which looks for the start of a new XML file > when iterating over the blast results. > > There are several bug reports now related to parsing multiple blast > records (bugs 1970, 2051, 2090). > > I have written a patch to Bio/Blast/NCBIXML to fix this problem. As it > changes the way NCBIXML is used, I was wondering if anybody has > objections to this approach. > > > Current usage of NCBIXML (single Blast record): > > from Bio.Blast import NCBIXML > blast_out = open("myblastoutput.xml") > parser = NCBIXML.BlastParser() > b_record = parser.parse(blast_out) > > > New usage of NCBIXML (single Blast records): > > from Bio.Blast import NCBIXML > blast_out = open("myblastoutput.xml") > b_records = NCBIXML.parse(blast_out) > b_record = b_records.next() > > > > Current usage of NCBIXML (multiple Blast records): > > from Bio.Blast import NCBIStandalone, NCBIXML > parser = NCBIXML.BlastParser() > blast_out = open("myblastoutput.xml") > for b_record in NCBIStandalone.Iterator(blast_out, parser): > #Do something with the record > > > New usage of NCBIXML (multiple Blast records): > > from Bio.Blast import NCBIXML > blast_out = open("myblastoutput.xml") > b_records = NCBIXML.parse(blast_out) > for b_record in b_records: > #Do something with the record > > > Objections, anybody? > In case you want to try this, you can download the patch from Bugzilla > bug #1970. > > --Michiel. > > -- Christof Winter Bioinformatics Group TU Dresden Tatzberg 47-51 01307 Dresden, Germany Phone: +49 351 463 40065 EMail: winter at biotec.tu-dresden.de From mmokrejs at ribosome.natur.cuni.cz Tue Dec 12 17:09:20 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=) Date: Tue, 12 Dec 2006 23:09:20 +0100 Subject: [BioPython] Medline records In-Reply-To: <8B121229-3B35-4246-9222-A8E4CC8E91EA@hsc.wvu.edu> References: <8B121229-3B35-4246-9222-A8E4CC8E91EA@hsc.wvu.edu> Message-ID: <457F2890.8020307@ribosome.natur.cuni.cz> Hi Thomas, I use the Medline/Pubmed modules from biopython 1.42 at least and they do work, although some fields contains weird data. Well, see my comments in the code below. It seems PubMed people should cleanup their database contents a bit. Thomas Elliott wrote: > Hi, > > I started playing with Biopython again. Glad to see it is still > active. I love Python. > > Some issues came up in parsing Medline records according to the > tutorial. > > It wasn't obvious which variables exist to be queried on a record. > The tutorial example gives title, authors, and source only. > > I tried looking at the code. There are terms in NLMMedlineXML.py > that look like they should work, but which raise AttributeError (e.g. > journal and date_created). > > I finally looked in the keys to __dict__ for a record. I found > 'year' there, but record.year seems to be always empty, or rather it > is a blank string. > > ? record.title_abbreviation is actually the abbreviated journal name. > ? record.volume_issue gives only the volume, not the issue. > ? record.journal_title_code is always a blank string. > > Not sure what the right way is to do this. I guess it would be > helpful to know which file of the source I should be looking at for > variable names. from Bio import PubMed, Medline, GenBank def get_citation_by_pmid(pmid): """Fetches citation data from NCBI Pubmed using pmid as a key. It returns dicstionary with the following structure: {'title': 'Molecular characterization of the murine Hif-1 alpha locus.', 'journal': 'Gene. Expr.', 'author': 'Luo G., Gu Y. Z., Jain S., Chan W. K., Carr K. M., Hogenesch J. B., Bradfield C. A.', 'volume': '6', 'year': '1997', 'issue': '5', 'pages': '287-299'} Test with PMID: 15703059, 10851087, 1111, 123456, 15703059, Y00664, 12509242, 10713153 """ rec_parser = Medline.RecordParser() medline_dict = PubMed.Dictionary(parser = rec_parser) cur_record = medline_dict[pmid] _authors = cur_record.authors # ['Luo G', 'Gu YZ', 'Jain S', 'Chan WK', 'Carr KM', 'Hogenesch JB', 'Bradfield CA'] _new_authors = [] for _author in _authors: _author = ' '.join(_author.split(' ')[:-1]+['. '.join(tuple(_author.split(' ')[-1:][0]))+'.']) # 'van Carr-Schmidt K.M.' _new_authors.append(_author) _title = cur_record.title # '[The laboratory in programs for enteric infection control]' # 'Cap-independent translation of maize Hsp101.' # "The chicken c-Jun 5' untranslated region directs translation by internal\ninitiation." if _title.startswith('[') and _title.endswith(']'): _title = cur_record.title[1:-1] if '\r\n' in _title: _title = _title.replace('\r\n', ' ') if '\n' in _title: _title = _title.replace('\n', ' ') # _volume_issue = cur_record.volume_issue # '6' but also '1-2' and also 'Pt 6' # pages _pages = cur_record.pagination # '287-99' _start_page, _last_page = _pages.split('-') _start_page, _last_page = int(_start_page), int(_last_page) if _last_page < _start_page: _fixed_last_page = str(_start_page)[:-len(str(_last_page))] + str(_last_page) _pages = str(_start_page) + "-" + str(_fixed_last_page) # year _year = cur_record.publication_date if not _year: _year = cur_record.year try: _year = int(_year) except TypeError: # without raise # 1998 Oct _space_position = _year.find(' ') _year = _year[:_space_position] _year = int(_year) except ValueError: # 1975 May-Jun _space_position = _year.find(' ') _year = _year[:_space_position] _year = int(_year) # journal _source = cur_record.source # 'Gene Expr 1997;6(5):287-99.' # 'J Comp Physiol [A] 2000 Jun;186(6):567-74' # 'Biokhimiia 1975 May-Jun;40(3):645-51.' # BUG: we should not blindly append dots to the end of the string, # for example this would be wrong in case of journals: # RNA, Oncogene, Nature ... where the correct citation is `Nature 33: 22-33', etc. _index = _source.find(cur_record.publication_date) _journal = _source[:_index - 1].strip() # strip the trailing space _journal = _journal.replace(' ','. ') if _journal[-1] != '.' and _journal[-1] != ')': _journal = ''.join(_journal,'.') # volume and issue if ';' not in _source: raise ValueError, "cannot find semicolon as the delimiter of the title from volume and issue" else: # print "_source is " + _source _index = _source.index(';') if '(' not in _source or ')' not in _source: raise ValueError, "cannot find round brackets around issue number" else: # use rindex so we do not match the first bracket, as with 'DNA Repair (Amst) 2002 May 30;1(5):379-90.' _i1 = _source.rindex('(') _i2 = _source.rindex(')') _volume = _source[_index + 1:_i1] # print "volume is " + str(_volume) # get the issue from here, although we might already have it right _issue = _source[_i1 + 1:_i2] # print "issue is " + str(_issue) if str(pmid) != str(cur_record.pubmed_id): raise RuntimeError, "we asked PubMed for pmid=" + str(pmid) + " but received record with pmid=" + str(cur_record.pubmed_id) _dict = {} _dict['year'] = _year _dict['author'] = ', '.join(_new_authors) _dict['title'] = _title _dict['journal'] = _journal _dict['volume'] = _volume _dict['issue'] = _issue _dict['pages'] = _pages return _dict Hope this helps. -- Dr. Martin Mokrejs Dept. of Genetics and Microbiology Faculty of Science, Charles University Vinicna 5, 128 43 Prague, Czech Republic http://www.iresite.org http://www.iresite.org/~mmokrejs From mdehoon at c2b2.columbia.edu Wed Dec 13 00:13:56 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Wed, 13 Dec 2006 00:13:56 -0500 Subject: [BioPython] Blast XML parser In-Reply-To: <457EDBFF.2080605@biotec.tu-dresden.de> References: <457D9151.3020401@c2b2.columbia.edu> <457EDBFF.2080605@biotec.tu-dresden.de> Message-ID: <457F8C14.9050403@c2b2.columbia.edu> Thanks, Christof. Christof Winter wrote: > The result looks fine, however, it seems that query IDs and definitions > are not parsed. This is probably because of NCBI's change of tag names from > ... You're right. I've fixed this in an updated patch (at bug #1970 on Bugzilla). > Concerning the new syntax, I would prefer a unified syntax for parsing > of both XML formats, and I would like to vote for Peter's "nice idea" in > his comment #6 in http://bugzilla.open-bio.org/show_bug.cgi?id=1970). > Running the same code on different machines with different local BLAST > versions constantly gives me a headache when parsing the results. As > long as these different BLAST versions are out there, people will run > into problems, and fill the BioPython discussion lists. I'm not sure if I understand correctly. Are you saying that we should have one function to handle both old- and new-style XML output? Or are you referring to how the functions should be named? --Michiel. From biopython at maubp.freeserve.co.uk Wed Dec 13 04:59:10 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Dec 2006 09:59:10 +0000 Subject: [BioPython] Blast XML parser In-Reply-To: <457F8C14.9050403@c2b2.columbia.edu> References: <457D9151.3020401@c2b2.columbia.edu> <457EDBFF.2080605@biotec.tu-dresden.de> <457F8C14.9050403@c2b2.columbia.edu> Message-ID: <457FCEEE.8010203@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Thanks, Christof. > >> Concerning the new syntax, I would prefer a unified syntax for parsing >> of both XML formats, and I would like to vote for Peter's "nice idea" in >> his comment #6 in http://bugzilla.open-bio.org/show_bug.cgi?id=1970). >> Running the same code on different machines with different local BLAST >> versions constantly gives me a headache when parsing the results. As >> long as these different BLAST versions are out there, people will run >> into problems, and fill the BioPython discussion lists. > > I'm not sure if I understand correctly. Are you saying that we should > have one function to handle both old- and new-style XML output? Or are > you referring to how the functions should be named? I think Christof was agreeing that supporting both the "old" and "new" Blast XML formats in a single function would be nice. I think we could wrap Michiel's new XML code inside a loop that would split up any concatenated (old style) XML files. It shouldn't be too ugly - provided the XML parser can cope with a single record old style, as well as one-or-more records new style. I'll try and make a modified patch to show what I have in mind later today... Peter From winter at biotec.tu-dresden.de Wed Dec 13 05:40:26 2006 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Wed, 13 Dec 2006 11:40:26 +0100 Subject: [BioPython] Blast XML parser In-Reply-To: <457F8C14.9050403@c2b2.columbia.edu> References: <457D9151.3020401@c2b2.columbia.edu> <457EDBFF.2080605@biotec.tu-dresden.de> <457F8C14.9050403@c2b2.columbia.edu> Message-ID: <457FD89A.6080504@biotec.tu-dresden.de> Dear Michiel: Michiel de Hoon wrote: > Thanks, Christof. > > Christof Winter wrote: >> The result looks fine, however, it seems that query IDs and >> definitions are not parsed. This is probably because of NCBI's change >> of tag names from > > ... > You're right. I've fixed this in an updated patch (at bug #1970 on > Bugzilla). Works fine! Thanks a lot for this, Michiel. >> Concerning the new syntax, I would prefer a unified syntax for parsing >> of both XML formats, and I would like to vote for Peter's "nice idea" >> in his comment #6 in >> http://bugzilla.open-bio.org/show_bug.cgi?id=1970). Running the same >> code on different machines with different local BLAST versions >> constantly gives me a headache when parsing the results. As long as >> these different BLAST versions are out there, people will run into >> problems, and fill the BioPython discussion lists. > > I'm not sure if I understand correctly. Are you saying that we should > have one function to handle both old- and new-style XML output? Or are > you referring to how the functions should be named? What I meant is: I would prefer that the same BioPython code is able to parse any NCBI Blast XML output format. That is, we should have one function for both old- and new-style XML output. The actual naming of functions is not so important for me, I also wouldn't mind having a new naming or syntax, as long as it can be used for either format. Personally, I favour iterators, since they are simple and elegant. Cheers, Christof > > --Michiel. From davidc at hgu.mrc.ac.uk Fri Dec 15 04:29:58 2006 From: davidc at hgu.mrc.ac.uk (Dave Clements) Date: Fri, 15 Dec 2006 09:29:58 +0000 Subject: [BioPython] OBO File Format Parser? Message-ID: <45826B16.2000504@hgu.mrc.ac.uk> Hello, I have searched through Biopython and have not been able to find an OBO file format parser (see http://www.geneontology.org/GO.format.shtml). Does anyone know if such a parser exists. I apologise if this issues has been previously raised. The Biopython mail list search facility is currently down. Thanks, Dave C. From cjfields at uiuc.edu Fri Dec 15 12:28:51 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 15 Dec 2006 11:28:51 -0600 Subject: [BioPython] OBO File Format Parser? In-Reply-To: <45826B16.2000504@hgu.mrc.ac.uk> References: <45826B16.2000504@hgu.mrc.ac.uk> Message-ID: <6ED9CE8F-0629-40A0-B14C-FB1925947E9F@uiuc.edu> On Dec 15, 2006, at 3:29 AM, Dave Clements wrote: > Hello, > > I have searched through Biopython and have not been able to find an > OBO > file format parser (see http://www.geneontology.org/GO.format.shtml). > Does anyone know if such a parser exists. > > I apologise if this issues has been previously raised. The Biopython > mail list search facility is currently down. > > Thanks, > > Dave C. BioPerl has a relatively new one: =head1 NAME Bio::OntologyIO::obo - a parser for OBO flat-file format from Gene Ontology Consortium =head1 SYNOPSIS use Bio::OntologyIO; # do not use directly -- use via Bio::OntologyIO my $parser = Bio::OntologyIO->new ( -format => "obo", -file => "gene_ontology.obo"); while(my $ont = $parser->next_ontology()) { print "read ontology ",$ont->name()," with ", scalar($ont->get_root_terms)," root terms, and ", scalar($ont->get_all_terms)," total terms, and ", scalar($ont->get_leaf_terms)," leaf terms\n"; } chris From asmund.skjaveland at usit.uio.no Fri Dec 15 14:31:21 2006 From: asmund.skjaveland at usit.uio.no (=?ISO-8859-1?Q?=C5smund_Skj=E6veland?=) Date: Fri, 15 Dec 2006 20:31:21 +0100 Subject: [BioPython] Converting between BLAST XML formats? Message-ID: <4582F809.6080308@ulrik.uio.no> Since the updated XML format from BLAST is causing so many headaches: Has anybody written code to convert XML from the new format to the old one? I'm trying to use a program that expects old-style XML, but I'd rather not downgrade my BLAST. -- ?smund Skj?veland From biopython at maubp.freeserve.co.uk Sat Dec 16 06:20:32 2006 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Sat, 16 Dec 2006 11:20:32 +0000 Subject: [BioPython] Converting between BLAST XML formats? In-Reply-To: <4582F809.6080308@ulrik.uio.no> References: <4582F809.6080308@ulrik.uio.no> Message-ID: <4583D680.5060503@maubp.freeserve.co.uk> ?smund Skj?veland wrote: > Since the updated XML format from BLAST is causing so many headaches: > Has anybody written code to convert XML from the new format to the old > one? Not that I'm aware of. > I'm trying to use a program that expects old-style XML, but I'd > rather not downgrade my BLAST. Are you talking about some "program" that expects old style XML Blast output which doesn't use BioPython? Your only practical option is to stick to single query sequences, or downgrade your version of Blast. If you are using BioPython in this program, then please try out the new code recently committed to CVS as described on bug 1970. http://bugzilla.open-bio.org/show_bug.cgi?id=1970 Fingers crossed this will work nicely with both old and new Blast XML output. Peter From aloraine at gmail.com Sun Dec 17 22:08:01 2006 From: aloraine at gmail.com (Ann Loraine) Date: Sun, 17 Dec 2006 21:08:01 -0600 Subject: [BioPython] interbase vs one-based? Message-ID: <83722dde0612171908t1f72199aveaed4e514730576b@mail.gmail.com> Hello, I'm not sure if this is a bug or not...my apologies if this has already been discussed. I parsed a Genbank file using BioPython and got this: >>> f = rec.features[1] >>> print f type: gene location: [221:1607] ref: None:None strand: 1 qualifiers: Key: db_xref, Value: ['GeneID:911526'] Key: locus_tag, Value: ['MYPU_0010'] Key: note, Value: ['dnaA'] Note the coordinates - the feature's start position is 221 and end position is 1607, seemingly. However, the text of the Genbank file for this feature says this: gene 222..1607 /locus_tag="MYPU_0010" /note="dnaA" /db_xref="GeneID:911526" It looks like the Genbank parser converts coordinates from 1-based to interbase coordinates. Is this correct? Yours, Ann -- Ann Loraine Assistant Professor Departments of Genetics, Biostatistics, and Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From biopython at maubp.freeserve.co.uk Mon Dec 18 06:37:02 2006 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Mon, 18 Dec 2006 11:37:02 +0000 Subject: [BioPython] interbase vs one-based? In-Reply-To: <83722dde0612171908t1f72199aveaed4e514730576b@mail.gmail.com> References: <83722dde0612171908t1f72199aveaed4e514730576b@mail.gmail.com> Message-ID: <45867D5E.1010009@maubp.freeserve.co.uk> Ann Loraine wrote: > Hello, > > I'm not sure if this is a bug or not...my apologies if this has > already been discussed. Yes, a raw location of 222..1607 is intentionally converted into a location [221:1607] in BioPython. This has been discussed before on the mailing list... but never mind. The rational is to follow the python string splicing conventions, thus req.seq[221:1607] should give you the nucleotides for this feature. Try: help(Bio.SeqFeature.FeatureLocation) or: print rec.features[1].location.__doc__ You should get something like this: > Specify the location of a feature along a sequence. > > This attempts to deal with fuzziness of position ends, but also make > it easy to get the start and end in the 'normal' case (no fuzziness). > > You should access the start and end attributes with > your_location.start and your_location.end. If the start and end are > exact, this will return the positions, if not, we'll return the > approriate Fuzzy class with info about the position and fuzziness. > > Note that the start and end location numbering follow Python's > scheme, thus a GenBank entry of 123..150 (one based counting) becomes > a location of [122:150] (zero based counting). Peter From aloraine at gmail.com Mon Dec 18 08:53:09 2006 From: aloraine at gmail.com (Ann Loraine) Date: Mon, 18 Dec 2006 07:53:09 -0600 Subject: [BioPython] interbase vs one-based? In-Reply-To: <45867D5E.1010009@maubp.freeserve.co.uk> References: <83722dde0612171908t1f72199aveaed4e514730576b@mail.gmail.com> <45867D5E.1010009@maubp.freeserve.co.uk> Message-ID: <83722dde0612180553h4599a7fdoecabd987fda5a77@mail.gmail.com> Thanks! -Ann On 12/18/06, Peter (BioPython List) wrote: > Ann Loraine wrote: > > Hello, > > > > I'm not sure if this is a bug or not...my apologies if this has > > already been discussed. > > Yes, a raw location of 222..1607 is intentionally converted into a > location [221:1607] in BioPython. This has been discussed before on the > mailing list... but never mind. > > The rational is to follow the python string splicing conventions, thus > req.seq[221:1607] should give you the nucleotides for this feature. > > Try: > > help(Bio.SeqFeature.FeatureLocation) > > or: > > print rec.features[1].location.__doc__ > > You should get something like this: > > Specify the location of a feature along a sequence. > > > > This attempts to deal with fuzziness of position ends, but also make > > it easy to get the start and end in the 'normal' case (no fuzziness). > > > > You should access the start and end attributes with > > your_location.start and your_location.end. If the start and end are > > exact, this will return the positions, if not, we'll return the > > approriate Fuzzy class with info about the position and fuzziness. > > > > Note that the start and end location numbering follow Python's > > scheme, thus a GenBank entry of 123..150 (one based counting) becomes > > a location of [122:150] (zero based counting). > > Peter > > -- Ann Loraine Assistant Professor Departments of Genetics, Biostatistics, and Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From biopython at wardroper.org Thu Dec 21 18:21:40 2006 From: biopython at wardroper.org (Alan Wardroper) Date: Thu, 21 Dec 2006 15:21:40 -0800 Subject: [BioPython] GO term annotation from fasta? Message-ID: <458B1704.6000704@wardroper.org> I have a large db of est clones and associated assemblies I'd like to (roughly) annotate using GO terms to let the wetlab people concentrate on potentially interesting clones. Looking for some advice on where to start to do this with biopython. My feeling is something like generate fasta files from my mysql db, blast the sequences against genbank, parse out the top hit, and use those IDs to grab GO terms, but I'm not sure how best to proceed. Is there a better way to do this in biopython? I can't see any way to do blastx or tblastx from bp, with qblast only supporting blastn and blastp. Thanks for any pointers. -- -------------------------- Alan Wardroper alan at wardroper.org From sdavis2 at mail.nih.gov Thu Dec 21 20:38:21 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 21 Dec 2006 20:38:21 -0500 Subject: [BioPython] GO term annotation from fasta? In-Reply-To: <458B1704.6000704@wardroper.org> References: <458B1704.6000704@wardroper.org> Message-ID: <458B370D.9030300@mail.nih.gov> Alan Wardroper wrote: > I have a large db of est clones and associated assemblies I'd like to > (roughly) annotate using GO terms to let the wetlab people concentrate > on potentially interesting clones. Looking for some advice on where to > start to do this with biopython. My feeling is something like generate > fasta files from my mysql db, blast the sequences against genbank, parse > out the top hit, and use those IDs to grab GO terms, but I'm not sure > how best to proceed. Is there a better way to do this in biopython? I > can't see any way to do blastx or tblastx from bp, with qblast only > supporting blastn and blastp. > Thanks for any pointers. > In what species are you working? Sean From biopython at wardroper.org Fri Dec 22 22:58:32 2006 From: biopython at wardroper.org (Alan Wardroper) Date: Fri, 22 Dec 2006 19:58:32 -0800 Subject: [BioPython] GO term annotation from fasta? Message-ID: <458CA968.6020708@wardroper.org> >Alan Wardroper wrote: >> I have a large db of est clones and associated assemblies I'd like to >> (roughly) annotate using GO terms ... with biopython. >In what species are you working? Salmonids-- at this point mostly Salmo salar and Oncorhynchus mykiss, but also others down the line. -- -------------------------- Alan Wardroper alan at wardroper.org From ULNJUJERYDIX at spammotel.com Sun Dec 24 11:54:52 2006 From: ULNJUJERYDIX at spammotel.com (Kevin Lam) Date: Mon, 25 Dec 2006 00:54:52 +0800 Subject: [BioPython] howto generate TFBS images on sequences In-Reply-To: <5b6410e0612232148ra526235tdae1a5fbcc2ea8b6@mail.gmail.com> References: <5b6410e0612232148ra526235tdae1a5fbcc2ea8b6@mail.gmail.com> Message-ID: <5b6410e0612240854q5e1054f0waf18015a5336ee1e@mail.gmail.com> Hi I am looking for a solution to draw boxes/arrows on a scale to represent TFBS on a promoter sequence with numbers and scale. Is there anything in biopython that can do this? I can't find any info in the doc thanks kevin From sdavis2 at mail.nih.gov Mon Dec 25 11:37:18 2006 From: sdavis2 at mail.nih.gov (Davis, Sean (NIH/NCI) [E]) Date: Mon, 25 Dec 2006 11:37:18 -0500 Subject: [BioPython] GO term annotation from fasta? References: <458CA968.6020708@wardroper.org> Message-ID: <014DBF86B19310419F0DF8910FC56457240D08@nihcesmlbx10.nih.gov> -----Original Message----- From: Alan Wardroper [mailto:biopython at wardroper.org] Sent: Fri 12/22/2006 10:58 PM To: biopython at lists.open-bio.org Subject: Re: [BioPython] GO term annotation from fasta? >Alan Wardroper wrote: >> I have a large db of est clones and associated assemblies I'd like to >> (roughly) annotate using GO terms ... with biopython. >In what species are you working? Salmonids-- at this point mostly Salmo salar and Oncorhynchus mykiss, but also others down the line. ------------------------- Your original solution sounds reasonable, but you might want to thing about some sort of translated blast, as I don't suppose your organism(s) are very similar to any of the well-annotated genomes. Also, blasting against genbank won't get you much, as most the genbank accessions are not annotated in GO. You might consider blasting against the sequences in GO (which are proteins, I believe, and are available in the GO database), so you could map directly from your sequences to best blast hit to GO. Sean _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From sdavis2 at mail.nih.gov Sat Dec 30 15:31:46 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sat, 30 Dec 2006 15:31:46 -0500 Subject: [BioPython] MAGE-OM classes in python Message-ID: <4596CCB2.7000006@mail.nih.gov> I have looked around for someone using the MAGE-OM in python, but it appears that only java, perl, and c++ versions exist. Does anyone know of something in python? Any interest in developing a set? Sean From dag23 at duke.edu Fri Dec 1 18:52:01 2006 From: dag23 at duke.edu (dag23 at duke.edu) Date: Fri, 1 Dec 2006 13:52:01 -0500 Subject: [BioPython] mxTextTools for new Macs Message-ID: <20061201135201.ewz1p2uz3ko4gss4@webmail.duke.edu> Hi List of helpful people, I've a new MacBookPro and in attempting to install mxTools, I've hit major snags. While this isn't a BioPython issue per se, it might impact those trying to install the BioPython required files on the new architecture. Has anyone had similar issues? Cheers, David From mdehoon at c2b2.columbia.edu Sun Dec 3 14:56:18 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 03 Dec 2006 09:56:18 -0500 Subject: [BioPython] mxTextTools for new Macs In-Reply-To: <20061201135201.ewz1p2uz3ko4gss4@webmail.duke.edu> References: <20061201135201.ewz1p2uz3ko4gss4@webmail.duke.edu> Message-ID: <4572E592.1090004@c2b2.columbia.edu> I didn't have any problems installing mxTools on a MacBook Pro. What snags did you run into? --Michiel. dag23 at duke.edu wrote: > Hi List of helpful people, > > I've a new MacBookPro and in attempting to install mxTools, I've hit major > snags. While this isn't a BioPython issue per se, it might impact those trying > to install the BioPython required files on the new architecture. Has anyone had > similar issues? > > Cheers, > > David > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From aloraine at gmail.com Fri Dec 8 08:37:13 2006 From: aloraine at gmail.com (Ann Loraine) Date: Fri, 8 Dec 2006 02:37:13 -0600 Subject: [BioPython] question regarding unicode, biopython Seq object, DAS Message-ID: <83722dde0612080037i33884c20jafcb4b24ac7b6815@mail.gmail.com> Hello, I'm attempting to get sequence data from a DAS server (UCSC, DAS1) and am having what appears to be a unicode-related problem - if you have any insights or advice, I'd be grateful for the help. I'm running biopython v. 1.42 on Mac OS X 10.3.9. My sax parser delivers character (sequence) data as unicode, but when I make a Seq object from the unicode string and then try to reverse complement the sequence, I get an exception: TypeError: character mapping must return integer, None or unicode So I tried this: >>> from Bio.Alphabet import IUPAC >>> from Bio.Seq import Seq >>> s = Seq(u'atcg',IUPAC.unambiguous_dna) >>> s.reverse_complement() Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.4/site-packages/Bio/Seq.py", line 117, in reverse_complement s = self.data[-1::-1].translate(ttable) TypeError: character mapping must return integer, None or unicode >>> s = Seq('atcg',IUPAC.unambiguous_dna) # note: no longer unicode >>> s.reverse_complement() Seq('cgat', IUPACUnambiguousDNA()) An example access of the UCSC DAS1 site follows. In my code I'm using a SAX parser to get the data, but this demonstrates a bit of how the DAS aspect works: >>> u = 'http://genome.cse.ucsc.edu/cgi-bin/das/hg17/dna?segment=1:158288275,158302415' >>> import urllib >>> fh = urllib.urlopen(u) >>> fh.readline() '\n' >>> fh.readline() '\n' >>> fh.readline() '\n' >>> fh.readline() '\n' >>> fh.readline() 'gtctcttaaaacccactggacgttggcacagtgctgggatgactatggag\n' ...and etc. Yours, Ann -- Ann Loraine Assistant Professor Departments of Genetics, Biostatistics, and Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From telliott at hsc.wvu.edu Fri Dec 8 18:58:40 2006 From: telliott at hsc.wvu.edu (Thomas Elliott) Date: Fri, 8 Dec 2006 13:58:40 -0500 Subject: [BioPython] Medline records Message-ID: <8B121229-3B35-4246-9222-A8E4CC8E91EA@hsc.wvu.edu> Hi, I started playing with Biopython again. Glad to see it is still active. I love Python. Some issues came up in parsing Medline records according to the tutorial. It wasn't obvious which variables exist to be queried on a record. The tutorial example gives title, authors, and source only. I tried looking at the code. There are terms in NLMMedlineXML.py that look like they should work, but which raise AttributeError (e.g. journal and date_created). I finally looked in the keys to __dict__ for a record. I found 'year' there, but record.year seems to be always empty, or rather it is a blank string. ? record.title_abbreviation is actually the abbreviated journal name. ? record.volume_issue gives only the volume, not the issue. ? record.journal_title_code is always a blank string. Not sure what the right way is to do this. I guess it would be helpful to know which file of the source I should be looking at for variable names. Tom Elliott From mdehoon at c2b2.columbia.edu Sat Dec 9 20:01:04 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sat, 09 Dec 2006 15:01:04 -0500 Subject: [BioPython] question regarding unicode, biopython Seq object, DAS In-Reply-To: <83722dde0612080037i33884c20jafcb4b24ac7b6815@mail.gmail.com> References: <83722dde0612080037i33884c20jafcb4b24ac7b6815@mail.gmail.com> Message-ID: <457B1600.4000807@c2b2.columbia.edu> Ann Loraine wrote: > My sax parser delivers character (sequence) data as unicode, but when > I make a Seq object from the unicode string and then try to reverse > complement the sequence, I get an exception: Can you convert the unicode string to a regular string before creating the Seq object? As in >>> from Bio.Alphabet import IUPAC >>> from Bio.Seq import Seq >>> s = u'atcg' >>> s = str(s) >>> s = Seq(s, IUPAC.unambiguous_dna) >>> s.reverse_complement() Seq('cgat', IUPACUnambiguousDNA()) >>> By the way, you can also use reverse_complement on a string directly: >>> from Bio.Seq import reverse_complement >>> s = 'atcg' >>> reverse_complement(s) 'cgat' >>> --Michiel. From aloraine at gmail.com Sat Dec 9 22:38:59 2006 From: aloraine at gmail.com (Ann Loraine) Date: Sat, 9 Dec 2006 16:38:59 -0600 Subject: [BioPython] question regarding unicode, biopython Seq object, DAS In-Reply-To: <457B1600.4000807@c2b2.columbia.edu> References: <83722dde0612080037i33884c20jafcb4b24ac7b6815@mail.gmail.com> <457B1600.4000807@c2b2.columbia.edu> Message-ID: <83722dde0612091438k22bf4a6fyb18c6f9276ac2f69@mail.gmail.com> Thanks very much! Also, I found out that strings have an encode method I never noticed before: >>> foo = u'foo' >>> foo u'foo' >>> foo.encode('ascii') 'foo' Yours, Ann On 12/9/06, Michiel de Hoon wrote: > Ann Loraine wrote: > > My sax parser delivers character (sequence) data as unicode, but when > > I make a Seq object from the unicode string and then try to reverse > > complement the sequence, I get an exception: > > Can you convert the unicode string to a regular string before creating > the Seq object? As in > > >>> from Bio.Alphabet import IUPAC > >>> from Bio.Seq import Seq > >>> s = u'atcg' > >>> s = str(s) > >>> s = Seq(s, IUPAC.unambiguous_dna) > >>> s.reverse_complement() > Seq('cgat', IUPACUnambiguousDNA()) > >>> > > By the way, you can also use reverse_complement on a string directly: > > >>> from Bio.Seq import reverse_complement > >>> s = 'atcg' > >>> reverse_complement(s) > 'cgat' > >>> > > > --Michiel. > -- Ann Loraine Assistant Professor Departments of Genetics, Biostatistics, and Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From elventear at gmail.com Mon Dec 11 04:33:03 2006 From: elventear at gmail.com (Pepe Barbe) Date: Sun, 10 Dec 2006 22:33:03 -0600 Subject: [BioPython] Markov Model module in BioPython Message-ID: <3e73596b0612102033s38ce972dl70dbd8b09fc3114d@mail.gmail.com> Hello Everyone, I am curious if the Markov Model module works, of it is complete. I've seen the tests for the module and it doesn't test the functionality of the entire module and I've ran into some issues that make me wonder if it is complete. E.g. the allow_transitions function in the Markov Builder class is missing self._state_alphabet.letters in line 182. After fixing that, another bug showed up. Thus I wonder if it has been tested or if its functionality is complete. I am running BioPython 1.42. Thanks, Pepe From alpersoyler at yahoo.com Mon Dec 11 14:44:12 2006 From: alpersoyler at yahoo.com (alper soyler) Date: Mon, 11 Dec 2006 06:44:12 -0800 (PST) Subject: [BioPython] XML Parser problem Message-ID: <20061211144412.6627.qmail@web56505.mail.re3.yahoo.com> Dear all, I run blastall with option -m7 to save the resulting file as xml. However, when I open the xml file with firefox, it gave the following error message. XML Parsing Error: junk after document element Location: file:///home/alper/Desktop/genes/combinedblastfile.xml Line Number 38, Column 1: ^ But it can be opened with the text editor. When I tried to parse the results with biopython it also gives the below errors. I did not understand the reason. If you help me, I will be very glad. Thank you in advance. Traceback (most recent call last): File "XMLBlastParser.py", line 13, in ? b_record = b_parser.parse(blast_out) File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse self._parser.parse(handler) File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse xmlreader.IncrementalParser.parse(self, source) File "/usr/lib/python2.4/site-packages/_xmlplus/sax/xmlreader.py", line 123, in parse self.feed(buffer) File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed self._err_handler.fatalError(exc) File "/usr/lib/python2.4/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException:/home/alper/Desktop/genes/combinedblastfile.xml:38:0: junk after document element Alper Soyler Dept. of Food Engineering Middle East Technical University,Turkey Tel:+90312 2105625 Fax:+90312 2102767 http://www.metu.edu.tr/~soyler ----- Original Message ---- From: "biopython-request at lists.open-bio.org" To: biopython at lists.open-bio.org Sent: Sunday, October 8, 2006 8:03:28 AM Subject: BioPython Digest, Vol 46, Issue 2 Send BioPython mailing list submissions to biopython at lists.open-bio.org To subscribe or unsubscribe via the World Wide Web, visit http://lists.open-bio.org/mailman/listinfo/biopython or, via email, send a message with subject or body 'help' to biopython-request at lists.open-bio.org You can reach the person managing the list at biopython-owner at lists.open-bio.org When replying, please edit your Subject line so it is more specific than "Re: Contents of BioPython digest..." Today's Topics: 1. Re: Genbank parsing problem and fix (Gemma Atkinson) 2. Re: Genbank parsing problem and fix (Peter) 3. BioPython for TRANSFAC (Wijaya Edward) 4. Creating fusion protein like constructs with BioPython (Mitchell Stanton-Cook) 5. Re: Creating fusion protein like constructs with BioPython (Peter) 6. Re: Creating fusion protein like constructs with BioPython (Thomas Hamelryck) 7. Problem parsing Blast XML output from different sources (Steffi Gebauer-Jung) 8. Re: Problem parsing Blast XML output from different sources (Michiel Jan Laurens de Hoon) 9. Join kirby white on Yahoo! Messenger! (kirbywhite at sbcglobal.net) 10. Re: Problem parsing Blast XML output from different sources (Michiel de Hoon) ---------------------------------------------------------------------- Message: 1 Date: Tue, 3 Oct 2006 12:36:58 +0100 From: Gemma Atkinson Subject: Re: [BioPython] Genbank parsing problem and fix To: biopython at lists.open-bio.org Message-ID: Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Hi Peter, I was using the Bio.Genbank module. This is the code I've been using: from Bio import GenBank parser = GenBank.RecordParser(debug_level=2) record = parser.parse(open("test4.txt")) It was the expressions/genbank.py file, imported from within the Genbank module that I've been changing. I haven't touched the formatdefs/genbank.py file (should have made that clear before - sorry). This was the error I was getting before I changed expressions/ genbank.py: File "testgbparser.py", line 3, in ? record = parser.parse(open("test4.txt")) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ python2.4/Bio/GenBank/__init__.py", line 240, in parse self._scanner.feed(handle, self._consumer) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ python2.4/Bio/GenBank/__init__.py", line 1259, in feed self._parser.parseFile(handle) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ python2.4/Martel/Parser.py", line 328, in parseFile self.parseString(fileobj.read()) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/ python2.4/Martel/Parser.py", line 356, in parseString self._err_handler.fatalError(result) File "/Library/Frameworks/Python.framework/Versions/2.4//lib/ python2.4/xml/sax/handler.py", line 38, in fatalError raise exception Martel.Parser.ParserPositionException: error parsing at or beyond character 1153 Gemma On 3 Oct 2006, at 10:54, Peter wrote: > gca500 at york.ac.uk wrote: >> Hi All, >> Been having a problem using the Genbank RecordParser with some >> Genbank files that have recently been added to NCBI. After a bit >> of trial and error, I realised the problem only occurs if a >> REFERENCE field isn't followed by an AUTHOR field (for example in >> reference 2 of this record: http://www.ncbi.nlm.nih.gov/entrez/ >> viewer.fcgi?db=protein&val=88602864). >> There's a very easy fix on line 289 of Genbank.py. Decided to post >> this to the list to save any one else who stumbles across this >> problem tearing their hair out like I've been doing this afternoon! >> Change ... and it works! >> Hope this is useful, >> Gemma > > Hi Gemma, > > I have made your suggested change to biopython/Bio/formatdefs/ > genbank.py as CVS revision 1.10, which should be viewable online soon: > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/ > expressions/genbank.py?cvsroot=biopython > > I am curious as to why you are using this code (part of the > FormatIO system), rather than the Bio.GenBank module. > > Thank you, > > Peter > ------------------------------ Message: 2 Date: Tue, 03 Oct 2006 14:33:48 +0100 From: Peter Subject: Re: [BioPython] Genbank parsing problem and fix To: Gemma Atkinson Cc: biopython at lists.open-bio.org Message-ID: <452266BC.9060809 at maubp.freeserve.co.uk> Content-Type: text/plain; charset=ISO-8859-1; format=flowed >> Hi Gemma, >> >> I have made your suggested change to biopython/Bio/formatdefs/ >> genbank.py as CVS revision 1.10, which should be viewable online soon: >> >> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/ >> expressions/genbank.py?cvsroot=biopython I got the URL right, but I mean to say Bio/expressions/genbank.py (which actually has the Martel definition in it) not Bio/formatdefs/genbank.py Peter wrote: >> I am curious as to why you are using this code ... Gemma replied: > I was using the Bio.Genbank module. This is the code I've been using: > > from Bio import GenBank > parser = GenBank.RecordParser(debug_level=2) > record = parser.parse(open("test4.txt")) I would guess you are using BioPython 1.41 (or older) then, as your stack trace was indeed using Martel internally. Recent versions of BioPython (1.42 and later) use a pure python parser in Bio.GenBank as the old Martel code didn't scale well with large input files (to the point of being almost useless on large genomes). If you do update your installation, and run into any problems with the GenBank parser, please do let us know. Peter ------------------------------ Message: 3 Date: Tue, 03 Oct 2006 22:16:27 +0800 From: Wijaya Edward Subject: [BioPython] BioPython for TRANSFAC To: biopython at lists.open-bio.org Message-ID: <3ACF03E372996C4EACD542EA8A05E66A061584 at mailbe01.teak.local.net> Content-Type: text/plain; charset=iso-8859-1 Hi there, Is there a method in BioPython that allow me to pass the query "fruitfly" or "drosophila" and then returning the: 1. already characterized TF and their binding sites (BS), 2. their respective coregulated genes, and 3. the location of TFBS location/position in the genes. all from TRANSFAC database. -- Regards, Edward WIJAYA ------------ Institute For Infocomm Research - Disclaimer ------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. -------------------------------------------------------- ------------------------------ Message: 4 Date: Wed, 4 Oct 2006 23:38:03 +1000 From: "Mitchell Stanton-Cook" Subject: [BioPython] Creating fusion protein like constructs with BioPython To: BioPython at lists.open-bio.org Message-ID: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hello all. I am trying to create fusion protein-like model from two separate pdb files. I introduce a CYS mutant in the target protein, and then wish to form a disulphide bound between it and a small peptide. This is pure computational work. I am using Bio.PDB. As the two structures are in arbitrary frames of reference I need to rotate and translate to form the "construct". I wish to have TargetProtein-CB-SY-SY-CB-SmallPeptide (the peptide is not really added to the N/C term) I have tried many different approaches but have failed miserable to get SmallPeptide rotated relative to TargetProtein at the correct dihedral angle +/-90deg and bond lengths. My current approach is (omitting the correct bond length at this time): TP-CB-SY SY-CB-SP 1 2 3 4 Translate 2 onto 3 Calculate the angle between 1-(23)-4 Calculate the cross product of 1-23 x 23-4 Generate the rotation matrix given the angle and vector Rotate all SP (SmallPeptide) atoms by this rotation matrix. This has not worked. I have had some other ideas and have written code for them. Ideally, I wish to calculate the rotations about X,Y,Z to place the SP at the correct dihedral angle followed by translation, but I have no idea how to do this. 1) Can I use Bio.PDB to do this above task or do I need to look at something else? 2) Does anyone have any ideas on how to complete this goal? Thanking you for your time. Mitch ------------------------------ Message: 5 Date: Thu, 05 Oct 2006 10:47:30 +0100 From: Peter Subject: Re: [BioPython] Creating fusion protein like constructs with BioPython To: Mitchell Stanton-Cook Cc: BioPython at lists.open-bio.org Message-ID: <4524D4B2.8030600 at maubp.freeserve.co.uk> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Mitchell Stanton-Cook wrote: > Hello all. > > I am trying to create fusion protein-like model from two separate pdb files. > I introduce a CYS mutant in the target protein, and then wish to form a > disulphide bound between it and a small peptide. > > This is pure computational work. > > ... > > 1) Can I use Bio.PDB to do this above task or do I need to look at something > else? My gut instinct is that yes, you probably can - but you will have to do a lot of the work with your own code. Its not something I have ever tried though. > 2) Does anyone have any ideas on how to complete this goal? You might want to have a look at MMTK, which on the face of it would be better suited. Assuming MMTK will read both PDB files you might have better luck - this proviso is because I have found MMTK will choke on "odd" PDB files, and its support for non-standard residues could be better. http://starship.python.net/crew/hinsen/MMTK/index.html Peter ------------------------------ Message: 6 Date: Thu, 5 Oct 2006 11:52:56 +0200 From: "Thomas Hamelryck" Subject: Re: [BioPython] Creating fusion protein like constructs with BioPython To: "Mitchell Stanton-Cook" Cc: BioPython at lists.open-bio.org Message-ID: <2d7c25310610050252j2f889242h84411e0927fb4502 at mail.gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Hi, > I am trying to create fusion protein-like model from two separate pdb files. > I introduce a CYS mutant in the target protein, and then wish to form a > disulphide bound between it and a small peptide. ... > 1) Can I use Bio.PDB to do this above task or do I need to look at something > else? Bio.PDB has functionality to do vector/rotation calculations. Take a look at the Vector.py module. Best, ---- Thomas Hamelryck, Post-doctoral researcher Bioinformatics center Institute of Molecular Biology and Physiology University of Copenhagen Universitetsparken 15 - Bygning 10 DK-2100 Copenhagen ? Denmark Homepage: http://www.binf.ku.dk/Protein_structure ------------------------------ Message: 7 Date: Thu, 05 Oct 2006 12:30:36 +0200 From: Steffi Gebauer-Jung Subject: [BioPython] Problem parsing Blast XML output from different sources To: biopython at lists.open-bio.org Message-ID: <4524DECC.3030307 at ice.mpg.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hello, because of blastall 2.2.14 output was not parsed from the Bio.Blast.NCBIStandalone parser, I tried to switch to the recommended Bio.Blast.NCBIXML parser. Thereby I found, that the xml output of the locally installed standalone blastall (2.2.14) differs from the web xml output. For BlastN hsps on Plus/Minus strands, the xml gives query_frame/hit_frame 1 / -1 as usual. But query and frame positions and sequences are switched in direction (would match frames -1/1). As the Bio.Blast.Record returned by the NCBIXML parser only gives frames, sequences and start positions it is not possible (without knowing the source of the xml file) to be sure to find the right data. This is clearly a problem of Blast. But because of the missing end positions in the returned record object it becomes a problem for users of the parser too. Could somebody try to confirm the different behaviour of the xml blast output with his/her own examples/installation? Thanks, Steffi ------------------------------ Message: 8 Date: Thu, 05 Oct 2006 12:01:04 -0400 From: Michiel Jan Laurens de Hoon Subject: Re: [BioPython] Problem parsing Blast XML output from different sources To: Steffi Gebauer-Jung Cc: biopython at lists.open-bio.org Message-ID: <45252C40.8040806 at c2b2.columbia.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Which sequence are you running blast on? I'd like to try this on our local blast installation. --Michiel. Steffi Gebauer-Jung wrote: > Hello, > > because of blastall 2.2.14 output was not parsed from the > Bio.Blast.NCBIStandalone parser, > I tried to switch to the recommended Bio.Blast.NCBIXML parser. > > Thereby I found, that the xml output of the locally installed standalone > blastall (2.2.14) > differs from the web xml output. > > For BlastN hsps on Plus/Minus strands, the xml gives > query_frame/hit_frame 1 / -1 as usual. > But query and frame positions and sequences are switched in direction > (would match frames -1/1). > > As the Bio.Blast.Record returned by the NCBIXML parser only gives > frames, sequences > and start positions it is not possible (without knowing the source of > the xml file) > to be sure to find the right data. > > This is clearly a problem of Blast. > But because of the missing end positions in the returned record object > it becomes a problem for users of the parser too. > > Could somebody try to confirm the different behaviour of the xml blast > output > with his/her own examples/installation? > > Thanks, Steffi > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 ------------------------------ Message: 9 Date: 06 Oct 2006 01:30:32 -0700 From: kirbywhite at sbcglobal.net Subject: [BioPython] Join kirby white on Yahoo! Messenger! To: biopython at biopython.org Message-ID: <200610060837.k968bH7m002645 at portal.open-bio.org> Content-Type: text/plain; charset=en_US.ISO-8859-1 kirby white wants to talk with you using the new Yahoo! Messenger with Voice: Accept the invitation by clicking this link: http://invite.msg.yahoo.com/invite?op=accept&intl=us&sig=TH4bGUcdNQlSM9glNjqlrYiUe5Ghe81EwN0H9cef5vb5F7R7g9X1RKU7ac1qLispOfRJgQy2V7nt.fUIeMUChnR9ZMz50uB3r5ocpMTyDcxHE4kS.n_LZ2zqpi54EYbR3KHoIq73BouZjRO0y5J6LODqpmvT3VY- With Yahoo! Messenger with Voice, you get: Free worldwide PC-to-PC calls.* All you need are speakers and a microphone (or a headset). If no one's there, leave a voicemail! IM Windows Live™ Messenger friends too. Add your Windows Live friends to your Yahoo! contact list. See when they're online and IM them anytime. Stealth settings keep you in control. Now you can get in touch on your time, by controlling who sees when you're online. So what are you waiting for? It's free. Get Yahoo! Messenger with Voice and start connecting how you want, when you want. * Emergency 911 calling services not available on Yahoo! Messenger. Please inform others who use your Yahoo! Messenger they must dial 911 through traditional phone lines or cell carriers. By using Yahoo! Messenger you agree to not use PC-to-PC calling in countries where prohibited. The above features apply to the Windows version of Yahoo! Messenger. ------------------------------ Message: 10 Date: Sun, 08 Oct 2006 00:51:09 -0400 From: Michiel de Hoon Subject: Re: [BioPython] Problem parsing Blast XML output from different sources To: Steffi Gebauer-Jung , biopython at biopython.org Message-ID: <452883BD.7050907 at c2b2.columbia.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi Steffi, I am trying to replicate this problem with Blast. Where did you get the pat database? I searched for it with google, but there seems to be more than one blast database called pat. --Michiel. Steffi Gebauer-Jung wrote: > Hello, > > I don't know what local databases you have available for testing. > The discrepancy between xml and 'pairwise text' output should be seen > for every Plus/Minus Hsp created by local Blastn (local server or > standalone blastall from command line, I use version 2.2.14) > > I tried several combinations, one is M38240 vs. pat database, > the hsp hit was BD298385. > Here are the interesting output snippets: > >> dbj|BD298385.1| >> >> CLEAN SYNTHETIC VECTORS, PLASMIDS, TRANSGENIC PLANTS AND PLANT PARTS > CONTAINING THEM, AND METHODS FOR OBTAINING THEM > Length = 14108 > > Score = 125 bits (63), Expect = 1e-25 > Identities = 63/63 (100%) > Strand = Plus / Minus > > > Query: 727 aatgaagactaatctttttctctttctcatcttttcacttctcctatcattatcctcggc > 786 > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > Sbjct: 8332 aatgaagactaatctttttctctttctcatcttttcacttctcctatcattatcctcggc > 8273 > > Query: 787 cga 789 > ||| > Sbjct: 8272 cga 8270 > > ===================================================== > > 15 > gi|92136243|dbj|BD298385.1| > CLEAN SYNTHETIC VECTORS, PLASMIDS, TRANSGENIC PLANTS > AND PLANT PARTS CONTAINING THEM, AND METHODS FOR OBTAINING THEM > BD298385 > 14108 > > > 1 > 125.381 > 63 > 9.63859e-26 > 789 > 727 > 8270 > 8332 > 1 > -1 > 63 > 63 > 63 > > TCGGCCGAGGATAATGATAGGAGAAGTGAAAAGATGAGAAAGAGAAAAAGATTAGTCTTCATT > > > TCGGCCGAGGATAATGATAGGAGAAGTGAAAAGATGAGAAAGAGAAAAAGATTAGTCTTCATT > > > ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > > > > > > Thanks, Steffi > > > > > > > Michiel Jan Laurens de Hoon wrote: > >> Which sequence are you running blast on? >> I'd like to try this on our local blast installation. >> >> --Michiel. >> >> Steffi Gebauer-Jung wrote: >> >>> Hello, >>> >>> because of blastall 2.2.14 output was not parsed from the >>> Bio.Blast.NCBIStandalone parser, >>> I tried to switch to the recommended Bio.Blast.NCBIXML parser. >>> >>> Thereby I found, that the xml output of the locally installed >>> standalone blastall (2.2.14) >>> differs from the web xml output. >>> >>> For BlastN hsps on Plus/Minus strands, the xml gives >>> query_frame/hit_frame 1 / -1 as usual. >>> But query and frame positions and sequences are switched in direction >>> (would match frames -1/1). >>> >>> As the Bio.Blast.Record returned by the NCBIXML parser only gives >>> frames, sequences >>> and start positions it is not possible (without knowing the source of >>> the xml file) >>> to be sure to find the right data. >>> >>> This is clearly a problem of Blast. >>> But because of the missing end positions in the returned record object >>> it becomes a problem for users of the parser too. >>> >>> Could somebody try to confirm the different behaviour of the xml >>> blast output >>> with his/her own examples/installation? >>> >>> Thanks, Steffi >>> >>> >>> >>> _______________________________________________ >>> BioPython mailing list - BioPython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >> >> >> > ------------------------------ _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython End of BioPython Digest, Vol 46, Issue 2 **************************************** __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From gebauer-jung at ice.mpg.de Mon Dec 11 15:54:54 2006 From: gebauer-jung at ice.mpg.de (Steffi Gebauer-Jung) Date: Mon, 11 Dec 2006 16:54:54 +0100 Subject: [BioPython] XML Parser problem Message-ID: <457D7F4E.60804@ice.mpg.de> Hello, there are more then 1 concatenated xml documents in the file - which in fact is no valid xml. Some days ago I had the same problem. It does not occur with the NCBI blast server but with our local blast server installation (version 2.2.15): If there is more than only 1 query (regardless of given as sequences or a multiple fasta file) the resulting xml output is a concatenation of several xml files. Interestingly for the command line blastall (version 2.2.15 too) the output is ok and contains a valid xml document with several iterations. Best, Steffi >Message: 9 >Date: Mon, 11 Dec 2006 06:44:12 -0800 (PST) >From: alper soyler >Subject: [BioPython] XML Parser problem >To: biopython at lists.open-bio.org >Message-ID: <20061211144412.6627.qmail at web56505.mail.re3.yahoo.com> >Content-Type: text/plain; charset=ascii > >Dear all, > >I run blastall with option -m7 to save the resulting file as xml. However, when I open the xml file with firefox, it gave the following error message. > >XML Parsing Error: junk after document element >Location: file:///home/alper/Desktop/genes/combinedblastfile.xml >Line Number 38, Column 1: > >^ >But it can be opened with the text editor. When I tried to parse the results with biopython it also gives the below errors. I did not understand the reason. If you help me, I will be very glad. Thank you in advance. > Traceback (most recent call last): > File "XMLBlastParser.py", line 13, in ? > b_record = b_parser.parse(blast_out) > File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse > self._parser.parse(handler) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse > xmlreader.IncrementalParser.parse(self, source) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/xmlreader.py", line 123, in parse > self.feed(buffer) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed > self._err_handler.fatalError(exc) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError > raise exception >xml.sax._exceptions.SAXParseException:/home/alper/Desktop/genes/combinedblastfile.xml:38:0: junk after document element > > >Alper Soyler >Dept. of Food Engineering >Middle East Technical University,Turkey >Tel:+90312 2105625 >Fax:+90312 2102767 >http://www.metu.edu.tr/~soyler > From mdehoon at c2b2.columbia.edu Mon Dec 11 16:24:22 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 11 Dec 2006 11:24:22 -0500 Subject: [BioPython] XML Parser problem In-Reply-To: <20061211144412.6627.qmail@web56505.mail.re3.yahoo.com> References: <20061211144412.6627.qmail@web56505.mail.re3.yahoo.com> Message-ID: <457D8636.9010301@c2b2.columbia.edu> It looks like you are running an old version of Blast. When blasting several sequences at the same time, old blast makes an output file consisting of several XML files concatenated together. According to the XML specification, this is not a well-formed XML file. There are two things you can do: 1) Continue using the old blast, and use NCBIStandalone.Iterator to parse the output file. See section 3.1.6 in the tutorial. 2) Update to the new blast, which creates a well-formed XML file, and see my next mail about the XML parser. --Michiel. alper soyler wrote: > Dear all, > > I run blastall with option -m7 to save the resulting file as xml. However, when I open the xml file with firefox, it gave the following error message. > > XML Parsing Error: junk after document element > Location: file:///home/alper/Desktop/genes/combinedblastfile.xml > Line Number 38, Column 1: > > ^ -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From mdehoon at c2b2.columbia.edu Mon Dec 11 17:11:45 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 11 Dec 2006 12:11:45 -0500 Subject: [BioPython] Blast XML parser Message-ID: <457D9151.3020401@c2b2.columbia.edu> The file format of Blast XML output changed with recent (>= 2.2.14 I believe) versions of blast if multiple sequences are blasted at the same time. Older versions of blast return an output file consisting of several XML files concatenated together. Newer blast versions return one XML file containing the blast results for all blasted sequences. Whereas the advantage is that this is a valid XML file, it breaks NCBIStandalone.Iterator, which looks for the start of a new XML file when iterating over the blast results. There are several bug reports now related to parsing multiple blast records (bugs 1970, 2051, 2090). I have written a patch to Bio/Blast/NCBIXML to fix this problem. As it changes the way NCBIXML is used, I was wondering if anybody has objections to this approach. Current usage of NCBIXML (single Blast record): from Bio.Blast import NCBIXML blast_out = open("myblastoutput.xml") parser = NCBIXML.BlastParser() b_record = parser.parse(blast_out) New usage of NCBIXML (single Blast records): from Bio.Blast import NCBIXML blast_out = open("myblastoutput.xml") b_records = NCBIXML.parse(blast_out) b_record = b_records.next() Current usage of NCBIXML (multiple Blast records): from Bio.Blast import NCBIStandalone, NCBIXML parser = NCBIXML.BlastParser() blast_out = open("myblastoutput.xml") for b_record in NCBIStandalone.Iterator(blast_out, parser): #Do something with the record New usage of NCBIXML (multiple Blast records): from Bio.Blast import NCBIXML blast_out = open("myblastoutput.xml") b_records = NCBIXML.parse(blast_out) for b_record in b_records: #Do something with the record Objections, anybody? In case you want to try this, you can download the patch from Bugzilla bug #1970. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From winter at biotec.tu-dresden.de Mon Dec 11 15:44:24 2006 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Mon, 11 Dec 2006 16:44:24 +0100 Subject: [BioPython] XML Parser problem In-Reply-To: <20061211144412.6627.qmail@web56505.mail.re3.yahoo.com> References: <20061211144412.6627.qmail@web56505.mail.re3.yahoo.com> Message-ID: <457D7CD8.70700@biotec.tu-dresden.de> Dear Alper: The error you get is probably due to a not well-formed XML document produced by older versions of BLAST. On my Debian Linux system, blastall 2.2.10 produces such XML files, whereas blastall 2.2.13 does not anymore. A workaround was included in the NCBIStandalone Iterator class by Michael Anthony Maibaum: http://portal.open-bio.org/pipermail/biopython/2006-January/002889.html The following code should work: from Bio.Blast import NCBIXML, NCBIStandalone blast_results = open(filename) iterator = NCBIStandalone.Iterator(blast_results, NCBIXML.BlastParser()) for record in iterator: # do something blast_results.close() http://www.biopython.org/DIST/docs/tutorial/Tutorial.pdf on page 21 still lists code that uses b_record = b_parser.parse(blast_out), which gives the error when parsing a file that consists of several XML documents. Hope that helps, cheers, Christof alper soyler wrote: > Dear all, > > I run blastall with option -m7 to save the resulting file as xml. However, when I open the xml file with firefox, it gave the following error message. > > XML Parsing Error: junk after document element > Location: file:///home/alper/Desktop/genes/combinedblastfile.xml > Line Number 38, Column 1: > > ^ > But it can be opened with the text editor. When I tried to parse the results with biopython it also gives the below errors. I did not understand the reason. If you help me, I will be very glad. Thank you in advance. > Traceback (most recent call last): > File "XMLBlastParser.py", line 13, in ? > b_record = b_parser.parse(blast_out) > File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse > self._parser.parse(handler) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse > xmlreader.IncrementalParser.parse(self, source) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/xmlreader.py", line 123, in parse > self.feed(buffer) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed > self._err_handler.fatalError(exc) > File "/usr/lib/python2.4/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError > raise exception > xml.sax._exceptions.SAXParseException:/home/alper/Desktop/genes/combinedblastfile.xml:38:0: junk after document element > > > Alper Soyler > Dept. of Food Engineering > Middle East Technical University,Turkey > Tel:+90312 2105625 > Fax:+90312 2102767 > http://www.metu.edu.tr/~soyler -- Christof Winter Bioinformatics Group TU Dresden Tatzberg 47-51 01307 Dresden, Germany From tiagoantao at gmail.com Mon Dec 11 16:40:09 2006 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 11 Dec 2006 16:40:09 +0000 Subject: [BioPython] Extracting gene position information from whole chromosome information Message-ID: <6d941f120612110840j246b0ebane46a74605caf1898@mail.gmail.com> Hi all, I am trying to understand what would be the best practice in BioPython to extract gene position information from genomic information. I am currently using genomic information (as opposed to querying GenBank for a gene and looking at metadata) mainly because I am doing genome-wide studies (in fact crossing information with the HapMap project). My current strategy (which is quite poor, IMO) is as this: 1. I get all ASN files for human chromosomes (e.g. ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_02/hs_ref_chr2.asn.gz ) 2. Search (textual 'dump' search, not using any kind of parser) for the locus of interest (say, lactase). 3. Find the positions in the genome where the gene is coded 4. Use it (in my case to find relevant SNPs in HapMap) For me (being a BioPython newbie), I end up with some doubts: 1. Are there any mechanisms (BioPython wise) to parse genome (chromosome) wide ASN files? I could not find none in the cookbook... 2. Would this be the best strategy (Searching for annotations in single gene files would be another strategy...)? Thanks a lot, Tiago -- For every expert, there is an equal and opposite expert. - Arthur C. Clarke From sdavis2 at mail.nih.gov Tue Dec 12 11:41:27 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 12 Dec 2006 06:41:27 -0500 Subject: [BioPython] Extracting gene position information from whole chromosome information In-Reply-To: <6d941f120612110840j246b0ebane46a74605caf1898@mail.gmail.com> References: <6d941f120612110840j246b0ebane46a74605caf1898@mail.gmail.com> Message-ID: <200612120641.27156.sdavis2@mail.nih.gov> On Monday 11 December 2006 11:40, Tiago Ant?o wrote: > Hi all, > > I am trying to understand what would be the best practice in BioPython > to extract gene position information from genomic information. > I am currently using genomic information (as opposed to querying > GenBank for a gene and looking at metadata) mainly because I am doing > genome-wide studies (in fact crossing information with the HapMap > project). > > My current strategy (which is quite poor, IMO) is as this: > 1. I get all ASN files for human chromosomes (e.g. > ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_02/hs_ref_chr2.asn.gz ) > 2. Search (textual 'dump' search, not using any kind of parser) for > the locus of interest (say, lactase). > 3. Find the positions in the genome where the gene is coded > 4. Use it (in my case to find relevant SNPs in HapMap) > > For me (being a BioPython newbie), I end up with some doubts: > 1. Are there any mechanisms (BioPython wise) to parse genome > (chromosome) wide ASN files? I could not find none in the cookbook... > 2. Would this be the best strategy (Searching for annotations in > single gene files would be another strategy...)? I would simply use the UCSC table browser (or their tab-delimited text files) to do this. Sean From winter at biotec.tu-dresden.de Tue Dec 12 16:42:39 2006 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Tue, 12 Dec 2006 17:42:39 +0100 Subject: [BioPython] Blast XML parser In-Reply-To: <457D9151.3020401@c2b2.columbia.edu> References: <457D9151.3020401@c2b2.columbia.edu> Message-ID: <457EDBFF.2080605@biotec.tu-dresden.de> Dear Michiel: I just tested the patched NCBIXML.py with the XML format output of multiple sequences blasted online at the NCBI BLAST website. The result looks fine, however, it seems that query IDs and definitions are not parsed. This is probably because of NCBI's change of tag names from and in the concatenated, old XML format to and in the new, valid XML format. Concerning the new syntax, I would prefer a unified syntax for parsing of both XML formats, and I would like to vote for Peter's "nice idea" in his comment #6 in http://bugzilla.open-bio.org/show_bug.cgi?id=1970). Running the same code on different machines with different local BLAST versions constantly gives me a headache when parsing the results. As long as these different BLAST versions are out there, people will run into problems, and fill the BioPython discussion lists. Cheers, Christof Michiel Jan Laurens de Hoon wrote: > The file format of Blast XML output changed with recent (>= 2.2.14 I > believe) versions of blast if multiple sequences are blasted at the same > time. Older versions of blast return an output file consisting of > several XML files concatenated together. Newer blast versions return one > XML file containing the blast results for all blasted sequences. Whereas > the advantage is that this is a valid XML file, it breaks > NCBIStandalone.Iterator, which looks for the start of a new XML file > when iterating over the blast results. > > There are several bug reports now related to parsing multiple blast > records (bugs 1970, 2051, 2090). > > I have written a patch to Bio/Blast/NCBIXML to fix this problem. As it > changes the way NCBIXML is used, I was wondering if anybody has > objections to this approach. > > > Current usage of NCBIXML (single Blast record): > > from Bio.Blast import NCBIXML > blast_out = open("myblastoutput.xml") > parser = NCBIXML.BlastParser() > b_record = parser.parse(blast_out) > > > New usage of NCBIXML (single Blast records): > > from Bio.Blast import NCBIXML > blast_out = open("myblastoutput.xml") > b_records = NCBIXML.parse(blast_out) > b_record = b_records.next() > > > > Current usage of NCBIXML (multiple Blast records): > > from Bio.Blast import NCBIStandalone, NCBIXML > parser = NCBIXML.BlastParser() > blast_out = open("myblastoutput.xml") > for b_record in NCBIStandalone.Iterator(blast_out, parser): > #Do something with the record > > > New usage of NCBIXML (multiple Blast records): > > from Bio.Blast import NCBIXML > blast_out = open("myblastoutput.xml") > b_records = NCBIXML.parse(blast_out) > for b_record in b_records: > #Do something with the record > > > Objections, anybody? > In case you want to try this, you can download the patch from Bugzilla > bug #1970. > > --Michiel. > > -- Christof Winter Bioinformatics Group TU Dresden Tatzberg 47-51 01307 Dresden, Germany Phone: +49 351 463 40065 EMail: winter at biotec.tu-dresden.de From mmokrejs at ribosome.natur.cuni.cz Tue Dec 12 22:09:20 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=) Date: Tue, 12 Dec 2006 23:09:20 +0100 Subject: [BioPython] Medline records In-Reply-To: <8B121229-3B35-4246-9222-A8E4CC8E91EA@hsc.wvu.edu> References: <8B121229-3B35-4246-9222-A8E4CC8E91EA@hsc.wvu.edu> Message-ID: <457F2890.8020307@ribosome.natur.cuni.cz> Hi Thomas, I use the Medline/Pubmed modules from biopython 1.42 at least and they do work, although some fields contains weird data. Well, see my comments in the code below. It seems PubMed people should cleanup their database contents a bit. Thomas Elliott wrote: > Hi, > > I started playing with Biopython again. Glad to see it is still > active. I love Python. > > Some issues came up in parsing Medline records according to the > tutorial. > > It wasn't obvious which variables exist to be queried on a record. > The tutorial example gives title, authors, and source only. > > I tried looking at the code. There are terms in NLMMedlineXML.py > that look like they should work, but which raise AttributeError (e.g. > journal and date_created). > > I finally looked in the keys to __dict__ for a record. I found > 'year' there, but record.year seems to be always empty, or rather it > is a blank string. > > ? record.title_abbreviation is actually the abbreviated journal name. > ? record.volume_issue gives only the volume, not the issue. > ? record.journal_title_code is always a blank string. > > Not sure what the right way is to do this. I guess it would be > helpful to know which file of the source I should be looking at for > variable names. from Bio import PubMed, Medline, GenBank def get_citation_by_pmid(pmid): """Fetches citation data from NCBI Pubmed using pmid as a key. It returns dicstionary with the following structure: {'title': 'Molecular characterization of the murine Hif-1 alpha locus.', 'journal': 'Gene. Expr.', 'author': 'Luo G., Gu Y. Z., Jain S., Chan W. K., Carr K. M., Hogenesch J. B., Bradfield C. A.', 'volume': '6', 'year': '1997', 'issue': '5', 'pages': '287-299'} Test with PMID: 15703059, 10851087, 1111, 123456, 15703059, Y00664, 12509242, 10713153 """ rec_parser = Medline.RecordParser() medline_dict = PubMed.Dictionary(parser = rec_parser) cur_record = medline_dict[pmid] _authors = cur_record.authors # ['Luo G', 'Gu YZ', 'Jain S', 'Chan WK', 'Carr KM', 'Hogenesch JB', 'Bradfield CA'] _new_authors = [] for _author in _authors: _author = ' '.join(_author.split(' ')[:-1]+['. '.join(tuple(_author.split(' ')[-1:][0]))+'.']) # 'van Carr-Schmidt K.M.' _new_authors.append(_author) _title = cur_record.title # '[The laboratory in programs for enteric infection control]' # 'Cap-independent translation of maize Hsp101.' # "The chicken c-Jun 5' untranslated region directs translation by internal\ninitiation." if _title.startswith('[') and _title.endswith(']'): _title = cur_record.title[1:-1] if '\r\n' in _title: _title = _title.replace('\r\n', ' ') if '\n' in _title: _title = _title.replace('\n', ' ') # _volume_issue = cur_record.volume_issue # '6' but also '1-2' and also 'Pt 6' # pages _pages = cur_record.pagination # '287-99' _start_page, _last_page = _pages.split('-') _start_page, _last_page = int(_start_page), int(_last_page) if _last_page < _start_page: _fixed_last_page = str(_start_page)[:-len(str(_last_page))] + str(_last_page) _pages = str(_start_page) + "-" + str(_fixed_last_page) # year _year = cur_record.publication_date if not _year: _year = cur_record.year try: _year = int(_year) except TypeError: # without raise # 1998 Oct _space_position = _year.find(' ') _year = _year[:_space_position] _year = int(_year) except ValueError: # 1975 May-Jun _space_position = _year.find(' ') _year = _year[:_space_position] _year = int(_year) # journal _source = cur_record.source # 'Gene Expr 1997;6(5):287-99.' # 'J Comp Physiol [A] 2000 Jun;186(6):567-74' # 'Biokhimiia 1975 May-Jun;40(3):645-51.' # BUG: we should not blindly append dots to the end of the string, # for example this would be wrong in case of journals: # RNA, Oncogene, Nature ... where the correct citation is `Nature 33: 22-33', etc. _index = _source.find(cur_record.publication_date) _journal = _source[:_index - 1].strip() # strip the trailing space _journal = _journal.replace(' ','. ') if _journal[-1] != '.' and _journal[-1] != ')': _journal = ''.join(_journal,'.') # volume and issue if ';' not in _source: raise ValueError, "cannot find semicolon as the delimiter of the title from volume and issue" else: # print "_source is " + _source _index = _source.index(';') if '(' not in _source or ')' not in _source: raise ValueError, "cannot find round brackets around issue number" else: # use rindex so we do not match the first bracket, as with 'DNA Repair (Amst) 2002 May 30;1(5):379-90.' _i1 = _source.rindex('(') _i2 = _source.rindex(')') _volume = _source[_index + 1:_i1] # print "volume is " + str(_volume) # get the issue from here, although we might already have it right _issue = _source[_i1 + 1:_i2] # print "issue is " + str(_issue) if str(pmid) != str(cur_record.pubmed_id): raise RuntimeError, "we asked PubMed for pmid=" + str(pmid) + " but received record with pmid=" + str(cur_record.pubmed_id) _dict = {} _dict['year'] = _year _dict['author'] = ', '.join(_new_authors) _dict['title'] = _title _dict['journal'] = _journal _dict['volume'] = _volume _dict['issue'] = _issue _dict['pages'] = _pages return _dict Hope this helps. -- Dr. Martin Mokrejs Dept. of Genetics and Microbiology Faculty of Science, Charles University Vinicna 5, 128 43 Prague, Czech Republic http://www.iresite.org http://www.iresite.org/~mmokrejs From mdehoon at c2b2.columbia.edu Wed Dec 13 05:13:56 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Wed, 13 Dec 2006 00:13:56 -0500 Subject: [BioPython] Blast XML parser In-Reply-To: <457EDBFF.2080605@biotec.tu-dresden.de> References: <457D9151.3020401@c2b2.columbia.edu> <457EDBFF.2080605@biotec.tu-dresden.de> Message-ID: <457F8C14.9050403@c2b2.columbia.edu> Thanks, Christof. Christof Winter wrote: > The result looks fine, however, it seems that query IDs and definitions > are not parsed. This is probably because of NCBI's change of tag names from > ... You're right. I've fixed this in an updated patch (at bug #1970 on Bugzilla). > Concerning the new syntax, I would prefer a unified syntax for parsing > of both XML formats, and I would like to vote for Peter's "nice idea" in > his comment #6 in http://bugzilla.open-bio.org/show_bug.cgi?id=1970). > Running the same code on different machines with different local BLAST > versions constantly gives me a headache when parsing the results. As > long as these different BLAST versions are out there, people will run > into problems, and fill the BioPython discussion lists. I'm not sure if I understand correctly. Are you saying that we should have one function to handle both old- and new-style XML output? Or are you referring to how the functions should be named? --Michiel. From biopython at maubp.freeserve.co.uk Wed Dec 13 09:59:10 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Dec 2006 09:59:10 +0000 Subject: [BioPython] Blast XML parser In-Reply-To: <457F8C14.9050403@c2b2.columbia.edu> References: <457D9151.3020401@c2b2.columbia.edu> <457EDBFF.2080605@biotec.tu-dresden.de> <457F8C14.9050403@c2b2.columbia.edu> Message-ID: <457FCEEE.8010203@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Thanks, Christof. > >> Concerning the new syntax, I would prefer a unified syntax for parsing >> of both XML formats, and I would like to vote for Peter's "nice idea" in >> his comment #6 in http://bugzilla.open-bio.org/show_bug.cgi?id=1970). >> Running the same code on different machines with different local BLAST >> versions constantly gives me a headache when parsing the results. As >> long as these different BLAST versions are out there, people will run >> into problems, and fill the BioPython discussion lists. > > I'm not sure if I understand correctly. Are you saying that we should > have one function to handle both old- and new-style XML output? Or are > you referring to how the functions should be named? I think Christof was agreeing that supporting both the "old" and "new" Blast XML formats in a single function would be nice. I think we could wrap Michiel's new XML code inside a loop that would split up any concatenated (old style) XML files. It shouldn't be too ugly - provided the XML parser can cope with a single record old style, as well as one-or-more records new style. I'll try and make a modified patch to show what I have in mind later today... Peter From winter at biotec.tu-dresden.de Wed Dec 13 10:40:26 2006 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Wed, 13 Dec 2006 11:40:26 +0100 Subject: [BioPython] Blast XML parser In-Reply-To: <457F8C14.9050403@c2b2.columbia.edu> References: <457D9151.3020401@c2b2.columbia.edu> <457EDBFF.2080605@biotec.tu-dresden.de> <457F8C14.9050403@c2b2.columbia.edu> Message-ID: <457FD89A.6080504@biotec.tu-dresden.de> Dear Michiel: Michiel de Hoon wrote: > Thanks, Christof. > > Christof Winter wrote: >> The result looks fine, however, it seems that query IDs and >> definitions are not parsed. This is probably because of NCBI's change >> of tag names from > > ... > You're right. I've fixed this in an updated patch (at bug #1970 on > Bugzilla). Works fine! Thanks a lot for this, Michiel. >> Concerning the new syntax, I would prefer a unified syntax for parsing >> of both XML formats, and I would like to vote for Peter's "nice idea" >> in his comment #6 in >> http://bugzilla.open-bio.org/show_bug.cgi?id=1970). Running the same >> code on different machines with different local BLAST versions >> constantly gives me a headache when parsing the results. As long as >> these different BLAST versions are out there, people will run into >> problems, and fill the BioPython discussion lists. > > I'm not sure if I understand correctly. Are you saying that we should > have one function to handle both old- and new-style XML output? Or are > you referring to how the functions should be named? What I meant is: I would prefer that the same BioPython code is able to parse any NCBI Blast XML output format. That is, we should have one function for both old- and new-style XML output. The actual naming of functions is not so important for me, I also wouldn't mind having a new naming or syntax, as long as it can be used for either format. Personally, I favour iterators, since they are simple and elegant. Cheers, Christof > > --Michiel. From davidc at hgu.mrc.ac.uk Fri Dec 15 09:29:58 2006 From: davidc at hgu.mrc.ac.uk (Dave Clements) Date: Fri, 15 Dec 2006 09:29:58 +0000 Subject: [BioPython] OBO File Format Parser? Message-ID: <45826B16.2000504@hgu.mrc.ac.uk> Hello, I have searched through Biopython and have not been able to find an OBO file format parser (see http://www.geneontology.org/GO.format.shtml). Does anyone know if such a parser exists. I apologise if this issues has been previously raised. The Biopython mail list search facility is currently down. Thanks, Dave C. From cjfields at uiuc.edu Fri Dec 15 17:28:51 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 15 Dec 2006 11:28:51 -0600 Subject: [BioPython] OBO File Format Parser? In-Reply-To: <45826B16.2000504@hgu.mrc.ac.uk> References: <45826B16.2000504@hgu.mrc.ac.uk> Message-ID: <6ED9CE8F-0629-40A0-B14C-FB1925947E9F@uiuc.edu> On Dec 15, 2006, at 3:29 AM, Dave Clements wrote: > Hello, > > I have searched through Biopython and have not been able to find an > OBO > file format parser (see http://www.geneontology.org/GO.format.shtml). > Does anyone know if such a parser exists. > > I apologise if this issues has been previously raised. The Biopython > mail list search facility is currently down. > > Thanks, > > Dave C. BioPerl has a relatively new one: =head1 NAME Bio::OntologyIO::obo - a parser for OBO flat-file format from Gene Ontology Consortium =head1 SYNOPSIS use Bio::OntologyIO; # do not use directly -- use via Bio::OntologyIO my $parser = Bio::OntologyIO->new ( -format => "obo", -file => "gene_ontology.obo"); while(my $ont = $parser->next_ontology()) { print "read ontology ",$ont->name()," with ", scalar($ont->get_root_terms)," root terms, and ", scalar($ont->get_all_terms)," total terms, and ", scalar($ont->get_leaf_terms)," leaf terms\n"; } chris From asmund.skjaveland at usit.uio.no Fri Dec 15 19:31:21 2006 From: asmund.skjaveland at usit.uio.no (=?ISO-8859-1?Q?=C5smund_Skj=E6veland?=) Date: Fri, 15 Dec 2006 20:31:21 +0100 Subject: [BioPython] Converting between BLAST XML formats? Message-ID: <4582F809.6080308@ulrik.uio.no> Since the updated XML format from BLAST is causing so many headaches: Has anybody written code to convert XML from the new format to the old one? I'm trying to use a program that expects old-style XML, but I'd rather not downgrade my BLAST. -- ?smund Skj?veland From biopython at maubp.freeserve.co.uk Sat Dec 16 11:20:32 2006 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Sat, 16 Dec 2006 11:20:32 +0000 Subject: [BioPython] Converting between BLAST XML formats? In-Reply-To: <4582F809.6080308@ulrik.uio.no> References: <4582F809.6080308@ulrik.uio.no> Message-ID: <4583D680.5060503@maubp.freeserve.co.uk> ?smund Skj?veland wrote: > Since the updated XML format from BLAST is causing so many headaches: > Has anybody written code to convert XML from the new format to the old > one? Not that I'm aware of. > I'm trying to use a program that expects old-style XML, but I'd > rather not downgrade my BLAST. Are you talking about some "program" that expects old style XML Blast output which doesn't use BioPython? Your only practical option is to stick to single query sequences, or downgrade your version of Blast. If you are using BioPython in this program, then please try out the new code recently committed to CVS as described on bug 1970. http://bugzilla.open-bio.org/show_bug.cgi?id=1970 Fingers crossed this will work nicely with both old and new Blast XML output. Peter From aloraine at gmail.com Mon Dec 18 03:08:01 2006 From: aloraine at gmail.com (Ann Loraine) Date: Sun, 17 Dec 2006 21:08:01 -0600 Subject: [BioPython] interbase vs one-based? Message-ID: <83722dde0612171908t1f72199aveaed4e514730576b@mail.gmail.com> Hello, I'm not sure if this is a bug or not...my apologies if this has already been discussed. I parsed a Genbank file using BioPython and got this: >>> f = rec.features[1] >>> print f type: gene location: [221:1607] ref: None:None strand: 1 qualifiers: Key: db_xref, Value: ['GeneID:911526'] Key: locus_tag, Value: ['MYPU_0010'] Key: note, Value: ['dnaA'] Note the coordinates - the feature's start position is 221 and end position is 1607, seemingly. However, the text of the Genbank file for this feature says this: gene 222..1607 /locus_tag="MYPU_0010" /note="dnaA" /db_xref="GeneID:911526" It looks like the Genbank parser converts coordinates from 1-based to interbase coordinates. Is this correct? Yours, Ann -- Ann Loraine Assistant Professor Departments of Genetics, Biostatistics, and Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From biopython at maubp.freeserve.co.uk Mon Dec 18 11:37:02 2006 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Mon, 18 Dec 2006 11:37:02 +0000 Subject: [BioPython] interbase vs one-based? In-Reply-To: <83722dde0612171908t1f72199aveaed4e514730576b@mail.gmail.com> References: <83722dde0612171908t1f72199aveaed4e514730576b@mail.gmail.com> Message-ID: <45867D5E.1010009@maubp.freeserve.co.uk> Ann Loraine wrote: > Hello, > > I'm not sure if this is a bug or not...my apologies if this has > already been discussed. Yes, a raw location of 222..1607 is intentionally converted into a location [221:1607] in BioPython. This has been discussed before on the mailing list... but never mind. The rational is to follow the python string splicing conventions, thus req.seq[221:1607] should give you the nucleotides for this feature. Try: help(Bio.SeqFeature.FeatureLocation) or: print rec.features[1].location.__doc__ You should get something like this: > Specify the location of a feature along a sequence. > > This attempts to deal with fuzziness of position ends, but also make > it easy to get the start and end in the 'normal' case (no fuzziness). > > You should access the start and end attributes with > your_location.start and your_location.end. If the start and end are > exact, this will return the positions, if not, we'll return the > approriate Fuzzy class with info about the position and fuzziness. > > Note that the start and end location numbering follow Python's > scheme, thus a GenBank entry of 123..150 (one based counting) becomes > a location of [122:150] (zero based counting). Peter From aloraine at gmail.com Mon Dec 18 13:53:09 2006 From: aloraine at gmail.com (Ann Loraine) Date: Mon, 18 Dec 2006 07:53:09 -0600 Subject: [BioPython] interbase vs one-based? In-Reply-To: <45867D5E.1010009@maubp.freeserve.co.uk> References: <83722dde0612171908t1f72199aveaed4e514730576b@mail.gmail.com> <45867D5E.1010009@maubp.freeserve.co.uk> Message-ID: <83722dde0612180553h4599a7fdoecabd987fda5a77@mail.gmail.com> Thanks! -Ann On 12/18/06, Peter (BioPython List) wrote: > Ann Loraine wrote: > > Hello, > > > > I'm not sure if this is a bug or not...my apologies if this has > > already been discussed. > > Yes, a raw location of 222..1607 is intentionally converted into a > location [221:1607] in BioPython. This has been discussed before on the > mailing list... but never mind. > > The rational is to follow the python string splicing conventions, thus > req.seq[221:1607] should give you the nucleotides for this feature. > > Try: > > help(Bio.SeqFeature.FeatureLocation) > > or: > > print rec.features[1].location.__doc__ > > You should get something like this: > > Specify the location of a feature along a sequence. > > > > This attempts to deal with fuzziness of position ends, but also make > > it easy to get the start and end in the 'normal' case (no fuzziness). > > > > You should access the start and end attributes with > > your_location.start and your_location.end. If the start and end are > > exact, this will return the positions, if not, we'll return the > > approriate Fuzzy class with info about the position and fuzziness. > > > > Note that the start and end location numbering follow Python's > > scheme, thus a GenBank entry of 123..150 (one based counting) becomes > > a location of [122:150] (zero based counting). > > Peter > > -- Ann Loraine Assistant Professor Departments of Genetics, Biostatistics, and Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From biopython at wardroper.org Thu Dec 21 23:21:40 2006 From: biopython at wardroper.org (Alan Wardroper) Date: Thu, 21 Dec 2006 15:21:40 -0800 Subject: [BioPython] GO term annotation from fasta? Message-ID: <458B1704.6000704@wardroper.org> I have a large db of est clones and associated assemblies I'd like to (roughly) annotate using GO terms to let the wetlab people concentrate on potentially interesting clones. Looking for some advice on where to start to do this with biopython. My feeling is something like generate fasta files from my mysql db, blast the sequences against genbank, parse out the top hit, and use those IDs to grab GO terms, but I'm not sure how best to proceed. Is there a better way to do this in biopython? I can't see any way to do blastx or tblastx from bp, with qblast only supporting blastn and blastp. Thanks for any pointers. -- -------------------------- Alan Wardroper alan at wardroper.org From sdavis2 at mail.nih.gov Fri Dec 22 01:38:21 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 21 Dec 2006 20:38:21 -0500 Subject: [BioPython] GO term annotation from fasta? In-Reply-To: <458B1704.6000704@wardroper.org> References: <458B1704.6000704@wardroper.org> Message-ID: <458B370D.9030300@mail.nih.gov> Alan Wardroper wrote: > I have a large db of est clones and associated assemblies I'd like to > (roughly) annotate using GO terms to let the wetlab people concentrate > on potentially interesting clones. Looking for some advice on where to > start to do this with biopython. My feeling is something like generate > fasta files from my mysql db, blast the sequences against genbank, parse > out the top hit, and use those IDs to grab GO terms, but I'm not sure > how best to proceed. Is there a better way to do this in biopython? I > can't see any way to do blastx or tblastx from bp, with qblast only > supporting blastn and blastp. > Thanks for any pointers. > In what species are you working? Sean From biopython at wardroper.org Sat Dec 23 03:58:32 2006 From: biopython at wardroper.org (Alan Wardroper) Date: Fri, 22 Dec 2006 19:58:32 -0800 Subject: [BioPython] GO term annotation from fasta? Message-ID: <458CA968.6020708@wardroper.org> >Alan Wardroper wrote: >> I have a large db of est clones and associated assemblies I'd like to >> (roughly) annotate using GO terms ... with biopython. >In what species are you working? Salmonids-- at this point mostly Salmo salar and Oncorhynchus mykiss, but also others down the line. -- -------------------------- Alan Wardroper alan at wardroper.org From ULNJUJERYDIX at spammotel.com Sun Dec 24 16:54:52 2006 From: ULNJUJERYDIX at spammotel.com (Kevin Lam) Date: Mon, 25 Dec 2006 00:54:52 +0800 Subject: [BioPython] howto generate TFBS images on sequences In-Reply-To: <5b6410e0612232148ra526235tdae1a5fbcc2ea8b6@mail.gmail.com> References: <5b6410e0612232148ra526235tdae1a5fbcc2ea8b6@mail.gmail.com> Message-ID: <5b6410e0612240854q5e1054f0waf18015a5336ee1e@mail.gmail.com> Hi I am looking for a solution to draw boxes/arrows on a scale to represent TFBS on a promoter sequence with numbers and scale. Is there anything in biopython that can do this? I can't find any info in the doc thanks kevin From sdavis2 at mail.nih.gov Mon Dec 25 16:37:18 2006 From: sdavis2 at mail.nih.gov (Davis, Sean (NIH/NCI) [E]) Date: Mon, 25 Dec 2006 11:37:18 -0500 Subject: [BioPython] GO term annotation from fasta? References: <458CA968.6020708@wardroper.org> Message-ID: <014DBF86B19310419F0DF8910FC56457240D08@nihcesmlbx10.nih.gov> -----Original Message----- From: Alan Wardroper [mailto:biopython at wardroper.org] Sent: Fri 12/22/2006 10:58 PM To: biopython at lists.open-bio.org Subject: Re: [BioPython] GO term annotation from fasta? >Alan Wardroper wrote: >> I have a large db of est clones and associated assemblies I'd like to >> (roughly) annotate using GO terms ... with biopython. >In what species are you working? Salmonids-- at this point mostly Salmo salar and Oncorhynchus mykiss, but also others down the line. ------------------------- Your original solution sounds reasonable, but you might want to thing about some sort of translated blast, as I don't suppose your organism(s) are very similar to any of the well-annotated genomes. Also, blasting against genbank won't get you much, as most the genbank accessions are not annotated in GO. You might consider blasting against the sequences in GO (which are proteins, I believe, and are available in the GO database), so you could map directly from your sequences to best blast hit to GO. Sean _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From sdavis2 at mail.nih.gov Sat Dec 30 20:31:46 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sat, 30 Dec 2006 15:31:46 -0500 Subject: [BioPython] MAGE-OM classes in python Message-ID: <4596CCB2.7000006@mail.nih.gov> I have looked around for someone using the MAGE-OM in python, but it appears that only java, perl, and c++ versions exist. Does anyone know of something in python? Any interest in developing a set? Sean