From tiagoantao at gmail.com Mon Oct 3 18:12:18 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 3 Oct 2011 23:12:18 +0100 Subject: [Biopython] VCF parser Message-ID: Hi, I wonder if there is a VCF parser in either Python or Java? Either I am being dumb at searching (probably) or nothing exists? Thanks, Tiago -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From bala.biophysics at gmail.com Tue Oct 4 04:05:36 2011 From: bala.biophysics at gmail.com (Bala subramanian) Date: Tue, 4 Oct 2011 10:05:36 +0200 Subject: [Biopython] changing record attributes while iterating Message-ID: Friends, I have a fasta file. I need to modify the record id by adding a suffix to it. So i used SeqRecord (the code attached below). It is working fine but i would like to know if there is any simple way to do that. ie. if i can change the record attributes while iterating through the fasta with SeqIO.parse itself. I tried something like following but i couldnt get what i wanted. new_list=[] for record in SeqIO.parse(open(argv[1], "rU"), "fasta"): record.id=record.id + '_suffix' new_list.append(record) Hence i used SeqRecord to do the modification ? ---------------------------------------------------------------------------------------------------- #!/usr/bin/env python from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq from sys import argv new_list=[] for record in SeqIO.parse(open(argv[1], "rU"), "fasta"): seq=str(record.seq) newrec=SeqRecord(Seq(seq),id=record.id+"_suffix",name='',description='') new_list.append(newrec) output_handle = open(raw_input('Enter the output file:'), 'w') SeqIO.write(new_list, output_handle, "fasta") output_handle.close() From p.j.a.cock at googlemail.com Tue Oct 4 04:24:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Oct 2011 09:24:08 +0100 Subject: [Biopython] changing record attributes while iterating In-Reply-To: References: Message-ID: On Tue, Oct 4, 2011 at 9:05 AM, Bala subramanian wrote: > Friends, > I have a fasta file. I need to modify the record id by adding a suffix to > it. So i used SeqRecord (the code attached below). It is working fine but i > would like to know if there is any simple way to do that. ie. if i can > change the record attributes while iterating through the fasta with > SeqIO.parse itself. I tried something like following but i couldnt get what > i wanted. > > new_list=[] > for record in SeqIO.parse(open(argv[1], "rU"), "fasta"): > ? ? ? ? ? ? ? ? ? ?record.id=record.id + '_suffix' > ? ? ? ? ? ? ? ? ? ?new_list.append(record) The above looks fine, although depending on the rest of your script a big list might be a bad idea (too much memory) and an iterator based approach may be preferable. If as in the rest of your example you just need to do this for output, perhaps: #!/usr/bin/env python from Bio import SeqIO from sys import argv def rename(record): """Modified record in place AND returns it.""" record.id += '_suffix' return record #This is a generator expression: records = (rename(r) for r in SeqIO.parse(argv[1], "fasta")) output_filename = raw_input('Enter the output file:') SeqIO.write(records, output_filename, "fasta") The alternative you showed was wasteful, creating lots of new objects to no benefit. Peter From nanatrapnest at hotmail.it Wed Oct 5 11:07:44 2011 From: nanatrapnest at hotmail.it (Nana Trapnest) Date: Wed, 5 Oct 2011 15:07:44 +0000 Subject: [Biopython] StructureBuilder Message-ID: Hello, is it possible with structure builder copy all a protein and change atoms coord??? How can I do this?? Thanks to all of you! Stefania From anaryin at gmail.com Wed Oct 5 12:02:30 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 5 Oct 2011 18:02:30 +0200 Subject: [Biopython] StructureBuilder In-Reply-To: References: Message-ID: Hello Stefania, It should be possible to copy the entire protein yes, but I would rather use deepcopy to create a fully new Structure object and manipulate that one. Something along the lines of: import copy [ ... Parse your structure to s...] s_copy = copy.deepcopy(s) for atom in s_copy.get_atoms(): *here use either atom.transform or just modify atom.coord* Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2011/10/5 Nana Trapnest > > Hello, > is it possible with structure builder copy all a protein and change atoms > coord??? How can I do this?? > Thanks to all of you! > Stefania > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From dilara.ally at gmail.com Wed Oct 5 19:21:29 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Wed, 05 Oct 2011 16:21:29 -0700 Subject: [Biopython] error with entrez id code Message-ID: <4E8CE679.5050107@gmail.com> Hi All I've written a program to identify Entrez gene ids from a blastall that I performed. The code is as follows: from Bio import SeqIO from Bio import Entrez import os import os.path import re import csv dirname1="/Users/dally/Desktop/BlastFiles/annotate_me/" dirname2="/Users/dally/Desktop/BlastFiles/annotated/" allfiles=os.listdir(dirname1) fanddir=[os.path.join(dirname1,fname) for fname in allfiles] OutFileName="Contig_annotation.csv" c=csv.writer(open(os.path.join(dirname2,OutFileName),"wb")) for f in fanddir: print f InFile=open(f,'rU') LineNumber=0 for Line in InFile: print LineNumber#, ':', Line ElementList=Line.split('\t') geneid=ElementList[1] #print geneid Sections=geneid.split('|') NewID=Sections[3] from Bio import Entrez from Bio import SeqFeature Entrez.email = "dally at projects.sdsu.edu" handle=Entrez.efetch(db="nucleotide", id=NewID,rettype="gb") # rettype="gb" is GenBank format or XML format retmode="xml" record=SeqIO.read(handle,"genbank") handle.close() #print record.id lineage=record.annotations["taxonomy"] c.writerow([ElementList[0],ElementList[1],ElementList[2],ElementList[3],ElementList[4],ElementList[5],ElementList[6],ElementList[7],ElementList[8], ElementList[9],ElementList[10], NewID, record.id, record.description, record.annotations["source"], lineage[0], lineage[1],lineage[2], record.annotations["keywords"], ]) LineNumber=LineNumber+1 InFile.close() The gene identifier looks like this: gi|2252639|gb|AC002292.1|AC002292. But I"m only interested in the fourth component (AC002292.1)It runs through a file with approximately 8000-10000 identifiers and then extracts information from the associated genbank file. The code seemed to run fine on my first file for the first 1287 lines but then I got this error > raceback (most recent call last): > File "Ally_EntrezID_Search_Final_Script.py", line 38, in > record=SeqIO.read(handle,"genbank") > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", > line 604, in read > first = iterator.next() > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", > line 532, in parse > for r in i: > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 440, in parse_records > record = self.parse(handle, do_features) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 423, in parse > if self.feed(handle, consumer, do_features): > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 400, in feed > misc_lines, sequence_string = self.parse_footer() > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 921, in parse_footer > line = self.handle.readline() > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", > line 447, in readline > data = self._sock.recv(self._rbufsize) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", > line 533, in read > return self._read_chunked(amt) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", > line 586, in _read_chunked > value.append(self._safe_read(amt)) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", > line 637, in _safe_read > raise IncompleteRead(''.join(s), amt) > httplib.IncompleteRead: IncompleteRead(707 bytes read, 3147 more expected) I'm new to python and biopython programming. So any advice would be extremely appreciated. Thanks. Dilara From p.j.a.cock at googlemail.com Thu Oct 6 03:43:49 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Oct 2011 08:43:49 +0100 Subject: [Biopython] error with entrez id code In-Reply-To: <4E8CE679.5050107@gmail.com> References: <4E8CE679.5050107@gmail.com> Message-ID: On Thursday, October 6, 2011, Dilara Ally wrote: > Hi All > > I've written a program to identify Entrez gene ids from a blastall that I performed. The code is as follows: > > from Bio import SeqIO > from Bio import Entrez > ... > > The code seemed to run fine on my first file for the first 1287 lines but then I got this error > >> raceback (most recent call last): >> File "Ally_EntrezID_Search_Final_Script.py", line 38, in >> record=SeqIO.read(handle,"genbank") >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 604, in read >> first = iterator.next() >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 532, in parse >> for r in i: >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 440, in parse_records >> record = self.parse(handle, do_features) >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 423, in parse >> if self.feed(handle, consumer, do_features): >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 400, in feed >> misc_lines, sequence_string = self.parse_footer() >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 921, in parse_footer >> line = self.handle.readline() >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 447, in readline >> data = self._sock.recv(self._rbufsize) >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 533, in read >> return self._read_chunked(amt) >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 586, in _read_chunked >> value.append(self._safe_read(amt)) >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 637, in _safe_read >> raise IncompleteRead(''.join(s), amt) >> httplib.IncompleteRead: IncompleteRead(707 bytes read, 3147 more expected) > > I'm new to python and biopython programming. So any advice would be extremely appreciated. Is it always the same record that breaks? If so, what is the ID so we can try it out. If not, then it looks like a random network error, maybe you can stick a try/except in to refetch the data? Peter From animesh.agrawal at anu.edu.au Thu Oct 6 06:25:08 2011 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Thu, 06 Oct 2011 21:25:08 +1100 Subject: [Biopython] Using bioPython and bioSQL In-Reply-To: <7770fe573faa2.4e8d81ae@anu.edu.au> References: <76b0a56038ee5.4e8d8031@anu.edu.au> <7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au> <7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au> <7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au> <7770fe573faa2.4e8d81ae@anu.edu.au> Message-ID: <7710edf23d45a.4e8e1cb4@anu.edu.au> Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives. Cheers Animesh Animesh Agrawal PhD Scholar The John Curtin School of Medical Research Australian National University Canberra, Australia From p.j.a.cock at googlemail.com Thu Oct 6 06:39:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Oct 2011 11:39:57 +0100 Subject: [Biopython] Using bioPython and bioSQL In-Reply-To: <7710edf23d45a.4e8e1cb4@anu.edu.au> References: <76b0a56038ee5.4e8d8031@anu.edu.au> <7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au> <7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au> <7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au> <7770fe573faa2.4e8d81ae@anu.edu.au> <7710edf23d45a.4e8e1cb4@anu.edu.au> Message-ID: On Thu, Oct 6, 2011 at 11:25 AM, Animesh Agrawal wrote: > Hi All,I am trying to develop a interface for a local sequence depository > in my lab. Using biopython cookbook examples I have been able to > populate the database. But to query the database I want to create an > interface so all other members in my lab can access it. I have no > experience in doing this kind of development. I need some advice > on best way of doing it and if there are already developed modules > in biopython which can help me in attaining my objectives. > Cheers > Animesh Hi Animesh, Do you mean some kind of web interface? Would you just need this to be read only? You can use GBrowse with BioSQL, but I believe CHADO is better supported as the schema. CHADO is also a better choice if you want users to be able to edit the annotation. http://gmod.org/wiki/Chado_-_Getting_Started Peter From sdavis2 at mail.nih.gov Thu Oct 6 06:51:20 2011 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 6 Oct 2011 06:51:20 -0400 Subject: [Biopython] Using bioPython and bioSQL In-Reply-To: <7710edf23d45a.4e8e1cb4@anu.edu.au> References: <76b0a56038ee5.4e8d8031@anu.edu.au> <7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au> <7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au> <7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au> <7770fe573faa2.4e8d81ae@anu.edu.au> <7710edf23d45a.4e8e1cb4@anu.edu.au> Message-ID: Hi, Animesh. How do you want folks to query the database? Web? Command-line? Are the queries limited in scope or do you want to provide something fully general? Sean On Thu, Oct 6, 2011 at 6:25 AM, Animesh Agrawal wrote: > Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives. > Cheers > Animesh > Animesh Agrawal > PhD Scholar > The John Curtin School of Medical Research > Australian National University > Canberra, Australia > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From elisa.sechi85 at hotmail.it Thu Oct 6 06:43:25 2011 From: elisa.sechi85 at hotmail.it (Elisa sechi) Date: Thu, 6 Oct 2011 12:43:25 +0200 Subject: [Biopython] help for overwrite a pdb file In-Reply-To: References: Message-ID: Hi! All ! I'm contacting you in order to ask help about Biopython. I'm using python,I have extract the atoms coordinates of a protein from a pdb file and I have used quaternion in order to rotate the coordinates. I have put its in a new matrix but now the problem is: how do I save the cartesian coordinates in a pdb file???Do I have to create a new structure with the use of builder structure Class?? I ask you if there is a way to overwrite the new cartesian coordinates in the old pdb file that i have used. Please help me!!! Thank you very much! Elisa bye From anaryin at gmail.com Thu Oct 6 07:01:28 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 6 Oct 2011 13:01:28 +0200 Subject: [Biopython] help for overwrite a pdb file In-Reply-To: References: Message-ID: Hello Elisa, You should use PDBIO to generate a new structure file. If you have already transformed the coordinates, it's pretty simple: import PDBIO io = PDBIO() io.set_structure(your_structure) io.save('new_structure.pdb') Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2011/10/6 Elisa sechi > > > > > > > > > > > > Hi! All ! > I'm contacting you in order to ask help about Biopython. > I'm using python,I have extract the atoms coordinates of a protein from a > pdb file and I have used quaternion in order to rotate the coordinates. > I have put its in a new matrix but now the problem is: how do I save the > cartesian coordinates in a pdb file???Do I have to create a new structure > with the use of builder structure Class?? > I ask you if there is a way to overwrite the new cartesian coordinates in > the old pdb file that i have used. > Please help me!!! > Thank you very much! > Elisa > bye > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Thu Oct 6 07:02:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Oct 2011 12:02:57 +0100 Subject: [Biopython] help for overwrite a pdb file In-Reply-To: References: Message-ID: On Thu, Oct 6, 2011 at 11:43 AM, Elisa sechi wrote: > > Hi! All ! > I'm contacting you in order to ask help about Biopython. > I'm using python,I have extract the atoms coordinates ?of a protein from a pdb file and I have used quaternion in order to rotate the coordinates. > I have put its in a new matrix but now the problem is: how do I save the cartesian coordinates in a pdb file???Do I have to create a new structure with the use of builder structure Class?? > I ask you if there is a way to overwrite the new cartesian coordinates in the old pdb file that i have used. > Please help me!!! > Thank you very much! > Elisa > ? bye There's an example here which rotates models in a PDB file and saves the output: http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ It is not using quaternions for the rotation, but otherwise it should be helpful. Peter From animesh.agrawal at anu.edu.au Thu Oct 6 07:23:39 2011 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Thu, 06 Oct 2011 22:23:39 +1100 Subject: [Biopython] Using bioPython and bioSQL In-Reply-To: <77109ef23fc49.4e8d8f9e@anu.edu.au> References: <76b0a56038ee5.4e8d8031@anu.edu.au> <7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au> <7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au> <7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au> <7770fe573faa2.4e8d81ae@anu.edu.au> <7710edf23d45a.4e8e1cb4@anu.edu.au> <77c0838039ccd.4e8d8edc@anu.edu.au> <7660c1093accc.4e8d8f1a@anu.edu.au> <7710e8403ab11.4e8d8f58@anu.edu.au> <77b0ce493f67b.4e8d8f61@anu.edu.au> <77109ef23fc49.4e8d8f9e@anu.edu.au> Message-ID: <7710e50538fb7.4e8e2a6b@anu.edu.au> Hi Peter,Thanks a lot for your reply.Yes I want web interface and I need it to be read only. I'll check out GBrowse and CHADO. Cheers, Animesh On 10/06/11, Peter Cock wrote: > On Thu, Oct 6, 2011 at 11:25 AM, Animesh Agrawal > wrote: > > Hi All,I am trying to develop a interface for a local sequence depository > > in my lab. Using biopython cookbook examples I have been able to > > populate the database. But to query the database I want to create an > > interface so all other members in my lab can access it. I have no > > experience in doing this kind of development. I need some advice > > on best way of doing it and if there are already developed modules > > in biopython which can help me in attaining my objectives. > > Cheers > > Animesh > > Hi Animesh, > > Do you mean some kind of web interface? Would you just need > this to be read only? > > You can use GBrowse with BioSQL, but I believe CHADO is better > supported as the schema. CHADO is also a better choice if you > want users to be able to edit the annotation. > http://gmod.org/wiki/Chado_-_Getting_Started > > Peter > > From animesh.agrawal at anu.edu.au Thu Oct 6 07:27:51 2011 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Thu, 06 Oct 2011 22:27:51 +1100 Subject: [Biopython] Using bioPython and bioSQL In-Reply-To: <7680a8613e5c9.4e8d9094@anu.edu.au> References: <76b0a56038ee5.4e8d8031@anu.edu.au> <7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au> <7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au> <7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au> <7770fe573faa2.4e8d81ae@anu.edu.au> <7710edf23d45a.4e8e1cb4@anu.edu.au> <77b080583927a.4e8d9019@anu.edu.au> <76e0a1b23d252.4e8d9057@anu.edu.au> <7680a8613e5c9.4e8d9094@anu.edu.au> Message-ID: <7660a5e03929b.4e8e2b67@anu.edu.au> Hi Sean,I definitely want a web interface. Queries should be limited in scope. Cheers, Animesh On 10/06/11, Sean Davis wrote: > Hi, Animesh. > > How do you want folks to query the database?? Web?? Command-line?? Are > the queries limited in scope or do you want to provide something fully > general? > > Sean > > On Thu, Oct 6, 2011 at 6:25 AM, Animesh Agrawal > wrote: > > Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives. > > Cheers > > Animesh > > Animesh Agrawal > > PhD Scholar > > The John Curtin School of Medical Research > > Australian National University > > Canberra, Australia > > _______________________________________________ > > Biopython mailing list ?- ?Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > From sdavis2 at mail.nih.gov Thu Oct 6 07:50:07 2011 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 6 Oct 2011 07:50:07 -0400 Subject: [Biopython] Using bioPython and bioSQL In-Reply-To: <7660a5e03929b.4e8e2b67@anu.edu.au> References: <76b0a56038ee5.4e8d8031@anu.edu.au> <7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au> <7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au> <7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au> <7770fe573faa2.4e8d81ae@anu.edu.au> <7710edf23d45a.4e8e1cb4@anu.edu.au> <77b080583927a.4e8d9019@anu.edu.au> <76e0a1b23d252.4e8d9057@anu.edu.au> <7680a8613e5c9.4e8d9094@anu.edu.au> <7660a5e03929b.4e8e2b67@anu.edu.au> Message-ID: Hi, Animesh. Depending on the types of queries, building small CGI scripts or even a small web application can be quite useful. Most recently, I have been using the flask micro-framework ( http://flask.pocoo.org/ ) for building such small applications. If you can figure out how to do the queries that you want with biopython or SQL, then it isn't too hard to translate that to a couple of web pages, one for gathering input from the user and a second for delivering results. Sean On Thu, Oct 6, 2011 at 7:27 AM, Animesh Agrawal wrote: > Hi Sean,I definitely want a web interface. Queries should be limited in scope. > Cheers, > Animesh > > On 10/06/11, Sean Davis ? wrote: >> Hi, Animesh. >> >> How do you want folks to query the database?? Web?? Command-line?? Are >> the queries limited in scope or do you want to provide something fully >> general? >> >> Sean >> >> On Thu, Oct 6, 2011 at 6:25 AM, Animesh Agrawal >> wrote: >> > Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives. >> > Cheers >> > Animesh >> > Animesh Agrawal >> > PhD Scholar >> > The John Curtin School of Medical Research >> > Australian National University >> > Canberra, Australia >> > _______________________________________________ >> > Biopython mailing list ?- ?Biopython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython >> > >> >> > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From tiagoantao at gmail.com Thu Oct 6 16:14:56 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 6 Oct 2011 21:14:56 +0100 Subject: [Biopython] UniprotXML dbReference parser Message-ID: Hi, Do I understand wrongly or the UniprotXML parser for simply ignores the "property type" information? If so, is there any way to get access to the XML raw data (so that I can grep it)? Thanks a lot, Tiago -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From p.j.a.cock at googlemail.com Thu Oct 6 18:26:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Oct 2011 23:26:19 +0100 Subject: [Biopython] UniprotXML dbReference parser In-Reply-To: References: Message-ID: 2011/10/6 Tiago Ant?o : > Hi, > > Do I understand wrongly or the UniprotXML parser for > > > > > > simply ignores the "property type" information? Probably... I think it emulates the very simple list of db:acc strings produced by the GenBank parser etc, but try dir(...) on it. Although PDB references look to get part of their information dumped in the record's annotations dictionary. I guess we could return a list of DB reference objects which happen to act like the old style string for back compatibility. > If so, is there any way to get access to the XML raw data > (so that I can grep it)? Are you asking for XML parsing library recommendations? Or you could hack the SeqIO parser instead... i've CC'd Andrea who wrote it in case he can add something more practical. Peter From tiagoantao at gmail.com Thu Oct 6 18:43:01 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 6 Oct 2011 23:43:01 +0100 Subject: [Biopython] UniprotXML dbReference parser In-Reply-To: References: Message-ID: Hi, 2011/10/6 Peter Cock : > Probably... I think it emulates the very simple list of > db:acc strings produced by the GenBank parser etc, > but try dir(...) on it. ?Although PDB references look > to get part of their information dumped in the > record's annotations dictionary. The problem is that the Gene ID is inside (thus it never gets returned). We get the protein ID only. > Are you asking for XML parsing library recommendations? > Or you could hack the SeqIO parser instead... i've CC'd > Andrea who wrote it in case he can add something > more practical. I just used xml.parsers.expat. Not a problem for myself, but the fact is that the uniprot xml parser does not return the whole information that it is there. -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From p.j.a.cock at googlemail.com Fri Oct 7 03:22:49 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Oct 2011 08:22:49 +0100 Subject: [Biopython] changing record attributes while iterating In-Reply-To: References: Message-ID: On Friday, October 7, 2011, Michalwrote: > Hello, > Does your code with generator save the whole file in the > memory or does it read each entry and save it immediately? > Thank you in advance. Using a generator expression like that only one SeqRecord is in memory at a time. It goes through the input FASTA one record at a time, renames it, saves it immediately. Peter P.S. list CC'd From dilara.ally at gmail.com Fri Oct 7 13:34:24 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Fri, 07 Oct 2011 10:34:24 -0700 Subject: [Biopython] error with entrez id code In-Reply-To: References: <4E8CE679.5050107@gmail.com> Message-ID: <4E8F3820.1030002@gmail.com> > Is it always the same record that breaks? If so, what is the ID so we > can try it out. > > If not, then it looks like a random network error, maybe you can stick > a try/except in to refetch the data? Hi Peter Individually the identifier has no problem calling up the record, but the problem seems to be in the loop. As a newbie, what is a try/except? Thanks. Dilara On 10/6/11 12:43 AM, Peter Cock wrote: > > > On Thursday, October 6, 2011, Dilara Ally > wrote: > > Hi All > > > > I've written a program to identify Entrez gene ids from a blastall > that I performed. The code is as follows: > > > > from Bio import SeqIO > > from Bio import Entrez > > ... > > > > The code seemed to run fine on my first file for the first 1287 > lines but then I got this error > > > >> raceback (most recent call last): > >> File "Ally_EntrezID_Search_Final_Script.py", line 38, in > >> record=SeqIO.read(handle,"genbank") > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", > line 604, in read > >> first = iterator.next() > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", > line 532, in parse > >> for r in i: > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 440, in parse_records > >> record = self.parse(handle, do_features) > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 423, in parse > >> if self.feed(handle, consumer, do_features): > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 400, in feed > >> misc_lines, sequence_string = self.parse_footer() > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 921, in parse_footer > >> line = self.handle.readline() > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", > line 447, in readline > >> data = self._sock.recv(self._rbufsize) > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", > line 533, in read > >> return self._read_chunked(amt) > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", > line 586, in _read_chunked > >> value.append(self._safe_read(amt)) > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", > line 637, in _safe_read > >> raise IncompleteRead(''.join(s), amt) > >> httplib.IncompleteRead: IncompleteRead(707 bytes read, 3147 more > expected) > > > > I'm new to python and biopython programming. So any advice would be > extremely appreciated. > > > Peter From p.j.a.cock at googlemail.com Sat Oct 8 10:10:12 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 8 Oct 2011 15:10:12 +0100 Subject: [Biopython] error with entrez id code In-Reply-To: <4E8F3820.1030002@gmail.com> References: <4E8CE679.5050107@gmail.com> <4E8F3820.1030002@gmail.com> Message-ID: On Fri, Oct 7, 2011 at 6:34 PM, Dilara Ally wrote: > Is it always the same record that breaks? If so, what is the ID so we can > try it out. > > If not, then it looks like a random network error, maybe you can stick a > try/except in to refetch the data? > > Hi Peter > > Individually the identifier has no problem calling up the record, but the > problem seems to be in the loop.? As a newbie, what is a try/except? > > Thanks. By try/except I mean use Python's error handling mechanism to spot when there is a network error. See: http://docs.python.org/tutorial/errors.html e.g. Something like this would give you a second chance. Note that exception httplib.IncompleteRead is a subclass of the more general HTTPException, see: http://docs.python.org/library/httplib.html from httplib import HTTPException try: handle=Entrez.efetch(db="nucleotide", id=NewID,rettype="gb") # rettype="gb" is GenBank format or XML format retmode="xml" record=SeqIO.read(handle,"genbank") handle.close() except HTTPException, e: print "Network problem: %s" % e print "Second (and final) attempt..." handle=Entrez.efetch(db="nucleotide", id=NewID,rettype="gb") # rettype="gb" is GenBank format or XML format retmode="xml" record=SeqIO.read(handle,"genbank") handle.close() If the second attempt fails, you'll get an exception like before. There are more elegant ways to write that (with less repetition, and making multiple retries easy), but I'm trying to keep this simple as an introductory example. Peter From chaouki.amir at gmail.com Sun Oct 9 15:37:42 2011 From: chaouki.amir at gmail.com (amir chaouki) Date: Sun, 9 Oct 2011 20:37:42 +0100 Subject: [Biopython] clustal header Message-ID: Hi, i want to to do a multiple sequence alignment with the clustalw method but i keep getting this error: ", ".join(known_headers))) ValueError: a is not a known CLUSTAL header: CLUSTAL, PROBCONS, MUSCLE my sequence file contains this > as headers for every sequence name, so what are the compatible headers? -- *Amir Chaouki* From p.j.a.cock at googlemail.com Sun Oct 9 16:09:00 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 9 Oct 2011 21:09:00 +0100 Subject: [Biopython] clustal header In-Reply-To: References: Message-ID: On Sunday, October 9, 2011, amir chaouki wrote: > Hi, > i want to to do a multiple sequence alignment with the clustalw method but i > keep getting this error: ", ".join(known_headers))) > ValueError: a is not a known CLUSTAL header: CLUSTAL, PROBCONS, MUSCLE > > my sequence file contains this > as headers for every sequence name, so > what are the compatible headers? Hi Amir, That error message can come from trying to parse a non-clustal file as if it were a clustal file. Perhaps you tried to parse a fasta file? If you showed the code that caused this message, it would be easier to help you, Peter From sdavis2 at mail.nih.gov Wed Oct 12 14:54:13 2011 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 12 Oct 2011 14:54:13 -0400 Subject: [Biopython] [OT][Job] Functional genomic analysis of cancer/RNAi screening Message-ID: Functional genomic analysis of cancer/RNAi screening NATIONAL CANCER INSTITUTE, BETHESDA, MD The laboratory of Dr. Natasha Caplen, within the Genetics Branch, CCR, NCI, is seeking postdoctoral candidates for a project focused on functional genomic analysis using RNAi screening approaches. We are looking for a highly motivated candidate who has received their PhD within the last year to contribute to our on-going studies applying RNAi based loss-of-function approaches to probe cancer gene function. The successful candidate will be expected to perform both bench and computational-based studies and will be involved in projects requiring the development and analysis of large-scale RNAi screening data focused on the biology of oncogenic transcription factors. The candidate will be involved in the design and employment of RNAi screens (up to genome-wide scale) and analysis of the data generated through application of state of the art computational methodologies. This large-scale RNAi screening data will also be assessed in the context of other relevant datasets such as next generation sequencing, epigenetic, gene expression and drug sensitivity datasets. The computational analyses will ultimately be used to systematically build hypotheses to identify key pathways and networks underlying the specifics of the cancer biology and the candidate will then be expected to experimentally test these hypotheses. Dr. Caplen?s laboratory conducts both independent and collaborative studies and the successful candidate will have the opportunity to interact with NCI and NIH investigators studying many different cancer biology questions using RNAi based technologies. Currently we are involved in RNAi studies relevant to the biology and treatment of several pediatric cancers, colorectal, breast and prostate cancer. For further information please see Dr. Caplen?s website at http://ccr.cancer.gov/staff/staff.asp?profileid=9035. Requirements: The candidate must have a Ph.D in biological sciences with additional training in computational biology or bioinformatics. Previous experience in molecular biology including mammalian cell culture and assessment of gene expression is required, as, too, is experience in programming skills in languages such as perl, python, R, java, or c++. As the position involves the need to discuss scientific data and strategy with members of the existing team and with collaborators, oral and written fluency in the English language is required. Applicants should email a cover letter describing research experience and interests, curriculum vitae, bibliography, and contact information for three references (including the current supervisor) to Dr. Natasha Caplen at ncaplen at mail.nih.gov. Please include ?PD2011? in the email subject line. From paul at tonair.de Thu Oct 13 06:26:54 2011 From: paul at tonair.de (paul at tonair.de) Date: Thu, 13 Oct 2011 12:26:54 +0200 Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB Message-ID: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com> dear biopython users, i'm trying to read in a pqr file with the Bio.PDB module. In a PQR file, the atom charge and atom radius are stored instead of the occupancy & B-factor. Apparently, the negative charge values make trouble while reading in. (1) Is there a way to tweak Bio.PDB module to read in a PQR file? More to the background of this task: I would like to keep the charge and the radius in order to output a PDB file with more than 80 lines. The pdb-like output looks like this: ATOM 1 C1 UNL _0001_000 9.643 1.777 18.433 1.700 0.000 BK____M000 The text "BK____M000" refers to a conformer of a side chain and is needed by a PoissonBoltzmann named mcce (multi-conformation continuum electrostatics). (2) Can Bio.PDB generate such an output file? Cheers & Thanks, Paul From p.j.a.cock at googlemail.com Thu Oct 13 06:40:14 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Oct 2011 11:40:14 +0100 Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB In-Reply-To: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com> References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com> Message-ID: On Thu, Oct 13, 2011 at 11:26 AM, wrote: > > dear biopython users, > > i'm trying to read in a pqr file with the > Bio.PDB module. In a PQR file, the atom charge and atom radius are > stored instead of the occupancy & B-factor. > Apparently, the negative > charge values make trouble while reading in. > > (1) Is there a way to > tweak Bio.PDB module to read in a PQR file? If a negative B-factor was the only issue, probably yes. > More to the background of > this task: I would like to keep the charge and the radius in order to > output a PDB file with more than 80 lines. You mean more than 80 columns? i.e. Longer than PDB norms? > The pdb-like output looks > like this: > ATOM 1 C1 UNL _0001_000 9.643 1.777 18.433 1.700 0.000 > BK____M000 > The text "BK____M000" refers to a conformer of a side chain > and is needed by a PoissonBoltzmann named mcce (multi-conformation > continuum electrostatics). > > (2) Can Bio.PDB generate such an output > file? Not yet ;) > Cheers & Thanks, > Paul It would help if you could share some sample data (URLs) and links to this PDB-like PQR file format's specification (assuming it has one). Regards, Peter From anaryin at gmail.com Thu Oct 13 06:43:06 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 13 Oct 2011 12:43:06 +0200 Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB In-Reply-To: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com> References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com> Message-ID: Hello Paul, Straight from Pymol :) Bio.PDB cannot read PQR files as is, but since the format is quite similar to the PDB it should be easy to convert. The first step is to know if you want to develop a converter too (you will need the forcefield atomic charges and radius for that) or just a "parser". Parsing is easy, it's a matter of adapting the current SMCRA objects and PDBParser. Converting requires much more and is probably superfluous given the PDB2PQR software. Some important information on the format: http://www.poissonboltzmann.org/file-formats/biomolecular-structurw/pqr I think the best course of action is to add a PQRParser class that has different residue properties than the regular PDB. For example, occupancy and bfactor are not used at all.. Let me know what you think, Cheers, Jo?o From paul at tonair.de Thu Oct 13 07:51:42 2011 From: paul at tonair.de (paul at tonair.de) Date: Thu, 13 Oct 2011 13:51:42 +0200 Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB In-Reply-To: References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com> Message-ID: Dear all, a PQR functionality within biopython would be great! Regarding the output of extended PDB files I would like to write: There is no detailed description on such files: http://www.sci.ccny.cuny.edu/~mcce/doc/running_mcce2.php [1] see chapter 3.2.4: step2_out.pdb: input structure file of step 3 in mcce extended pdb format extended means: the conformer is added beyond the element located somewhere around column 80. Is there any workaround with the currect biopython release to read in PQR and dump out such an extended PDB file? Cheers & thanks, Paul On Thu, 13 Oct 2011 12:48:22 +0200, Mikael Trellet wrote: This PQRParser class would be a nice add to Bio.PDB indeed, and shouldn't take a very long time to develop. Could work on it with you Joao, if the need exists obviously. Regards, Mikael On Thu, Oct 13, 2011 at 12:43 PM, Jo?o Rodrigues wrote: Hello Paul, Straight from Pymol :) Bio.PDB cannot read PQR files as is, but since the format is quite similar to the PDB it should be easy to convert. The first step is to know if you want to develop a converter too (you will need the forcefield atomic charges and radius for that) or just a "parser". Parsing is easy, it's a matter of adapting the current SMCRA objects and PDBParser. Converting requires much more and is probably superfluous given the PDB2PQR software. Some important information on the format: http://www.poissonboltzmann.org/file-formats/biomolecular-structurw/pqr [3] I think the best course of action is to add a PQRParser class that has different residue properties than the regular PDB. For example, occupancy and bfactor are not used at all.. Let me know what you think, Cheers, Jo?o _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org [4] http://lists.open-bio.org/mailman/listinfo/biopython [5] -- Mikael TRELLET, Computational structural biology group, Utrecht University Bijvoet Center, The Netherlands Links: ------ [1] http://www.sci.ccny.cuny.edu/~mcce/doc/running_mcce2.php [2] mailto:anaryin at gmail.com [3] http://www.poissonboltzmann.org/file-formats/biomolecular-structurw/pqr [4] mailto:Biopython at lists.open-bio.org [5] http://lists.open-bio.org/mailman/listinfo/biopython From anaryin at gmail.com Thu Oct 13 08:27:54 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 13 Oct 2011 14:27:54 +0200 Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB In-Reply-To: References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com> Message-ID: Dear Paul, You would have to do two things: 1. First, modify PDBParser so that it reads more characters in the occupancy and bfactor fields 2. Modify PDBIO so that it is able to output a field beyond the element OR just create your own function to print information of a residue and use it instead of PDBIO. How do you get the conformer information? From paul at tonair.de Fri Oct 14 08:00:04 2011 From: paul at tonair.de (paul at tonair.de) Date: Fri, 14 Oct 2011 14:00:04 +0200 Subject: [Biopython] ligand PDB files Message-ID: Dear all, I'm having trouble to read in the attached PDB file - this is my code: " from Bio.PDB import * parser=PDBParser() structure=parser.get_structure("PHA-L","./2w26_lig.pdb") for model in structure: for chain in model: for residue in chain: for atom in residue: print atom " which gives this error: " File "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/PDBParser.py", line 66, in get_structure self._parse(file.readlines()) File "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/PDBParser.py", line 89, in _parse self.trailer=self._parse_coordinates(coords_trailer) File "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/PDBParser.py", line 205, in _parse_coordinates fullname, serial_number, element) File "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/StructureBuilder.py", line 197, in init_atom fullname, serial_number, element) File "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/Atom.py", line 68, in __init__ assert not element or element == element.upper(), element AssertionError: Cl " Does this mean that the PDB parser only recognizes "amino acid-atoms", i.e. a chlorine does not work? Cheers & Thanks, Paul -------------- next part -------------- COMPND 2w26_LIG.pdb_0 AUTHOR GENERATED BY OPEN BABEL 2.3.0 ATOM 1 C1 RIV A 1 9.643 1.777 18.433 1.00 0.00 C ATOM 2 N1 RIV A 1 8.303 2.377 18.109 1.00 0.00 N ATOM 3 C2 RIV A 1 10.053 0.667 17.441 1.00 0.00 C ATOM 4 C3 RIV A 1 7.671 2.122 16.881 1.00 0.00 C ATOM 5 O1 RIV A 1 9.768 1.124 16.111 1.00 0.00 O ATOM 6 C4 RIV A 1 8.355 1.223 15.853 1.00 0.00 C ATOM 7 C5 RIV A 1 6.487 4.959 20.981 1.00 0.00 C ATOM 8 C6 RIV A 1 7.333 5.468 19.984 1.00 0.00 C ATOM 9 C7 RIV A 1 6.237 3.551 21.013 1.00 0.00 C ATOM 10 C8 RIV A 1 7.918 4.619 19.048 1.00 0.00 C ATOM 11 C9 RIV A 1 6.837 2.690 20.070 1.00 0.00 C ATOM 12 C10 RIV A 1 7.682 3.222 19.078 1.00 0.00 C ATOM 13 O2 RIV A 1 6.583 2.613 16.630 1.00 0.00 O ATOM 14 N2 RIV A 1 5.906 5.863 21.947 1.00 0.00 N ATOM 15 C11 RIV A 1 5.040 5.543 22.995 1.00 0.00 C ATOM 16 C12 RIV A 1 6.146 7.326 22.000 1.00 0.00 C ATOM 17 O3 RIV A 1 4.690 6.614 23.757 1.00 0.00 O ATOM 18 C13 RIV A 1 5.213 7.787 23.134 1.00 0.00 C ATOM 19 O4 RIV A 1 4.634 4.419 23.228 1.00 0.00 O ATOM 20 C14 RIV A 1 5.924 8.721 24.155 1.00 0.00 C ATOM 21 N3 RIV A 1 7.078 8.136 24.932 1.00 0.00 N ATOM 22 C15 RIV A 1 8.402 8.558 24.672 1.00 0.00 C ATOM 23 S1 RIV A 1 11.131 8.264 25.063 1.00 0.00 S ATOM 24 C16 RIV A 1 11.805 7.503 26.288 1.00 0.00 C ATOM 25 C17 RIV A 1 9.567 8.044 25.466 1.00 0.00 C ATOM 26 C18 RIV A 1 10.794 7.011 27.130 1.00 0.00 C ATOM 27 C19 RIV A 1 9.509 7.324 26.659 1.00 0.00 C ATOM 28 O5 RIV A 1 8.611 9.379 23.797 1.00 0.00 O ATOM 29 Cl1 RIV A 1 13.544 7.302 26.531 1.00 0.00 Cl ATOM 30 H RIV A 1 9.643 1.777 18.433 1.00 0.00 H ATOM 31 H RIV A 1 9.643 1.777 18.433 1.00 0.00 H ATOM 32 H RIV A 1 10.053 0.667 17.441 1.00 0.00 H ATOM 33 H RIV A 1 10.053 0.667 17.441 1.00 0.00 H ATOM 34 H RIV A 1 8.355 1.223 15.853 1.00 0.00 H ATOM 35 H RIV A 1 8.355 1.223 15.853 1.00 0.00 H ATOM 36 H RIV A 1 7.333 5.468 19.984 1.00 0.00 H ATOM 37 H RIV A 1 6.237 3.551 21.013 1.00 0.00 H ATOM 38 H RIV A 1 7.918 4.619 19.048 1.00 0.00 H ATOM 39 H RIV A 1 6.837 2.690 20.070 1.00 0.00 H ATOM 40 H RIV A 1 6.146 7.326 22.000 1.00 0.00 H ATOM 41 H RIV A 1 6.146 7.326 22.000 1.00 0.00 H ATOM 42 H RIV A 1 5.213 7.787 23.134 1.00 0.00 H ATOM 43 H RIV A 1 5.924 8.721 24.155 1.00 0.00 H ATOM 44 H RIV A 1 5.924 8.721 24.155 1.00 0.00 H ATOM 45 H RIV A 1 7.078 8.136 24.932 1.00 0.00 H ATOM 46 H RIV A 1 10.794 7.011 27.130 1.00 0.00 H ATOM 47 H RIV A 1 9.509 7.324 26.659 1.00 0.00 H CONECT 1 3 2 30 31 CONECT 1 CONECT 2 4 1 12 CONECT 3 5 1 32 33 CONECT 3 CONECT 4 6 13 2 CONECT 5 6 3 CONECT 6 5 4 34 35 CONECT 6 CONECT 7 8 9 14 CONECT 8 10 7 36 CONECT 9 11 7 37 CONECT 10 12 8 38 CONECT 11 12 9 39 CONECT 12 2 10 11 CONECT 13 4 CONECT 14 7 16 15 CONECT 15 14 19 17 CONECT 16 14 18 40 41 CONECT 16 CONECT 17 15 18 CONECT 18 16 17 20 42 CONECT 18 CONECT 19 15 CONECT 20 18 21 43 44 CONECT 20 CONECT 21 20 22 45 CONECT 22 28 21 25 CONECT 23 25 24 CONECT 24 23 29 26 CONECT 25 22 23 27 CONECT 26 24 27 46 CONECT 27 25 26 47 CONECT 28 22 CONECT 29 24 CONECT 30 1 CONECT 31 1 CONECT 32 3 CONECT 33 3 CONECT 34 6 CONECT 35 6 CONECT 36 8 CONECT 37 9 CONECT 38 10 CONECT 39 11 CONECT 40 16 CONECT 41 16 CONECT 42 18 CONECT 43 20 CONECT 44 20 CONECT 45 21 CONECT 46 26 CONECT 47 27 MASTER 0 0 0 0 0 0 0 0 47 0 47 0 END From robert.campbell at queensu.ca Fri Oct 14 09:04:22 2011 From: robert.campbell at queensu.ca (Robert Campbell) Date: Fri, 14 Oct 2011 09:04:22 -0400 Subject: [Biopython] ligand PDB files In-Reply-To: References: Message-ID: <20111014090422.639e9284@adelie.biochem.queensu.ca> Dear Paul, On Fri, 2011-10-14 14:00 EDT, paul at tonair.de wrote: > Dear all, > I'm having trouble to read in the attached PDB file - this > is my code: Your code is okay. The problem is in your PDB file: > File > "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/Atom.py", line > 68, in __init__ > assert not element or element == element.upper(), > element > AssertionError: Cl > " > Does this mean that the PDB parser only > recognizes "amino acid-atoms", i.e. a chlorine does not work? The chlorine atoms should be "CL" not "Cl" in a proper PDB file. Cheers, Rob -- Robert L. Campbell, Ph.D. Senior Research Associate/Adjunct Assistant Professor Dept. of Biomedical & Molecular Sciences, Botterell Hall Rm 644 Queen's University, Kingston, ON K7L 3N6 Canada Tel: 613-533-6821 http://pldserver1.biochem.queensu.ca/~rlc From paul at tonair.de Fri Oct 14 09:51:47 2011 From: paul at tonair.de (paul at tonair.de) Date: Fri, 14 Oct 2011 15:51:47 +0200 Subject: [Biopython] ligand PDB files In-Reply-To: <20111014090422.639e9284@adelie.biochem.queensu.ca> References: <20111014090422.639e9284@adelie.biochem.queensu.ca> Message-ID: <751ac2c9e7bf1a3659f31849565d1122@mail.canobus.com> Dear Rob, thank you very much for your help, this fixed the error!! Cheers, Paul > > Your code is okay. The problem is in your PDB file: > > >> File >> "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/Atom.py", line >> 68, in __init__ >> assert not element or element == element.upper(), >> element >> AssertionError: Cl >> " >> Does this mean that the PDB parser only >> recognizes "amino acid-atoms", i.e. a chlorine does not work? > > The chlorine atoms should be "CL" not "Cl" in a proper PDB file. > > Cheers, > Rob From pawan.mani2 at gmail.com Sat Oct 15 12:26:17 2011 From: pawan.mani2 at gmail.com (One Life) Date: Sat, 15 Oct 2011 16:26:17 +0000 (UTC) Subject: [Biopython] Invitation to connect on LinkedIn Message-ID: <450967254.855500.1318695977476.JavaMail.app@ela4-app0133.prod> I'd like to add you to my professional network on LinkedIn. - One One Life bioinformatics jobs or lifesciences jobs at student New Delhi Area, India Confirm that you know One Life: https://www.linkedin.com/e/l8bh8w-gtstjc81-5u/isd/4571376627/NJGAOFxD/?hs=false&tok=2ZCK1gt4mqn4Y1 -- You are receiving Invitation to Connect emails. Click to unsubscribe: http://www.linkedin.com/e/l8bh8w-gtstjc81-5u/qqAvDr0lR7bVZ5oUF-GdFl1c_dfVGAwasCwqz9Wv-gP/goo/biopython%40lists%2Eopen-bio%2Eorg/20061/I1584202408_1/?hs=false&tok=0zbjHnXC6qn4Y1 (c) 2011 LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043, USA. From jordan.r.willis at Vanderbilt.Edu Sat Oct 15 16:59:58 2011 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Sat, 15 Oct 2011 15:59:58 -0500 Subject: [Biopython] Blast DB keeps crashing nodes Message-ID: <66965B9E-2AD6-4E02-BB8E-2F11A820DCDF@Vanderbilt.Edu> Hello Biopython, I was wondering if anyone has worked extensively with the Blast Database locally. I am blasting millions of sequences using Biopython as my backend framework. I am using a high throughput computer cluster to blast each sequence. Rather than submit two million jobs, I have divided the fast files up into 50 or so. The problem I am facing is a memory issue. I'm not sure, but I think that the Database is cacheing itself and not clearing before the next sequence is queried. In that regard, the next job calls upon the database again, and so on?. The memory builds up until it finally crashes the node. Has anyone dealt with this issue before? Thanks, Jordan From dilara.ally at gmail.com Sat Oct 15 17:55:21 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Sat, 15 Oct 2011 14:55:21 -0700 Subject: [Biopython] Blast DB keeps crashing nodes In-Reply-To: <66965B9E-2AD6-4E02-BB8E-2F11A820DCDF@Vanderbilt.Edu> References: <66965B9E-2AD6-4E02-BB8E-2F11A820DCDF@Vanderbilt.Edu> Message-ID: <4E9A0149.1000504@gmail.com> How many hits per sequence have you requested to get back - the default on the blastall is 250? I did blast search on ~600,000 contigs but I set up simultaneous jobs across 34 nodes. I used only the top 20 hits. Each file had 1000 fasta formatted sequences and each node was given ~12 files. But I still had to do it in two parts to get all sequences blasted. I waited until the first set finished to set up the second blast job. The job finished in 2 days. Before I ran it on the cluster I tested a single file to see how long and how much memory it took. The cluster I used had 34 computing nodes, with 16-48 cores and 16-64GB of memory. Hope that helps. On 10/15/11 1:59 PM, Willis, Jordan R wrote: > Hello Biopython, > > I was wondering if anyone has worked extensively with the Blast Database locally. > > I am blasting millions of sequences using Biopython as my backend framework. I am using a high throughput computer cluster to blast each sequence. Rather than submit two million jobs, I have divided the fast files up into 50 or so. > > The problem I am facing is a memory issue. I'm not sure, but I think that the Database is cacheing itself and not clearing before the next sequence is queried. In that regard, the next job calls upon the database again, and so on?. > > The memory builds up until it finally crashes the node. Has anyone dealt with this issue before? > > Thanks, > Jordan > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mictadlo at gmail.com Mon Oct 17 08:11:12 2011 From: mictadlo at gmail.com (Mic) Date: Mon, 17 Oct 2011 22:11:12 +1000 Subject: [Biopython] SAM to BAM Message-ID: Hello, Is there a way to convert SAM file to sorted BAM file and generate also BAI file with pysam? Thank you in advance. From p.j.a.cock at googlemail.com Mon Oct 17 09:06:58 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 14:06:58 +0100 Subject: [Biopython] [Samtools-help] SAM to BAM In-Reply-To: References: Message-ID: On Mon, Oct 17, 2011 at 1:11 PM, Mic wrote: > Hello, > Is there a way to convert SAM file to sorted BAM file and generate also BAI > file with pysam? > Thank you in advance. With samtools at the command line, samtools view -b -S example.sam | samtools sort - example samtools index example.bam I know you can easy call samtools from pysam, not sure if you can do the pipe trick to avoid extra steps: samtools view -b -S example.sam > example_unsorted samtools sort example_unsorted.bam example rm example_unsorted.bam samtools index example.bam Peter From jgrant at smith.edu Mon Oct 17 09:47:38 2011 From: jgrant at smith.edu (Jessica Grant) Date: Mon, 17 Oct 2011 09:47:38 -0400 Subject: [Biopython] pdb file question Message-ID: <541079A3-3C7D-45FF-8717-B1C64C85735F@smith.edu> Hello, I am trying to write a script that reproduces the crystal structure of a protein based on the information in the pdb file. I have gotten kind of stuck using the SMTRY lines in remark 290. It doesn't seem to contain all the information I need, at least the results I am getting don't look the same as when I produce symmetry mates in pymol, for example. Has anyone any experience with this? Thanks, Jessica From anaryin at gmail.com Mon Oct 17 10:08:54 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 17 Oct 2011 16:08:54 +0200 Subject: [Biopython] pdb file question In-Reply-To: <541079A3-3C7D-45FF-8717-B1C64C85735F@smith.edu> References: <541079A3-3C7D-45FF-8717-B1C64C85735F@smith.edu> Message-ID: Hello Jessica, Are you extracting the symmetry information with Biopython? If so, how are you using it to generate the other symmetry "members"? Using atom.transform? Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2011/10/17 Jessica Grant > Hello, > > I am trying to write a script that reproduces the crystal structure of a > protein based on the information in the pdb file. I have gotten kind of > stuck using the SMTRY lines in remark 290. It doesn't seem to contain all > the information I need, at least the results I am getting don't look the > same as when I produce symmetry mates in pymol, for example. Has anyone any > experience with this? Thanks, > > Jessica > > > > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > From hahj87 at gmail.com Mon Oct 17 11:03:10 2011 From: hahj87 at gmail.com (=?ISO-8859-1?Q?Joshua_Ismael_Haase_Hern=E1ndez?=) Date: Mon, 17 Oct 2011 10:03:10 -0500 Subject: [Biopython] is IRC channel at freenode active? Message-ID: Hi there, I was arround in the IRC channel and the only one there is Chanserv. I was wondering if the channel has some use. From mictadlo at gmail.com Mon Oct 17 23:44:14 2011 From: mictadlo at gmail.com (Mic) Date: Tue, 18 Oct 2011 13:44:14 +1000 Subject: [Biopython] Segmentation fault Message-ID: Hello, I have tried to generate a subset BAM, but I get a 'Segmentation fault' with the following code: from Bio import SeqIO import pysam from optparse import OptionParser import subprocess, os, sys from multiprocessing import Pool import functools import argparse def GetReferenceInfo(referenceFastaPath): referencenames = [] referencelengths = [] referenceFastaFile = open(referenceFastaPath) for record in SeqIO.parse(referenceFastaFile, "fasta"): referencenames.append(record.name) referencelengths.append(len(record.seq)) referenceFastaFile.close() return (referencenames, referencelengths) def GenerateSubsetBAM(bam_filename, ref_name): reads = [] bam_fh = pysam.Samfile(bam_filename, "rb") for read in bam_fh.fetch(ref_name): reads.append(read) print ref_name + ' Done ' + str(len(reads)) return (ref_name, reads) def writeBAM(reads, ref_names, ref_lengths, output_BAM): #print ref_names #print ref_lengths #print output_BAM #with pysam.Samfile(output_BAM, "wb", referencenames = ref_names, referencelengths = ref_lengths) as bh: bh = pysam.Samfile(output_BAM, "wb", referencenames = ref_names, referencelengths = ref_lengths) print reads.keys() for ref_name in ref_names: print ref_name for read in reads[ref_name]: print read #bh.write(read) print ref_name + 'Done' if __name__ == '__main__': parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta -o outputBAM") parser.add_option("-b", "--BAM", type="string", dest="inputBAMFilepath", help="Specify a BAM file") parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath", help="Specify a reference fasta file.") parser.add_option("-o", "--output", type="string", dest="outputBAMFilepath", help="Specify an output BAM file.") (opts, args) = parser.parse_args() if (opts.inputBAMFilepath is None): print ("\nSpecify a BAM file. eg. -b large.bam\n") parser.print_help() elif not(os.path.exists(opts.inputBAMFilepath)): print ("\nReference BAM file does not exists: " + opts.inputBAMFilepath +"\n") elif (opts.fastaFilepath is None): print ("\nSpecify a reference fasta file. eg. -f Subset.fasta\n") parser.print_help() elif not(os.path.exists(opts.fastaFilepath)): print ("\nReference fasta file does not exists: " + opts.fastaFilepath +"\n") elif os.path.exists(opts.outputBAMFilepath) and not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n): ")=='Y'): print ("\nOutput BAM exists. Please specify alternative output file. eg. -o Subset.bam\n") else: print "Read fasta ..." (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath) print 'Done!' print "creating subset...." pool = Pool() GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM, opts.inputBAMFilepath) reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper, ref_names)) pool.close() print "Done!" print "Writting results to subset BAM file..." writeBAM(reads, ref_names, ref_lengths, opts.outputBAMFilepath) print "Done!" I run the code in the following way: python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bamRead fasta ... Done! creating subset.... chr1 Done 1464 chr2 Done 1806 Done! Writting results to subset BAM file... ['chr2', 'chr1'] chr1 Segmentation fault Thank you in advance. From p.j.a.cock at googlemail.com Tue Oct 18 05:00:47 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Oct 2011 10:00:47 +0100 Subject: [Biopython] [Samtools-help] Segmentation fault In-Reply-To: References: Message-ID: On Tue, Oct 18, 2011 at 4:44 AM, Mic wrote: > Hello, > I have tried to generate a subset BAM, but I get a 'Segmentation fault' with > the following code: > from Bio import SeqIO > import pysam > from optparse import OptionParser > import subprocess, os, sys > from multiprocessing import Pool > import functools > ... I tried this and it seemed to get stuck much earlier. Could you cut down the example a bit by removing the multiprocessing? Peter P.S. Also you can remove the unused "import argparse" line. From mictadlo at gmail.com Tue Oct 18 06:26:06 2011 From: mictadlo at gmail.com (Mic) Date: Tue, 18 Oct 2011 20:26:06 +1000 Subject: [Biopython] [Samtools-help] Segmentation fault In-Reply-To: References: Message-ID: Hello, Thank you for your email. I updated the code and find out that print reads['chr1'] #works fine but print reads['chr1'][0] #caused Segmentation fault Please find below the updated code: from Bio import SeqIO import pysam from optparse import OptionParser import subprocess, os, sys from multiprocessing import Pool import functools def GetReferenceInfo(referenceFastaPath): referencenames = [] referencelengths = [] referenceFastaFile = open(referenceFastaPath) for record in SeqIO.parse(referenceFastaFile, "fasta"): referencenames.append(record.name) referencelengths.append(len(record.seq)) referenceFastaFile.close() return (referencenames, referencelengths) def GenerateSubsetBAM(bam_filename, ref_name): reads = [] bam_fh = pysam.Samfile(bam_filename, "rb") for read in bam_fh.fetch(ref_name): reads.append(read) print ref_name + ' Done ' + str(len(reads)) return (ref_name, reads) if __name__ == '__main__': parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta -o outputBAM") parser.add_option("-b", "--BAM", type="string", dest="inputBAMFilepath", help="Specify a BAM file") parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath", help="Specify a reference fasta file.") parser.add_option("-o", "--output", type="string", dest="outputBAMFilepath", help="Specify an output BAM file.") (opts, args) = parser.parse_args() if (opts.inputBAMFilepath is None): print ("\nSpecify a BAM file. eg. -b large.bam\n") parser.print_help() elif not(os.path.exists(opts.inputBAMFilepath)): print ("\nReference BAM file does not exists: " + opts.inputBAMFilepath +"\n") elif (opts.fastaFilepath is None): print ("\nSpecify a reference fasta file. eg. -f Subset.fasta\n") parser.print_help() elif not(os.path.exists(opts.fastaFilepath)): print ("\nReference fasta file does not exists: " + opts.fastaFilepath +"\n") elif os.path.exists(opts.outputBAMFilepath) and not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n): ")=='Y'): print ("\nOutput BAM exists. Please specify alternative output file. eg. -o Subset.bam\n") else: print "Read fasta ..." (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath) print 'Done!' print "creating subset...." pool = Pool() GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM, opts.inputBAMFilepath) reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper, ref_names)) pool.close() print "Done!" print reads['chr1'] #works fine print "xxxxx" print reads['chr1'][0] #caused Segmentation fault I run the code with the pysam-0.5 examples (pysam-0.5/tests) in the following way: python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam Read fasta ... Done! creating subset.... chr1 Done 1464 chr2 Done 1806 Done! [, ..., ] xxxxx Segmentation fault Thank you in advance. On Tue, Oct 18, 2011 at 7:00 PM, Peter Cock wrote: > On Tue, Oct 18, 2011 at 4:44 AM, Mic wrote: > > Hello, > > I have tried to generate a subset BAM, but I get a 'Segmentation fault' > with > > the following code: > > from Bio import SeqIO > > import pysam > > from optparse import OptionParser > > import subprocess, os, sys > > from multiprocessing import Pool > > import functools > > ... > > I tried this and it seemed to get stuck much earlier. Could you > cut down the example a bit by removing the multiprocessing? > > Peter > > P.S. Also you can remove the unused "import argparse" line. > From mmokrejs at fold.natur.cuni.cz Tue Oct 18 07:44:54 2011 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Tue, 18 Oct 2011 13:44:54 +0200 Subject: [Biopython] [Samtools-help] Segmentation fault In-Reply-To: References: Message-ID: <4E9D66B6.70904@fold.natur.cuni.cz> Before running your python code, do (under bash): $ ulimit -c unlimited $ python mypython.py $ file core $ gdb /usr/bin/python ./core gdb> where gdb> bt full gdb> quit $ Martin Mic wrote: > Hello, > Thank you for your email. I updated the code and find out that > print reads['chr1'] #works fine > but > print reads['chr1'][0] #caused Segmentation fault > > Please find below the updated code: > > from Bio import SeqIO > import pysam > from optparse import OptionParser > import subprocess, os, sys > from multiprocessing import Pool > import functools > > > def GetReferenceInfo(referenceFastaPath): > referencenames = [] > referencelengths = [] > referenceFastaFile = open(referenceFastaPath) > for record in SeqIO.parse(referenceFastaFile, "fasta"): > referencenames.append(record.name) > referencelengths.append(len(record.seq)) > referenceFastaFile.close() > return (referencenames, referencelengths) > > > def GenerateSubsetBAM(bam_filename, ref_name): > reads = [] > bam_fh = pysam.Samfile(bam_filename, "rb") > > for read in bam_fh.fetch(ref_name): > reads.append(read) > > print ref_name + ' Done ' + str(len(reads)) > return (ref_name, reads) > > > if __name__ == '__main__': > parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta -o > outputBAM") > parser.add_option("-b", "--BAM", type="string", dest="inputBAMFilepath", > help="Specify a BAM file") > parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath", > help="Specify a reference fasta file.") > parser.add_option("-o", "--output", type="string", > dest="outputBAMFilepath", help="Specify an output BAM file.") > > (opts, args) = parser.parse_args() > > if (opts.inputBAMFilepath is None): > print ("\nSpecify a BAM file. eg. -b large.bam\n") > parser.print_help() > elif not(os.path.exists(opts.inputBAMFilepath)): > print ("\nReference BAM file does not exists: " + opts.inputBAMFilepath > +"\n") > elif (opts.fastaFilepath is None): > print ("\nSpecify a reference fasta file. eg. -f Subset.fasta\n") > parser.print_help() > elif not(os.path.exists(opts.fastaFilepath)): > print ("\nReference fasta file does not exists: " + opts.fastaFilepath > +"\n") > elif os.path.exists(opts.outputBAMFilepath) and > not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n): ")=='Y'): > print ("\nOutput BAM exists. Please specify alternative output file. > eg. -o Subset.bam\n") > else: > print "Read fasta ..." > (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath) > print 'Done!' > > print "creating subset...." > pool = Pool() > GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM, > opts.inputBAMFilepath) > reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper, ref_names)) > pool.close() > print "Done!" > > print reads['chr1'] #works fine > print "xxxxx" > > print reads['chr1'][0] #caused Segmentation fault > > I run the code with the pysam-0.5 examples (pysam-0.5/tests) in the > following way: > > python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam > > Read fasta ... > Done! > creating subset.... > chr1 Done 1464 > chr2 Done 1806 > Done! > [, ..., > ] > xxxxx > Segmentation fault > > Thank you in advance. > > > On Tue, Oct 18, 2011 at 7:00 PM, Peter Cock wrote: > >> On Tue, Oct 18, 2011 at 4:44 AM, Mic wrote: >>> Hello, >>> I have tried to generate a subset BAM, but I get a 'Segmentation fault' >> with >>> the following code: >>> from Bio import SeqIO >>> import pysam >>> from optparse import OptionParser >>> import subprocess, os, sys >>> from multiprocessing import Pool >>> import functools >>> ... >> >> I tried this and it seemed to get stuck much earlier. Could you >> cut down the example a bit by removing the multiprocessing? >> >> Peter >> >> P.S. Also you can remove the unused "import argparse" line. >> > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From mictadlo at gmail.com Tue Oct 18 08:05:01 2011 From: mictadlo at gmail.com (Mic) Date: Tue, 18 Oct 2011 22:05:01 +1000 Subject: [Biopython] [Samtools-help] Segmentation fault In-Reply-To: <4E9D66B6.70904@fold.natur.cuni.cz> References: <4E9D66B6.70904@fold.natur.cuni.cz> Message-ID: Thank you for your tip, but I got an error: $ulimit -c unlimited $SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam Read fasta ... Done! creating subset.... chr1 Done 1464 EAS56_57:6:190:289:82 69 0 99 0 None 0 99 35 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; [('MF', 192)] chr2 Done 1806 B7_591:8:4:841:340 73 1 0 99 [(0, 36)] -1 -1 36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA <<<<<<<<;<<<<<<<<;<<<<<;<;:<<<<<<<;; [('MF', 18), ('Aq', 77), ('NM', 0), ('UQ', 0), ('H0', 1), ('H1', 0)] Done! xxxxx Segmentation fault (core dumped) $file core core: ERROR: cannot open `core' (No such file or directory) I also inserted "print reads[0]" in the method GenerateSubsetBAM: def GenerateSubsetBAM(bam_filename, ref_name): reads = [] bam_fh = pysam.Samfile(bam_filename, "rb") for read in bam_fh.fetch(ref_name): reads.append(read) print ref_name + ' Done ' + str(len(reads)) print reads[0] # works fine! return (ref_name, reads) and as output I got: python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam Read fasta ... Done! creating subset.... chr1 Done 1464 EAS56_57:6:190:289:82 69 0 99 0 None 0 99 35 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; [('MF', 192)] chr2 Done 1806 B7_591:8:4:841:340 73 1 0 99 [(0, 36)] -1 -1 36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA <<<<<<<<;<<<<<<<<;<<<<<;<;:<<<<<<<;; [('MF', 18), ('Aq', 77), ('NM', 0), ('UQ', 0), ('H0', 1), ('H1', 0)] Done! xxxxx Segmentation fault Why does reads['chr1'][0] caused the Segmentation fault? Thank you in advance. On Tue, Oct 18, 2011 at 9:44 PM, Martin Mokrejs wrote: > Before running your python code, do (under bash): > $ ulimit -c unlimited > $ python mypython.py > $ file core > $ gdb /usr/bin/python ./core > gdb> where > gdb> bt full > gdb> quit > $ > > Martin > > Mic wrote: > > Hello, > > Thank you for your email. I updated the code and find out that > > print reads['chr1'] #works fine > > but > > print reads['chr1'][0] #caused Segmentation fault > > > > Please find below the updated code: > > > > from Bio import SeqIO > > import pysam > > from optparse import OptionParser > > import subprocess, os, sys > > from multiprocessing import Pool > > import functools > > > > > > def GetReferenceInfo(referenceFastaPath): > > referencenames = [] > > referencelengths = [] > > referenceFastaFile = open(referenceFastaPath) > > for record in SeqIO.parse(referenceFastaFile, "fasta"): > > referencenames.append(record.name) > > referencelengths.append(len(record.seq)) > > referenceFastaFile.close() > > return (referencenames, referencelengths) > > > > > > def GenerateSubsetBAM(bam_filename, ref_name): > > reads = [] > > bam_fh = pysam.Samfile(bam_filename, "rb") > > > > for read in bam_fh.fetch(ref_name): > > reads.append(read) > > > > print ref_name + ' Done ' + str(len(reads)) > > return (ref_name, reads) > > > > > > if __name__ == '__main__': > > parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta > -o > > outputBAM") > > parser.add_option("-b", "--BAM", type="string", > dest="inputBAMFilepath", > > help="Specify a BAM file") > > parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath", > > help="Specify a reference fasta file.") > > parser.add_option("-o", "--output", type="string", > > dest="outputBAMFilepath", help="Specify an output BAM file.") > > > > (opts, args) = parser.parse_args() > > > > if (opts.inputBAMFilepath is None): > > print ("\nSpecify a BAM file. eg. -b large.bam\n") > > parser.print_help() > > elif not(os.path.exists(opts.inputBAMFilepath)): > > print ("\nReference BAM file does not exists: " + > opts.inputBAMFilepath > > +"\n") > > elif (opts.fastaFilepath is None): > > print ("\nSpecify a reference fasta file. eg. -f Subset.fasta\n") > > parser.print_help() > > elif not(os.path.exists(opts.fastaFilepath)): > > print ("\nReference fasta file does not exists: " + > opts.fastaFilepath > > +"\n") > > elif os.path.exists(opts.outputBAMFilepath) and > > not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n): > ")=='Y'): > > print ("\nOutput BAM exists. Please specify alternative output file. > > eg. -o Subset.bam\n") > > else: > > print "Read fasta ..." > > (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath) > > print 'Done!' > > > > print "creating subset...." > > pool = Pool() > > GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM, > > opts.inputBAMFilepath) > > reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper, > ref_names)) > > pool.close() > > print "Done!" > > > > print reads['chr1'] #works fine > > print "xxxxx" > > > > print reads['chr1'][0] #caused Segmentation fault > > > > I run the code with the pysam-0.5 examples (pysam-0.5/tests) in the > > following way: > > > > python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam > > > > Read fasta ... > > Done! > > creating subset.... > > chr1 Done 1464 > > chr2 Done 1806 > > Done! > > [, ..., > > ] > > xxxxx > > Segmentation fault > > > > Thank you in advance. > > > > > > On Tue, Oct 18, 2011 at 7:00 PM, Peter Cock >wrote: > > > >> On Tue, Oct 18, 2011 at 4:44 AM, Mic wrote: > >>> Hello, > >>> I have tried to generate a subset BAM, but I get a 'Segmentation fault' > >> with > >>> the following code: > >>> from Bio import SeqIO > >>> import pysam > >>> from optparse import OptionParser > >>> import subprocess, os, sys > >>> from multiprocessing import Pool > >>> import functools > >>> ... > >> > >> I tried this and it seemed to get stuck much earlier. Could you > >> cut down the example a bit by removing the multiprocessing? > >> > >> Peter > >> > >> P.S. Also you can remove the unused "import argparse" line. > >> > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > From p.j.a.cock at googlemail.com Tue Oct 18 08:58:47 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Oct 2011 13:58:47 +0100 Subject: [Biopython] [Samtools-help] Segmentation fault In-Reply-To: References: Message-ID: On Tue, Oct 18, 2011 at 11:26 AM, Mic wrote: > Hello, > Thank you for your email. I updated the code and find out that > ? ? print reads['chr1'] ? ? #works fine > but > ? ? print reads['chr1'][0] ?#caused Segmentation fault > Please find below the updated code: > ... Your pool version doesn't run on my machine, something unhappy in multiprocessing gives: TypeError: type 'partial' takes at least one argument Here's a version using a single thread, which works fine for me. What does it do on your machines? Either way this should help in determining the segmentation fault. from Bio import SeqIO import pysam import subprocess, os, sys def GetReferenceInfo(referenceFastaPath): referencenames = [] referencelengths = [] referenceFastaFile = open(referenceFastaPath) for record in SeqIO.parse(referenceFastaFile, "fasta"): referencenames.append(record.name) referencelengths.append(len(record.seq)) referenceFastaFile.close() return (referencenames, referencelengths) def GenerateSubsetBAM(bam_filename, ref_name): reads = [] bam_fh = pysam.Samfile(bam_filename, "rb") for read in bam_fh.fetch(ref_name): reads.append(read) print ref_name + ' Done ' + str(len(reads)) return (ref_name, reads) bam_filename = "ex1.bam" fasta_filename = "ex1.fa" print "Read fasta ..." (ref_names, ref_lengths) = GetReferenceInfo(fasta_filename) print 'Done!' print "creating subset...." reads = dict() for ref in ref_names: reads[ref] = GenerateSubsetBAM(bam_filename, ref) print "Done!" print reads['chr1'] #works fine print "xxxxx" print reads['chr1'][0] #also fine -- Peter From nathaniel.echols at gmail.com Tue Oct 18 14:08:03 2011 From: nathaniel.echols at gmail.com (Nat Echols) Date: Tue, 18 Oct 2011 11:08:03 -0700 Subject: [Biopython] newbie question: sequence parsing Message-ID: Greetings-- We have started using BioPython in our (non-bioinformatics) application and are investigating the possibility of replacing our existing (custom-made) sequence parsers. Two quick questions: 1) Is there a sequence parser that works with just a simple string, without any header or additional metadata? If not, how could we write one that results in the same basic object as those in Bio.SeqIO? (The parsing is of course easy, I just want to have the API be consistent regardless of format.) 2) Is there a single function that will take a file (and/or string) of unknown format and try the different parsers until it finds one that works? We currently use several different formats (raw string, FASTA, PIR, and possibly others), and we try not to rely on the file extension alone to determine the type. We already have something that does this using our parsers, which could be refactored to use Bio.SeqIO instead, but if BioPython has something similar I'd rather use that. thanks, Nat From p.j.a.cock at googlemail.com Tue Oct 18 15:04:14 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Oct 2011 20:04:14 +0100 Subject: [Biopython] newbie question: sequence parsing In-Reply-To: References: Message-ID: On Tue, Oct 18, 2011 at 7:08 PM, Nat Echols wrote: > Greetings-- > > We have started using BioPython in our (non-bioinformatics) application and > are investigating the possibility of replacing our existing (custom-made) > sequence parsers. ?Two quick questions: > > 1) Is there a sequence parser that works with just a simple string, without > any header or additional metadata? ?If not, how could we write one that > results in the same basic object as those in Bio.SeqIO? ?(The parsing is of > course easy, I just want to have the API be consistent regardless of > format.) Sounds like the "raw" format in EMBOSS, although there are two interpretations: one sequence per line, or one sequence for the whole file. Have a look at the FASTA parser in Bio/SeqIO/FastaIO.py as the most simple case. Essentially you create a SeqRecord object (which is covered in the Tutorial). > 2) Is there a single function that will take a file (and/or string) of > unknown format and try the different parsers until it finds one that works? > ?We currently use several different formats (raw string, FASTA, PIR, and > possibly others), and we try not to rely on the file extension alone to > determine the type. ?We already have something that does this using our > parsers, which could be refactored to use Bio.SeqIO instead, but if > BioPython has something similar I'd rather use that. No, we don't have such a function. There are many difficulties with format guessing - both from the file contents and even the filename. I usually cite the Zen of Python, Explicit is Better Than Implicit. Peter From cjfields at illinois.edu Tue Oct 18 15:11:56 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 18 Oct 2011 19:11:56 +0000 Subject: [Biopython] newbie question: sequence parsing In-Reply-To: References: Message-ID: On Oct 18, 2011, at 2:04 PM, Peter Cock wrote: > On Tue, Oct 18, 2011 at 7:08 PM, Nat Echols wrote: >> ... >> 2) Is there a single function that will take a file (and/or string) of >> unknown format and try the different parsers until it finds one that works? >> We currently use several different formats (raw string, FASTA, PIR, and >> possibly others), and we try not to rely on the file extension alone to >> determine the type. We already have something that does this using our >> parsers, which could be refactored to use Bio.SeqIO instead, but if >> BioPython has something similar I'd rather use that. > > No, we don't have such a function. There are many difficulties > with format guessing - both from the file contents and even the > filename. I usually cite the Zen of Python, Explicit is Better Than > Implicit. > > Peter Some implicitness is fine, but speaking from experience (BioPerl's GuessSeqFormat) trying to guess the format from the dozens that litter the bioinformatics landscape is a nest of hornets no one wants to maintain. chris From p.j.a.cock at googlemail.com Tue Oct 18 15:31:06 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Oct 2011 20:31:06 +0100 Subject: [Biopython] newbie question: sequence parsing In-Reply-To: References: Message-ID: On Tue, Oct 18, 2011 at 8:11 PM, Fields, Christopher J wrote: > On Oct 18, 2011, at 2:04 PM, Peter Cock wrote: > >> On Tue, Oct 18, 2011 at 7:08 PM, Nat Echols wrote: >>> ... >>> 2) Is there a single function that will take a file (and/or string) of >>> unknown format and try the different parsers until it finds one that works? >>> ?We currently use several different formats (raw string, FASTA, PIR, and >>> possibly others), and we try not to rely on the file extension alone to >>> determine the type. ?We already have something that does this using our >>> parsers, which could be refactored to use Bio.SeqIO instead, but if >>> BioPython has something similar I'd rather use that. >> >> No, we don't have such a function. There are many difficulties >> with format guessing - both from the file contents and even the >> filename. I usually cite the Zen of Python, Explicit is Better Than >> Implicit. >> >> Peter > > Some implicitness is fine, but speaking from experience > (BioPerl's GuessSeqFormat) trying to guess the format > from the dozens that litter the bioinformatics landscape > is a nest of hornets no one wants to maintain. > > chris I think "nest of hornets" is a much more beautiful phrase than my dead pan "many difficulties". The practical reality is that while some file formats are easy (binary files with 4 byte "magic" identifiers), others are horrible, and the definitions shift over time, as new formats of variants are added. I really don't want to go there. Peter From nathaniel.echols at gmail.com Tue Oct 18 17:47:03 2011 From: nathaniel.echols at gmail.com (Nat Echols) Date: Tue, 18 Oct 2011 14:47:03 -0700 Subject: [Biopython] issues with NCBIXML Message-ID: Hi again, I'm puzzled by the behavior of the Blast XML parser. It appears to be picking up all of the alignments correctly, but the top-level Bio.Blast.Record.Blast object that it returns appears to be incompletely populated. Specifically, the attributes num_hits and num_sequences are set to None - but I have several dozen alignments. Am I missing the point of these attributes, or doing something wrong? It's not a huge issue (I can just count the alignments, I guess), but I'm a bit concerned that there's something wrong with my code. thanks, Nat From p.j.a.cock at googlemail.com Tue Oct 18 18:07:24 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Oct 2011 23:07:24 +0100 Subject: [Biopython] issues with NCBIXML In-Reply-To: References: Message-ID: On Tue, Oct 18, 2011 at 10:47 PM, Nat Echols wrote: > Hi again, > > I'm puzzled by the behavior of the Blast XML parser. ?It appears to be > picking up all of the alignments correctly, but the > top-level Bio.Blast.Record.Blast object that it returns appears to be > incompletely populated. ?Specifically, the attributes num_hits and > num_sequences are set to None - but I have several dozen alignments. ?Am I > missing the point of these attributes, or doing something wrong? ?It's not a > huge issue (I can just count the alignments, I guess), but I'm a bit > concerned that there's something wrong with my code. > > thanks, > Nat The number of alignments and descriptions only really apply to the plain text (or HTML) BLAST output, but I guess we could set them to the number of hits in the XML output. Peter From mictadlo at gmail.com Tue Oct 18 19:12:03 2011 From: mictadlo at gmail.com (Mic) Date: Wed, 19 Oct 2011 09:12:03 +1000 Subject: [Biopython] [Samtools-help] Segmentation fault In-Reply-To: <4E9D6FAB.70308@fold.natur.cuni.cz> References: <4E9D66B6.70904@fold.natur.cuni.cz> <4E9D6FAB.70308@fold.natur.cuni.cz> Message-ID: I run it now on my Laptop (Ubuntu 11.04 x64) and now I can see the core file: $ ulimit -c unlimited $ python subsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam Segmentation fault (core dumped) $ file core core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'python subsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam' $ gdb /usr/bin/python ./core GNU gdb (Ubuntu/Linaro 7.2-1ubuntu11) 7.2 Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: ... Reading symbols from /usr/bin/python...(no debugging symbols found)...done. [New Thread 2748] warning: Can't read pathname for load map: Input/output error. Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0 Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2 Reading symbols from /lib/x86_64-linux-gnu/libutil.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/x86_64-linux-gnu/libutil.so.1 Reading symbols from /lib/libssl.so.0.9.8...(no debugging symbols found)...done. Loaded symbols for /lib/libssl.so.0.9.8 Reading symbols from /lib/libcrypto.so.0.9.8...(no debugging symbols found)...done. Loaded symbols for /lib/libcrypto.so.0.9.8 Reading symbols from /lib/x86_64-linux-gnu/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/x86_64-linux-gnu/libz.so.1 Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/x86_64-linux-gnu/libm.so.6 Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /usr/lib/python2.7/lib-dynload/_heapq.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/python2.7/lib-dynload/_heapq.so Reading symbols from /usr/lib/python2.7/lib-dynload/_elementtree.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/python2.7/lib-dynload/_elementtree.so Reading symbols from /lib/x86_64-linux-gnu/libexpat.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/x86_64-linux-gnu/libexpat.so.1 Reading symbols from /usr/lib/python2.7/lib-dynload/pyexpat.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/python2.7/lib-dynload/pyexpat.so Reading symbols from /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/csamtools.so...done. Loaded symbols for /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/csamtools.so Reading symbols from /usr/lib/python2.7/lib-dynload/_ctypes.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/python2.7/lib-dynload/_ctypes.so Reading symbols from /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/ctabix.so...done. Loaded symbols for /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/ctabix.so Reading symbols from /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/TabProxies.so...done. Loaded symbols for /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/TabProxies.so Reading symbols from /usr/lib/python2.7/lib-dynload/_io.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/python2.7/lib-dynload/_io.so Reading symbols from /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/cvcf.so...done. Loaded symbols for /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/cvcf.so Reading symbols from /usr/lib/python2.7/lib-dynload/_multiprocessing.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/python2.7/lib-dynload/_multiprocessing.so Reading symbols from /usr/lib/pymodules/python2.7/Bio/Nexus/cnexus.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/pymodules/python2.7/Bio/Nexus/cnexus.so Core was generated by `python subsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam'. Program terminated with signal 11, Segmentation fault. #0 __pyx_pf_9csamtools_11AlignedRead_5qname___get__ (o=0x164e138, x=) at pysam/csamtools.c:18123 18123 if (__pyx_t_1) { (gdb) (gdb) where #0 __pyx_pf_9csamtools_11AlignedRead_5qname___get__ (o=0x164e138, x=) at pysam/csamtools.c:18123 #1 __pyx_getprop_9csamtools_11AlignedRead_qname (o=0x164e138, x=) at pysam/csamtools.c:30806 #2 0x0000000000479804 in ?? () #3 0x00007f187dbabc65 in __pyx_pf_9csamtools_11AlignedRead___str__ ( __pyx_v_self=0x164e138) at pysam/csamtools.c:17687 #4 0x0000000000479eac in _PyObject_Str () #5 0x0000000000479f8a in PyObject_Str () #6 0x00000000004d390c in ?? () #7 0x00000000004cd2d1 in PyFile_WriteObject () #8 0x000000000049909d in PyEval_EvalFrameEx () #9 0x000000000049d325 in PyEval_EvalCodeEx () #10 0x00000000004ecb02 in PyEval_EvalCode () #11 0x00000000004fdc74 in ?? () #12 0x000000000042c182 in PyRun_FileExFlags () #13 0x000000000042cb4a in PyRun_SimpleFileExFlags () #14 0x0000000000418c9e in Py_Main () #15 0x00007f187ed7aeff in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6 #16 0x00000000004c62b1 in _start () (gdb) (gdb) bt full #0 __pyx_pf_9csamtools_11AlignedRead_5qname___get__ (o=0x164e138, x=) at pysam/csamtools.c:18123 __pyx_v_src = 0x0 __pyx_t_2 = 0x0 __pyx_frame = 0x0 __pyx_r = 0x0 __pyx_t_1 = __Pyx_use_tracing = 0 __pyx_frame_code = 0x0 #1 __pyx_getprop_9csamtools_11AlignedRead_qname (o=0x164e138, x=) at pysam/csamtools.c:30806 No locals. #2 0x0000000000479804 in ?? () No symbol table info available. #3 0x00007f187dbabc65 in __pyx_pf_9csamtools_11AlignedRead___str__ ( __pyx_v_self=0x164e138) at pysam/csamtools.c:17687 __pyx_r = 0x0 __pyx_t_1 = 0x1e4fb90 __pyx_t_2 = 0x0 __pyx_t_3 = 0x0 __pyx_t_4 = 0x0 ---Type to continue, or q to quit--- __pyx_t_5 = 0x0 __pyx_t_6 = 0x0 __pyx_t_7 = 0x0 __pyx_t_8 = 0x0 __pyx_t_9 = 0x0 __pyx_t_10 = 0x0 __pyx_t_11 = 0x0 __pyx_t_12 = 0x0 __pyx_t_13 = 0x0 __pyx_t_14 = 0x0 __pyx_frame_code = 0x0 __pyx_frame = 0x0 __Pyx_use_tracing = 0 #4 0x0000000000479eac in _PyObject_Str () No symbol table info available. #5 0x0000000000479f8a in PyObject_Str () No symbol table info available. #6 0x00000000004d390c in ?? () No symbol table info available. #7 0x00000000004cd2d1 in PyFile_WriteObject () No symbol table info available. ---Type to continue, or q to quit--- #8 0x000000000049909d in PyEval_EvalFrameEx () No symbol table info available. #9 0x000000000049d325 in PyEval_EvalCodeEx () No symbol table info available. #10 0x00000000004ecb02 in PyEval_EvalCode () No symbol table info available. #11 0x00000000004fdc74 in ?? () No symbol table info available. #12 0x000000000042c182 in PyRun_FileExFlags () No symbol table info available. #13 0x000000000042cb4a in PyRun_SimpleFileExFlags () No symbol table info available. #14 0x0000000000418c9e in Py_Main () No symbol table info available. #15 0x00007f187ed7aeff in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #16 0x00000000004c62b1 in _start () No symbol table info available. (gdb) quit $ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 20 file size (blocks, -f) unlimited pending signals (-i) 16382 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Thank you in advance. From nathaniel.echols at gmail.com Wed Oct 19 14:48:11 2011 From: nathaniel.echols at gmail.com (Nat Echols) Date: Wed, 19 Oct 2011 11:48:11 -0700 Subject: [Biopython] issues with NCBIXML In-Reply-To: References: Message-ID: On Tue, Oct 18, 2011 at 3:07 PM, Peter Cock wrote: > The number of alignments and descriptions only really apply > to the plain text (or HTML) BLAST output, but I guess we > could set them to the number of hits in the XML output. This would be useful, for consistency's sake if nothing else. I'm happy to contribute a patch if that streamlines the process. -Nat From p.j.a.cock at googlemail.com Wed Oct 19 15:06:30 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 19 Oct 2011 20:06:30 +0100 Subject: [Biopython] issues with NCBIXML In-Reply-To: References: Message-ID: On Wed, Oct 19, 2011 at 7:48 PM, Nat Echols wrote: > On Tue, Oct 18, 2011 at 3:07 PM, Peter Cock > wrote: >> >> The number of alignments and descriptions only really apply >> to the plain text (or HTML) BLAST output, but I guess we >> could set them to the number of hits in the XML output. > > This would be useful, for consistency's sake if nothing else. ?I'm happy to > contribute a patch if that streamlines the process. > -Nat Sure. If you can include unit tests for it even better. You should just be able to add some assertEqual lines to the existing XML parser tests for the newly populated properties. Thanks, Peter From mictadlo at gmail.com Thu Oct 20 05:38:56 2011 From: mictadlo at gmail.com (Mic) Date: Thu, 20 Oct 2011 19:38:56 +1000 Subject: [Biopython] changing record attributes while iterating In-Reply-To: References: Message-ID: Hello, would it be possible to using a generator expression for the following code? from Bio import SeqIO fa_parser = SeqIO.parse(open("../test_files/test.fasta", "rU"), "fasta") sequence = fa_parser.next().seq for record in fa_parser: sequence += 3*'N' + record.seq print sequence Input: >1 1111111 >2 2222222 >3 3333333 >4 4444444 Output: 1111111NNN2222222NNN3333333NNN4444444 Thank you advance. On Fri, Oct 7, 2011 at 5:22 PM, Peter Cock wrote: > > > On Friday, October 7, 2011, Michalwrote: > > > Hello, > > Does your code with generator save the whole file in the > > memory or does it read each entry and save it immediately? > > Thank you in advance. > > Using a generator expression like that only one SeqRecord is in memory at a > time. It goes through the input FASTA one record at a time, renames it, > saves it immediately. > > Peter > > P.S. list CC'd From p.j.a.cock at googlemail.com Thu Oct 20 05:58:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 20 Oct 2011 10:58:05 +0100 Subject: [Biopython] changing record attributes while iterating In-Reply-To: References: Message-ID: Hi Mic, You should have started a new thread with a new title... On Thu, Oct 20, 2011 at 10:38 AM, Mic wrote: > Hello, > would it be possible to using a generator expression for the following code? > from Bio import SeqIO > fa_parser = SeqIO.parse(open("../test_files/test.fasta", "rU"), "fasta") > sequence = fa_parser.next().seq > for record in fa_parser: > ? ??sequence?+= 3*'N' + record.seq > > print?sequence > Input: >>1 > 1111111 >>2 > 2222222 >>3 > 3333333 >>4 > 4444444 > Output: > 1111111NNN2222222NNN3333333NNN4444444 > Thank you advance. Sure, how about this: from Bio import SeqIO fa_parser = SeqIO.parse("../test_files/test.fasta", "fasta") print ('N' * 3).join(str(rec.seq) for rec in fa_parser) Peter From andreas.wilm at gmail.com Tue Oct 25 02:26:59 2011 From: andreas.wilm at gmail.com (Andreas Wilm) Date: Tue, 25 Oct 2011 14:26:59 +0800 Subject: [Biopython] VCF parser In-Reply-To: References: Message-ID: HI Tiago, I'm not aware of a Biopython VCF parser, but pysam seems to have one (haven't used it though). Try >>> from pysam import cvcf You also might want to check an implementation which was posted on seqanswers: http://seqanswers.com/forums/archive/index.php/t-9266.html Andreas PS: For the sake of completeness: your question was asked before here (no replies). See http://www.biopython.org/pipermail/biopython/2011-March/007131.html 2011/10/4 Tiago Ant?o : > Hi, > > I wonder if there is a VCF parser in either Python or Java? Either I > am being dumb at searching (probably) or nothing exists? > > Thanks, > Tiago > > -- > "If you want to get laid, go to college.? If you want an education, go > to the library." - Frank Zappa > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Andreas Wilm andreas.wilm at gmail.com | mail at andreas-wilm.com | 0x7C68FBCC From pawan.mani2 at gmail.com Tue Oct 25 11:50:51 2011 From: pawan.mani2 at gmail.com (kakchingtabam pawankumar sharma) Date: Tue, 25 Oct 2011 21:20:51 +0530 Subject: [Biopython] installation of pyfatsa In-Reply-To: <2DDF09AFEB46E54894A3843CEF9CB3A4B55076114C@EXCHMB.ocimumbio.com> References: <2DDF09AFEB46E54894A3843CEF9CB3A4B55076114C@EXCHMB.ocimumbio.com> Message-ID: Dear**** I woul like to know how to install pyfasta in linux. I have downloaded pyfasta-0.4.4.tar.gz and install using command: *tar -xzvf pyfasta-0.4.4.tar.gz*.**** But I could used the command line: **** *pyfasta split -n 6 sample .fasta* ** ** So kindly help me out to solve this problem.**** ** ** ** ** With Reagards,**** Pawan**** ------------------------------ This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions that are unlawful. This e-mail may contain viruses. Ocimum Biosolutions has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. The information contained in this email and any attachments is confidential and may be subject to copyright or other intellectual property protection. If you are not the intended recipient, you are not authorized to use or disclose this information, and we request that you notify us by reply mail or telephone and delete the original message from your mail system. OCIMUMBIO SOLUTIONS (P) LTD From p.j.a.cock at googlemail.com Tue Oct 25 12:13:00 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Oct 2011 17:13:00 +0100 Subject: [Biopython] installation of pyfatsa In-Reply-To: References: <2DDF09AFEB46E54894A3843CEF9CB3A4B55076114C@EXCHMB.ocimumbio.com> Message-ID: On Tue, Oct 25, 2011 at 4:50 PM, kakchingtabam pawankumar sharma wrote: > ?Dear**** > > ? ? ? ? ? ? ? ? I woul like to know how to install pyfasta in linux. I have > downloaded pyfasta-0.4.4.tar.gz and install using command: ?*tar -xzvf > pyfasta-0.4.4.tar.gz*.**** > > But I could used the command line: ?**** > > *pyfasta split -n 6 sample .fasta* > > ** ** > > So kindly help me out to solve this problem.**** > > ** ** > > ** ** > > With Reagards,**** > > Pawan**** > Hi Pawan, Note pyfasta is not part of Biopython, but is a separate tool by Brent Pedersen (CC'd). http://pypi.python.org/pypi/pyfasta/ https://github.com/brentp/pyfasta/ However, uncompressing the tar ball is only the first step in installing it. You probably need to run "python setup.py install" for that. Peter From bpederse at gmail.com Tue Oct 25 12:23:31 2011 From: bpederse at gmail.com (Brent Pedersen) Date: Tue, 25 Oct 2011 10:23:31 -0600 Subject: [Biopython] VCF parser In-Reply-To: References: Message-ID: On Mon, Oct 3, 2011 at 4:12 PM, Tiago Ant?o wrote: > Hi, > > I wonder if there is a VCF parser in either Python or Java? Either I > am being dumb at searching (probably) or nothing exists? > > Thanks, > Tiago > > -- > "If you want to get laid, go to college.? If you want an education, go > to the library." - Frank Zappa > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > I have found this one: https://github.com/jdoughertyii/PyVCF to be quite good and easy to use. From anaryin at gmail.com Wed Oct 26 06:30:12 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 26 Oct 2011 12:30:12 +0200 Subject: [Biopython] Pairwise alignment - is it a generic function? Message-ID: Hello all, A friend of mine was interested in a small simple alignment script for aminoacids, to which I recommended to have a look at Biopython. We found the pairwise2 module but we're a bit puzzled. Does it align "any" sequence, aa or nucleotides? I don't see any scoring matrix referenced there.. Related to this, can you suggest any implementation of an aminoacid pairwise alignment algorithm, in Python, that does is self contained (ie. doesn't depend on some other program). Best, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao From p.j.a.cock at googlemail.com Wed Oct 26 06:58:09 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Oct 2011 11:58:09 +0100 Subject: [Biopython] Pairwise alignment - is it a generic function? In-Reply-To: References: Message-ID: On Wed, Oct 26, 2011 at 11:30 AM, Jo?o Rodrigues wrote: > Hello all, > > A friend of mine was interested in a small simple alignment script for > aminoacids, to which I recommended to have a look at Biopython. We found the > pairwise2 module but we're a bit puzzled. Does it align "any" sequence, aa > or nucleotides? I don't see any scoring matrix referenced there.. It should work on proteins, just pass in the appropriate scoring matrix. > Related to this, can you suggest any implementation of an aminoacid > pairwise alignment algorithm, in Python, that does is self contained > (ie. doesn't depend on some other program). Well, Bio.pairwise2 has a faster C implementation and fall back slower pure Python implementation (used under Jython/PyPy/etc), which might answer your needs. Peter From from.d.putto at gmail.com Wed Oct 26 11:11:05 2011 From: from.d.putto at gmail.com (Sheila the angel) Date: Wed, 26 Oct 2011 17:11:05 +0200 Subject: [Biopython] downloading gnome Protein table Message-ID: Hi All, I an facing some problem to downloading the gnome and other information. For an example I did a query on ncbi gnome for NC_008390 On clicking results you can get following link http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=ShowDetailView&TermToSearch=19840 On my web-browser I can save this page as File> Save as >out.html Furthermore I want to download the Protein table also http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=Retrieve&dopt=Protein+Table&list_uids=19840 I want to do this for many Ids. Is there any simple way in Bio-Python??? Thanks in Advance -- Cheers Sheila From p.j.a.cock at googlemail.com Wed Oct 26 11:27:37 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Oct 2011 16:27:37 +0100 Subject: [Biopython] downloading gnome Protein table In-Reply-To: References: Message-ID: On Wed, Oct 26, 2011 at 4:11 PM, Sheila the angel wrote: > Hi All, > > I an facing some problem to downloading the gnome and other information. > For an example I did a query on ncbi gnome for ?NC_008390 > On clicking results you can get following link > > http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=ShowDetailView&TermToSearch=19840 > On my web-browser I can save this page ?as File> Save as >out.html > > Furthermore I want to download the Protein table also > http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=Retrieve&dopt=Protein+Table&list_uids=19840 > > I want to do this for many Ids. Is there any simple way in Bio-Python??? > > Thanks in Advance Hmm, some of that might be available by Bio.Entrez, not sure though. For the protein table I would personally work with the *.ptt files from the NCBI FTP site, e.g. ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt or: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid58303/NC_008391.ptt The FTP links are on the page of the first URL you gave. You can download all the "bacteria" *.ptt files as a tar ball, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ptt.tar.gz Typically I work from the GenBank file files instead (*.gbk rather than *.ptt) Peter From mictadlo at gmail.com Wed Oct 26 21:14:16 2011 From: mictadlo at gmail.com (Mic) Date: Thu, 27 Oct 2011 11:14:16 +1000 Subject: [Biopython] changing record attributes while iterating In-Reply-To: References: Message-ID: Thank you it is working. I would like to to put sequences id in a list in the following way: >>> c = (i.id for i in b) SyntaxError: invalid syntax >>> c[0] Traceback (most recent call last): File "", line 1, in TypeError: 'generator' object is not subscriptable How is it possible to generate a list of sequence ids? Thank you in advance. On Thu, Oct 20, 2011 at 7:58 PM, Peter Cock wrote: > Hi Mic, > > You should have started a new thread with a new title... > > On Thu, Oct 20, 2011 at 10:38 AM, Mic wrote: > > Hello, > > would it be possible to using a generator expression for the following > code? > > from Bio import SeqIO > > fa_parser = SeqIO.parse(open("../test_files/test.fasta", "rU"), "fasta") > > sequence = fa_parser.next().seq > > for record in fa_parser: > > sequence += 3*'N' + record.seq > > > > print sequence > > Input: > >>1 > > 1111111 > >>2 > > 2222222 > >>3 > > 3333333 > >>4 > > 4444444 > > Output: > > 1111111NNN2222222NNN3333333NNN4444444 > > Thank you advance. > > Sure, how about this: > > from Bio import SeqIO > fa_parser = SeqIO.parse("../test_files/test.fasta", "fasta") > print ('N' * 3).join(str(rec.seq) for rec in fa_parser) > > Peter > From p.j.a.cock at googlemail.com Thu Oct 27 04:35:24 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Oct 2011 09:35:24 +0100 Subject: [Biopython] changing record attributes while iterating In-Reply-To: References: Message-ID: On Thu, Oct 27, 2011 at 2:14 AM, Mic wrote: > Thank you it is working. > I would like to to put sequences id in a list in the following way: >>>> c = (i.id for i in b) > SyntaxError: invalid syntax The above would be a generator expression, and requires Python 2.4. It shouldn't cause a SyntaxError unless there is some mistake I'm not seeing (or you missed something in the copy & paste). >>>> c[0] > Traceback (most recent call last): > ? File "", line 1, in > TypeError: 'generator' object is not subscriptable > How is it possible to?generate?a list of sequence ids? You need to create a list (e.g using a list comprehension) rather than a generator, probably: c = [i.id for i in b] c[0] = "Fred" Peter From from.d.putto at gmail.com Thu Oct 27 06:47:04 2011 From: from.d.putto at gmail.com (Sheila the angel) Date: Thu, 27 Oct 2011 12:47:04 +0200 Subject: [Biopython] downloading gnome Protein table In-Reply-To: References: Message-ID: The problem is I have only the Refseq ID like NC_008390 and I don't have Protein table ID (in this case CP000441.ptt) so I can't download the .ptt file (as in ftp url ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt ) Also not all Refseq IDs I have belongs to 'Bacteria'. So for ID NC_004314 (just an example) I have to change the ftp url as ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Plasmodium_falciparum/NC_004314.ptt Downloading the *.gbk file may be an option (but later I need to convert them into protein table) so I tried this from Bio import Entrez Entrez.email = "from.d.putto at gmail.com" handle = Entrez.efetch(db="genome", id="NC_008390", rettype="gbk") print handle.read() The output shows me 'Nothing has been found' I am not sure in which database I should look for id like NC_008390. Moreover later-on I need to convert 'gbk' file to .ptt (or extract protein information) On Wed, Oct 26, 2011 at 5:27 PM, Peter Cock wrote: > On Wed, Oct 26, 2011 at 4:11 PM, Sheila the angel > wrote: > > Hi All, > > > > I an facing some problem to downloading the gnome and other information. > > For an example I did a query on ncbi gnome for NC_008390 > > On clicking results you can get following link > > > > > http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=ShowDetailView&TermToSearch=19840 > > On my web-browser I can save this page as File> Save as >out.html > > > > Furthermore I want to download the Protein table also > > > http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=Retrieve&dopt=Protein+Table&list_uids=19840 > > > > I want to do this for many Ids. Is there any simple way in Bio-Python??? > > > > Thanks in Advance > > Hmm, some of that might be available by Bio.Entrez, not sure though. > > For the protein table I would personally work with the *.ptt files from > the NCBI FTP site, e.g. > > > ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt > > or: > > > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid58303/NC_008391.ptt > > The FTP links are on the page of the first URL you gave. You can download > all the "bacteria" *.ptt files as a tar ball, > > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ptt.tar.gz > > Typically I work from the GenBank file files instead (*.gbk rather than > *.ptt) > > Peter > From p.j.a.cock at googlemail.com Thu Oct 27 09:14:10 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Oct 2011 14:14:10 +0100 Subject: [Biopython] downloading gnome Protein table In-Reply-To: References: Message-ID: On Thu, Oct 27, 2011 at 11:47 AM, Sheila the angel wrote: > The problem is I have only the Refseq ID like?NC_008390?and I don't have > Protein table ID (in this case CP000441.ptt) so I can't download the .ptt > file (as in ftp url > ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt > ? ) Given your identifiers, use ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ rather than ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/ - in this case, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid58303/NC_008390.ptt > > Also not all??Refseq IDs I have belongs to 'Bacteria'. > Then the NCBI won't have them on the Bacterial FTP sites, and I don't think they will provide *.ptt files for them. > So for ID > NC_004314?(just an example) I have to change the ftp url as > ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Plasmodium_falciparum/NC_004314.ptt > > Downloading the *.gbk file may be an option (but later I need to convert > them into protein table) Just download *all* the bacterial protein tables as the tar ball, its only 120MB compressed: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ptt.tar.gz Then you can just search locally for a file by name etc. > so I tried this > from Bio import Entrez > Entrez.email = "from.d.putto at gmail.com" > handle = Entrez.efetch(db="genome", id="NC_008390", rettype="gbk") > print handle.read() > The output shows me 'Nothing has been found' > I am not sure in which database I should look for id like NC_008390. Try it on the NCBI website for all databases, http://www.ncbi.nlm.nih.gov/sites/gquery?term=NC_008390 You'll see it does match the genome database, but also the nucleotide database. In this case you want the sequence as a GenBank file so use the nucleotide database. > Moreover later-on I need to convert 'gbk' file to .ptt (or extract protein > information) The Biopython GenBank parser can do that - life is easier with bacterial genomes as there are (almost) no nasty join(...) locations to deal with. Peter From devaniranjan at gmail.com Thu Oct 27 15:16:07 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Thu, 27 Oct 2011 15:16:07 -0400 Subject: [Biopython] weighted sampling of a dictionary Message-ID: Hi, I am not sure if this question is more suitable for biopython or a python forum. I have the following dictionary. dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34, 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, 'LAU': 1, 'PTA': 7, ' AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34, 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, 'YLP': 49, 'TA Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, 'TAA': 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL': 16, 'SY Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28} The keys are the different amino acid triplets (all possible triplets extracted from a culled list of PDB), the numbers next to them are the frequency that they occour in. I was wondering if there is a way in biopython/python to sample them at the frequecy indicated by the no's next to the key. I have only given a snippet of the triplet dictionary, the entire dictionary has about 1400 key entries. I would appreciate any help in this matter --thank you very much. George From bpederse at gmail.com Thu Oct 27 16:29:43 2011 From: bpederse at gmail.com (Brent Pedersen) Date: Thu, 27 Oct 2011 14:29:43 -0600 Subject: [Biopython] weighted sampling of a dictionary In-Reply-To: References: Message-ID: On Thu, Oct 27, 2011 at 1:16 PM, George Devaniranjan wrote: > Hi, > > I am not sure if this question is more suitable for biopython or a python > forum. > > > I have the following dictionary. > > dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34, > 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, 'LAU': > 1, 'PTA': 7, ' > AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34, > 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, 'YLP': > 49, 'TA > Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, 'TAA': > 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL': > 16, 'SY > Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28} > > The keys are the different amino acid triplets (all possible triplets > extracted from a culled list of PDB), the numbers next to them are the > frequency that they occour in. > > I was wondering if there is a way in biopython/python to sample them at the > frequecy indicated by the no's next to the key. > > I have only given a snippet of the triplet dictionary, the entire dictionary > has about 1400 key entries. > > I would appreciate any help in this matter --thank you very much. > > George > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > you could try the one of these (presumably the class king) http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python/ you'll have something like: import operator aminos, weights = zip(*sorted(adict.items(), key=operator.itemgetter(1))) amino_gen = WeightedRandomGenerator(weights) for i in xrange(nsims): idx = amino_gen.next() rand_aa = aminos[idx] From jmtc21 at bath.ac.uk Thu Oct 27 16:33:18 2011 From: jmtc21 at bath.ac.uk (Jaime Tovar) Date: Thu, 27 Oct 2011 21:33:18 +0100 Subject: [Biopython] expat and biopython 1.58 problem on linux x64 Message-ID: <4EA9C00E.5080509@bath.ac.uk> Hello all, I'm having troubles while updating my biopython to 1.58. I'm having exactly the same problem with the xml parser as described in this old post: http://www.biopython.org/pipermail/biopython/2011-May/007263.html Sadly I may have to use the entrez module so it will make me happy to have the thing running if possible. I'm installing in a opensuse 11.3 x64 box Did a rpm install of biopython from the opensuse science repo. So I have 1.58-1.2 installed. Python 1.6.5-3.5.1 for x64 expat 2.0.1-98.1 x64 Tried to install both by hand from the tar.gz and using an rpm but the problem persists. Any help will be greatly appreciated. Thanks!!! Jaime. From winda002 at student.otago.ac.nz Thu Oct 27 16:52:00 2011 From: winda002 at student.otago.ac.nz (David Winter) Date: Fri, 28 Oct 2011 09:52:00 +1300 Subject: [Biopython] weighted sampling of a dictionary In-Reply-To: References: Message-ID: <20111028095200.20435ub1z2jexy0g@www.studentmail.otago.ac.nz> Hi George, I was actually doing this yesterday :) The function I came up with takes two lists: import random def weighted_sample(population, weights): """ Sample from a population, given provided weights """ if len(population) != len(weights): raise ValueError('Lengths of population and weights do not match') normal_weights = [ float(w)/sum(weights) for w in weights ] val = random.random() running_total = 0 for index, weight in enumerate(normal_weights): running_total += weight if val < running_total: return population[index] Which seems to do the trick: population = ['AAU' ,'AAC', 'AAG'] weights = [2,5,3] sample = [weighted_sample(population, weights) for _ in range(1000)] sample.count('AAC') #should be about 500 If that's too slow, check out numpy's random.multinomial() function. I haven't tested this, but this should get you the number of times you get each codon from 1000 "draws": import numpy as np codons, weights = codon_dict.items() denom = sum(weights) normalised_weights = [float(w)/denom for w in weights] np.random.multinomial(codons, weights, 1000) Cheers, David Quoting George Devaniranjan : > Hi, > > I am not sure if this question is more suitable for biopython or a python > forum. > > > I have the following dictionary. > > dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34, > 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, 'LAU': > 1, 'PTA': 7, ' > AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34, > 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, 'YLP': > 49, 'TA > Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, 'TAA': > 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL': > 16, 'SY > Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28} > > The keys are the different amino acid triplets (all possible triplets > extracted from a culled list of PDB), the numbers next to them are the > frequency that they occour in. > > I was wondering if there is a way in biopython/python to sample them at the > frequecy indicated by the no's next to the key. > > I have only given a snippet of the triplet dictionary, the entire dictionary > has about 1400 key entries. > > I would appreciate any help in this matter --thank you very much. > > George > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Fri Oct 28 05:54:09 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Oct 2011 10:54:09 +0100 Subject: [Biopython] expat and biopython 1.58 problem on linux x64 In-Reply-To: <4EA9C00E.5080509@bath.ac.uk> References: <4EA9C00E.5080509@bath.ac.uk> Message-ID: On Thu, Oct 27, 2011 at 9:33 PM, Jaime Tovar wrote: > Hello all, > > I'm having troubles while updating my biopython to 1.58. > > I'm having exactly the same problem with the xml parser as described in this > old post: > > http://www.biopython.org/pipermail/biopython/2011-May/007263.html > > Sadly I may have to use the entrez module so it will make me happy to have > the thing running if possible. > > I'm installing in a opensuse 11.3 x64 box > Did a rpm install of biopython from the opensuse science repo. So I have > 1.58-1.2 installed. > Python 1.6.5-3.5.1 for x64 > expat 2.0.1-98.1 x64 > > Tried to install both by hand from the tar.gz and using an rpm but the > problem persists. > > Any help will be greatly appreciated. > > Thanks!!! > > Jaime. Hmm. Can you try installing the latest code from git please? You can grab it via the git command line tool, or use github to download the latest code as a tar ball: http://biopython.org/wiki/SourceCode Specifically I'm hoping this change will fix the segmentation fault (assuming http://bugs.python.org/issue4877 is to blame): https://github.com/biopython/biopython/commit/59f9cbd2ad14ebd05d5864033ff0c7ef7a8f0daa Previously: $ python Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import Entrez >>> handle = open("NEWS") >>> handle.close() >>> Entrez.read(handle) Segmentation fault With the fix: $ python Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import Entrez >>> handle = open("NEWS") >>> handle.close() >>> Entrez.read(handle) Traceback (most recent call last): File "", line 1, in File "Bio/Entrez/__init__.py", line 270, in read record = handler.read(handle) File "Bio/Entrez/Parser.py", line 167, in read raise IOError("Can't parse a closed handle") IOError: Can't parse a closed handle Assuming you start seeing the IOError instead, the question would shift to what is going on with your network settings (e.g. look at proxies). If the segmentation fault doesn't go away we'll need to think again. Peter From bioinformaticsing at gmail.com Fri Oct 28 07:46:07 2011 From: bioinformaticsing at gmail.com (ning luwen) Date: Fri, 28 Oct 2011 19:46:07 +0800 Subject: [Biopython] Memory leak while parse gbk file? Message-ID: Hi, I have tried to parse about 2000+ gbk file using SeqIO.parse to parse gbk file, but the memory up quickly. ( in my desktop 4g memory, out memory after a number of iterates, and then try one work station, memory used as high as 100g+, and continue increasing) for temp_name in file_names:#file_names: list of path of gbk files. f=open(temp_name) for x in SeqIO.parse(f,'genbank'): print x.name,len(x.features) f.close() I guess there may be memory leak while parse gbk flle. -- regards, luwen ning From p.j.a.cock at googlemail.com Fri Oct 28 07:52:33 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Oct 2011 12:52:33 +0100 Subject: [Biopython] Memory leak while parse gbk file? In-Reply-To: References: Message-ID: On Fri, Oct 28, 2011 at 12:46 PM, ning luwen wrote: > Hi, > ? ?I have tried to parse about 2000+ gbk file using SeqIO.parse to > parse gbk file, but the memory up quickly. ( in my desktop 4g memory, > out memory after a number of iterates, and then try one work station, > memory used as high as 100g+, and continue increasing) > > for temp_name in file_names:#file_names: list of path of gbk files. > ? ?f=open(temp_name) > ? ?for x in SeqIO.parse(f,'genbank'): > ? ? ? ?print x.name,len(x.features) > ? ?f.close() > > ? I guess there may be memory leak while parse gbk flle. > -- > regards, > luwen ning Which version of Python are you using? Try calling garbage collection, import gc from Bio import SeqIO for temp_name in file_names:#file_names: list of path of gbk files. f=open(temp_name) for x in SeqIO.parse(f,'genbank'): print x.name,len(x.features) f.close() gc.collect() I expect that to fix the increasing memory usage. If it does, then it isn't a memory leak. Peter From p.j.a.cock at googlemail.com Fri Oct 28 09:21:42 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Oct 2011 14:21:42 +0100 Subject: [Biopython] expat and biopython 1.58 problem on linux x64 In-Reply-To: <4EAAA9A0.3010906@bath.ac.uk> References: <4EA9C00E.5080509@bath.ac.uk> <4EAAA9A0.3010906@bath.ac.uk> Message-ID: On Fri, Oct 28, 2011 at 2:09 PM, Jaime Tovar wrote: > Got the tarball for latest, > > but: > > ... > ~/tmp/biop/biopython-biopython-59f9cbd/Tests> python test_Entrez.py > Test error handling when presented with Fasta non-XML data ... ok > Test error handling when presented with GenBank non-XML data ... ok > Test parsing XML returned by EFetch, Nucleotide database (first test) ... > ERROR > Test parsing XML returned by EFetch, Protein database ... ERROR > Test parsing XML returned by EFetch, OMIM database ... ERROR > Test parsing XML returned by EFetch, PubMed database (first test) ... > Segmentation fault > > Can we try to find where exactly is the problem? > > Thanks for the help. > J OK, so it doesn't look like the problem with closed handles, http://bugs.python.org/issue4877 Although to be sure please try the example in my last email, from Bio import Entrez handle = open("NEWS") handle.close() Entrez.read(handle) (You can use any file that exists). Beyond that I only have questions rather than answers for now. My guess is something is broken on your system with conflicting versions of expat, see for example: http://www.dscpl.com.au/wiki/ModPython/Articles/ExpatCausingApacheCrash What does this give you, and does it match expat 2.0.1 which you said earlier was installed? import pyexpat print pyexpat.version_info Can you try to get a strack trace? Alternatively, you could disable individual tests which trigger the segmentation fault one by one and then we can attempt to spot any commonalities. e.g. The segmentation fault is from: "Test parsing XML returned by EFetch, PubMed database (first test)" which is method test_pubmed1, rename it to xtest_test_pubmed1 (or anything that doesn't start test_*) and it will be skipped. Peter From devaniranjan at gmail.com Fri Oct 28 09:23:22 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Fri, 28 Oct 2011 09:23:22 -0400 Subject: [Biopython] weighted sampling of a dictionary In-Reply-To: <20111028095200.20435ub1z2jexy0g@www.studentmail.otago.ac.nz> References: <20111028095200.20435ub1z2jexy0g@www.studentmail.otago.ac.nz> Message-ID: Thanks guys for all your suggestions -I am going to try these out. Best, George On Thu, Oct 27, 2011 at 4:52 PM, David Winter wrote: > Hi George, > > I was actually doing this yesterday :) > > The function I came up with takes two lists: > > import random > > def weighted_sample(population, weights): > """ Sample from a population, given provided weights """ > if len(population) != len(weights): > raise ValueError('Lengths of population and weights do not match') > normal_weights = [ float(w)/sum(weights) for w in weights ] > val = random.random() > running_total = 0 > for index, weight in enumerate(normal_weights): > running_total += weight > if val < running_total: > return population[index] > > Which seems to do the trick: > > population = ['AAU' ,'AAC', 'AAG'] > weights = [2,5,3] > sample = [weighted_sample(population, weights) for _ in range(1000)] > sample.count('AAC') #should be about 500 > > If that's too slow, check out numpy's random.multinomial() function. > > I haven't tested this, but this should get you the number of times you get > each codon from 1000 "draws": > > import numpy as np > > codons, weights = codon_dict.items() > denom = sum(weights) > normalised_weights = [float(w)/denom for w in weights] > np.random.multinomial(codons, weights, 1000) > > Cheers, > David > > > > Quoting George Devaniranjan : > > Hi, >> >> I am not sure if this question is more suitable for biopython or a python >> forum. >> >> >> I have the following dictionary. >> >> dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34, >> 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, >> 'LAU': >> 1, 'PTA': 7, ' >> AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34, >> 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, >> 'YLP': >> 49, 'TA >> Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, >> 'TAA': >> 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL': >> 16, 'SY >> Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28} >> >> The keys are the different amino acid triplets (all possible triplets >> extracted from a culled list of PDB), the numbers next to them are the >> frequency that they occour in. >> >> I was wondering if there is a way in biopython/python to sample them at >> the >> frequecy indicated by the no's next to the key. >> >> I have only given a snippet of the triplet dictionary, the entire >> dictionary >> has about 1400 key entries. >> >> I would appreciate any help in this matter --thank you very much. >> >> George >> ______________________________**_________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/**mailman/listinfo/biopython >> >> > > > From p.j.a.cock at googlemail.com Mon Oct 31 07:27:31 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 31 Oct 2011 11:27:31 +0000 Subject: [Biopython] expat and biopython 1.58 problem on linux x64 In-Reply-To: References: <4EA9C00E.5080509@bath.ac.uk> <4EAAA9A0.3010906@bath.ac.uk> Message-ID: On Fri, Oct 28, 2011 at 2:21 PM, Peter Cock wrote: > > OK, so it doesn't look like the problem with closed handles, > http://bugs.python.org/issue4877 > Hi Jaime, Was there any sign of an expat version mismatch? That does seem like the most likely problem (Python expecting one thing, the library providing another). Another guess was we could be reusing the parser object (which apparently is not allowed), although the unit tests don't seem to do this: http://bugs.python.org/issue6676 http://bugs.python.org/issue12829 Peter From tiagoantao at gmail.com Mon Oct 3 22:12:18 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 3 Oct 2011 23:12:18 +0100 Subject: [Biopython] VCF parser Message-ID: Hi, I wonder if there is a VCF parser in either Python or Java? Either I am being dumb at searching (probably) or nothing exists? Thanks, Tiago -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From bala.biophysics at gmail.com Tue Oct 4 08:05:36 2011 From: bala.biophysics at gmail.com (Bala subramanian) Date: Tue, 4 Oct 2011 10:05:36 +0200 Subject: [Biopython] changing record attributes while iterating Message-ID: Friends, I have a fasta file. I need to modify the record id by adding a suffix to it. So i used SeqRecord (the code attached below). It is working fine but i would like to know if there is any simple way to do that. ie. if i can change the record attributes while iterating through the fasta with SeqIO.parse itself. I tried something like following but i couldnt get what i wanted. new_list=[] for record in SeqIO.parse(open(argv[1], "rU"), "fasta"): record.id=record.id + '_suffix' new_list.append(record) Hence i used SeqRecord to do the modification ? ---------------------------------------------------------------------------------------------------- #!/usr/bin/env python from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq from sys import argv new_list=[] for record in SeqIO.parse(open(argv[1], "rU"), "fasta"): seq=str(record.seq) newrec=SeqRecord(Seq(seq),id=record.id+"_suffix",name='',description='') new_list.append(newrec) output_handle = open(raw_input('Enter the output file:'), 'w') SeqIO.write(new_list, output_handle, "fasta") output_handle.close() From p.j.a.cock at googlemail.com Tue Oct 4 08:24:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Oct 2011 09:24:08 +0100 Subject: [Biopython] changing record attributes while iterating In-Reply-To: References: Message-ID: On Tue, Oct 4, 2011 at 9:05 AM, Bala subramanian wrote: > Friends, > I have a fasta file. I need to modify the record id by adding a suffix to > it. So i used SeqRecord (the code attached below). It is working fine but i > would like to know if there is any simple way to do that. ie. if i can > change the record attributes while iterating through the fasta with > SeqIO.parse itself. I tried something like following but i couldnt get what > i wanted. > > new_list=[] > for record in SeqIO.parse(open(argv[1], "rU"), "fasta"): > ? ? ? ? ? ? ? ? ? ?record.id=record.id + '_suffix' > ? ? ? ? ? ? ? ? ? ?new_list.append(record) The above looks fine, although depending on the rest of your script a big list might be a bad idea (too much memory) and an iterator based approach may be preferable. If as in the rest of your example you just need to do this for output, perhaps: #!/usr/bin/env python from Bio import SeqIO from sys import argv def rename(record): """Modified record in place AND returns it.""" record.id += '_suffix' return record #This is a generator expression: records = (rename(r) for r in SeqIO.parse(argv[1], "fasta")) output_filename = raw_input('Enter the output file:') SeqIO.write(records, output_filename, "fasta") The alternative you showed was wasteful, creating lots of new objects to no benefit. Peter From nanatrapnest at hotmail.it Wed Oct 5 15:07:44 2011 From: nanatrapnest at hotmail.it (Nana Trapnest) Date: Wed, 5 Oct 2011 15:07:44 +0000 Subject: [Biopython] StructureBuilder Message-ID: Hello, is it possible with structure builder copy all a protein and change atoms coord??? How can I do this?? Thanks to all of you! Stefania From anaryin at gmail.com Wed Oct 5 16:02:30 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 5 Oct 2011 18:02:30 +0200 Subject: [Biopython] StructureBuilder In-Reply-To: References: Message-ID: Hello Stefania, It should be possible to copy the entire protein yes, but I would rather use deepcopy to create a fully new Structure object and manipulate that one. Something along the lines of: import copy [ ... Parse your structure to s...] s_copy = copy.deepcopy(s) for atom in s_copy.get_atoms(): *here use either atom.transform or just modify atom.coord* Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2011/10/5 Nana Trapnest > > Hello, > is it possible with structure builder copy all a protein and change atoms > coord??? How can I do this?? > Thanks to all of you! > Stefania > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From dilara.ally at gmail.com Wed Oct 5 23:21:29 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Wed, 05 Oct 2011 16:21:29 -0700 Subject: [Biopython] error with entrez id code Message-ID: <4E8CE679.5050107@gmail.com> Hi All I've written a program to identify Entrez gene ids from a blastall that I performed. The code is as follows: from Bio import SeqIO from Bio import Entrez import os import os.path import re import csv dirname1="/Users/dally/Desktop/BlastFiles/annotate_me/" dirname2="/Users/dally/Desktop/BlastFiles/annotated/" allfiles=os.listdir(dirname1) fanddir=[os.path.join(dirname1,fname) for fname in allfiles] OutFileName="Contig_annotation.csv" c=csv.writer(open(os.path.join(dirname2,OutFileName),"wb")) for f in fanddir: print f InFile=open(f,'rU') LineNumber=0 for Line in InFile: print LineNumber#, ':', Line ElementList=Line.split('\t') geneid=ElementList[1] #print geneid Sections=geneid.split('|') NewID=Sections[3] from Bio import Entrez from Bio import SeqFeature Entrez.email = "dally at projects.sdsu.edu" handle=Entrez.efetch(db="nucleotide", id=NewID,rettype="gb") # rettype="gb" is GenBank format or XML format retmode="xml" record=SeqIO.read(handle,"genbank") handle.close() #print record.id lineage=record.annotations["taxonomy"] c.writerow([ElementList[0],ElementList[1],ElementList[2],ElementList[3],ElementList[4],ElementList[5],ElementList[6],ElementList[7],ElementList[8], ElementList[9],ElementList[10], NewID, record.id, record.description, record.annotations["source"], lineage[0], lineage[1],lineage[2], record.annotations["keywords"], ]) LineNumber=LineNumber+1 InFile.close() The gene identifier looks like this: gi|2252639|gb|AC002292.1|AC002292. But I"m only interested in the fourth component (AC002292.1)It runs through a file with approximately 8000-10000 identifiers and then extracts information from the associated genbank file. The code seemed to run fine on my first file for the first 1287 lines but then I got this error > raceback (most recent call last): > File "Ally_EntrezID_Search_Final_Script.py", line 38, in > record=SeqIO.read(handle,"genbank") > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", > line 604, in read > first = iterator.next() > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", > line 532, in parse > for r in i: > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 440, in parse_records > record = self.parse(handle, do_features) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 423, in parse > if self.feed(handle, consumer, do_features): > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 400, in feed > misc_lines, sequence_string = self.parse_footer() > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 921, in parse_footer > line = self.handle.readline() > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", > line 447, in readline > data = self._sock.recv(self._rbufsize) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", > line 533, in read > return self._read_chunked(amt) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", > line 586, in _read_chunked > value.append(self._safe_read(amt)) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", > line 637, in _safe_read > raise IncompleteRead(''.join(s), amt) > httplib.IncompleteRead: IncompleteRead(707 bytes read, 3147 more expected) I'm new to python and biopython programming. So any advice would be extremely appreciated. Thanks. Dilara From p.j.a.cock at googlemail.com Thu Oct 6 07:43:49 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Oct 2011 08:43:49 +0100 Subject: [Biopython] error with entrez id code In-Reply-To: <4E8CE679.5050107@gmail.com> References: <4E8CE679.5050107@gmail.com> Message-ID: On Thursday, October 6, 2011, Dilara Ally wrote: > Hi All > > I've written a program to identify Entrez gene ids from a blastall that I performed. The code is as follows: > > from Bio import SeqIO > from Bio import Entrez > ... > > The code seemed to run fine on my first file for the first 1287 lines but then I got this error > >> raceback (most recent call last): >> File "Ally_EntrezID_Search_Final_Script.py", line 38, in >> record=SeqIO.read(handle,"genbank") >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 604, in read >> first = iterator.next() >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 532, in parse >> for r in i: >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 440, in parse_records >> record = self.parse(handle, do_features) >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 423, in parse >> if self.feed(handle, consumer, do_features): >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 400, in feed >> misc_lines, sequence_string = self.parse_footer() >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 921, in parse_footer >> line = self.handle.readline() >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 447, in readline >> data = self._sock.recv(self._rbufsize) >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 533, in read >> return self._read_chunked(amt) >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 586, in _read_chunked >> value.append(self._safe_read(amt)) >> File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 637, in _safe_read >> raise IncompleteRead(''.join(s), amt) >> httplib.IncompleteRead: IncompleteRead(707 bytes read, 3147 more expected) > > I'm new to python and biopython programming. So any advice would be extremely appreciated. Is it always the same record that breaks? If so, what is the ID so we can try it out. If not, then it looks like a random network error, maybe you can stick a try/except in to refetch the data? Peter From animesh.agrawal at anu.edu.au Thu Oct 6 10:25:08 2011 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Thu, 06 Oct 2011 21:25:08 +1100 Subject: [Biopython] Using bioPython and bioSQL In-Reply-To: <7770fe573faa2.4e8d81ae@anu.edu.au> References: <76b0a56038ee5.4e8d8031@anu.edu.au> <7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au> <7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au> <7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au> <7770fe573faa2.4e8d81ae@anu.edu.au> Message-ID: <7710edf23d45a.4e8e1cb4@anu.edu.au> Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives. Cheers Animesh Animesh Agrawal PhD Scholar The John Curtin School of Medical Research Australian National University Canberra, Australia From p.j.a.cock at googlemail.com Thu Oct 6 10:39:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Oct 2011 11:39:57 +0100 Subject: [Biopython] Using bioPython and bioSQL In-Reply-To: <7710edf23d45a.4e8e1cb4@anu.edu.au> References: <76b0a56038ee5.4e8d8031@anu.edu.au> <7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au> <7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au> <7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au> <7770fe573faa2.4e8d81ae@anu.edu.au> <7710edf23d45a.4e8e1cb4@anu.edu.au> Message-ID: On Thu, Oct 6, 2011 at 11:25 AM, Animesh Agrawal wrote: > Hi All,I am trying to develop a interface for a local sequence depository > in my lab. Using biopython cookbook examples I have been able to > populate the database. But to query the database I want to create an > interface so all other members in my lab can access it. I have no > experience in doing this kind of development. I need some advice > on best way of doing it and if there are already developed modules > in biopython which can help me in attaining my objectives. > Cheers > Animesh Hi Animesh, Do you mean some kind of web interface? Would you just need this to be read only? You can use GBrowse with BioSQL, but I believe CHADO is better supported as the schema. CHADO is also a better choice if you want users to be able to edit the annotation. http://gmod.org/wiki/Chado_-_Getting_Started Peter From sdavis2 at mail.nih.gov Thu Oct 6 10:51:20 2011 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 6 Oct 2011 06:51:20 -0400 Subject: [Biopython] Using bioPython and bioSQL In-Reply-To: <7710edf23d45a.4e8e1cb4@anu.edu.au> References: <76b0a56038ee5.4e8d8031@anu.edu.au> <7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au> <7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au> <7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au> <7770fe573faa2.4e8d81ae@anu.edu.au> <7710edf23d45a.4e8e1cb4@anu.edu.au> Message-ID: Hi, Animesh. How do you want folks to query the database? Web? Command-line? Are the queries limited in scope or do you want to provide something fully general? Sean On Thu, Oct 6, 2011 at 6:25 AM, Animesh Agrawal wrote: > Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives. > Cheers > Animesh > Animesh Agrawal > PhD Scholar > The John Curtin School of Medical Research > Australian National University > Canberra, Australia > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From elisa.sechi85 at hotmail.it Thu Oct 6 10:43:25 2011 From: elisa.sechi85 at hotmail.it (Elisa sechi) Date: Thu, 6 Oct 2011 12:43:25 +0200 Subject: [Biopython] help for overwrite a pdb file In-Reply-To: References: Message-ID: Hi! All ! I'm contacting you in order to ask help about Biopython. I'm using python,I have extract the atoms coordinates of a protein from a pdb file and I have used quaternion in order to rotate the coordinates. I have put its in a new matrix but now the problem is: how do I save the cartesian coordinates in a pdb file???Do I have to create a new structure with the use of builder structure Class?? I ask you if there is a way to overwrite the new cartesian coordinates in the old pdb file that i have used. Please help me!!! Thank you very much! Elisa bye From anaryin at gmail.com Thu Oct 6 11:01:28 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 6 Oct 2011 13:01:28 +0200 Subject: [Biopython] help for overwrite a pdb file In-Reply-To: References: Message-ID: Hello Elisa, You should use PDBIO to generate a new structure file. If you have already transformed the coordinates, it's pretty simple: import PDBIO io = PDBIO() io.set_structure(your_structure) io.save('new_structure.pdb') Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2011/10/6 Elisa sechi > > > > > > > > > > > > Hi! All ! > I'm contacting you in order to ask help about Biopython. > I'm using python,I have extract the atoms coordinates of a protein from a > pdb file and I have used quaternion in order to rotate the coordinates. > I have put its in a new matrix but now the problem is: how do I save the > cartesian coordinates in a pdb file???Do I have to create a new structure > with the use of builder structure Class?? > I ask you if there is a way to overwrite the new cartesian coordinates in > the old pdb file that i have used. > Please help me!!! > Thank you very much! > Elisa > bye > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Thu Oct 6 11:02:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Oct 2011 12:02:57 +0100 Subject: [Biopython] help for overwrite a pdb file In-Reply-To: References: Message-ID: On Thu, Oct 6, 2011 at 11:43 AM, Elisa sechi wrote: > > Hi! All ! > I'm contacting you in order to ask help about Biopython. > I'm using python,I have extract the atoms coordinates ?of a protein from a pdb file and I have used quaternion in order to rotate the coordinates. > I have put its in a new matrix but now the problem is: how do I save the cartesian coordinates in a pdb file???Do I have to create a new structure with the use of builder structure Class?? > I ask you if there is a way to overwrite the new cartesian coordinates in the old pdb file that i have used. > Please help me!!! > Thank you very much! > Elisa > ? bye There's an example here which rotates models in a PDB file and saves the output: http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ It is not using quaternions for the rotation, but otherwise it should be helpful. Peter From animesh.agrawal at anu.edu.au Thu Oct 6 11:23:39 2011 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Thu, 06 Oct 2011 22:23:39 +1100 Subject: [Biopython] Using bioPython and bioSQL In-Reply-To: <77109ef23fc49.4e8d8f9e@anu.edu.au> References: <76b0a56038ee5.4e8d8031@anu.edu.au> <7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au> <7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au> <7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au> <7770fe573faa2.4e8d81ae@anu.edu.au> <7710edf23d45a.4e8e1cb4@anu.edu.au> <77c0838039ccd.4e8d8edc@anu.edu.au> <7660c1093accc.4e8d8f1a@anu.edu.au> <7710e8403ab11.4e8d8f58@anu.edu.au> <77b0ce493f67b.4e8d8f61@anu.edu.au> <77109ef23fc49.4e8d8f9e@anu.edu.au> Message-ID: <7710e50538fb7.4e8e2a6b@anu.edu.au> Hi Peter,Thanks a lot for your reply.Yes I want web interface and I need it to be read only. I'll check out GBrowse and CHADO. Cheers, Animesh On 10/06/11, Peter Cock wrote: > On Thu, Oct 6, 2011 at 11:25 AM, Animesh Agrawal > wrote: > > Hi All,I am trying to develop a interface for a local sequence depository > > in my lab. Using biopython cookbook examples I have been able to > > populate the database. But to query the database I want to create an > > interface so all other members in my lab can access it. I have no > > experience in doing this kind of development. I need some advice > > on best way of doing it and if there are already developed modules > > in biopython which can help me in attaining my objectives. > > Cheers > > Animesh > > Hi Animesh, > > Do you mean some kind of web interface? Would you just need > this to be read only? > > You can use GBrowse with BioSQL, but I believe CHADO is better > supported as the schema. CHADO is also a better choice if you > want users to be able to edit the annotation. > http://gmod.org/wiki/Chado_-_Getting_Started > > Peter > > From animesh.agrawal at anu.edu.au Thu Oct 6 11:27:51 2011 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Thu, 06 Oct 2011 22:27:51 +1100 Subject: [Biopython] Using bioPython and bioSQL In-Reply-To: <7680a8613e5c9.4e8d9094@anu.edu.au> References: <76b0a56038ee5.4e8d8031@anu.edu.au> <7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au> <7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au> <7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au> <7770fe573faa2.4e8d81ae@anu.edu.au> <7710edf23d45a.4e8e1cb4@anu.edu.au> <77b080583927a.4e8d9019@anu.edu.au> <76e0a1b23d252.4e8d9057@anu.edu.au> <7680a8613e5c9.4e8d9094@anu.edu.au> Message-ID: <7660a5e03929b.4e8e2b67@anu.edu.au> Hi Sean,I definitely want a web interface. Queries should be limited in scope. Cheers, Animesh On 10/06/11, Sean Davis wrote: > Hi, Animesh. > > How do you want folks to query the database?? Web?? Command-line?? Are > the queries limited in scope or do you want to provide something fully > general? > > Sean > > On Thu, Oct 6, 2011 at 6:25 AM, Animesh Agrawal > wrote: > > Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives. > > Cheers > > Animesh > > Animesh Agrawal > > PhD Scholar > > The John Curtin School of Medical Research > > Australian National University > > Canberra, Australia > > _______________________________________________ > > Biopython mailing list ?- ?Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > From sdavis2 at mail.nih.gov Thu Oct 6 11:50:07 2011 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 6 Oct 2011 07:50:07 -0400 Subject: [Biopython] Using bioPython and bioSQL In-Reply-To: <7660a5e03929b.4e8e2b67@anu.edu.au> References: <76b0a56038ee5.4e8d8031@anu.edu.au> <7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au> <7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au> <7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au> <7770fe573faa2.4e8d81ae@anu.edu.au> <7710edf23d45a.4e8e1cb4@anu.edu.au> <77b080583927a.4e8d9019@anu.edu.au> <76e0a1b23d252.4e8d9057@anu.edu.au> <7680a8613e5c9.4e8d9094@anu.edu.au> <7660a5e03929b.4e8e2b67@anu.edu.au> Message-ID: Hi, Animesh. Depending on the types of queries, building small CGI scripts or even a small web application can be quite useful. Most recently, I have been using the flask micro-framework ( http://flask.pocoo.org/ ) for building such small applications. If you can figure out how to do the queries that you want with biopython or SQL, then it isn't too hard to translate that to a couple of web pages, one for gathering input from the user and a second for delivering results. Sean On Thu, Oct 6, 2011 at 7:27 AM, Animesh Agrawal wrote: > Hi Sean,I definitely want a web interface. Queries should be limited in scope. > Cheers, > Animesh > > On 10/06/11, Sean Davis ? wrote: >> Hi, Animesh. >> >> How do you want folks to query the database?? Web?? Command-line?? Are >> the queries limited in scope or do you want to provide something fully >> general? >> >> Sean >> >> On Thu, Oct 6, 2011 at 6:25 AM, Animesh Agrawal >> wrote: >> > Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives. >> > Cheers >> > Animesh >> > Animesh Agrawal >> > PhD Scholar >> > The John Curtin School of Medical Research >> > Australian National University >> > Canberra, Australia >> > _______________________________________________ >> > Biopython mailing list ?- ?Biopython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython >> > >> >> > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From tiagoantao at gmail.com Thu Oct 6 20:14:56 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 6 Oct 2011 21:14:56 +0100 Subject: [Biopython] UniprotXML dbReference parser Message-ID: Hi, Do I understand wrongly or the UniprotXML parser for simply ignores the "property type" information? If so, is there any way to get access to the XML raw data (so that I can grep it)? Thanks a lot, Tiago -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From p.j.a.cock at googlemail.com Thu Oct 6 22:26:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Oct 2011 23:26:19 +0100 Subject: [Biopython] UniprotXML dbReference parser In-Reply-To: References: Message-ID: 2011/10/6 Tiago Ant?o : > Hi, > > Do I understand wrongly or the UniprotXML parser for > > > > > > simply ignores the "property type" information? Probably... I think it emulates the very simple list of db:acc strings produced by the GenBank parser etc, but try dir(...) on it. Although PDB references look to get part of their information dumped in the record's annotations dictionary. I guess we could return a list of DB reference objects which happen to act like the old style string for back compatibility. > If so, is there any way to get access to the XML raw data > (so that I can grep it)? Are you asking for XML parsing library recommendations? Or you could hack the SeqIO parser instead... i've CC'd Andrea who wrote it in case he can add something more practical. Peter From tiagoantao at gmail.com Thu Oct 6 22:43:01 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 6 Oct 2011 23:43:01 +0100 Subject: [Biopython] UniprotXML dbReference parser In-Reply-To: References: Message-ID: Hi, 2011/10/6 Peter Cock : > Probably... I think it emulates the very simple list of > db:acc strings produced by the GenBank parser etc, > but try dir(...) on it. ?Although PDB references look > to get part of their information dumped in the > record's annotations dictionary. The problem is that the Gene ID is inside (thus it never gets returned). We get the protein ID only. > Are you asking for XML parsing library recommendations? > Or you could hack the SeqIO parser instead... i've CC'd > Andrea who wrote it in case he can add something > more practical. I just used xml.parsers.expat. Not a problem for myself, but the fact is that the uniprot xml parser does not return the whole information that it is there. -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From p.j.a.cock at googlemail.com Fri Oct 7 07:22:49 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Oct 2011 08:22:49 +0100 Subject: [Biopython] changing record attributes while iterating In-Reply-To: References: Message-ID: On Friday, October 7, 2011, Michalwrote: > Hello, > Does your code with generator save the whole file in the > memory or does it read each entry and save it immediately? > Thank you in advance. Using a generator expression like that only one SeqRecord is in memory at a time. It goes through the input FASTA one record at a time, renames it, saves it immediately. Peter P.S. list CC'd From dilara.ally at gmail.com Fri Oct 7 17:34:24 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Fri, 07 Oct 2011 10:34:24 -0700 Subject: [Biopython] error with entrez id code In-Reply-To: References: <4E8CE679.5050107@gmail.com> Message-ID: <4E8F3820.1030002@gmail.com> > Is it always the same record that breaks? If so, what is the ID so we > can try it out. > > If not, then it looks like a random network error, maybe you can stick > a try/except in to refetch the data? Hi Peter Individually the identifier has no problem calling up the record, but the problem seems to be in the loop. As a newbie, what is a try/except? Thanks. Dilara On 10/6/11 12:43 AM, Peter Cock wrote: > > > On Thursday, October 6, 2011, Dilara Ally > wrote: > > Hi All > > > > I've written a program to identify Entrez gene ids from a blastall > that I performed. The code is as follows: > > > > from Bio import SeqIO > > from Bio import Entrez > > ... > > > > The code seemed to run fine on my first file for the first 1287 > lines but then I got this error > > > >> raceback (most recent call last): > >> File "Ally_EntrezID_Search_Final_Script.py", line 38, in > >> record=SeqIO.read(handle,"genbank") > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", > line 604, in read > >> first = iterator.next() > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", > line 532, in parse > >> for r in i: > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 440, in parse_records > >> record = self.parse(handle, do_features) > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 423, in parse > >> if self.feed(handle, consumer, do_features): > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 400, in feed > >> misc_lines, sequence_string = self.parse_footer() > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 921, in parse_footer > >> line = self.handle.readline() > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", > line 447, in readline > >> data = self._sock.recv(self._rbufsize) > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", > line 533, in read > >> return self._read_chunked(amt) > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", > line 586, in _read_chunked > >> value.append(self._safe_read(amt)) > >> File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", > line 637, in _safe_read > >> raise IncompleteRead(''.join(s), amt) > >> httplib.IncompleteRead: IncompleteRead(707 bytes read, 3147 more > expected) > > > > I'm new to python and biopython programming. So any advice would be > extremely appreciated. > > > Peter From p.j.a.cock at googlemail.com Sat Oct 8 14:10:12 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 8 Oct 2011 15:10:12 +0100 Subject: [Biopython] error with entrez id code In-Reply-To: <4E8F3820.1030002@gmail.com> References: <4E8CE679.5050107@gmail.com> <4E8F3820.1030002@gmail.com> Message-ID: On Fri, Oct 7, 2011 at 6:34 PM, Dilara Ally wrote: > Is it always the same record that breaks? If so, what is the ID so we can > try it out. > > If not, then it looks like a random network error, maybe you can stick a > try/except in to refetch the data? > > Hi Peter > > Individually the identifier has no problem calling up the record, but the > problem seems to be in the loop.? As a newbie, what is a try/except? > > Thanks. By try/except I mean use Python's error handling mechanism to spot when there is a network error. See: http://docs.python.org/tutorial/errors.html e.g. Something like this would give you a second chance. Note that exception httplib.IncompleteRead is a subclass of the more general HTTPException, see: http://docs.python.org/library/httplib.html from httplib import HTTPException try: handle=Entrez.efetch(db="nucleotide", id=NewID,rettype="gb") # rettype="gb" is GenBank format or XML format retmode="xml" record=SeqIO.read(handle,"genbank") handle.close() except HTTPException, e: print "Network problem: %s" % e print "Second (and final) attempt..." handle=Entrez.efetch(db="nucleotide", id=NewID,rettype="gb") # rettype="gb" is GenBank format or XML format retmode="xml" record=SeqIO.read(handle,"genbank") handle.close() If the second attempt fails, you'll get an exception like before. There are more elegant ways to write that (with less repetition, and making multiple retries easy), but I'm trying to keep this simple as an introductory example. Peter From chaouki.amir at gmail.com Sun Oct 9 19:37:42 2011 From: chaouki.amir at gmail.com (amir chaouki) Date: Sun, 9 Oct 2011 20:37:42 +0100 Subject: [Biopython] clustal header Message-ID: Hi, i want to to do a multiple sequence alignment with the clustalw method but i keep getting this error: ", ".join(known_headers))) ValueError: a is not a known CLUSTAL header: CLUSTAL, PROBCONS, MUSCLE my sequence file contains this > as headers for every sequence name, so what are the compatible headers? -- *Amir Chaouki* From p.j.a.cock at googlemail.com Sun Oct 9 20:09:00 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 9 Oct 2011 21:09:00 +0100 Subject: [Biopython] clustal header In-Reply-To: References: Message-ID: On Sunday, October 9, 2011, amir chaouki wrote: > Hi, > i want to to do a multiple sequence alignment with the clustalw method but i > keep getting this error: ", ".join(known_headers))) > ValueError: a is not a known CLUSTAL header: CLUSTAL, PROBCONS, MUSCLE > > my sequence file contains this > as headers for every sequence name, so > what are the compatible headers? Hi Amir, That error message can come from trying to parse a non-clustal file as if it were a clustal file. Perhaps you tried to parse a fasta file? If you showed the code that caused this message, it would be easier to help you, Peter From sdavis2 at mail.nih.gov Wed Oct 12 18:54:13 2011 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 12 Oct 2011 14:54:13 -0400 Subject: [Biopython] [OT][Job] Functional genomic analysis of cancer/RNAi screening Message-ID: Functional genomic analysis of cancer/RNAi screening NATIONAL CANCER INSTITUTE, BETHESDA, MD The laboratory of Dr. Natasha Caplen, within the Genetics Branch, CCR, NCI, is seeking postdoctoral candidates for a project focused on functional genomic analysis using RNAi screening approaches. We are looking for a highly motivated candidate who has received their PhD within the last year to contribute to our on-going studies applying RNAi based loss-of-function approaches to probe cancer gene function. The successful candidate will be expected to perform both bench and computational-based studies and will be involved in projects requiring the development and analysis of large-scale RNAi screening data focused on the biology of oncogenic transcription factors. The candidate will be involved in the design and employment of RNAi screens (up to genome-wide scale) and analysis of the data generated through application of state of the art computational methodologies. This large-scale RNAi screening data will also be assessed in the context of other relevant datasets such as next generation sequencing, epigenetic, gene expression and drug sensitivity datasets. The computational analyses will ultimately be used to systematically build hypotheses to identify key pathways and networks underlying the specifics of the cancer biology and the candidate will then be expected to experimentally test these hypotheses. Dr. Caplen?s laboratory conducts both independent and collaborative studies and the successful candidate will have the opportunity to interact with NCI and NIH investigators studying many different cancer biology questions using RNAi based technologies. Currently we are involved in RNAi studies relevant to the biology and treatment of several pediatric cancers, colorectal, breast and prostate cancer. For further information please see Dr. Caplen?s website at http://ccr.cancer.gov/staff/staff.asp?profileid=9035. Requirements: The candidate must have a Ph.D in biological sciences with additional training in computational biology or bioinformatics. Previous experience in molecular biology including mammalian cell culture and assessment of gene expression is required, as, too, is experience in programming skills in languages such as perl, python, R, java, or c++. As the position involves the need to discuss scientific data and strategy with members of the existing team and with collaborators, oral and written fluency in the English language is required. Applicants should email a cover letter describing research experience and interests, curriculum vitae, bibliography, and contact information for three references (including the current supervisor) to Dr. Natasha Caplen at ncaplen at mail.nih.gov. Please include ?PD2011? in the email subject line. From paul at tonair.de Thu Oct 13 10:26:54 2011 From: paul at tonair.de (paul at tonair.de) Date: Thu, 13 Oct 2011 12:26:54 +0200 Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB Message-ID: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com> dear biopython users, i'm trying to read in a pqr file with the Bio.PDB module. In a PQR file, the atom charge and atom radius are stored instead of the occupancy & B-factor. Apparently, the negative charge values make trouble while reading in. (1) Is there a way to tweak Bio.PDB module to read in a PQR file? More to the background of this task: I would like to keep the charge and the radius in order to output a PDB file with more than 80 lines. The pdb-like output looks like this: ATOM 1 C1 UNL _0001_000 9.643 1.777 18.433 1.700 0.000 BK____M000 The text "BK____M000" refers to a conformer of a side chain and is needed by a PoissonBoltzmann named mcce (multi-conformation continuum electrostatics). (2) Can Bio.PDB generate such an output file? Cheers & Thanks, Paul From p.j.a.cock at googlemail.com Thu Oct 13 10:40:14 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Oct 2011 11:40:14 +0100 Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB In-Reply-To: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com> References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com> Message-ID: On Thu, Oct 13, 2011 at 11:26 AM, wrote: > > dear biopython users, > > i'm trying to read in a pqr file with the > Bio.PDB module. In a PQR file, the atom charge and atom radius are > stored instead of the occupancy & B-factor. > Apparently, the negative > charge values make trouble while reading in. > > (1) Is there a way to > tweak Bio.PDB module to read in a PQR file? If a negative B-factor was the only issue, probably yes. > More to the background of > this task: I would like to keep the charge and the radius in order to > output a PDB file with more than 80 lines. You mean more than 80 columns? i.e. Longer than PDB norms? > The pdb-like output looks > like this: > ATOM 1 C1 UNL _0001_000 9.643 1.777 18.433 1.700 0.000 > BK____M000 > The text "BK____M000" refers to a conformer of a side chain > and is needed by a PoissonBoltzmann named mcce (multi-conformation > continuum electrostatics). > > (2) Can Bio.PDB generate such an output > file? Not yet ;) > Cheers & Thanks, > Paul It would help if you could share some sample data (URLs) and links to this PDB-like PQR file format's specification (assuming it has one). Regards, Peter From anaryin at gmail.com Thu Oct 13 10:43:06 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 13 Oct 2011 12:43:06 +0200 Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB In-Reply-To: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com> References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com> Message-ID: Hello Paul, Straight from Pymol :) Bio.PDB cannot read PQR files as is, but since the format is quite similar to the PDB it should be easy to convert. The first step is to know if you want to develop a converter too (you will need the forcefield atomic charges and radius for that) or just a "parser". Parsing is easy, it's a matter of adapting the current SMCRA objects and PDBParser. Converting requires much more and is probably superfluous given the PDB2PQR software. Some important information on the format: http://www.poissonboltzmann.org/file-formats/biomolecular-structurw/pqr I think the best course of action is to add a PQRParser class that has different residue properties than the regular PDB. For example, occupancy and bfactor are not used at all.. Let me know what you think, Cheers, Jo?o From paul at tonair.de Thu Oct 13 11:51:42 2011 From: paul at tonair.de (paul at tonair.de) Date: Thu, 13 Oct 2011 13:51:42 +0200 Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB In-Reply-To: References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com> Message-ID: Dear all, a PQR functionality within biopython would be great! Regarding the output of extended PDB files I would like to write: There is no detailed description on such files: http://www.sci.ccny.cuny.edu/~mcce/doc/running_mcce2.php [1] see chapter 3.2.4: step2_out.pdb: input structure file of step 3 in mcce extended pdb format extended means: the conformer is added beyond the element located somewhere around column 80. Is there any workaround with the currect biopython release to read in PQR and dump out such an extended PDB file? Cheers & thanks, Paul On Thu, 13 Oct 2011 12:48:22 +0200, Mikael Trellet wrote: This PQRParser class would be a nice add to Bio.PDB indeed, and shouldn't take a very long time to develop. Could work on it with you Joao, if the need exists obviously. Regards, Mikael On Thu, Oct 13, 2011 at 12:43 PM, Jo?o Rodrigues wrote: Hello Paul, Straight from Pymol :) Bio.PDB cannot read PQR files as is, but since the format is quite similar to the PDB it should be easy to convert. The first step is to know if you want to develop a converter too (you will need the forcefield atomic charges and radius for that) or just a "parser". Parsing is easy, it's a matter of adapting the current SMCRA objects and PDBParser. Converting requires much more and is probably superfluous given the PDB2PQR software. Some important information on the format: http://www.poissonboltzmann.org/file-formats/biomolecular-structurw/pqr [3] I think the best course of action is to add a PQRParser class that has different residue properties than the regular PDB. For example, occupancy and bfactor are not used at all.. Let me know what you think, Cheers, Jo?o _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org [4] http://lists.open-bio.org/mailman/listinfo/biopython [5] -- Mikael TRELLET, Computational structural biology group, Utrecht University Bijvoet Center, The Netherlands Links: ------ [1] http://www.sci.ccny.cuny.edu/~mcce/doc/running_mcce2.php [2] mailto:anaryin at gmail.com [3] http://www.poissonboltzmann.org/file-formats/biomolecular-structurw/pqr [4] mailto:Biopython at lists.open-bio.org [5] http://lists.open-bio.org/mailman/listinfo/biopython From anaryin at gmail.com Thu Oct 13 12:27:54 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 13 Oct 2011 14:27:54 +0200 Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB In-Reply-To: References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com> Message-ID: Dear Paul, You would have to do two things: 1. First, modify PDBParser so that it reads more characters in the occupancy and bfactor fields 2. Modify PDBIO so that it is able to output a field beyond the element OR just create your own function to print information of a residue and use it instead of PDBIO. How do you get the conformer information? From paul at tonair.de Fri Oct 14 12:00:04 2011 From: paul at tonair.de (paul at tonair.de) Date: Fri, 14 Oct 2011 14:00:04 +0200 Subject: [Biopython] ligand PDB files Message-ID: Dear all, I'm having trouble to read in the attached PDB file - this is my code: " from Bio.PDB import * parser=PDBParser() structure=parser.get_structure("PHA-L","./2w26_lig.pdb") for model in structure: for chain in model: for residue in chain: for atom in residue: print atom " which gives this error: " File "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/PDBParser.py", line 66, in get_structure self._parse(file.readlines()) File "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/PDBParser.py", line 89, in _parse self.trailer=self._parse_coordinates(coords_trailer) File "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/PDBParser.py", line 205, in _parse_coordinates fullname, serial_number, element) File "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/StructureBuilder.py", line 197, in init_atom fullname, serial_number, element) File "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/Atom.py", line 68, in __init__ assert not element or element == element.upper(), element AssertionError: Cl " Does this mean that the PDB parser only recognizes "amino acid-atoms", i.e. a chlorine does not work? Cheers & Thanks, Paul -------------- next part -------------- COMPND 2w26_LIG.pdb_0 AUTHOR GENERATED BY OPEN BABEL 2.3.0 ATOM 1 C1 RIV A 1 9.643 1.777 18.433 1.00 0.00 C ATOM 2 N1 RIV A 1 8.303 2.377 18.109 1.00 0.00 N ATOM 3 C2 RIV A 1 10.053 0.667 17.441 1.00 0.00 C ATOM 4 C3 RIV A 1 7.671 2.122 16.881 1.00 0.00 C ATOM 5 O1 RIV A 1 9.768 1.124 16.111 1.00 0.00 O ATOM 6 C4 RIV A 1 8.355 1.223 15.853 1.00 0.00 C ATOM 7 C5 RIV A 1 6.487 4.959 20.981 1.00 0.00 C ATOM 8 C6 RIV A 1 7.333 5.468 19.984 1.00 0.00 C ATOM 9 C7 RIV A 1 6.237 3.551 21.013 1.00 0.00 C ATOM 10 C8 RIV A 1 7.918 4.619 19.048 1.00 0.00 C ATOM 11 C9 RIV A 1 6.837 2.690 20.070 1.00 0.00 C ATOM 12 C10 RIV A 1 7.682 3.222 19.078 1.00 0.00 C ATOM 13 O2 RIV A 1 6.583 2.613 16.630 1.00 0.00 O ATOM 14 N2 RIV A 1 5.906 5.863 21.947 1.00 0.00 N ATOM 15 C11 RIV A 1 5.040 5.543 22.995 1.00 0.00 C ATOM 16 C12 RIV A 1 6.146 7.326 22.000 1.00 0.00 C ATOM 17 O3 RIV A 1 4.690 6.614 23.757 1.00 0.00 O ATOM 18 C13 RIV A 1 5.213 7.787 23.134 1.00 0.00 C ATOM 19 O4 RIV A 1 4.634 4.419 23.228 1.00 0.00 O ATOM 20 C14 RIV A 1 5.924 8.721 24.155 1.00 0.00 C ATOM 21 N3 RIV A 1 7.078 8.136 24.932 1.00 0.00 N ATOM 22 C15 RIV A 1 8.402 8.558 24.672 1.00 0.00 C ATOM 23 S1 RIV A 1 11.131 8.264 25.063 1.00 0.00 S ATOM 24 C16 RIV A 1 11.805 7.503 26.288 1.00 0.00 C ATOM 25 C17 RIV A 1 9.567 8.044 25.466 1.00 0.00 C ATOM 26 C18 RIV A 1 10.794 7.011 27.130 1.00 0.00 C ATOM 27 C19 RIV A 1 9.509 7.324 26.659 1.00 0.00 C ATOM 28 O5 RIV A 1 8.611 9.379 23.797 1.00 0.00 O ATOM 29 Cl1 RIV A 1 13.544 7.302 26.531 1.00 0.00 Cl ATOM 30 H RIV A 1 9.643 1.777 18.433 1.00 0.00 H ATOM 31 H RIV A 1 9.643 1.777 18.433 1.00 0.00 H ATOM 32 H RIV A 1 10.053 0.667 17.441 1.00 0.00 H ATOM 33 H RIV A 1 10.053 0.667 17.441 1.00 0.00 H ATOM 34 H RIV A 1 8.355 1.223 15.853 1.00 0.00 H ATOM 35 H RIV A 1 8.355 1.223 15.853 1.00 0.00 H ATOM 36 H RIV A 1 7.333 5.468 19.984 1.00 0.00 H ATOM 37 H RIV A 1 6.237 3.551 21.013 1.00 0.00 H ATOM 38 H RIV A 1 7.918 4.619 19.048 1.00 0.00 H ATOM 39 H RIV A 1 6.837 2.690 20.070 1.00 0.00 H ATOM 40 H RIV A 1 6.146 7.326 22.000 1.00 0.00 H ATOM 41 H RIV A 1 6.146 7.326 22.000 1.00 0.00 H ATOM 42 H RIV A 1 5.213 7.787 23.134 1.00 0.00 H ATOM 43 H RIV A 1 5.924 8.721 24.155 1.00 0.00 H ATOM 44 H RIV A 1 5.924 8.721 24.155 1.00 0.00 H ATOM 45 H RIV A 1 7.078 8.136 24.932 1.00 0.00 H ATOM 46 H RIV A 1 10.794 7.011 27.130 1.00 0.00 H ATOM 47 H RIV A 1 9.509 7.324 26.659 1.00 0.00 H CONECT 1 3 2 30 31 CONECT 1 CONECT 2 4 1 12 CONECT 3 5 1 32 33 CONECT 3 CONECT 4 6 13 2 CONECT 5 6 3 CONECT 6 5 4 34 35 CONECT 6 CONECT 7 8 9 14 CONECT 8 10 7 36 CONECT 9 11 7 37 CONECT 10 12 8 38 CONECT 11 12 9 39 CONECT 12 2 10 11 CONECT 13 4 CONECT 14 7 16 15 CONECT 15 14 19 17 CONECT 16 14 18 40 41 CONECT 16 CONECT 17 15 18 CONECT 18 16 17 20 42 CONECT 18 CONECT 19 15 CONECT 20 18 21 43 44 CONECT 20 CONECT 21 20 22 45 CONECT 22 28 21 25 CONECT 23 25 24 CONECT 24 23 29 26 CONECT 25 22 23 27 CONECT 26 24 27 46 CONECT 27 25 26 47 CONECT 28 22 CONECT 29 24 CONECT 30 1 CONECT 31 1 CONECT 32 3 CONECT 33 3 CONECT 34 6 CONECT 35 6 CONECT 36 8 CONECT 37 9 CONECT 38 10 CONECT 39 11 CONECT 40 16 CONECT 41 16 CONECT 42 18 CONECT 43 20 CONECT 44 20 CONECT 45 21 CONECT 46 26 CONECT 47 27 MASTER 0 0 0 0 0 0 0 0 47 0 47 0 END From robert.campbell at queensu.ca Fri Oct 14 13:04:22 2011 From: robert.campbell at queensu.ca (Robert Campbell) Date: Fri, 14 Oct 2011 09:04:22 -0400 Subject: [Biopython] ligand PDB files In-Reply-To: References: Message-ID: <20111014090422.639e9284@adelie.biochem.queensu.ca> Dear Paul, On Fri, 2011-10-14 14:00 EDT, paul at tonair.de wrote: > Dear all, > I'm having trouble to read in the attached PDB file - this > is my code: Your code is okay. The problem is in your PDB file: > File > "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/Atom.py", line > 68, in __init__ > assert not element or element == element.upper(), > element > AssertionError: Cl > " > Does this mean that the PDB parser only > recognizes "amino acid-atoms", i.e. a chlorine does not work? The chlorine atoms should be "CL" not "Cl" in a proper PDB file. Cheers, Rob -- Robert L. Campbell, Ph.D. Senior Research Associate/Adjunct Assistant Professor Dept. of Biomedical & Molecular Sciences, Botterell Hall Rm 644 Queen's University, Kingston, ON K7L 3N6 Canada Tel: 613-533-6821 http://pldserver1.biochem.queensu.ca/~rlc From paul at tonair.de Fri Oct 14 13:51:47 2011 From: paul at tonair.de (paul at tonair.de) Date: Fri, 14 Oct 2011 15:51:47 +0200 Subject: [Biopython] ligand PDB files In-Reply-To: <20111014090422.639e9284@adelie.biochem.queensu.ca> References: <20111014090422.639e9284@adelie.biochem.queensu.ca> Message-ID: <751ac2c9e7bf1a3659f31849565d1122@mail.canobus.com> Dear Rob, thank you very much for your help, this fixed the error!! Cheers, Paul > > Your code is okay. The problem is in your PDB file: > > >> File >> "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/Atom.py", line >> 68, in __init__ >> assert not element or element == element.upper(), >> element >> AssertionError: Cl >> " >> Does this mean that the PDB parser only >> recognizes "amino acid-atoms", i.e. a chlorine does not work? > > The chlorine atoms should be "CL" not "Cl" in a proper PDB file. > > Cheers, > Rob From pawan.mani2 at gmail.com Sat Oct 15 16:26:17 2011 From: pawan.mani2 at gmail.com (One Life) Date: Sat, 15 Oct 2011 16:26:17 +0000 (UTC) Subject: [Biopython] Invitation to connect on LinkedIn Message-ID: <450967254.855500.1318695977476.JavaMail.app@ela4-app0133.prod> I'd like to add you to my professional network on LinkedIn. - One One Life bioinformatics jobs or lifesciences jobs at student New Delhi Area, India Confirm that you know One Life: https://www.linkedin.com/e/l8bh8w-gtstjc81-5u/isd/4571376627/NJGAOFxD/?hs=false&tok=2ZCK1gt4mqn4Y1 -- You are receiving Invitation to Connect emails. Click to unsubscribe: http://www.linkedin.com/e/l8bh8w-gtstjc81-5u/qqAvDr0lR7bVZ5oUF-GdFl1c_dfVGAwasCwqz9Wv-gP/goo/biopython%40lists%2Eopen-bio%2Eorg/20061/I1584202408_1/?hs=false&tok=0zbjHnXC6qn4Y1 (c) 2011 LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043, USA. From jordan.r.willis at Vanderbilt.Edu Sat Oct 15 20:59:58 2011 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Sat, 15 Oct 2011 15:59:58 -0500 Subject: [Biopython] Blast DB keeps crashing nodes Message-ID: <66965B9E-2AD6-4E02-BB8E-2F11A820DCDF@Vanderbilt.Edu> Hello Biopython, I was wondering if anyone has worked extensively with the Blast Database locally. I am blasting millions of sequences using Biopython as my backend framework. I am using a high throughput computer cluster to blast each sequence. Rather than submit two million jobs, I have divided the fast files up into 50 or so. The problem I am facing is a memory issue. I'm not sure, but I think that the Database is cacheing itself and not clearing before the next sequence is queried. In that regard, the next job calls upon the database again, and so on?. The memory builds up until it finally crashes the node. Has anyone dealt with this issue before? Thanks, Jordan From dilara.ally at gmail.com Sat Oct 15 21:55:21 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Sat, 15 Oct 2011 14:55:21 -0700 Subject: [Biopython] Blast DB keeps crashing nodes In-Reply-To: <66965B9E-2AD6-4E02-BB8E-2F11A820DCDF@Vanderbilt.Edu> References: <66965B9E-2AD6-4E02-BB8E-2F11A820DCDF@Vanderbilt.Edu> Message-ID: <4E9A0149.1000504@gmail.com> How many hits per sequence have you requested to get back - the default on the blastall is 250? I did blast search on ~600,000 contigs but I set up simultaneous jobs across 34 nodes. I used only the top 20 hits. Each file had 1000 fasta formatted sequences and each node was given ~12 files. But I still had to do it in two parts to get all sequences blasted. I waited until the first set finished to set up the second blast job. The job finished in 2 days. Before I ran it on the cluster I tested a single file to see how long and how much memory it took. The cluster I used had 34 computing nodes, with 16-48 cores and 16-64GB of memory. Hope that helps. On 10/15/11 1:59 PM, Willis, Jordan R wrote: > Hello Biopython, > > I was wondering if anyone has worked extensively with the Blast Database locally. > > I am blasting millions of sequences using Biopython as my backend framework. I am using a high throughput computer cluster to blast each sequence. Rather than submit two million jobs, I have divided the fast files up into 50 or so. > > The problem I am facing is a memory issue. I'm not sure, but I think that the Database is cacheing itself and not clearing before the next sequence is queried. In that regard, the next job calls upon the database again, and so on?. > > The memory builds up until it finally crashes the node. Has anyone dealt with this issue before? > > Thanks, > Jordan > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mictadlo at gmail.com Mon Oct 17 12:11:12 2011 From: mictadlo at gmail.com (Mic) Date: Mon, 17 Oct 2011 22:11:12 +1000 Subject: [Biopython] SAM to BAM Message-ID: Hello, Is there a way to convert SAM file to sorted BAM file and generate also BAI file with pysam? Thank you in advance. From p.j.a.cock at googlemail.com Mon Oct 17 13:06:58 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Oct 2011 14:06:58 +0100 Subject: [Biopython] [Samtools-help] SAM to BAM In-Reply-To: References: Message-ID: On Mon, Oct 17, 2011 at 1:11 PM, Mic wrote: > Hello, > Is there a way to convert SAM file to sorted BAM file and generate also BAI > file with pysam? > Thank you in advance. With samtools at the command line, samtools view -b -S example.sam | samtools sort - example samtools index example.bam I know you can easy call samtools from pysam, not sure if you can do the pipe trick to avoid extra steps: samtools view -b -S example.sam > example_unsorted samtools sort example_unsorted.bam example rm example_unsorted.bam samtools index example.bam Peter From jgrant at smith.edu Mon Oct 17 13:47:38 2011 From: jgrant at smith.edu (Jessica Grant) Date: Mon, 17 Oct 2011 09:47:38 -0400 Subject: [Biopython] pdb file question Message-ID: <541079A3-3C7D-45FF-8717-B1C64C85735F@smith.edu> Hello, I am trying to write a script that reproduces the crystal structure of a protein based on the information in the pdb file. I have gotten kind of stuck using the SMTRY lines in remark 290. It doesn't seem to contain all the information I need, at least the results I am getting don't look the same as when I produce symmetry mates in pymol, for example. Has anyone any experience with this? Thanks, Jessica From anaryin at gmail.com Mon Oct 17 14:08:54 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 17 Oct 2011 16:08:54 +0200 Subject: [Biopython] pdb file question In-Reply-To: <541079A3-3C7D-45FF-8717-B1C64C85735F@smith.edu> References: <541079A3-3C7D-45FF-8717-B1C64C85735F@smith.edu> Message-ID: Hello Jessica, Are you extracting the symmetry information with Biopython? If so, how are you using it to generate the other symmetry "members"? Using atom.transform? Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2011/10/17 Jessica Grant > Hello, > > I am trying to write a script that reproduces the crystal structure of a > protein based on the information in the pdb file. I have gotten kind of > stuck using the SMTRY lines in remark 290. It doesn't seem to contain all > the information I need, at least the results I am getting don't look the > same as when I produce symmetry mates in pymol, for example. Has anyone any > experience with this? Thanks, > > Jessica > > > > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > From hahj87 at gmail.com Mon Oct 17 15:03:10 2011 From: hahj87 at gmail.com (=?ISO-8859-1?Q?Joshua_Ismael_Haase_Hern=E1ndez?=) Date: Mon, 17 Oct 2011 10:03:10 -0500 Subject: [Biopython] is IRC channel at freenode active? Message-ID: Hi there, I was arround in the IRC channel and the only one there is Chanserv. I was wondering if the channel has some use. From mictadlo at gmail.com Tue Oct 18 03:44:14 2011 From: mictadlo at gmail.com (Mic) Date: Tue, 18 Oct 2011 13:44:14 +1000 Subject: [Biopython] Segmentation fault Message-ID: Hello, I have tried to generate a subset BAM, but I get a 'Segmentation fault' with the following code: from Bio import SeqIO import pysam from optparse import OptionParser import subprocess, os, sys from multiprocessing import Pool import functools import argparse def GetReferenceInfo(referenceFastaPath): referencenames = [] referencelengths = [] referenceFastaFile = open(referenceFastaPath) for record in SeqIO.parse(referenceFastaFile, "fasta"): referencenames.append(record.name) referencelengths.append(len(record.seq)) referenceFastaFile.close() return (referencenames, referencelengths) def GenerateSubsetBAM(bam_filename, ref_name): reads = [] bam_fh = pysam.Samfile(bam_filename, "rb") for read in bam_fh.fetch(ref_name): reads.append(read) print ref_name + ' Done ' + str(len(reads)) return (ref_name, reads) def writeBAM(reads, ref_names, ref_lengths, output_BAM): #print ref_names #print ref_lengths #print output_BAM #with pysam.Samfile(output_BAM, "wb", referencenames = ref_names, referencelengths = ref_lengths) as bh: bh = pysam.Samfile(output_BAM, "wb", referencenames = ref_names, referencelengths = ref_lengths) print reads.keys() for ref_name in ref_names: print ref_name for read in reads[ref_name]: print read #bh.write(read) print ref_name + 'Done' if __name__ == '__main__': parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta -o outputBAM") parser.add_option("-b", "--BAM", type="string", dest="inputBAMFilepath", help="Specify a BAM file") parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath", help="Specify a reference fasta file.") parser.add_option("-o", "--output", type="string", dest="outputBAMFilepath", help="Specify an output BAM file.") (opts, args) = parser.parse_args() if (opts.inputBAMFilepath is None): print ("\nSpecify a BAM file. eg. -b large.bam\n") parser.print_help() elif not(os.path.exists(opts.inputBAMFilepath)): print ("\nReference BAM file does not exists: " + opts.inputBAMFilepath +"\n") elif (opts.fastaFilepath is None): print ("\nSpecify a reference fasta file. eg. -f Subset.fasta\n") parser.print_help() elif not(os.path.exists(opts.fastaFilepath)): print ("\nReference fasta file does not exists: " + opts.fastaFilepath +"\n") elif os.path.exists(opts.outputBAMFilepath) and not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n): ")=='Y'): print ("\nOutput BAM exists. Please specify alternative output file. eg. -o Subset.bam\n") else: print "Read fasta ..." (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath) print 'Done!' print "creating subset...." pool = Pool() GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM, opts.inputBAMFilepath) reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper, ref_names)) pool.close() print "Done!" print "Writting results to subset BAM file..." writeBAM(reads, ref_names, ref_lengths, opts.outputBAMFilepath) print "Done!" I run the code in the following way: python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bamRead fasta ... Done! creating subset.... chr1 Done 1464 chr2 Done 1806 Done! Writting results to subset BAM file... ['chr2', 'chr1'] chr1 Segmentation fault Thank you in advance. From p.j.a.cock at googlemail.com Tue Oct 18 09:00:47 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Oct 2011 10:00:47 +0100 Subject: [Biopython] [Samtools-help] Segmentation fault In-Reply-To: References: Message-ID: On Tue, Oct 18, 2011 at 4:44 AM, Mic wrote: > Hello, > I have tried to generate a subset BAM, but I get a 'Segmentation fault' with > the following code: > from Bio import SeqIO > import pysam > from optparse import OptionParser > import subprocess, os, sys > from multiprocessing import Pool > import functools > ... I tried this and it seemed to get stuck much earlier. Could you cut down the example a bit by removing the multiprocessing? Peter P.S. Also you can remove the unused "import argparse" line. From mictadlo at gmail.com Tue Oct 18 10:26:06 2011 From: mictadlo at gmail.com (Mic) Date: Tue, 18 Oct 2011 20:26:06 +1000 Subject: [Biopython] [Samtools-help] Segmentation fault In-Reply-To: References: Message-ID: Hello, Thank you for your email. I updated the code and find out that print reads['chr1'] #works fine but print reads['chr1'][0] #caused Segmentation fault Please find below the updated code: from Bio import SeqIO import pysam from optparse import OptionParser import subprocess, os, sys from multiprocessing import Pool import functools def GetReferenceInfo(referenceFastaPath): referencenames = [] referencelengths = [] referenceFastaFile = open(referenceFastaPath) for record in SeqIO.parse(referenceFastaFile, "fasta"): referencenames.append(record.name) referencelengths.append(len(record.seq)) referenceFastaFile.close() return (referencenames, referencelengths) def GenerateSubsetBAM(bam_filename, ref_name): reads = [] bam_fh = pysam.Samfile(bam_filename, "rb") for read in bam_fh.fetch(ref_name): reads.append(read) print ref_name + ' Done ' + str(len(reads)) return (ref_name, reads) if __name__ == '__main__': parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta -o outputBAM") parser.add_option("-b", "--BAM", type="string", dest="inputBAMFilepath", help="Specify a BAM file") parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath", help="Specify a reference fasta file.") parser.add_option("-o", "--output", type="string", dest="outputBAMFilepath", help="Specify an output BAM file.") (opts, args) = parser.parse_args() if (opts.inputBAMFilepath is None): print ("\nSpecify a BAM file. eg. -b large.bam\n") parser.print_help() elif not(os.path.exists(opts.inputBAMFilepath)): print ("\nReference BAM file does not exists: " + opts.inputBAMFilepath +"\n") elif (opts.fastaFilepath is None): print ("\nSpecify a reference fasta file. eg. -f Subset.fasta\n") parser.print_help() elif not(os.path.exists(opts.fastaFilepath)): print ("\nReference fasta file does not exists: " + opts.fastaFilepath +"\n") elif os.path.exists(opts.outputBAMFilepath) and not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n): ")=='Y'): print ("\nOutput BAM exists. Please specify alternative output file. eg. -o Subset.bam\n") else: print "Read fasta ..." (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath) print 'Done!' print "creating subset...." pool = Pool() GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM, opts.inputBAMFilepath) reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper, ref_names)) pool.close() print "Done!" print reads['chr1'] #works fine print "xxxxx" print reads['chr1'][0] #caused Segmentation fault I run the code with the pysam-0.5 examples (pysam-0.5/tests) in the following way: python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam Read fasta ... Done! creating subset.... chr1 Done 1464 chr2 Done 1806 Done! [, ..., ] xxxxx Segmentation fault Thank you in advance. On Tue, Oct 18, 2011 at 7:00 PM, Peter Cock wrote: > On Tue, Oct 18, 2011 at 4:44 AM, Mic wrote: > > Hello, > > I have tried to generate a subset BAM, but I get a 'Segmentation fault' > with > > the following code: > > from Bio import SeqIO > > import pysam > > from optparse import OptionParser > > import subprocess, os, sys > > from multiprocessing import Pool > > import functools > > ... > > I tried this and it seemed to get stuck much earlier. Could you > cut down the example a bit by removing the multiprocessing? > > Peter > > P.S. Also you can remove the unused "import argparse" line. > From mmokrejs at fold.natur.cuni.cz Tue Oct 18 11:44:54 2011 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Tue, 18 Oct 2011 13:44:54 +0200 Subject: [Biopython] [Samtools-help] Segmentation fault In-Reply-To: References: Message-ID: <4E9D66B6.70904@fold.natur.cuni.cz> Before running your python code, do (under bash): $ ulimit -c unlimited $ python mypython.py $ file core $ gdb /usr/bin/python ./core gdb> where gdb> bt full gdb> quit $ Martin Mic wrote: > Hello, > Thank you for your email. I updated the code and find out that > print reads['chr1'] #works fine > but > print reads['chr1'][0] #caused Segmentation fault > > Please find below the updated code: > > from Bio import SeqIO > import pysam > from optparse import OptionParser > import subprocess, os, sys > from multiprocessing import Pool > import functools > > > def GetReferenceInfo(referenceFastaPath): > referencenames = [] > referencelengths = [] > referenceFastaFile = open(referenceFastaPath) > for record in SeqIO.parse(referenceFastaFile, "fasta"): > referencenames.append(record.name) > referencelengths.append(len(record.seq)) > referenceFastaFile.close() > return (referencenames, referencelengths) > > > def GenerateSubsetBAM(bam_filename, ref_name): > reads = [] > bam_fh = pysam.Samfile(bam_filename, "rb") > > for read in bam_fh.fetch(ref_name): > reads.append(read) > > print ref_name + ' Done ' + str(len(reads)) > return (ref_name, reads) > > > if __name__ == '__main__': > parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta -o > outputBAM") > parser.add_option("-b", "--BAM", type="string", dest="inputBAMFilepath", > help="Specify a BAM file") > parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath", > help="Specify a reference fasta file.") > parser.add_option("-o", "--output", type="string", > dest="outputBAMFilepath", help="Specify an output BAM file.") > > (opts, args) = parser.parse_args() > > if (opts.inputBAMFilepath is None): > print ("\nSpecify a BAM file. eg. -b large.bam\n") > parser.print_help() > elif not(os.path.exists(opts.inputBAMFilepath)): > print ("\nReference BAM file does not exists: " + opts.inputBAMFilepath > +"\n") > elif (opts.fastaFilepath is None): > print ("\nSpecify a reference fasta file. eg. -f Subset.fasta\n") > parser.print_help() > elif not(os.path.exists(opts.fastaFilepath)): > print ("\nReference fasta file does not exists: " + opts.fastaFilepath > +"\n") > elif os.path.exists(opts.outputBAMFilepath) and > not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n): ")=='Y'): > print ("\nOutput BAM exists. Please specify alternative output file. > eg. -o Subset.bam\n") > else: > print "Read fasta ..." > (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath) > print 'Done!' > > print "creating subset...." > pool = Pool() > GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM, > opts.inputBAMFilepath) > reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper, ref_names)) > pool.close() > print "Done!" > > print reads['chr1'] #works fine > print "xxxxx" > > print reads['chr1'][0] #caused Segmentation fault > > I run the code with the pysam-0.5 examples (pysam-0.5/tests) in the > following way: > > python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam > > Read fasta ... > Done! > creating subset.... > chr1 Done 1464 > chr2 Done 1806 > Done! > [, ..., > ] > xxxxx > Segmentation fault > > Thank you in advance. > > > On Tue, Oct 18, 2011 at 7:00 PM, Peter Cock wrote: > >> On Tue, Oct 18, 2011 at 4:44 AM, Mic wrote: >>> Hello, >>> I have tried to generate a subset BAM, but I get a 'Segmentation fault' >> with >>> the following code: >>> from Bio import SeqIO >>> import pysam >>> from optparse import OptionParser >>> import subprocess, os, sys >>> from multiprocessing import Pool >>> import functools >>> ... >> >> I tried this and it seemed to get stuck much earlier. Could you >> cut down the example a bit by removing the multiprocessing? >> >> Peter >> >> P.S. Also you can remove the unused "import argparse" line. >> > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From mictadlo at gmail.com Tue Oct 18 12:05:01 2011 From: mictadlo at gmail.com (Mic) Date: Tue, 18 Oct 2011 22:05:01 +1000 Subject: [Biopython] [Samtools-help] Segmentation fault In-Reply-To: <4E9D66B6.70904@fold.natur.cuni.cz> References: <4E9D66B6.70904@fold.natur.cuni.cz> Message-ID: Thank you for your tip, but I got an error: $ulimit -c unlimited $SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam Read fasta ... Done! creating subset.... chr1 Done 1464 EAS56_57:6:190:289:82 69 0 99 0 None 0 99 35 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; [('MF', 192)] chr2 Done 1806 B7_591:8:4:841:340 73 1 0 99 [(0, 36)] -1 -1 36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA <<<<<<<<;<<<<<<<<;<<<<<;<;:<<<<<<<;; [('MF', 18), ('Aq', 77), ('NM', 0), ('UQ', 0), ('H0', 1), ('H1', 0)] Done! xxxxx Segmentation fault (core dumped) $file core core: ERROR: cannot open `core' (No such file or directory) I also inserted "print reads[0]" in the method GenerateSubsetBAM: def GenerateSubsetBAM(bam_filename, ref_name): reads = [] bam_fh = pysam.Samfile(bam_filename, "rb") for read in bam_fh.fetch(ref_name): reads.append(read) print ref_name + ' Done ' + str(len(reads)) print reads[0] # works fine! return (ref_name, reads) and as output I got: python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam Read fasta ... Done! creating subset.... chr1 Done 1464 EAS56_57:6:190:289:82 69 0 99 0 None 0 99 35 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; [('MF', 192)] chr2 Done 1806 B7_591:8:4:841:340 73 1 0 99 [(0, 36)] -1 -1 36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA <<<<<<<<;<<<<<<<<;<<<<<;<;:<<<<<<<;; [('MF', 18), ('Aq', 77), ('NM', 0), ('UQ', 0), ('H0', 1), ('H1', 0)] Done! xxxxx Segmentation fault Why does reads['chr1'][0] caused the Segmentation fault? Thank you in advance. On Tue, Oct 18, 2011 at 9:44 PM, Martin Mokrejs wrote: > Before running your python code, do (under bash): > $ ulimit -c unlimited > $ python mypython.py > $ file core > $ gdb /usr/bin/python ./core > gdb> where > gdb> bt full > gdb> quit > $ > > Martin > > Mic wrote: > > Hello, > > Thank you for your email. I updated the code and find out that > > print reads['chr1'] #works fine > > but > > print reads['chr1'][0] #caused Segmentation fault > > > > Please find below the updated code: > > > > from Bio import SeqIO > > import pysam > > from optparse import OptionParser > > import subprocess, os, sys > > from multiprocessing import Pool > > import functools > > > > > > def GetReferenceInfo(referenceFastaPath): > > referencenames = [] > > referencelengths = [] > > referenceFastaFile = open(referenceFastaPath) > > for record in SeqIO.parse(referenceFastaFile, "fasta"): > > referencenames.append(record.name) > > referencelengths.append(len(record.seq)) > > referenceFastaFile.close() > > return (referencenames, referencelengths) > > > > > > def GenerateSubsetBAM(bam_filename, ref_name): > > reads = [] > > bam_fh = pysam.Samfile(bam_filename, "rb") > > > > for read in bam_fh.fetch(ref_name): > > reads.append(read) > > > > print ref_name + ' Done ' + str(len(reads)) > > return (ref_name, reads) > > > > > > if __name__ == '__main__': > > parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta > -o > > outputBAM") > > parser.add_option("-b", "--BAM", type="string", > dest="inputBAMFilepath", > > help="Specify a BAM file") > > parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath", > > help="Specify a reference fasta file.") > > parser.add_option("-o", "--output", type="string", > > dest="outputBAMFilepath", help="Specify an output BAM file.") > > > > (opts, args) = parser.parse_args() > > > > if (opts.inputBAMFilepath is None): > > print ("\nSpecify a BAM file. eg. -b large.bam\n") > > parser.print_help() > > elif not(os.path.exists(opts.inputBAMFilepath)): > > print ("\nReference BAM file does not exists: " + > opts.inputBAMFilepath > > +"\n") > > elif (opts.fastaFilepath is None): > > print ("\nSpecify a reference fasta file. eg. -f Subset.fasta\n") > > parser.print_help() > > elif not(os.path.exists(opts.fastaFilepath)): > > print ("\nReference fasta file does not exists: " + > opts.fastaFilepath > > +"\n") > > elif os.path.exists(opts.outputBAMFilepath) and > > not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n): > ")=='Y'): > > print ("\nOutput BAM exists. Please specify alternative output file. > > eg. -o Subset.bam\n") > > else: > > print "Read fasta ..." > > (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath) > > print 'Done!' > > > > print "creating subset...." > > pool = Pool() > > GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM, > > opts.inputBAMFilepath) > > reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper, > ref_names)) > > pool.close() > > print "Done!" > > > > print reads['chr1'] #works fine > > print "xxxxx" > > > > print reads['chr1'][0] #caused Segmentation fault > > > > I run the code with the pysam-0.5 examples (pysam-0.5/tests) in the > > following way: > > > > python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam > > > > Read fasta ... > > Done! > > creating subset.... > > chr1 Done 1464 > > chr2 Done 1806 > > Done! > > [, ..., > > ] > > xxxxx > > Segmentation fault > > > > Thank you in advance. > > > > > > On Tue, Oct 18, 2011 at 7:00 PM, Peter Cock >wrote: > > > >> On Tue, Oct 18, 2011 at 4:44 AM, Mic wrote: > >>> Hello, > >>> I have tried to generate a subset BAM, but I get a 'Segmentation fault' > >> with > >>> the following code: > >>> from Bio import SeqIO > >>> import pysam > >>> from optparse import OptionParser > >>> import subprocess, os, sys > >>> from multiprocessing import Pool > >>> import functools > >>> ... > >> > >> I tried this and it seemed to get stuck much earlier. Could you > >> cut down the example a bit by removing the multiprocessing? > >> > >> Peter > >> > >> P.S. Also you can remove the unused "import argparse" line. > >> > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > From p.j.a.cock at googlemail.com Tue Oct 18 12:58:47 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Oct 2011 13:58:47 +0100 Subject: [Biopython] [Samtools-help] Segmentation fault In-Reply-To: References: Message-ID: On Tue, Oct 18, 2011 at 11:26 AM, Mic wrote: > Hello, > Thank you for your email. I updated the code and find out that > ? ? print reads['chr1'] ? ? #works fine > but > ? ? print reads['chr1'][0] ?#caused Segmentation fault > Please find below the updated code: > ... Your pool version doesn't run on my machine, something unhappy in multiprocessing gives: TypeError: type 'partial' takes at least one argument Here's a version using a single thread, which works fine for me. What does it do on your machines? Either way this should help in determining the segmentation fault. from Bio import SeqIO import pysam import subprocess, os, sys def GetReferenceInfo(referenceFastaPath): referencenames = [] referencelengths = [] referenceFastaFile = open(referenceFastaPath) for record in SeqIO.parse(referenceFastaFile, "fasta"): referencenames.append(record.name) referencelengths.append(len(record.seq)) referenceFastaFile.close() return (referencenames, referencelengths) def GenerateSubsetBAM(bam_filename, ref_name): reads = [] bam_fh = pysam.Samfile(bam_filename, "rb") for read in bam_fh.fetch(ref_name): reads.append(read) print ref_name + ' Done ' + str(len(reads)) return (ref_name, reads) bam_filename = "ex1.bam" fasta_filename = "ex1.fa" print "Read fasta ..." (ref_names, ref_lengths) = GetReferenceInfo(fasta_filename) print 'Done!' print "creating subset...." reads = dict() for ref in ref_names: reads[ref] = GenerateSubsetBAM(bam_filename, ref) print "Done!" print reads['chr1'] #works fine print "xxxxx" print reads['chr1'][0] #also fine -- Peter From nathaniel.echols at gmail.com Tue Oct 18 18:08:03 2011 From: nathaniel.echols at gmail.com (Nat Echols) Date: Tue, 18 Oct 2011 11:08:03 -0700 Subject: [Biopython] newbie question: sequence parsing Message-ID: Greetings-- We have started using BioPython in our (non-bioinformatics) application and are investigating the possibility of replacing our existing (custom-made) sequence parsers. Two quick questions: 1) Is there a sequence parser that works with just a simple string, without any header or additional metadata? If not, how could we write one that results in the same basic object as those in Bio.SeqIO? (The parsing is of course easy, I just want to have the API be consistent regardless of format.) 2) Is there a single function that will take a file (and/or string) of unknown format and try the different parsers until it finds one that works? We currently use several different formats (raw string, FASTA, PIR, and possibly others), and we try not to rely on the file extension alone to determine the type. We already have something that does this using our parsers, which could be refactored to use Bio.SeqIO instead, but if BioPython has something similar I'd rather use that. thanks, Nat From p.j.a.cock at googlemail.com Tue Oct 18 19:04:14 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Oct 2011 20:04:14 +0100 Subject: [Biopython] newbie question: sequence parsing In-Reply-To: References: Message-ID: On Tue, Oct 18, 2011 at 7:08 PM, Nat Echols wrote: > Greetings-- > > We have started using BioPython in our (non-bioinformatics) application and > are investigating the possibility of replacing our existing (custom-made) > sequence parsers. ?Two quick questions: > > 1) Is there a sequence parser that works with just a simple string, without > any header or additional metadata? ?If not, how could we write one that > results in the same basic object as those in Bio.SeqIO? ?(The parsing is of > course easy, I just want to have the API be consistent regardless of > format.) Sounds like the "raw" format in EMBOSS, although there are two interpretations: one sequence per line, or one sequence for the whole file. Have a look at the FASTA parser in Bio/SeqIO/FastaIO.py as the most simple case. Essentially you create a SeqRecord object (which is covered in the Tutorial). > 2) Is there a single function that will take a file (and/or string) of > unknown format and try the different parsers until it finds one that works? > ?We currently use several different formats (raw string, FASTA, PIR, and > possibly others), and we try not to rely on the file extension alone to > determine the type. ?We already have something that does this using our > parsers, which could be refactored to use Bio.SeqIO instead, but if > BioPython has something similar I'd rather use that. No, we don't have such a function. There are many difficulties with format guessing - both from the file contents and even the filename. I usually cite the Zen of Python, Explicit is Better Than Implicit. Peter From cjfields at illinois.edu Tue Oct 18 19:11:56 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 18 Oct 2011 19:11:56 +0000 Subject: [Biopython] newbie question: sequence parsing In-Reply-To: References: Message-ID: On Oct 18, 2011, at 2:04 PM, Peter Cock wrote: > On Tue, Oct 18, 2011 at 7:08 PM, Nat Echols wrote: >> ... >> 2) Is there a single function that will take a file (and/or string) of >> unknown format and try the different parsers until it finds one that works? >> We currently use several different formats (raw string, FASTA, PIR, and >> possibly others), and we try not to rely on the file extension alone to >> determine the type. We already have something that does this using our >> parsers, which could be refactored to use Bio.SeqIO instead, but if >> BioPython has something similar I'd rather use that. > > No, we don't have such a function. There are many difficulties > with format guessing - both from the file contents and even the > filename. I usually cite the Zen of Python, Explicit is Better Than > Implicit. > > Peter Some implicitness is fine, but speaking from experience (BioPerl's GuessSeqFormat) trying to guess the format from the dozens that litter the bioinformatics landscape is a nest of hornets no one wants to maintain. chris From p.j.a.cock at googlemail.com Tue Oct 18 19:31:06 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Oct 2011 20:31:06 +0100 Subject: [Biopython] newbie question: sequence parsing In-Reply-To: References: Message-ID: On Tue, Oct 18, 2011 at 8:11 PM, Fields, Christopher J wrote: > On Oct 18, 2011, at 2:04 PM, Peter Cock wrote: > >> On Tue, Oct 18, 2011 at 7:08 PM, Nat Echols wrote: >>> ... >>> 2) Is there a single function that will take a file (and/or string) of >>> unknown format and try the different parsers until it finds one that works? >>> ?We currently use several different formats (raw string, FASTA, PIR, and >>> possibly others), and we try not to rely on the file extension alone to >>> determine the type. ?We already have something that does this using our >>> parsers, which could be refactored to use Bio.SeqIO instead, but if >>> BioPython has something similar I'd rather use that. >> >> No, we don't have such a function. There are many difficulties >> with format guessing - both from the file contents and even the >> filename. I usually cite the Zen of Python, Explicit is Better Than >> Implicit. >> >> Peter > > Some implicitness is fine, but speaking from experience > (BioPerl's GuessSeqFormat) trying to guess the format > from the dozens that litter the bioinformatics landscape > is a nest of hornets no one wants to maintain. > > chris I think "nest of hornets" is a much more beautiful phrase than my dead pan "many difficulties". The practical reality is that while some file formats are easy (binary files with 4 byte "magic" identifiers), others are horrible, and the definitions shift over time, as new formats of variants are added. I really don't want to go there. Peter From nathaniel.echols at gmail.com Tue Oct 18 21:47:03 2011 From: nathaniel.echols at gmail.com (Nat Echols) Date: Tue, 18 Oct 2011 14:47:03 -0700 Subject: [Biopython] issues with NCBIXML Message-ID: Hi again, I'm puzzled by the behavior of the Blast XML parser. It appears to be picking up all of the alignments correctly, but the top-level Bio.Blast.Record.Blast object that it returns appears to be incompletely populated. Specifically, the attributes num_hits and num_sequences are set to None - but I have several dozen alignments. Am I missing the point of these attributes, or doing something wrong? It's not a huge issue (I can just count the alignments, I guess), but I'm a bit concerned that there's something wrong with my code. thanks, Nat From p.j.a.cock at googlemail.com Tue Oct 18 22:07:24 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Oct 2011 23:07:24 +0100 Subject: [Biopython] issues with NCBIXML In-Reply-To: References: Message-ID: On Tue, Oct 18, 2011 at 10:47 PM, Nat Echols wrote: > Hi again, > > I'm puzzled by the behavior of the Blast XML parser. ?It appears to be > picking up all of the alignments correctly, but the > top-level Bio.Blast.Record.Blast object that it returns appears to be > incompletely populated. ?Specifically, the attributes num_hits and > num_sequences are set to None - but I have several dozen alignments. ?Am I > missing the point of these attributes, or doing something wrong? ?It's not a > huge issue (I can just count the alignments, I guess), but I'm a bit > concerned that there's something wrong with my code. > > thanks, > Nat The number of alignments and descriptions only really apply to the plain text (or HTML) BLAST output, but I guess we could set them to the number of hits in the XML output. Peter From mictadlo at gmail.com Tue Oct 18 23:12:03 2011 From: mictadlo at gmail.com (Mic) Date: Wed, 19 Oct 2011 09:12:03 +1000 Subject: [Biopython] [Samtools-help] Segmentation fault In-Reply-To: <4E9D6FAB.70308@fold.natur.cuni.cz> References: <4E9D66B6.70904@fold.natur.cuni.cz> <4E9D6FAB.70308@fold.natur.cuni.cz> Message-ID: I run it now on my Laptop (Ubuntu 11.04 x64) and now I can see the core file: $ ulimit -c unlimited $ python subsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam Segmentation fault (core dumped) $ file core core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'python subsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam' $ gdb /usr/bin/python ./core GNU gdb (Ubuntu/Linaro 7.2-1ubuntu11) 7.2 Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: ... Reading symbols from /usr/bin/python...(no debugging symbols found)...done. [New Thread 2748] warning: Can't read pathname for load map: Input/output error. Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0 Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2 Reading symbols from /lib/x86_64-linux-gnu/libutil.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/x86_64-linux-gnu/libutil.so.1 Reading symbols from /lib/libssl.so.0.9.8...(no debugging symbols found)...done. Loaded symbols for /lib/libssl.so.0.9.8 Reading symbols from /lib/libcrypto.so.0.9.8...(no debugging symbols found)...done. Loaded symbols for /lib/libcrypto.so.0.9.8 Reading symbols from /lib/x86_64-linux-gnu/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/x86_64-linux-gnu/libz.so.1 Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/x86_64-linux-gnu/libm.so.6 Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /usr/lib/python2.7/lib-dynload/_heapq.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/python2.7/lib-dynload/_heapq.so Reading symbols from /usr/lib/python2.7/lib-dynload/_elementtree.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/python2.7/lib-dynload/_elementtree.so Reading symbols from /lib/x86_64-linux-gnu/libexpat.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/x86_64-linux-gnu/libexpat.so.1 Reading symbols from /usr/lib/python2.7/lib-dynload/pyexpat.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/python2.7/lib-dynload/pyexpat.so Reading symbols from /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/csamtools.so...done. Loaded symbols for /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/csamtools.so Reading symbols from /usr/lib/python2.7/lib-dynload/_ctypes.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/python2.7/lib-dynload/_ctypes.so Reading symbols from /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/ctabix.so...done. Loaded symbols for /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/ctabix.so Reading symbols from /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/TabProxies.so...done. Loaded symbols for /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/TabProxies.so Reading symbols from /usr/lib/python2.7/lib-dynload/_io.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/python2.7/lib-dynload/_io.so Reading symbols from /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/cvcf.so...done. Loaded symbols for /home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/cvcf.so Reading symbols from /usr/lib/python2.7/lib-dynload/_multiprocessing.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/python2.7/lib-dynload/_multiprocessing.so Reading symbols from /usr/lib/pymodules/python2.7/Bio/Nexus/cnexus.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/pymodules/python2.7/Bio/Nexus/cnexus.so Core was generated by `python subsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam'. Program terminated with signal 11, Segmentation fault. #0 __pyx_pf_9csamtools_11AlignedRead_5qname___get__ (o=0x164e138, x=) at pysam/csamtools.c:18123 18123 if (__pyx_t_1) { (gdb) (gdb) where #0 __pyx_pf_9csamtools_11AlignedRead_5qname___get__ (o=0x164e138, x=) at pysam/csamtools.c:18123 #1 __pyx_getprop_9csamtools_11AlignedRead_qname (o=0x164e138, x=) at pysam/csamtools.c:30806 #2 0x0000000000479804 in ?? () #3 0x00007f187dbabc65 in __pyx_pf_9csamtools_11AlignedRead___str__ ( __pyx_v_self=0x164e138) at pysam/csamtools.c:17687 #4 0x0000000000479eac in _PyObject_Str () #5 0x0000000000479f8a in PyObject_Str () #6 0x00000000004d390c in ?? () #7 0x00000000004cd2d1 in PyFile_WriteObject () #8 0x000000000049909d in PyEval_EvalFrameEx () #9 0x000000000049d325 in PyEval_EvalCodeEx () #10 0x00000000004ecb02 in PyEval_EvalCode () #11 0x00000000004fdc74 in ?? () #12 0x000000000042c182 in PyRun_FileExFlags () #13 0x000000000042cb4a in PyRun_SimpleFileExFlags () #14 0x0000000000418c9e in Py_Main () #15 0x00007f187ed7aeff in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6 #16 0x00000000004c62b1 in _start () (gdb) (gdb) bt full #0 __pyx_pf_9csamtools_11AlignedRead_5qname___get__ (o=0x164e138, x=) at pysam/csamtools.c:18123 __pyx_v_src = 0x0 __pyx_t_2 = 0x0 __pyx_frame = 0x0 __pyx_r = 0x0 __pyx_t_1 = __Pyx_use_tracing = 0 __pyx_frame_code = 0x0 #1 __pyx_getprop_9csamtools_11AlignedRead_qname (o=0x164e138, x=) at pysam/csamtools.c:30806 No locals. #2 0x0000000000479804 in ?? () No symbol table info available. #3 0x00007f187dbabc65 in __pyx_pf_9csamtools_11AlignedRead___str__ ( __pyx_v_self=0x164e138) at pysam/csamtools.c:17687 __pyx_r = 0x0 __pyx_t_1 = 0x1e4fb90 __pyx_t_2 = 0x0 __pyx_t_3 = 0x0 __pyx_t_4 = 0x0 ---Type to continue, or q to quit--- __pyx_t_5 = 0x0 __pyx_t_6 = 0x0 __pyx_t_7 = 0x0 __pyx_t_8 = 0x0 __pyx_t_9 = 0x0 __pyx_t_10 = 0x0 __pyx_t_11 = 0x0 __pyx_t_12 = 0x0 __pyx_t_13 = 0x0 __pyx_t_14 = 0x0 __pyx_frame_code = 0x0 __pyx_frame = 0x0 __Pyx_use_tracing = 0 #4 0x0000000000479eac in _PyObject_Str () No symbol table info available. #5 0x0000000000479f8a in PyObject_Str () No symbol table info available. #6 0x00000000004d390c in ?? () No symbol table info available. #7 0x00000000004cd2d1 in PyFile_WriteObject () No symbol table info available. ---Type to continue, or q to quit--- #8 0x000000000049909d in PyEval_EvalFrameEx () No symbol table info available. #9 0x000000000049d325 in PyEval_EvalCodeEx () No symbol table info available. #10 0x00000000004ecb02 in PyEval_EvalCode () No symbol table info available. #11 0x00000000004fdc74 in ?? () No symbol table info available. #12 0x000000000042c182 in PyRun_FileExFlags () No symbol table info available. #13 0x000000000042cb4a in PyRun_SimpleFileExFlags () No symbol table info available. #14 0x0000000000418c9e in Py_Main () No symbol table info available. #15 0x00007f187ed7aeff in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #16 0x00000000004c62b1 in _start () No symbol table info available. (gdb) quit $ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 20 file size (blocks, -f) unlimited pending signals (-i) 16382 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Thank you in advance. From nathaniel.echols at gmail.com Wed Oct 19 18:48:11 2011 From: nathaniel.echols at gmail.com (Nat Echols) Date: Wed, 19 Oct 2011 11:48:11 -0700 Subject: [Biopython] issues with NCBIXML In-Reply-To: References: Message-ID: On Tue, Oct 18, 2011 at 3:07 PM, Peter Cock wrote: > The number of alignments and descriptions only really apply > to the plain text (or HTML) BLAST output, but I guess we > could set them to the number of hits in the XML output. This would be useful, for consistency's sake if nothing else. I'm happy to contribute a patch if that streamlines the process. -Nat From p.j.a.cock at googlemail.com Wed Oct 19 19:06:30 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 19 Oct 2011 20:06:30 +0100 Subject: [Biopython] issues with NCBIXML In-Reply-To: References: Message-ID: On Wed, Oct 19, 2011 at 7:48 PM, Nat Echols wrote: > On Tue, Oct 18, 2011 at 3:07 PM, Peter Cock > wrote: >> >> The number of alignments and descriptions only really apply >> to the plain text (or HTML) BLAST output, but I guess we >> could set them to the number of hits in the XML output. > > This would be useful, for consistency's sake if nothing else. ?I'm happy to > contribute a patch if that streamlines the process. > -Nat Sure. If you can include unit tests for it even better. You should just be able to add some assertEqual lines to the existing XML parser tests for the newly populated properties. Thanks, Peter From mictadlo at gmail.com Thu Oct 20 09:38:56 2011 From: mictadlo at gmail.com (Mic) Date: Thu, 20 Oct 2011 19:38:56 +1000 Subject: [Biopython] changing record attributes while iterating In-Reply-To: References: Message-ID: Hello, would it be possible to using a generator expression for the following code? from Bio import SeqIO fa_parser = SeqIO.parse(open("../test_files/test.fasta", "rU"), "fasta") sequence = fa_parser.next().seq for record in fa_parser: sequence += 3*'N' + record.seq print sequence Input: >1 1111111 >2 2222222 >3 3333333 >4 4444444 Output: 1111111NNN2222222NNN3333333NNN4444444 Thank you advance. On Fri, Oct 7, 2011 at 5:22 PM, Peter Cock wrote: > > > On Friday, October 7, 2011, Michalwrote: > > > Hello, > > Does your code with generator save the whole file in the > > memory or does it read each entry and save it immediately? > > Thank you in advance. > > Using a generator expression like that only one SeqRecord is in memory at a > time. It goes through the input FASTA one record at a time, renames it, > saves it immediately. > > Peter > > P.S. list CC'd From p.j.a.cock at googlemail.com Thu Oct 20 09:58:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 20 Oct 2011 10:58:05 +0100 Subject: [Biopython] changing record attributes while iterating In-Reply-To: References: Message-ID: Hi Mic, You should have started a new thread with a new title... On Thu, Oct 20, 2011 at 10:38 AM, Mic wrote: > Hello, > would it be possible to using a generator expression for the following code? > from Bio import SeqIO > fa_parser = SeqIO.parse(open("../test_files/test.fasta", "rU"), "fasta") > sequence = fa_parser.next().seq > for record in fa_parser: > ? ??sequence?+= 3*'N' + record.seq > > print?sequence > Input: >>1 > 1111111 >>2 > 2222222 >>3 > 3333333 >>4 > 4444444 > Output: > 1111111NNN2222222NNN3333333NNN4444444 > Thank you advance. Sure, how about this: from Bio import SeqIO fa_parser = SeqIO.parse("../test_files/test.fasta", "fasta") print ('N' * 3).join(str(rec.seq) for rec in fa_parser) Peter From andreas.wilm at gmail.com Tue Oct 25 06:26:59 2011 From: andreas.wilm at gmail.com (Andreas Wilm) Date: Tue, 25 Oct 2011 14:26:59 +0800 Subject: [Biopython] VCF parser In-Reply-To: References: Message-ID: HI Tiago, I'm not aware of a Biopython VCF parser, but pysam seems to have one (haven't used it though). Try >>> from pysam import cvcf You also might want to check an implementation which was posted on seqanswers: http://seqanswers.com/forums/archive/index.php/t-9266.html Andreas PS: For the sake of completeness: your question was asked before here (no replies). See http://www.biopython.org/pipermail/biopython/2011-March/007131.html 2011/10/4 Tiago Ant?o : > Hi, > > I wonder if there is a VCF parser in either Python or Java? Either I > am being dumb at searching (probably) or nothing exists? > > Thanks, > Tiago > > -- > "If you want to get laid, go to college.? If you want an education, go > to the library." - Frank Zappa > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Andreas Wilm andreas.wilm at gmail.com | mail at andreas-wilm.com | 0x7C68FBCC From pawan.mani2 at gmail.com Tue Oct 25 15:50:51 2011 From: pawan.mani2 at gmail.com (kakchingtabam pawankumar sharma) Date: Tue, 25 Oct 2011 21:20:51 +0530 Subject: [Biopython] installation of pyfatsa In-Reply-To: <2DDF09AFEB46E54894A3843CEF9CB3A4B55076114C@EXCHMB.ocimumbio.com> References: <2DDF09AFEB46E54894A3843CEF9CB3A4B55076114C@EXCHMB.ocimumbio.com> Message-ID: Dear**** I woul like to know how to install pyfasta in linux. I have downloaded pyfasta-0.4.4.tar.gz and install using command: *tar -xzvf pyfasta-0.4.4.tar.gz*.**** But I could used the command line: **** *pyfasta split -n 6 sample .fasta* ** ** So kindly help me out to solve this problem.**** ** ** ** ** With Reagards,**** Pawan**** ------------------------------ This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions that are unlawful. This e-mail may contain viruses. Ocimum Biosolutions has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. The information contained in this email and any attachments is confidential and may be subject to copyright or other intellectual property protection. If you are not the intended recipient, you are not authorized to use or disclose this information, and we request that you notify us by reply mail or telephone and delete the original message from your mail system. OCIMUMBIO SOLUTIONS (P) LTD From p.j.a.cock at googlemail.com Tue Oct 25 16:13:00 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Oct 2011 17:13:00 +0100 Subject: [Biopython] installation of pyfatsa In-Reply-To: References: <2DDF09AFEB46E54894A3843CEF9CB3A4B55076114C@EXCHMB.ocimumbio.com> Message-ID: On Tue, Oct 25, 2011 at 4:50 PM, kakchingtabam pawankumar sharma wrote: > ?Dear**** > > ? ? ? ? ? ? ? ? I woul like to know how to install pyfasta in linux. I have > downloaded pyfasta-0.4.4.tar.gz and install using command: ?*tar -xzvf > pyfasta-0.4.4.tar.gz*.**** > > But I could used the command line: ?**** > > *pyfasta split -n 6 sample .fasta* > > ** ** > > So kindly help me out to solve this problem.**** > > ** ** > > ** ** > > With Reagards,**** > > Pawan**** > Hi Pawan, Note pyfasta is not part of Biopython, but is a separate tool by Brent Pedersen (CC'd). http://pypi.python.org/pypi/pyfasta/ https://github.com/brentp/pyfasta/ However, uncompressing the tar ball is only the first step in installing it. You probably need to run "python setup.py install" for that. Peter From bpederse at gmail.com Tue Oct 25 16:23:31 2011 From: bpederse at gmail.com (Brent Pedersen) Date: Tue, 25 Oct 2011 10:23:31 -0600 Subject: [Biopython] VCF parser In-Reply-To: References: Message-ID: On Mon, Oct 3, 2011 at 4:12 PM, Tiago Ant?o wrote: > Hi, > > I wonder if there is a VCF parser in either Python or Java? Either I > am being dumb at searching (probably) or nothing exists? > > Thanks, > Tiago > > -- > "If you want to get laid, go to college.? If you want an education, go > to the library." - Frank Zappa > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > I have found this one: https://github.com/jdoughertyii/PyVCF to be quite good and easy to use. From anaryin at gmail.com Wed Oct 26 10:30:12 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 26 Oct 2011 12:30:12 +0200 Subject: [Biopython] Pairwise alignment - is it a generic function? Message-ID: Hello all, A friend of mine was interested in a small simple alignment script for aminoacids, to which I recommended to have a look at Biopython. We found the pairwise2 module but we're a bit puzzled. Does it align "any" sequence, aa or nucleotides? I don't see any scoring matrix referenced there.. Related to this, can you suggest any implementation of an aminoacid pairwise alignment algorithm, in Python, that does is self contained (ie. doesn't depend on some other program). Best, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao From p.j.a.cock at googlemail.com Wed Oct 26 10:58:09 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Oct 2011 11:58:09 +0100 Subject: [Biopython] Pairwise alignment - is it a generic function? In-Reply-To: References: Message-ID: On Wed, Oct 26, 2011 at 11:30 AM, Jo?o Rodrigues wrote: > Hello all, > > A friend of mine was interested in a small simple alignment script for > aminoacids, to which I recommended to have a look at Biopython. We found the > pairwise2 module but we're a bit puzzled. Does it align "any" sequence, aa > or nucleotides? I don't see any scoring matrix referenced there.. It should work on proteins, just pass in the appropriate scoring matrix. > Related to this, can you suggest any implementation of an aminoacid > pairwise alignment algorithm, in Python, that does is self contained > (ie. doesn't depend on some other program). Well, Bio.pairwise2 has a faster C implementation and fall back slower pure Python implementation (used under Jython/PyPy/etc), which might answer your needs. Peter From from.d.putto at gmail.com Wed Oct 26 15:11:05 2011 From: from.d.putto at gmail.com (Sheila the angel) Date: Wed, 26 Oct 2011 17:11:05 +0200 Subject: [Biopython] downloading gnome Protein table Message-ID: Hi All, I an facing some problem to downloading the gnome and other information. For an example I did a query on ncbi gnome for NC_008390 On clicking results you can get following link http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=ShowDetailView&TermToSearch=19840 On my web-browser I can save this page as File> Save as >out.html Furthermore I want to download the Protein table also http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=Retrieve&dopt=Protein+Table&list_uids=19840 I want to do this for many Ids. Is there any simple way in Bio-Python??? Thanks in Advance -- Cheers Sheila From p.j.a.cock at googlemail.com Wed Oct 26 15:27:37 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Oct 2011 16:27:37 +0100 Subject: [Biopython] downloading gnome Protein table In-Reply-To: References: Message-ID: On Wed, Oct 26, 2011 at 4:11 PM, Sheila the angel wrote: > Hi All, > > I an facing some problem to downloading the gnome and other information. > For an example I did a query on ncbi gnome for ?NC_008390 > On clicking results you can get following link > > http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=ShowDetailView&TermToSearch=19840 > On my web-browser I can save this page ?as File> Save as >out.html > > Furthermore I want to download the Protein table also > http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=Retrieve&dopt=Protein+Table&list_uids=19840 > > I want to do this for many Ids. Is there any simple way in Bio-Python??? > > Thanks in Advance Hmm, some of that might be available by Bio.Entrez, not sure though. For the protein table I would personally work with the *.ptt files from the NCBI FTP site, e.g. ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt or: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid58303/NC_008391.ptt The FTP links are on the page of the first URL you gave. You can download all the "bacteria" *.ptt files as a tar ball, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ptt.tar.gz Typically I work from the GenBank file files instead (*.gbk rather than *.ptt) Peter From mictadlo at gmail.com Thu Oct 27 01:14:16 2011 From: mictadlo at gmail.com (Mic) Date: Thu, 27 Oct 2011 11:14:16 +1000 Subject: [Biopython] changing record attributes while iterating In-Reply-To: References: Message-ID: Thank you it is working. I would like to to put sequences id in a list in the following way: >>> c = (i.id for i in b) SyntaxError: invalid syntax >>> c[0] Traceback (most recent call last): File "", line 1, in TypeError: 'generator' object is not subscriptable How is it possible to generate a list of sequence ids? Thank you in advance. On Thu, Oct 20, 2011 at 7:58 PM, Peter Cock wrote: > Hi Mic, > > You should have started a new thread with a new title... > > On Thu, Oct 20, 2011 at 10:38 AM, Mic wrote: > > Hello, > > would it be possible to using a generator expression for the following > code? > > from Bio import SeqIO > > fa_parser = SeqIO.parse(open("../test_files/test.fasta", "rU"), "fasta") > > sequence = fa_parser.next().seq > > for record in fa_parser: > > sequence += 3*'N' + record.seq > > > > print sequence > > Input: > >>1 > > 1111111 > >>2 > > 2222222 > >>3 > > 3333333 > >>4 > > 4444444 > > Output: > > 1111111NNN2222222NNN3333333NNN4444444 > > Thank you advance. > > Sure, how about this: > > from Bio import SeqIO > fa_parser = SeqIO.parse("../test_files/test.fasta", "fasta") > print ('N' * 3).join(str(rec.seq) for rec in fa_parser) > > Peter > From p.j.a.cock at googlemail.com Thu Oct 27 08:35:24 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Oct 2011 09:35:24 +0100 Subject: [Biopython] changing record attributes while iterating In-Reply-To: References: Message-ID: On Thu, Oct 27, 2011 at 2:14 AM, Mic wrote: > Thank you it is working. > I would like to to put sequences id in a list in the following way: >>>> c = (i.id for i in b) > SyntaxError: invalid syntax The above would be a generator expression, and requires Python 2.4. It shouldn't cause a SyntaxError unless there is some mistake I'm not seeing (or you missed something in the copy & paste). >>>> c[0] > Traceback (most recent call last): > ? File "", line 1, in > TypeError: 'generator' object is not subscriptable > How is it possible to?generate?a list of sequence ids? You need to create a list (e.g using a list comprehension) rather than a generator, probably: c = [i.id for i in b] c[0] = "Fred" Peter From from.d.putto at gmail.com Thu Oct 27 10:47:04 2011 From: from.d.putto at gmail.com (Sheila the angel) Date: Thu, 27 Oct 2011 12:47:04 +0200 Subject: [Biopython] downloading gnome Protein table In-Reply-To: References: Message-ID: The problem is I have only the Refseq ID like NC_008390 and I don't have Protein table ID (in this case CP000441.ptt) so I can't download the .ptt file (as in ftp url ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt ) Also not all Refseq IDs I have belongs to 'Bacteria'. So for ID NC_004314 (just an example) I have to change the ftp url as ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Plasmodium_falciparum/NC_004314.ptt Downloading the *.gbk file may be an option (but later I need to convert them into protein table) so I tried this from Bio import Entrez Entrez.email = "from.d.putto at gmail.com" handle = Entrez.efetch(db="genome", id="NC_008390", rettype="gbk") print handle.read() The output shows me 'Nothing has been found' I am not sure in which database I should look for id like NC_008390. Moreover later-on I need to convert 'gbk' file to .ptt (or extract protein information) On Wed, Oct 26, 2011 at 5:27 PM, Peter Cock wrote: > On Wed, Oct 26, 2011 at 4:11 PM, Sheila the angel > wrote: > > Hi All, > > > > I an facing some problem to downloading the gnome and other information. > > For an example I did a query on ncbi gnome for NC_008390 > > On clicking results you can get following link > > > > > http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=ShowDetailView&TermToSearch=19840 > > On my web-browser I can save this page as File> Save as >out.html > > > > Furthermore I want to download the Protein table also > > > http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=Retrieve&dopt=Protein+Table&list_uids=19840 > > > > I want to do this for many Ids. Is there any simple way in Bio-Python??? > > > > Thanks in Advance > > Hmm, some of that might be available by Bio.Entrez, not sure though. > > For the protein table I would personally work with the *.ptt files from > the NCBI FTP site, e.g. > > > ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt > > or: > > > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid58303/NC_008391.ptt > > The FTP links are on the page of the first URL you gave. You can download > all the "bacteria" *.ptt files as a tar ball, > > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ptt.tar.gz > > Typically I work from the GenBank file files instead (*.gbk rather than > *.ptt) > > Peter > From p.j.a.cock at googlemail.com Thu Oct 27 13:14:10 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Oct 2011 14:14:10 +0100 Subject: [Biopython] downloading gnome Protein table In-Reply-To: References: Message-ID: On Thu, Oct 27, 2011 at 11:47 AM, Sheila the angel wrote: > The problem is I have only the Refseq ID like?NC_008390?and I don't have > Protein table ID (in this case CP000441.ptt) so I can't download the .ptt > file (as in ftp url > ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt > ? ) Given your identifiers, use ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ rather than ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/ - in this case, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid58303/NC_008390.ptt > > Also not all??Refseq IDs I have belongs to 'Bacteria'. > Then the NCBI won't have them on the Bacterial FTP sites, and I don't think they will provide *.ptt files for them. > So for ID > NC_004314?(just an example) I have to change the ftp url as > ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Plasmodium_falciparum/NC_004314.ptt > > Downloading the *.gbk file may be an option (but later I need to convert > them into protein table) Just download *all* the bacterial protein tables as the tar ball, its only 120MB compressed: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ptt.tar.gz Then you can just search locally for a file by name etc. > so I tried this > from Bio import Entrez > Entrez.email = "from.d.putto at gmail.com" > handle = Entrez.efetch(db="genome", id="NC_008390", rettype="gbk") > print handle.read() > The output shows me 'Nothing has been found' > I am not sure in which database I should look for id like NC_008390. Try it on the NCBI website for all databases, http://www.ncbi.nlm.nih.gov/sites/gquery?term=NC_008390 You'll see it does match the genome database, but also the nucleotide database. In this case you want the sequence as a GenBank file so use the nucleotide database. > Moreover later-on I need to convert 'gbk' file to .ptt (or extract protein > information) The Biopython GenBank parser can do that - life is easier with bacterial genomes as there are (almost) no nasty join(...) locations to deal with. Peter From devaniranjan at gmail.com Thu Oct 27 19:16:07 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Thu, 27 Oct 2011 15:16:07 -0400 Subject: [Biopython] weighted sampling of a dictionary Message-ID: Hi, I am not sure if this question is more suitable for biopython or a python forum. I have the following dictionary. dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34, 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, 'LAU': 1, 'PTA': 7, ' AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34, 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, 'YLP': 49, 'TA Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, 'TAA': 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL': 16, 'SY Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28} The keys are the different amino acid triplets (all possible triplets extracted from a culled list of PDB), the numbers next to them are the frequency that they occour in. I was wondering if there is a way in biopython/python to sample them at the frequecy indicated by the no's next to the key. I have only given a snippet of the triplet dictionary, the entire dictionary has about 1400 key entries. I would appreciate any help in this matter --thank you very much. George From bpederse at gmail.com Thu Oct 27 20:29:43 2011 From: bpederse at gmail.com (Brent Pedersen) Date: Thu, 27 Oct 2011 14:29:43 -0600 Subject: [Biopython] weighted sampling of a dictionary In-Reply-To: References: Message-ID: On Thu, Oct 27, 2011 at 1:16 PM, George Devaniranjan wrote: > Hi, > > I am not sure if this question is more suitable for biopython or a python > forum. > > > I have the following dictionary. > > dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34, > 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, 'LAU': > 1, 'PTA': 7, ' > AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34, > 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, 'YLP': > 49, 'TA > Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, 'TAA': > 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL': > 16, 'SY > Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28} > > The keys are the different amino acid triplets (all possible triplets > extracted from a culled list of PDB), the numbers next to them are the > frequency that they occour in. > > I was wondering if there is a way in biopython/python to sample them at the > frequecy indicated by the no's next to the key. > > I have only given a snippet of the triplet dictionary, the entire dictionary > has about 1400 key entries. > > I would appreciate any help in this matter --thank you very much. > > George > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > you could try the one of these (presumably the class king) http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python/ you'll have something like: import operator aminos, weights = zip(*sorted(adict.items(), key=operator.itemgetter(1))) amino_gen = WeightedRandomGenerator(weights) for i in xrange(nsims): idx = amino_gen.next() rand_aa = aminos[idx] From jmtc21 at bath.ac.uk Thu Oct 27 20:33:18 2011 From: jmtc21 at bath.ac.uk (Jaime Tovar) Date: Thu, 27 Oct 2011 21:33:18 +0100 Subject: [Biopython] expat and biopython 1.58 problem on linux x64 Message-ID: <4EA9C00E.5080509@bath.ac.uk> Hello all, I'm having troubles while updating my biopython to 1.58. I'm having exactly the same problem with the xml parser as described in this old post: http://www.biopython.org/pipermail/biopython/2011-May/007263.html Sadly I may have to use the entrez module so it will make me happy to have the thing running if possible. I'm installing in a opensuse 11.3 x64 box Did a rpm install of biopython from the opensuse science repo. So I have 1.58-1.2 installed. Python 1.6.5-3.5.1 for x64 expat 2.0.1-98.1 x64 Tried to install both by hand from the tar.gz and using an rpm but the problem persists. Any help will be greatly appreciated. Thanks!!! Jaime. From winda002 at student.otago.ac.nz Thu Oct 27 20:52:00 2011 From: winda002 at student.otago.ac.nz (David Winter) Date: Fri, 28 Oct 2011 09:52:00 +1300 Subject: [Biopython] weighted sampling of a dictionary In-Reply-To: References: Message-ID: <20111028095200.20435ub1z2jexy0g@www.studentmail.otago.ac.nz> Hi George, I was actually doing this yesterday :) The function I came up with takes two lists: import random def weighted_sample(population, weights): """ Sample from a population, given provided weights """ if len(population) != len(weights): raise ValueError('Lengths of population and weights do not match') normal_weights = [ float(w)/sum(weights) for w in weights ] val = random.random() running_total = 0 for index, weight in enumerate(normal_weights): running_total += weight if val < running_total: return population[index] Which seems to do the trick: population = ['AAU' ,'AAC', 'AAG'] weights = [2,5,3] sample = [weighted_sample(population, weights) for _ in range(1000)] sample.count('AAC') #should be about 500 If that's too slow, check out numpy's random.multinomial() function. I haven't tested this, but this should get you the number of times you get each codon from 1000 "draws": import numpy as np codons, weights = codon_dict.items() denom = sum(weights) normalised_weights = [float(w)/denom for w in weights] np.random.multinomial(codons, weights, 1000) Cheers, David Quoting George Devaniranjan : > Hi, > > I am not sure if this question is more suitable for biopython or a python > forum. > > > I have the following dictionary. > > dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34, > 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, 'LAU': > 1, 'PTA': 7, ' > AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34, > 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, 'YLP': > 49, 'TA > Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, 'TAA': > 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL': > 16, 'SY > Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28} > > The keys are the different amino acid triplets (all possible triplets > extracted from a culled list of PDB), the numbers next to them are the > frequency that they occour in. > > I was wondering if there is a way in biopython/python to sample them at the > frequecy indicated by the no's next to the key. > > I have only given a snippet of the triplet dictionary, the entire dictionary > has about 1400 key entries. > > I would appreciate any help in this matter --thank you very much. > > George > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Fri Oct 28 09:54:09 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Oct 2011 10:54:09 +0100 Subject: [Biopython] expat and biopython 1.58 problem on linux x64 In-Reply-To: <4EA9C00E.5080509@bath.ac.uk> References: <4EA9C00E.5080509@bath.ac.uk> Message-ID: On Thu, Oct 27, 2011 at 9:33 PM, Jaime Tovar wrote: > Hello all, > > I'm having troubles while updating my biopython to 1.58. > > I'm having exactly the same problem with the xml parser as described in this > old post: > > http://www.biopython.org/pipermail/biopython/2011-May/007263.html > > Sadly I may have to use the entrez module so it will make me happy to have > the thing running if possible. > > I'm installing in a opensuse 11.3 x64 box > Did a rpm install of biopython from the opensuse science repo. So I have > 1.58-1.2 installed. > Python 1.6.5-3.5.1 for x64 > expat 2.0.1-98.1 x64 > > Tried to install both by hand from the tar.gz and using an rpm but the > problem persists. > > Any help will be greatly appreciated. > > Thanks!!! > > Jaime. Hmm. Can you try installing the latest code from git please? You can grab it via the git command line tool, or use github to download the latest code as a tar ball: http://biopython.org/wiki/SourceCode Specifically I'm hoping this change will fix the segmentation fault (assuming http://bugs.python.org/issue4877 is to blame): https://github.com/biopython/biopython/commit/59f9cbd2ad14ebd05d5864033ff0c7ef7a8f0daa Previously: $ python Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import Entrez >>> handle = open("NEWS") >>> handle.close() >>> Entrez.read(handle) Segmentation fault With the fix: $ python Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import Entrez >>> handle = open("NEWS") >>> handle.close() >>> Entrez.read(handle) Traceback (most recent call last): File "", line 1, in File "Bio/Entrez/__init__.py", line 270, in read record = handler.read(handle) File "Bio/Entrez/Parser.py", line 167, in read raise IOError("Can't parse a closed handle") IOError: Can't parse a closed handle Assuming you start seeing the IOError instead, the question would shift to what is going on with your network settings (e.g. look at proxies). If the segmentation fault doesn't go away we'll need to think again. Peter From bioinformaticsing at gmail.com Fri Oct 28 11:46:07 2011 From: bioinformaticsing at gmail.com (ning luwen) Date: Fri, 28 Oct 2011 19:46:07 +0800 Subject: [Biopython] Memory leak while parse gbk file? Message-ID: Hi, I have tried to parse about 2000+ gbk file using SeqIO.parse to parse gbk file, but the memory up quickly. ( in my desktop 4g memory, out memory after a number of iterates, and then try one work station, memory used as high as 100g+, and continue increasing) for temp_name in file_names:#file_names: list of path of gbk files. f=open(temp_name) for x in SeqIO.parse(f,'genbank'): print x.name,len(x.features) f.close() I guess there may be memory leak while parse gbk flle. -- regards, luwen ning From p.j.a.cock at googlemail.com Fri Oct 28 11:52:33 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Oct 2011 12:52:33 +0100 Subject: [Biopython] Memory leak while parse gbk file? In-Reply-To: References: Message-ID: On Fri, Oct 28, 2011 at 12:46 PM, ning luwen wrote: > Hi, > ? ?I have tried to parse about 2000+ gbk file using SeqIO.parse to > parse gbk file, but the memory up quickly. ( in my desktop 4g memory, > out memory after a number of iterates, and then try one work station, > memory used as high as 100g+, and continue increasing) > > for temp_name in file_names:#file_names: list of path of gbk files. > ? ?f=open(temp_name) > ? ?for x in SeqIO.parse(f,'genbank'): > ? ? ? ?print x.name,len(x.features) > ? ?f.close() > > ? I guess there may be memory leak while parse gbk flle. > -- > regards, > luwen ning Which version of Python are you using? Try calling garbage collection, import gc from Bio import SeqIO for temp_name in file_names:#file_names: list of path of gbk files. f=open(temp_name) for x in SeqIO.parse(f,'genbank'): print x.name,len(x.features) f.close() gc.collect() I expect that to fix the increasing memory usage. If it does, then it isn't a memory leak. Peter From p.j.a.cock at googlemail.com Fri Oct 28 13:21:42 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Oct 2011 14:21:42 +0100 Subject: [Biopython] expat and biopython 1.58 problem on linux x64 In-Reply-To: <4EAAA9A0.3010906@bath.ac.uk> References: <4EA9C00E.5080509@bath.ac.uk> <4EAAA9A0.3010906@bath.ac.uk> Message-ID: On Fri, Oct 28, 2011 at 2:09 PM, Jaime Tovar wrote: > Got the tarball for latest, > > but: > > ... > ~/tmp/biop/biopython-biopython-59f9cbd/Tests> python test_Entrez.py > Test error handling when presented with Fasta non-XML data ... ok > Test error handling when presented with GenBank non-XML data ... ok > Test parsing XML returned by EFetch, Nucleotide database (first test) ... > ERROR > Test parsing XML returned by EFetch, Protein database ... ERROR > Test parsing XML returned by EFetch, OMIM database ... ERROR > Test parsing XML returned by EFetch, PubMed database (first test) ... > Segmentation fault > > Can we try to find where exactly is the problem? > > Thanks for the help. > J OK, so it doesn't look like the problem with closed handles, http://bugs.python.org/issue4877 Although to be sure please try the example in my last email, from Bio import Entrez handle = open("NEWS") handle.close() Entrez.read(handle) (You can use any file that exists). Beyond that I only have questions rather than answers for now. My guess is something is broken on your system with conflicting versions of expat, see for example: http://www.dscpl.com.au/wiki/ModPython/Articles/ExpatCausingApacheCrash What does this give you, and does it match expat 2.0.1 which you said earlier was installed? import pyexpat print pyexpat.version_info Can you try to get a strack trace? Alternatively, you could disable individual tests which trigger the segmentation fault one by one and then we can attempt to spot any commonalities. e.g. The segmentation fault is from: "Test parsing XML returned by EFetch, PubMed database (first test)" which is method test_pubmed1, rename it to xtest_test_pubmed1 (or anything that doesn't start test_*) and it will be skipped. Peter From devaniranjan at gmail.com Fri Oct 28 13:23:22 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Fri, 28 Oct 2011 09:23:22 -0400 Subject: [Biopython] weighted sampling of a dictionary In-Reply-To: <20111028095200.20435ub1z2jexy0g@www.studentmail.otago.ac.nz> References: <20111028095200.20435ub1z2jexy0g@www.studentmail.otago.ac.nz> Message-ID: Thanks guys for all your suggestions -I am going to try these out. Best, George On Thu, Oct 27, 2011 at 4:52 PM, David Winter wrote: > Hi George, > > I was actually doing this yesterday :) > > The function I came up with takes two lists: > > import random > > def weighted_sample(population, weights): > """ Sample from a population, given provided weights """ > if len(population) != len(weights): > raise ValueError('Lengths of population and weights do not match') > normal_weights = [ float(w)/sum(weights) for w in weights ] > val = random.random() > running_total = 0 > for index, weight in enumerate(normal_weights): > running_total += weight > if val < running_total: > return population[index] > > Which seems to do the trick: > > population = ['AAU' ,'AAC', 'AAG'] > weights = [2,5,3] > sample = [weighted_sample(population, weights) for _ in range(1000)] > sample.count('AAC') #should be about 500 > > If that's too slow, check out numpy's random.multinomial() function. > > I haven't tested this, but this should get you the number of times you get > each codon from 1000 "draws": > > import numpy as np > > codons, weights = codon_dict.items() > denom = sum(weights) > normalised_weights = [float(w)/denom for w in weights] > np.random.multinomial(codons, weights, 1000) > > Cheers, > David > > > > Quoting George Devaniranjan : > > Hi, >> >> I am not sure if this question is more suitable for biopython or a python >> forum. >> >> >> I have the following dictionary. >> >> dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34, >> 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, >> 'LAU': >> 1, 'PTA': 7, ' >> AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34, >> 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, >> 'YLP': >> 49, 'TA >> Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, >> 'TAA': >> 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL': >> 16, 'SY >> Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28} >> >> The keys are the different amino acid triplets (all possible triplets >> extracted from a culled list of PDB), the numbers next to them are the >> frequency that they occour in. >> >> I was wondering if there is a way in biopython/python to sample them at >> the >> frequecy indicated by the no's next to the key. >> >> I have only given a snippet of the triplet dictionary, the entire >> dictionary >> has about 1400 key entries. >> >> I would appreciate any help in this matter --thank you very much. >> >> George >> ______________________________**_________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/**mailman/listinfo/biopython >> >> > > > From p.j.a.cock at googlemail.com Mon Oct 31 11:27:31 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 31 Oct 2011 11:27:31 +0000 Subject: [Biopython] expat and biopython 1.58 problem on linux x64 In-Reply-To: References: <4EA9C00E.5080509@bath.ac.uk> <4EAAA9A0.3010906@bath.ac.uk> Message-ID: On Fri, Oct 28, 2011 at 2:21 PM, Peter Cock wrote: > > OK, so it doesn't look like the problem with closed handles, > http://bugs.python.org/issue4877 > Hi Jaime, Was there any sign of an expat version mismatch? That does seem like the most likely problem (Python expecting one thing, the library providing another). Another guess was we could be reusing the parser object (which apparently is not allowed), although the unit tests don't seem to do this: http://bugs.python.org/issue6676 http://bugs.python.org/issue12829 Peter