From tiagoantao at gmail.com  Mon Oct  3 18:12:18 2011
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 3 Oct 2011 23:12:18 +0100
Subject: [Biopython] VCF parser
Message-ID: <CAA9RGEOdgq9=Y5oruhjycP8Jw545rwDQG_eXKFzWKXYj2s3uQQ@mail.gmail.com>

Hi,

I wonder if there is a VCF parser in either Python or Java? Either I
am being dumb at searching (probably) or nothing exists?

Thanks,
Tiago

-- 
"If you want to get laid, go to college.? If you want an education, go
to the library." - Frank Zappa


From bala.biophysics at gmail.com  Tue Oct  4 04:05:36 2011
From: bala.biophysics at gmail.com (Bala subramanian)
Date: Tue, 4 Oct 2011 10:05:36 +0200
Subject: [Biopython] changing record attributes while iterating
Message-ID: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>

Friends,
I have a fasta file. I need to modify the record id by adding a suffix to
it. So i used SeqRecord (the code attached below). It is working fine but i
would like to know if there is any simple way to do that. ie. if i can
change the record attributes while iterating through the fasta with
SeqIO.parse itself. I tried something like following but i couldnt get what
i wanted.

new_list=[]
for record in SeqIO.parse(open(argv[1], "rU"), "fasta"):
                    record.id=record.id + '_suffix'
                    new_list.append(record)

Hence i used SeqRecord to do the modification ?
----------------------------------------------------------------------------------------------------
#!/usr/bin/env python
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from sys import argv

new_list=[]

for record in SeqIO.parse(open(argv[1], "rU"), "fasta"):

    seq=str(record.seq)
    newrec=SeqRecord(Seq(seq),id=record.id+"_suffix",name='',description='')

    new_list.append(newrec)

output_handle = open(raw_input('Enter the output file:'), 'w')
SeqIO.write(new_list, output_handle, "fasta")
output_handle.close()

From p.j.a.cock at googlemail.com  Tue Oct  4 04:24:08 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 4 Oct 2011 09:24:08 +0100
Subject: [Biopython] changing record attributes while iterating
In-Reply-To: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>
References: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>
Message-ID: <CAKVJ-_7dHzN4A551Hx4nydDTUAxeGDvqDvtKKZpZdvwue0P4Pw@mail.gmail.com>

On Tue, Oct 4, 2011 at 9:05 AM, Bala subramanian
<bala.biophysics at gmail.com> wrote:
> Friends,
> I have a fasta file. I need to modify the record id by adding a suffix to
> it. So i used SeqRecord (the code attached below). It is working fine but i
> would like to know if there is any simple way to do that. ie. if i can
> change the record attributes while iterating through the fasta with
> SeqIO.parse itself. I tried something like following but i couldnt get what
> i wanted.
>
> new_list=[]
> for record in SeqIO.parse(open(argv[1], "rU"), "fasta"):
> ? ? ? ? ? ? ? ? ? ?record.id=record.id + '_suffix'
> ? ? ? ? ? ? ? ? ? ?new_list.append(record)

The above looks fine, although depending on the rest of your script
a big list might be a bad idea (too much memory) and an iterator
based approach may be preferable. If as in the rest of your example
you just need to do this for output, perhaps:

#!/usr/bin/env python
from Bio import SeqIO
from sys import argv

def rename(record):
    """Modified record in place AND returns it."""
    record.id +=  '_suffix'
    return record

#This is a generator expression:
records = (rename(r) for r in SeqIO.parse(argv[1], "fasta"))

output_filename = raw_input('Enter the output file:')
SeqIO.write(records, output_filename, "fasta")

The alternative you showed was wasteful, creating lots of new
objects to no benefit.

Peter


From nanatrapnest at hotmail.it  Wed Oct  5 11:07:44 2011
From: nanatrapnest at hotmail.it (Nana Trapnest)
Date: Wed, 5 Oct 2011 15:07:44 +0000
Subject: [Biopython] StructureBuilder
Message-ID: <DUB107-W49AD03D4BEF4381949E73DA8F80@phx.gbl>


Hello,
is it possible with structure builder copy all a protein and change atoms coord??? How can I do this??
Thanks to all of you!
Stefania
 		 	   		  

From anaryin at gmail.com  Wed Oct  5 12:02:30 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 5 Oct 2011 18:02:30 +0200
Subject: [Biopython] StructureBuilder
In-Reply-To: <DUB107-W49AD03D4BEF4381949E73DA8F80@phx.gbl>
References: <DUB107-W49AD03D4BEF4381949E73DA8F80@phx.gbl>
Message-ID: <CAJ9sUYMNF8f3FMApUxipduFOZcKOvnj-nmUDVGAe0-dLwz2+fw@mail.gmail.com>

Hello Stefania,

It should be possible to copy the entire protein yes, but I would rather use
deepcopy <http://docs.python.org/library/copy.html#copy.deepcopy> to create
a fully new Structure object and manipulate that one.

Something along the lines of:

import copy

[ ... Parse your structure to s...]

s_copy = copy.deepcopy(s)
for atom in s_copy.get_atoms():
  *here use either atom.transform or just modify atom.coord*


Cheers,

Jo?o [...] Rodrigues
http://nmr.chem.uu.nl/~joao


2011/10/5 Nana Trapnest <nanatrapnest at hotmail.it>

>
> Hello,
> is it possible with structure builder copy all a protein and change atoms
> coord??? How can I do this??
> Thanks to all of you!
> Stefania
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From dilara.ally at gmail.com  Wed Oct  5 19:21:29 2011
From: dilara.ally at gmail.com (Dilara Ally)
Date: Wed, 05 Oct 2011 16:21:29 -0700
Subject: [Biopython] error with entrez id code
Message-ID: <4E8CE679.5050107@gmail.com>

Hi All

I've written a program to identify Entrez gene ids from a blastall that 
I performed.  The code is as follows:

from Bio import SeqIO
from Bio import Entrez
import os
import os.path
import re
import csv

dirname1="/Users/dally/Desktop/BlastFiles/annotate_me/"
dirname2="/Users/dally/Desktop/BlastFiles/annotated/"

allfiles=os.listdir(dirname1)
fanddir=[os.path.join(dirname1,fname) for fname in allfiles]
OutFileName="Contig_annotation.csv"
c=csv.writer(open(os.path.join(dirname2,OutFileName),"wb"))

for f in fanddir:
     print f
     InFile=open(f,'rU')
     LineNumber=0
     for Line in InFile:
         print LineNumber#, ':', Line
         ElementList=Line.split('\t')
         geneid=ElementList[1]
         #print geneid
         Sections=geneid.split('|')
         NewID=Sections[3]

         from Bio import Entrez
         from Bio import SeqFeature
         Entrez.email = "dally at projects.sdsu.edu"
         handle=Entrez.efetch(db="nucleotide", id=NewID,rettype="gb")  # 
rettype="gb" is GenBank format or XML format retmode="xml"
         record=SeqIO.read(handle,"genbank")
         handle.close()
         #print record.id
         lineage=record.annotations["taxonomy"]
         
c.writerow([ElementList[0],ElementList[1],ElementList[2],ElementList[3],ElementList[4],ElementList[5],ElementList[6],ElementList[7],ElementList[8], 
ElementList[9],ElementList[10], NewID, record.id, record.description, 
record.annotations["source"], lineage[0], lineage[1],lineage[2], 
record.annotations["keywords"], ])
         LineNumber=LineNumber+1

InFile.close()

The gene identifier looks like this: gi|2252639|gb|AC002292.1|AC002292.  
But I"m only interested in the fourth component (AC002292.1)It runs 
through a file with approximately 8000-10000 identifiers and then 
extracts information from the associated genbank file.

The code seemed to run fine on my first file for the first 1287 lines 
but then I got this error

> raceback (most recent call last):
>   File "Ally_EntrezID_Search_Final_Script.py", line 38, in <module>
>     record=SeqIO.read(handle,"genbank")
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", 
> line 604, in read
>     first = iterator.next()
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", 
> line 532, in parse
>     for r in i:
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 440, in parse_records
>     record = self.parse(handle, do_features)
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 423, in parse
>     if self.feed(handle, consumer, do_features):
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 400, in feed
>     misc_lines, sequence_string = self.parse_footer()
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 921, in parse_footer
>     line = self.handle.readline()
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", 
> line 447, in readline
>     data = self._sock.recv(self._rbufsize)
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 533, in read
>     return self._read_chunked(amt)
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 586, in _read_chunked
>     value.append(self._safe_read(amt))
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 637, in _safe_read
>     raise IncompleteRead(''.join(s), amt)
> httplib.IncompleteRead: IncompleteRead(707 bytes read, 3147 more expected)
I'm new to python and biopython programming.  So any advice would be 
extremely appreciated.

Thanks.

Dilara


From p.j.a.cock at googlemail.com  Thu Oct  6 03:43:49 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Oct 2011 08:43:49 +0100
Subject: [Biopython] error with entrez id code
In-Reply-To: <4E8CE679.5050107@gmail.com>
References: <4E8CE679.5050107@gmail.com>
Message-ID: <CAKVJ-_7OvoPUM8B0p9Jo1iQbqpKRBT+O9TJCNt5=15gXLZU+jQ@mail.gmail.com>

On Thursday, October 6, 2011, Dilara Ally <dilara.ally at gmail.com> wrote:
> Hi All
>
> I've written a program to identify Entrez gene ids from a blastall that I
performed.  The code is as follows:
>
> from Bio import SeqIO
> from Bio import Entrez
> ...
>
> The code seemed to run fine on my first file for the first 1287 lines but
then I got this error
>
>> raceback (most recent call last):
>>  File "Ally_EntrezID_Search_Final_Script.py", line 38, in <module>
>>    record=SeqIO.read(handle,"genbank")
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py",
line 604, in read
>>    first = iterator.next()
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py",
line 532, in parse
>>    for r in i:
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 440, in parse_records
>>    record = self.parse(handle, do_features)
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 423, in parse
>>    if self.feed(handle, consumer, do_features):
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 400, in feed
>>    misc_lines, sequence_string = self.parse_footer()
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 921, in parse_footer
>>    line = self.handle.readline()
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py",
line 447, in readline
>>    data = self._sock.recv(self._rbufsize)
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py",
line 533, in read
>>    return self._read_chunked(amt)
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py",
line 586, in _read_chunked
>>    value.append(self._safe_read(amt))
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py",
line 637, in _safe_read
>>    raise IncompleteRead(''.join(s), amt)
>> httplib.IncompleteRead: IncompleteRead(707 bytes read, 3147 more
expected)
>
> I'm new to python and biopython programming.  So any advice would be
extremely appreciated.

Is it always the same record that breaks? If so, what is the ID so we can
try it out.

If not, then it looks like a random network error, maybe you can stick a
try/except in to refetch the data?

Peter

From animesh.agrawal at anu.edu.au  Thu Oct  6 06:25:08 2011
From: animesh.agrawal at anu.edu.au (Animesh Agrawal)
Date: Thu, 06 Oct 2011 21:25:08 +1100
Subject: [Biopython] Using bioPython and bioSQL
In-Reply-To: <7770fe573faa2.4e8d81ae@anu.edu.au>
References: <76b0a56038ee5.4e8d8031@anu.edu.au>
	<7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au>
	<7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au>
	<7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au>
	<7770fe573faa2.4e8d81ae@anu.edu.au>
Message-ID: <7710edf23d45a.4e8e1cb4@anu.edu.au>

Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives.
Cheers
Animesh
Animesh Agrawal
PhD Scholar
The John Curtin School of Medical Research
Australian National University
Canberra, Australia

From p.j.a.cock at googlemail.com  Thu Oct  6 06:39:57 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Oct 2011 11:39:57 +0100
Subject: [Biopython] Using bioPython and bioSQL
In-Reply-To: <7710edf23d45a.4e8e1cb4@anu.edu.au>
References: <76b0a56038ee5.4e8d8031@anu.edu.au>
	<7660b9533ee66.4e8d806f@anu.edu.au>
	<75a084db398e1.4e8d80ac@anu.edu.au>
	<7660d30c39ced.4e8d80ea@anu.edu.au>
	<77109413397e8.4e8d8127@anu.edu.au>
	<7660dda43f476.4e8d8165@anu.edu.au>
	<77c0dae63f99c.4e8d81a2@anu.edu.au>
	<7770fe573faa2.4e8d81ae@anu.edu.au>
	<7710edf23d45a.4e8e1cb4@anu.edu.au>
Message-ID: <CAKVJ-_76phfz_PNDFkVgfzmED8_M6Zh7BpxByTd2qtuKWs_MLw@mail.gmail.com>

On Thu, Oct 6, 2011 at 11:25 AM, Animesh Agrawal
<animesh.agrawal at anu.edu.au> wrote:
> Hi All,I am trying to develop a interface for a local sequence depository
> in my lab. Using biopython cookbook examples I have been able to
> populate the database. But to query the database I want to create an
> interface so all other members in my lab can access it. I have no
> experience in doing this kind of development. I need some advice
> on best way of doing it and if there are already developed modules
> in biopython which can help me in attaining my objectives.
> Cheers
> Animesh

Hi Animesh,

Do you mean some kind of web interface? Would you just need
this to be read only?

You can use GBrowse with BioSQL, but I believe CHADO is better
supported as the schema. CHADO is also a better choice if you
want users to be able to edit the annotation.
http://gmod.org/wiki/Chado_-_Getting_Started

Peter

From sdavis2 at mail.nih.gov  Thu Oct  6 06:51:20 2011
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 6 Oct 2011 06:51:20 -0400
Subject: [Biopython] Using bioPython and bioSQL
In-Reply-To: <7710edf23d45a.4e8e1cb4@anu.edu.au>
References: <76b0a56038ee5.4e8d8031@anu.edu.au>
	<7660b9533ee66.4e8d806f@anu.edu.au>
	<75a084db398e1.4e8d80ac@anu.edu.au>
	<7660d30c39ced.4e8d80ea@anu.edu.au>
	<77109413397e8.4e8d8127@anu.edu.au>
	<7660dda43f476.4e8d8165@anu.edu.au>
	<77c0dae63f99c.4e8d81a2@anu.edu.au>
	<7770fe573faa2.4e8d81ae@anu.edu.au>
	<7710edf23d45a.4e8e1cb4@anu.edu.au>
Message-ID: <CANeAVBki4WJfOEWkwY-3HQ6XmAvWuqcTuYi39+6=T3jPR7Q11w@mail.gmail.com>

Hi, Animesh.

How do you want folks to query the database?  Web?  Command-line?  Are
the queries limited in scope or do you want to provide something fully
general?

Sean

On Thu, Oct 6, 2011 at 6:25 AM, Animesh Agrawal
<animesh.agrawal at anu.edu.au> wrote:
> Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives.
> Cheers
> Animesh
> Animesh Agrawal
> PhD Scholar
> The John Curtin School of Medical Research
> Australian National University
> Canberra, Australia
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From elisa.sechi85 at hotmail.it  Thu Oct  6 06:43:25 2011
From: elisa.sechi85 at hotmail.it (Elisa sechi)
Date: Thu, 6 Oct 2011 12:43:25 +0200
Subject: [Biopython] help for overwrite a pdb file
In-Reply-To: <DUB106-W53522F2712EE87225812DE96F80@phx.gbl>
References: <DUB106-W53522F2712EE87225812DE96F80@phx.gbl>
Message-ID: <DUB106-W54E88C16AE2C9441266EC596F90@phx.gbl>


Hi! All !
I'm contacting you in order to ask help about Biopython.
I'm using python,I have extract the atoms coordinates  of a protein from a pdb file and I have used quaternion in order to rotate the coordinates.
I have put its in a new matrix but now the problem is: how do I save the cartesian coordinates in a pdb file???Do I have to create a new structure with the use of builder structure Class??
I ask you if there is a way to overwrite the new cartesian coordinates in the old pdb file that i have used.
Please help me!!!
Thank you very much!
Elisa
   bye

 		 	   		   		 	   		  
From anaryin at gmail.com  Thu Oct  6 07:01:28 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 6 Oct 2011 13:01:28 +0200
Subject: [Biopython] help for overwrite a pdb file
In-Reply-To: <DUB106-W54E88C16AE2C9441266EC596F90@phx.gbl>
References: <DUB106-W53522F2712EE87225812DE96F80@phx.gbl>
	<DUB106-W54E88C16AE2C9441266EC596F90@phx.gbl>
Message-ID: <CAJ9sUYO-qx1fn9TJLH7XYiYrk-j2akFqJoNi33wvVb+dW1Wkcg@mail.gmail.com>

Hello Elisa,

You should use PDBIO to generate a new structure file. If you have already
transformed the coordinates, it's pretty simple:

import PDBIO
io = PDBIO()
io.set_structure(your_structure)
io.save('new_structure.pdb')


Cheers,

Jo?o [...] Rodrigues
http://nmr.chem.uu.nl/~joao


2011/10/6 Elisa sechi <elisa.sechi85 at hotmail.it>

>
>
>
>
>
>
>
>
>
>
>
> Hi! All !
> I'm contacting you in order to ask help about Biopython.
> I'm using python,I have extract the atoms coordinates  of a protein from a
> pdb file and I have used quaternion in order to rotate the coordinates.
> I have put its in a new matrix but now the problem is: how do I save the
> cartesian coordinates in a pdb file???Do I have to create a new structure
> with the use of builder structure Class??
> I ask you if there is a way to overwrite the new cartesian coordinates in
> the old pdb file that i have used.
> Please help me!!!
> Thank you very much!
> Elisa
>   bye
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From p.j.a.cock at googlemail.com  Thu Oct  6 07:02:57 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Oct 2011 12:02:57 +0100
Subject: [Biopython] help for overwrite a pdb file
In-Reply-To: <DUB106-W54E88C16AE2C9441266EC596F90@phx.gbl>
References: <DUB106-W53522F2712EE87225812DE96F80@phx.gbl>
	<DUB106-W54E88C16AE2C9441266EC596F90@phx.gbl>
Message-ID: <CAKVJ-_4hg4xyNOn9h17DHRs7SBb+XRevRfu3qHY3GMtQiv1xFA@mail.gmail.com>

On Thu, Oct 6, 2011 at 11:43 AM, Elisa sechi <elisa.sechi85 at hotmail.it> wrote:
>
> Hi! All !
> I'm contacting you in order to ask help about Biopython.
> I'm using python,I have extract the atoms coordinates ?of a protein from a pdb file and I have used quaternion in order to rotate the coordinates.
> I have put its in a new matrix but now the problem is: how do I save the cartesian coordinates in a pdb file???Do I have to create a new structure with the use of builder structure Class??
> I ask you if there is a way to overwrite the new cartesian coordinates in the old pdb file that i have used.
> Please help me!!!
> Thank you very much!
> Elisa
> ? bye

There's an example here which rotates models in a PDB file and saves the output:
http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/

It is not using quaternions for the rotation, but otherwise it should
be helpful.

Peter


From animesh.agrawal at anu.edu.au  Thu Oct  6 07:23:39 2011
From: animesh.agrawal at anu.edu.au (Animesh Agrawal)
Date: Thu, 06 Oct 2011 22:23:39 +1100
Subject: [Biopython] Using bioPython and bioSQL
In-Reply-To: <77109ef23fc49.4e8d8f9e@anu.edu.au>
References: <76b0a56038ee5.4e8d8031@anu.edu.au>
	<7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au>
	<7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au>
	<7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au>
	<7770fe573faa2.4e8d81ae@anu.edu.au> <7710edf23d45a.4e8e1cb4@anu.edu.au>
	<CAKVJ-_76phfz_PNDFkVgfzmED8_M6Zh7BpxByTd2qtuKWs_MLw@mail.gmail.com>
	<77c0838039ccd.4e8d8edc@anu.edu.au> <7660c1093accc.4e8d8f1a@anu.edu.au>
	<7710e8403ab11.4e8d8f58@anu.edu.au> <77b0ce493f67b.4e8d8f61@anu.edu.au>
	<77109ef23fc49.4e8d8f9e@anu.edu.au>
Message-ID: <7710e50538fb7.4e8e2a6b@anu.edu.au>

Hi Peter,Thanks a lot for your reply.Yes I want web interface and I need it to be read only. I'll check out GBrowse and CHADO.
Cheers,
Animesh

On 10/06/11, Peter Cock  <p.j.a.cock at googlemail.com> wrote:
> On Thu, Oct 6, 2011 at 11:25 AM, Animesh Agrawal
> <animesh.agrawal at anu.edu.au> wrote:
> > Hi All,I am trying to develop a interface for a local sequence depository
> > in my lab. Using biopython cookbook examples I have been able to
> > populate the database. But to query the database I want to create an
> > interface so all other members in my lab can access it. I have no
> > experience in doing this kind of development. I need some advice
> > on best way of doing it and if there are already developed modules
> > in biopython which can help me in attaining my objectives.
> > Cheers
> > Animesh
> 
> Hi Animesh,
> 
> Do you mean some kind of web interface? Would you just need
> this to be read only?
> 
> You can use GBrowse with BioSQL, but I believe CHADO is better
> supported as the schema. CHADO is also a better choice if you
> want users to be able to edit the annotation.
> http://gmod.org/wiki/Chado_-_Getting_Started
> 
> Peter
> 
> 

From animesh.agrawal at anu.edu.au  Thu Oct  6 07:27:51 2011
From: animesh.agrawal at anu.edu.au (Animesh Agrawal)
Date: Thu, 06 Oct 2011 22:27:51 +1100
Subject: [Biopython] Using bioPython and bioSQL
In-Reply-To: <7680a8613e5c9.4e8d9094@anu.edu.au>
References: <76b0a56038ee5.4e8d8031@anu.edu.au>
	<7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au>
	<7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au>
	<7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au>
	<7770fe573faa2.4e8d81ae@anu.edu.au> <7710edf23d45a.4e8e1cb4@anu.edu.au>
	<CANeAVBki4WJfOEWkwY-3HQ6XmAvWuqcTuYi39+6=T3jPR7Q11w@mail.gmail.com>
	<77b080583927a.4e8d9019@anu.edu.au> <76e0a1b23d252.4e8d9057@anu.edu.au>
	<7680a8613e5c9.4e8d9094@anu.edu.au>
Message-ID: <7660a5e03929b.4e8e2b67@anu.edu.au>

Hi Sean,I definitely want a web interface. Queries should be limited in scope.
Cheers,
Animesh

On 10/06/11, Sean Davis  <sdavis2 at mail.nih.gov> wrote:
> Hi, Animesh.
> 
> How do you want folks to query the database?? Web?? Command-line?? Are
> the queries limited in scope or do you want to provide something fully
> general?
> 
> Sean
> 
> On Thu, Oct 6, 2011 at 6:25 AM, Animesh Agrawal
> <animesh.agrawal at anu.edu.au> wrote:
> > Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives.
> > Cheers
> > Animesh
> > Animesh Agrawal
> > PhD Scholar
> > The John Curtin School of Medical Research
> > Australian National University
> > Canberra, Australia
> > _______________________________________________
> > Biopython mailing list ?- ?Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
> 
>


From sdavis2 at mail.nih.gov  Thu Oct  6 07:50:07 2011
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 6 Oct 2011 07:50:07 -0400
Subject: [Biopython] Using bioPython and bioSQL
In-Reply-To: <7660a5e03929b.4e8e2b67@anu.edu.au>
References: <76b0a56038ee5.4e8d8031@anu.edu.au>
	<7660b9533ee66.4e8d806f@anu.edu.au>
	<75a084db398e1.4e8d80ac@anu.edu.au>
	<7660d30c39ced.4e8d80ea@anu.edu.au>
	<77109413397e8.4e8d8127@anu.edu.au>
	<7660dda43f476.4e8d8165@anu.edu.au>
	<77c0dae63f99c.4e8d81a2@anu.edu.au>
	<7770fe573faa2.4e8d81ae@anu.edu.au>
	<7710edf23d45a.4e8e1cb4@anu.edu.au>
	<CANeAVBki4WJfOEWkwY-3HQ6XmAvWuqcTuYi39+6=T3jPR7Q11w@mail.gmail.com>
	<77b080583927a.4e8d9019@anu.edu.au>
	<76e0a1b23d252.4e8d9057@anu.edu.au>
	<7680a8613e5c9.4e8d9094@anu.edu.au>
	<7660a5e03929b.4e8e2b67@anu.edu.au>
Message-ID: <CANeAVBnNs+Q0NA1L+Y+rsLDa=D2eAa+UK4zQ4vxtV+iKp6_qxQ@mail.gmail.com>

Hi, Animesh.

Depending on the types of queries, building small CGI scripts or even
a small web application can be quite useful.  Most recently, I have
been using the flask micro-framework ( http://flask.pocoo.org/ ) for
building such small applications.  If you can figure out how to do the
queries that you want with biopython or SQL, then it isn't too hard to
translate that to a couple of web pages, one for gathering input from
the user and a second for delivering results.

Sean


On Thu, Oct 6, 2011 at 7:27 AM, Animesh Agrawal
<animesh.agrawal at anu.edu.au> wrote:
> Hi Sean,I definitely want a web interface. Queries should be limited in scope.
> Cheers,
> Animesh
>
> On 10/06/11, Sean Davis ?<sdavis2 at mail.nih.gov> wrote:
>> Hi, Animesh.
>>
>> How do you want folks to query the database?? Web?? Command-line?? Are
>> the queries limited in scope or do you want to provide something fully
>> general?
>>
>> Sean
>>
>> On Thu, Oct 6, 2011 at 6:25 AM, Animesh Agrawal
>> <animesh.agrawal at anu.edu.au> wrote:
>> > Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives.
>> > Cheers
>> > Animesh
>> > Animesh Agrawal
>> > PhD Scholar
>> > The John Curtin School of Medical Research
>> > Australian National University
>> > Canberra, Australia
>> > _______________________________________________
>> > Biopython mailing list ?- ?Biopython at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biopython
>> >
>>
>>
>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From tiagoantao at gmail.com  Thu Oct  6 16:14:56 2011
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 6 Oct 2011 21:14:56 +0100
Subject: [Biopython] UniprotXML dbReference parser
Message-ID: <CAA9RGEMK-8PCxf1icHnTDB8smt7F=QG9qVZyvxB9Qd_T4nHGEw@mail.gmail.com>

Hi,

Do I understand wrongly or the UniprotXML parser for

<dbReference type="RefSeq" id="NP_001117940.1" key="6">
<property type="nucleotide sequence ID" value="NM_001124468.1"/>
</dbReference>

simply ignores the "property type" information?

If so, is there any way to get access to the XML raw data (so that I
can grep it)?

Thanks a lot,
Tiago

-- 
"If you want to get laid, go to college.? If you want an education, go
to the library." - Frank Zappa


From p.j.a.cock at googlemail.com  Thu Oct  6 18:26:19 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Oct 2011 23:26:19 +0100
Subject: [Biopython] UniprotXML dbReference parser
In-Reply-To: <CAA9RGEMK-8PCxf1icHnTDB8smt7F=QG9qVZyvxB9Qd_T4nHGEw@mail.gmail.com>
References: <CAA9RGEMK-8PCxf1icHnTDB8smt7F=QG9qVZyvxB9Qd_T4nHGEw@mail.gmail.com>
Message-ID: <CAKVJ-_5HGyCsPo=Vo1vnnZOY3qHLBcO4zo=5D_p0WysTtTT74w@mail.gmail.com>

2011/10/6 Tiago Ant?o <tiagoantao at gmail.com>:
> Hi,
>
> Do I understand wrongly or the UniprotXML parser for
>
> <dbReference type="RefSeq" id="NP_001117940.1" key="6">
> <property type="nucleotide sequence ID" value="NM_001124468.1"/>
> </dbReference>
>
> simply ignores the "property type" information?

Probably... I think it emulates the very simple list of
db:acc strings produced by the GenBank parser etc,
but try dir(...) on it.  Although PDB references look
to get part of their information dumped in the
record's annotations dictionary.

I guess we could return a list of DB reference objects
which happen to act like the old style string for back
compatibility.

> If so, is there any way to get access to the XML raw data
> (so that I can grep it)?

Are you asking for XML parsing library recommendations?
Or you could hack the SeqIO parser instead... i've CC'd
Andrea who wrote it in case he can add something
more practical.

Peter


From tiagoantao at gmail.com  Thu Oct  6 18:43:01 2011
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 6 Oct 2011 23:43:01 +0100
Subject: [Biopython] UniprotXML dbReference parser
In-Reply-To: <CAKVJ-_5HGyCsPo=Vo1vnnZOY3qHLBcO4zo=5D_p0WysTtTT74w@mail.gmail.com>
References: <CAA9RGEMK-8PCxf1icHnTDB8smt7F=QG9qVZyvxB9Qd_T4nHGEw@mail.gmail.com>
	<CAKVJ-_5HGyCsPo=Vo1vnnZOY3qHLBcO4zo=5D_p0WysTtTT74w@mail.gmail.com>
Message-ID: <CAA9RGEOpkkbZhFcUT4xF8DymJS6mxq4x5vg_n6A-u6LCq7zF_g@mail.gmail.com>

Hi,

2011/10/6 Peter Cock <p.j.a.cock at googlemail.com>:
> Probably... I think it emulates the very simple list of
> db:acc strings produced by the GenBank parser etc,
> but try dir(...) on it. ?Although PDB references look
> to get part of their information dumped in the
> record's annotations dictionary.

The problem is that the Gene ID is inside (thus it never gets
returned). We get the protein ID only.

> Are you asking for XML parsing library recommendations?
> Or you could hack the SeqIO parser instead... i've CC'd
> Andrea who wrote it in case he can add something
> more practical.


I just used xml.parsers.expat. Not a problem for myself, but the fact
is that the uniprot xml parser does not return the whole information
that it is there.

-- 
"If you want to get laid, go to college.? If you want an education, go
to the library." - Frank Zappa


From p.j.a.cock at googlemail.com  Fri Oct  7 03:22:49 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 7 Oct 2011 08:22:49 +0100
Subject: [Biopython]  changing record attributes while iterating
In-Reply-To: <CAOP6n=hRfYi1mRAt6pxudmyMeVH8bnnM3E5yd9kXXPU5L=pQvQ@mail.gmail.com>
References: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>
	<CAKVJ-_7dHzN4A551Hx4nydDTUAxeGDvqDvtKKZpZdvwue0P4Pw@mail.gmail.com>
	<CAOP6n=hRfYi1mRAt6pxudmyMeVH8bnnM3E5yd9kXXPU5L=pQvQ@mail.gmail.com>
Message-ID: <CAKVJ-_56L2beMYmCqs17KmpmhwX37oXMAR=xVD3237M6cAhGJw@mail.gmail.com>

On Friday, October 7, 2011, Michalwrote:
> Hello,
> Does your code with generator save the whole file in the
> memory or does it read each entry and save it immediately?
> Thank you in advance.

Using a generator expression like that only one SeqRecord is in memory at a
time. It goes through the input FASTA one record at a time, renames it,
saves it immediately.

Peter

P.S. list CC'd

From dilara.ally at gmail.com  Fri Oct  7 13:34:24 2011
From: dilara.ally at gmail.com (Dilara Ally)
Date: Fri, 07 Oct 2011 10:34:24 -0700
Subject: [Biopython] error with entrez id code
In-Reply-To: <CAKVJ-_7OvoPUM8B0p9Jo1iQbqpKRBT+O9TJCNt5=15gXLZU+jQ@mail.gmail.com>
References: <4E8CE679.5050107@gmail.com>
	<CAKVJ-_7OvoPUM8B0p9Jo1iQbqpKRBT+O9TJCNt5=15gXLZU+jQ@mail.gmail.com>
Message-ID: <4E8F3820.1030002@gmail.com>

> Is it always the same record that breaks? If so, what is the ID so we 
> can try it out.
>
> If not, then it looks like a random network error, maybe you can stick 
> a try/except in to refetch the data?
Hi Peter

Individually the identifier has no problem calling up the record, but 
the problem seems to be in the loop.  As a newbie, what is a try/except?

Thanks.

Dilara

On 10/6/11 12:43 AM, Peter Cock wrote:
>
>
> On Thursday, October 6, 2011, Dilara Ally <dilara.ally at gmail.com 
> <mailto:dilara.ally at gmail.com>> wrote:
> > Hi All
> >
> > I've written a program to identify Entrez gene ids from a blastall 
> that I performed.  The code is as follows:
> >
> > from Bio import SeqIO
> > from Bio import Entrez
> > ...
> >
> > The code seemed to run fine on my first file for the first 1287 
> lines but then I got this error
> >
> >> raceback (most recent call last):
> >>  File "Ally_EntrezID_Search_Final_Script.py", line 38, in <module>
> >>    record=SeqIO.read(handle,"genbank")
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", 
> line 604, in read
> >>    first = iterator.next()
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", 
> line 532, in parse
> >>    for r in i:
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 440, in parse_records
> >>    record = self.parse(handle, do_features)
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 423, in parse
> >>    if self.feed(handle, consumer, do_features):
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 400, in feed
> >>    misc_lines, sequence_string = self.parse_footer()
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 921, in parse_footer
> >>    line = self.handle.readline()
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", 
> line 447, in readline
> >>    data = self._sock.recv(self._rbufsize)
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 533, in read
> >>    return self._read_chunked(amt)
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 586, in _read_chunked
> >>    value.append(self._safe_read(amt))
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 637, in _safe_read
> >>    raise IncompleteRead(''.join(s), amt)
> >> httplib.IncompleteRead: IncompleteRead(707 bytes read, 3147 more 
> expected)
> >
> > I'm new to python and biopython programming.  So any advice would be 
> extremely appreciated.
>
>
> Peter

From p.j.a.cock at googlemail.com  Sat Oct  8 10:10:12 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 8 Oct 2011 15:10:12 +0100
Subject: [Biopython] error with entrez id code
In-Reply-To: <4E8F3820.1030002@gmail.com>
References: <4E8CE679.5050107@gmail.com>
	<CAKVJ-_7OvoPUM8B0p9Jo1iQbqpKRBT+O9TJCNt5=15gXLZU+jQ@mail.gmail.com>
	<4E8F3820.1030002@gmail.com>
Message-ID: <CAKVJ-_6OjxzAOqrnsGzwbJ9Dv4_jWzwO7set+MhZ1=rpBcoKoQ@mail.gmail.com>

On Fri, Oct 7, 2011 at 6:34 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
> Is it always the same record that breaks? If so, what is the ID so we can
> try it out.
>
> If not, then it looks like a random network error, maybe you can stick a
> try/except in to refetch the data?
>
> Hi Peter
>
> Individually the identifier has no problem calling up the record, but the
> problem seems to be in the loop.? As a newbie, what is a try/except?
>
> Thanks.

By try/except I mean use Python's error handling mechanism to
spot when there is a network error. See:
http://docs.python.org/tutorial/errors.html

e.g. Something like this would give you a second chance.
Note that exception httplib.IncompleteRead is a subclass
of the more general HTTPException, see:
http://docs.python.org/library/httplib.html

from httplib import HTTPException
try:
    handle=Entrez.efetch(db="nucleotide", id=NewID,rettype="gb")  #
rettype="gb" is GenBank format or XML format retmode="xml"
    record=SeqIO.read(handle,"genbank")
    handle.close()
except HTTPException, e:
    print "Network problem: %s" % e
    print "Second (and final) attempt..."
    handle=Entrez.efetch(db="nucleotide", id=NewID,rettype="gb")  #
rettype="gb" is GenBank format or XML format retmode="xml"
    record=SeqIO.read(handle,"genbank")
    handle.close()

If the second attempt fails, you'll get an exception like before.
There are more elegant ways to write that (with less repetition,
and making multiple retries easy), but I'm trying to keep this
simple as an introductory example.

Peter


From chaouki.amir at gmail.com  Sun Oct  9 15:37:42 2011
From: chaouki.amir at gmail.com (amir chaouki)
Date: Sun, 9 Oct 2011 20:37:42 +0100
Subject: [Biopython] clustal header
Message-ID: <CAM+pXQ=YEK5jNo=i4k-0ewjc4_tTQf5aro3KuTWJ1DUHk7NpSQ@mail.gmail.com>

Hi,
i want to to do a multiple sequence alignment with the clustalw method but i
keep getting this error:  ", ".join(known_headers)))
ValueError: a is not a known CLUSTAL header: CLUSTAL, PROBCONS, MUSCLE

my sequence file contains this >  as headers for every sequence name, so
what are the compatible headers?

-- 
*Amir Chaouki*

From p.j.a.cock at googlemail.com  Sun Oct  9 16:09:00 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 9 Oct 2011 21:09:00 +0100
Subject: [Biopython]  clustal header
In-Reply-To: <CAM+pXQ=YEK5jNo=i4k-0ewjc4_tTQf5aro3KuTWJ1DUHk7NpSQ@mail.gmail.com>
References: <CAM+pXQ=YEK5jNo=i4k-0ewjc4_tTQf5aro3KuTWJ1DUHk7NpSQ@mail.gmail.com>
Message-ID: <CAKVJ-_68hDoM5tG93gwbYQ2fopMhzLWWLmiyEcV2+LE912Az7w@mail.gmail.com>

On Sunday, October 9, 2011, amir chaouki <chaouki.amir at gmail.com> wrote:
> Hi,
> i want to to do a multiple sequence alignment with the clustalw method but
i
> keep getting this error:  ", ".join(known_headers)))
> ValueError: a is not a known CLUSTAL header: CLUSTAL, PROBCONS, MUSCLE
>
> my sequence file contains this >  as headers for every sequence name, so
> what are the compatible headers?

Hi Amir,

That error message can come from trying to parse a non-clustal file as if it
were a clustal file. Perhaps you tried to parse a fasta file?

If you showed the code that caused this message, it would be easier to help
you,

Peter

From sdavis2 at mail.nih.gov  Wed Oct 12 14:54:13 2011
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Wed, 12 Oct 2011 14:54:13 -0400
Subject: [Biopython] [OT][Job] Functional genomic analysis of cancer/RNAi
	screening
Message-ID: <CANeAVBnJCpwm=Jj+TbbdAHBq9LEozQ=kHA_FkpnXgyuLUNNzNA@mail.gmail.com>

Functional genomic analysis of cancer/RNAi screening
NATIONAL CANCER INSTITUTE, BETHESDA, MD

The laboratory of Dr. Natasha Caplen, within the Genetics Branch, CCR, NCI,
is seeking postdoctoral
candidates for a project focused on functional genomic analysis using RNAi
screening approaches. We
are looking for a highly motivated candidate who has received their PhD
within the last year to contribute
to our on-going studies applying RNAi based loss-of-function approaches to
probe cancer gene function.
The successful candidate will be expected to perform both bench and
computational-based studies and
will be involved in projects requiring the development and analysis of
large-scale RNAi screening data
focused on the biology of oncogenic transcription factors. The candidate
will be involved in the design
and employment of RNAi screens (up to genome-wide scale) and analysis of the
data generated through
application of state of the art computational methodologies. This
large-scale RNAi screening data will
also be assessed in the context of other relevant datasets such as next
generation sequencing, epigenetic,
gene expression and drug sensitivity datasets. The computational analyses
will ultimately be used to
systematically build hypotheses to identify key pathways and networks
underlying the specifics of the
cancer biology and the candidate will then be expected to experimentally
test these hypotheses.

Dr. Caplen?s laboratory conducts both independent and collaborative studies
and the successful candidate
will have the opportunity to interact with NCI and NIH investigators
studying many different cancer
biology questions using RNAi based technologies. Currently we are involved
in RNAi studies relevant to
the biology and treatment of several pediatric cancers, colorectal, breast
and prostate cancer. For further
information please see Dr. Caplen?s website at
http://ccr.cancer.gov/staff/staff.asp?profileid=9035.

Requirements:
The candidate must have a Ph.D in biological sciences with additional
training in computational biology
or bioinformatics. Previous experience in molecular biology including
mammalian cell culture and
assessment of gene expression is required, as, too, is experience in
programming skills in languages
such as perl, python, R, java, or c++. As the position involves the need to
discuss scientific data and
strategy with members of the existing team and with collaborators, oral and
written fluency in the English
language is required. Applicants should email a cover letter describing
research experience and interests,
curriculum vitae, bibliography, and contact information for three references
(including the current
supervisor) to Dr. Natasha Caplen at ncaplen at mail.nih.gov. Please include
?PD2011? in the email subject
line.


From paul at tonair.de  Thu Oct 13 06:26:54 2011
From: paul at tonair.de (paul at tonair.de)
Date: Thu, 13 Oct 2011 12:26:54 +0200
Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB
Message-ID: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com>


dear biopython users, 

i'm trying to read in a pqr file with the
Bio.PDB module. In a PQR file, the atom charge and atom radius are
stored instead of the occupancy & B-factor.
Apparently, the negative
charge values make trouble while reading in. 

(1) Is there a way to
tweak Bio.PDB module to read in a PQR file? 

More to the background of
this task: I would like to keep the charge and the radius in order to
output a PDB file with more than 80 lines. The pdb-like output looks
like this:
ATOM 1 C1 UNL _0001_000 9.643 1.777 18.433 1.700 0.000
BK____M000
The text "BK____M000" refers to a conformer of a side chain
and is needed by a PoissonBoltzmann named mcce (multi-conformation
continuum electrostatics). 

(2) Can Bio.PDB generate such an output
file? 

Cheers & Thanks,
Paul 

From p.j.a.cock at googlemail.com  Thu Oct 13 06:40:14 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 13 Oct 2011 11:40:14 +0100
Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB
In-Reply-To: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com>
References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com>
Message-ID: <CAKVJ-_5YcKPR6XV4+08vYvd8eTPV=++Ub+OWE1_D8s48p4J8xQ@mail.gmail.com>

On Thu, Oct 13, 2011 at 11:26 AM,  <paul at tonair.de> wrote:
>
> dear biopython users,
>
> i'm trying to read in a pqr file with the
> Bio.PDB module. In a PQR file, the atom charge and atom radius are
> stored instead of the occupancy & B-factor.
> Apparently, the negative
> charge values make trouble while reading in.
>
> (1) Is there a way to
> tweak Bio.PDB module to read in a PQR file?

If a negative B-factor was the only issue, probably yes.

> More to the background of
> this task: I would like to keep the charge and the radius in order to
> output a PDB file with more than 80 lines.

You mean more than 80 columns? i.e. Longer than PDB norms?

> The pdb-like output looks
> like this:
> ATOM 1 C1 UNL _0001_000 9.643 1.777 18.433 1.700 0.000
> BK____M000
> The text "BK____M000" refers to a conformer of a side chain
> and is needed by a PoissonBoltzmann named mcce (multi-conformation
> continuum electrostatics).
>
> (2) Can Bio.PDB generate such an output
> file?

Not yet ;)

> Cheers & Thanks,
> Paul

It would help if you could share some sample data (URLs) and links
to this PDB-like PQR file format's specification (assuming it has one).

Regards,

Peter

From anaryin at gmail.com  Thu Oct 13 06:43:06 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 13 Oct 2011 12:43:06 +0200
Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB
In-Reply-To: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com>
References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com>
Message-ID: <CAJ9sUYMcmdTgCKeWd67LbLYSj6rf-6Y_7LvHRieB94nE7zbTXQ@mail.gmail.com>

Hello Paul,

Straight from Pymol :)

Bio.PDB cannot read PQR files as is, but since the format is quite similar
to the PDB it should be easy to convert.

The first step is to know if you want to develop a converter too (you will
need the forcefield atomic charges and radius for that) or just a "parser".
Parsing is easy, it's a matter of adapting the current SMCRA objects and
PDBParser. Converting requires much more and is probably superfluous given
the PDB2PQR software.

Some important information on the format:
http://www.poissonboltzmann.org/file-formats/biomolecular-structurw/pqr

I think the best course of action is to add a PQRParser class that has
different residue properties than the regular PDB. For example, occupancy
and bfactor are not used at all..

Let me know what you think,

Cheers,

Jo?o


From paul at tonair.de  Thu Oct 13 07:51:42 2011
From: paul at tonair.de (paul at tonair.de)
Date: Thu, 13 Oct 2011 13:51:42 +0200
Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB
In-Reply-To: <CAMHOhY36254=cqJdupOQC111R5RM08OQD-513pjny5HC2NWheQ@mail.gmail.com>
References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com>
	<CAJ9sUYMcmdTgCKeWd67LbLYSj6rf-6Y_7LvHRieB94nE7zbTXQ@mail.gmail.com>
	<CAMHOhY36254=cqJdupOQC111R5RM08OQD-513pjny5HC2NWheQ@mail.gmail.com>
Message-ID: <fb35e0cb79678fdbf4e6d29d3325df1a@mail.canobus.com>


Dear all, 

a PQR functionality within biopython would be great!


Regarding the output of extended PDB files I would like to
write:
There is no detailed description on such files:


http://www.sci.ccny.cuny.edu/~mcce/doc/running_mcce2.php [1] 

see
chapter 3.2.4:
step2_out.pdb: input structure file of step 3 in mcce
extended pdb format 

extended means: the conformer is added beyond the
element located somewhere around column 80. 

Is there any workaround
with the currect biopython release to read in PQR and dump out such an
extended PDB file? 

Cheers & thanks,
Paul 

On Thu, 13 Oct 2011
12:48:22 +0200, Mikael Trellet  wrote:  

This PQRParser class would be
a nice add to Bio.PDB indeed, and shouldn't take a very long time to
develop. Could work on it with you Joao, if the need exists obviously.


Regards, 

Mikael 

On Thu, Oct 13, 2011 at 12:43 PM, Jo?o Rodrigues 
wrote:
 Hello Paul,

Straight from Pymol :)

Bio.PDB cannot read PQR
files as is, but since the format is quite similar
to the PDB it should
be easy to convert.

The first step is to know if you want to develop a
converter too (you will
need the forcefield atomic charges and radius
for that) or just a "parser".
Parsing is easy, it's a matter of adapting
the current SMCRA objects and
PDBParser. Converting requires much more
and is probably superfluous given
the PDB2PQR software.

Some important
information on the
format:
http://www.poissonboltzmann.org/file-formats/biomolecular-structurw/pqr
[3]

I think the best course of action is to add a PQRParser class that
has
different residue properties than the regular PDB. For example,
occupancy
and bfactor are not used at all..

Let me know what you
think,

Cheers,

Jo?o

_______________________________________________
Biopython
mailing list - Biopython at lists.open-bio.org
[4]
http://lists.open-bio.org/mailman/listinfo/biopython [5]    

 --

Mikael TRELLET,
Computational structural biology group, Utrecht
University
Bijvoet Center,
The Netherlands

 
Links:
------
[1]
http://www.sci.ccny.cuny.edu/~mcce/doc/running_mcce2.php
[2]
mailto:anaryin at gmail.com
[3]
http://www.poissonboltzmann.org/file-formats/biomolecular-structurw/pqr
[4]
mailto:Biopython at lists.open-bio.org
[5]
http://lists.open-bio.org/mailman/listinfo/biopython

From anaryin at gmail.com  Thu Oct 13 08:27:54 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 13 Oct 2011 14:27:54 +0200
Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB
In-Reply-To: <fb35e0cb79678fdbf4e6d29d3325df1a@mail.canobus.com>
References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com>
	<CAJ9sUYMcmdTgCKeWd67LbLYSj6rf-6Y_7LvHRieB94nE7zbTXQ@mail.gmail.com>
	<CAMHOhY36254=cqJdupOQC111R5RM08OQD-513pjny5HC2NWheQ@mail.gmail.com>
	<fb35e0cb79678fdbf4e6d29d3325df1a@mail.canobus.com>
Message-ID: <CAJ9sUYP2vVHgs0awhO_8p4EKbmek3S26vwscyNqe4feJn1quDA@mail.gmail.com>

Dear Paul,

You would have to do two things:

1. First, modify PDBParser so that it reads more characters in the occupancy
and bfactor fields
2. Modify PDBIO so that it is able to output a field beyond the element OR
just create your own function to print information of a residue and use it
instead of PDBIO.

How do you get the conformer information?

From paul at tonair.de  Fri Oct 14 08:00:04 2011
From: paul at tonair.de (paul at tonair.de)
Date: Fri, 14 Oct 2011 14:00:04 +0200
Subject: [Biopython] ligand PDB files
Message-ID: <f744ccd5166106e2ddfda84262ac938c@mail.canobus.com>


Dear all, 
I'm having trouble to read in the attached PDB file - this
is my code: 
" 

from Bio.PDB import
*
parser=PDBParser()
structure=parser.get_structure("PHA-L","./2w26_lig.pdb")


for model in structure:
 for chain in model:
 for residue in chain:

for atom in residue:
 print atom  
" 
which gives this error: 
" 
File
"/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/PDBParser.py",
line 66, in get_structure
 self._parse(file.readlines())
 File
"/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/PDBParser.py",
line 89, in _parse

self.trailer=self._parse_coordinates(coords_trailer)
 File
"/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/PDBParser.py",
line 205, in _parse_coordinates
 fullname, serial_number, element)
 File
"/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/StructureBuilder.py",
line 197, in init_atom
 fullname, serial_number, element)
 File
"/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/Atom.py", line
68, in __init__
 assert not element or element == element.upper(),
element
AssertionError: Cl 
" 
Does this mean that the PDB parser only
recognizes "amino acid-atoms", i.e. a chlorine does not work? 
Cheers &
Thanks, 
Paul
-------------- next part --------------
COMPND    2w26_LIG.pdb_0 
AUTHOR    GENERATED BY OPEN BABEL 2.3.0
ATOM      1  C1  RIV A   1       9.643   1.777  18.433  1.00  0.00           C  
ATOM      2  N1  RIV A   1       8.303   2.377  18.109  1.00  0.00           N  
ATOM      3  C2  RIV A   1      10.053   0.667  17.441  1.00  0.00           C  
ATOM      4  C3  RIV A   1       7.671   2.122  16.881  1.00  0.00           C  
ATOM      5  O1  RIV A   1       9.768   1.124  16.111  1.00  0.00           O  
ATOM      6  C4  RIV A   1       8.355   1.223  15.853  1.00  0.00           C  
ATOM      7  C5  RIV A   1       6.487   4.959  20.981  1.00  0.00           C  
ATOM      8  C6  RIV A   1       7.333   5.468  19.984  1.00  0.00           C  
ATOM      9  C7  RIV A   1       6.237   3.551  21.013  1.00  0.00           C  
ATOM     10  C8  RIV A   1       7.918   4.619  19.048  1.00  0.00           C  
ATOM     11  C9  RIV A   1       6.837   2.690  20.070  1.00  0.00           C  
ATOM     12  C10 RIV A   1       7.682   3.222  19.078  1.00  0.00           C  
ATOM     13  O2  RIV A   1       6.583   2.613  16.630  1.00  0.00           O  
ATOM     14  N2  RIV A   1       5.906   5.863  21.947  1.00  0.00           N  
ATOM     15  C11 RIV A   1       5.040   5.543  22.995  1.00  0.00           C  
ATOM     16  C12 RIV A   1       6.146   7.326  22.000  1.00  0.00           C  
ATOM     17  O3  RIV A   1       4.690   6.614  23.757  1.00  0.00           O  
ATOM     18  C13 RIV A   1       5.213   7.787  23.134  1.00  0.00           C  
ATOM     19  O4  RIV A   1       4.634   4.419  23.228  1.00  0.00           O  
ATOM     20  C14 RIV A   1       5.924   8.721  24.155  1.00  0.00           C  
ATOM     21  N3  RIV A   1       7.078   8.136  24.932  1.00  0.00           N  
ATOM     22  C15 RIV A   1       8.402   8.558  24.672  1.00  0.00           C  
ATOM     23  S1  RIV A   1      11.131   8.264  25.063  1.00  0.00           S  
ATOM     24  C16 RIV A   1      11.805   7.503  26.288  1.00  0.00           C  
ATOM     25  C17 RIV A   1       9.567   8.044  25.466  1.00  0.00           C  
ATOM     26  C18 RIV A   1      10.794   7.011  27.130  1.00  0.00           C  
ATOM     27  C19 RIV A   1       9.509   7.324  26.659  1.00  0.00           C  
ATOM     28  O5  RIV A   1       8.611   9.379  23.797  1.00  0.00           O  
ATOM     29 Cl1  RIV A   1      13.544   7.302  26.531  1.00  0.00          Cl  
ATOM     30  H   RIV A   1       9.643   1.777  18.433  1.00  0.00           H  
ATOM     31  H   RIV A   1       9.643   1.777  18.433  1.00  0.00           H  
ATOM     32  H   RIV A   1      10.053   0.667  17.441  1.00  0.00           H  
ATOM     33  H   RIV A   1      10.053   0.667  17.441  1.00  0.00           H  
ATOM     34  H   RIV A   1       8.355   1.223  15.853  1.00  0.00           H  
ATOM     35  H   RIV A   1       8.355   1.223  15.853  1.00  0.00           H  
ATOM     36  H   RIV A   1       7.333   5.468  19.984  1.00  0.00           H  
ATOM     37  H   RIV A   1       6.237   3.551  21.013  1.00  0.00           H  
ATOM     38  H   RIV A   1       7.918   4.619  19.048  1.00  0.00           H  
ATOM     39  H   RIV A   1       6.837   2.690  20.070  1.00  0.00           H  
ATOM     40  H   RIV A   1       6.146   7.326  22.000  1.00  0.00           H  
ATOM     41  H   RIV A   1       6.146   7.326  22.000  1.00  0.00           H  
ATOM     42  H   RIV A   1       5.213   7.787  23.134  1.00  0.00           H  
ATOM     43  H   RIV A   1       5.924   8.721  24.155  1.00  0.00           H  
ATOM     44  H   RIV A   1       5.924   8.721  24.155  1.00  0.00           H  
ATOM     45  H   RIV A   1       7.078   8.136  24.932  1.00  0.00           H  
ATOM     46  H   RIV A   1      10.794   7.011  27.130  1.00  0.00           H  
ATOM     47  H   RIV A   1       9.509   7.324  26.659  1.00  0.00           H  
CONECT    1    3    2   30   31                                       
CONECT    1                                                           
CONECT    2    4    1   12                                            
CONECT    3    5    1   32   33                                       
CONECT    3                                                           
CONECT    4    6   13    2                                            
CONECT    5    6    3                                                 
CONECT    6    5    4   34   35                                       
CONECT    6                                                           
CONECT    7    8    9   14                                            
CONECT    8   10    7   36                                            
CONECT    9   11    7   37                                            
CONECT   10   12    8   38                                            
CONECT   11   12    9   39                                            
CONECT   12    2   10   11                                            
CONECT   13    4                                                      
CONECT   14    7   16   15                                            
CONECT   15   14   19   17                                            
CONECT   16   14   18   40   41                                       
CONECT   16                                                           
CONECT   17   15   18                                                 
CONECT   18   16   17   20   42                                       
CONECT   18                                                           
CONECT   19   15                                                      
CONECT   20   18   21   43   44                                       
CONECT   20                                                           
CONECT   21   20   22   45                                            
CONECT   22   28   21   25                                            
CONECT   23   25   24                                                 
CONECT   24   23   29   26                                            
CONECT   25   22   23   27                                            
CONECT   26   24   27   46                                            
CONECT   27   25   26   47                                            
CONECT   28   22                                                      
CONECT   29   24                                                      
CONECT   30    1                                                      
CONECT   31    1                                                      
CONECT   32    3                                                      
CONECT   33    3                                                      
CONECT   34    6                                                      
CONECT   35    6                                                      
CONECT   36    8                                                      
CONECT   37    9                                                      
CONECT   38   10                                                      
CONECT   39   11                                                      
CONECT   40   16                                                      
CONECT   41   16                                                      
CONECT   42   18                                                      
CONECT   43   20                                                      
CONECT   44   20                                                      
CONECT   45   21                                                      
CONECT   46   26                                                      
CONECT   47   27                                                      
MASTER        0    0    0    0    0    0    0    0   47    0   47    0
END

From robert.campbell at queensu.ca  Fri Oct 14 09:04:22 2011
From: robert.campbell at queensu.ca (Robert Campbell)
Date: Fri, 14 Oct 2011 09:04:22 -0400
Subject: [Biopython] ligand PDB files
In-Reply-To: <f744ccd5166106e2ddfda84262ac938c@mail.canobus.com>
References: <f744ccd5166106e2ddfda84262ac938c@mail.canobus.com>
Message-ID: <20111014090422.639e9284@adelie.biochem.queensu.ca>

Dear Paul,

On Fri, 2011-10-14 14:00  EDT,  paul at tonair.de wrote:

> Dear all, 
> I'm having trouble to read in the attached PDB file - this
> is my code: 

<snip>
Your code is okay.  The problem is in your PDB file:


>  File
> "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/Atom.py", line
> 68, in __init__
>  assert not element or element == element.upper(),
> element
> AssertionError: Cl 
> " 
> Does this mean that the PDB parser only
> recognizes "amino acid-atoms", i.e. a chlorine does not work? 

The chlorine atoms should be "CL" not "Cl" in a proper PDB file.

Cheers,
Rob

-- 
Robert L. Campbell, Ph.D.
Senior Research Associate/Adjunct Assistant Professor 
Dept. of Biomedical & Molecular Sciences, Botterell Hall Rm 644
Queen's University, 
Kingston, ON K7L 3N6  Canada
Tel: 613-533-6821
<robert.campbell at queensu.ca>    http://pldserver1.biochem.queensu.ca/~rlc

From paul at tonair.de  Fri Oct 14 09:51:47 2011
From: paul at tonair.de (paul at tonair.de)
Date: Fri, 14 Oct 2011 15:51:47 +0200
Subject: [Biopython] ligand PDB files
In-Reply-To: <20111014090422.639e9284@adelie.biochem.queensu.ca>
References: <f744ccd5166106e2ddfda84262ac938c@mail.canobus.com>
	<20111014090422.639e9284@adelie.biochem.queensu.ca>
Message-ID: <751ac2c9e7bf1a3659f31849565d1122@mail.canobus.com>

Dear Rob,

thank you very much for your help, this fixed the error!!


Cheers,
Paul

> <snip>
> Your code is okay.  The problem is in your PDB file:
> 
> 
>>  File
>> "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/Atom.py", line
>> 68, in __init__
>>  assert not element or element == element.upper(),
>> element
>> AssertionError: Cl
>> "
>> Does this mean that the PDB parser only
>> recognizes "amino acid-atoms", i.e. a chlorine does not work?
> 
> The chlorine atoms should be "CL" not "Cl" in a proper PDB file.
> 
> Cheers,
> Rob


From pawan.mani2 at gmail.com  Sat Oct 15 12:26:17 2011
From: pawan.mani2 at gmail.com (One Life)
Date: Sat, 15 Oct 2011 16:26:17 +0000 (UTC)
Subject: [Biopython] Invitation to connect on LinkedIn
Message-ID: <450967254.855500.1318695977476.JavaMail.app@ela4-app0133.prod>

I'd like to add you to my professional network on LinkedIn.

- One

One  Life
bioinformatics jobs or lifesciences jobs at student
New Delhi Area, India

Confirm that you know One  Life:
https://www.linkedin.com/e/l8bh8w-gtstjc81-5u/isd/4571376627/NJGAOFxD/?hs=false&tok=2ZCK1gt4mqn4Y1

--
You are receiving Invitation to Connect emails. Click to unsubscribe:
http://www.linkedin.com/e/l8bh8w-gtstjc81-5u/qqAvDr0lR7bVZ5oUF-GdFl1c_dfVGAwasCwqz9Wv-gP/goo/biopython%40lists%2Eopen-bio%2Eorg/20061/I1584202408_1/?hs=false&tok=0zbjHnXC6qn4Y1

(c) 2011 LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043, USA.

From jordan.r.willis at Vanderbilt.Edu  Sat Oct 15 16:59:58 2011
From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R)
Date: Sat, 15 Oct 2011 15:59:58 -0500
Subject: [Biopython] Blast DB keeps crashing nodes
Message-ID: <66965B9E-2AD6-4E02-BB8E-2F11A820DCDF@Vanderbilt.Edu>


Hello Biopython,

I was wondering if anyone has worked extensively with the Blast Database locally.

I am blasting millions of sequences using Biopython as my backend framework. I am using a high throughput computer cluster to blast each sequence. Rather than submit two million jobs, I have divided the fast files up into 50 or so.

The problem I am facing is a memory issue. I'm not sure, but I think that the Database is cacheing itself and not clearing before the next sequence is queried. In that regard, the next job calls upon the database again, and so on?.

The memory builds up until it finally crashes the node. Has anyone dealt with this issue before?

Thanks,
Jordan


From dilara.ally at gmail.com  Sat Oct 15 17:55:21 2011
From: dilara.ally at gmail.com (Dilara Ally)
Date: Sat, 15 Oct 2011 14:55:21 -0700
Subject: [Biopython] Blast DB keeps crashing nodes
In-Reply-To: <66965B9E-2AD6-4E02-BB8E-2F11A820DCDF@Vanderbilt.Edu>
References: <66965B9E-2AD6-4E02-BB8E-2F11A820DCDF@Vanderbilt.Edu>
Message-ID: <4E9A0149.1000504@gmail.com>

How many hits per sequence have you requested to get back - the default 
on the blastall is 250?   I did blast search on ~600,000 contigs but I 
set up simultaneous jobs across 34 nodes.  I used only the top 20 hits.  
Each file had 1000 fasta formatted sequences and each node was given ~12 
files.  But I still had to do it in two parts to get all sequences 
blasted. I waited until the first set finished to set up the second 
blast job.  The job finished in 2 days.  Before I ran it on the cluster 
I tested a single file to see how long and how much memory it took.  The 
cluster I used had 34 computing nodes, with 16-48 cores and 16-64GB of 
memory.

Hope that helps.

On 10/15/11 1:59 PM, Willis, Jordan R wrote:
> Hello Biopython,
>
> I was wondering if anyone has worked extensively with the Blast Database locally.
>
> I am blasting millions of sequences using Biopython as my backend framework. I am using a high throughput computer cluster to blast each sequence. Rather than submit two million jobs, I have divided the fast files up into 50 or so.
>
> The problem I am facing is a memory issue. I'm not sure, but I think that the Database is cacheing itself and not clearing before the next sequence is queried. In that regard, the next job calls upon the database again, and so on?.
>
> The memory builds up until it finally crashes the node. Has anyone dealt with this issue before?
>
> Thanks,
> Jordan
>
>
>
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From mictadlo at gmail.com  Mon Oct 17 08:11:12 2011
From: mictadlo at gmail.com (Mic)
Date: Mon, 17 Oct 2011 22:11:12 +1000
Subject: [Biopython] SAM to BAM
Message-ID: <CAOP6n=jJC48dDPssrwNYffCyf8oFWiC6rfEHJP0H8mMPQeZZDw@mail.gmail.com>

Hello,
Is there a way to convert SAM file to sorted BAM file and generate also BAI
file with pysam?

Thank you in advance.

From p.j.a.cock at googlemail.com  Mon Oct 17 09:06:58 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 17 Oct 2011 14:06:58 +0100
Subject: [Biopython] [Samtools-help] SAM to BAM
In-Reply-To: <CAOP6n=jJC48dDPssrwNYffCyf8oFWiC6rfEHJP0H8mMPQeZZDw@mail.gmail.com>
References: <CAOP6n=jJC48dDPssrwNYffCyf8oFWiC6rfEHJP0H8mMPQeZZDw@mail.gmail.com>
Message-ID: <CAKVJ-_4i77vOJe4wyJRZ_O7p4RmOCFxYg51TKV--FhPfHv5vgQ@mail.gmail.com>

On Mon, Oct 17, 2011 at 1:11 PM, Mic <mictadlo at gmail.com> wrote:
> Hello,
> Is there a way to convert SAM file to sorted BAM file and generate also BAI
> file with pysam?
> Thank you in advance.

With samtools at the command line,

samtools view -b -S example.sam | samtools sort - example
samtools index example.bam

I know you can easy call samtools from pysam, not sure if you
can do the pipe trick to avoid extra steps:

samtools view -b -S example.sam > example_unsorted
samtools sort example_unsorted.bam example
rm example_unsorted.bam
samtools index example.bam

Peter

From jgrant at smith.edu  Mon Oct 17 09:47:38 2011
From: jgrant at smith.edu (Jessica Grant)
Date: Mon, 17 Oct 2011 09:47:38 -0400
Subject: [Biopython] pdb file question
Message-ID: <541079A3-3C7D-45FF-8717-B1C64C85735F@smith.edu>

Hello,

I am trying to write a script that reproduces the crystal structure of  
a protein based on the information in the pdb file.  I have gotten  
kind of stuck using the SMTRY lines in remark 290.  It doesn't seem to  
contain all the information I need, at least the results I am getting  
don't look the same as when I produce symmetry mates in pymol, for  
example.  Has anyone any experience with this?  Thanks,

Jessica


From anaryin at gmail.com  Mon Oct 17 10:08:54 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Mon, 17 Oct 2011 16:08:54 +0200
Subject: [Biopython] pdb file question
In-Reply-To: <541079A3-3C7D-45FF-8717-B1C64C85735F@smith.edu>
References: <541079A3-3C7D-45FF-8717-B1C64C85735F@smith.edu>
Message-ID: <CAJ9sUYNWsqZ+4OiCcuCvSxst0+jYYZ2-6zdzdSfbLfZt5C8-YA@mail.gmail.com>

Hello Jessica,

Are you extracting the symmetry information with Biopython? If so, how are
you using it to generate the other symmetry "members"? Using atom.transform?

Cheers,

Jo?o [...] Rodrigues
http://nmr.chem.uu.nl/~joao


2011/10/17 Jessica Grant <jgrant at smith.edu>

> Hello,
>
> I am trying to write a script that reproduces the crystal structure of a
> protein based on the information in the pdb file.  I have gotten kind of
> stuck using the SMTRY lines in remark 290.  It doesn't seem to contain all
> the information I need, at least the results I am getting don't look the
> same as when I produce symmetry mates in pymol, for example.  Has anyone any
> experience with this?  Thanks,
>
> Jessica
>
>
>
> ______________________________**_________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/**mailman/listinfo/biopython<http://lists.open-bio.org/mailman/listinfo/biopython>
>


From hahj87 at gmail.com  Mon Oct 17 11:03:10 2011
From: hahj87 at gmail.com (=?ISO-8859-1?Q?Joshua_Ismael_Haase_Hern=E1ndez?=)
Date: Mon, 17 Oct 2011 10:03:10 -0500
Subject: [Biopython] is IRC channel at freenode active?
Message-ID: <CA+ypG2YZCbjzfpgpGPi80Z4ttQjBgFBvbNE-podmfewLo6QkhQ@mail.gmail.com>

Hi there,

I was arround in the IRC channel
and the only one there is Chanserv.

I was wondering if the channel has
some use.

From mictadlo at gmail.com  Mon Oct 17 23:44:14 2011
From: mictadlo at gmail.com (Mic)
Date: Tue, 18 Oct 2011 13:44:14 +1000
Subject: [Biopython] Segmentation fault
Message-ID: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>

Hello,
I have tried to generate a subset BAM, but I get a 'Segmentation fault' with
the following code:

from Bio import SeqIO
import pysam
from optparse import OptionParser
import subprocess, os, sys
from multiprocessing import Pool
import functools
import argparse


def GetReferenceInfo(referenceFastaPath):
  referencenames = []
  referencelengths = []
  referenceFastaFile = open(referenceFastaPath)
  for record in SeqIO.parse(referenceFastaFile, "fasta"):
    referencenames.append(record.name)
    referencelengths.append(len(record.seq))
  referenceFastaFile.close()
  return (referencenames, referencelengths)


def GenerateSubsetBAM(bam_filename, ref_name):
    reads = []
    bam_fh = pysam.Samfile(bam_filename, "rb")

    for read in bam_fh.fetch(ref_name):
        reads.append(read)

    print ref_name + ' Done ' + str(len(reads))
    return (ref_name, reads)


def writeBAM(reads, ref_names, ref_lengths, output_BAM):
    #print ref_names
    #print ref_lengths
    #print output_BAM
    #with pysam.Samfile(output_BAM, "wb", referencenames = ref_names,
referencelengths = ref_lengths) as bh:
    bh = pysam.Samfile(output_BAM, "wb", referencenames = ref_names,
referencelengths = ref_lengths)

    print reads.keys()
    for ref_name in ref_names:
        print ref_name
        for read in reads[ref_name]:
            print read
            #bh.write(read)
        print ref_name + 'Done'


if __name__ == '__main__':
  parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta -o
outputBAM")
  parser.add_option("-b", "--BAM", type="string", dest="inputBAMFilepath",
help="Specify a BAM file")
  parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath",
help="Specify a reference fasta file.")
  parser.add_option("-o", "--output", type="string",
dest="outputBAMFilepath", help="Specify an output BAM file.")

  (opts, args) = parser.parse_args()

  if (opts.inputBAMFilepath is None):
    print ("\nSpecify a BAM file. eg. -b large.bam\n")
    parser.print_help()
  elif not(os.path.exists(opts.inputBAMFilepath)):
    print ("\nReference BAM file does not exists: " + opts.inputBAMFilepath
+"\n")
  elif (opts.fastaFilepath is None):
    print ("\nSpecify a reference fasta file.  eg. -f Subset.fasta\n")
    parser.print_help()
  elif not(os.path.exists(opts.fastaFilepath)):
    print ("\nReference fasta file does not exists: " + opts.fastaFilepath
+"\n")
  elif os.path.exists(opts.outputBAMFilepath) and
not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n): ")=='Y'):
    print ("\nOutput BAM exists. Please specify alternative output file.
 eg. -o Subset.bam\n")
  else:
    print "Read fasta ..."
    (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath)
    print 'Done!'

    print "creating subset...."
    pool = Pool()
    GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM,
opts.inputBAMFilepath)
    reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper, ref_names))
    pool.close()
    print "Done!"

    print "Writting results to subset BAM file..."
    writeBAM(reads, ref_names, ref_lengths, opts.outputBAMFilepath)
    print "Done!"


I run the code in the following way:

python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bamRead fasta ...
Done!
creating subset....
chr1 Done 1464
chr2 Done 1806
Done!
Writting results to subset BAM file...
['chr2', 'chr1']
chr1
Segmentation fault

Thank you in advance.

From p.j.a.cock at googlemail.com  Tue Oct 18 05:00:47 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 18 Oct 2011 10:00:47 +0100
Subject: [Biopython] [Samtools-help] Segmentation fault
In-Reply-To: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>
References: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>
Message-ID: <CAKVJ-_7cF1gGQbzcD8LO+w-QvbsJUp=T1xmzwoXqUNt5GsaowA@mail.gmail.com>

On Tue, Oct 18, 2011 at 4:44 AM, Mic <mictadlo at gmail.com> wrote:
> Hello,
> I have tried to generate a subset BAM, but I get a 'Segmentation fault' with
> the following code:
> from Bio import SeqIO
> import pysam
> from optparse import OptionParser
> import subprocess, os, sys
> from multiprocessing import Pool
> import functools
> ...

I tried this and it seemed to get stuck much earlier. Could you
cut down the example a bit by removing the multiprocessing?

Peter

P.S. Also you can remove the unused "import argparse" line.

From mictadlo at gmail.com  Tue Oct 18 06:26:06 2011
From: mictadlo at gmail.com (Mic)
Date: Tue, 18 Oct 2011 20:26:06 +1000
Subject: [Biopython] [Samtools-help] Segmentation fault
In-Reply-To: <CAKVJ-_7cF1gGQbzcD8LO+w-QvbsJUp=T1xmzwoXqUNt5GsaowA@mail.gmail.com>
References: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>
	<CAKVJ-_7cF1gGQbzcD8LO+w-QvbsJUp=T1xmzwoXqUNt5GsaowA@mail.gmail.com>
Message-ID: <CAOP6n=g7d9cHhoqcf9FDW2vr9Rm5GgL7xMir0TU6Z6DJPQf_8Q@mail.gmail.com>

Hello,
Thank you for your email. I updated the code and find out that
    print reads['chr1']     #works fine
but
    print reads['chr1'][0]  #caused Segmentation fault

Please find below the updated code:

from Bio import SeqIO
import pysam
from optparse import OptionParser
import subprocess, os, sys
from multiprocessing import Pool
import functools


def GetReferenceInfo(referenceFastaPath):
  referencenames = []
  referencelengths = []
  referenceFastaFile = open(referenceFastaPath)
  for record in SeqIO.parse(referenceFastaFile, "fasta"):
    referencenames.append(record.name)
    referencelengths.append(len(record.seq))
  referenceFastaFile.close()
  return (referencenames, referencelengths)


def GenerateSubsetBAM(bam_filename, ref_name):
    reads = []
    bam_fh = pysam.Samfile(bam_filename, "rb")

    for read in bam_fh.fetch(ref_name):
        reads.append(read)

    print ref_name + ' Done ' + str(len(reads))
    return (ref_name, reads)


if __name__ == '__main__':
  parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta -o
outputBAM")
  parser.add_option("-b", "--BAM", type="string", dest="inputBAMFilepath",
help="Specify a BAM file")
  parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath",
help="Specify a reference fasta file.")
  parser.add_option("-o", "--output", type="string",
dest="outputBAMFilepath", help="Specify an output BAM file.")

  (opts, args) = parser.parse_args()

  if (opts.inputBAMFilepath is None):
    print ("\nSpecify a BAM file. eg. -b large.bam\n")
    parser.print_help()
  elif not(os.path.exists(opts.inputBAMFilepath)):
    print ("\nReference BAM file does not exists: " + opts.inputBAMFilepath
+"\n")
  elif (opts.fastaFilepath is None):
    print ("\nSpecify a reference fasta file.  eg. -f Subset.fasta\n")
    parser.print_help()
  elif not(os.path.exists(opts.fastaFilepath)):
    print ("\nReference fasta file does not exists: " + opts.fastaFilepath
+"\n")
  elif os.path.exists(opts.outputBAMFilepath) and
not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n): ")=='Y'):
    print ("\nOutput BAM exists. Please specify alternative output file.
 eg. -o Subset.bam\n")
  else:
    print "Read fasta ..."
    (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath)
    print 'Done!'

    print "creating subset...."
    pool = Pool()
    GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM,
opts.inputBAMFilepath)
    reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper, ref_names))
    pool.close()
    print "Done!"

    print reads['chr1']     #works fine
    print "xxxxx"

    print reads['chr1'][0]  #caused Segmentation fault

I run the code with the pysam-0.5 examples (pysam-0.5/tests) in the
following way:

python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam

Read fasta ...
Done!
creating subset....
chr1 Done 1464
chr2 Done 1806
Done!
[<csamtools.AlignedRead object at 0x2b975635d168>, ...,
<csamtools.AlignedRead object at 0x2b35d89b6ca8>]
xxxxx
Segmentation fault

Thank you in advance.


On Tue, Oct 18, 2011 at 7:00 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Oct 18, 2011 at 4:44 AM, Mic <mictadlo at gmail.com> wrote:
> > Hello,
> > I have tried to generate a subset BAM, but I get a 'Segmentation fault'
> with
> > the following code:
> > from Bio import SeqIO
> > import pysam
> > from optparse import OptionParser
> > import subprocess, os, sys
> > from multiprocessing import Pool
> > import functools
> > ...
>
> I tried this and it seemed to get stuck much earlier. Could you
> cut down the example a bit by removing the multiprocessing?
>
> Peter
>
> P.S. Also you can remove the unused "import argparse" line.
>

From mmokrejs at fold.natur.cuni.cz  Tue Oct 18 07:44:54 2011
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Tue, 18 Oct 2011 13:44:54 +0200
Subject: [Biopython] [Samtools-help] Segmentation fault
In-Reply-To: <CAOP6n=g7d9cHhoqcf9FDW2vr9Rm5GgL7xMir0TU6Z6DJPQf_8Q@mail.gmail.com>
References: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>
	<CAKVJ-_7cF1gGQbzcD8LO+w-QvbsJUp=T1xmzwoXqUNt5GsaowA@mail.gmail.com>
	<CAOP6n=g7d9cHhoqcf9FDW2vr9Rm5GgL7xMir0TU6Z6DJPQf_8Q@mail.gmail.com>
Message-ID: <4E9D66B6.70904@fold.natur.cuni.cz>

Before running your python code, do (under bash):
$ ulimit -c unlimited
$ python mypython.py
$ file core
$ gdb /usr/bin/python ./core
gdb> where
gdb> bt full
gdb> quit
$

Martin

Mic wrote:
> Hello,
> Thank you for your email. I updated the code and find out that
>     print reads['chr1']     #works fine
> but
>     print reads['chr1'][0]  #caused Segmentation fault
> 
> Please find below the updated code:
> 
> from Bio import SeqIO
> import pysam
> from optparse import OptionParser
> import subprocess, os, sys
> from multiprocessing import Pool
> import functools
> 
> 
> def GetReferenceInfo(referenceFastaPath):
>   referencenames = []
>   referencelengths = []
>   referenceFastaFile = open(referenceFastaPath)
>   for record in SeqIO.parse(referenceFastaFile, "fasta"):
>     referencenames.append(record.name)
>     referencelengths.append(len(record.seq))
>   referenceFastaFile.close()
>   return (referencenames, referencelengths)
> 
> 
> def GenerateSubsetBAM(bam_filename, ref_name):
>     reads = []
>     bam_fh = pysam.Samfile(bam_filename, "rb")
> 
>     for read in bam_fh.fetch(ref_name):
>         reads.append(read)
> 
>     print ref_name + ' Done ' + str(len(reads))
>     return (ref_name, reads)
> 
> 
> if __name__ == '__main__':
>   parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta -o
> outputBAM")
>   parser.add_option("-b", "--BAM", type="string", dest="inputBAMFilepath",
> help="Specify a BAM file")
>   parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath",
> help="Specify a reference fasta file.")
>   parser.add_option("-o", "--output", type="string",
> dest="outputBAMFilepath", help="Specify an output BAM file.")
> 
>   (opts, args) = parser.parse_args()
> 
>   if (opts.inputBAMFilepath is None):
>     print ("\nSpecify a BAM file. eg. -b large.bam\n")
>     parser.print_help()
>   elif not(os.path.exists(opts.inputBAMFilepath)):
>     print ("\nReference BAM file does not exists: " + opts.inputBAMFilepath
> +"\n")
>   elif (opts.fastaFilepath is None):
>     print ("\nSpecify a reference fasta file.  eg. -f Subset.fasta\n")
>     parser.print_help()
>   elif not(os.path.exists(opts.fastaFilepath)):
>     print ("\nReference fasta file does not exists: " + opts.fastaFilepath
> +"\n")
>   elif os.path.exists(opts.outputBAMFilepath) and
> not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n): ")=='Y'):
>     print ("\nOutput BAM exists. Please specify alternative output file.
>  eg. -o Subset.bam\n")
>   else:
>     print "Read fasta ..."
>     (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath)
>     print 'Done!'
> 
>     print "creating subset...."
>     pool = Pool()
>     GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM,
> opts.inputBAMFilepath)
>     reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper, ref_names))
>     pool.close()
>     print "Done!"
> 
>     print reads['chr1']     #works fine
>     print "xxxxx"
> 
>     print reads['chr1'][0]  #caused Segmentation fault
> 
> I run the code with the pysam-0.5 examples (pysam-0.5/tests) in the
> following way:
> 
> python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam
> 
> Read fasta ...
> Done!
> creating subset....
> chr1 Done 1464
> chr2 Done 1806
> Done!
> [<csamtools.AlignedRead object at 0x2b975635d168>, ...,
> <csamtools.AlignedRead object at 0x2b35d89b6ca8>]
> xxxxx
> Segmentation fault
> 
> Thank you in advance.
> 
> 
> On Tue, Oct 18, 2011 at 7:00 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
> 
>> On Tue, Oct 18, 2011 at 4:44 AM, Mic <mictadlo at gmail.com> wrote:
>>> Hello,
>>> I have tried to generate a subset BAM, but I get a 'Segmentation fault'
>> with
>>> the following code:
>>> from Bio import SeqIO
>>> import pysam
>>> from optparse import OptionParser
>>> import subprocess, os, sys
>>> from multiprocessing import Pool
>>> import functools
>>> ...
>>
>> I tried this and it seemed to get stuck much earlier. Could you
>> cut down the example a bit by removing the multiprocessing?
>>
>> Peter
>>
>> P.S. Also you can remove the unused "import argparse" line.
>>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 
> 

From mictadlo at gmail.com  Tue Oct 18 08:05:01 2011
From: mictadlo at gmail.com (Mic)
Date: Tue, 18 Oct 2011 22:05:01 +1000
Subject: [Biopython] [Samtools-help] Segmentation fault
In-Reply-To: <4E9D66B6.70904@fold.natur.cuni.cz>
References: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>
	<CAKVJ-_7cF1gGQbzcD8LO+w-QvbsJUp=T1xmzwoXqUNt5GsaowA@mail.gmail.com>
	<CAOP6n=g7d9cHhoqcf9FDW2vr9Rm5GgL7xMir0TU6Z6DJPQf_8Q@mail.gmail.com>
	<4E9D66B6.70904@fold.natur.cuni.cz>
Message-ID: <CAOP6n=juO9=4wap-4xMHJatKbbZidtTX_5T8K-67Ud6R+E+AEA@mail.gmail.com>

Thank you for your tip, but I got an error:
$ulimit -c unlimited
$SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam
Read fasta ...
Done!
creating subset....
chr1 Done 1464
EAS56_57:6:190:289:82 69 0 99 0 None 0 99 35
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; [('MF',
192)]
chr2 Done 1806
B7_591:8:4:841:340 73 1 0 99 [(0, 36)] -1 -1 36
TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA
<<<<<<<<;<<<<<<<<;<<<<<;<;:<<<<<<<;; [('MF',
18), ('Aq', 77), ('NM', 0), ('UQ', 0), ('H0', 1), ('H1', 0)]
Done!
xxxxx
Segmentation fault (core dumped)
$file core
core: ERROR: cannot open `core' (No such file or directory)


I also inserted "print reads[0]" in the method GenerateSubsetBAM:

def GenerateSubsetBAM(bam_filename, ref_name):
    reads = []
    bam_fh = pysam.Samfile(bam_filename, "rb")

    for read in bam_fh.fetch(ref_name):
        reads.append(read)

    print ref_name + ' Done ' + str(len(reads))
    print reads[0]   # works fine!
    return (ref_name, reads)

and as output I got:

python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam
Read fasta ...
Done!
creating subset....
chr1 Done 1464
EAS56_57:6:190:289:82 69 0 99 0 None 0 99 35
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; [('MF',
192)]
chr2 Done 1806
B7_591:8:4:841:340 73 1 0 99 [(0, 36)] -1 -1 36
TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA
<<<<<<<<;<<<<<<<<;<<<<<;<;:<<<<<<<;; [('MF',
18), ('Aq', 77), ('NM', 0), ('UQ', 0), ('H0', 1), ('H1', 0)]
Done!
xxxxx
Segmentation fault

Why does reads['chr1'][0] caused the Segmentation fault?

Thank you in advance.


On Tue, Oct 18, 2011 at 9:44 PM, Martin Mokrejs <mmokrejs at fold.natur.cuni.cz
> wrote:

> Before running your python code, do (under bash):
> $ ulimit -c unlimited
> $ python mypython.py
> $ file core
> $ gdb /usr/bin/python ./core
> gdb> where
> gdb> bt full
> gdb> quit
> $
>
> Martin
>
> Mic wrote:
> > Hello,
> > Thank you for your email. I updated the code and find out that
> >     print reads['chr1']     #works fine
> > but
> >     print reads['chr1'][0]  #caused Segmentation fault
> >
> > Please find below the updated code:
> >
> > from Bio import SeqIO
> > import pysam
> > from optparse import OptionParser
> > import subprocess, os, sys
> > from multiprocessing import Pool
> > import functools
> >
> >
> > def GetReferenceInfo(referenceFastaPath):
> >   referencenames = []
> >   referencelengths = []
> >   referenceFastaFile = open(referenceFastaPath)
> >   for record in SeqIO.parse(referenceFastaFile, "fasta"):
> >     referencenames.append(record.name)
> >     referencelengths.append(len(record.seq))
> >   referenceFastaFile.close()
> >   return (referencenames, referencelengths)
> >
> >
> > def GenerateSubsetBAM(bam_filename, ref_name):
> >     reads = []
> >     bam_fh = pysam.Samfile(bam_filename, "rb")
> >
> >     for read in bam_fh.fetch(ref_name):
> >         reads.append(read)
> >
> >     print ref_name + ' Done ' + str(len(reads))
> >     return (ref_name, reads)
> >
> >
> > if __name__ == '__main__':
> >   parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta
> -o
> > outputBAM")
> >   parser.add_option("-b", "--BAM", type="string",
> dest="inputBAMFilepath",
> > help="Specify a BAM file")
> >   parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath",
> > help="Specify a reference fasta file.")
> >   parser.add_option("-o", "--output", type="string",
> > dest="outputBAMFilepath", help="Specify an output BAM file.")
> >
> >   (opts, args) = parser.parse_args()
> >
> >   if (opts.inputBAMFilepath is None):
> >     print ("\nSpecify a BAM file. eg. -b large.bam\n")
> >     parser.print_help()
> >   elif not(os.path.exists(opts.inputBAMFilepath)):
> >     print ("\nReference BAM file does not exists: " +
> opts.inputBAMFilepath
> > +"\n")
> >   elif (opts.fastaFilepath is None):
> >     print ("\nSpecify a reference fasta file.  eg. -f Subset.fasta\n")
> >     parser.print_help()
> >   elif not(os.path.exists(opts.fastaFilepath)):
> >     print ("\nReference fasta file does not exists: " +
> opts.fastaFilepath
> > +"\n")
> >   elif os.path.exists(opts.outputBAMFilepath) and
> > not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n):
> ")=='Y'):
> >     print ("\nOutput BAM exists. Please specify alternative output file.
> >  eg. -o Subset.bam\n")
> >   else:
> >     print "Read fasta ..."
> >     (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath)
> >     print 'Done!'
> >
> >     print "creating subset...."
> >     pool = Pool()
> >     GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM,
> > opts.inputBAMFilepath)
> >     reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper,
> ref_names))
> >     pool.close()
> >     print "Done!"
> >
> >     print reads['chr1']     #works fine
> >     print "xxxxx"
> >
> >     print reads['chr1'][0]  #caused Segmentation fault
> >
> > I run the code with the pysam-0.5 examples (pysam-0.5/tests) in the
> > following way:
> >
> > python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam
> >
> > Read fasta ...
> > Done!
> > creating subset....
> > chr1 Done 1464
> > chr2 Done 1806
> > Done!
> > [<csamtools.AlignedRead object at 0x2b975635d168>, ...,
> > <csamtools.AlignedRead object at 0x2b35d89b6ca8>]
> > xxxxx
> > Segmentation fault
> >
> > Thank you in advance.
> >
> >
> > On Tue, Oct 18, 2011 at 7:00 PM, Peter Cock <p.j.a.cock at googlemail.com
> >wrote:
> >
> >> On Tue, Oct 18, 2011 at 4:44 AM, Mic <mictadlo at gmail.com> wrote:
> >>> Hello,
> >>> I have tried to generate a subset BAM, but I get a 'Segmentation fault'
> >> with
> >>> the following code:
> >>> from Bio import SeqIO
> >>> import pysam
> >>> from optparse import OptionParser
> >>> import subprocess, os, sys
> >>> from multiprocessing import Pool
> >>> import functools
> >>> ...
> >>
> >> I tried this and it seemed to get stuck much earlier. Could you
> >> cut down the example a bit by removing the multiprocessing?
> >>
> >> Peter
> >>
> >> P.S. Also you can remove the unused "import argparse" line.
> >>
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
> >
>

From p.j.a.cock at googlemail.com  Tue Oct 18 08:58:47 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 18 Oct 2011 13:58:47 +0100
Subject: [Biopython] [Samtools-help] Segmentation fault
In-Reply-To: <CAOP6n=g7d9cHhoqcf9FDW2vr9Rm5GgL7xMir0TU6Z6DJPQf_8Q@mail.gmail.com>
References: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>
	<CAKVJ-_7cF1gGQbzcD8LO+w-QvbsJUp=T1xmzwoXqUNt5GsaowA@mail.gmail.com>
	<CAOP6n=g7d9cHhoqcf9FDW2vr9Rm5GgL7xMir0TU6Z6DJPQf_8Q@mail.gmail.com>
Message-ID: <CAKVJ-_4f28KNeZx=wqVxStUzeBD-3K_5Ogegt-OxTWtgqLPs7g@mail.gmail.com>

On Tue, Oct 18, 2011 at 11:26 AM, Mic <mictadlo at gmail.com> wrote:
> Hello,
> Thank you for your email. I updated the code and find out that
> ? ? print reads['chr1'] ? ? #works fine
> but
> ? ? print reads['chr1'][0] ?#caused Segmentation fault
> Please find below the updated code:
> ...

Your pool version doesn't run on my machine, something
unhappy in multiprocessing gives:
TypeError: type 'partial' takes at least one argument

Here's a version using a single thread, which works fine
for me. What does it do on your machines? Either way
this should help in determining the segmentation fault.

from Bio import SeqIO
import pysam
import subprocess, os, sys

def GetReferenceInfo(referenceFastaPath):
  referencenames = []
  referencelengths = []
  referenceFastaFile = open(referenceFastaPath)
  for record in SeqIO.parse(referenceFastaFile, "fasta"):
    referencenames.append(record.name)
    referencelengths.append(len(record.seq))
  referenceFastaFile.close()
  return (referencenames, referencelengths)


def GenerateSubsetBAM(bam_filename, ref_name):
    reads = []
    bam_fh = pysam.Samfile(bam_filename, "rb")

    for read in bam_fh.fetch(ref_name):
        reads.append(read)

    print ref_name + ' Done ' + str(len(reads))
    return (ref_name, reads)

bam_filename = "ex1.bam"
fasta_filename = "ex1.fa"

print "Read fasta ..."
(ref_names, ref_lengths) = GetReferenceInfo(fasta_filename)
print 'Done!'

print "creating subset...."
reads = dict()
for ref in ref_names:
    reads[ref] = GenerateSubsetBAM(bam_filename, ref)
print "Done!"

print reads['chr1']     #works fine
print "xxxxx"
print reads['chr1'][0]  #also fine

--

Peter


From nathaniel.echols at gmail.com  Tue Oct 18 14:08:03 2011
From: nathaniel.echols at gmail.com (Nat Echols)
Date: Tue, 18 Oct 2011 11:08:03 -0700
Subject: [Biopython] newbie question: sequence parsing
Message-ID: <CALzQQJbbsmUyCJ5q-pOttk_chiOO2QfqNKVH90KXcXh03e+m7w@mail.gmail.com>

Greetings--

We have started using BioPython in our (non-bioinformatics) application and
are investigating the possibility of replacing our existing (custom-made)
sequence parsers.  Two quick questions:

1) Is there a sequence parser that works with just a simple string, without
any header or additional metadata?  If not, how could we write one that
results in the same basic object as those in Bio.SeqIO?  (The parsing is of
course easy, I just want to have the API be consistent regardless of
format.)

2) Is there a single function that will take a file (and/or string) of
unknown format and try the different parsers until it finds one that works?
 We currently use several different formats (raw string, FASTA, PIR, and
possibly others), and we try not to rely on the file extension alone to
determine the type.  We already have something that does this using our
parsers, which could be refactored to use Bio.SeqIO instead, but if
BioPython has something similar I'd rather use that.

thanks,
Nat

From p.j.a.cock at googlemail.com  Tue Oct 18 15:04:14 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 18 Oct 2011 20:04:14 +0100
Subject: [Biopython] newbie question: sequence parsing
In-Reply-To: <CALzQQJbbsmUyCJ5q-pOttk_chiOO2QfqNKVH90KXcXh03e+m7w@mail.gmail.com>
References: <CALzQQJbbsmUyCJ5q-pOttk_chiOO2QfqNKVH90KXcXh03e+m7w@mail.gmail.com>
Message-ID: <CAKVJ-_71VptuB=dhpcZw=dwg6VrfjpZaDgvdw9ifZSc9Yrva7g@mail.gmail.com>

On Tue, Oct 18, 2011 at 7:08 PM, Nat Echols <nathaniel.echols at gmail.com> wrote:
> Greetings--
>
> We have started using BioPython in our (non-bioinformatics) application and
> are investigating the possibility of replacing our existing (custom-made)
> sequence parsers. ?Two quick questions:
>
> 1) Is there a sequence parser that works with just a simple string, without
> any header or additional metadata? ?If not, how could we write one that
> results in the same basic object as those in Bio.SeqIO? ?(The parsing is of
> course easy, I just want to have the API be consistent regardless of
> format.)

Sounds like the "raw" format in EMBOSS, although there are two
interpretations: one sequence per line, or one sequence for the
whole file.

Have a look at the FASTA parser in Bio/SeqIO/FastaIO.py as the
most simple case. Essentially you create a SeqRecord object
(which is covered in the Tutorial).

> 2) Is there a single function that will take a file (and/or string) of
> unknown format and try the different parsers until it finds one that works?
> ?We currently use several different formats (raw string, FASTA, PIR, and
> possibly others), and we try not to rely on the file extension alone to
> determine the type. ?We already have something that does this using our
> parsers, which could be refactored to use Bio.SeqIO instead, but if
> BioPython has something similar I'd rather use that.

No, we don't have such a function. There are many difficulties
with format guessing - both from the file contents and even the
filename. I usually cite the Zen of Python, Explicit is Better Than
Implicit.

Peter


From cjfields at illinois.edu  Tue Oct 18 15:11:56 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 18 Oct 2011 19:11:56 +0000
Subject: [Biopython] newbie question: sequence parsing
In-Reply-To: <CAKVJ-_71VptuB=dhpcZw=dwg6VrfjpZaDgvdw9ifZSc9Yrva7g@mail.gmail.com>
References: <CALzQQJbbsmUyCJ5q-pOttk_chiOO2QfqNKVH90KXcXh03e+m7w@mail.gmail.com>
	<CAKVJ-_71VptuB=dhpcZw=dwg6VrfjpZaDgvdw9ifZSc9Yrva7g@mail.gmail.com>
Message-ID: <BC8892C7-6670-4003-A9B0-22DB88625E39@illinois.edu>

On Oct 18, 2011, at 2:04 PM, Peter Cock wrote:

> On Tue, Oct 18, 2011 at 7:08 PM, Nat Echols <nathaniel.echols at gmail.com> wrote:
>> ...
>> 2) Is there a single function that will take a file (and/or string) of
>> unknown format and try the different parsers until it finds one that works?
>>  We currently use several different formats (raw string, FASTA, PIR, and
>> possibly others), and we try not to rely on the file extension alone to
>> determine the type.  We already have something that does this using our
>> parsers, which could be refactored to use Bio.SeqIO instead, but if
>> BioPython has something similar I'd rather use that.
> 
> No, we don't have such a function. There are many difficulties
> with format guessing - both from the file contents and even the
> filename. I usually cite the Zen of Python, Explicit is Better Than
> Implicit.
> 
> Peter


Some implicitness is fine, but speaking from experience (BioPerl's GuessSeqFormat) trying to guess the format from the dozens that litter the bioinformatics landscape is a nest of hornets no one wants to maintain.  

chris

From p.j.a.cock at googlemail.com  Tue Oct 18 15:31:06 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 18 Oct 2011 20:31:06 +0100
Subject: [Biopython] newbie question: sequence parsing
In-Reply-To: <BC8892C7-6670-4003-A9B0-22DB88625E39@illinois.edu>
References: <CALzQQJbbsmUyCJ5q-pOttk_chiOO2QfqNKVH90KXcXh03e+m7w@mail.gmail.com>
	<CAKVJ-_71VptuB=dhpcZw=dwg6VrfjpZaDgvdw9ifZSc9Yrva7g@mail.gmail.com>
	<BC8892C7-6670-4003-A9B0-22DB88625E39@illinois.edu>
Message-ID: <CAKVJ-_5aRcGSnbbO3BVAVNRdBYz46jT9VkxVjNNEXV8S0zW5Uw@mail.gmail.com>

On Tue, Oct 18, 2011 at 8:11 PM, Fields, Christopher J
<cjfields at illinois.edu> wrote:
> On Oct 18, 2011, at 2:04 PM, Peter Cock wrote:
>
>> On Tue, Oct 18, 2011 at 7:08 PM, Nat Echols <nathaniel.echols at gmail.com> wrote:
>>> ...
>>> 2) Is there a single function that will take a file (and/or string) of
>>> unknown format and try the different parsers until it finds one that works?
>>> ?We currently use several different formats (raw string, FASTA, PIR, and
>>> possibly others), and we try not to rely on the file extension alone to
>>> determine the type. ?We already have something that does this using our
>>> parsers, which could be refactored to use Bio.SeqIO instead, but if
>>> BioPython has something similar I'd rather use that.
>>
>> No, we don't have such a function. There are many difficulties
>> with format guessing - both from the file contents and even the
>> filename. I usually cite the Zen of Python, Explicit is Better Than
>> Implicit.
>>
>> Peter
>
> Some implicitness is fine, but speaking from experience
> (BioPerl's GuessSeqFormat) trying to guess the format
> from the dozens that litter the bioinformatics landscape
> is a nest of hornets no one wants to maintain.
>
> chris

I think "nest of hornets" is a much more beautiful phrase
than my dead pan "many difficulties".

The practical reality is that while some file formats are
easy (binary files with 4 byte "magic" identifiers), others
are horrible, and the definitions shift over time, as new
formats of variants are added. I really don't want to go
there.

Peter


From nathaniel.echols at gmail.com  Tue Oct 18 17:47:03 2011
From: nathaniel.echols at gmail.com (Nat Echols)
Date: Tue, 18 Oct 2011 14:47:03 -0700
Subject: [Biopython] issues with NCBIXML
Message-ID: <CALzQQJYj4j4b8_LDVw+d5ufk6faOBS1f8LkDcm7b8ZnfLiyZUg@mail.gmail.com>

Hi again,

I'm puzzled by the behavior of the Blast XML parser.  It appears to be
picking up all of the alignments correctly, but the
top-level Bio.Blast.Record.Blast object that it returns appears to be
incompletely populated.  Specifically, the attributes num_hits and
num_sequences are set to None - but I have several dozen alignments.  Am I
missing the point of these attributes, or doing something wrong?  It's not a
huge issue (I can just count the alignments, I guess), but I'm a bit
concerned that there's something wrong with my code.

thanks,
Nat

From p.j.a.cock at googlemail.com  Tue Oct 18 18:07:24 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 18 Oct 2011 23:07:24 +0100
Subject: [Biopython] issues with NCBIXML
In-Reply-To: <CALzQQJYj4j4b8_LDVw+d5ufk6faOBS1f8LkDcm7b8ZnfLiyZUg@mail.gmail.com>
References: <CALzQQJYj4j4b8_LDVw+d5ufk6faOBS1f8LkDcm7b8ZnfLiyZUg@mail.gmail.com>
Message-ID: <CAKVJ-_7ghKgBqihqA+xiJ2WneJ=iibJkeA7zzxKun47Jsq6b=w@mail.gmail.com>

On Tue, Oct 18, 2011 at 10:47 PM, Nat Echols <nathaniel.echols at gmail.com> wrote:
> Hi again,
>
> I'm puzzled by the behavior of the Blast XML parser. ?It appears to be
> picking up all of the alignments correctly, but the
> top-level Bio.Blast.Record.Blast object that it returns appears to be
> incompletely populated. ?Specifically, the attributes num_hits and
> num_sequences are set to None - but I have several dozen alignments. ?Am I
> missing the point of these attributes, or doing something wrong? ?It's not a
> huge issue (I can just count the alignments, I guess), but I'm a bit
> concerned that there's something wrong with my code.
>
> thanks,
> Nat

The number of alignments and descriptions only really apply
to the plain text (or HTML) BLAST output, but I guess we
could set them to the number of hits in the XML output.

Peter


From mictadlo at gmail.com  Tue Oct 18 19:12:03 2011
From: mictadlo at gmail.com (Mic)
Date: Wed, 19 Oct 2011 09:12:03 +1000
Subject: [Biopython] [Samtools-help] Segmentation fault
In-Reply-To: <4E9D6FAB.70308@fold.natur.cuni.cz>
References: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>
	<CAKVJ-_7cF1gGQbzcD8LO+w-QvbsJUp=T1xmzwoXqUNt5GsaowA@mail.gmail.com>
	<CAOP6n=g7d9cHhoqcf9FDW2vr9Rm5GgL7xMir0TU6Z6DJPQf_8Q@mail.gmail.com>
	<4E9D66B6.70904@fold.natur.cuni.cz>
	<CAOP6n=juO9=4wap-4xMHJatKbbZidtTX_5T8K-67Ud6R+E+AEA@mail.gmail.com>
	<4E9D6FAB.70308@fold.natur.cuni.cz>
Message-ID: <CAOP6n=gJDWOG08C+j9dRVQa4C6wzHmrUpf3u7_-e1kCBu9Mcug@mail.gmail.com>

I run it now on my Laptop (Ubuntu 11.04 x64) and now I can see the core
file:

$ ulimit -c unlimited
$ python subsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam
Segmentation fault (core dumped)

$ file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from
'python subsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam'

$ gdb /usr/bin/python ./core
GNU gdb (Ubuntu/Linaro 7.2-1ubuntu11) 7.2
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/python...(no debugging symbols found)...done.
[New Thread 2748]

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...(no debugging
symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...(no debugging
symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2
Reading symbols from /lib/x86_64-linux-gnu/libutil.so.1...(no debugging
symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libutil.so.1
Reading symbols from /lib/libssl.so.0.9.8...(no debugging symbols
found)...done.
Loaded symbols for /lib/libssl.so.0.9.8
Reading symbols from /lib/libcrypto.so.0.9.8...(no debugging symbols
found)...done.
Loaded symbols for /lib/libcrypto.so.0.9.8
Reading symbols from /lib/x86_64-linux-gnu/libz.so.1...(no debugging symbols
found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libz.so.1
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...(no debugging symbols
found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libm.so.6
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...(no debugging symbols
found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib/python2.7/lib-dynload/_heapq.so...(no
debugging symbols found)...done.
Loaded symbols for /usr/lib/python2.7/lib-dynload/_heapq.so
Reading symbols from /usr/lib/python2.7/lib-dynload/_elementtree.so...(no
debugging symbols found)...done.
Loaded symbols for /usr/lib/python2.7/lib-dynload/_elementtree.so
Reading symbols from /lib/x86_64-linux-gnu/libexpat.so.1...(no debugging
symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libexpat.so.1
Reading symbols from /usr/lib/python2.7/lib-dynload/pyexpat.so...(no
debugging symbols found)...done.
Loaded symbols for /usr/lib/python2.7/lib-dynload/pyexpat.so
Reading symbols from
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/csamtools.so...done.
Loaded symbols for
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/csamtools.so
Reading symbols from /usr/lib/python2.7/lib-dynload/_ctypes.so...(no
debugging symbols found)...done.
Loaded symbols for /usr/lib/python2.7/lib-dynload/_ctypes.so
Reading symbols from
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/ctabix.so...done.
Loaded symbols for
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/ctabix.so
Reading symbols from
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/TabProxies.so...done.
Loaded symbols for
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/TabProxies.so
Reading symbols from /usr/lib/python2.7/lib-dynload/_io.so...(no debugging
symbols found)...done.
Loaded symbols for /usr/lib/python2.7/lib-dynload/_io.so
Reading symbols from
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/cvcf.so...done.
Loaded symbols for
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/cvcf.so
Reading symbols from
/usr/lib/python2.7/lib-dynload/_multiprocessing.so...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/python2.7/lib-dynload/_multiprocessing.so
Reading symbols from /usr/lib/pymodules/python2.7/Bio/Nexus/cnexus.so...(no
debugging symbols found)...done.
Loaded symbols for /usr/lib/pymodules/python2.7/Bio/Nexus/cnexus.so
Core was generated by `python subsetBAM-P.py --BAM ex1.bam -f ex1.fa -o
new.bam'.
Program terminated with signal 11, Segmentation fault.
#0  __pyx_pf_9csamtools_11AlignedRead_5qname___get__ (o=0x164e138,
    x=<value optimized out>) at pysam/csamtools.c:18123
18123  if (__pyx_t_1) {
(gdb)
(gdb) where
#0  __pyx_pf_9csamtools_11AlignedRead_5qname___get__ (o=0x164e138,
    x=<value optimized out>) at pysam/csamtools.c:18123
#1  __pyx_getprop_9csamtools_11AlignedRead_qname (o=0x164e138,
    x=<value optimized out>) at pysam/csamtools.c:30806
#2  0x0000000000479804 in ?? ()
#3  0x00007f187dbabc65 in __pyx_pf_9csamtools_11AlignedRead___str__ (
    __pyx_v_self=0x164e138) at pysam/csamtools.c:17687
#4  0x0000000000479eac in _PyObject_Str ()
#5  0x0000000000479f8a in PyObject_Str ()
#6  0x00000000004d390c in ?? ()
#7  0x00000000004cd2d1 in PyFile_WriteObject ()
#8  0x000000000049909d in PyEval_EvalFrameEx ()
#9  0x000000000049d325 in PyEval_EvalCodeEx ()
#10 0x00000000004ecb02 in PyEval_EvalCode ()
#11 0x00000000004fdc74 in ?? ()
#12 0x000000000042c182 in PyRun_FileExFlags ()
#13 0x000000000042cb4a in PyRun_SimpleFileExFlags ()
#14 0x0000000000418c9e in Py_Main ()
#15 0x00007f187ed7aeff in __libc_start_main ()
   from /lib/x86_64-linux-gnu/libc.so.6
#16 0x00000000004c62b1 in _start ()
(gdb)
(gdb) bt full
#0  __pyx_pf_9csamtools_11AlignedRead_5qname___get__ (o=0x164e138,
    x=<value optimized out>) at pysam/csamtools.c:18123
        __pyx_v_src = 0x0
        __pyx_t_2 = 0x0
        __pyx_frame = 0x0
        __pyx_r = 0x0
        __pyx_t_1 = <value optimized out>
        __Pyx_use_tracing = 0
        __pyx_frame_code = 0x0
#1  __pyx_getprop_9csamtools_11AlignedRead_qname (o=0x164e138,
    x=<value optimized out>) at pysam/csamtools.c:30806
No locals.
#2  0x0000000000479804 in ?? ()
No symbol table info available.
#3  0x00007f187dbabc65 in __pyx_pf_9csamtools_11AlignedRead___str__ (
    __pyx_v_self=0x164e138) at pysam/csamtools.c:17687
        __pyx_r = 0x0
        __pyx_t_1 = 0x1e4fb90
        __pyx_t_2 = 0x0
        __pyx_t_3 = 0x0
        __pyx_t_4 = 0x0
---Type <return> to continue, or q <return> to quit---
        __pyx_t_5 = 0x0
        __pyx_t_6 = 0x0
        __pyx_t_7 = 0x0
        __pyx_t_8 = 0x0
        __pyx_t_9 = 0x0
        __pyx_t_10 = 0x0
        __pyx_t_11 = 0x0
        __pyx_t_12 = 0x0
        __pyx_t_13 = 0x0
        __pyx_t_14 = 0x0
        __pyx_frame_code = 0x0
        __pyx_frame = 0x0
        __Pyx_use_tracing = 0
#4  0x0000000000479eac in _PyObject_Str ()
No symbol table info available.
#5  0x0000000000479f8a in PyObject_Str ()
No symbol table info available.
#6  0x00000000004d390c in ?? ()
No symbol table info available.
#7  0x00000000004cd2d1 in PyFile_WriteObject ()
No symbol table info available.
---Type <return> to continue, or q <return> to quit---
#8  0x000000000049909d in PyEval_EvalFrameEx ()
No symbol table info available.
#9  0x000000000049d325 in PyEval_EvalCodeEx ()
No symbol table info available.
#10 0x00000000004ecb02 in PyEval_EvalCode ()
No symbol table info available.
#11 0x00000000004fdc74 in ?? ()
No symbol table info available.
#12 0x000000000042c182 in PyRun_FileExFlags ()
No symbol table info available.
#13 0x000000000042cb4a in PyRun_SimpleFileExFlags ()
No symbol table info available.
#14 0x0000000000418c9e in Py_Main ()
No symbol table info available.
#15 0x00007f187ed7aeff in __libc_start_main ()
   from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#16 0x00000000004c62b1 in _start ()
No symbol table info available.
(gdb) quit

$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) 16382
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Thank you in advance.

From nathaniel.echols at gmail.com  Wed Oct 19 14:48:11 2011
From: nathaniel.echols at gmail.com (Nat Echols)
Date: Wed, 19 Oct 2011 11:48:11 -0700
Subject: [Biopython] issues with NCBIXML
In-Reply-To: <CAKVJ-_7ghKgBqihqA+xiJ2WneJ=iibJkeA7zzxKun47Jsq6b=w@mail.gmail.com>
References: <CALzQQJYj4j4b8_LDVw+d5ufk6faOBS1f8LkDcm7b8ZnfLiyZUg@mail.gmail.com>
	<CAKVJ-_7ghKgBqihqA+xiJ2WneJ=iibJkeA7zzxKun47Jsq6b=w@mail.gmail.com>
Message-ID: <CALzQQJZVqdbSmtua_w0sQ2VUKVL1=-yBJkEXYs4WXKncBWHxUw@mail.gmail.com>

On Tue, Oct 18, 2011 at 3:07 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> The number of alignments and descriptions only really apply
> to the plain text (or HTML) BLAST output, but I guess we
> could set them to the number of hits in the XML output.


This would be useful, for consistency's sake if nothing else.  I'm happy to
contribute a patch if that streamlines the process.

-Nat

From p.j.a.cock at googlemail.com  Wed Oct 19 15:06:30 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 19 Oct 2011 20:06:30 +0100
Subject: [Biopython] issues with NCBIXML
In-Reply-To: <CALzQQJZVqdbSmtua_w0sQ2VUKVL1=-yBJkEXYs4WXKncBWHxUw@mail.gmail.com>
References: <CALzQQJYj4j4b8_LDVw+d5ufk6faOBS1f8LkDcm7b8ZnfLiyZUg@mail.gmail.com>
	<CAKVJ-_7ghKgBqihqA+xiJ2WneJ=iibJkeA7zzxKun47Jsq6b=w@mail.gmail.com>
	<CALzQQJZVqdbSmtua_w0sQ2VUKVL1=-yBJkEXYs4WXKncBWHxUw@mail.gmail.com>
Message-ID: <CAKVJ-_6+E=9M_QJ-fSpL0f9z9NmBSk_BupCi=gh8XBs+7rahgw@mail.gmail.com>

On Wed, Oct 19, 2011 at 7:48 PM, Nat Echols <nathaniel.echols at gmail.com> wrote:
> On Tue, Oct 18, 2011 at 3:07 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> The number of alignments and descriptions only really apply
>> to the plain text (or HTML) BLAST output, but I guess we
>> could set them to the number of hits in the XML output.
>
> This would be useful, for consistency's sake if nothing else. ?I'm happy to
> contribute a patch if that streamlines the process.
> -Nat

Sure. If you can include unit tests for it even better.
You should just be able to add some assertEqual
lines to the existing XML parser tests for the newly
populated properties.

Thanks,

Peter


From mictadlo at gmail.com  Thu Oct 20 05:38:56 2011
From: mictadlo at gmail.com (Mic)
Date: Thu, 20 Oct 2011 19:38:56 +1000
Subject: [Biopython] changing record attributes while iterating
In-Reply-To: <CAKVJ-_56L2beMYmCqs17KmpmhwX37oXMAR=xVD3237M6cAhGJw@mail.gmail.com>
References: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>
	<CAKVJ-_7dHzN4A551Hx4nydDTUAxeGDvqDvtKKZpZdvwue0P4Pw@mail.gmail.com>
	<CAOP6n=hRfYi1mRAt6pxudmyMeVH8bnnM3E5yd9kXXPU5L=pQvQ@mail.gmail.com>
	<CAKVJ-_56L2beMYmCqs17KmpmhwX37oXMAR=xVD3237M6cAhGJw@mail.gmail.com>
Message-ID: <CAOP6n=ggSjPopTkV_QFfhnxKUnh71q6FNLcUOJZfT9+T6db8-Q@mail.gmail.com>

Hello,
would it be possible to using a generator expression for the following code?

from Bio import SeqIO

fa_parser = SeqIO.parse(open("../test_files/test.fasta", "rU"), "fasta")

sequence = fa_parser.next().seq

for record in fa_parser:
    sequence += 3*'N' + record.seq

print sequence

Input:
>1
1111111
>2
2222222
>3
3333333
>4
4444444

Output:
1111111NNN2222222NNN3333333NNN4444444

Thank you advance.


On Fri, Oct 7, 2011 at 5:22 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

>
>
> On Friday, October 7, 2011, Michalwrote:
>
> > Hello,
> > Does your code with generator save the whole file in the
> > memory or does it read each entry and save it immediately?
> > Thank you in advance.
>
> Using a generator expression like that only one SeqRecord is in memory at a
> time. It goes through the input FASTA one record at a time, renames it,
> saves it immediately.
>
> Peter
>
> P.S. list CC'd

From p.j.a.cock at googlemail.com  Thu Oct 20 05:58:05 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 20 Oct 2011 10:58:05 +0100
Subject: [Biopython] changing record attributes while iterating
In-Reply-To: <CAOP6n=ggSjPopTkV_QFfhnxKUnh71q6FNLcUOJZfT9+T6db8-Q@mail.gmail.com>
References: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>
	<CAKVJ-_7dHzN4A551Hx4nydDTUAxeGDvqDvtKKZpZdvwue0P4Pw@mail.gmail.com>
	<CAOP6n=hRfYi1mRAt6pxudmyMeVH8bnnM3E5yd9kXXPU5L=pQvQ@mail.gmail.com>
	<CAKVJ-_56L2beMYmCqs17KmpmhwX37oXMAR=xVD3237M6cAhGJw@mail.gmail.com>
	<CAOP6n=ggSjPopTkV_QFfhnxKUnh71q6FNLcUOJZfT9+T6db8-Q@mail.gmail.com>
Message-ID: <CAKVJ-_5cthiZ7dhWr9_e6QLPHXnTVusa=Gdj7jOGSRFnyORuvQ@mail.gmail.com>

Hi Mic,

You should have started a new thread with a new title...

On Thu, Oct 20, 2011 at 10:38 AM, Mic <mictadlo at gmail.com> wrote:
> Hello,
> would it be possible to using a generator expression for the following code?
> from Bio import SeqIO
> fa_parser = SeqIO.parse(open("../test_files/test.fasta", "rU"), "fasta")
> sequence = fa_parser.next().seq
> for record in fa_parser:
> ? ??sequence?+= 3*'N' + record.seq
>
> print?sequence
> Input:
>>1
> 1111111
>>2
> 2222222
>>3
> 3333333
>>4
> 4444444
> Output:
> 1111111NNN2222222NNN3333333NNN4444444
> Thank you advance.

Sure, how about this:

from Bio import SeqIO
fa_parser = SeqIO.parse("../test_files/test.fasta", "fasta")
print ('N' * 3).join(str(rec.seq) for rec in fa_parser)

Peter


From andreas.wilm at gmail.com  Tue Oct 25 02:26:59 2011
From: andreas.wilm at gmail.com (Andreas Wilm)
Date: Tue, 25 Oct 2011 14:26:59 +0800
Subject: [Biopython] VCF parser
In-Reply-To: <CAA9RGEOdgq9=Y5oruhjycP8Jw545rwDQG_eXKFzWKXYj2s3uQQ@mail.gmail.com>
References: <CAA9RGEOdgq9=Y5oruhjycP8Jw545rwDQG_eXKFzWKXYj2s3uQQ@mail.gmail.com>
Message-ID: <CAL3gG7U07Vmsz3-A+xf0apORgc_3WcGE9Tn2SCLjm2_-BLPfMA@mail.gmail.com>

HI Tiago,

I'm not aware of a Biopython VCF parser, but pysam seems to have one
(haven't used it though). Try
>>> from pysam import cvcf

You also might want to check an implementation which was posted on
seqanswers: http://seqanswers.com/forums/archive/index.php/t-9266.html

Andreas

PS:
For the sake of completeness: your question was asked before here (no
replies). See http://www.biopython.org/pipermail/biopython/2011-March/007131.html


2011/10/4 Tiago Ant?o <tiagoantao at gmail.com>:
> Hi,
>
> I wonder if there is a VCF parser in either Python or Java? Either I
> am being dumb at searching (probably) or nothing exists?
>
> Thanks,
> Tiago
>
> --
> "If you want to get laid, go to college.? If you want an education, go
> to the library." - Frank Zappa
>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
Andreas Wilm
andreas.wilm at gmail.com | mail at andreas-wilm.com | 0x7C68FBCC


From pawan.mani2 at gmail.com  Tue Oct 25 11:50:51 2011
From: pawan.mani2 at gmail.com (kakchingtabam pawankumar sharma)
Date: Tue, 25 Oct 2011 21:20:51 +0530
Subject: [Biopython] installation of pyfatsa
In-Reply-To: <2DDF09AFEB46E54894A3843CEF9CB3A4B55076114C@EXCHMB.ocimumbio.com>
References: <2DDF09AFEB46E54894A3843CEF9CB3A4B55076114C@EXCHMB.ocimumbio.com>
Message-ID: <CAALmQ-SL4O848KMHzmpBwSL-9C+_-2s4xWQSZht2=DduciqzOg@mail.gmail.com>

 Dear****

                 I woul like to know how to install pyfasta in linux. I have
downloaded pyfasta-0.4.4.tar.gz and install using command:  *tar -xzvf
pyfasta-0.4.4.tar.gz*.****

But I could used the command line:  ****

*pyfasta split -n 6 sample .fasta*

** **

So kindly help me out to solve this problem.****

** **

** **

With Reagards,****

Pawan****

------------------------------
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient,
please notify the sender by e-mail and delete the original message. Further,
you are not to copy, disclose, or distribute this e-mail or its contents to
any other person and any such actions that are unlawful. This e-mail may
contain viruses. Ocimum Biosolutions has taken every reasonable precaution
to minimize this risk, but is not liable for any damage you may sustain as a
result of any virus in this e-mail. You should carry out your own virus
checks before opening the e-mail or attachment.
 The information contained in this email and any attachments is confidential
and may be subject to copyright or other intellectual property protection.
If you are not the intended recipient, you are not authorized to use or
disclose this information, and we request that you notify us by reply mail
or telephone and delete the original message from your mail system.

OCIMUMBIO SOLUTIONS (P) LTD

From p.j.a.cock at googlemail.com  Tue Oct 25 12:13:00 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 25 Oct 2011 17:13:00 +0100
Subject: [Biopython] installation of pyfatsa
In-Reply-To: <CAALmQ-SL4O848KMHzmpBwSL-9C+_-2s4xWQSZht2=DduciqzOg@mail.gmail.com>
References: <2DDF09AFEB46E54894A3843CEF9CB3A4B55076114C@EXCHMB.ocimumbio.com>
	<CAALmQ-SL4O848KMHzmpBwSL-9C+_-2s4xWQSZht2=DduciqzOg@mail.gmail.com>
Message-ID: <CAKVJ-_5cwyCa2X6W6T8EY7V7CQ=_ZOOgFCm86UOmAnB0+Y0jEg@mail.gmail.com>

On Tue, Oct 25, 2011 at 4:50 PM, kakchingtabam pawankumar sharma
<pawan.mani2 at gmail.com> wrote:
> ?Dear****
>
> ? ? ? ? ? ? ? ? I woul like to know how to install pyfasta in linux. I have
> downloaded pyfasta-0.4.4.tar.gz and install using command: ?*tar -xzvf
> pyfasta-0.4.4.tar.gz*.****
>
> But I could used the command line: ?****
>
> *pyfasta split -n 6 sample .fasta*
>
> ** **
>
> So kindly help me out to solve this problem.****
>
> ** **
>
> ** **
>
> With Reagards,****
>
> Pawan****
>

Hi Pawan,

Note pyfasta is not part of Biopython, but is a separate tool by Brent
Pedersen (CC'd).

http://pypi.python.org/pypi/pyfasta/
https://github.com/brentp/pyfasta/

However, uncompressing the tar ball is only the first step in installing it.
You probably need to run "python setup.py install" for that.

Peter


From bpederse at gmail.com  Tue Oct 25 12:23:31 2011
From: bpederse at gmail.com (Brent Pedersen)
Date: Tue, 25 Oct 2011 10:23:31 -0600
Subject: [Biopython] VCF parser
In-Reply-To: <CAA9RGEOdgq9=Y5oruhjycP8Jw545rwDQG_eXKFzWKXYj2s3uQQ@mail.gmail.com>
References: <CAA9RGEOdgq9=Y5oruhjycP8Jw545rwDQG_eXKFzWKXYj2s3uQQ@mail.gmail.com>
Message-ID: <CAAp4xwp+hhcbf46+DC+g30eP=MbpeUmbR3dmO7v-m3DU7AizTg@mail.gmail.com>

On Mon, Oct 3, 2011 at 4:12 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
> Hi,
>
> I wonder if there is a VCF parser in either Python or Java? Either I
> am being dumb at searching (probably) or nothing exists?
>
> Thanks,
> Tiago
>
> --
> "If you want to get laid, go to college.? If you want an education, go
> to the library." - Frank Zappa
>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

I have found this one: https://github.com/jdoughertyii/PyVCF
to be quite good and easy to use.


From anaryin at gmail.com  Wed Oct 26 06:30:12 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 26 Oct 2011 12:30:12 +0200
Subject: [Biopython] Pairwise alignment - is it a generic function?
Message-ID: <CAJ9sUYNTF5JJx3MPNMinqyD-zqWpHruxdH_9a+GDymkUSL0C+A@mail.gmail.com>

Hello all,

A friend of mine was interested in a small simple alignment script for
aminoacids, to which I recommended to have a look at Biopython. We found the
pairwise2 module but we're a bit puzzled. Does it align "any" sequence, aa
or nucleotides? I don't see any scoring matrix referenced there..

Related to this, can you suggest any implementation of an aminoacid pairwise
alignment algorithm, in Python, that does is self contained (ie. doesn't
depend on some other program).

Best,

Jo?o [...] Rodrigues
http://nmr.chem.uu.nl/~joao


From p.j.a.cock at googlemail.com  Wed Oct 26 06:58:09 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 26 Oct 2011 11:58:09 +0100
Subject: [Biopython] Pairwise alignment - is it a generic function?
In-Reply-To: <CAJ9sUYNTF5JJx3MPNMinqyD-zqWpHruxdH_9a+GDymkUSL0C+A@mail.gmail.com>
References: <CAJ9sUYNTF5JJx3MPNMinqyD-zqWpHruxdH_9a+GDymkUSL0C+A@mail.gmail.com>
Message-ID: <CAKVJ-_7KGjsF_MaQ-ngVSMN43T2_R2kkyYh6Cmh9a3hkk8NhuQ@mail.gmail.com>

On Wed, Oct 26, 2011 at 11:30 AM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Hello all,
>
> A friend of mine was interested in a small simple alignment script for
> aminoacids, to which I recommended to have a look at Biopython. We found the
> pairwise2 module but we're a bit puzzled. Does it align "any" sequence, aa
> or nucleotides? I don't see any scoring matrix referenced there..

It should work on proteins, just pass in the appropriate scoring matrix.

> Related to this, can you suggest any implementation of an aminoacid
> pairwise alignment algorithm, in Python, that does is self contained
> (ie. doesn't depend on some other program).

Well, Bio.pairwise2 has a faster C implementation and fall back slower
pure Python implementation (used under Jython/PyPy/etc), which might
answer your needs.

Peter


From from.d.putto at gmail.com  Wed Oct 26 11:11:05 2011
From: from.d.putto at gmail.com (Sheila the angel)
Date: Wed, 26 Oct 2011 17:11:05 +0200
Subject: [Biopython] downloading gnome Protein table
Message-ID: <CAFinXcTpxMZSYVYUJa2ECASDuDKrSpGk9eq5OJm6aDW8m69xUQ@mail.gmail.com>

Hi All,

I an facing some problem to downloading the gnome and other information.
For an example I did a query on ncbi gnome for  NC_008390
On clicking results you can get following link

http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=ShowDetailView&TermToSearch=19840
On my web-browser I can save this page  as File> Save as >out.html

Furthermore I want to download the Protein table also
http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=Retrieve&dopt=Protein+Table&list_uids=19840

I want to do this for many Ids. Is there any simple way in Bio-Python???


Thanks in Advance

--
Cheers
Sheila

From p.j.a.cock at googlemail.com  Wed Oct 26 11:27:37 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 26 Oct 2011 16:27:37 +0100
Subject: [Biopython] downloading gnome Protein table
In-Reply-To: <CAFinXcTpxMZSYVYUJa2ECASDuDKrSpGk9eq5OJm6aDW8m69xUQ@mail.gmail.com>
References: <CAFinXcTpxMZSYVYUJa2ECASDuDKrSpGk9eq5OJm6aDW8m69xUQ@mail.gmail.com>
Message-ID: <CAKVJ-_5NsUg-ymN_xtM+RhMrub=-YH4TO2g108Wos5R8UQ2LtQ@mail.gmail.com>

On Wed, Oct 26, 2011 at 4:11 PM, Sheila the angel
<from.d.putto at gmail.com> wrote:
> Hi All,
>
> I an facing some problem to downloading the gnome and other information.
> For an example I did a query on ncbi gnome for ?NC_008390
> On clicking results you can get following link
>
> http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=ShowDetailView&TermToSearch=19840
> On my web-browser I can save this page ?as File> Save as >out.html
>
> Furthermore I want to download the Protein table also
> http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=Retrieve&dopt=Protein+Table&list_uids=19840
>
> I want to do this for many Ids. Is there any simple way in Bio-Python???
>
> Thanks in Advance

Hmm, some of that might be available by Bio.Entrez, not sure though.

For the protein table I would personally work with the *.ptt files from
the NCBI FTP site, e.g.

ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt

or:

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid58303/NC_008391.ptt

The FTP links are on the page of the first URL you gave. You can download
all the "bacteria" *.ptt files as a tar ball,

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ptt.tar.gz

Typically I work from the GenBank file files instead (*.gbk rather than *.ptt)

Peter


From mictadlo at gmail.com  Wed Oct 26 21:14:16 2011
From: mictadlo at gmail.com (Mic)
Date: Thu, 27 Oct 2011 11:14:16 +1000
Subject: [Biopython] changing record attributes while iterating
In-Reply-To: <CAKVJ-_5cthiZ7dhWr9_e6QLPHXnTVusa=Gdj7jOGSRFnyORuvQ@mail.gmail.com>
References: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>
	<CAKVJ-_7dHzN4A551Hx4nydDTUAxeGDvqDvtKKZpZdvwue0P4Pw@mail.gmail.com>
	<CAOP6n=hRfYi1mRAt6pxudmyMeVH8bnnM3E5yd9kXXPU5L=pQvQ@mail.gmail.com>
	<CAKVJ-_56L2beMYmCqs17KmpmhwX37oXMAR=xVD3237M6cAhGJw@mail.gmail.com>
	<CAOP6n=ggSjPopTkV_QFfhnxKUnh71q6FNLcUOJZfT9+T6db8-Q@mail.gmail.com>
	<CAKVJ-_5cthiZ7dhWr9_e6QLPHXnTVusa=Gdj7jOGSRFnyORuvQ@mail.gmail.com>
Message-ID: <CAOP6n=g1pduQAFiBBB1RL3vNyjs8WCfb5eCUq_Tu1JoEbMigbg@mail.gmail.com>

Thank you it is working.

I would like to to put sequences id in a list in the following way:

>>> c = (i.id for i in b)
SyntaxError: invalid syntax
>>> c[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'generator' object is not subscriptable

How is it possible to generate a list of sequence ids?

Thank you in advance.

On Thu, Oct 20, 2011 at 7:58 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Hi Mic,
>
> You should have started a new thread with a new title...
>
> On Thu, Oct 20, 2011 at 10:38 AM, Mic <mictadlo at gmail.com> wrote:
> > Hello,
> > would it be possible to using a generator expression for the following
> code?
> > from Bio import SeqIO
> > fa_parser = SeqIO.parse(open("../test_files/test.fasta", "rU"), "fasta")
> > sequence = fa_parser.next().seq
> > for record in fa_parser:
> >     sequence += 3*'N' + record.seq
> >
> > print sequence
> > Input:
> >>1
> > 1111111
> >>2
> > 2222222
> >>3
> > 3333333
> >>4
> > 4444444
> > Output:
> > 1111111NNN2222222NNN3333333NNN4444444
> > Thank you advance.
>
> Sure, how about this:
>
> from Bio import SeqIO
> fa_parser = SeqIO.parse("../test_files/test.fasta", "fasta")
> print ('N' * 3).join(str(rec.seq) for rec in fa_parser)
>
> Peter
>

From p.j.a.cock at googlemail.com  Thu Oct 27 04:35:24 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 27 Oct 2011 09:35:24 +0100
Subject: [Biopython] changing record attributes while iterating
In-Reply-To: <CAOP6n=g1pduQAFiBBB1RL3vNyjs8WCfb5eCUq_Tu1JoEbMigbg@mail.gmail.com>
References: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>
	<CAKVJ-_7dHzN4A551Hx4nydDTUAxeGDvqDvtKKZpZdvwue0P4Pw@mail.gmail.com>
	<CAOP6n=hRfYi1mRAt6pxudmyMeVH8bnnM3E5yd9kXXPU5L=pQvQ@mail.gmail.com>
	<CAKVJ-_56L2beMYmCqs17KmpmhwX37oXMAR=xVD3237M6cAhGJw@mail.gmail.com>
	<CAOP6n=ggSjPopTkV_QFfhnxKUnh71q6FNLcUOJZfT9+T6db8-Q@mail.gmail.com>
	<CAKVJ-_5cthiZ7dhWr9_e6QLPHXnTVusa=Gdj7jOGSRFnyORuvQ@mail.gmail.com>
	<CAOP6n=g1pduQAFiBBB1RL3vNyjs8WCfb5eCUq_Tu1JoEbMigbg@mail.gmail.com>
Message-ID: <CAKVJ-_7DYTTuF4uKvaJL9jOwWb-oG59PfhuxTOVWh23TMuwO_A@mail.gmail.com>

On Thu, Oct 27, 2011 at 2:14 AM, Mic <mictadlo at gmail.com> wrote:
> Thank you it is working.
> I would like to to put sequences id in a list in the following way:
>>>> c = (i.id for i in b)
> SyntaxError: invalid syntax

The above would be a generator expression, and requires
Python 2.4. It shouldn't cause a SyntaxError unless there is
some mistake I'm not seeing (or you missed something in
the copy & paste).

>>>> c[0]
> Traceback (most recent call last):
> ? File "<stdin>", line 1, in <module>
> TypeError: 'generator' object is not subscriptable
> How is it possible to?generate?a list of sequence ids?

You need to create a list (e.g using a list comprehension)
rather than a generator, probably:

c = [i.id for i in b]
c[0] = "Fred"

Peter


From from.d.putto at gmail.com  Thu Oct 27 06:47:04 2011
From: from.d.putto at gmail.com (Sheila the angel)
Date: Thu, 27 Oct 2011 12:47:04 +0200
Subject: [Biopython] downloading gnome Protein table
In-Reply-To: <CAKVJ-_5NsUg-ymN_xtM+RhMrub=-YH4TO2g108Wos5R8UQ2LtQ@mail.gmail.com>
References: <CAFinXcTpxMZSYVYUJa2ECASDuDKrSpGk9eq5OJm6aDW8m69xUQ@mail.gmail.com>
	<CAKVJ-_5NsUg-ymN_xtM+RhMrub=-YH4TO2g108Wos5R8UQ2LtQ@mail.gmail.com>
Message-ID: <CAFinXcTOn4rtnzEz6XoVH8U-g0XsyLgh7T1RWM378A6hhcsQsw@mail.gmail.com>

The problem is I have only the Refseq ID like NC_008390 and I don't have
Protein table ID (in this case CP000441.ptt) so I can't download the .ptt
file (as in ftp url
ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt
 )

Also not all  Refseq IDs I have belongs to 'Bacteria'. So for ID
NC_004314 (just
an example) I have to change the ftp url as
ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Plasmodium_falciparum/NC_004314.ptt


Downloading the *.gbk file may be an option (but later I need to convert
them into protein table) so I tried this

from Bio import Entrez
Entrez.email = "from.d.putto at gmail.com"
handle = Entrez.efetch(db="genome", id="NC_008390", rettype="gbk")
print handle.read()

The output shows me 'Nothing has been found'
I am not sure in which database I should look for id like NC_008390.
Moreover later-on I need to convert 'gbk' file to .ptt (or extract protein
information)


On Wed, Oct 26, 2011 at 5:27 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Wed, Oct 26, 2011 at 4:11 PM, Sheila the angel
> <from.d.putto at gmail.com> wrote:
> > Hi All,
> >
> > I an facing some problem to downloading the gnome and other information.
> > For an example I did a query on ncbi gnome for  NC_008390
> > On clicking results you can get following link
> >
> >
> http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=ShowDetailView&TermToSearch=19840
> > On my web-browser I can save this page  as File> Save as >out.html
> >
> > Furthermore I want to download the Protein table also
> >
> http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=Retrieve&dopt=Protein+Table&list_uids=19840
> >
> > I want to do this for many Ids. Is there any simple way in Bio-Python???
> >
> > Thanks in Advance
>
> Hmm, some of that might be available by Bio.Entrez, not sure though.
>
> For the protein table I would personally work with the *.ptt files from
> the NCBI FTP site, e.g.
>
>
> ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt
>
> or:
>
>
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid58303/NC_008391.ptt
>
> The FTP links are on the page of the first URL you gave. You can download
> all the "bacteria" *.ptt files as a tar ball,
>
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ptt.tar.gz
>
> Typically I work from the GenBank file files instead (*.gbk rather than
> *.ptt)
>
> Peter
>

From p.j.a.cock at googlemail.com  Thu Oct 27 09:14:10 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 27 Oct 2011 14:14:10 +0100
Subject: [Biopython] downloading gnome Protein table
In-Reply-To: <CAFinXcTOn4rtnzEz6XoVH8U-g0XsyLgh7T1RWM378A6hhcsQsw@mail.gmail.com>
References: <CAFinXcTpxMZSYVYUJa2ECASDuDKrSpGk9eq5OJm6aDW8m69xUQ@mail.gmail.com>
	<CAKVJ-_5NsUg-ymN_xtM+RhMrub=-YH4TO2g108Wos5R8UQ2LtQ@mail.gmail.com>
	<CAFinXcTOn4rtnzEz6XoVH8U-g0XsyLgh7T1RWM378A6hhcsQsw@mail.gmail.com>
Message-ID: <CAKVJ-_4kqNSvherjHTXqTsxS_QmZnhgZSznvUOQ=WcLfFY8Q9A@mail.gmail.com>

On Thu, Oct 27, 2011 at 11:47 AM, Sheila the angel
<from.d.putto at gmail.com> wrote:
> The problem is I have only the Refseq ID like?NC_008390?and I don't have
> Protein table ID (in this case CP000441.ptt) so I can't download the .ptt
> file (as in ftp url
> ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt
> ? )

Given your identifiers, use ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ rather
than ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/ - in this case,

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid58303/NC_008390.ptt

>
> Also not all??Refseq IDs I have belongs to 'Bacteria'.
>

Then the NCBI won't have them on the Bacterial FTP sites, and I
don't think they will provide *.ptt files for them.

> So for ID
> NC_004314?(just an example) I have to change the ftp url as
> ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Plasmodium_falciparum/NC_004314.ptt
>
> Downloading the *.gbk file may be an option (but later I need to convert
> them into protein table)

Just download *all* the bacterial protein tables as the tar ball, its only
120MB compressed:

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ptt.tar.gz

Then you can just search locally for a file by name etc.

> so I tried this
> from Bio import Entrez
> Entrez.email = "from.d.putto at gmail.com"
> handle = Entrez.efetch(db="genome", id="NC_008390", rettype="gbk")
> print handle.read()
> The output shows me 'Nothing has been found'
> I am not sure in which database I should look for id like NC_008390.

Try it on the NCBI website for all databases,
http://www.ncbi.nlm.nih.gov/sites/gquery?term=NC_008390

You'll see it does match the genome database, but also the
nucleotide database. In this case you want the sequence as
a GenBank file so use the nucleotide database.

> Moreover later-on I need to convert 'gbk' file to .ptt (or extract protein
> information)

The Biopython GenBank parser can do that - life is easier with
bacterial genomes as there are (almost) no nasty join(...)
locations to deal with.

Peter


From devaniranjan at gmail.com  Thu Oct 27 15:16:07 2011
From: devaniranjan at gmail.com (George Devaniranjan)
Date: Thu, 27 Oct 2011 15:16:07 -0400
Subject: [Biopython] weighted sampling of a dictionary
Message-ID: <CAFU65PcW=0D20X7-44uEE=MjtXZZoJWPWyNnbrq4CWUtz46UwA@mail.gmail.com>

Hi,

I am not sure if this question is more suitable for biopython or a python
forum.


I have the following dictionary.

dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34,
'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, 'LAU':
1, 'PTA': 7, '
AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34,
'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, 'YLP':
49, 'TA
Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, 'TAA':
20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL':
16, 'SY
Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28}

The keys are the different amino acid triplets (all possible triplets
extracted from a culled list of PDB), the numbers next to them are the
frequency that they occour in.

I was wondering if there is a way in biopython/python to sample them at the
frequecy indicated by the no's next to the key.

I have only given a snippet of the triplet dictionary, the entire dictionary
has about 1400 key entries.

I would appreciate any help in this matter --thank you very much.

George

From bpederse at gmail.com  Thu Oct 27 16:29:43 2011
From: bpederse at gmail.com (Brent Pedersen)
Date: Thu, 27 Oct 2011 14:29:43 -0600
Subject: [Biopython] weighted sampling of a dictionary
In-Reply-To: <CAFU65PcW=0D20X7-44uEE=MjtXZZoJWPWyNnbrq4CWUtz46UwA@mail.gmail.com>
References: <CAFU65PcW=0D20X7-44uEE=MjtXZZoJWPWyNnbrq4CWUtz46UwA@mail.gmail.com>
Message-ID: <CAAp4xwrgkRTcROeSJDP4s71G4rAeouRRsTxB_NjLhzxBSsC5AA@mail.gmail.com>

On Thu, Oct 27, 2011 at 1:16 PM, George Devaniranjan
<devaniranjan at gmail.com> wrote:
> Hi,
>
> I am not sure if this question is more suitable for biopython or a python
> forum.
>
>
> I have the following dictionary.
>
> dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34,
> 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, 'LAU':
> 1, 'PTA': 7, '
> AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34,
> 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, 'YLP':
> 49, 'TA
> Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, 'TAA':
> 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL':
> 16, 'SY
> Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28}
>
> The keys are the different amino acid triplets (all possible triplets
> extracted from a culled list of PDB), the numbers next to them are the
> frequency that they occour in.
>
> I was wondering if there is a way in biopython/python to sample them at the
> frequecy indicated by the no's next to the key.
>
> I have only given a snippet of the triplet dictionary, the entire dictionary
> has about 1400 key entries.
>
> I would appreciate any help in this matter --thank you very much.
>
> George
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


you could try the one of these (presumably the class king)
http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python/

you'll have something like:

import operator
aminos, weights = zip(*sorted(adict.items(), key=operator.itemgetter(1)))

amino_gen = WeightedRandomGenerator(weights)

for i in xrange(nsims):
    idx = amino_gen.next()
    rand_aa = aminos[idx]


From jmtc21 at bath.ac.uk  Thu Oct 27 16:33:18 2011
From: jmtc21 at bath.ac.uk (Jaime Tovar)
Date: Thu, 27 Oct 2011 21:33:18 +0100
Subject: [Biopython] expat and biopython 1.58 problem on linux x64
Message-ID: <4EA9C00E.5080509@bath.ac.uk>

Hello all,

I'm having troubles while updating my biopython to 1.58.

I'm having exactly the same problem with the xml parser as described in 
this old post:

http://www.biopython.org/pipermail/biopython/2011-May/007263.html

Sadly I may have to use the entrez module so it will make me happy to 
have the thing running if possible.

I'm installing in a opensuse 11.3 x64 box
Did a rpm install of biopython from the opensuse science repo. So I have 
1.58-1.2 installed.
Python 1.6.5-3.5.1 for x64
expat 2.0.1-98.1 x64

Tried to install both by hand from the tar.gz and using an rpm but the 
problem persists.

Any help will be greatly appreciated.

Thanks!!!

Jaime.


From winda002 at student.otago.ac.nz  Thu Oct 27 16:52:00 2011
From: winda002 at student.otago.ac.nz (David Winter)
Date: Fri, 28 Oct 2011 09:52:00 +1300
Subject: [Biopython] weighted sampling of a dictionary
In-Reply-To: <CAFU65PcW=0D20X7-44uEE=MjtXZZoJWPWyNnbrq4CWUtz46UwA@mail.gmail.com>
References: <CAFU65PcW=0D20X7-44uEE=MjtXZZoJWPWyNnbrq4CWUtz46UwA@mail.gmail.com>
Message-ID: <20111028095200.20435ub1z2jexy0g@www.studentmail.otago.ac.nz>

Hi George,

I was actually doing this yesterday :)

The function I came up with takes two lists:

import random

def weighted_sample(population, weights):
   """ Sample from a population, given provided weights """
   if len(population) != len(weights):
     raise ValueError('Lengths of population and weights do not match')
   normal_weights = [ float(w)/sum(weights) for w in weights ]
   val = random.random()
   running_total = 0
   for index, weight in enumerate(normal_weights):
     running_total += weight
     if val < running_total:
       return population[index]

Which seems to do the trick:

population = ['AAU' ,'AAC', 'AAG']
weights = [2,5,3]
sample = [weighted_sample(population, weights) for _ in range(1000)]
sample.count('AAC') #should be about 500

If that's too slow, check out numpy's random.multinomial() function.

I haven't tested this, but this should get you the number of times you  
get each codon from 1000 "draws":

import numpy as np

codons, weights = codon_dict.items()
denom = sum(weights)
normalised_weights = [float(w)/denom for w in weights]
np.random.multinomial(codons, weights, 1000)

Cheers,
David


Quoting George Devaniranjan <devaniranjan at gmail.com>:

> Hi,
>
> I am not sure if this question is more suitable for biopython or a python
> forum.
>
>
> I have the following dictionary.
>
> dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34,
> 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, 'LAU':
> 1, 'PTA': 7, '
> AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34,
> 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, 'YLP':
> 49, 'TA
> Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, 'TAA':
> 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL':
> 16, 'SY
> Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28}
>
> The keys are the different amino acid triplets (all possible triplets
> extracted from a culled list of PDB), the numbers next to them are the
> frequency that they occour in.
>
> I was wondering if there is a way in biopython/python to sample them at the
> frequecy indicated by the no's next to the key.
>
> I have only given a snippet of the triplet dictionary, the entire dictionary
> has about 1400 key entries.
>
> I would appreciate any help in this matter --thank you very much.
>
> George
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From p.j.a.cock at googlemail.com  Fri Oct 28 05:54:09 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 28 Oct 2011 10:54:09 +0100
Subject: [Biopython] expat and biopython 1.58 problem on linux x64
In-Reply-To: <4EA9C00E.5080509@bath.ac.uk>
References: <4EA9C00E.5080509@bath.ac.uk>
Message-ID: <CAKVJ-_4ZMCU3J6VeWnGyu0yk7Okk+ZcWqfMEDnXmUy0LgStZXA@mail.gmail.com>

On Thu, Oct 27, 2011 at 9:33 PM, Jaime Tovar <jmtc21 at bath.ac.uk> wrote:
> Hello all,
>
> I'm having troubles while updating my biopython to 1.58.
>
> I'm having exactly the same problem with the xml parser as described in this
> old post:
>
> http://www.biopython.org/pipermail/biopython/2011-May/007263.html
>
> Sadly I may have to use the entrez module so it will make me happy to have
> the thing running if possible.
>
> I'm installing in a opensuse 11.3 x64 box
> Did a rpm install of biopython from the opensuse science repo. So I have
> 1.58-1.2 installed.
> Python 1.6.5-3.5.1 for x64
> expat 2.0.1-98.1 x64
>
> Tried to install both by hand from the tar.gz and using an rpm but the
> problem persists.
>
> Any help will be greatly appreciated.
>
> Thanks!!!
>
> Jaime.

Hmm. Can you try installing the latest code from git please?
You can grab it via the git command line tool, or use github
to download the latest code as a tar ball:
http://biopython.org/wiki/SourceCode

Specifically I'm hoping this change will fix the segmentation
fault (assuming http://bugs.python.org/issue4877 is to blame):
https://github.com/biopython/biopython/commit/59f9cbd2ad14ebd05d5864033ff0c7ef7a8f0daa

Previously:

$ python
Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Entrez
>>> handle = open("NEWS")
>>> handle.close()
>>> Entrez.read(handle)
Segmentation fault

With the fix:

$ python
Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Entrez
>>> handle = open("NEWS")
>>> handle.close()
>>> Entrez.read(handle)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Bio/Entrez/__init__.py", line 270, in read
    record = handler.read(handle)
  File "Bio/Entrez/Parser.py", line 167, in read
    raise IOError("Can't parse a closed handle")
IOError: Can't parse a closed handle

Assuming you start seeing the IOError instead, the question
would shift to what is going on with your network settings
(e.g. look at proxies).

If the segmentation fault doesn't go away we'll need to think
again.

Peter

From bioinformaticsing at gmail.com  Fri Oct 28 07:46:07 2011
From: bioinformaticsing at gmail.com (ning luwen)
Date: Fri, 28 Oct 2011 19:46:07 +0800
Subject: [Biopython] Memory leak while parse gbk file?
Message-ID: <CAO51=Z6dzCotJA2efvhmFXUmKTnNDbaUOmxv_zzom-CJJr5YGQ@mail.gmail.com>

Hi,
    I have tried to parse about 2000+ gbk file using SeqIO.parse to
parse gbk file, but the memory up quickly. ( in my desktop 4g memory,
out memory after a number of iterates, and then try one work station,
memory used as high as 100g+, and continue increasing)

for temp_name in file_names:#file_names: list of path of gbk files.
    f=open(temp_name)
    for x in SeqIO.parse(f,'genbank'):
        print x.name,len(x.features)
    f.close()

   I guess there may be memory leak while parse gbk flle.
-- 
regards,
luwen ning

From p.j.a.cock at googlemail.com  Fri Oct 28 07:52:33 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 28 Oct 2011 12:52:33 +0100
Subject: [Biopython] Memory leak while parse gbk file?
In-Reply-To: <CAO51=Z6dzCotJA2efvhmFXUmKTnNDbaUOmxv_zzom-CJJr5YGQ@mail.gmail.com>
References: <CAO51=Z6dzCotJA2efvhmFXUmKTnNDbaUOmxv_zzom-CJJr5YGQ@mail.gmail.com>
Message-ID: <CAKVJ-_7zTR2D1zRhwz8ru-w7bt4LNen0cY2X6HkJKLNLb195yg@mail.gmail.com>

On Fri, Oct 28, 2011 at 12:46 PM, ning luwen
<bioinformaticsing at gmail.com> wrote:
> Hi,
> ? ?I have tried to parse about 2000+ gbk file using SeqIO.parse to
> parse gbk file, but the memory up quickly. ( in my desktop 4g memory,
> out memory after a number of iterates, and then try one work station,
> memory used as high as 100g+, and continue increasing)
>
> for temp_name in file_names:#file_names: list of path of gbk files.
> ? ?f=open(temp_name)
> ? ?for x in SeqIO.parse(f,'genbank'):
> ? ? ? ?print x.name,len(x.features)
> ? ?f.close()
>
> ? I guess there may be memory leak while parse gbk flle.
> --
> regards,
> luwen ning

Which version of Python are you using? Try calling garbage collection,

import gc
from Bio import SeqIO
for temp_name in file_names:#file_names: list of path of gbk files.
    f=open(temp_name)
    for x in SeqIO.parse(f,'genbank'):
        print x.name,len(x.features)
    f.close()
    gc.collect()

I expect that to fix the increasing memory usage. If it does, then
it isn't a memory leak.

Peter


From p.j.a.cock at googlemail.com  Fri Oct 28 09:21:42 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 28 Oct 2011 14:21:42 +0100
Subject: [Biopython] expat and biopython 1.58 problem on linux x64
In-Reply-To: <4EAAA9A0.3010906@bath.ac.uk>
References: <4EA9C00E.5080509@bath.ac.uk>
	<CAKVJ-_4ZMCU3J6VeWnGyu0yk7Okk+ZcWqfMEDnXmUy0LgStZXA@mail.gmail.com>
	<4EAAA9A0.3010906@bath.ac.uk>
Message-ID: <CAKVJ-_4_M4jJJQo+_pBokU7dNn8S14Zo66_V5WXDWwwJBQN97g@mail.gmail.com>

On Fri, Oct 28, 2011 at 2:09 PM, Jaime Tovar <jmtc21 at bath.ac.uk> wrote:
> Got the tarball for latest,
>
> but:
>
> ...
> ~/tmp/biop/biopython-biopython-59f9cbd/Tests> python test_Entrez.py
> Test error handling when presented with Fasta non-XML data ... ok
> Test error handling when presented with GenBank non-XML data ... ok
> Test parsing XML returned by EFetch, Nucleotide database (first test) ...
> ERROR
> Test parsing XML returned by EFetch, Protein database ... ERROR
> Test parsing XML returned by EFetch, OMIM database ... ERROR
> Test parsing XML returned by EFetch, PubMed database (first test) ...
> Segmentation fault
>
> Can we try to find where exactly is the problem?
>
> Thanks for the help.
> J

OK, so it doesn't look like the problem with closed handles,
http://bugs.python.org/issue4877

Although to be sure please try the example in my last email,

from Bio import Entrez
handle = open("NEWS")
handle.close()
Entrez.read(handle)

(You can use any file that exists).

Beyond that I only have questions rather than answers for now.
My guess is something is broken on your system with conflicting
versions of expat, see for example:

http://www.dscpl.com.au/wiki/ModPython/Articles/ExpatCausingApacheCrash

What does this give you, and does it match expat 2.0.1 which you
said earlier was installed?

import pyexpat
print pyexpat.version_info

Can you try to get a strack trace?

Alternatively, you could disable individual tests which trigger
the segmentation fault one by one and then we can attempt to
spot any commonalities. e.g. The segmentation fault is from:
"Test parsing XML returned by EFetch, PubMed database (first test)"
which is method test_pubmed1, rename it to xtest_test_pubmed1
(or anything that doesn't start test_*) and it will be skipped.

Peter

From devaniranjan at gmail.com  Fri Oct 28 09:23:22 2011
From: devaniranjan at gmail.com (George Devaniranjan)
Date: Fri, 28 Oct 2011 09:23:22 -0400
Subject: [Biopython] weighted sampling of a dictionary
In-Reply-To: <20111028095200.20435ub1z2jexy0g@www.studentmail.otago.ac.nz>
References: <CAFU65PcW=0D20X7-44uEE=MjtXZZoJWPWyNnbrq4CWUtz46UwA@mail.gmail.com>
	<20111028095200.20435ub1z2jexy0g@www.studentmail.otago.ac.nz>
Message-ID: <CAFU65PfrYJkkqOgtiDw36TuoHjTjf53FnXODJ5XBPn9QZ19rKQ@mail.gmail.com>

Thanks guys for all your suggestions -I am going to try these out.

Best,
George

On Thu, Oct 27, 2011 at 4:52 PM, David Winter
<winda002 at student.otago.ac.nz>wrote:

> Hi George,
>
> I was actually doing this yesterday :)
>
> The function I came up with takes two lists:
>
> import random
>
> def weighted_sample(population, weights):
>  """ Sample from a population, given provided weights """
>  if len(population) != len(weights):
>    raise ValueError('Lengths of population and weights do not match')
>  normal_weights = [ float(w)/sum(weights) for w in weights ]
>  val = random.random()
>  running_total = 0
>  for index, weight in enumerate(normal_weights):
>    running_total += weight
>    if val < running_total:
>      return population[index]
>
> Which seems to do the trick:
>
> population = ['AAU' ,'AAC', 'AAG']
> weights = [2,5,3]
> sample = [weighted_sample(population, weights) for _ in range(1000)]
> sample.count('AAC') #should be about 500
>
> If that's too slow, check out numpy's random.multinomial() function.
>
> I haven't tested this, but this should get you the number of times you get
> each codon from 1000 "draws":
>
> import numpy as np
>
> codons, weights = codon_dict.items()
> denom = sum(weights)
> normalised_weights = [float(w)/denom for w in weights]
> np.random.multinomial(codons, weights, 1000)
>
> Cheers,
> David
>
>
>
> Quoting George Devaniranjan <devaniranjan at gmail.com>:
>
>  Hi,
>>
>> I am not sure if this question is more suitable for biopython or a python
>> forum.
>>
>>
>> I have the following dictionary.
>>
>> dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34,
>> 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16,
>> 'LAU':
>> 1, 'PTA': 7, '
>> AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34,
>> 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18,
>> 'YLP':
>> 49, 'TA
>> Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15,
>> 'TAA':
>> 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL':
>> 16, 'SY
>> Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28}
>>
>> The keys are the different amino acid triplets (all possible triplets
>> extracted from a culled list of PDB), the numbers next to them are the
>> frequency that they occour in.
>>
>> I was wondering if there is a way in biopython/python to sample them at
>> the
>> frequecy indicated by the no's next to the key.
>>
>> I have only given a snippet of the triplet dictionary, the entire
>> dictionary
>> has about 1400 key entries.
>>
>> I would appreciate any help in this matter --thank you very much.
>>
>> George
>> ______________________________**_________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/**mailman/listinfo/biopython<http://lists.open-bio.org/mailman/listinfo/biopython>
>>
>>
>
>
>

From p.j.a.cock at googlemail.com  Mon Oct 31 07:27:31 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 31 Oct 2011 11:27:31 +0000
Subject: [Biopython] expat and biopython 1.58 problem on linux x64
In-Reply-To: <CAKVJ-_4_M4jJJQo+_pBokU7dNn8S14Zo66_V5WXDWwwJBQN97g@mail.gmail.com>
References: <4EA9C00E.5080509@bath.ac.uk>
	<CAKVJ-_4ZMCU3J6VeWnGyu0yk7Okk+ZcWqfMEDnXmUy0LgStZXA@mail.gmail.com>
	<4EAAA9A0.3010906@bath.ac.uk>
	<CAKVJ-_4_M4jJJQo+_pBokU7dNn8S14Zo66_V5WXDWwwJBQN97g@mail.gmail.com>
Message-ID: <CAKVJ-_7o_TuN87SKfT+ZX8ZYnbW66rEe0crc+MWPGUxyd15bPA@mail.gmail.com>

On Fri, Oct 28, 2011 at 2:21 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> OK, so it doesn't look like the problem with closed handles,
> http://bugs.python.org/issue4877
>

Hi Jaime,

Was there any sign of an expat version mismatch? That does
seem like the most likely problem (Python expecting one thing,
the library providing another).

Another guess was we could be reusing the parser object (which
apparently is not allowed), although the unit tests don't seem to do this:
http://bugs.python.org/issue6676
http://bugs.python.org/issue12829

Peter

From tiagoantao at gmail.com  Mon Oct  3 22:12:18 2011
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 3 Oct 2011 23:12:18 +0100
Subject: [Biopython] VCF parser
Message-ID: <CAA9RGEOdgq9=Y5oruhjycP8Jw545rwDQG_eXKFzWKXYj2s3uQQ@mail.gmail.com>

Hi,

I wonder if there is a VCF parser in either Python or Java? Either I
am being dumb at searching (probably) or nothing exists?

Thanks,
Tiago

-- 
"If you want to get laid, go to college.? If you want an education, go
to the library." - Frank Zappa


From bala.biophysics at gmail.com  Tue Oct  4 08:05:36 2011
From: bala.biophysics at gmail.com (Bala subramanian)
Date: Tue, 4 Oct 2011 10:05:36 +0200
Subject: [Biopython] changing record attributes while iterating
Message-ID: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>

Friends,
I have a fasta file. I need to modify the record id by adding a suffix to
it. So i used SeqRecord (the code attached below). It is working fine but i
would like to know if there is any simple way to do that. ie. if i can
change the record attributes while iterating through the fasta with
SeqIO.parse itself. I tried something like following but i couldnt get what
i wanted.

new_list=[]
for record in SeqIO.parse(open(argv[1], "rU"), "fasta"):
                    record.id=record.id + '_suffix'
                    new_list.append(record)

Hence i used SeqRecord to do the modification ?
----------------------------------------------------------------------------------------------------
#!/usr/bin/env python
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from sys import argv

new_list=[]

for record in SeqIO.parse(open(argv[1], "rU"), "fasta"):

    seq=str(record.seq)
    newrec=SeqRecord(Seq(seq),id=record.id+"_suffix",name='',description='')

    new_list.append(newrec)

output_handle = open(raw_input('Enter the output file:'), 'w')
SeqIO.write(new_list, output_handle, "fasta")
output_handle.close()


From p.j.a.cock at googlemail.com  Tue Oct  4 08:24:08 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 4 Oct 2011 09:24:08 +0100
Subject: [Biopython] changing record attributes while iterating
In-Reply-To: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>
References: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>
Message-ID: <CAKVJ-_7dHzN4A551Hx4nydDTUAxeGDvqDvtKKZpZdvwue0P4Pw@mail.gmail.com>

On Tue, Oct 4, 2011 at 9:05 AM, Bala subramanian
<bala.biophysics at gmail.com> wrote:
> Friends,
> I have a fasta file. I need to modify the record id by adding a suffix to
> it. So i used SeqRecord (the code attached below). It is working fine but i
> would like to know if there is any simple way to do that. ie. if i can
> change the record attributes while iterating through the fasta with
> SeqIO.parse itself. I tried something like following but i couldnt get what
> i wanted.
>
> new_list=[]
> for record in SeqIO.parse(open(argv[1], "rU"), "fasta"):
> ? ? ? ? ? ? ? ? ? ?record.id=record.id + '_suffix'
> ? ? ? ? ? ? ? ? ? ?new_list.append(record)

The above looks fine, although depending on the rest of your script
a big list might be a bad idea (too much memory) and an iterator
based approach may be preferable. If as in the rest of your example
you just need to do this for output, perhaps:

#!/usr/bin/env python
from Bio import SeqIO
from sys import argv

def rename(record):
    """Modified record in place AND returns it."""
    record.id +=  '_suffix'
    return record

#This is a generator expression:
records = (rename(r) for r in SeqIO.parse(argv[1], "fasta"))

output_filename = raw_input('Enter the output file:')
SeqIO.write(records, output_filename, "fasta")

The alternative you showed was wasteful, creating lots of new
objects to no benefit.

Peter


From nanatrapnest at hotmail.it  Wed Oct  5 15:07:44 2011
From: nanatrapnest at hotmail.it (Nana Trapnest)
Date: Wed, 5 Oct 2011 15:07:44 +0000
Subject: [Biopython] StructureBuilder
Message-ID: <DUB107-W49AD03D4BEF4381949E73DA8F80@phx.gbl>


Hello,
is it possible with structure builder copy all a protein and change atoms coord??? How can I do this??
Thanks to all of you!
Stefania
 		 	   		  

From anaryin at gmail.com  Wed Oct  5 16:02:30 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 5 Oct 2011 18:02:30 +0200
Subject: [Biopython] StructureBuilder
In-Reply-To: <DUB107-W49AD03D4BEF4381949E73DA8F80@phx.gbl>
References: <DUB107-W49AD03D4BEF4381949E73DA8F80@phx.gbl>
Message-ID: <CAJ9sUYMNF8f3FMApUxipduFOZcKOvnj-nmUDVGAe0-dLwz2+fw@mail.gmail.com>

Hello Stefania,

It should be possible to copy the entire protein yes, but I would rather use
deepcopy <http://docs.python.org/library/copy.html#copy.deepcopy> to create
a fully new Structure object and manipulate that one.

Something along the lines of:

import copy

[ ... Parse your structure to s...]

s_copy = copy.deepcopy(s)
for atom in s_copy.get_atoms():
  *here use either atom.transform or just modify atom.coord*


Cheers,

Jo?o [...] Rodrigues
http://nmr.chem.uu.nl/~joao


2011/10/5 Nana Trapnest <nanatrapnest at hotmail.it>

>
> Hello,
> is it possible with structure builder copy all a protein and change atoms
> coord??? How can I do this??
> Thanks to all of you!
> Stefania
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From dilara.ally at gmail.com  Wed Oct  5 23:21:29 2011
From: dilara.ally at gmail.com (Dilara Ally)
Date: Wed, 05 Oct 2011 16:21:29 -0700
Subject: [Biopython] error with entrez id code
Message-ID: <4E8CE679.5050107@gmail.com>

Hi All

I've written a program to identify Entrez gene ids from a blastall that 
I performed.  The code is as follows:

from Bio import SeqIO
from Bio import Entrez
import os
import os.path
import re
import csv

dirname1="/Users/dally/Desktop/BlastFiles/annotate_me/"
dirname2="/Users/dally/Desktop/BlastFiles/annotated/"

allfiles=os.listdir(dirname1)
fanddir=[os.path.join(dirname1,fname) for fname in allfiles]
OutFileName="Contig_annotation.csv"
c=csv.writer(open(os.path.join(dirname2,OutFileName),"wb"))

for f in fanddir:
     print f
     InFile=open(f,'rU')
     LineNumber=0
     for Line in InFile:
         print LineNumber#, ':', Line
         ElementList=Line.split('\t')
         geneid=ElementList[1]
         #print geneid
         Sections=geneid.split('|')
         NewID=Sections[3]

         from Bio import Entrez
         from Bio import SeqFeature
         Entrez.email = "dally at projects.sdsu.edu"
         handle=Entrez.efetch(db="nucleotide", id=NewID,rettype="gb")  # 
rettype="gb" is GenBank format or XML format retmode="xml"
         record=SeqIO.read(handle,"genbank")
         handle.close()
         #print record.id
         lineage=record.annotations["taxonomy"]
         
c.writerow([ElementList[0],ElementList[1],ElementList[2],ElementList[3],ElementList[4],ElementList[5],ElementList[6],ElementList[7],ElementList[8], 
ElementList[9],ElementList[10], NewID, record.id, record.description, 
record.annotations["source"], lineage[0], lineage[1],lineage[2], 
record.annotations["keywords"], ])
         LineNumber=LineNumber+1

InFile.close()

The gene identifier looks like this: gi|2252639|gb|AC002292.1|AC002292.  
But I"m only interested in the fourth component (AC002292.1)It runs 
through a file with approximately 8000-10000 identifiers and then 
extracts information from the associated genbank file.

The code seemed to run fine on my first file for the first 1287 lines 
but then I got this error

> raceback (most recent call last):
>   File "Ally_EntrezID_Search_Final_Script.py", line 38, in <module>
>     record=SeqIO.read(handle,"genbank")
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", 
> line 604, in read
>     first = iterator.next()
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", 
> line 532, in parse
>     for r in i:
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 440, in parse_records
>     record = self.parse(handle, do_features)
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 423, in parse
>     if self.feed(handle, consumer, do_features):
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 400, in feed
>     misc_lines, sequence_string = self.parse_footer()
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 921, in parse_footer
>     line = self.handle.readline()
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", 
> line 447, in readline
>     data = self._sock.recv(self._rbufsize)
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 533, in read
>     return self._read_chunked(amt)
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 586, in _read_chunked
>     value.append(self._safe_read(amt))
>   File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 637, in _safe_read
>     raise IncompleteRead(''.join(s), amt)
> httplib.IncompleteRead: IncompleteRead(707 bytes read, 3147 more expected)
I'm new to python and biopython programming.  So any advice would be 
extremely appreciated.

Thanks.

Dilara


From p.j.a.cock at googlemail.com  Thu Oct  6 07:43:49 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Oct 2011 08:43:49 +0100
Subject: [Biopython] error with entrez id code
In-Reply-To: <4E8CE679.5050107@gmail.com>
References: <4E8CE679.5050107@gmail.com>
Message-ID: <CAKVJ-_7OvoPUM8B0p9Jo1iQbqpKRBT+O9TJCNt5=15gXLZU+jQ@mail.gmail.com>

On Thursday, October 6, 2011, Dilara Ally <dilara.ally at gmail.com> wrote:
> Hi All
>
> I've written a program to identify Entrez gene ids from a blastall that I
performed.  The code is as follows:
>
> from Bio import SeqIO
> from Bio import Entrez
> ...
>
> The code seemed to run fine on my first file for the first 1287 lines but
then I got this error
>
>> raceback (most recent call last):
>>  File "Ally_EntrezID_Search_Final_Script.py", line 38, in <module>
>>    record=SeqIO.read(handle,"genbank")
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py",
line 604, in read
>>    first = iterator.next()
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py",
line 532, in parse
>>    for r in i:
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 440, in parse_records
>>    record = self.parse(handle, do_features)
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 423, in parse
>>    if self.feed(handle, consumer, do_features):
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 400, in feed
>>    misc_lines, sequence_string = self.parse_footer()
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 921, in parse_footer
>>    line = self.handle.readline()
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py",
line 447, in readline
>>    data = self._sock.recv(self._rbufsize)
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py",
line 533, in read
>>    return self._read_chunked(amt)
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py",
line 586, in _read_chunked
>>    value.append(self._safe_read(amt))
>>  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py",
line 637, in _safe_read
>>    raise IncompleteRead(''.join(s), amt)
>> httplib.IncompleteRead: IncompleteRead(707 bytes read, 3147 more
expected)
>
> I'm new to python and biopython programming.  So any advice would be
extremely appreciated.

Is it always the same record that breaks? If so, what is the ID so we can
try it out.

If not, then it looks like a random network error, maybe you can stick a
try/except in to refetch the data?

Peter


From animesh.agrawal at anu.edu.au  Thu Oct  6 10:25:08 2011
From: animesh.agrawal at anu.edu.au (Animesh Agrawal)
Date: Thu, 06 Oct 2011 21:25:08 +1100
Subject: [Biopython] Using bioPython and bioSQL
In-Reply-To: <7770fe573faa2.4e8d81ae@anu.edu.au>
References: <76b0a56038ee5.4e8d8031@anu.edu.au>
	<7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au>
	<7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au>
	<7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au>
	<7770fe573faa2.4e8d81ae@anu.edu.au>
Message-ID: <7710edf23d45a.4e8e1cb4@anu.edu.au>

Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives.
Cheers
Animesh
Animesh Agrawal
PhD Scholar
The John Curtin School of Medical Research
Australian National University
Canberra, Australia


From p.j.a.cock at googlemail.com  Thu Oct  6 10:39:57 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Oct 2011 11:39:57 +0100
Subject: [Biopython] Using bioPython and bioSQL
In-Reply-To: <7710edf23d45a.4e8e1cb4@anu.edu.au>
References: <76b0a56038ee5.4e8d8031@anu.edu.au>
	<7660b9533ee66.4e8d806f@anu.edu.au>
	<75a084db398e1.4e8d80ac@anu.edu.au>
	<7660d30c39ced.4e8d80ea@anu.edu.au>
	<77109413397e8.4e8d8127@anu.edu.au>
	<7660dda43f476.4e8d8165@anu.edu.au>
	<77c0dae63f99c.4e8d81a2@anu.edu.au>
	<7770fe573faa2.4e8d81ae@anu.edu.au>
	<7710edf23d45a.4e8e1cb4@anu.edu.au>
Message-ID: <CAKVJ-_76phfz_PNDFkVgfzmED8_M6Zh7BpxByTd2qtuKWs_MLw@mail.gmail.com>

On Thu, Oct 6, 2011 at 11:25 AM, Animesh Agrawal
<animesh.agrawal at anu.edu.au> wrote:
> Hi All,I am trying to develop a interface for a local sequence depository
> in my lab. Using biopython cookbook examples I have been able to
> populate the database. But to query the database I want to create an
> interface so all other members in my lab can access it. I have no
> experience in doing this kind of development. I need some advice
> on best way of doing it and if there are already developed modules
> in biopython which can help me in attaining my objectives.
> Cheers
> Animesh

Hi Animesh,

Do you mean some kind of web interface? Would you just need
this to be read only?

You can use GBrowse with BioSQL, but I believe CHADO is better
supported as the schema. CHADO is also a better choice if you
want users to be able to edit the annotation.
http://gmod.org/wiki/Chado_-_Getting_Started

Peter


From sdavis2 at mail.nih.gov  Thu Oct  6 10:51:20 2011
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 6 Oct 2011 06:51:20 -0400
Subject: [Biopython] Using bioPython and bioSQL
In-Reply-To: <7710edf23d45a.4e8e1cb4@anu.edu.au>
References: <76b0a56038ee5.4e8d8031@anu.edu.au>
	<7660b9533ee66.4e8d806f@anu.edu.au>
	<75a084db398e1.4e8d80ac@anu.edu.au>
	<7660d30c39ced.4e8d80ea@anu.edu.au>
	<77109413397e8.4e8d8127@anu.edu.au>
	<7660dda43f476.4e8d8165@anu.edu.au>
	<77c0dae63f99c.4e8d81a2@anu.edu.au>
	<7770fe573faa2.4e8d81ae@anu.edu.au>
	<7710edf23d45a.4e8e1cb4@anu.edu.au>
Message-ID: <CANeAVBki4WJfOEWkwY-3HQ6XmAvWuqcTuYi39+6=T3jPR7Q11w@mail.gmail.com>

Hi, Animesh.

How do you want folks to query the database?  Web?  Command-line?  Are
the queries limited in scope or do you want to provide something fully
general?

Sean

On Thu, Oct 6, 2011 at 6:25 AM, Animesh Agrawal
<animesh.agrawal at anu.edu.au> wrote:
> Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives.
> Cheers
> Animesh
> Animesh Agrawal
> PhD Scholar
> The John Curtin School of Medical Research
> Australian National University
> Canberra, Australia
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From elisa.sechi85 at hotmail.it  Thu Oct  6 10:43:25 2011
From: elisa.sechi85 at hotmail.it (Elisa sechi)
Date: Thu, 6 Oct 2011 12:43:25 +0200
Subject: [Biopython] help for overwrite a pdb file
In-Reply-To: <DUB106-W53522F2712EE87225812DE96F80@phx.gbl>
References: <DUB106-W53522F2712EE87225812DE96F80@phx.gbl>
Message-ID: <DUB106-W54E88C16AE2C9441266EC596F90@phx.gbl>


Hi! All !
I'm contacting you in order to ask help about Biopython.
I'm using python,I have extract the atoms coordinates  of a protein from a pdb file and I have used quaternion in order to rotate the coordinates.
I have put its in a new matrix but now the problem is: how do I save the cartesian coordinates in a pdb file???Do I have to create a new structure with the use of builder structure Class??
I ask you if there is a way to overwrite the new cartesian coordinates in the old pdb file that i have used.
Please help me!!!
Thank you very much!
Elisa
   bye

 		 	   		   		 	   		  
From anaryin at gmail.com  Thu Oct  6 11:01:28 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 6 Oct 2011 13:01:28 +0200
Subject: [Biopython] help for overwrite a pdb file
In-Reply-To: <DUB106-W54E88C16AE2C9441266EC596F90@phx.gbl>
References: <DUB106-W53522F2712EE87225812DE96F80@phx.gbl>
	<DUB106-W54E88C16AE2C9441266EC596F90@phx.gbl>
Message-ID: <CAJ9sUYO-qx1fn9TJLH7XYiYrk-j2akFqJoNi33wvVb+dW1Wkcg@mail.gmail.com>

Hello Elisa,

You should use PDBIO to generate a new structure file. If you have already
transformed the coordinates, it's pretty simple:

import PDBIO
io = PDBIO()
io.set_structure(your_structure)
io.save('new_structure.pdb')


Cheers,

Jo?o [...] Rodrigues
http://nmr.chem.uu.nl/~joao


2011/10/6 Elisa sechi <elisa.sechi85 at hotmail.it>

>
>
>
>
>
>
>
>
>
>
>
> Hi! All !
> I'm contacting you in order to ask help about Biopython.
> I'm using python,I have extract the atoms coordinates  of a protein from a
> pdb file and I have used quaternion in order to rotate the coordinates.
> I have put its in a new matrix but now the problem is: how do I save the
> cartesian coordinates in a pdb file???Do I have to create a new structure
> with the use of builder structure Class??
> I ask you if there is a way to overwrite the new cartesian coordinates in
> the old pdb file that i have used.
> Please help me!!!
> Thank you very much!
> Elisa
>   bye
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From p.j.a.cock at googlemail.com  Thu Oct  6 11:02:57 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Oct 2011 12:02:57 +0100
Subject: [Biopython] help for overwrite a pdb file
In-Reply-To: <DUB106-W54E88C16AE2C9441266EC596F90@phx.gbl>
References: <DUB106-W53522F2712EE87225812DE96F80@phx.gbl>
	<DUB106-W54E88C16AE2C9441266EC596F90@phx.gbl>
Message-ID: <CAKVJ-_4hg4xyNOn9h17DHRs7SBb+XRevRfu3qHY3GMtQiv1xFA@mail.gmail.com>

On Thu, Oct 6, 2011 at 11:43 AM, Elisa sechi <elisa.sechi85 at hotmail.it> wrote:
>
> Hi! All !
> I'm contacting you in order to ask help about Biopython.
> I'm using python,I have extract the atoms coordinates ?of a protein from a pdb file and I have used quaternion in order to rotate the coordinates.
> I have put its in a new matrix but now the problem is: how do I save the cartesian coordinates in a pdb file???Do I have to create a new structure with the use of builder structure Class??
> I ask you if there is a way to overwrite the new cartesian coordinates in the old pdb file that i have used.
> Please help me!!!
> Thank you very much!
> Elisa
> ? bye

There's an example here which rotates models in a PDB file and saves the output:
http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/

It is not using quaternions for the rotation, but otherwise it should
be helpful.

Peter


From animesh.agrawal at anu.edu.au  Thu Oct  6 11:23:39 2011
From: animesh.agrawal at anu.edu.au (Animesh Agrawal)
Date: Thu, 06 Oct 2011 22:23:39 +1100
Subject: [Biopython] Using bioPython and bioSQL
In-Reply-To: <77109ef23fc49.4e8d8f9e@anu.edu.au>
References: <76b0a56038ee5.4e8d8031@anu.edu.au>
	<7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au>
	<7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au>
	<7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au>
	<7770fe573faa2.4e8d81ae@anu.edu.au> <7710edf23d45a.4e8e1cb4@anu.edu.au>
	<CAKVJ-_76phfz_PNDFkVgfzmED8_M6Zh7BpxByTd2qtuKWs_MLw@mail.gmail.com>
	<77c0838039ccd.4e8d8edc@anu.edu.au> <7660c1093accc.4e8d8f1a@anu.edu.au>
	<7710e8403ab11.4e8d8f58@anu.edu.au> <77b0ce493f67b.4e8d8f61@anu.edu.au>
	<77109ef23fc49.4e8d8f9e@anu.edu.au>
Message-ID: <7710e50538fb7.4e8e2a6b@anu.edu.au>

Hi Peter,Thanks a lot for your reply.Yes I want web interface and I need it to be read only. I'll check out GBrowse and CHADO.
Cheers,
Animesh

On 10/06/11, Peter Cock  <p.j.a.cock at googlemail.com> wrote:
> On Thu, Oct 6, 2011 at 11:25 AM, Animesh Agrawal
> <animesh.agrawal at anu.edu.au> wrote:
> > Hi All,I am trying to develop a interface for a local sequence depository
> > in my lab. Using biopython cookbook examples I have been able to
> > populate the database. But to query the database I want to create an
> > interface so all other members in my lab can access it. I have no
> > experience in doing this kind of development. I need some advice
> > on best way of doing it and if there are already developed modules
> > in biopython which can help me in attaining my objectives.
> > Cheers
> > Animesh
> 
> Hi Animesh,
> 
> Do you mean some kind of web interface? Would you just need
> this to be read only?
> 
> You can use GBrowse with BioSQL, but I believe CHADO is better
> supported as the schema. CHADO is also a better choice if you
> want users to be able to edit the annotation.
> http://gmod.org/wiki/Chado_-_Getting_Started
> 
> Peter
> 
> 


From animesh.agrawal at anu.edu.au  Thu Oct  6 11:27:51 2011
From: animesh.agrawal at anu.edu.au (Animesh Agrawal)
Date: Thu, 06 Oct 2011 22:27:51 +1100
Subject: [Biopython] Using bioPython and bioSQL
In-Reply-To: <7680a8613e5c9.4e8d9094@anu.edu.au>
References: <76b0a56038ee5.4e8d8031@anu.edu.au>
	<7660b9533ee66.4e8d806f@anu.edu.au> <75a084db398e1.4e8d80ac@anu.edu.au>
	<7660d30c39ced.4e8d80ea@anu.edu.au> <77109413397e8.4e8d8127@anu.edu.au>
	<7660dda43f476.4e8d8165@anu.edu.au> <77c0dae63f99c.4e8d81a2@anu.edu.au>
	<7770fe573faa2.4e8d81ae@anu.edu.au> <7710edf23d45a.4e8e1cb4@anu.edu.au>
	<CANeAVBki4WJfOEWkwY-3HQ6XmAvWuqcTuYi39+6=T3jPR7Q11w@mail.gmail.com>
	<77b080583927a.4e8d9019@anu.edu.au> <76e0a1b23d252.4e8d9057@anu.edu.au>
	<7680a8613e5c9.4e8d9094@anu.edu.au>
Message-ID: <7660a5e03929b.4e8e2b67@anu.edu.au>

Hi Sean,I definitely want a web interface. Queries should be limited in scope.
Cheers,
Animesh

On 10/06/11, Sean Davis  <sdavis2 at mail.nih.gov> wrote:
> Hi, Animesh.
> 
> How do you want folks to query the database?? Web?? Command-line?? Are
> the queries limited in scope or do you want to provide something fully
> general?
> 
> Sean
> 
> On Thu, Oct 6, 2011 at 6:25 AM, Animesh Agrawal
> <animesh.agrawal at anu.edu.au> wrote:
> > Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives.
> > Cheers
> > Animesh
> > Animesh Agrawal
> > PhD Scholar
> > The John Curtin School of Medical Research
> > Australian National University
> > Canberra, Australia
> > _______________________________________________
> > Biopython mailing list ?- ?Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
> 
>


From sdavis2 at mail.nih.gov  Thu Oct  6 11:50:07 2011
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 6 Oct 2011 07:50:07 -0400
Subject: [Biopython] Using bioPython and bioSQL
In-Reply-To: <7660a5e03929b.4e8e2b67@anu.edu.au>
References: <76b0a56038ee5.4e8d8031@anu.edu.au>
	<7660b9533ee66.4e8d806f@anu.edu.au>
	<75a084db398e1.4e8d80ac@anu.edu.au>
	<7660d30c39ced.4e8d80ea@anu.edu.au>
	<77109413397e8.4e8d8127@anu.edu.au>
	<7660dda43f476.4e8d8165@anu.edu.au>
	<77c0dae63f99c.4e8d81a2@anu.edu.au>
	<7770fe573faa2.4e8d81ae@anu.edu.au>
	<7710edf23d45a.4e8e1cb4@anu.edu.au>
	<CANeAVBki4WJfOEWkwY-3HQ6XmAvWuqcTuYi39+6=T3jPR7Q11w@mail.gmail.com>
	<77b080583927a.4e8d9019@anu.edu.au>
	<76e0a1b23d252.4e8d9057@anu.edu.au>
	<7680a8613e5c9.4e8d9094@anu.edu.au>
	<7660a5e03929b.4e8e2b67@anu.edu.au>
Message-ID: <CANeAVBnNs+Q0NA1L+Y+rsLDa=D2eAa+UK4zQ4vxtV+iKp6_qxQ@mail.gmail.com>

Hi, Animesh.

Depending on the types of queries, building small CGI scripts or even
a small web application can be quite useful.  Most recently, I have
been using the flask micro-framework ( http://flask.pocoo.org/ ) for
building such small applications.  If you can figure out how to do the
queries that you want with biopython or SQL, then it isn't too hard to
translate that to a couple of web pages, one for gathering input from
the user and a second for delivering results.

Sean


On Thu, Oct 6, 2011 at 7:27 AM, Animesh Agrawal
<animesh.agrawal at anu.edu.au> wrote:
> Hi Sean,I definitely want a web interface. Queries should be limited in scope.
> Cheers,
> Animesh
>
> On 10/06/11, Sean Davis ?<sdavis2 at mail.nih.gov> wrote:
>> Hi, Animesh.
>>
>> How do you want folks to query the database?? Web?? Command-line?? Are
>> the queries limited in scope or do you want to provide something fully
>> general?
>>
>> Sean
>>
>> On Thu, Oct 6, 2011 at 6:25 AM, Animesh Agrawal
>> <animesh.agrawal at anu.edu.au> wrote:
>> > Hi All,I am trying to develop a interface for a local sequence depository in my lab. Using biopython cookbook examples I have been able to populate the database. But to query the database I want to create an interface so all other members in my lab can access it. I have no experience in doing this kind of development. I need some advice on best way of doing it and if there are already developed modules in biopython which can help me in attaining my objectives.
>> > Cheers
>> > Animesh
>> > Animesh Agrawal
>> > PhD Scholar
>> > The John Curtin School of Medical Research
>> > Australian National University
>> > Canberra, Australia
>> > _______________________________________________
>> > Biopython mailing list ?- ?Biopython at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biopython
>> >
>>
>>
>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From tiagoantao at gmail.com  Thu Oct  6 20:14:56 2011
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 6 Oct 2011 21:14:56 +0100
Subject: [Biopython] UniprotXML dbReference parser
Message-ID: <CAA9RGEMK-8PCxf1icHnTDB8smt7F=QG9qVZyvxB9Qd_T4nHGEw@mail.gmail.com>

Hi,

Do I understand wrongly or the UniprotXML parser for

<dbReference type="RefSeq" id="NP_001117940.1" key="6">
<property type="nucleotide sequence ID" value="NM_001124468.1"/>
</dbReference>

simply ignores the "property type" information?

If so, is there any way to get access to the XML raw data (so that I
can grep it)?

Thanks a lot,
Tiago

-- 
"If you want to get laid, go to college.? If you want an education, go
to the library." - Frank Zappa


From p.j.a.cock at googlemail.com  Thu Oct  6 22:26:19 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Oct 2011 23:26:19 +0100
Subject: [Biopython] UniprotXML dbReference parser
In-Reply-To: <CAA9RGEMK-8PCxf1icHnTDB8smt7F=QG9qVZyvxB9Qd_T4nHGEw@mail.gmail.com>
References: <CAA9RGEMK-8PCxf1icHnTDB8smt7F=QG9qVZyvxB9Qd_T4nHGEw@mail.gmail.com>
Message-ID: <CAKVJ-_5HGyCsPo=Vo1vnnZOY3qHLBcO4zo=5D_p0WysTtTT74w@mail.gmail.com>

2011/10/6 Tiago Ant?o <tiagoantao at gmail.com>:
> Hi,
>
> Do I understand wrongly or the UniprotXML parser for
>
> <dbReference type="RefSeq" id="NP_001117940.1" key="6">
> <property type="nucleotide sequence ID" value="NM_001124468.1"/>
> </dbReference>
>
> simply ignores the "property type" information?

Probably... I think it emulates the very simple list of
db:acc strings produced by the GenBank parser etc,
but try dir(...) on it.  Although PDB references look
to get part of their information dumped in the
record's annotations dictionary.

I guess we could return a list of DB reference objects
which happen to act like the old style string for back
compatibility.

> If so, is there any way to get access to the XML raw data
> (so that I can grep it)?

Are you asking for XML parsing library recommendations?
Or you could hack the SeqIO parser instead... i've CC'd
Andrea who wrote it in case he can add something
more practical.

Peter


From tiagoantao at gmail.com  Thu Oct  6 22:43:01 2011
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 6 Oct 2011 23:43:01 +0100
Subject: [Biopython] UniprotXML dbReference parser
In-Reply-To: <CAKVJ-_5HGyCsPo=Vo1vnnZOY3qHLBcO4zo=5D_p0WysTtTT74w@mail.gmail.com>
References: <CAA9RGEMK-8PCxf1icHnTDB8smt7F=QG9qVZyvxB9Qd_T4nHGEw@mail.gmail.com>
	<CAKVJ-_5HGyCsPo=Vo1vnnZOY3qHLBcO4zo=5D_p0WysTtTT74w@mail.gmail.com>
Message-ID: <CAA9RGEOpkkbZhFcUT4xF8DymJS6mxq4x5vg_n6A-u6LCq7zF_g@mail.gmail.com>

Hi,

2011/10/6 Peter Cock <p.j.a.cock at googlemail.com>:
> Probably... I think it emulates the very simple list of
> db:acc strings produced by the GenBank parser etc,
> but try dir(...) on it. ?Although PDB references look
> to get part of their information dumped in the
> record's annotations dictionary.

The problem is that the Gene ID is inside (thus it never gets
returned). We get the protein ID only.

> Are you asking for XML parsing library recommendations?
> Or you could hack the SeqIO parser instead... i've CC'd
> Andrea who wrote it in case he can add something
> more practical.


I just used xml.parsers.expat. Not a problem for myself, but the fact
is that the uniprot xml parser does not return the whole information
that it is there.

-- 
"If you want to get laid, go to college.? If you want an education, go
to the library." - Frank Zappa


From p.j.a.cock at googlemail.com  Fri Oct  7 07:22:49 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 7 Oct 2011 08:22:49 +0100
Subject: [Biopython]  changing record attributes while iterating
In-Reply-To: <CAOP6n=hRfYi1mRAt6pxudmyMeVH8bnnM3E5yd9kXXPU5L=pQvQ@mail.gmail.com>
References: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>
	<CAKVJ-_7dHzN4A551Hx4nydDTUAxeGDvqDvtKKZpZdvwue0P4Pw@mail.gmail.com>
	<CAOP6n=hRfYi1mRAt6pxudmyMeVH8bnnM3E5yd9kXXPU5L=pQvQ@mail.gmail.com>
Message-ID: <CAKVJ-_56L2beMYmCqs17KmpmhwX37oXMAR=xVD3237M6cAhGJw@mail.gmail.com>

On Friday, October 7, 2011, Michalwrote:
> Hello,
> Does your code with generator save the whole file in the
> memory or does it read each entry and save it immediately?
> Thank you in advance.

Using a generator expression like that only one SeqRecord is in memory at a
time. It goes through the input FASTA one record at a time, renames it,
saves it immediately.

Peter

P.S. list CC'd


From dilara.ally at gmail.com  Fri Oct  7 17:34:24 2011
From: dilara.ally at gmail.com (Dilara Ally)
Date: Fri, 07 Oct 2011 10:34:24 -0700
Subject: [Biopython] error with entrez id code
In-Reply-To: <CAKVJ-_7OvoPUM8B0p9Jo1iQbqpKRBT+O9TJCNt5=15gXLZU+jQ@mail.gmail.com>
References: <4E8CE679.5050107@gmail.com>
	<CAKVJ-_7OvoPUM8B0p9Jo1iQbqpKRBT+O9TJCNt5=15gXLZU+jQ@mail.gmail.com>
Message-ID: <4E8F3820.1030002@gmail.com>

> Is it always the same record that breaks? If so, what is the ID so we 
> can try it out.
>
> If not, then it looks like a random network error, maybe you can stick 
> a try/except in to refetch the data?
Hi Peter

Individually the identifier has no problem calling up the record, but 
the problem seems to be in the loop.  As a newbie, what is a try/except?

Thanks.

Dilara

On 10/6/11 12:43 AM, Peter Cock wrote:
>
>
> On Thursday, October 6, 2011, Dilara Ally <dilara.ally at gmail.com 
> <mailto:dilara.ally at gmail.com>> wrote:
> > Hi All
> >
> > I've written a program to identify Entrez gene ids from a blastall 
> that I performed.  The code is as follows:
> >
> > from Bio import SeqIO
> > from Bio import Entrez
> > ...
> >
> > The code seemed to run fine on my first file for the first 1287 
> lines but then I got this error
> >
> >> raceback (most recent call last):
> >>  File "Ally_EntrezID_Search_Final_Script.py", line 38, in <module>
> >>    record=SeqIO.read(handle,"genbank")
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", 
> line 604, in read
> >>    first = iterator.next()
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", 
> line 532, in parse
> >>    for r in i:
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 440, in parse_records
> >>    record = self.parse(handle, do_features)
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 423, in parse
> >>    if self.feed(handle, consumer, do_features):
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 400, in feed
> >>    misc_lines, sequence_string = self.parse_footer()
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", 
> line 921, in parse_footer
> >>    line = self.handle.readline()
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", 
> line 447, in readline
> >>    data = self._sock.recv(self._rbufsize)
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 533, in read
> >>    return self._read_chunked(amt)
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 586, in _read_chunked
> >>    value.append(self._safe_read(amt))
> >>  File 
> "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", 
> line 637, in _safe_read
> >>    raise IncompleteRead(''.join(s), amt)
> >> httplib.IncompleteRead: IncompleteRead(707 bytes read, 3147 more 
> expected)
> >
> > I'm new to python and biopython programming.  So any advice would be 
> extremely appreciated.
>
>
> Peter


From p.j.a.cock at googlemail.com  Sat Oct  8 14:10:12 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 8 Oct 2011 15:10:12 +0100
Subject: [Biopython] error with entrez id code
In-Reply-To: <4E8F3820.1030002@gmail.com>
References: <4E8CE679.5050107@gmail.com>
	<CAKVJ-_7OvoPUM8B0p9Jo1iQbqpKRBT+O9TJCNt5=15gXLZU+jQ@mail.gmail.com>
	<4E8F3820.1030002@gmail.com>
Message-ID: <CAKVJ-_6OjxzAOqrnsGzwbJ9Dv4_jWzwO7set+MhZ1=rpBcoKoQ@mail.gmail.com>

On Fri, Oct 7, 2011 at 6:34 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
> Is it always the same record that breaks? If so, what is the ID so we can
> try it out.
>
> If not, then it looks like a random network error, maybe you can stick a
> try/except in to refetch the data?
>
> Hi Peter
>
> Individually the identifier has no problem calling up the record, but the
> problem seems to be in the loop.? As a newbie, what is a try/except?
>
> Thanks.

By try/except I mean use Python's error handling mechanism to
spot when there is a network error. See:
http://docs.python.org/tutorial/errors.html

e.g. Something like this would give you a second chance.
Note that exception httplib.IncompleteRead is a subclass
of the more general HTTPException, see:
http://docs.python.org/library/httplib.html

from httplib import HTTPException
try:
    handle=Entrez.efetch(db="nucleotide", id=NewID,rettype="gb")  #
rettype="gb" is GenBank format or XML format retmode="xml"
    record=SeqIO.read(handle,"genbank")
    handle.close()
except HTTPException, e:
    print "Network problem: %s" % e
    print "Second (and final) attempt..."
    handle=Entrez.efetch(db="nucleotide", id=NewID,rettype="gb")  #
rettype="gb" is GenBank format or XML format retmode="xml"
    record=SeqIO.read(handle,"genbank")
    handle.close()

If the second attempt fails, you'll get an exception like before.
There are more elegant ways to write that (with less repetition,
and making multiple retries easy), but I'm trying to keep this
simple as an introductory example.

Peter


From chaouki.amir at gmail.com  Sun Oct  9 19:37:42 2011
From: chaouki.amir at gmail.com (amir chaouki)
Date: Sun, 9 Oct 2011 20:37:42 +0100
Subject: [Biopython] clustal header
Message-ID: <CAM+pXQ=YEK5jNo=i4k-0ewjc4_tTQf5aro3KuTWJ1DUHk7NpSQ@mail.gmail.com>

Hi,
i want to to do a multiple sequence alignment with the clustalw method but i
keep getting this error:  ", ".join(known_headers)))
ValueError: a is not a known CLUSTAL header: CLUSTAL, PROBCONS, MUSCLE

my sequence file contains this >  as headers for every sequence name, so
what are the compatible headers?

-- 
*Amir Chaouki*


From p.j.a.cock at googlemail.com  Sun Oct  9 20:09:00 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 9 Oct 2011 21:09:00 +0100
Subject: [Biopython]  clustal header
In-Reply-To: <CAM+pXQ=YEK5jNo=i4k-0ewjc4_tTQf5aro3KuTWJ1DUHk7NpSQ@mail.gmail.com>
References: <CAM+pXQ=YEK5jNo=i4k-0ewjc4_tTQf5aro3KuTWJ1DUHk7NpSQ@mail.gmail.com>
Message-ID: <CAKVJ-_68hDoM5tG93gwbYQ2fopMhzLWWLmiyEcV2+LE912Az7w@mail.gmail.com>

On Sunday, October 9, 2011, amir chaouki <chaouki.amir at gmail.com> wrote:
> Hi,
> i want to to do a multiple sequence alignment with the clustalw method but
i
> keep getting this error:  ", ".join(known_headers)))
> ValueError: a is not a known CLUSTAL header: CLUSTAL, PROBCONS, MUSCLE
>
> my sequence file contains this >  as headers for every sequence name, so
> what are the compatible headers?

Hi Amir,

That error message can come from trying to parse a non-clustal file as if it
were a clustal file. Perhaps you tried to parse a fasta file?

If you showed the code that caused this message, it would be easier to help
you,

Peter


From sdavis2 at mail.nih.gov  Wed Oct 12 18:54:13 2011
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Wed, 12 Oct 2011 14:54:13 -0400
Subject: [Biopython] [OT][Job] Functional genomic analysis of cancer/RNAi
	screening
Message-ID: <CANeAVBnJCpwm=Jj+TbbdAHBq9LEozQ=kHA_FkpnXgyuLUNNzNA@mail.gmail.com>

Functional genomic analysis of cancer/RNAi screening
NATIONAL CANCER INSTITUTE, BETHESDA, MD

The laboratory of Dr. Natasha Caplen, within the Genetics Branch, CCR, NCI,
is seeking postdoctoral
candidates for a project focused on functional genomic analysis using RNAi
screening approaches. We
are looking for a highly motivated candidate who has received their PhD
within the last year to contribute
to our on-going studies applying RNAi based loss-of-function approaches to
probe cancer gene function.
The successful candidate will be expected to perform both bench and
computational-based studies and
will be involved in projects requiring the development and analysis of
large-scale RNAi screening data
focused on the biology of oncogenic transcription factors. The candidate
will be involved in the design
and employment of RNAi screens (up to genome-wide scale) and analysis of the
data generated through
application of state of the art computational methodologies. This
large-scale RNAi screening data will
also be assessed in the context of other relevant datasets such as next
generation sequencing, epigenetic,
gene expression and drug sensitivity datasets. The computational analyses
will ultimately be used to
systematically build hypotheses to identify key pathways and networks
underlying the specifics of the
cancer biology and the candidate will then be expected to experimentally
test these hypotheses.

Dr. Caplen?s laboratory conducts both independent and collaborative studies
and the successful candidate
will have the opportunity to interact with NCI and NIH investigators
studying many different cancer
biology questions using RNAi based technologies. Currently we are involved
in RNAi studies relevant to
the biology and treatment of several pediatric cancers, colorectal, breast
and prostate cancer. For further
information please see Dr. Caplen?s website at
http://ccr.cancer.gov/staff/staff.asp?profileid=9035.

Requirements:
The candidate must have a Ph.D in biological sciences with additional
training in computational biology
or bioinformatics. Previous experience in molecular biology including
mammalian cell culture and
assessment of gene expression is required, as, too, is experience in
programming skills in languages
such as perl, python, R, java, or c++. As the position involves the need to
discuss scientific data and
strategy with members of the existing team and with collaborators, oral and
written fluency in the English
language is required. Applicants should email a cover letter describing
research experience and interests,
curriculum vitae, bibliography, and contact information for three references
(including the current
supervisor) to Dr. Natasha Caplen at ncaplen at mail.nih.gov. Please include
?PD2011? in the email subject
line.


From paul at tonair.de  Thu Oct 13 10:26:54 2011
From: paul at tonair.de (paul at tonair.de)
Date: Thu, 13 Oct 2011 12:26:54 +0200
Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB
Message-ID: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com>


dear biopython users, 

i'm trying to read in a pqr file with the
Bio.PDB module. In a PQR file, the atom charge and atom radius are
stored instead of the occupancy & B-factor.
Apparently, the negative
charge values make trouble while reading in. 

(1) Is there a way to
tweak Bio.PDB module to read in a PQR file? 

More to the background of
this task: I would like to keep the charge and the radius in order to
output a PDB file with more than 80 lines. The pdb-like output looks
like this:
ATOM 1 C1 UNL _0001_000 9.643 1.777 18.433 1.700 0.000
BK____M000
The text "BK____M000" refers to a conformer of a side chain
and is needed by a PoissonBoltzmann named mcce (multi-conformation
continuum electrostatics). 

(2) Can Bio.PDB generate such an output
file? 

Cheers & Thanks,
Paul 


From p.j.a.cock at googlemail.com  Thu Oct 13 10:40:14 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 13 Oct 2011 11:40:14 +0100
Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB
In-Reply-To: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com>
References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com>
Message-ID: <CAKVJ-_5YcKPR6XV4+08vYvd8eTPV=++Ub+OWE1_D8s48p4J8xQ@mail.gmail.com>

On Thu, Oct 13, 2011 at 11:26 AM,  <paul at tonair.de> wrote:
>
> dear biopython users,
>
> i'm trying to read in a pqr file with the
> Bio.PDB module. In a PQR file, the atom charge and atom radius are
> stored instead of the occupancy & B-factor.
> Apparently, the negative
> charge values make trouble while reading in.
>
> (1) Is there a way to
> tweak Bio.PDB module to read in a PQR file?

If a negative B-factor was the only issue, probably yes.

> More to the background of
> this task: I would like to keep the charge and the radius in order to
> output a PDB file with more than 80 lines.

You mean more than 80 columns? i.e. Longer than PDB norms?

> The pdb-like output looks
> like this:
> ATOM 1 C1 UNL _0001_000 9.643 1.777 18.433 1.700 0.000
> BK____M000
> The text "BK____M000" refers to a conformer of a side chain
> and is needed by a PoissonBoltzmann named mcce (multi-conformation
> continuum electrostatics).
>
> (2) Can Bio.PDB generate such an output
> file?

Not yet ;)

> Cheers & Thanks,
> Paul

It would help if you could share some sample data (URLs) and links
to this PDB-like PQR file format's specification (assuming it has one).

Regards,

Peter


From anaryin at gmail.com  Thu Oct 13 10:43:06 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 13 Oct 2011 12:43:06 +0200
Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB
In-Reply-To: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com>
References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com>
Message-ID: <CAJ9sUYMcmdTgCKeWd67LbLYSj6rf-6Y_7LvHRieB94nE7zbTXQ@mail.gmail.com>

Hello Paul,

Straight from Pymol :)

Bio.PDB cannot read PQR files as is, but since the format is quite similar
to the PDB it should be easy to convert.

The first step is to know if you want to develop a converter too (you will
need the forcefield atomic charges and radius for that) or just a "parser".
Parsing is easy, it's a matter of adapting the current SMCRA objects and
PDBParser. Converting requires much more and is probably superfluous given
the PDB2PQR software.

Some important information on the format:
http://www.poissonboltzmann.org/file-formats/biomolecular-structurw/pqr

I think the best course of action is to add a PQRParser class that has
different residue properties than the regular PDB. For example, occupancy
and bfactor are not used at all..

Let me know what you think,

Cheers,

Jo?o


From paul at tonair.de  Thu Oct 13 11:51:42 2011
From: paul at tonair.de (paul at tonair.de)
Date: Thu, 13 Oct 2011 13:51:42 +0200
Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB
In-Reply-To: <CAMHOhY36254=cqJdupOQC111R5RM08OQD-513pjny5HC2NWheQ@mail.gmail.com>
References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com>
	<CAJ9sUYMcmdTgCKeWd67LbLYSj6rf-6Y_7LvHRieB94nE7zbTXQ@mail.gmail.com>
	<CAMHOhY36254=cqJdupOQC111R5RM08OQD-513pjny5HC2NWheQ@mail.gmail.com>
Message-ID: <fb35e0cb79678fdbf4e6d29d3325df1a@mail.canobus.com>


Dear all, 

a PQR functionality within biopython would be great!


Regarding the output of extended PDB files I would like to
write:
There is no detailed description on such files:


http://www.sci.ccny.cuny.edu/~mcce/doc/running_mcce2.php [1] 

see
chapter 3.2.4:
step2_out.pdb: input structure file of step 3 in mcce
extended pdb format 

extended means: the conformer is added beyond the
element located somewhere around column 80. 

Is there any workaround
with the currect biopython release to read in PQR and dump out such an
extended PDB file? 

Cheers & thanks,
Paul 

On Thu, 13 Oct 2011
12:48:22 +0200, Mikael Trellet  wrote:  

This PQRParser class would be
a nice add to Bio.PDB indeed, and shouldn't take a very long time to
develop. Could work on it with you Joao, if the need exists obviously.


Regards, 

Mikael 

On Thu, Oct 13, 2011 at 12:43 PM, Jo?o Rodrigues 
wrote:
 Hello Paul,

Straight from Pymol :)

Bio.PDB cannot read PQR
files as is, but since the format is quite similar
to the PDB it should
be easy to convert.

The first step is to know if you want to develop a
converter too (you will
need the forcefield atomic charges and radius
for that) or just a "parser".
Parsing is easy, it's a matter of adapting
the current SMCRA objects and
PDBParser. Converting requires much more
and is probably superfluous given
the PDB2PQR software.

Some important
information on the
format:
http://www.poissonboltzmann.org/file-formats/biomolecular-structurw/pqr
[3]

I think the best course of action is to add a PQRParser class that
has
different residue properties than the regular PDB. For example,
occupancy
and bfactor are not used at all..

Let me know what you
think,

Cheers,

Jo?o

_______________________________________________
Biopython
mailing list - Biopython at lists.open-bio.org
[4]
http://lists.open-bio.org/mailman/listinfo/biopython [5]    

 --

Mikael TRELLET,
Computational structural biology group, Utrecht
University
Bijvoet Center,
The Netherlands

 
Links:
------
[1]
http://www.sci.ccny.cuny.edu/~mcce/doc/running_mcce2.php
[2]
mailto:anaryin at gmail.com
[3]
http://www.poissonboltzmann.org/file-formats/biomolecular-structurw/pqr
[4]
mailto:Biopython at lists.open-bio.org
[5]
http://lists.open-bio.org/mailman/listinfo/biopython


From anaryin at gmail.com  Thu Oct 13 12:27:54 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 13 Oct 2011 14:27:54 +0200
Subject: [Biopython] Bio.PDB question - from PQR to an "extended" PDB
In-Reply-To: <fb35e0cb79678fdbf4e6d29d3325df1a@mail.canobus.com>
References: <20a268651765b6b043bb4dd5ef3b3057@mail.canobus.com>
	<CAJ9sUYMcmdTgCKeWd67LbLYSj6rf-6Y_7LvHRieB94nE7zbTXQ@mail.gmail.com>
	<CAMHOhY36254=cqJdupOQC111R5RM08OQD-513pjny5HC2NWheQ@mail.gmail.com>
	<fb35e0cb79678fdbf4e6d29d3325df1a@mail.canobus.com>
Message-ID: <CAJ9sUYP2vVHgs0awhO_8p4EKbmek3S26vwscyNqe4feJn1quDA@mail.gmail.com>

Dear Paul,

You would have to do two things:

1. First, modify PDBParser so that it reads more characters in the occupancy
and bfactor fields
2. Modify PDBIO so that it is able to output a field beyond the element OR
just create your own function to print information of a residue and use it
instead of PDBIO.

How do you get the conformer information?


From paul at tonair.de  Fri Oct 14 12:00:04 2011
From: paul at tonair.de (paul at tonair.de)
Date: Fri, 14 Oct 2011 14:00:04 +0200
Subject: [Biopython] ligand PDB files
Message-ID: <f744ccd5166106e2ddfda84262ac938c@mail.canobus.com>


Dear all, 
I'm having trouble to read in the attached PDB file - this
is my code: 
" 

from Bio.PDB import
*
parser=PDBParser()
structure=parser.get_structure("PHA-L","./2w26_lig.pdb")


for model in structure:
 for chain in model:
 for residue in chain:

for atom in residue:
 print atom  
" 
which gives this error: 
" 
File
"/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/PDBParser.py",
line 66, in get_structure
 self._parse(file.readlines())
 File
"/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/PDBParser.py",
line 89, in _parse

self.trailer=self._parse_coordinates(coords_trailer)
 File
"/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/PDBParser.py",
line 205, in _parse_coordinates
 fullname, serial_number, element)
 File
"/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/StructureBuilder.py",
line 197, in init_atom
 fullname, serial_number, element)
 File
"/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/Atom.py", line
68, in __init__
 assert not element or element == element.upper(),
element
AssertionError: Cl 
" 
Does this mean that the PDB parser only
recognizes "amino acid-atoms", i.e. a chlorine does not work? 
Cheers &
Thanks, 
Paul
-------------- next part --------------
COMPND    2w26_LIG.pdb_0 
AUTHOR    GENERATED BY OPEN BABEL 2.3.0
ATOM      1  C1  RIV A   1       9.643   1.777  18.433  1.00  0.00           C  
ATOM      2  N1  RIV A   1       8.303   2.377  18.109  1.00  0.00           N  
ATOM      3  C2  RIV A   1      10.053   0.667  17.441  1.00  0.00           C  
ATOM      4  C3  RIV A   1       7.671   2.122  16.881  1.00  0.00           C  
ATOM      5  O1  RIV A   1       9.768   1.124  16.111  1.00  0.00           O  
ATOM      6  C4  RIV A   1       8.355   1.223  15.853  1.00  0.00           C  
ATOM      7  C5  RIV A   1       6.487   4.959  20.981  1.00  0.00           C  
ATOM      8  C6  RIV A   1       7.333   5.468  19.984  1.00  0.00           C  
ATOM      9  C7  RIV A   1       6.237   3.551  21.013  1.00  0.00           C  
ATOM     10  C8  RIV A   1       7.918   4.619  19.048  1.00  0.00           C  
ATOM     11  C9  RIV A   1       6.837   2.690  20.070  1.00  0.00           C  
ATOM     12  C10 RIV A   1       7.682   3.222  19.078  1.00  0.00           C  
ATOM     13  O2  RIV A   1       6.583   2.613  16.630  1.00  0.00           O  
ATOM     14  N2  RIV A   1       5.906   5.863  21.947  1.00  0.00           N  
ATOM     15  C11 RIV A   1       5.040   5.543  22.995  1.00  0.00           C  
ATOM     16  C12 RIV A   1       6.146   7.326  22.000  1.00  0.00           C  
ATOM     17  O3  RIV A   1       4.690   6.614  23.757  1.00  0.00           O  
ATOM     18  C13 RIV A   1       5.213   7.787  23.134  1.00  0.00           C  
ATOM     19  O4  RIV A   1       4.634   4.419  23.228  1.00  0.00           O  
ATOM     20  C14 RIV A   1       5.924   8.721  24.155  1.00  0.00           C  
ATOM     21  N3  RIV A   1       7.078   8.136  24.932  1.00  0.00           N  
ATOM     22  C15 RIV A   1       8.402   8.558  24.672  1.00  0.00           C  
ATOM     23  S1  RIV A   1      11.131   8.264  25.063  1.00  0.00           S  
ATOM     24  C16 RIV A   1      11.805   7.503  26.288  1.00  0.00           C  
ATOM     25  C17 RIV A   1       9.567   8.044  25.466  1.00  0.00           C  
ATOM     26  C18 RIV A   1      10.794   7.011  27.130  1.00  0.00           C  
ATOM     27  C19 RIV A   1       9.509   7.324  26.659  1.00  0.00           C  
ATOM     28  O5  RIV A   1       8.611   9.379  23.797  1.00  0.00           O  
ATOM     29 Cl1  RIV A   1      13.544   7.302  26.531  1.00  0.00          Cl  
ATOM     30  H   RIV A   1       9.643   1.777  18.433  1.00  0.00           H  
ATOM     31  H   RIV A   1       9.643   1.777  18.433  1.00  0.00           H  
ATOM     32  H   RIV A   1      10.053   0.667  17.441  1.00  0.00           H  
ATOM     33  H   RIV A   1      10.053   0.667  17.441  1.00  0.00           H  
ATOM     34  H   RIV A   1       8.355   1.223  15.853  1.00  0.00           H  
ATOM     35  H   RIV A   1       8.355   1.223  15.853  1.00  0.00           H  
ATOM     36  H   RIV A   1       7.333   5.468  19.984  1.00  0.00           H  
ATOM     37  H   RIV A   1       6.237   3.551  21.013  1.00  0.00           H  
ATOM     38  H   RIV A   1       7.918   4.619  19.048  1.00  0.00           H  
ATOM     39  H   RIV A   1       6.837   2.690  20.070  1.00  0.00           H  
ATOM     40  H   RIV A   1       6.146   7.326  22.000  1.00  0.00           H  
ATOM     41  H   RIV A   1       6.146   7.326  22.000  1.00  0.00           H  
ATOM     42  H   RIV A   1       5.213   7.787  23.134  1.00  0.00           H  
ATOM     43  H   RIV A   1       5.924   8.721  24.155  1.00  0.00           H  
ATOM     44  H   RIV A   1       5.924   8.721  24.155  1.00  0.00           H  
ATOM     45  H   RIV A   1       7.078   8.136  24.932  1.00  0.00           H  
ATOM     46  H   RIV A   1      10.794   7.011  27.130  1.00  0.00           H  
ATOM     47  H   RIV A   1       9.509   7.324  26.659  1.00  0.00           H  
CONECT    1    3    2   30   31                                       
CONECT    1                                                           
CONECT    2    4    1   12                                            
CONECT    3    5    1   32   33                                       
CONECT    3                                                           
CONECT    4    6   13    2                                            
CONECT    5    6    3                                                 
CONECT    6    5    4   34   35                                       
CONECT    6                                                           
CONECT    7    8    9   14                                            
CONECT    8   10    7   36                                            
CONECT    9   11    7   37                                            
CONECT   10   12    8   38                                            
CONECT   11   12    9   39                                            
CONECT   12    2   10   11                                            
CONECT   13    4                                                      
CONECT   14    7   16   15                                            
CONECT   15   14   19   17                                            
CONECT   16   14   18   40   41                                       
CONECT   16                                                           
CONECT   17   15   18                                                 
CONECT   18   16   17   20   42                                       
CONECT   18                                                           
CONECT   19   15                                                      
CONECT   20   18   21   43   44                                       
CONECT   20                                                           
CONECT   21   20   22   45                                            
CONECT   22   28   21   25                                            
CONECT   23   25   24                                                 
CONECT   24   23   29   26                                            
CONECT   25   22   23   27                                            
CONECT   26   24   27   46                                            
CONECT   27   25   26   47                                            
CONECT   28   22                                                      
CONECT   29   24                                                      
CONECT   30    1                                                      
CONECT   31    1                                                      
CONECT   32    3                                                      
CONECT   33    3                                                      
CONECT   34    6                                                      
CONECT   35    6                                                      
CONECT   36    8                                                      
CONECT   37    9                                                      
CONECT   38   10                                                      
CONECT   39   11                                                      
CONECT   40   16                                                      
CONECT   41   16                                                      
CONECT   42   18                                                      
CONECT   43   20                                                      
CONECT   44   20                                                      
CONECT   45   21                                                      
CONECT   46   26                                                      
CONECT   47   27                                                      
MASTER        0    0    0    0    0    0    0    0   47    0   47    0
END

From robert.campbell at queensu.ca  Fri Oct 14 13:04:22 2011
From: robert.campbell at queensu.ca (Robert Campbell)
Date: Fri, 14 Oct 2011 09:04:22 -0400
Subject: [Biopython] ligand PDB files
In-Reply-To: <f744ccd5166106e2ddfda84262ac938c@mail.canobus.com>
References: <f744ccd5166106e2ddfda84262ac938c@mail.canobus.com>
Message-ID: <20111014090422.639e9284@adelie.biochem.queensu.ca>

Dear Paul,

On Fri, 2011-10-14 14:00  EDT,  paul at tonair.de wrote:

> Dear all, 
> I'm having trouble to read in the attached PDB file - this
> is my code: 

<snip>
Your code is okay.  The problem is in your PDB file:


>  File
> "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/Atom.py", line
> 68, in __init__
>  assert not element or element == element.upper(),
> element
> AssertionError: Cl 
> " 
> Does this mean that the PDB parser only
> recognizes "amino acid-atoms", i.e. a chlorine does not work? 

The chlorine atoms should be "CL" not "Cl" in a proper PDB file.

Cheers,
Rob

-- 
Robert L. Campbell, Ph.D.
Senior Research Associate/Adjunct Assistant Professor 
Dept. of Biomedical & Molecular Sciences, Botterell Hall Rm 644
Queen's University, 
Kingston, ON K7L 3N6  Canada
Tel: 613-533-6821
<robert.campbell at queensu.ca>    http://pldserver1.biochem.queensu.ca/~rlc


From paul at tonair.de  Fri Oct 14 13:51:47 2011
From: paul at tonair.de (paul at tonair.de)
Date: Fri, 14 Oct 2011 15:51:47 +0200
Subject: [Biopython] ligand PDB files
In-Reply-To: <20111014090422.639e9284@adelie.biochem.queensu.ca>
References: <f744ccd5166106e2ddfda84262ac938c@mail.canobus.com>
	<20111014090422.639e9284@adelie.biochem.queensu.ca>
Message-ID: <751ac2c9e7bf1a3659f31849565d1122@mail.canobus.com>

Dear Rob,

thank you very much for your help, this fixed the error!!


Cheers,
Paul

> <snip>
> Your code is okay.  The problem is in your PDB file:
> 
> 
>>  File
>> "/SW/python/lib/python2.6/site-packages/biopython/Bio/PDB/Atom.py", line
>> 68, in __init__
>>  assert not element or element == element.upper(),
>> element
>> AssertionError: Cl
>> "
>> Does this mean that the PDB parser only
>> recognizes "amino acid-atoms", i.e. a chlorine does not work?
> 
> The chlorine atoms should be "CL" not "Cl" in a proper PDB file.
> 
> Cheers,
> Rob


From pawan.mani2 at gmail.com  Sat Oct 15 16:26:17 2011
From: pawan.mani2 at gmail.com (One Life)
Date: Sat, 15 Oct 2011 16:26:17 +0000 (UTC)
Subject: [Biopython] Invitation to connect on LinkedIn
Message-ID: <450967254.855500.1318695977476.JavaMail.app@ela4-app0133.prod>

I'd like to add you to my professional network on LinkedIn.

- One

One  Life
bioinformatics jobs or lifesciences jobs at student
New Delhi Area, India

Confirm that you know One  Life:
https://www.linkedin.com/e/l8bh8w-gtstjc81-5u/isd/4571376627/NJGAOFxD/?hs=false&tok=2ZCK1gt4mqn4Y1

--
You are receiving Invitation to Connect emails. Click to unsubscribe:
http://www.linkedin.com/e/l8bh8w-gtstjc81-5u/qqAvDr0lR7bVZ5oUF-GdFl1c_dfVGAwasCwqz9Wv-gP/goo/biopython%40lists%2Eopen-bio%2Eorg/20061/I1584202408_1/?hs=false&tok=0zbjHnXC6qn4Y1

(c) 2011 LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043, USA.


From jordan.r.willis at Vanderbilt.Edu  Sat Oct 15 20:59:58 2011
From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R)
Date: Sat, 15 Oct 2011 15:59:58 -0500
Subject: [Biopython] Blast DB keeps crashing nodes
Message-ID: <66965B9E-2AD6-4E02-BB8E-2F11A820DCDF@Vanderbilt.Edu>


Hello Biopython,

I was wondering if anyone has worked extensively with the Blast Database locally.

I am blasting millions of sequences using Biopython as my backend framework. I am using a high throughput computer cluster to blast each sequence. Rather than submit two million jobs, I have divided the fast files up into 50 or so.

The problem I am facing is a memory issue. I'm not sure, but I think that the Database is cacheing itself and not clearing before the next sequence is queried. In that regard, the next job calls upon the database again, and so on?.

The memory builds up until it finally crashes the node. Has anyone dealt with this issue before?

Thanks,
Jordan


From dilara.ally at gmail.com  Sat Oct 15 21:55:21 2011
From: dilara.ally at gmail.com (Dilara Ally)
Date: Sat, 15 Oct 2011 14:55:21 -0700
Subject: [Biopython] Blast DB keeps crashing nodes
In-Reply-To: <66965B9E-2AD6-4E02-BB8E-2F11A820DCDF@Vanderbilt.Edu>
References: <66965B9E-2AD6-4E02-BB8E-2F11A820DCDF@Vanderbilt.Edu>
Message-ID: <4E9A0149.1000504@gmail.com>

How many hits per sequence have you requested to get back - the default 
on the blastall is 250?   I did blast search on ~600,000 contigs but I 
set up simultaneous jobs across 34 nodes.  I used only the top 20 hits.  
Each file had 1000 fasta formatted sequences and each node was given ~12 
files.  But I still had to do it in two parts to get all sequences 
blasted. I waited until the first set finished to set up the second 
blast job.  The job finished in 2 days.  Before I ran it on the cluster 
I tested a single file to see how long and how much memory it took.  The 
cluster I used had 34 computing nodes, with 16-48 cores and 16-64GB of 
memory.

Hope that helps.

On 10/15/11 1:59 PM, Willis, Jordan R wrote:
> Hello Biopython,
>
> I was wondering if anyone has worked extensively with the Blast Database locally.
>
> I am blasting millions of sequences using Biopython as my backend framework. I am using a high throughput computer cluster to blast each sequence. Rather than submit two million jobs, I have divided the fast files up into 50 or so.
>
> The problem I am facing is a memory issue. I'm not sure, but I think that the Database is cacheing itself and not clearing before the next sequence is queried. In that regard, the next job calls upon the database again, and so on?.
>
> The memory builds up until it finally crashes the node. Has anyone dealt with this issue before?
>
> Thanks,
> Jordan
>
>
>
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From mictadlo at gmail.com  Mon Oct 17 12:11:12 2011
From: mictadlo at gmail.com (Mic)
Date: Mon, 17 Oct 2011 22:11:12 +1000
Subject: [Biopython] SAM to BAM
Message-ID: <CAOP6n=jJC48dDPssrwNYffCyf8oFWiC6rfEHJP0H8mMPQeZZDw@mail.gmail.com>

Hello,
Is there a way to convert SAM file to sorted BAM file and generate also BAI
file with pysam?

Thank you in advance.


From p.j.a.cock at googlemail.com  Mon Oct 17 13:06:58 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 17 Oct 2011 14:06:58 +0100
Subject: [Biopython] [Samtools-help] SAM to BAM
In-Reply-To: <CAOP6n=jJC48dDPssrwNYffCyf8oFWiC6rfEHJP0H8mMPQeZZDw@mail.gmail.com>
References: <CAOP6n=jJC48dDPssrwNYffCyf8oFWiC6rfEHJP0H8mMPQeZZDw@mail.gmail.com>
Message-ID: <CAKVJ-_4i77vOJe4wyJRZ_O7p4RmOCFxYg51TKV--FhPfHv5vgQ@mail.gmail.com>

On Mon, Oct 17, 2011 at 1:11 PM, Mic <mictadlo at gmail.com> wrote:
> Hello,
> Is there a way to convert SAM file to sorted BAM file and generate also BAI
> file with pysam?
> Thank you in advance.

With samtools at the command line,

samtools view -b -S example.sam | samtools sort - example
samtools index example.bam

I know you can easy call samtools from pysam, not sure if you
can do the pipe trick to avoid extra steps:

samtools view -b -S example.sam > example_unsorted
samtools sort example_unsorted.bam example
rm example_unsorted.bam
samtools index example.bam

Peter


From jgrant at smith.edu  Mon Oct 17 13:47:38 2011
From: jgrant at smith.edu (Jessica Grant)
Date: Mon, 17 Oct 2011 09:47:38 -0400
Subject: [Biopython] pdb file question
Message-ID: <541079A3-3C7D-45FF-8717-B1C64C85735F@smith.edu>

Hello,

I am trying to write a script that reproduces the crystal structure of  
a protein based on the information in the pdb file.  I have gotten  
kind of stuck using the SMTRY lines in remark 290.  It doesn't seem to  
contain all the information I need, at least the results I am getting  
don't look the same as when I produce symmetry mates in pymol, for  
example.  Has anyone any experience with this?  Thanks,

Jessica


From anaryin at gmail.com  Mon Oct 17 14:08:54 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Mon, 17 Oct 2011 16:08:54 +0200
Subject: [Biopython] pdb file question
In-Reply-To: <541079A3-3C7D-45FF-8717-B1C64C85735F@smith.edu>
References: <541079A3-3C7D-45FF-8717-B1C64C85735F@smith.edu>
Message-ID: <CAJ9sUYNWsqZ+4OiCcuCvSxst0+jYYZ2-6zdzdSfbLfZt5C8-YA@mail.gmail.com>

Hello Jessica,

Are you extracting the symmetry information with Biopython? If so, how are
you using it to generate the other symmetry "members"? Using atom.transform?

Cheers,

Jo?o [...] Rodrigues
http://nmr.chem.uu.nl/~joao


2011/10/17 Jessica Grant <jgrant at smith.edu>

> Hello,
>
> I am trying to write a script that reproduces the crystal structure of a
> protein based on the information in the pdb file.  I have gotten kind of
> stuck using the SMTRY lines in remark 290.  It doesn't seem to contain all
> the information I need, at least the results I am getting don't look the
> same as when I produce symmetry mates in pymol, for example.  Has anyone any
> experience with this?  Thanks,
>
> Jessica
>
>
>
> ______________________________**_________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/**mailman/listinfo/biopython<http://lists.open-bio.org/mailman/listinfo/biopython>
>


From hahj87 at gmail.com  Mon Oct 17 15:03:10 2011
From: hahj87 at gmail.com (=?ISO-8859-1?Q?Joshua_Ismael_Haase_Hern=E1ndez?=)
Date: Mon, 17 Oct 2011 10:03:10 -0500
Subject: [Biopython] is IRC channel at freenode active?
Message-ID: <CA+ypG2YZCbjzfpgpGPi80Z4ttQjBgFBvbNE-podmfewLo6QkhQ@mail.gmail.com>

Hi there,

I was arround in the IRC channel
and the only one there is Chanserv.

I was wondering if the channel has
some use.


From mictadlo at gmail.com  Tue Oct 18 03:44:14 2011
From: mictadlo at gmail.com (Mic)
Date: Tue, 18 Oct 2011 13:44:14 +1000
Subject: [Biopython] Segmentation fault
Message-ID: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>

Hello,
I have tried to generate a subset BAM, but I get a 'Segmentation fault' with
the following code:

from Bio import SeqIO
import pysam
from optparse import OptionParser
import subprocess, os, sys
from multiprocessing import Pool
import functools
import argparse


def GetReferenceInfo(referenceFastaPath):
  referencenames = []
  referencelengths = []
  referenceFastaFile = open(referenceFastaPath)
  for record in SeqIO.parse(referenceFastaFile, "fasta"):
    referencenames.append(record.name)
    referencelengths.append(len(record.seq))
  referenceFastaFile.close()
  return (referencenames, referencelengths)


def GenerateSubsetBAM(bam_filename, ref_name):
    reads = []
    bam_fh = pysam.Samfile(bam_filename, "rb")

    for read in bam_fh.fetch(ref_name):
        reads.append(read)

    print ref_name + ' Done ' + str(len(reads))
    return (ref_name, reads)


def writeBAM(reads, ref_names, ref_lengths, output_BAM):
    #print ref_names
    #print ref_lengths
    #print output_BAM
    #with pysam.Samfile(output_BAM, "wb", referencenames = ref_names,
referencelengths = ref_lengths) as bh:
    bh = pysam.Samfile(output_BAM, "wb", referencenames = ref_names,
referencelengths = ref_lengths)

    print reads.keys()
    for ref_name in ref_names:
        print ref_name
        for read in reads[ref_name]:
            print read
            #bh.write(read)
        print ref_name + 'Done'


if __name__ == '__main__':
  parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta -o
outputBAM")
  parser.add_option("-b", "--BAM", type="string", dest="inputBAMFilepath",
help="Specify a BAM file")
  parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath",
help="Specify a reference fasta file.")
  parser.add_option("-o", "--output", type="string",
dest="outputBAMFilepath", help="Specify an output BAM file.")

  (opts, args) = parser.parse_args()

  if (opts.inputBAMFilepath is None):
    print ("\nSpecify a BAM file. eg. -b large.bam\n")
    parser.print_help()
  elif not(os.path.exists(opts.inputBAMFilepath)):
    print ("\nReference BAM file does not exists: " + opts.inputBAMFilepath
+"\n")
  elif (opts.fastaFilepath is None):
    print ("\nSpecify a reference fasta file.  eg. -f Subset.fasta\n")
    parser.print_help()
  elif not(os.path.exists(opts.fastaFilepath)):
    print ("\nReference fasta file does not exists: " + opts.fastaFilepath
+"\n")
  elif os.path.exists(opts.outputBAMFilepath) and
not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n): ")=='Y'):
    print ("\nOutput BAM exists. Please specify alternative output file.
 eg. -o Subset.bam\n")
  else:
    print "Read fasta ..."
    (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath)
    print 'Done!'

    print "creating subset...."
    pool = Pool()
    GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM,
opts.inputBAMFilepath)
    reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper, ref_names))
    pool.close()
    print "Done!"

    print "Writting results to subset BAM file..."
    writeBAM(reads, ref_names, ref_lengths, opts.outputBAMFilepath)
    print "Done!"


I run the code in the following way:

python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bamRead fasta ...
Done!
creating subset....
chr1 Done 1464
chr2 Done 1806
Done!
Writting results to subset BAM file...
['chr2', 'chr1']
chr1
Segmentation fault

Thank you in advance.


From p.j.a.cock at googlemail.com  Tue Oct 18 09:00:47 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 18 Oct 2011 10:00:47 +0100
Subject: [Biopython] [Samtools-help] Segmentation fault
In-Reply-To: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>
References: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>
Message-ID: <CAKVJ-_7cF1gGQbzcD8LO+w-QvbsJUp=T1xmzwoXqUNt5GsaowA@mail.gmail.com>

On Tue, Oct 18, 2011 at 4:44 AM, Mic <mictadlo at gmail.com> wrote:
> Hello,
> I have tried to generate a subset BAM, but I get a 'Segmentation fault' with
> the following code:
> from Bio import SeqIO
> import pysam
> from optparse import OptionParser
> import subprocess, os, sys
> from multiprocessing import Pool
> import functools
> ...

I tried this and it seemed to get stuck much earlier. Could you
cut down the example a bit by removing the multiprocessing?

Peter

P.S. Also you can remove the unused "import argparse" line.


From mictadlo at gmail.com  Tue Oct 18 10:26:06 2011
From: mictadlo at gmail.com (Mic)
Date: Tue, 18 Oct 2011 20:26:06 +1000
Subject: [Biopython] [Samtools-help] Segmentation fault
In-Reply-To: <CAKVJ-_7cF1gGQbzcD8LO+w-QvbsJUp=T1xmzwoXqUNt5GsaowA@mail.gmail.com>
References: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>
	<CAKVJ-_7cF1gGQbzcD8LO+w-QvbsJUp=T1xmzwoXqUNt5GsaowA@mail.gmail.com>
Message-ID: <CAOP6n=g7d9cHhoqcf9FDW2vr9Rm5GgL7xMir0TU6Z6DJPQf_8Q@mail.gmail.com>

Hello,
Thank you for your email. I updated the code and find out that
    print reads['chr1']     #works fine
but
    print reads['chr1'][0]  #caused Segmentation fault

Please find below the updated code:

from Bio import SeqIO
import pysam
from optparse import OptionParser
import subprocess, os, sys
from multiprocessing import Pool
import functools


def GetReferenceInfo(referenceFastaPath):
  referencenames = []
  referencelengths = []
  referenceFastaFile = open(referenceFastaPath)
  for record in SeqIO.parse(referenceFastaFile, "fasta"):
    referencenames.append(record.name)
    referencelengths.append(len(record.seq))
  referenceFastaFile.close()
  return (referencenames, referencelengths)


def GenerateSubsetBAM(bam_filename, ref_name):
    reads = []
    bam_fh = pysam.Samfile(bam_filename, "rb")

    for read in bam_fh.fetch(ref_name):
        reads.append(read)

    print ref_name + ' Done ' + str(len(reads))
    return (ref_name, reads)


if __name__ == '__main__':
  parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta -o
outputBAM")
  parser.add_option("-b", "--BAM", type="string", dest="inputBAMFilepath",
help="Specify a BAM file")
  parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath",
help="Specify a reference fasta file.")
  parser.add_option("-o", "--output", type="string",
dest="outputBAMFilepath", help="Specify an output BAM file.")

  (opts, args) = parser.parse_args()

  if (opts.inputBAMFilepath is None):
    print ("\nSpecify a BAM file. eg. -b large.bam\n")
    parser.print_help()
  elif not(os.path.exists(opts.inputBAMFilepath)):
    print ("\nReference BAM file does not exists: " + opts.inputBAMFilepath
+"\n")
  elif (opts.fastaFilepath is None):
    print ("\nSpecify a reference fasta file.  eg. -f Subset.fasta\n")
    parser.print_help()
  elif not(os.path.exists(opts.fastaFilepath)):
    print ("\nReference fasta file does not exists: " + opts.fastaFilepath
+"\n")
  elif os.path.exists(opts.outputBAMFilepath) and
not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n): ")=='Y'):
    print ("\nOutput BAM exists. Please specify alternative output file.
 eg. -o Subset.bam\n")
  else:
    print "Read fasta ..."
    (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath)
    print 'Done!'

    print "creating subset...."
    pool = Pool()
    GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM,
opts.inputBAMFilepath)
    reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper, ref_names))
    pool.close()
    print "Done!"

    print reads['chr1']     #works fine
    print "xxxxx"

    print reads['chr1'][0]  #caused Segmentation fault

I run the code with the pysam-0.5 examples (pysam-0.5/tests) in the
following way:

python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam

Read fasta ...
Done!
creating subset....
chr1 Done 1464
chr2 Done 1806
Done!
[<csamtools.AlignedRead object at 0x2b975635d168>, ...,
<csamtools.AlignedRead object at 0x2b35d89b6ca8>]
xxxxx
Segmentation fault

Thank you in advance.


On Tue, Oct 18, 2011 at 7:00 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Oct 18, 2011 at 4:44 AM, Mic <mictadlo at gmail.com> wrote:
> > Hello,
> > I have tried to generate a subset BAM, but I get a 'Segmentation fault'
> with
> > the following code:
> > from Bio import SeqIO
> > import pysam
> > from optparse import OptionParser
> > import subprocess, os, sys
> > from multiprocessing import Pool
> > import functools
> > ...
>
> I tried this and it seemed to get stuck much earlier. Could you
> cut down the example a bit by removing the multiprocessing?
>
> Peter
>
> P.S. Also you can remove the unused "import argparse" line.
>


From mmokrejs at fold.natur.cuni.cz  Tue Oct 18 11:44:54 2011
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Tue, 18 Oct 2011 13:44:54 +0200
Subject: [Biopython] [Samtools-help] Segmentation fault
In-Reply-To: <CAOP6n=g7d9cHhoqcf9FDW2vr9Rm5GgL7xMir0TU6Z6DJPQf_8Q@mail.gmail.com>
References: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>
	<CAKVJ-_7cF1gGQbzcD8LO+w-QvbsJUp=T1xmzwoXqUNt5GsaowA@mail.gmail.com>
	<CAOP6n=g7d9cHhoqcf9FDW2vr9Rm5GgL7xMir0TU6Z6DJPQf_8Q@mail.gmail.com>
Message-ID: <4E9D66B6.70904@fold.natur.cuni.cz>

Before running your python code, do (under bash):
$ ulimit -c unlimited
$ python mypython.py
$ file core
$ gdb /usr/bin/python ./core
gdb> where
gdb> bt full
gdb> quit
$

Martin

Mic wrote:
> Hello,
> Thank you for your email. I updated the code and find out that
>     print reads['chr1']     #works fine
> but
>     print reads['chr1'][0]  #caused Segmentation fault
> 
> Please find below the updated code:
> 
> from Bio import SeqIO
> import pysam
> from optparse import OptionParser
> import subprocess, os, sys
> from multiprocessing import Pool
> import functools
> 
> 
> def GetReferenceInfo(referenceFastaPath):
>   referencenames = []
>   referencelengths = []
>   referenceFastaFile = open(referenceFastaPath)
>   for record in SeqIO.parse(referenceFastaFile, "fasta"):
>     referencenames.append(record.name)
>     referencelengths.append(len(record.seq))
>   referenceFastaFile.close()
>   return (referencenames, referencelengths)
> 
> 
> def GenerateSubsetBAM(bam_filename, ref_name):
>     reads = []
>     bam_fh = pysam.Samfile(bam_filename, "rb")
> 
>     for read in bam_fh.fetch(ref_name):
>         reads.append(read)
> 
>     print ref_name + ' Done ' + str(len(reads))
>     return (ref_name, reads)
> 
> 
> if __name__ == '__main__':
>   parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta -o
> outputBAM")
>   parser.add_option("-b", "--BAM", type="string", dest="inputBAMFilepath",
> help="Specify a BAM file")
>   parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath",
> help="Specify a reference fasta file.")
>   parser.add_option("-o", "--output", type="string",
> dest="outputBAMFilepath", help="Specify an output BAM file.")
> 
>   (opts, args) = parser.parse_args()
> 
>   if (opts.inputBAMFilepath is None):
>     print ("\nSpecify a BAM file. eg. -b large.bam\n")
>     parser.print_help()
>   elif not(os.path.exists(opts.inputBAMFilepath)):
>     print ("\nReference BAM file does not exists: " + opts.inputBAMFilepath
> +"\n")
>   elif (opts.fastaFilepath is None):
>     print ("\nSpecify a reference fasta file.  eg. -f Subset.fasta\n")
>     parser.print_help()
>   elif not(os.path.exists(opts.fastaFilepath)):
>     print ("\nReference fasta file does not exists: " + opts.fastaFilepath
> +"\n")
>   elif os.path.exists(opts.outputBAMFilepath) and
> not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n): ")=='Y'):
>     print ("\nOutput BAM exists. Please specify alternative output file.
>  eg. -o Subset.bam\n")
>   else:
>     print "Read fasta ..."
>     (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath)
>     print 'Done!'
> 
>     print "creating subset...."
>     pool = Pool()
>     GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM,
> opts.inputBAMFilepath)
>     reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper, ref_names))
>     pool.close()
>     print "Done!"
> 
>     print reads['chr1']     #works fine
>     print "xxxxx"
> 
>     print reads['chr1'][0]  #caused Segmentation fault
> 
> I run the code with the pysam-0.5 examples (pysam-0.5/tests) in the
> following way:
> 
> python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam
> 
> Read fasta ...
> Done!
> creating subset....
> chr1 Done 1464
> chr2 Done 1806
> Done!
> [<csamtools.AlignedRead object at 0x2b975635d168>, ...,
> <csamtools.AlignedRead object at 0x2b35d89b6ca8>]
> xxxxx
> Segmentation fault
> 
> Thank you in advance.
> 
> 
> On Tue, Oct 18, 2011 at 7:00 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
> 
>> On Tue, Oct 18, 2011 at 4:44 AM, Mic <mictadlo at gmail.com> wrote:
>>> Hello,
>>> I have tried to generate a subset BAM, but I get a 'Segmentation fault'
>> with
>>> the following code:
>>> from Bio import SeqIO
>>> import pysam
>>> from optparse import OptionParser
>>> import subprocess, os, sys
>>> from multiprocessing import Pool
>>> import functools
>>> ...
>>
>> I tried this and it seemed to get stuck much earlier. Could you
>> cut down the example a bit by removing the multiprocessing?
>>
>> Peter
>>
>> P.S. Also you can remove the unused "import argparse" line.
>>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 
> 


From mictadlo at gmail.com  Tue Oct 18 12:05:01 2011
From: mictadlo at gmail.com (Mic)
Date: Tue, 18 Oct 2011 22:05:01 +1000
Subject: [Biopython] [Samtools-help] Segmentation fault
In-Reply-To: <4E9D66B6.70904@fold.natur.cuni.cz>
References: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>
	<CAKVJ-_7cF1gGQbzcD8LO+w-QvbsJUp=T1xmzwoXqUNt5GsaowA@mail.gmail.com>
	<CAOP6n=g7d9cHhoqcf9FDW2vr9Rm5GgL7xMir0TU6Z6DJPQf_8Q@mail.gmail.com>
	<4E9D66B6.70904@fold.natur.cuni.cz>
Message-ID: <CAOP6n=juO9=4wap-4xMHJatKbbZidtTX_5T8K-67Ud6R+E+AEA@mail.gmail.com>

Thank you for your tip, but I got an error:
$ulimit -c unlimited
$SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam
Read fasta ...
Done!
creating subset....
chr1 Done 1464
EAS56_57:6:190:289:82 69 0 99 0 None 0 99 35
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; [('MF',
192)]
chr2 Done 1806
B7_591:8:4:841:340 73 1 0 99 [(0, 36)] -1 -1 36
TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA
<<<<<<<<;<<<<<<<<;<<<<<;<;:<<<<<<<;; [('MF',
18), ('Aq', 77), ('NM', 0), ('UQ', 0), ('H0', 1), ('H1', 0)]
Done!
xxxxx
Segmentation fault (core dumped)
$file core
core: ERROR: cannot open `core' (No such file or directory)


I also inserted "print reads[0]" in the method GenerateSubsetBAM:

def GenerateSubsetBAM(bam_filename, ref_name):
    reads = []
    bam_fh = pysam.Samfile(bam_filename, "rb")

    for read in bam_fh.fetch(ref_name):
        reads.append(read)

    print ref_name + ' Done ' + str(len(reads))
    print reads[0]   # works fine!
    return (ref_name, reads)

and as output I got:

python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam
Read fasta ...
Done!
creating subset....
chr1 Done 1464
EAS56_57:6:190:289:82 69 0 99 0 None 0 99 35
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; [('MF',
192)]
chr2 Done 1806
B7_591:8:4:841:340 73 1 0 99 [(0, 36)] -1 -1 36
TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA
<<<<<<<<;<<<<<<<<;<<<<<;<;:<<<<<<<;; [('MF',
18), ('Aq', 77), ('NM', 0), ('UQ', 0), ('H0', 1), ('H1', 0)]
Done!
xxxxx
Segmentation fault

Why does reads['chr1'][0] caused the Segmentation fault?

Thank you in advance.


On Tue, Oct 18, 2011 at 9:44 PM, Martin Mokrejs <mmokrejs at fold.natur.cuni.cz
> wrote:

> Before running your python code, do (under bash):
> $ ulimit -c unlimited
> $ python mypython.py
> $ file core
> $ gdb /usr/bin/python ./core
> gdb> where
> gdb> bt full
> gdb> quit
> $
>
> Martin
>
> Mic wrote:
> > Hello,
> > Thank you for your email. I updated the code and find out that
> >     print reads['chr1']     #works fine
> > but
> >     print reads['chr1'][0]  #caused Segmentation fault
> >
> > Please find below the updated code:
> >
> > from Bio import SeqIO
> > import pysam
> > from optparse import OptionParser
> > import subprocess, os, sys
> > from multiprocessing import Pool
> > import functools
> >
> >
> > def GetReferenceInfo(referenceFastaPath):
> >   referencenames = []
> >   referencelengths = []
> >   referenceFastaFile = open(referenceFastaPath)
> >   for record in SeqIO.parse(referenceFastaFile, "fasta"):
> >     referencenames.append(record.name)
> >     referencelengths.append(len(record.seq))
> >   referenceFastaFile.close()
> >   return (referencenames, referencelengths)
> >
> >
> > def GenerateSubsetBAM(bam_filename, ref_name):
> >     reads = []
> >     bam_fh = pysam.Samfile(bam_filename, "rb")
> >
> >     for read in bam_fh.fetch(ref_name):
> >         reads.append(read)
> >
> >     print ref_name + ' Done ' + str(len(reads))
> >     return (ref_name, reads)
> >
> >
> > if __name__ == '__main__':
> >   parser = OptionParser("Usage: %prog -b BAMfile -f new_reference_fasta
> -o
> > outputBAM")
> >   parser.add_option("-b", "--BAM", type="string",
> dest="inputBAMFilepath",
> > help="Specify a BAM file")
> >   parser.add_option("-f", "--fasta", type="string", dest="fastaFilepath",
> > help="Specify a reference fasta file.")
> >   parser.add_option("-o", "--output", type="string",
> > dest="outputBAMFilepath", help="Specify an output BAM file.")
> >
> >   (opts, args) = parser.parse_args()
> >
> >   if (opts.inputBAMFilepath is None):
> >     print ("\nSpecify a BAM file. eg. -b large.bam\n")
> >     parser.print_help()
> >   elif not(os.path.exists(opts.inputBAMFilepath)):
> >     print ("\nReference BAM file does not exists: " +
> opts.inputBAMFilepath
> > +"\n")
> >   elif (opts.fastaFilepath is None):
> >     print ("\nSpecify a reference fasta file.  eg. -f Subset.fasta\n")
> >     parser.print_help()
> >   elif not(os.path.exists(opts.fastaFilepath)):
> >     print ("\nReference fasta file does not exists: " +
> opts.fastaFilepath
> > +"\n")
> >   elif os.path.exists(opts.outputBAMFilepath) and
> > not(raw_input(opts.outputBAMFilepath + " exists. Overwrite? (Y/n):
> ")=='Y'):
> >     print ("\nOutput BAM exists. Please specify alternative output file.
> >  eg. -o Subset.bam\n")
> >   else:
> >     print "Read fasta ..."
> >     (ref_names, ref_lengths) = GetReferenceInfo(opts.fastaFilepath)
> >     print 'Done!'
> >
> >     print "creating subset...."
> >     pool = Pool()
> >     GenerateSubsetBAM_wrapper = functools.partial(GenerateSubsetBAM,
> > opts.inputBAMFilepath)
> >     reads = dict(pool.imap_unordered(GenerateSubsetBAM_wrapper,
> ref_names))
> >     pool.close()
> >     print "Done!"
> >
> >     print reads['chr1']     #works fine
> >     print "xxxxx"
> >
> >     print reads['chr1'][0]  #caused Segmentation fault
> >
> > I run the code with the pysam-0.5 examples (pysam-0.5/tests) in the
> > following way:
> >
> > python SubsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam
> >
> > Read fasta ...
> > Done!
> > creating subset....
> > chr1 Done 1464
> > chr2 Done 1806
> > Done!
> > [<csamtools.AlignedRead object at 0x2b975635d168>, ...,
> > <csamtools.AlignedRead object at 0x2b35d89b6ca8>]
> > xxxxx
> > Segmentation fault
> >
> > Thank you in advance.
> >
> >
> > On Tue, Oct 18, 2011 at 7:00 PM, Peter Cock <p.j.a.cock at googlemail.com
> >wrote:
> >
> >> On Tue, Oct 18, 2011 at 4:44 AM, Mic <mictadlo at gmail.com> wrote:
> >>> Hello,
> >>> I have tried to generate a subset BAM, but I get a 'Segmentation fault'
> >> with
> >>> the following code:
> >>> from Bio import SeqIO
> >>> import pysam
> >>> from optparse import OptionParser
> >>> import subprocess, os, sys
> >>> from multiprocessing import Pool
> >>> import functools
> >>> ...
> >>
> >> I tried this and it seemed to get stuck much earlier. Could you
> >> cut down the example a bit by removing the multiprocessing?
> >>
> >> Peter
> >>
> >> P.S. Also you can remove the unused "import argparse" line.
> >>
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
> >
>


From p.j.a.cock at googlemail.com  Tue Oct 18 12:58:47 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 18 Oct 2011 13:58:47 +0100
Subject: [Biopython] [Samtools-help] Segmentation fault
In-Reply-To: <CAOP6n=g7d9cHhoqcf9FDW2vr9Rm5GgL7xMir0TU6Z6DJPQf_8Q@mail.gmail.com>
References: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>
	<CAKVJ-_7cF1gGQbzcD8LO+w-QvbsJUp=T1xmzwoXqUNt5GsaowA@mail.gmail.com>
	<CAOP6n=g7d9cHhoqcf9FDW2vr9Rm5GgL7xMir0TU6Z6DJPQf_8Q@mail.gmail.com>
Message-ID: <CAKVJ-_4f28KNeZx=wqVxStUzeBD-3K_5Ogegt-OxTWtgqLPs7g@mail.gmail.com>

On Tue, Oct 18, 2011 at 11:26 AM, Mic <mictadlo at gmail.com> wrote:
> Hello,
> Thank you for your email. I updated the code and find out that
> ? ? print reads['chr1'] ? ? #works fine
> but
> ? ? print reads['chr1'][0] ?#caused Segmentation fault
> Please find below the updated code:
> ...

Your pool version doesn't run on my machine, something
unhappy in multiprocessing gives:
TypeError: type 'partial' takes at least one argument

Here's a version using a single thread, which works fine
for me. What does it do on your machines? Either way
this should help in determining the segmentation fault.

from Bio import SeqIO
import pysam
import subprocess, os, sys

def GetReferenceInfo(referenceFastaPath):
  referencenames = []
  referencelengths = []
  referenceFastaFile = open(referenceFastaPath)
  for record in SeqIO.parse(referenceFastaFile, "fasta"):
    referencenames.append(record.name)
    referencelengths.append(len(record.seq))
  referenceFastaFile.close()
  return (referencenames, referencelengths)


def GenerateSubsetBAM(bam_filename, ref_name):
    reads = []
    bam_fh = pysam.Samfile(bam_filename, "rb")

    for read in bam_fh.fetch(ref_name):
        reads.append(read)

    print ref_name + ' Done ' + str(len(reads))
    return (ref_name, reads)

bam_filename = "ex1.bam"
fasta_filename = "ex1.fa"

print "Read fasta ..."
(ref_names, ref_lengths) = GetReferenceInfo(fasta_filename)
print 'Done!'

print "creating subset...."
reads = dict()
for ref in ref_names:
    reads[ref] = GenerateSubsetBAM(bam_filename, ref)
print "Done!"

print reads['chr1']     #works fine
print "xxxxx"
print reads['chr1'][0]  #also fine

--

Peter


From nathaniel.echols at gmail.com  Tue Oct 18 18:08:03 2011
From: nathaniel.echols at gmail.com (Nat Echols)
Date: Tue, 18 Oct 2011 11:08:03 -0700
Subject: [Biopython] newbie question: sequence parsing
Message-ID: <CALzQQJbbsmUyCJ5q-pOttk_chiOO2QfqNKVH90KXcXh03e+m7w@mail.gmail.com>

Greetings--

We have started using BioPython in our (non-bioinformatics) application and
are investigating the possibility of replacing our existing (custom-made)
sequence parsers.  Two quick questions:

1) Is there a sequence parser that works with just a simple string, without
any header or additional metadata?  If not, how could we write one that
results in the same basic object as those in Bio.SeqIO?  (The parsing is of
course easy, I just want to have the API be consistent regardless of
format.)

2) Is there a single function that will take a file (and/or string) of
unknown format and try the different parsers until it finds one that works?
 We currently use several different formats (raw string, FASTA, PIR, and
possibly others), and we try not to rely on the file extension alone to
determine the type.  We already have something that does this using our
parsers, which could be refactored to use Bio.SeqIO instead, but if
BioPython has something similar I'd rather use that.

thanks,
Nat


From p.j.a.cock at googlemail.com  Tue Oct 18 19:04:14 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 18 Oct 2011 20:04:14 +0100
Subject: [Biopython] newbie question: sequence parsing
In-Reply-To: <CALzQQJbbsmUyCJ5q-pOttk_chiOO2QfqNKVH90KXcXh03e+m7w@mail.gmail.com>
References: <CALzQQJbbsmUyCJ5q-pOttk_chiOO2QfqNKVH90KXcXh03e+m7w@mail.gmail.com>
Message-ID: <CAKVJ-_71VptuB=dhpcZw=dwg6VrfjpZaDgvdw9ifZSc9Yrva7g@mail.gmail.com>

On Tue, Oct 18, 2011 at 7:08 PM, Nat Echols <nathaniel.echols at gmail.com> wrote:
> Greetings--
>
> We have started using BioPython in our (non-bioinformatics) application and
> are investigating the possibility of replacing our existing (custom-made)
> sequence parsers. ?Two quick questions:
>
> 1) Is there a sequence parser that works with just a simple string, without
> any header or additional metadata? ?If not, how could we write one that
> results in the same basic object as those in Bio.SeqIO? ?(The parsing is of
> course easy, I just want to have the API be consistent regardless of
> format.)

Sounds like the "raw" format in EMBOSS, although there are two
interpretations: one sequence per line, or one sequence for the
whole file.

Have a look at the FASTA parser in Bio/SeqIO/FastaIO.py as the
most simple case. Essentially you create a SeqRecord object
(which is covered in the Tutorial).

> 2) Is there a single function that will take a file (and/or string) of
> unknown format and try the different parsers until it finds one that works?
> ?We currently use several different formats (raw string, FASTA, PIR, and
> possibly others), and we try not to rely on the file extension alone to
> determine the type. ?We already have something that does this using our
> parsers, which could be refactored to use Bio.SeqIO instead, but if
> BioPython has something similar I'd rather use that.

No, we don't have such a function. There are many difficulties
with format guessing - both from the file contents and even the
filename. I usually cite the Zen of Python, Explicit is Better Than
Implicit.

Peter


From cjfields at illinois.edu  Tue Oct 18 19:11:56 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 18 Oct 2011 19:11:56 +0000
Subject: [Biopython] newbie question: sequence parsing
In-Reply-To: <CAKVJ-_71VptuB=dhpcZw=dwg6VrfjpZaDgvdw9ifZSc9Yrva7g@mail.gmail.com>
References: <CALzQQJbbsmUyCJ5q-pOttk_chiOO2QfqNKVH90KXcXh03e+m7w@mail.gmail.com>
	<CAKVJ-_71VptuB=dhpcZw=dwg6VrfjpZaDgvdw9ifZSc9Yrva7g@mail.gmail.com>
Message-ID: <BC8892C7-6670-4003-A9B0-22DB88625E39@illinois.edu>

On Oct 18, 2011, at 2:04 PM, Peter Cock wrote:

> On Tue, Oct 18, 2011 at 7:08 PM, Nat Echols <nathaniel.echols at gmail.com> wrote:
>> ...
>> 2) Is there a single function that will take a file (and/or string) of
>> unknown format and try the different parsers until it finds one that works?
>>  We currently use several different formats (raw string, FASTA, PIR, and
>> possibly others), and we try not to rely on the file extension alone to
>> determine the type.  We already have something that does this using our
>> parsers, which could be refactored to use Bio.SeqIO instead, but if
>> BioPython has something similar I'd rather use that.
> 
> No, we don't have such a function. There are many difficulties
> with format guessing - both from the file contents and even the
> filename. I usually cite the Zen of Python, Explicit is Better Than
> Implicit.
> 
> Peter


Some implicitness is fine, but speaking from experience (BioPerl's GuessSeqFormat) trying to guess the format from the dozens that litter the bioinformatics landscape is a nest of hornets no one wants to maintain.  

chris


From p.j.a.cock at googlemail.com  Tue Oct 18 19:31:06 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 18 Oct 2011 20:31:06 +0100
Subject: [Biopython] newbie question: sequence parsing
In-Reply-To: <BC8892C7-6670-4003-A9B0-22DB88625E39@illinois.edu>
References: <CALzQQJbbsmUyCJ5q-pOttk_chiOO2QfqNKVH90KXcXh03e+m7w@mail.gmail.com>
	<CAKVJ-_71VptuB=dhpcZw=dwg6VrfjpZaDgvdw9ifZSc9Yrva7g@mail.gmail.com>
	<BC8892C7-6670-4003-A9B0-22DB88625E39@illinois.edu>
Message-ID: <CAKVJ-_5aRcGSnbbO3BVAVNRdBYz46jT9VkxVjNNEXV8S0zW5Uw@mail.gmail.com>

On Tue, Oct 18, 2011 at 8:11 PM, Fields, Christopher J
<cjfields at illinois.edu> wrote:
> On Oct 18, 2011, at 2:04 PM, Peter Cock wrote:
>
>> On Tue, Oct 18, 2011 at 7:08 PM, Nat Echols <nathaniel.echols at gmail.com> wrote:
>>> ...
>>> 2) Is there a single function that will take a file (and/or string) of
>>> unknown format and try the different parsers until it finds one that works?
>>> ?We currently use several different formats (raw string, FASTA, PIR, and
>>> possibly others), and we try not to rely on the file extension alone to
>>> determine the type. ?We already have something that does this using our
>>> parsers, which could be refactored to use Bio.SeqIO instead, but if
>>> BioPython has something similar I'd rather use that.
>>
>> No, we don't have such a function. There are many difficulties
>> with format guessing - both from the file contents and even the
>> filename. I usually cite the Zen of Python, Explicit is Better Than
>> Implicit.
>>
>> Peter
>
> Some implicitness is fine, but speaking from experience
> (BioPerl's GuessSeqFormat) trying to guess the format
> from the dozens that litter the bioinformatics landscape
> is a nest of hornets no one wants to maintain.
>
> chris

I think "nest of hornets" is a much more beautiful phrase
than my dead pan "many difficulties".

The practical reality is that while some file formats are
easy (binary files with 4 byte "magic" identifiers), others
are horrible, and the definitions shift over time, as new
formats of variants are added. I really don't want to go
there.

Peter


From nathaniel.echols at gmail.com  Tue Oct 18 21:47:03 2011
From: nathaniel.echols at gmail.com (Nat Echols)
Date: Tue, 18 Oct 2011 14:47:03 -0700
Subject: [Biopython] issues with NCBIXML
Message-ID: <CALzQQJYj4j4b8_LDVw+d5ufk6faOBS1f8LkDcm7b8ZnfLiyZUg@mail.gmail.com>

Hi again,

I'm puzzled by the behavior of the Blast XML parser.  It appears to be
picking up all of the alignments correctly, but the
top-level Bio.Blast.Record.Blast object that it returns appears to be
incompletely populated.  Specifically, the attributes num_hits and
num_sequences are set to None - but I have several dozen alignments.  Am I
missing the point of these attributes, or doing something wrong?  It's not a
huge issue (I can just count the alignments, I guess), but I'm a bit
concerned that there's something wrong with my code.

thanks,
Nat


From p.j.a.cock at googlemail.com  Tue Oct 18 22:07:24 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 18 Oct 2011 23:07:24 +0100
Subject: [Biopython] issues with NCBIXML
In-Reply-To: <CALzQQJYj4j4b8_LDVw+d5ufk6faOBS1f8LkDcm7b8ZnfLiyZUg@mail.gmail.com>
References: <CALzQQJYj4j4b8_LDVw+d5ufk6faOBS1f8LkDcm7b8ZnfLiyZUg@mail.gmail.com>
Message-ID: <CAKVJ-_7ghKgBqihqA+xiJ2WneJ=iibJkeA7zzxKun47Jsq6b=w@mail.gmail.com>

On Tue, Oct 18, 2011 at 10:47 PM, Nat Echols <nathaniel.echols at gmail.com> wrote:
> Hi again,
>
> I'm puzzled by the behavior of the Blast XML parser. ?It appears to be
> picking up all of the alignments correctly, but the
> top-level Bio.Blast.Record.Blast object that it returns appears to be
> incompletely populated. ?Specifically, the attributes num_hits and
> num_sequences are set to None - but I have several dozen alignments. ?Am I
> missing the point of these attributes, or doing something wrong? ?It's not a
> huge issue (I can just count the alignments, I guess), but I'm a bit
> concerned that there's something wrong with my code.
>
> thanks,
> Nat

The number of alignments and descriptions only really apply
to the plain text (or HTML) BLAST output, but I guess we
could set them to the number of hits in the XML output.

Peter


From mictadlo at gmail.com  Tue Oct 18 23:12:03 2011
From: mictadlo at gmail.com (Mic)
Date: Wed, 19 Oct 2011 09:12:03 +1000
Subject: [Biopython] [Samtools-help] Segmentation fault
In-Reply-To: <4E9D6FAB.70308@fold.natur.cuni.cz>
References: <CAOP6n=hWFBcMPV+67AWChVZGb=JzpkH_CQMtxtOTyL+j0Qai8w@mail.gmail.com>
	<CAKVJ-_7cF1gGQbzcD8LO+w-QvbsJUp=T1xmzwoXqUNt5GsaowA@mail.gmail.com>
	<CAOP6n=g7d9cHhoqcf9FDW2vr9Rm5GgL7xMir0TU6Z6DJPQf_8Q@mail.gmail.com>
	<4E9D66B6.70904@fold.natur.cuni.cz>
	<CAOP6n=juO9=4wap-4xMHJatKbbZidtTX_5T8K-67Ud6R+E+AEA@mail.gmail.com>
	<4E9D6FAB.70308@fold.natur.cuni.cz>
Message-ID: <CAOP6n=gJDWOG08C+j9dRVQa4C6wzHmrUpf3u7_-e1kCBu9Mcug@mail.gmail.com>

I run it now on my Laptop (Ubuntu 11.04 x64) and now I can see the core
file:

$ ulimit -c unlimited
$ python subsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam
Segmentation fault (core dumped)

$ file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from
'python subsetBAM-P.py --BAM ex1.bam -f ex1.fa -o new.bam'

$ gdb /usr/bin/python ./core
GNU gdb (Ubuntu/Linaro 7.2-1ubuntu11) 7.2
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/python...(no debugging symbols found)...done.
[New Thread 2748]

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...(no debugging
symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...(no debugging
symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2
Reading symbols from /lib/x86_64-linux-gnu/libutil.so.1...(no debugging
symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libutil.so.1
Reading symbols from /lib/libssl.so.0.9.8...(no debugging symbols
found)...done.
Loaded symbols for /lib/libssl.so.0.9.8
Reading symbols from /lib/libcrypto.so.0.9.8...(no debugging symbols
found)...done.
Loaded symbols for /lib/libcrypto.so.0.9.8
Reading symbols from /lib/x86_64-linux-gnu/libz.so.1...(no debugging symbols
found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libz.so.1
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...(no debugging symbols
found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libm.so.6
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...(no debugging symbols
found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib/python2.7/lib-dynload/_heapq.so...(no
debugging symbols found)...done.
Loaded symbols for /usr/lib/python2.7/lib-dynload/_heapq.so
Reading symbols from /usr/lib/python2.7/lib-dynload/_elementtree.so...(no
debugging symbols found)...done.
Loaded symbols for /usr/lib/python2.7/lib-dynload/_elementtree.so
Reading symbols from /lib/x86_64-linux-gnu/libexpat.so.1...(no debugging
symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libexpat.so.1
Reading symbols from /usr/lib/python2.7/lib-dynload/pyexpat.so...(no
debugging symbols found)...done.
Loaded symbols for /usr/lib/python2.7/lib-dynload/pyexpat.so
Reading symbols from
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/csamtools.so...done.
Loaded symbols for
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/csamtools.so
Reading symbols from /usr/lib/python2.7/lib-dynload/_ctypes.so...(no
debugging symbols found)...done.
Loaded symbols for /usr/lib/python2.7/lib-dynload/_ctypes.so
Reading symbols from
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/ctabix.so...done.
Loaded symbols for
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/ctabix.so
Reading symbols from
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/TabProxies.so...done.
Loaded symbols for
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/TabProxies.so
Reading symbols from /usr/lib/python2.7/lib-dynload/_io.so...(no debugging
symbols found)...done.
Loaded symbols for /usr/lib/python2.7/lib-dynload/_io.so
Reading symbols from
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/cvcf.so...done.
Loaded symbols for
/home/mic/apps/pymodules/lib/python2.7/site-packages/pysam-0.5-py2.7-linux-x86_64.egg/cvcf.so
Reading symbols from
/usr/lib/python2.7/lib-dynload/_multiprocessing.so...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/python2.7/lib-dynload/_multiprocessing.so
Reading symbols from /usr/lib/pymodules/python2.7/Bio/Nexus/cnexus.so...(no
debugging symbols found)...done.
Loaded symbols for /usr/lib/pymodules/python2.7/Bio/Nexus/cnexus.so
Core was generated by `python subsetBAM-P.py --BAM ex1.bam -f ex1.fa -o
new.bam'.
Program terminated with signal 11, Segmentation fault.
#0  __pyx_pf_9csamtools_11AlignedRead_5qname___get__ (o=0x164e138,
    x=<value optimized out>) at pysam/csamtools.c:18123
18123  if (__pyx_t_1) {
(gdb)
(gdb) where
#0  __pyx_pf_9csamtools_11AlignedRead_5qname___get__ (o=0x164e138,
    x=<value optimized out>) at pysam/csamtools.c:18123
#1  __pyx_getprop_9csamtools_11AlignedRead_qname (o=0x164e138,
    x=<value optimized out>) at pysam/csamtools.c:30806
#2  0x0000000000479804 in ?? ()
#3  0x00007f187dbabc65 in __pyx_pf_9csamtools_11AlignedRead___str__ (
    __pyx_v_self=0x164e138) at pysam/csamtools.c:17687
#4  0x0000000000479eac in _PyObject_Str ()
#5  0x0000000000479f8a in PyObject_Str ()
#6  0x00000000004d390c in ?? ()
#7  0x00000000004cd2d1 in PyFile_WriteObject ()
#8  0x000000000049909d in PyEval_EvalFrameEx ()
#9  0x000000000049d325 in PyEval_EvalCodeEx ()
#10 0x00000000004ecb02 in PyEval_EvalCode ()
#11 0x00000000004fdc74 in ?? ()
#12 0x000000000042c182 in PyRun_FileExFlags ()
#13 0x000000000042cb4a in PyRun_SimpleFileExFlags ()
#14 0x0000000000418c9e in Py_Main ()
#15 0x00007f187ed7aeff in __libc_start_main ()
   from /lib/x86_64-linux-gnu/libc.so.6
#16 0x00000000004c62b1 in _start ()
(gdb)
(gdb) bt full
#0  __pyx_pf_9csamtools_11AlignedRead_5qname___get__ (o=0x164e138,
    x=<value optimized out>) at pysam/csamtools.c:18123
        __pyx_v_src = 0x0
        __pyx_t_2 = 0x0
        __pyx_frame = 0x0
        __pyx_r = 0x0
        __pyx_t_1 = <value optimized out>
        __Pyx_use_tracing = 0
        __pyx_frame_code = 0x0
#1  __pyx_getprop_9csamtools_11AlignedRead_qname (o=0x164e138,
    x=<value optimized out>) at pysam/csamtools.c:30806
No locals.
#2  0x0000000000479804 in ?? ()
No symbol table info available.
#3  0x00007f187dbabc65 in __pyx_pf_9csamtools_11AlignedRead___str__ (
    __pyx_v_self=0x164e138) at pysam/csamtools.c:17687
        __pyx_r = 0x0
        __pyx_t_1 = 0x1e4fb90
        __pyx_t_2 = 0x0
        __pyx_t_3 = 0x0
        __pyx_t_4 = 0x0
---Type <return> to continue, or q <return> to quit---
        __pyx_t_5 = 0x0
        __pyx_t_6 = 0x0
        __pyx_t_7 = 0x0
        __pyx_t_8 = 0x0
        __pyx_t_9 = 0x0
        __pyx_t_10 = 0x0
        __pyx_t_11 = 0x0
        __pyx_t_12 = 0x0
        __pyx_t_13 = 0x0
        __pyx_t_14 = 0x0
        __pyx_frame_code = 0x0
        __pyx_frame = 0x0
        __Pyx_use_tracing = 0
#4  0x0000000000479eac in _PyObject_Str ()
No symbol table info available.
#5  0x0000000000479f8a in PyObject_Str ()
No symbol table info available.
#6  0x00000000004d390c in ?? ()
No symbol table info available.
#7  0x00000000004cd2d1 in PyFile_WriteObject ()
No symbol table info available.
---Type <return> to continue, or q <return> to quit---
#8  0x000000000049909d in PyEval_EvalFrameEx ()
No symbol table info available.
#9  0x000000000049d325 in PyEval_EvalCodeEx ()
No symbol table info available.
#10 0x00000000004ecb02 in PyEval_EvalCode ()
No symbol table info available.
#11 0x00000000004fdc74 in ?? ()
No symbol table info available.
#12 0x000000000042c182 in PyRun_FileExFlags ()
No symbol table info available.
#13 0x000000000042cb4a in PyRun_SimpleFileExFlags ()
No symbol table info available.
#14 0x0000000000418c9e in Py_Main ()
No symbol table info available.
#15 0x00007f187ed7aeff in __libc_start_main ()
   from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#16 0x00000000004c62b1 in _start ()
No symbol table info available.
(gdb) quit

$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) 16382
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Thank you in advance.


From nathaniel.echols at gmail.com  Wed Oct 19 18:48:11 2011
From: nathaniel.echols at gmail.com (Nat Echols)
Date: Wed, 19 Oct 2011 11:48:11 -0700
Subject: [Biopython] issues with NCBIXML
In-Reply-To: <CAKVJ-_7ghKgBqihqA+xiJ2WneJ=iibJkeA7zzxKun47Jsq6b=w@mail.gmail.com>
References: <CALzQQJYj4j4b8_LDVw+d5ufk6faOBS1f8LkDcm7b8ZnfLiyZUg@mail.gmail.com>
	<CAKVJ-_7ghKgBqihqA+xiJ2WneJ=iibJkeA7zzxKun47Jsq6b=w@mail.gmail.com>
Message-ID: <CALzQQJZVqdbSmtua_w0sQ2VUKVL1=-yBJkEXYs4WXKncBWHxUw@mail.gmail.com>

On Tue, Oct 18, 2011 at 3:07 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> The number of alignments and descriptions only really apply
> to the plain text (or HTML) BLAST output, but I guess we
> could set them to the number of hits in the XML output.


This would be useful, for consistency's sake if nothing else.  I'm happy to
contribute a patch if that streamlines the process.

-Nat


From p.j.a.cock at googlemail.com  Wed Oct 19 19:06:30 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 19 Oct 2011 20:06:30 +0100
Subject: [Biopython] issues with NCBIXML
In-Reply-To: <CALzQQJZVqdbSmtua_w0sQ2VUKVL1=-yBJkEXYs4WXKncBWHxUw@mail.gmail.com>
References: <CALzQQJYj4j4b8_LDVw+d5ufk6faOBS1f8LkDcm7b8ZnfLiyZUg@mail.gmail.com>
	<CAKVJ-_7ghKgBqihqA+xiJ2WneJ=iibJkeA7zzxKun47Jsq6b=w@mail.gmail.com>
	<CALzQQJZVqdbSmtua_w0sQ2VUKVL1=-yBJkEXYs4WXKncBWHxUw@mail.gmail.com>
Message-ID: <CAKVJ-_6+E=9M_QJ-fSpL0f9z9NmBSk_BupCi=gh8XBs+7rahgw@mail.gmail.com>

On Wed, Oct 19, 2011 at 7:48 PM, Nat Echols <nathaniel.echols at gmail.com> wrote:
> On Tue, Oct 18, 2011 at 3:07 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> The number of alignments and descriptions only really apply
>> to the plain text (or HTML) BLAST output, but I guess we
>> could set them to the number of hits in the XML output.
>
> This would be useful, for consistency's sake if nothing else. ?I'm happy to
> contribute a patch if that streamlines the process.
> -Nat

Sure. If you can include unit tests for it even better.
You should just be able to add some assertEqual
lines to the existing XML parser tests for the newly
populated properties.

Thanks,

Peter


From mictadlo at gmail.com  Thu Oct 20 09:38:56 2011
From: mictadlo at gmail.com (Mic)
Date: Thu, 20 Oct 2011 19:38:56 +1000
Subject: [Biopython] changing record attributes while iterating
In-Reply-To: <CAKVJ-_56L2beMYmCqs17KmpmhwX37oXMAR=xVD3237M6cAhGJw@mail.gmail.com>
References: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>
	<CAKVJ-_7dHzN4A551Hx4nydDTUAxeGDvqDvtKKZpZdvwue0P4Pw@mail.gmail.com>
	<CAOP6n=hRfYi1mRAt6pxudmyMeVH8bnnM3E5yd9kXXPU5L=pQvQ@mail.gmail.com>
	<CAKVJ-_56L2beMYmCqs17KmpmhwX37oXMAR=xVD3237M6cAhGJw@mail.gmail.com>
Message-ID: <CAOP6n=ggSjPopTkV_QFfhnxKUnh71q6FNLcUOJZfT9+T6db8-Q@mail.gmail.com>

Hello,
would it be possible to using a generator expression for the following code?

from Bio import SeqIO

fa_parser = SeqIO.parse(open("../test_files/test.fasta", "rU"), "fasta")

sequence = fa_parser.next().seq

for record in fa_parser:
    sequence += 3*'N' + record.seq

print sequence

Input:
>1
1111111
>2
2222222
>3
3333333
>4
4444444

Output:
1111111NNN2222222NNN3333333NNN4444444

Thank you advance.


On Fri, Oct 7, 2011 at 5:22 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

>
>
> On Friday, October 7, 2011, Michalwrote:
>
> > Hello,
> > Does your code with generator save the whole file in the
> > memory or does it read each entry and save it immediately?
> > Thank you in advance.
>
> Using a generator expression like that only one SeqRecord is in memory at a
> time. It goes through the input FASTA one record at a time, renames it,
> saves it immediately.
>
> Peter
>
> P.S. list CC'd


From p.j.a.cock at googlemail.com  Thu Oct 20 09:58:05 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 20 Oct 2011 10:58:05 +0100
Subject: [Biopython] changing record attributes while iterating
In-Reply-To: <CAOP6n=ggSjPopTkV_QFfhnxKUnh71q6FNLcUOJZfT9+T6db8-Q@mail.gmail.com>
References: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>
	<CAKVJ-_7dHzN4A551Hx4nydDTUAxeGDvqDvtKKZpZdvwue0P4Pw@mail.gmail.com>
	<CAOP6n=hRfYi1mRAt6pxudmyMeVH8bnnM3E5yd9kXXPU5L=pQvQ@mail.gmail.com>
	<CAKVJ-_56L2beMYmCqs17KmpmhwX37oXMAR=xVD3237M6cAhGJw@mail.gmail.com>
	<CAOP6n=ggSjPopTkV_QFfhnxKUnh71q6FNLcUOJZfT9+T6db8-Q@mail.gmail.com>
Message-ID: <CAKVJ-_5cthiZ7dhWr9_e6QLPHXnTVusa=Gdj7jOGSRFnyORuvQ@mail.gmail.com>

Hi Mic,

You should have started a new thread with a new title...

On Thu, Oct 20, 2011 at 10:38 AM, Mic <mictadlo at gmail.com> wrote:
> Hello,
> would it be possible to using a generator expression for the following code?
> from Bio import SeqIO
> fa_parser = SeqIO.parse(open("../test_files/test.fasta", "rU"), "fasta")
> sequence = fa_parser.next().seq
> for record in fa_parser:
> ? ??sequence?+= 3*'N' + record.seq
>
> print?sequence
> Input:
>>1
> 1111111
>>2
> 2222222
>>3
> 3333333
>>4
> 4444444
> Output:
> 1111111NNN2222222NNN3333333NNN4444444
> Thank you advance.

Sure, how about this:

from Bio import SeqIO
fa_parser = SeqIO.parse("../test_files/test.fasta", "fasta")
print ('N' * 3).join(str(rec.seq) for rec in fa_parser)

Peter


From andreas.wilm at gmail.com  Tue Oct 25 06:26:59 2011
From: andreas.wilm at gmail.com (Andreas Wilm)
Date: Tue, 25 Oct 2011 14:26:59 +0800
Subject: [Biopython] VCF parser
In-Reply-To: <CAA9RGEOdgq9=Y5oruhjycP8Jw545rwDQG_eXKFzWKXYj2s3uQQ@mail.gmail.com>
References: <CAA9RGEOdgq9=Y5oruhjycP8Jw545rwDQG_eXKFzWKXYj2s3uQQ@mail.gmail.com>
Message-ID: <CAL3gG7U07Vmsz3-A+xf0apORgc_3WcGE9Tn2SCLjm2_-BLPfMA@mail.gmail.com>

HI Tiago,

I'm not aware of a Biopython VCF parser, but pysam seems to have one
(haven't used it though). Try
>>> from pysam import cvcf

You also might want to check an implementation which was posted on
seqanswers: http://seqanswers.com/forums/archive/index.php/t-9266.html

Andreas

PS:
For the sake of completeness: your question was asked before here (no
replies). See http://www.biopython.org/pipermail/biopython/2011-March/007131.html


2011/10/4 Tiago Ant?o <tiagoantao at gmail.com>:
> Hi,
>
> I wonder if there is a VCF parser in either Python or Java? Either I
> am being dumb at searching (probably) or nothing exists?
>
> Thanks,
> Tiago
>
> --
> "If you want to get laid, go to college.? If you want an education, go
> to the library." - Frank Zappa
>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
Andreas Wilm
andreas.wilm at gmail.com | mail at andreas-wilm.com | 0x7C68FBCC


From pawan.mani2 at gmail.com  Tue Oct 25 15:50:51 2011
From: pawan.mani2 at gmail.com (kakchingtabam pawankumar sharma)
Date: Tue, 25 Oct 2011 21:20:51 +0530
Subject: [Biopython] installation of pyfatsa
In-Reply-To: <2DDF09AFEB46E54894A3843CEF9CB3A4B55076114C@EXCHMB.ocimumbio.com>
References: <2DDF09AFEB46E54894A3843CEF9CB3A4B55076114C@EXCHMB.ocimumbio.com>
Message-ID: <CAALmQ-SL4O848KMHzmpBwSL-9C+_-2s4xWQSZht2=DduciqzOg@mail.gmail.com>

 Dear****

                 I woul like to know how to install pyfasta in linux. I have
downloaded pyfasta-0.4.4.tar.gz and install using command:  *tar -xzvf
pyfasta-0.4.4.tar.gz*.****

But I could used the command line:  ****

*pyfasta split -n 6 sample .fasta*

** **

So kindly help me out to solve this problem.****

** **

** **

With Reagards,****

Pawan****

------------------------------
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient,
please notify the sender by e-mail and delete the original message. Further,
you are not to copy, disclose, or distribute this e-mail or its contents to
any other person and any such actions that are unlawful. This e-mail may
contain viruses. Ocimum Biosolutions has taken every reasonable precaution
to minimize this risk, but is not liable for any damage you may sustain as a
result of any virus in this e-mail. You should carry out your own virus
checks before opening the e-mail or attachment.
 The information contained in this email and any attachments is confidential
and may be subject to copyright or other intellectual property protection.
If you are not the intended recipient, you are not authorized to use or
disclose this information, and we request that you notify us by reply mail
or telephone and delete the original message from your mail system.

OCIMUMBIO SOLUTIONS (P) LTD


From p.j.a.cock at googlemail.com  Tue Oct 25 16:13:00 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 25 Oct 2011 17:13:00 +0100
Subject: [Biopython] installation of pyfatsa
In-Reply-To: <CAALmQ-SL4O848KMHzmpBwSL-9C+_-2s4xWQSZht2=DduciqzOg@mail.gmail.com>
References: <2DDF09AFEB46E54894A3843CEF9CB3A4B55076114C@EXCHMB.ocimumbio.com>
	<CAALmQ-SL4O848KMHzmpBwSL-9C+_-2s4xWQSZht2=DduciqzOg@mail.gmail.com>
Message-ID: <CAKVJ-_5cwyCa2X6W6T8EY7V7CQ=_ZOOgFCm86UOmAnB0+Y0jEg@mail.gmail.com>

On Tue, Oct 25, 2011 at 4:50 PM, kakchingtabam pawankumar sharma
<pawan.mani2 at gmail.com> wrote:
> ?Dear****
>
> ? ? ? ? ? ? ? ? I woul like to know how to install pyfasta in linux. I have
> downloaded pyfasta-0.4.4.tar.gz and install using command: ?*tar -xzvf
> pyfasta-0.4.4.tar.gz*.****
>
> But I could used the command line: ?****
>
> *pyfasta split -n 6 sample .fasta*
>
> ** **
>
> So kindly help me out to solve this problem.****
>
> ** **
>
> ** **
>
> With Reagards,****
>
> Pawan****
>

Hi Pawan,

Note pyfasta is not part of Biopython, but is a separate tool by Brent
Pedersen (CC'd).

http://pypi.python.org/pypi/pyfasta/
https://github.com/brentp/pyfasta/

However, uncompressing the tar ball is only the first step in installing it.
You probably need to run "python setup.py install" for that.

Peter


From bpederse at gmail.com  Tue Oct 25 16:23:31 2011
From: bpederse at gmail.com (Brent Pedersen)
Date: Tue, 25 Oct 2011 10:23:31 -0600
Subject: [Biopython] VCF parser
In-Reply-To: <CAA9RGEOdgq9=Y5oruhjycP8Jw545rwDQG_eXKFzWKXYj2s3uQQ@mail.gmail.com>
References: <CAA9RGEOdgq9=Y5oruhjycP8Jw545rwDQG_eXKFzWKXYj2s3uQQ@mail.gmail.com>
Message-ID: <CAAp4xwp+hhcbf46+DC+g30eP=MbpeUmbR3dmO7v-m3DU7AizTg@mail.gmail.com>

On Mon, Oct 3, 2011 at 4:12 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
> Hi,
>
> I wonder if there is a VCF parser in either Python or Java? Either I
> am being dumb at searching (probably) or nothing exists?
>
> Thanks,
> Tiago
>
> --
> "If you want to get laid, go to college.? If you want an education, go
> to the library." - Frank Zappa
>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

I have found this one: https://github.com/jdoughertyii/PyVCF
to be quite good and easy to use.


From anaryin at gmail.com  Wed Oct 26 10:30:12 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 26 Oct 2011 12:30:12 +0200
Subject: [Biopython] Pairwise alignment - is it a generic function?
Message-ID: <CAJ9sUYNTF5JJx3MPNMinqyD-zqWpHruxdH_9a+GDymkUSL0C+A@mail.gmail.com>

Hello all,

A friend of mine was interested in a small simple alignment script for
aminoacids, to which I recommended to have a look at Biopython. We found the
pairwise2 module but we're a bit puzzled. Does it align "any" sequence, aa
or nucleotides? I don't see any scoring matrix referenced there..

Related to this, can you suggest any implementation of an aminoacid pairwise
alignment algorithm, in Python, that does is self contained (ie. doesn't
depend on some other program).

Best,

Jo?o [...] Rodrigues
http://nmr.chem.uu.nl/~joao


From p.j.a.cock at googlemail.com  Wed Oct 26 10:58:09 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 26 Oct 2011 11:58:09 +0100
Subject: [Biopython] Pairwise alignment - is it a generic function?
In-Reply-To: <CAJ9sUYNTF5JJx3MPNMinqyD-zqWpHruxdH_9a+GDymkUSL0C+A@mail.gmail.com>
References: <CAJ9sUYNTF5JJx3MPNMinqyD-zqWpHruxdH_9a+GDymkUSL0C+A@mail.gmail.com>
Message-ID: <CAKVJ-_7KGjsF_MaQ-ngVSMN43T2_R2kkyYh6Cmh9a3hkk8NhuQ@mail.gmail.com>

On Wed, Oct 26, 2011 at 11:30 AM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Hello all,
>
> A friend of mine was interested in a small simple alignment script for
> aminoacids, to which I recommended to have a look at Biopython. We found the
> pairwise2 module but we're a bit puzzled. Does it align "any" sequence, aa
> or nucleotides? I don't see any scoring matrix referenced there..

It should work on proteins, just pass in the appropriate scoring matrix.

> Related to this, can you suggest any implementation of an aminoacid
> pairwise alignment algorithm, in Python, that does is self contained
> (ie. doesn't depend on some other program).

Well, Bio.pairwise2 has a faster C implementation and fall back slower
pure Python implementation (used under Jython/PyPy/etc), which might
answer your needs.

Peter


From from.d.putto at gmail.com  Wed Oct 26 15:11:05 2011
From: from.d.putto at gmail.com (Sheila the angel)
Date: Wed, 26 Oct 2011 17:11:05 +0200
Subject: [Biopython] downloading gnome Protein table
Message-ID: <CAFinXcTpxMZSYVYUJa2ECASDuDKrSpGk9eq5OJm6aDW8m69xUQ@mail.gmail.com>

Hi All,

I an facing some problem to downloading the gnome and other information.
For an example I did a query on ncbi gnome for  NC_008390
On clicking results you can get following link

http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=ShowDetailView&TermToSearch=19840
On my web-browser I can save this page  as File> Save as >out.html

Furthermore I want to download the Protein table also
http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=Retrieve&dopt=Protein+Table&list_uids=19840

I want to do this for many Ids. Is there any simple way in Bio-Python???


Thanks in Advance

--
Cheers
Sheila


From p.j.a.cock at googlemail.com  Wed Oct 26 15:27:37 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 26 Oct 2011 16:27:37 +0100
Subject: [Biopython] downloading gnome Protein table
In-Reply-To: <CAFinXcTpxMZSYVYUJa2ECASDuDKrSpGk9eq5OJm6aDW8m69xUQ@mail.gmail.com>
References: <CAFinXcTpxMZSYVYUJa2ECASDuDKrSpGk9eq5OJm6aDW8m69xUQ@mail.gmail.com>
Message-ID: <CAKVJ-_5NsUg-ymN_xtM+RhMrub=-YH4TO2g108Wos5R8UQ2LtQ@mail.gmail.com>

On Wed, Oct 26, 2011 at 4:11 PM, Sheila the angel
<from.d.putto at gmail.com> wrote:
> Hi All,
>
> I an facing some problem to downloading the gnome and other information.
> For an example I did a query on ncbi gnome for ?NC_008390
> On clicking results you can get following link
>
> http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=ShowDetailView&TermToSearch=19840
> On my web-browser I can save this page ?as File> Save as >out.html
>
> Furthermore I want to download the Protein table also
> http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=Retrieve&dopt=Protein+Table&list_uids=19840
>
> I want to do this for many Ids. Is there any simple way in Bio-Python???
>
> Thanks in Advance

Hmm, some of that might be available by Bio.Entrez, not sure though.

For the protein table I would personally work with the *.ptt files from
the NCBI FTP site, e.g.

ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt

or:

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid58303/NC_008391.ptt

The FTP links are on the page of the first URL you gave. You can download
all the "bacteria" *.ptt files as a tar ball,

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ptt.tar.gz

Typically I work from the GenBank file files instead (*.gbk rather than *.ptt)

Peter


From mictadlo at gmail.com  Thu Oct 27 01:14:16 2011
From: mictadlo at gmail.com (Mic)
Date: Thu, 27 Oct 2011 11:14:16 +1000
Subject: [Biopython] changing record attributes while iterating
In-Reply-To: <CAKVJ-_5cthiZ7dhWr9_e6QLPHXnTVusa=Gdj7jOGSRFnyORuvQ@mail.gmail.com>
References: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>
	<CAKVJ-_7dHzN4A551Hx4nydDTUAxeGDvqDvtKKZpZdvwue0P4Pw@mail.gmail.com>
	<CAOP6n=hRfYi1mRAt6pxudmyMeVH8bnnM3E5yd9kXXPU5L=pQvQ@mail.gmail.com>
	<CAKVJ-_56L2beMYmCqs17KmpmhwX37oXMAR=xVD3237M6cAhGJw@mail.gmail.com>
	<CAOP6n=ggSjPopTkV_QFfhnxKUnh71q6FNLcUOJZfT9+T6db8-Q@mail.gmail.com>
	<CAKVJ-_5cthiZ7dhWr9_e6QLPHXnTVusa=Gdj7jOGSRFnyORuvQ@mail.gmail.com>
Message-ID: <CAOP6n=g1pduQAFiBBB1RL3vNyjs8WCfb5eCUq_Tu1JoEbMigbg@mail.gmail.com>

Thank you it is working.

I would like to to put sequences id in a list in the following way:

>>> c = (i.id for i in b)
SyntaxError: invalid syntax
>>> c[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'generator' object is not subscriptable

How is it possible to generate a list of sequence ids?

Thank you in advance.

On Thu, Oct 20, 2011 at 7:58 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Hi Mic,
>
> You should have started a new thread with a new title...
>
> On Thu, Oct 20, 2011 at 10:38 AM, Mic <mictadlo at gmail.com> wrote:
> > Hello,
> > would it be possible to using a generator expression for the following
> code?
> > from Bio import SeqIO
> > fa_parser = SeqIO.parse(open("../test_files/test.fasta", "rU"), "fasta")
> > sequence = fa_parser.next().seq
> > for record in fa_parser:
> >     sequence += 3*'N' + record.seq
> >
> > print sequence
> > Input:
> >>1
> > 1111111
> >>2
> > 2222222
> >>3
> > 3333333
> >>4
> > 4444444
> > Output:
> > 1111111NNN2222222NNN3333333NNN4444444
> > Thank you advance.
>
> Sure, how about this:
>
> from Bio import SeqIO
> fa_parser = SeqIO.parse("../test_files/test.fasta", "fasta")
> print ('N' * 3).join(str(rec.seq) for rec in fa_parser)
>
> Peter
>


From p.j.a.cock at googlemail.com  Thu Oct 27 08:35:24 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 27 Oct 2011 09:35:24 +0100
Subject: [Biopython] changing record attributes while iterating
In-Reply-To: <CAOP6n=g1pduQAFiBBB1RL3vNyjs8WCfb5eCUq_Tu1JoEbMigbg@mail.gmail.com>
References: <CA+WPOVPUyJmQen9u2LN5W=j22sVcn2CY7b_U5=LLdEp2j8k=Yg@mail.gmail.com>
	<CAKVJ-_7dHzN4A551Hx4nydDTUAxeGDvqDvtKKZpZdvwue0P4Pw@mail.gmail.com>
	<CAOP6n=hRfYi1mRAt6pxudmyMeVH8bnnM3E5yd9kXXPU5L=pQvQ@mail.gmail.com>
	<CAKVJ-_56L2beMYmCqs17KmpmhwX37oXMAR=xVD3237M6cAhGJw@mail.gmail.com>
	<CAOP6n=ggSjPopTkV_QFfhnxKUnh71q6FNLcUOJZfT9+T6db8-Q@mail.gmail.com>
	<CAKVJ-_5cthiZ7dhWr9_e6QLPHXnTVusa=Gdj7jOGSRFnyORuvQ@mail.gmail.com>
	<CAOP6n=g1pduQAFiBBB1RL3vNyjs8WCfb5eCUq_Tu1JoEbMigbg@mail.gmail.com>
Message-ID: <CAKVJ-_7DYTTuF4uKvaJL9jOwWb-oG59PfhuxTOVWh23TMuwO_A@mail.gmail.com>

On Thu, Oct 27, 2011 at 2:14 AM, Mic <mictadlo at gmail.com> wrote:
> Thank you it is working.
> I would like to to put sequences id in a list in the following way:
>>>> c = (i.id for i in b)
> SyntaxError: invalid syntax

The above would be a generator expression, and requires
Python 2.4. It shouldn't cause a SyntaxError unless there is
some mistake I'm not seeing (or you missed something in
the copy & paste).

>>>> c[0]
> Traceback (most recent call last):
> ? File "<stdin>", line 1, in <module>
> TypeError: 'generator' object is not subscriptable
> How is it possible to?generate?a list of sequence ids?

You need to create a list (e.g using a list comprehension)
rather than a generator, probably:

c = [i.id for i in b]
c[0] = "Fred"

Peter


From from.d.putto at gmail.com  Thu Oct 27 10:47:04 2011
From: from.d.putto at gmail.com (Sheila the angel)
Date: Thu, 27 Oct 2011 12:47:04 +0200
Subject: [Biopython] downloading gnome Protein table
In-Reply-To: <CAKVJ-_5NsUg-ymN_xtM+RhMrub=-YH4TO2g108Wos5R8UQ2LtQ@mail.gmail.com>
References: <CAFinXcTpxMZSYVYUJa2ECASDuDKrSpGk9eq5OJm6aDW8m69xUQ@mail.gmail.com>
	<CAKVJ-_5NsUg-ymN_xtM+RhMrub=-YH4TO2g108Wos5R8UQ2LtQ@mail.gmail.com>
Message-ID: <CAFinXcTOn4rtnzEz6XoVH8U-g0XsyLgh7T1RWM378A6hhcsQsw@mail.gmail.com>

The problem is I have only the Refseq ID like NC_008390 and I don't have
Protein table ID (in this case CP000441.ptt) so I can't download the .ptt
file (as in ftp url
ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt
 )

Also not all  Refseq IDs I have belongs to 'Bacteria'. So for ID
NC_004314 (just
an example) I have to change the ftp url as
ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Plasmodium_falciparum/NC_004314.ptt


Downloading the *.gbk file may be an option (but later I need to convert
them into protein table) so I tried this

from Bio import Entrez
Entrez.email = "from.d.putto at gmail.com"
handle = Entrez.efetch(db="genome", id="NC_008390", rettype="gbk")
print handle.read()

The output shows me 'Nothing has been found'
I am not sure in which database I should look for id like NC_008390.
Moreover later-on I need to convert 'gbk' file to .ptt (or extract protein
information)


On Wed, Oct 26, 2011 at 5:27 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Wed, Oct 26, 2011 at 4:11 PM, Sheila the angel
> <from.d.putto at gmail.com> wrote:
> > Hi All,
> >
> > I an facing some problem to downloading the gnome and other information.
> > For an example I did a query on ncbi gnome for  NC_008390
> > On clicking results you can get following link
> >
> >
> http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=ShowDetailView&TermToSearch=19840
> > On my web-browser I can save this page  as File> Save as >out.html
> >
> > Furthermore I want to download the Protein table also
> >
> http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=Retrieve&dopt=Protein+Table&list_uids=19840
> >
> > I want to do this for many Ids. Is there any simple way in Bio-Python???
> >
> > Thanks in Advance
>
> Hmm, some of that might be available by Bio.Entrez, not sure though.
>
> For the protein table I would personally work with the *.ptt files from
> the NCBI FTP site, e.g.
>
>
> ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt
>
> or:
>
>
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid58303/NC_008391.ptt
>
> The FTP links are on the page of the first URL you gave. You can download
> all the "bacteria" *.ptt files as a tar ball,
>
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ptt.tar.gz
>
> Typically I work from the GenBank file files instead (*.gbk rather than
> *.ptt)
>
> Peter
>


From p.j.a.cock at googlemail.com  Thu Oct 27 13:14:10 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 27 Oct 2011 14:14:10 +0100
Subject: [Biopython] downloading gnome Protein table
In-Reply-To: <CAFinXcTOn4rtnzEz6XoVH8U-g0XsyLgh7T1RWM378A6hhcsQsw@mail.gmail.com>
References: <CAFinXcTpxMZSYVYUJa2ECASDuDKrSpGk9eq5OJm6aDW8m69xUQ@mail.gmail.com>
	<CAKVJ-_5NsUg-ymN_xtM+RhMrub=-YH4TO2g108Wos5R8UQ2LtQ@mail.gmail.com>
	<CAFinXcTOn4rtnzEz6XoVH8U-g0XsyLgh7T1RWM378A6hhcsQsw@mail.gmail.com>
Message-ID: <CAKVJ-_4kqNSvherjHTXqTsxS_QmZnhgZSznvUOQ=WcLfFY8Q9A@mail.gmail.com>

On Thu, Oct 27, 2011 at 11:47 AM, Sheila the angel
<from.d.putto at gmail.com> wrote:
> The problem is I have only the Refseq ID like?NC_008390?and I don't have
> Protein table ID (in this case CP000441.ptt) so I can't download the .ptt
> file (as in ftp url
> ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt
> ? )

Given your identifiers, use ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ rather
than ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/ - in this case,

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid58303/NC_008390.ptt

>
> Also not all??Refseq IDs I have belongs to 'Bacteria'.
>

Then the NCBI won't have them on the Bacterial FTP sites, and I
don't think they will provide *.ptt files for them.

> So for ID
> NC_004314?(just an example) I have to change the ftp url as
> ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Plasmodium_falciparum/NC_004314.ptt
>
> Downloading the *.gbk file may be an option (but later I need to convert
> them into protein table)

Just download *all* the bacterial protein tables as the tar ball, its only
120MB compressed:

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ptt.tar.gz

Then you can just search locally for a file by name etc.

> so I tried this
> from Bio import Entrez
> Entrez.email = "from.d.putto at gmail.com"
> handle = Entrez.efetch(db="genome", id="NC_008390", rettype="gbk")
> print handle.read()
> The output shows me 'Nothing has been found'
> I am not sure in which database I should look for id like NC_008390.

Try it on the NCBI website for all databases,
http://www.ncbi.nlm.nih.gov/sites/gquery?term=NC_008390

You'll see it does match the genome database, but also the
nucleotide database. In this case you want the sequence as
a GenBank file so use the nucleotide database.

> Moreover later-on I need to convert 'gbk' file to .ptt (or extract protein
> information)

The Biopython GenBank parser can do that - life is easier with
bacterial genomes as there are (almost) no nasty join(...)
locations to deal with.

Peter


From devaniranjan at gmail.com  Thu Oct 27 19:16:07 2011
From: devaniranjan at gmail.com (George Devaniranjan)
Date: Thu, 27 Oct 2011 15:16:07 -0400
Subject: [Biopython] weighted sampling of a dictionary
Message-ID: <CAFU65PcW=0D20X7-44uEE=MjtXZZoJWPWyNnbrq4CWUtz46UwA@mail.gmail.com>

Hi,

I am not sure if this question is more suitable for biopython or a python
forum.


I have the following dictionary.

dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34,
'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, 'LAU':
1, 'PTA': 7, '
AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34,
'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, 'YLP':
49, 'TA
Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, 'TAA':
20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL':
16, 'SY
Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28}

The keys are the different amino acid triplets (all possible triplets
extracted from a culled list of PDB), the numbers next to them are the
frequency that they occour in.

I was wondering if there is a way in biopython/python to sample them at the
frequecy indicated by the no's next to the key.

I have only given a snippet of the triplet dictionary, the entire dictionary
has about 1400 key entries.

I would appreciate any help in this matter --thank you very much.

George


From bpederse at gmail.com  Thu Oct 27 20:29:43 2011
From: bpederse at gmail.com (Brent Pedersen)
Date: Thu, 27 Oct 2011 14:29:43 -0600
Subject: [Biopython] weighted sampling of a dictionary
In-Reply-To: <CAFU65PcW=0D20X7-44uEE=MjtXZZoJWPWyNnbrq4CWUtz46UwA@mail.gmail.com>
References: <CAFU65PcW=0D20X7-44uEE=MjtXZZoJWPWyNnbrq4CWUtz46UwA@mail.gmail.com>
Message-ID: <CAAp4xwrgkRTcROeSJDP4s71G4rAeouRRsTxB_NjLhzxBSsC5AA@mail.gmail.com>

On Thu, Oct 27, 2011 at 1:16 PM, George Devaniranjan
<devaniranjan at gmail.com> wrote:
> Hi,
>
> I am not sure if this question is more suitable for biopython or a python
> forum.
>
>
> I have the following dictionary.
>
> dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34,
> 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, 'LAU':
> 1, 'PTA': 7, '
> AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34,
> 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, 'YLP':
> 49, 'TA
> Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, 'TAA':
> 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL':
> 16, 'SY
> Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28}
>
> The keys are the different amino acid triplets (all possible triplets
> extracted from a culled list of PDB), the numbers next to them are the
> frequency that they occour in.
>
> I was wondering if there is a way in biopython/python to sample them at the
> frequecy indicated by the no's next to the key.
>
> I have only given a snippet of the triplet dictionary, the entire dictionary
> has about 1400 key entries.
>
> I would appreciate any help in this matter --thank you very much.
>
> George
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


you could try the one of these (presumably the class king)
http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python/

you'll have something like:

import operator
aminos, weights = zip(*sorted(adict.items(), key=operator.itemgetter(1)))

amino_gen = WeightedRandomGenerator(weights)

for i in xrange(nsims):
    idx = amino_gen.next()
    rand_aa = aminos[idx]


From jmtc21 at bath.ac.uk  Thu Oct 27 20:33:18 2011
From: jmtc21 at bath.ac.uk (Jaime Tovar)
Date: Thu, 27 Oct 2011 21:33:18 +0100
Subject: [Biopython] expat and biopython 1.58 problem on linux x64
Message-ID: <4EA9C00E.5080509@bath.ac.uk>

Hello all,

I'm having troubles while updating my biopython to 1.58.

I'm having exactly the same problem with the xml parser as described in 
this old post:

http://www.biopython.org/pipermail/biopython/2011-May/007263.html

Sadly I may have to use the entrez module so it will make me happy to 
have the thing running if possible.

I'm installing in a opensuse 11.3 x64 box
Did a rpm install of biopython from the opensuse science repo. So I have 
1.58-1.2 installed.
Python 1.6.5-3.5.1 for x64
expat 2.0.1-98.1 x64

Tried to install both by hand from the tar.gz and using an rpm but the 
problem persists.

Any help will be greatly appreciated.

Thanks!!!

Jaime.


From winda002 at student.otago.ac.nz  Thu Oct 27 20:52:00 2011
From: winda002 at student.otago.ac.nz (David Winter)
Date: Fri, 28 Oct 2011 09:52:00 +1300
Subject: [Biopython] weighted sampling of a dictionary
In-Reply-To: <CAFU65PcW=0D20X7-44uEE=MjtXZZoJWPWyNnbrq4CWUtz46UwA@mail.gmail.com>
References: <CAFU65PcW=0D20X7-44uEE=MjtXZZoJWPWyNnbrq4CWUtz46UwA@mail.gmail.com>
Message-ID: <20111028095200.20435ub1z2jexy0g@www.studentmail.otago.ac.nz>

Hi George,

I was actually doing this yesterday :)

The function I came up with takes two lists:

import random

def weighted_sample(population, weights):
   """ Sample from a population, given provided weights """
   if len(population) != len(weights):
     raise ValueError('Lengths of population and weights do not match')
   normal_weights = [ float(w)/sum(weights) for w in weights ]
   val = random.random()
   running_total = 0
   for index, weight in enumerate(normal_weights):
     running_total += weight
     if val < running_total:
       return population[index]

Which seems to do the trick:

population = ['AAU' ,'AAC', 'AAG']
weights = [2,5,3]
sample = [weighted_sample(population, weights) for _ in range(1000)]
sample.count('AAC') #should be about 500

If that's too slow, check out numpy's random.multinomial() function.

I haven't tested this, but this should get you the number of times you  
get each codon from 1000 "draws":

import numpy as np

codons, weights = codon_dict.items()
denom = sum(weights)
normalised_weights = [float(w)/denom for w in weights]
np.random.multinomial(codons, weights, 1000)

Cheers,
David


Quoting George Devaniranjan <devaniranjan at gmail.com>:

> Hi,
>
> I am not sure if this question is more suitable for biopython or a python
> forum.
>
>
> I have the following dictionary.
>
> dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34,
> 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16, 'LAU':
> 1, 'PTA': 7, '
> AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34,
> 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18, 'YLP':
> 49, 'TA
> Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15, 'TAA':
> 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL':
> 16, 'SY
> Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28}
>
> The keys are the different amino acid triplets (all possible triplets
> extracted from a culled list of PDB), the numbers next to them are the
> frequency that they occour in.
>
> I was wondering if there is a way in biopython/python to sample them at the
> frequecy indicated by the no's next to the key.
>
> I have only given a snippet of the triplet dictionary, the entire dictionary
> has about 1400 key entries.
>
> I would appreciate any help in this matter --thank you very much.
>
> George
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From p.j.a.cock at googlemail.com  Fri Oct 28 09:54:09 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 28 Oct 2011 10:54:09 +0100
Subject: [Biopython] expat and biopython 1.58 problem on linux x64
In-Reply-To: <4EA9C00E.5080509@bath.ac.uk>
References: <4EA9C00E.5080509@bath.ac.uk>
Message-ID: <CAKVJ-_4ZMCU3J6VeWnGyu0yk7Okk+ZcWqfMEDnXmUy0LgStZXA@mail.gmail.com>

On Thu, Oct 27, 2011 at 9:33 PM, Jaime Tovar <jmtc21 at bath.ac.uk> wrote:
> Hello all,
>
> I'm having troubles while updating my biopython to 1.58.
>
> I'm having exactly the same problem with the xml parser as described in this
> old post:
>
> http://www.biopython.org/pipermail/biopython/2011-May/007263.html
>
> Sadly I may have to use the entrez module so it will make me happy to have
> the thing running if possible.
>
> I'm installing in a opensuse 11.3 x64 box
> Did a rpm install of biopython from the opensuse science repo. So I have
> 1.58-1.2 installed.
> Python 1.6.5-3.5.1 for x64
> expat 2.0.1-98.1 x64
>
> Tried to install both by hand from the tar.gz and using an rpm but the
> problem persists.
>
> Any help will be greatly appreciated.
>
> Thanks!!!
>
> Jaime.

Hmm. Can you try installing the latest code from git please?
You can grab it via the git command line tool, or use github
to download the latest code as a tar ball:
http://biopython.org/wiki/SourceCode

Specifically I'm hoping this change will fix the segmentation
fault (assuming http://bugs.python.org/issue4877 is to blame):
https://github.com/biopython/biopython/commit/59f9cbd2ad14ebd05d5864033ff0c7ef7a8f0daa

Previously:

$ python
Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Entrez
>>> handle = open("NEWS")
>>> handle.close()
>>> Entrez.read(handle)
Segmentation fault

With the fix:

$ python
Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Entrez
>>> handle = open("NEWS")
>>> handle.close()
>>> Entrez.read(handle)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Bio/Entrez/__init__.py", line 270, in read
    record = handler.read(handle)
  File "Bio/Entrez/Parser.py", line 167, in read
    raise IOError("Can't parse a closed handle")
IOError: Can't parse a closed handle

Assuming you start seeing the IOError instead, the question
would shift to what is going on with your network settings
(e.g. look at proxies).

If the segmentation fault doesn't go away we'll need to think
again.

Peter


From bioinformaticsing at gmail.com  Fri Oct 28 11:46:07 2011
From: bioinformaticsing at gmail.com (ning luwen)
Date: Fri, 28 Oct 2011 19:46:07 +0800
Subject: [Biopython] Memory leak while parse gbk file?
Message-ID: <CAO51=Z6dzCotJA2efvhmFXUmKTnNDbaUOmxv_zzom-CJJr5YGQ@mail.gmail.com>

Hi,
    I have tried to parse about 2000+ gbk file using SeqIO.parse to
parse gbk file, but the memory up quickly. ( in my desktop 4g memory,
out memory after a number of iterates, and then try one work station,
memory used as high as 100g+, and continue increasing)

for temp_name in file_names:#file_names: list of path of gbk files.
    f=open(temp_name)
    for x in SeqIO.parse(f,'genbank'):
        print x.name,len(x.features)
    f.close()

   I guess there may be memory leak while parse gbk flle.
-- 
regards,
luwen ning


From p.j.a.cock at googlemail.com  Fri Oct 28 11:52:33 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 28 Oct 2011 12:52:33 +0100
Subject: [Biopython] Memory leak while parse gbk file?
In-Reply-To: <CAO51=Z6dzCotJA2efvhmFXUmKTnNDbaUOmxv_zzom-CJJr5YGQ@mail.gmail.com>
References: <CAO51=Z6dzCotJA2efvhmFXUmKTnNDbaUOmxv_zzom-CJJr5YGQ@mail.gmail.com>
Message-ID: <CAKVJ-_7zTR2D1zRhwz8ru-w7bt4LNen0cY2X6HkJKLNLb195yg@mail.gmail.com>

On Fri, Oct 28, 2011 at 12:46 PM, ning luwen
<bioinformaticsing at gmail.com> wrote:
> Hi,
> ? ?I have tried to parse about 2000+ gbk file using SeqIO.parse to
> parse gbk file, but the memory up quickly. ( in my desktop 4g memory,
> out memory after a number of iterates, and then try one work station,
> memory used as high as 100g+, and continue increasing)
>
> for temp_name in file_names:#file_names: list of path of gbk files.
> ? ?f=open(temp_name)
> ? ?for x in SeqIO.parse(f,'genbank'):
> ? ? ? ?print x.name,len(x.features)
> ? ?f.close()
>
> ? I guess there may be memory leak while parse gbk flle.
> --
> regards,
> luwen ning

Which version of Python are you using? Try calling garbage collection,

import gc
from Bio import SeqIO
for temp_name in file_names:#file_names: list of path of gbk files.
    f=open(temp_name)
    for x in SeqIO.parse(f,'genbank'):
        print x.name,len(x.features)
    f.close()
    gc.collect()

I expect that to fix the increasing memory usage. If it does, then
it isn't a memory leak.

Peter


From p.j.a.cock at googlemail.com  Fri Oct 28 13:21:42 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 28 Oct 2011 14:21:42 +0100
Subject: [Biopython] expat and biopython 1.58 problem on linux x64
In-Reply-To: <4EAAA9A0.3010906@bath.ac.uk>
References: <4EA9C00E.5080509@bath.ac.uk>
	<CAKVJ-_4ZMCU3J6VeWnGyu0yk7Okk+ZcWqfMEDnXmUy0LgStZXA@mail.gmail.com>
	<4EAAA9A0.3010906@bath.ac.uk>
Message-ID: <CAKVJ-_4_M4jJJQo+_pBokU7dNn8S14Zo66_V5WXDWwwJBQN97g@mail.gmail.com>

On Fri, Oct 28, 2011 at 2:09 PM, Jaime Tovar <jmtc21 at bath.ac.uk> wrote:
> Got the tarball for latest,
>
> but:
>
> ...
> ~/tmp/biop/biopython-biopython-59f9cbd/Tests> python test_Entrez.py
> Test error handling when presented with Fasta non-XML data ... ok
> Test error handling when presented with GenBank non-XML data ... ok
> Test parsing XML returned by EFetch, Nucleotide database (first test) ...
> ERROR
> Test parsing XML returned by EFetch, Protein database ... ERROR
> Test parsing XML returned by EFetch, OMIM database ... ERROR
> Test parsing XML returned by EFetch, PubMed database (first test) ...
> Segmentation fault
>
> Can we try to find where exactly is the problem?
>
> Thanks for the help.
> J

OK, so it doesn't look like the problem with closed handles,
http://bugs.python.org/issue4877

Although to be sure please try the example in my last email,

from Bio import Entrez
handle = open("NEWS")
handle.close()
Entrez.read(handle)

(You can use any file that exists).

Beyond that I only have questions rather than answers for now.
My guess is something is broken on your system with conflicting
versions of expat, see for example:

http://www.dscpl.com.au/wiki/ModPython/Articles/ExpatCausingApacheCrash

What does this give you, and does it match expat 2.0.1 which you
said earlier was installed?

import pyexpat
print pyexpat.version_info

Can you try to get a strack trace?

Alternatively, you could disable individual tests which trigger
the segmentation fault one by one and then we can attempt to
spot any commonalities. e.g. The segmentation fault is from:
"Test parsing XML returned by EFetch, PubMed database (first test)"
which is method test_pubmed1, rename it to xtest_test_pubmed1
(or anything that doesn't start test_*) and it will be skipped.

Peter


From devaniranjan at gmail.com  Fri Oct 28 13:23:22 2011
From: devaniranjan at gmail.com (George Devaniranjan)
Date: Fri, 28 Oct 2011 09:23:22 -0400
Subject: [Biopython] weighted sampling of a dictionary
In-Reply-To: <20111028095200.20435ub1z2jexy0g@www.studentmail.otago.ac.nz>
References: <CAFU65PcW=0D20X7-44uEE=MjtXZZoJWPWyNnbrq4CWUtz46UwA@mail.gmail.com>
	<20111028095200.20435ub1z2jexy0g@www.studentmail.otago.ac.nz>
Message-ID: <CAFU65PfrYJkkqOgtiDw36TuoHjTjf53FnXODJ5XBPn9QZ19rKQ@mail.gmail.com>

Thanks guys for all your suggestions -I am going to try these out.

Best,
George

On Thu, Oct 27, 2011 at 4:52 PM, David Winter
<winda002 at student.otago.ac.nz>wrote:

> Hi George,
>
> I was actually doing this yesterday :)
>
> The function I came up with takes two lists:
>
> import random
>
> def weighted_sample(population, weights):
>  """ Sample from a population, given provided weights """
>  if len(population) != len(weights):
>    raise ValueError('Lengths of population and weights do not match')
>  normal_weights = [ float(w)/sum(weights) for w in weights ]
>  val = random.random()
>  running_total = 0
>  for index, weight in enumerate(normal_weights):
>    running_total += weight
>    if val < running_total:
>      return population[index]
>
> Which seems to do the trick:
>
> population = ['AAU' ,'AAC', 'AAG']
> weights = [2,5,3]
> sample = [weighted_sample(population, weights) for _ in range(1000)]
> sample.count('AAC') #should be about 500
>
> If that's too slow, check out numpy's random.multinomial() function.
>
> I haven't tested this, but this should get you the number of times you get
> each codon from 1000 "draws":
>
> import numpy as np
>
> codons, weights = codon_dict.items()
> denom = sum(weights)
> normalised_weights = [float(w)/denom for w in weights]
> np.random.multinomial(codons, weights, 1000)
>
> Cheers,
> David
>
>
>
> Quoting George Devaniranjan <devaniranjan at gmail.com>:
>
>  Hi,
>>
>> I am not sure if this question is more suitable for biopython or a python
>> forum.
>>
>>
>> I have the following dictionary.
>>
>> dict ={'YLE': 6, 'QYL': 36, 'PTD': 32, 'AGG': 145, 'QYG': 34, 'QYD': 34,
>> 'AGD': 188, 'QYS': 35, 'AGS': 177, 'AGA': 154, 'QYA': 23, 'AGL': 16,
>> 'LAU':
>> 1, 'PTA': 7, '
>> AGY': 7, 'QYY': 19, 'QYE': 6, 'PAT': 57, 'QYT': 28, 'AGT': 10, 'QYQ': 34,
>> 'AGQ': 140, 'QYP': 32, 'AGP': 167, 'TAT': 31, 'SGS': 174, 'TAP': 18,
>> 'YLP':
>> 49, 'TA
>> Q': 23, 'UQE': 5, 'UAQ': 9, 'UAT': 8, 'UAE': 7, 'TAD': 1, 'TAG': 15,
>> 'TAA':
>> 20, 'TAS': 1, 'YUP': 1, 'TAL': 45, 'ALU': 20, 'PEP': 14, 'UAG': 6, 'EAL':
>> 16, 'SY
>> Y': 36, 'EAS': 35, 'SYT': 29, 'EAA': 16, 'SYQ': 13, 'EAG': 28}
>>
>> The keys are the different amino acid triplets (all possible triplets
>> extracted from a culled list of PDB), the numbers next to them are the
>> frequency that they occour in.
>>
>> I was wondering if there is a way in biopython/python to sample them at
>> the
>> frequecy indicated by the no's next to the key.
>>
>> I have only given a snippet of the triplet dictionary, the entire
>> dictionary
>> has about 1400 key entries.
>>
>> I would appreciate any help in this matter --thank you very much.
>>
>> George
>> ______________________________**_________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/**mailman/listinfo/biopython<http://lists.open-bio.org/mailman/listinfo/biopython>
>>
>>
>
>
>


From p.j.a.cock at googlemail.com  Mon Oct 31 11:27:31 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 31 Oct 2011 11:27:31 +0000
Subject: [Biopython] expat and biopython 1.58 problem on linux x64
In-Reply-To: <CAKVJ-_4_M4jJJQo+_pBokU7dNn8S14Zo66_V5WXDWwwJBQN97g@mail.gmail.com>
References: <4EA9C00E.5080509@bath.ac.uk>
	<CAKVJ-_4ZMCU3J6VeWnGyu0yk7Okk+ZcWqfMEDnXmUy0LgStZXA@mail.gmail.com>
	<4EAAA9A0.3010906@bath.ac.uk>
	<CAKVJ-_4_M4jJJQo+_pBokU7dNn8S14Zo66_V5WXDWwwJBQN97g@mail.gmail.com>
Message-ID: <CAKVJ-_7o_TuN87SKfT+ZX8ZYnbW66rEe0crc+MWPGUxyd15bPA@mail.gmail.com>

On Fri, Oct 28, 2011 at 2:21 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> OK, so it doesn't look like the problem with closed handles,
> http://bugs.python.org/issue4877
>

Hi Jaime,

Was there any sign of an expat version mismatch? That does
seem like the most likely problem (Python expecting one thing,
the library providing another).

Another guess was we could be reusing the parser object (which
apparently is not allowed), although the unit tests don't seem to do this:
http://bugs.python.org/issue6676
http://bugs.python.org/issue12829

Peter