From p.j.a.cock at googlemail.com Mon Jul 2 07:27:08 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 2 Jul 2012 12:27:08 +0100 Subject: [Biopython] Back translation support in Biopython In-Reply-To: References: Message-ID: On Wed, Apr 4, 2012 at 4:02 PM, Peter Cock wrote: > On Wed, Apr 4, 2012 at 2:49 AM, Eric Talevich wrote: >> Hi Igor, >> >> It sounds like you're referring to aligning amino acid sequences to codon >> sequences, as PAL2NAL does. This is different from what most people mean by >> back translation, but as you point out, certainly useful. >> >> If you write a function that can match a protein sequence alignment to a set >> of raw CDS sequences, returning a nucleotide alignment based on the >> codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does >> exactly that, plus a bit more, and is a fairly well-known and easily >> obtained program. Personally, I would prefer to write a wrapper for PAL2NAL >> under Bio.Align.Applications, using the existing Bio.Applications framework. > > As per the old thread, a simple function in Python taking the gapped protein > sequence, original nucleotide coding sequence, and the translation table > does sound useful. Then using that, you could go from a protein alignment > plus the original nucleotide coding sequences to a codon alignment, or > other tasks. Given this is all relatively straightforward string manipulation > and we already have the required genetic code tables in Biopython, I'm not > convinced that wrapping PAL2NAL would be the best solution (for this sub > task). Hi Igor, Did you do any work on back-translation (alignment threading) in Biopython? We needed to do this locally, and for some reason (yet to be determined) T-COFFEE wasn't working on our dataset, so I made a start at a Biopython implementation: https://github.com/peterjc/biopython/tree/back_trans https://github.com/peterjc/biopython/commit/7d14cdb59bb9d41c727c923c8aa7e3dda7779c80 Currently just one commit adding a Bio.Align.alignment_back_translate(...) function which takes a protein alignment and dictionary of nucleotide records - easy to get with Bio.SeqIO and Bio.AlignIO - with a stand alone example included in the doctest. There is also a new (currently private) function to do this for one sequence pair - perhaps useful on its own? There are potential complications with ID mapping between the proteins and nucleotides, thus the option of a key function, and the gap characters (would you ever want to use different gap characters in the protein and nucleotide alignments?). We could discuss implementation details over on the biopython-dev list, but the general API discussion might as well be here. e.g. Where to put the function and what to call it. Regards, Peter From from.d.putto at gmail.com Mon Jul 2 08:21:46 2012 From: from.d.putto at gmail.com (Sheila the angel) Date: Mon, 2 Jul 2012 14:21:46 +0200 Subject: [Biopython] searching homologene database Message-ID: To search tp53 homolog in homologene database - handle = Entrez.esearch(db="homologene", term="tp53[gene name] AND Homo sapiens[orgn]") record = Entrez.read(handle) handle = Entrez.efetch(db="homologene", id=record['IdList']) record = handle.read() print record I think record is asn.1 format !! how can I read or convert it in the genes protein table (as we see in the web result) http://www.ncbi.nlm.nih.gov/homologene/460 Thanks -- Sheila From w.arindrarto at gmail.com Mon Jul 2 08:39:31 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Mon, 2 Jul 2012 14:39:31 +0200 Subject: [Biopython] searching homologene database In-Reply-To: References: Message-ID: Hi Sheila, You can set the 'retmode' parameter in order to specify your preferred format. I'm not sure if NCBI provides an output format exactly like the one you see on their site, but instead of ASN.1 you can specify a more common format like XML. In your case, the call would be this (for XML, let's say): handle = Entrez.efetch(db="homologene", id=record['IdList'], retmode="xml") For a list of possible retmode values, you can look them up here: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch (see the explanation about 'retmode'). If you want to format the output further, you can use modules like the built-in elementtree or 3rd party modules like lxml to extract the tag values and feed them to your script / program. Hope that helps, Bow On Mon, Jul 2, 2012 at 2:21 PM, Sheila the angel wrote: > To search tp53 homolog in homologene database - > > handle = Entrez.esearch(db="homologene", term="tp53[gene name] AND Homo > sapiens[orgn]") > record = Entrez.read(handle) > handle = Entrez.efetch(db="homologene", id=record['IdList']) > record = handle.read() > print record > > I think record is asn.1 format !! how can I read or convert it in the genes > protein table (as we see in the web result) > http://www.ncbi.nlm.nih.gov/homologene/460 > > Thanks > > -- > Sheila > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From from.d.putto at gmail.com Tue Jul 10 09:16:25 2012 From: from.d.putto at gmail.com (Sheila the angel) Date: Tue, 10 Jul 2012 15:16:25 +0200 Subject: [Biopython] access Uniprot record by different ids Message-ID: I have a Uniprot AC list in which some AC are primary and some are secondary. The function my_dict = SeqIO.index("uniprot_sprot.dat", "swiss") makes dictionary of uniprot data but I can access a record only by primary AC. my_dict['P04637'] # gives the record my_dict['Q15086'] # KeyError my_dict['P53_HUMAN'] # KeyError Is it possible to access same record by both primary and secondary ACs (and by uniprot ID) ? From p.j.a.cock at googlemail.com Tue Jul 10 09:43:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 10 Jul 2012 14:43:31 +0100 Subject: [Biopython] access Uniprot record by different ids In-Reply-To: References: Message-ID: On Tue, Jul 10, 2012 at 2:16 PM, Sheila the angel wrote: > I have a Uniprot AC list in which some AC are primary and some are > secondary. The function > my_dict = SeqIO.index("uniprot_sprot.dat", "swiss") > makes dictionary of uniprot data but I can access a record only by primary > AC. > my_dict['P04637'] # gives the record > my_dict['Q15086'] # KeyError > my_dict['P53_HUMAN'] # KeyError > > Is it possible to access same record by both primary and secondary ACs > (and by uniprot ID) ? Not directly with Bio.SeqIO.index() or Bio.SeqIO.index_db(), no. You could perhaps use a second dictionary mapping aliases to the primary ID? Peter From n.j.loman at bham.ac.uk Wed Jul 11 11:02:00 2012 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Wed, 11 Jul 2012 16:02:00 +0100 Subject: [Biopython] SeqRecord substring should return SeqRecord or character? Message-ID: Hi there I wanted to add the last character of a SeqRecord s1 to another SeqRecord s2. However s1[-1] + s2 fails because s1[-1] returns a string rather than a SeqRecord just containing a single base and associated annotations. I have to do s1[-1:] to get a sliced SeqRecord. Is this behaviour intentional? I kind of assumed I would always get a SeqRecord from any given slice, and it's seems weird to get just a string back instead, although no doubt there's a good reason for this. Cheers Nick From p.j.a.cock at googlemail.com Wed Jul 11 11:21:00 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 11 Jul 2012 16:21:00 +0100 Subject: [Biopython] SeqRecord substring should return SeqRecord or character? In-Reply-To: References: Message-ID: On Wed, Jul 11, 2012 at 4:02 PM, Nick Loman wrote: > Hi there > > I wanted to add the last character of a SeqRecord s1 to another > SeqRecord s2. However s1[-1] + s2 fails because s1[-1] returns a > string rather than a SeqRecord just containing a single base and > associated annotations. I have to do s1[-1:] to get a sliced > SeqRecord. You should be able to do SeqRecord+string, and string+SeqRecord, both of which are specifically tested in the docstring. Have you got any more details? e.g. Version? Mini-example? > Is this behaviour intentional? I kind of assumed I would always get a > SeqRecord from any given slice, and it's seems weird to get just a > string back instead, although no doubt there's a good reason for this. For a single base/residue, the whole SeqRecord overhead does seem unnecessary. As to why you get a single letter string, not a single letter Seq, IIRC it was mimicking the Seq object. Peter From p.j.a.cock at googlemail.com Wed Jul 11 11:52:41 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 11 Jul 2012 16:52:41 +0100 Subject: [Biopython] SeqRecord substring should return SeqRecord or character? In-Reply-To: References: <620A45B10433AE4C81D3F931A02812F93BC80453CF@LESMBX1.adf.bham.ac.uk> Message-ID: On Wed, Jul 11, 2012 at 4:24 PM, Nick Loman wrote: > On Wed, Jul 11, 2012 at 4:21 PM, Peter Cock wrote: >> On Wed, Jul 11, 2012 at 4:02 PM, Nick Loman wrote: >>> Hi there >>> >>> I wanted to add the last character of a SeqRecord s1 to another >>> SeqRecord s2. However s1[-1] + s2 fails because s1[-1] returns a >>> string rather than a SeqRecord just containing a single base and >>> associated annotations. I have to do s1[-1:] to get a sliced >>> SeqRecord. >> >> You should be able to do SeqRecord+string, and string+SeqRecord, >> both of which are specifically tested in the docstring. Have you got >> any more details? e.g. Version? Mini-example? > > Hi Peter, > > It was doing this on a FASTQ record so it's the missing quality > annotation that cause the problem when trying to do this. Ah - so the addition should have worked, but you'd lose the partial quality string. You're stuck with ensuring you have two SeqRecords, so as you suggested rather than s1[-1]+s2 please use s1[-1:]+s2 instead. Slightly less clear, but only character more. This actually reminds me of similar behaviour with the bytes string in Python 3, where the same trick is required to get a single letter bytes string. >>> Is this behaviour intentional? I kind of assumed I would always get a >>> SeqRecord from any given slice, and it's seems weird to get just a >>> string back instead, although no doubt there's a good reason for this. >> >> For a single base/residue, the whole SeqRecord overhead does >> seem unnecessary. As to why you get a single letter string, not >> a single letter Seq, IIRC it was mimicking the Seq object. > > Yes, I guessed the overhead was likely to be the reason .. not sure > if there's a satisfactory solution? Returning a single letter SeqRecord have might been a better choice, and going back much further in Biopython's history the Seq object should probably have returned a single letter Seq (not a single letter string). There is a similar issue with the columns of an alignment. Peter From wheatontrue at gmail.com Thu Jul 12 04:53:26 2012 From: wheatontrue at gmail.com (Wheaton Little) Date: Thu, 12 Jul 2012 16:53:26 +0800 Subject: [Biopython] can I use the xml parser in biopython on other xml files? how? Message-ID: I would like to use the Biopython xml parser, if possible, on google patent xmls: http://www.google.com/googlebooks/uspto-patents-applications-text.html unfortunately, this is what I get: >>> t=open('ipa111229.xml','r').read() >>> import Bio >>> ttt=Bio.Entrez.read(t[:30000]) Traceback (most recent call last): File "", line 1, in ttt=Bio.Entrez.read(t[:30000]) File "/Library/Python/2.7/site-packages/Bio/Entrez/__init__.py", line 351, in read record = handler.read(handle) File "/Library/Python/2.7/site-packages/Bio/Entrez/Parser.py", line 169, in read self.parser.ParseFile(handle) TypeError: argument must have 'read' attribute What would I have to do to use the parser on this xml? From b.invergo at gmail.com Thu Jul 12 05:25:27 2012 From: b.invergo at gmail.com (Brandon Invergo) Date: Thu, 12 Jul 2012 11:25:27 +0200 Subject: [Biopython] can I use the xml parser in biopython on other xml files? how? In-Reply-To: References: Message-ID: <1342085127.614.10.camel@localhost.localdomain> With regards to the error that you receive, it's because you're trying to `read()` a list, when that method requires a file-like object. This would fix that: >>> ttt=Bio.Entrez.read(open('ipa111229.xml', 'r')) However, that wouldn't work because it requires a DTD from NCBI to read the file. Why not use one of Python's standard xml libraries (xml.sax or xml.dom (or xml.minidom))? -brandon On Thu, 2012-07-12 at 16:53 +0800, Wheaton Little wrote: > I would like to use the Biopython xml parser, if possible, on google > patent xmls: > > http://www.google.com/googlebooks/uspto-patents-applications-text.html > > unfortunately, this is what I get: > > >>> t=open('ipa111229.xml','r').read() > >>> import Bio > >>> ttt=Bio.Entrez.read(t[:30000]) > > Traceback (most recent call last): > File "", line 1, in > ttt=Bio.Entrez.read(t[:30000]) > File "/Library/Python/2.7/site-packages/Bio/Entrez/__init__.py", > line 351, in read > record = handler.read(handle) > File "/Library/Python/2.7/site-packages/Bio/Entrez/Parser.py", line > 169, in read > self.parser.ParseFile(handle) > TypeError: argument must have 'read' attribute > > What would I have to do to use the parser on this xml? > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Thu Jul 12 05:35:09 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 12 Jul 2012 10:35:09 +0100 Subject: [Biopython] can I use the xml parser in biopython on other xml files? how? In-Reply-To: References: Message-ID: On Thu, Jul 12, 2012 at 9:53 AM, Wheaton Little wrote: > I would like to use the Biopython xml parser, if possible, on google > patent xmls: > > http://www.google.com/googlebooks/uspto-patents-applications-text.html > > unfortunately, this is what I get: > >>>> t=open('ipa111229.xml','r').read() >>>> import Bio >>>> ttt=Bio.Entrez.read(t[:30000]) > > Traceback (most recent call last): > ... > TypeError: argument must have 'read' attribute > > What would I have to do to use the parser on this xml? In your example, you opened the file and read all the data into a string (variable t). The parser is not expecting a string, but a handle. String objects don't have a 'read' method, thus this error message. You could 'fix' this particular error by doing: handle=open('ipa111229.xml','r') from Bio import Entrez ttt=Entrez.read(handle) However, I doubt this will work as the Entrez parser is intended to be used with the NCBI XML files only. Python comes with several XML libraries in the standard library. ElementTree (or cElementTree) is quite popular, but as Brandom points out there are also DOM and SAX style parsers. Peter From from.d.putto at gmail.com Thu Jul 12 07:06:58 2012 From: from.d.putto at gmail.com (Sheila the angel) Date: Thu, 12 Jul 2012 13:06:58 +0200 Subject: [Biopython] access Uniprot record by different ids In-Reply-To: References: Message-ID: Thanks for reply. Now I made two dictionary one for uniprot_sprot.dat and another for secondary ids to primary ids. However it take too long to do this and I can't do Pickle for my_dict. I would like to know is it possible to dump my_dict (the uniprot.dat data) to MySql database. I looked at biopython-BioSQL page but didn't understand much (I am new to SQL) Thanks -- Sheila On Tue, Jul 10, 2012 at 3:43 PM, Peter Cock wrote: > On Tue, Jul 10, 2012 at 2:16 PM, Sheila the angel > wrote: > > I have a Uniprot AC list in which some AC are primary and some are > > secondary. The function > > my_dict = SeqIO.index("uniprot_sprot.dat", "swiss") > > makes dictionary of uniprot data but I can access a record only by > primary > > AC. > > my_dict['P04637'] # gives the record > > my_dict['Q15086'] # KeyError > > my_dict['P53_HUMAN'] # KeyError > > > > Is it possible to access same record by both primary and secondary ACs > > (and by uniprot ID) ? > > Not directly with Bio.SeqIO.index() or Bio.SeqIO.index_db(), no. > You could perhaps use a second dictionary mapping aliases to > the primary ID? > > Peter > From wheatontrue at gmail.com Thu Jul 12 07:57:51 2012 From: wheatontrue at gmail.com (Wheaton Little) Date: Thu, 12 Jul 2012 19:57:51 +0800 Subject: [Biopython] can I use the xml parser in biopython on other xml files? how? In-Reply-To: References: Message-ID: Indeed, it didn't like that. Using BeautifulSoup seems to work but not sure how well... Thanks for the advice, all! On Thu, Jul 12, 2012 at 5:35 PM, Peter Cock wrote: > On Thu, Jul 12, 2012 at 9:53 AM, Wheaton Little wrote: >> I would like to use the Biopython xml parser, if possible, on google >> patent xmls: >> >> http://www.google.com/googlebooks/uspto-patents-applications-text.html >> >> unfortunately, this is what I get: >> >>>>> t=open('ipa111229.xml','r').read() >>>>> import Bio >>>>> ttt=Bio.Entrez.read(t[:30000]) >> >> Traceback (most recent call last): >> ... >> TypeError: argument must have 'read' attribute >> >> What would I have to do to use the parser on this xml? > > In your example, you opened the file and read all the data into a > string (variable t). > > The parser is not expecting a string, but a handle. String objects > don't have a 'read' method, thus this error message. > > You could 'fix' this particular error by doing: > > handle=open('ipa111229.xml','r') > from Bio import Entrez > ttt=Entrez.read(handle) > > However, I doubt this will work as the Entrez parser is intended to be > used with the NCBI XML files only. > > Python comes with several XML libraries in the standard library. > ElementTree (or cElementTree) is quite popular, but as Brandom points > out there are also DOM and SAX style parsers. > > Peter From chaudhrynabeelahmed at gmail.com Thu Jul 12 08:58:58 2012 From: chaudhrynabeelahmed at gmail.com (Nabeel Ahmed) Date: Thu, 12 Jul 2012 17:58:58 +0500 Subject: [Biopython] Bioinformatics EMBOSS users Message-ID: I have recently installed EMBOSS-6.4.0 (Ubuntu 11.10). I am unable to make it work directly with live databases (embl, uniprot) , working totally fine with local sequence files. e.g % *plotorf * Plot potential open reading frames in a nucleotide sequence Input nucleotide sequence: *embl:x13776* *Error:* Failed to open filename 'embl'** Used 'showdb' , displayed table with zero rows. Is there any configuration, i am missing?? Ahmed From p.j.a.cock at googlemail.com Thu Jul 12 14:30:33 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 12 Jul 2012 19:30:33 +0100 Subject: [Biopython] Bioinformatics EMBOSS users In-Reply-To: References: Message-ID: On Thu, Jul 12, 2012 at 1:58 PM, Nabeel Ahmed wrote: > I have recently installed EMBOSS-6.4.0 (Ubuntu 11.10). > I am unable to make it work directly with live databases (embl, uniprot) , > working totally fine with local sequence files. > e.g > > % *plotorf * > Plot potential open reading frames in a nucleotide sequence > Input nucleotide sequence: *embl:x13776* > > *Error:* Failed to open filename 'embl'** > > Used 'showdb' , displayed table with zero rows. > > Is there any configuration, i am missing?? > > Ahmed I'm not sure - but the EMBOSS mailing list would be the place to ask: http://lists.open-bio.org/mailman/listinfo/emboss Peter From p.j.a.cock at googlemail.com Thu Jul 12 14:37:11 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 12 Jul 2012 19:37:11 +0100 Subject: [Biopython] access Uniprot record by different ids In-Reply-To: References: Message-ID: On Thu, Jul 12, 2012 at 12:06 PM, Sheila the angel wrote: > Thanks for reply. > Now I made two dictionary one for uniprot_sprot.dat and another for > secondary ids to primary ids. > However it take too long to do this and I can't do Pickle for my_dict. > I would like to know is it possible to dump my_dict (the uniprot.dat data) > to MySql database. Have you tried the Bio.SeqIO.index_db(...) function? This builds an SQLite database to hold the lookup table of offsets (i.e. the primary accession only). Creating the index is a little slow, but reuse is very fast. For your second dictionary mapping secondary accessions to the primary accession, you should be able to use pickle. > I looked at biopython-BioSQL page but didn't understand much > (I am new to SQL) > Thanks BioSQL is a bit complicated to get started with (although using SQLite is a lot simpler than MySQL or PostgreSQL). Peter From livingstonemark at gmail.com Mon Jul 16 21:49:37 2012 From: livingstonemark at gmail.com (Mark Livingstone) Date: Tue, 17 Jul 2012 11:49:37 +1000 Subject: [Biopython] The PDBParser Permissive setting Message-ID: Hi Guys, In my code I am experimenting with different ways of doing RMSD calculations. I have code which in addition to normal CA based RMSD can do (CA & CB) RMSD and also sidechain RMSD. On a perfect PDB file this works well. Unfortunately, the curation I have is fairly average / poor in quality :-( and I only find out when one of the liberal number of Try/Except blocks falls over. I need a better way to find out sooner if a PDB file is missing data. I am wondering therefore is for PDBParser I set Permissive=0, and after setting the relevant models and chains etc, I did wt_atoms = Bio.PDB.Selection.unfold_entities(wtc, 'A') If this successfully works without throwing an Exception, can I assume that this unfolded chain is perfect, or are there ways that I could still be tripped up? Alternatively, can anyone suggest code that I can employ in my curation process that will give me a decent sanity check of PDB quality, so I can get on writing experimental code - and not Try/Except blocks :-( Thanks in advance, MarkL From anaryin at gmail.com Tue Jul 17 02:42:09 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 17 Jul 2012 02:42:09 -0400 Subject: [Biopython] The PDBParser Permissive setting In-Reply-To: References: Message-ID: Hey Mark, What kind of validation do you want? Cheers, Jo?o No dia 17 de Jul de 2012 02:52, "Mark Livingstone" < livingstonemark at gmail.com> escreveu: > Hi Guys, > > In my code I am experimenting with different ways of doing RMSD > calculations. I have code which in addition to normal CA based RMSD > can do (CA & CB) RMSD and also sidechain RMSD. On a perfect PDB file > this works well. Unfortunately, the curation I have is fairly average > / poor in quality :-( and I only find out when one of the liberal > number of Try/Except blocks falls over. > > I need a better way to find out sooner if a PDB file is missing data. > > I am wondering therefore is for PDBParser I set Permissive=0, and > after setting the relevant models and chains etc, I did > > > wt_atoms = Bio.PDB.Selection.unfold_entities(wtc, 'A') > > If this successfully works without throwing an Exception, can I assume > that this unfolded chain is perfect, or are there ways that I could > still be tripped up? > > Alternatively, can anyone suggest code that I can employ in my > curation process that will give me a decent sanity check of PDB > quality, so I can get on writing experimental code - and not > Try/Except blocks :-( > > Thanks in advance, > > MarkL > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From livingstonemark at gmail.com Tue Jul 17 04:54:50 2012 From: livingstonemark at gmail.com (Mark Livingstone) Date: Tue, 17 Jul 2012 18:54:50 +1000 Subject: [Biopython] The PDBParser Permissive setting In-Reply-To: References: Message-ID: Hi Jo?o, I guess it would be good if I could get a data structure that had no discontinuities, no missing data points or unknowns. I would be able to tell it to ignore HOH or other irrelevancies. My use case as I mentioned is RMSD and similar algorithms, so one continuous structure with all the data attached that I can iterate through, selecting atoms / residues as needed, and get the names and coordinates as I go. So I guess I want a PDB Diagnostic type program to allow me to find exemplary PDB files to use during initial stages of development while I do proof of concept, since I know that finding edge case PDBs for later work is not as hard it seems as finding good ones ;-) Maybe the simplest way to think of the sort of PDBs is you can run your software and you don't need any try / except blocks for Biopython to work well :-D Cheers, MarkL On 17 July 2012 16:42, Jo?o Rodrigues wrote: > Hey Mark, > > What kind of validation do you want? > > Cheers, > > Jo?o > > No dia 17 de Jul de 2012 02:52, "Mark Livingstone" > escreveu: >> >> Hi Guys, >> >> In my code I am experimenting with different ways of doing RMSD >> calculations. I have code which in addition to normal CA based RMSD >> can do (CA & CB) RMSD and also sidechain RMSD. On a perfect PDB file >> this works well. Unfortunately, the curation I have is fairly average >> / poor in quality :-( and I only find out when one of the liberal >> number of Try/Except blocks falls over. >> >> I need a better way to find out sooner if a PDB file is missing data. >> >> I am wondering therefore is for PDBParser I set Permissive=0, and >> after setting the relevant models and chains etc, I did >> >> >> wt_atoms = Bio.PDB.Selection.unfold_entities(wtc, 'A') >> >> If this successfully works without throwing an Exception, can I assume >> that this unfolded chain is perfect, or are there ways that I could >> still be tripped up? >> >> Alternatively, can anyone suggest code that I can employ in my >> curation process that will give me a decent sanity check of PDB >> quality, so I can get on writing experimental code - and not >> Try/Except blocks :-( >> >> Thanks in advance, >> >> MarkL >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython From anaryin at gmail.com Tue Jul 17 05:35:56 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 17 Jul 2012 10:35:56 +0100 Subject: [Biopython] The PDBParser Permissive setting In-Reply-To: References: Message-ID: You mean for example, no chain breaks? And no missing atoms in residues? You can check the first one with a warning catcher (I think I answered something like this a few time ago here in the mailing list). The second one is trickier, you'll need a sort of topology to know which atoms belong to each residue. I have something like that in my GSOC branch but it's very very very experimental.. Is this what you mean? Which others would you be looking for? I think that for RMSD alone you need only to make sure that you match equivalent atoms. That should be easy enough without major modifications or endless try/excepts :) From dilara.ally at gmail.com Tue Jul 17 18:24:07 2012 From: dilara.ally at gmail.com (Dilara Ally) Date: Tue, 17 Jul 2012 15:24:07 -0700 Subject: [Biopython] When is a SeqRecord a SeqRecord Message-ID: Hi I've modified my code but why does the inclusion of return None and the subsequent code if filtered_rec is not None solve the problem? Thanks! Dilara q_threshold=20 def check_meanQ(rec, q_threshold): seqlen=len(rec) quality_scores=array(rec.letter_annotations["phred_quality"]) if round(quality_scores.mean()) <= q_threshold: print "Discarded ", rec.id, "because mean Q was", round(quality_scores.mean()) return None if round(quality_scores.mean()) > q_threshold: return rec from Bio import SeqIO for rec in SeqIO.parse("test.fastq", "fastq"): #print rec.id filtered_rec= check_meanQ(rec, q_threshold) if filtered_rec is not None: print filtered_rec.id print filtered_rec.letter_annotations From dilara.ally at gmail.com Tue Jul 17 15:11:12 2012 From: dilara.ally at gmail.com (Dilara Ally) Date: Tue, 17 Jul 2012 12:11:12 -0700 Subject: [Biopython] when is a SeqRecord not a SeqRecord Message-ID: Hi I'm trying to understand what is why when I print filtered_rec I get a SeqRecord but if I try to access any particular attribute of a SeqRecord such as letter_annotations I sometimes get an attribute error -- AttributeError: 'NoneType' object has no attribute 'letter_annotations.' q_threshold=20 def check_meanQ(record, q_threshold): seqlen=len(record) quality_scores=array(record.letter_annotations["phred_quality"]) if round(quality_scores.mean()) <= q_threshold: print "Discarded ", record.id, "because mean Q was", round(quality_scores.mean()) elif round(quality_scores.mean()) > q_threshold: return record from Bio import SeqIO for rec in SeqIO.parse("test.fastq", "fastq"): print rec.id filtered_rec= check_meanQ(rec, q_threshold) #print filtered_rec print filtered_rec.letter_annotations I've attached two fastq files that I've used with this code one is called test.fastq and the other is hiseq_pe_test.fastq Any help would be greatly appreciated. -------------- next part -------------- A non-text attachment was scrubbed... Name: test.fastq Type: application/octet-stream Size: 39217 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hiseq_pe_test.fastq Type: application/octet-stream Size: 1541 bytes Desc: not available URL: From chapmanb at 50mail.com Wed Jul 18 09:23:27 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 18 Jul 2012 09:23:27 -0400 Subject: [Biopython] when is a SeqRecord not a SeqRecord In-Reply-To: References: Message-ID: <87y5mhkxtc.fsf@fastmail.fm> Dilara; > I'm trying to understand what is why when I print filtered_rec I get a > SeqRecord but if I try to access any particular attribute of a SeqRecord > such as letter_annotations I sometimes get an attribute error -- > AttributeError: 'NoneType' object has no attribute > 'letter_annotations.' > def check_meanQ(record, q_threshold): > seqlen=len(record) > quality_scores=array(record.letter_annotations["phred_quality"]) > if round(quality_scores.mean()) <= q_threshold: > print "Discarded ", record.id, "because mean Q was", > round(quality_scores.mean()) > elif round(quality_scores.mean()) > q_threshold: > return record This function returns different results based on the comparison of mean quality scores to your threshold: - When it is below the threshold, it returns None (since you do not define an explicit return value) - When it is above the threshold, it returns a SeqRecord. > from Bio import SeqIO > for rec in SeqIO.parse("test.fastq", "fastq"): > print rec.id > filtered_rec= check_meanQ(rec, q_threshold) > #print filtered_rec > print filtered_rec.letter_annotations You are seeing the error since in the filtered cases the function returns None. You probably want: filtered_rec= check_meanQ(rec, q_threshold) if filtered_rec is not None: print filtered_rec.letter_annotations Brad From chapmanb at 50mail.com Wed Jul 18 09:23:27 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 18 Jul 2012 09:23:27 -0400 Subject: [Biopython] when is a SeqRecord not a SeqRecord In-Reply-To: References: Message-ID: <87y5mhkxtc.fsf@fastmail.fm> Dilara; > I'm trying to understand what is why when I print filtered_rec I get a > SeqRecord but if I try to access any particular attribute of a SeqRecord > such as letter_annotations I sometimes get an attribute error -- > AttributeError: 'NoneType' object has no attribute > 'letter_annotations.' > def check_meanQ(record, q_threshold): > seqlen=len(record) > quality_scores=array(record.letter_annotations["phred_quality"]) > if round(quality_scores.mean()) <= q_threshold: > print "Discarded ", record.id, "because mean Q was", > round(quality_scores.mean()) > elif round(quality_scores.mean()) > q_threshold: > return record This function returns different results based on the comparison of mean quality scores to your threshold: - When it is below the threshold, it returns None (since you do not define an explicit return value) - When it is above the threshold, it returns a SeqRecord. > from Bio import SeqIO > for rec in SeqIO.parse("test.fastq", "fastq"): > print rec.id > filtered_rec= check_meanQ(rec, q_threshold) > #print filtered_rec > print filtered_rec.letter_annotations You are seeing the error since in the filtered cases the function returns None. You probably want: filtered_rec= check_meanQ(rec, q_threshold) if filtered_rec is not None: print filtered_rec.letter_annotations Brad From bioinformaticsing at gmail.com Wed Jul 18 23:36:19 2012 From: bioinformaticsing at gmail.com (ning luwen) Date: Thu, 19 Jul 2012 11:36:19 +0800 Subject: [Biopython] Error while parsing bgk file Message-ID: Hi everyone, A error encountered when i parse a gbk file. the error message as follow: Traceback (most recent call last): File "stat_refseq_gbs.py", line 10, in for seq in f: File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 537, in parse for r in i: File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 445, in parse_records record = self.parse(handle, do_features) File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 428, in parse if self.feed(handle, consumer, do_features): File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 400, in feed self._feed_feature_table(consumer, self.parse_features(skip=False)) File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 350, in _feed_feature_table consumer.location(location_string) File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py", line 970, in location int(e), ValueError: invalid literal for int() with base 10: '68452073^68452074' the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the lines cause the error may be: V_segment complement(68451760..68452073^68452074) CDS complement(<68451760..68452072^68452073) -- regards, luwen ning From w.arindrarto at gmail.com Thu Jul 19 04:50:33 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 19 Jul 2012 10:50:33 +0200 Subject: [Biopython] Error while parsing bgk file In-Reply-To: References: Message-ID: Hi Ning, Thanks for reporting the error. A similar issue has been reported in the bug tracker here: https://redmine.open-bio.org/issues/3175 (it also looks like it's the same coordinate). It seems that this could be an invalid GenBank coordinate made by NCBI, though. >From which chromosome is this coordinate coming from? Is it the latest draft? cheers, Bow On Thu, Jul 19, 2012 at 5:36 AM, ning luwen wrote: > Hi everyone, > > A error encountered when i parse a gbk file. > > the error message as follow: > > Traceback (most recent call last): > File "stat_refseq_gbs.py", line 10, in > for seq in f: > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", > line 537, in parse > for r in i: > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 445, in parse_records > record = self.parse(handle, do_features) > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 428, in parse > if self.feed(handle, consumer, do_features): > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 400, in feed > self._feed_feature_table(consumer, self.parse_features(skip=False)) > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 350, in _feed_feature_table > consumer.location(location_string) > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py", > line 970, in location > int(e), > ValueError: invalid literal for int() with base 10: '68452073^68452074' > > the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the > lines cause the error may be: > > V_segment complement(68451760..68452073^68452074) > CDS complement(<68451760..68452072^68452073) > > -- > regards, > luwen ning > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From dilara.ally at gmail.com Thu Jul 19 11:51:35 2012 From: dilara.ally at gmail.com (Dilara Ally) Date: Thu, 19 Jul 2012 08:51:35 -0700 Subject: [Biopython] slice a record in two and writing both records Message-ID: If I have a function (modify_record) that slices up a SeqRecord into sub records and then returns the sliced record if it has a certain length (for e.g. the sliced record needs to be greater than 40bp), sometimes the original record when sliced will have two different records both greater than 40bp. I want to keep both sliced reads and rewrite them as separate records into a single fastq file. Here is my code: def modify_record(frec, win, len_threshold): quality_scores = array(frec.letter_annotations["phred_quality"]) all_window_qc = slidingWindow(quality_scores, win,1) track_qc = windowQ(all_window_qc) myzeros = boolean_array(track_qc, q_threshold,win) Nrec = slice_points(myzeros,win)[0][1]-1 where_to_slice = slice_points(myzeros,win)[1] where_to_slice.append(len(frec)+win) sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold) return sub_record q_threshold = 20 win = 5 len_threshold = 30 from Bio import SeqIO from numpy import * good_reads = (rec for rec in SeqIO.parse("hiseq_pe_test.fastq", "fastq") if array(rec.letter_annotations["phred_quality"]).mean() >= q_threshold) count = SeqIO.write(good_reads, "temp.fastq", "fastq") print "Saved %i reads" % count newly_filtered=[] for rec in SeqIO.parse("temp.fastq", "fastq"): s = modify_record(rec, win, len_threshold) newly_filtered.append(s) SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq") This writes only the first sub_record even when there are more than 1 that have a len >40bp. I've tried this as a generator expression and I'm still getting just the first sub_record. I'd also prefer to not to use append as it was previously suggested that this can lead to problems if you run the script more than once. Instead, I want to employ a generator expression - but I'm still getting used to the idea of generator expressions. My second question is more general. Generator expressions are more memory efficient than a list comprehension, but how are they better than just a simple loop that pulls in a single record, does something and then writes that record? Is it just a time issue? Many thanks for the help! From w.arindrarto at gmail.com Thu Jul 19 13:21:42 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 19 Jul 2012 19:21:42 +0200 Subject: [Biopython] slice a record in two and writing both records In-Reply-To: References: Message-ID: Hi Dilara, For your first question, it seems that the `modify_record` function always returns only one SeqRecord object. This is a bit of a guesswork from my end as I don't know how most of the functions in `modify_record` work, but since you still see an ouput sequence at the end, I think you may want to re-check again how `sub_record` returns its values / how it returns more than one SeqRecord objects. Also, you might want to try changing the last two lines: newly_filtered.append(s) SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq") Here, you're doing `SeqIO.write` for each iteration of the loop. Although the end result is the same (a file containing all the sequence you want), the code may be made more efficient by putting the SeqIO.write line outside of the loop, after all sequences are pooled in the `newly_filtered` list. For your second question, I personally find generator expressions to be more compact and easier to read. This is important for future code maintenance ~ having more readable lines of code means it's easier to understand your code and to debug them in case something goes wrong. Note that generator expressions aren't silver bullets. In some cases, for loops may still be better (e.g. if you're doing complex operations on the objects your iterating over). I find these two sites helpful when I first grappled with generators and generator expressions. I hope they are the same to you too: * http://stackoverflow.com/questions/1995418/python-generator-expression-vs-yield * http://www.dabeaz.com/generators/Generators.pdf (PDF) Hope that helps :), Bow On Thu, Jul 19, 2012 at 5:51 PM, Dilara Ally wrote: > If I have a function (modify_record) that slices up a SeqRecord into sub > records and then returns the sliced record if it has a certain length (for > e.g. the sliced record needs to be greater than 40bp), sometimes the > original record when sliced will have two different records both greater > than 40bp. I want to keep both sliced reads and rewrite them as separate > records into a single fastq file. Here is my code: > > def modify_record(frec, win, len_threshold): > quality_scores = array(frec.letter_annotations["phred_quality"]) > all_window_qc = slidingWindow(quality_scores, win,1) > track_qc = windowQ(all_window_qc) > myzeros = boolean_array(track_qc, q_threshold,win) > Nrec = slice_points(myzeros,win)[0][1]-1 > where_to_slice = slice_points(myzeros,win)[1] > where_to_slice.append(len(frec)+win) > sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold) > return sub_record > > q_threshold = 20 > win = 5 > len_threshold = 30 > > from Bio import SeqIO > from numpy import * > good_reads = (rec for rec in SeqIO.parse("hiseq_pe_test.fastq", "fastq") if > array(rec.letter_annotations["phred_quality"]).mean() >= q_threshold) > count = SeqIO.write(good_reads, "temp.fastq", "fastq") > print "Saved %i reads" % count > > newly_filtered=[] > for rec in SeqIO.parse("temp.fastq", "fastq"): > s = modify_record(rec, win, len_threshold) > newly_filtered.append(s) > SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq") > > This writes only the first sub_record even when there are more than 1 that > have a len >40bp. I've tried this as a generator expression and I'm still > getting just the first sub_record. I'd also prefer to not to use append > as it was previously suggested that this can lead to problems if you run > the script more than once. Instead, I want to employ a generator > expression - but I'm still getting used to the idea of generator > expressions. > > My second question is more general. Generator expressions are more memory > efficient than a list comprehension, but how are they better than just a > simple loop that pulls in a single record, does something and then writes > that record? Is it just a time issue? > > Many thanks for the help! > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From bioinformaticsing at gmail.com Thu Jul 19 23:41:20 2012 From: bioinformaticsing at gmail.com (ning luwen) Date: Fri, 20 Jul 2012 11:41:20 +0800 Subject: [Biopython] Fwd: Error while parsing bgk file In-Reply-To: References: Message-ID: ---------- Forwarded message ---------- From: Lenna Peterson Date: Thu, Jul 19, 2012 at 12:51 PM Subject: Re: [Biopython] Error while parsing bgk file To: ning luwen On Wed, Jul 18, 2012 at 11:36 PM, ning luwen wrote: > Hi everyone, > > A error encountered when i parse a gbk file. > > the error message as follow: > > Traceback (most recent call last): > File "stat_refseq_gbs.py", line 10, in > for seq in f: > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", > line 537, in parse > for r in i: > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 445, in parse_records > record = self.parse(handle, do_features) > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 428, in parse > if self.feed(handle, consumer, do_features): > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 400, in feed > self._feed_feature_table(consumer, self.parse_features(skip=False)) > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 350, in _feed_feature_table > consumer.location(location_string) > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py", > line 970, in location > int(e), > ValueError: invalid literal for int() with base 10: '68452073^68452074' > > the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the > lines cause the error may be: > > V_segment complement(68451760..68452073^68452074) > CDS complement(<68451760..68452072^68452073) > > -- > regards, > luwen ning > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Hi Luwen, Thanks for reporting this problem. I've submitted a patch that should fix it. https://github.com/biopython/biopython/pull/54 Lenna -- regards, luwen ning From bioinformaticsing at gmail.com Thu Jul 19 23:56:51 2012 From: bioinformaticsing at gmail.com (ning luwen) Date: Fri, 20 Jul 2012 11:56:51 +0800 Subject: [Biopython] Error while parsing bgk file In-Reply-To: References: Message-ID: Hi Bow, Thank you for your reply, and a patch by lenna can solve the interruption of the parse. ps: these gbk file was recently downloaded from ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/ (with extension of gbs.gz), and the file contained "invalid GenBank annotation" is ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_02/hs_ref_GRCh37.p5_chr2.gbs.gz On Thu, Jul 19, 2012 at 4:50 PM, Wibowo Arindrarto wrote: > Hi Ning, > > Thanks for reporting the error. A similar issue has been reported in > the bug tracker here: https://redmine.open-bio.org/issues/3175 (it > also looks like it's the same coordinate). It seems that this could be > an invalid GenBank coordinate made by NCBI, though. > > From which chromosome is this coordinate coming from? Is it the latest draft? > > cheers, > Bow > > > On Thu, Jul 19, 2012 at 5:36 AM, ning luwen wrote: >> Hi everyone, >> >> A error encountered when i parse a gbk file. >> >> the error message as follow: >> >> Traceback (most recent call last): >> File "stat_refseq_gbs.py", line 10, in >> for seq in f: >> File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", >> line 537, in parse >> for r in i: >> File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", >> line 445, in parse_records >> record = self.parse(handle, do_features) >> File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", >> line 428, in parse >> if self.feed(handle, consumer, do_features): >> File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", >> line 400, in feed >> self._feed_feature_table(consumer, self.parse_features(skip=False)) >> File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", >> line 350, in _feed_feature_table >> consumer.location(location_string) >> File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py", >> line 970, in location >> int(e), >> ValueError: invalid literal for int() with base 10: '68452073^68452074' >> >> the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the >> lines cause the error may be: >> >> V_segment complement(68451760..68452073^68452074) >> CDS complement(<68451760..68452072^68452073) >> >> -- >> regards, >> luwen ning >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython -- regards, luwen ning From p.j.a.cock at googlemail.com Fri Jul 20 06:07:04 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 20 Jul 2012 11:07:04 +0100 Subject: [Biopython] slice a record in two and writing both records In-Reply-To: References: Message-ID: On Thu, Jul 19, 2012 at 4:51 PM, Dilara Ally wrote: > If I have a function (modify_record) that slices up a SeqRecord into sub > records and then returns the sliced record if it has a certain length (for > e.g. the sliced record needs to be greater than 40bp), sometimes the > original record when sliced will have two different records both greater > than 40bp. I want to keep both sliced reads and rewrite them as separate > records into a single fastq file. Here is my code: > > def modify_record(frec, win, len_threshold): > quality_scores = array(frec.letter_annotations["phred_quality"]) > all_window_qc = slidingWindow(quality_scores, win,1) > track_qc = windowQ(all_window_qc) > myzeros = boolean_array(track_qc, q_threshold,win) > Nrec = slice_points(myzeros,win)[0][1]-1 > where_to_slice = slice_points(myzeros,win)[1] > where_to_slice.append(len(frec)+win) > sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold) > return sub_record > ... The key point is that for each input record you may want to produce several output records. A single function turning one input SeqRecord into one output SeqRecord won't work. I would suggest either, 1. Modify your function to return a list of SeqRecord objects, which could be zero, one (as now), or several - depending on the slice points. Then use itertools.chain to combine them, something like this: from itertools import chain good_reads = chain(modify_record(r) for r in SeqIO.parse(...)) count = SeqIO.write(good_reads, "filtered.fastq", "fastq") print "Saved %i read fragments" % count 2. Use a generator function to process the SeqRecord objects, def select_fragments(records, win, len_threshold): for record in records: where_to_slice = ... for slice_point in where_to_slice: yield record[slice_point] good_reads = select_fragments(SeqIO.parse(...)) count = SeqIO.write(good_reads, "filtered.fastq", "fastq") print "Saved %i read fragments" % count Both these approaches are generator/iteration based and will be memory efficient. Note you may also want to alter the record identifiers so that different fragments from a single read get different IDs. Peter From p.j.a.cock at googlemail.com Fri Jul 20 06:29:33 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 20 Jul 2012 11:29:33 +0100 Subject: [Biopython] Error while parsing bgk file In-Reply-To: References: Message-ID: On Fri, Jul 20, 2012 at 4:56 AM, ning luwen wrote: > Hi Bow, > > Thank you for your reply, and a patch by lenna can solve the > interruption of the parse. > > ps: these gbk file was recently downloaded from > ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/ (with extension of > gbs.gz), and the file contained "invalid GenBank annotation" is > ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_02/hs_ref_GRCh37.p5_chr2.gbs.gz Note the original bug report referred to a slightly different part/revision of this chromosome, but it is the same issue reported earlier: https://redmine.open-bio.org/issues/3175 ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_02/hs_ref_GRCh37.p2_chr2.gbk.gz I have now committed Lenna's fix, which means this file now parses with a warning about the problem features (which get None as their location): https://github.com/biopython/biopython/commit/bc733da09051ca53ad4515ac2d971ff0839a71ba https://github.com/biopython/biopython/commit/4bf78f72682f0500e93c410f8108891dade88ff8 Ning, if you would like to test this fix the simplest way is to get the latest source code from github, and reinstall Biopython. You can either use the git tool at the command line, or the github URL for a tarball: https://github.com/biopython/biopython/tarball/master (Please ask if you need more guidance with this) Regards, Peter From igorrcosta at hotmail.com Sat Jul 21 17:44:40 2012 From: igorrcosta at hotmail.com (Igor Rodrigues da Costa) Date: Sat, 21 Jul 2012 21:44:40 +0000 Subject: [Biopython] Back translation support in Biopython Message-ID: Hi Peter, I would eliminate the problem of ID mapping (or at least pass it to the user) by using only the function that uses one sequence pair. The other option is to check if the codon and the amino acid are equivalent at run time, using a given genetic code. I did this in my program that back translated using only the aligned protein sequence and the Uniprot/GI accession numbers (I did the search using Bio.Entrez), but in my case the nucleotide dictionary was only some different ways the nucleotide sequence could be imported from NCBI, each of them returning a different sequence. I can't see any need for different gap characters between both alignments, and I feel there can be both a Bio.SeqIO (using a pair of sequences only) and a Bio.AlignIO (using multiple sequences, probably slower if checking at run time) versions of this function. Att,Igor> Date: Mon, 2 Jul 2012 12:27:08 +0100 > Subject: Re: [Biopython] Back translation support in Biopython > From: p.j.a.cock at googlemail.com > To: igorrcosta at hotmail.com; eric.talevich at gmail.com > CC: biopython at lists.open-bio.org > > On Wed, Apr 4, 2012 at 4:02 PM, Peter Cock wrote: > > On Wed, Apr 4, 2012 at 2:49 AM, Eric Talevich wrote: > >> Hi Igor, > >> > >> It sounds like you're referring to aligning amino acid sequences to codon > >> sequences, as PAL2NAL does. This is different from what most people mean by > >> back translation, but as you point out, certainly useful. > >> > >> If you write a function that can match a protein sequence alignment to a set > >> of raw CDS sequences, returning a nucleotide alignment based on the > >> codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does > >> exactly that, plus a bit more, and is a fairly well-known and easily > >> obtained program. Personally, I would prefer to write a wrapper for PAL2NAL > >> under Bio.Align.Applications, using the existing Bio.Applications framework. > > > > As per the old thread, a simple function in Python taking the gapped protein > > sequence, original nucleotide coding sequence, and the translation table > > does sound useful. Then using that, you could go from a protein alignment > > plus the original nucleotide coding sequences to a codon alignment, or > > other tasks. Given this is all relatively straightforward string manipulation > > and we already have the required genetic code tables in Biopython, I'm not > > convinced that wrapping PAL2NAL would be the best solution (for this sub > > task). > > Hi Igor, > > Did you do any work on back-translation (alignment threading) in Biopython? > > We needed to do this locally, and for some reason (yet to be determined) > T-COFFEE wasn't working on our dataset, so I made a start at a Biopython > implementation: > > https://github.com/peterjc/biopython/tree/back_trans > https://github.com/peterjc/biopython/commit/7d14cdb59bb9d41c727c923c8aa7e3dda7779c80 > > Currently just one commit adding a Bio.Align.alignment_back_translate(...) > function which takes a protein alignment and dictionary of nucleotide > records - easy to get with Bio.SeqIO and Bio.AlignIO - with a stand alone > example included in the doctest. There is also a new (currently private) > function to do this for one sequence pair - perhaps useful on its own? > > There are potential complications with ID mapping between the proteins > and nucleotides, thus the option of a key function, and the gap characters > (would you ever want to use different gap characters in the protein and > nucleotide alignments?). We could discuss implementation details over > on the biopython-dev list, but the general API discussion might as well > be here. e.g. Where to put the function and what to call it. > > Regards, > > Peter From p.j.a.cock at googlemail.com Sun Jul 22 08:51:12 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 22 Jul 2012 13:51:12 +0100 Subject: [Biopython] Back translation support in Biopython In-Reply-To: References: Message-ID: On Sat, Jul 21, 2012 at 10:44 PM, Igor Rodrigues da Costa wrote: > > > Hi Peter, > I would eliminate the problem of ID mapping (or at least > pass it to the user) by using only the function that uses > one sequence pair. Making the function for doing one sequence pair part of the public API seems sensible then. > The other option is to check if the codon and the amino > acid are equivalent at run time, using a given genetic > code. I did this in my program that back translated > using only the aligned protein sequence and the > Uniprot/GI accession numbers (I did the search using > Bio.Entrez), but in my case the nucleotide dictionary > was only some different ways the nucleotide sequence > could be imported from NCBI, each of them returning > a different sequence. Certainly optionally checking the translation seems wise. There are potential complications with things like ambiguous bases, but in general this is useful. > I can't see any need for different gap characters > between both alignments, and I feel there can be both > a Bio.SeqIO (using a pair of sequences only) and a > Bio.AlignIO (using multiple sequences, probably slower > if checking at run time) versions of this function. I agree that an alignment based function, and a single sequence based function make sense - but probably under Bio.Align rather than Bio.SeqIO and Bio.AlignIO which are specifically for input/ouput functionality. Thanks for your thoughts, Peter From dilara.ally at gmail.com Mon Jul 23 17:48:30 2012 From: dilara.ally at gmail.com (Dilara Ally) Date: Mon, 23 Jul 2012 14:48:30 -0700 Subject: [Biopython] slice a record in two and writing both records In-Reply-To: References: Message-ID: <9085DA29-9159-44EE-BED7-56E3306B8EA3@gmail.com> Thanks. Itertools is a fantastic module! Dilara On Jul 20, 2012, at 3:07 AM, Peter Cock wrote: > On Thu, Jul 19, 2012 at 4:51 PM, Dilara Ally wrote: >> If I have a function (modify_record) that slices up a SeqRecord into sub >> records and then returns the sliced record if it has a certain length (for >> e.g. the sliced record needs to be greater than 40bp), sometimes the >> original record when sliced will have two different records both greater >> than 40bp. I want to keep both sliced reads and rewrite them as separate >> records into a single fastq file. Here is my code: >> >> def modify_record(frec, win, len_threshold): >> quality_scores = array(frec.letter_annotations["phred_quality"]) >> all_window_qc = slidingWindow(quality_scores, win,1) >> track_qc = windowQ(all_window_qc) >> myzeros = boolean_array(track_qc, q_threshold,win) >> Nrec = slice_points(myzeros,win)[0][1]-1 >> where_to_slice = slice_points(myzeros,win)[1] >> where_to_slice.append(len(frec)+win) >> sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold) >> return sub_record >> ... > > The key point is that for each input record you may want to > produce several output records. A single function turning > one input SeqRecord into one output SeqRecord won't work. > I would suggest either, > > 1. Modify your function to return a list of SeqRecord objects, > which could be zero, one (as now), or several - depending on > the slice points. Then use itertools.chain to combine them, > something like this: > > from itertools import chain > good_reads = chain(modify_record(r) for r in SeqIO.parse(...)) > count = SeqIO.write(good_reads, "filtered.fastq", "fastq") > print "Saved %i read fragments" % count > > 2. Use a generator function to process the SeqRecord objects, > > def select_fragments(records, win, len_threshold): > for record in records: > where_to_slice = ... > for slice_point in where_to_slice: > yield record[slice_point] > > good_reads = select_fragments(SeqIO.parse(...)) > count = SeqIO.write(good_reads, "filtered.fastq", "fastq") > print "Saved %i read fragments" % count > > Both these approaches are generator/iteration based and will > be memory efficient. > > Note you may also want to alter the record identifiers so that > different fragments from a single read get different IDs. > > Peter From llewelr at gmail.com Mon Jul 23 22:24:06 2012 From: llewelr at gmail.com (Richard Llewellyn) Date: Mon, 23 Jul 2012 20:24:06 -0600 Subject: [Biopython] Entrez.read(handle) for Bio._py3k.EvilHandleHack throws error with python 3.2 Message-ID: With python 3.2 and biopython 1.60 after getting a handle using Entrez.esummary (and esearch, others?) I get a TypeError: >>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here at example.org" >>> handle = Entrez.esummary(db="journals", id="30367") >>> record = Entrez.read(handle) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/__init__.py", line 351, in read record = handler.read(handle) File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/Parser.py", line 169, in read self.parser.ParseFile(handle) TypeError: read() did not return a bytes object (type=str) >>> handle Ah, it is evil! I realize py3k not yet officially supported. Thanks for the great work. From llewelr at gmail.com Mon Jul 23 23:20:00 2012 From: llewelr at gmail.com (Richard Llewellyn) Date: Mon, 23 Jul 2012 21:20:00 -0600 Subject: [Biopython] Entrez.read(handle) for Bio._py3k.EvilHandleHack throws error with python 3.2 In-Reply-To: References: Message-ID: Follow up for Entrez.read error on EvilHandleHack object: (this is python 3.2.3) If I change last line of Entrez.__init__.py _open function from return _binary_to_string_handle(handle) to return handle this error does not occur in example given below. On Mon, Jul 23, 2012 at 8:24 PM, Richard Llewellyn wrote: > With python 3.2 and biopython 1.60 after getting a handle using > Entrez.esummary (and esearch, others?) I get a TypeError: > > >>>> from Bio import Entrez >>>> Entrez.email = "Your.Name.Here at example.org" >>>> handle = Entrez.esummary(db="journals", id="30367") >>>> record = Entrez.read(handle) > > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/__init__.py", > line 351, in read > record = handler.read(handle) > File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/Parser.py", > line 169, in read > self.parser.ParseFile(handle) > TypeError: read() did not return a bytes object (type=str) > >>>> handle > > > Ah, it is evil! > > I realize py3k not yet officially supported. > > Thanks for the great work. From markd at soe.ucsc.edu Tue Jul 24 02:47:51 2012 From: markd at soe.ucsc.edu (Mark Diekhans) Date: Mon, 23 Jul 2012 23:47:51 -0700 Subject: [Biopython] accessing PDB IDcode when using PDBParser Message-ID: <20494.17687.678370.458937@hgwdev.cse.ucsc.edu> How does one access the idCode in the PDB HEADER when using the PDBParser? I can't find this in the documentation or the code. Also, what is function of the `id' argument for PDBParser.get_structure: The documentation is just self-referential: o id - string, the id that will be used for the structure Seems no obvious way via MMCIFParser either. Thanks! From anaryin at gmail.com Tue Jul 24 04:37:44 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 24 Jul 2012 10:37:44 +0200 Subject: [Biopython] accessing PDB IDcode when using PDBParser In-Reply-To: <20494.17687.678370.458937@hgwdev.cse.ucsc.edu> References: <20494.17687.678370.458937@hgwdev.cse.ucsc.edu> Message-ID: Hey Mark, Indeed there is no specific ID extraction from the HEADER. However, it comes as part of the "head" key in the header dictionary. If you split by whitespace and get the last field, you get the PDB ID. Example: HEADER HYDROLASE(ASPARTYL PROTEINASE) 17-OCT-89 2RSP The id you have in the get_structure function retrieves the first argument you pass to it. Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2012/7/24 Mark Diekhans > > How does one access the idCode in the PDB HEADER when using the PDBParser? > I can't find this in the documentation or the code. > > Also, what is function of the `id' argument for PDBParser.get_structure: > The documentation is just self-referential: > o id - string, the id that will be used for the structure > > Seems no obvious way via MMCIFParser either. > > Thanks! > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Tue Jul 24 05:41:35 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Jul 2012 10:41:35 +0100 Subject: [Biopython] Entrez.read(handle) for Bio._py3k.EvilHandleHack throws error with python 3.2 In-Reply-To: References: Message-ID: Hi Richard, It's great to have some feedback on Python 3 support :) On Tue, Jul 24, 2012 at 4:20 AM, Richard Llewellyn wrote: > Follow up for Entrez.read error on EvilHandleHack object: > > (this is python 3.2.3) > > If I change last line of Entrez.__init__.py _open function from > > return _binary_to_string_handle(handle) > to > return handle > > this error does not occur in example given below. Hmm. That call to _binary_to_string_handle converts from the bytes (binary) network handle to a string (unicode) handle which is required for most of the parsers in Biopython under Python 3 (e.g. FASTA, Genbank). Surprisingly the Entrez parser seems to be wanting a binary handle? That seems curious... I presume that means we don't have this particular case covered in the unit tests :( How familiar are you with the Python 3 split of bytes vs strings (unicode), and binary versus text handles? Peter From bjorn_johansson at bio.uminho.pt Fri Jul 27 04:03:41 2012 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Fri, 27 Jul 2012 09:03:41 +0100 Subject: [Biopython] Restriction cutting SeqRecord objects Message-ID: Hi, Restriction with Bio.Restriction only works for seq or mutable seq objects? I would like to digest SeqRecord objects and still keep the relevant features of the sequences. Did anyone perhaps implement something like this? One way would be to subclass the restriction enzymes, but they are created dynamically so I am not sure if this is a god idea. btw is the biopython site down? thanks, bjorn -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL www.bio.uminho.pt Google profile metabolicengineeringgroup Work (direct) +351-253 601517 | mob. +351-967 147 704 | mob. (SWE) 0739 792 968 Dept of Biology (secr) +351-253 60 4310 | fax +351-253 678980 From dilara.ally at gmail.com Thu Jul 26 13:48:44 2012 From: dilara.ally at gmail.com (Dilara Ally) Date: Thu, 26 Jul 2012 10:48:44 -0700 Subject: [Biopython] matching headers and then writing the seq record Message-ID: Hi Everyone, I'm interested in finding headers that match (in other words paired reads) in two different fastq files. Once the common headers are found, I then go back to the original fastq file and write those matched reads to a different fastq file. Right now, the part of the code that runs really slow is the headers_read1 and headers_read2 lines. And I was wondering if there was a more elegant way and time efficient manner than what I have done. It seems as if set undoes the elegance of using a generator. Any advice is greatly appreciated! Here is the code: def get_header(seq_record): fields = seq_record.id.split(':') lastfield = fields[6].split('_')[0] return lastfield def get_full_header(seq_record): fields = seq_record.id.split(':') headerInfo2 = fields[6].split('_')[0] headerInfo = str(fields[0]) + ":" + str(fields[1]) + ":" + str(fields[2]) + ":" + str(fields[3]) + ":" + str(fields[4]) + ":" + str(fields[5]) + ":" + str(headerInfo2) return headerInfo def replace_header(seq_record,pairType): if pairType == 1: ending = "/1" elif pairType == 2: ending = "/2" seq_record.id=seq_record.id+ending seq_record.name = "" seq_record.description = "" return seq_record def matched_records(records, pairType, header_matches): for rec in records: id = get_header(rec) result = id in header_matches #print result if (result == True): newrec = replace_header(rec,pairType) yield newrec import sys from Bio import SeqIO headers_read1 = set(get_header(seq_record) for seq_record in SeqIO.parse(sys.argv[1], "fastq")) headers_read2 = set(get_header(seq_record) for seq_record in SeqIO.parse(sys.argv[2], "fastq")) header_matches = [x for x in headers_read1 if x in headers_read2] records = SeqIO.parse(sys.argv[1], "fastq") pairType = 1 count = SeqIO.write(matched_records(records,pairType,header_matches), sys.argv[3], "fastq") print "Saved %i matched reads." %count records = SeqIO.parse(sys.argv[2], "fastq") pairType = 2 count = SeqIO.write(matched_records(records,pairType,header_matches), sys.argv[4], "fastq") print "Saved %i matched reads." %count From cartealy at yahoo.co.id Fri Jul 27 02:30:58 2012 From: cartealy at yahoo.co.id (Imam Cartealy) Date: Fri, 27 Jul 2012 14:30:58 +0800 (SGT) Subject: [Biopython] Is biopython.org down ? Message-ID: <1343370658.77420.YahooMailNeo@web190503.mail.sg3.yahoo.com> Hi everyone, I am having trouble accessing biopython.org for the last 2 days. Is biopython.org down ? Cheers ic ? Imam Cartealy Center for Biotechnology - BPPT Indonesia From idoerg at gmail.com Sat Jul 28 16:19:01 2012 From: idoerg at gmail.com (Iddo Friedberg) Date: Sat, 28 Jul 2012 16:19:01 -0400 Subject: [Biopython] Is biopython.org down ? In-Reply-To: <1343370658.77420.YahooMailNeo@web190503.mail.sg3.yahoo.com> References: <1343370658.77420.YahooMailNeo@web190503.mail.sg3.yahoo.com> Message-ID: Has been down for the past couple of days, but it is up now. On Fri, Jul 27, 2012 at 2:30 AM, Imam Cartealy wrote: > Hi everyone, > > I am having trouble accessing biopython.org for the last 2 days. Is > biopython.org down ? > > Cheers > > ic > > > Imam Cartealy > Center for Biotechnology - BPPT > Indonesia > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From p.j.a.cock at googlemail.com Sat Jul 28 16:48:32 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 28 Jul 2012 21:48:32 +0100 Subject: [Biopython] matching headers and then writing the seq record In-Reply-To: References: Message-ID: On Thu, Jul 26, 2012 at 6:48 PM, Dilara Ally wrote: > ... It seems as if set undoes the elegance of using a generator. > Any advice is greatly appreciated! ... > > headers_read1 = set(...) > headers_read2 = set(...) > header_matches = [x for x in headers_read1 if x in headers_read2] I would expect that using the built in set's intersection operation would be faster than this list comprehension solution to create header_matches. Also, you should use a set not a list for header_matches because testing membership with a set is much faster than a list. i.e. Try: header_matches = headers_read1.intersection(headers_read2) This might be a tiny change, but I expect it to be noticeably faster. Also, here: > def matched_records(records, pairType, header_matches): > for rec in records: > id = get_header(rec) > result = id in header_matches > if (result == True): > newrec = replace_header(rec,pairType) > yield newrec If you don't mind my style comments, you don't really need to create the variables 'id' and 'result', and 'newrec' - I would just do: def matched_records(records, pairType, header_matches): for rec in records: if get_header(rec) in header_matches: yield replace_header(rec,pairType) And at that point you could write the whole thing as a generator expression, which you may or may not find more pleasing (I'm not sure if it makes any significant difference to the speed). i.e. records = SeqIO.parse(sys.argv[1], "fastq") pairType = 1 wanted = (replace_header(rec,pairType) \ for rec in records \ if get_header(rec) in header_matches) count = SeqIO.write(wanted, sys.argv[3], "fastq") I hope that helps, Peter From aclark at aclark.net Sat Jul 28 19:45:10 2012 From: aclark at aclark.net (Alex Clark) Date: Sat, 28 Jul 2012 19:45:10 -0400 Subject: [Biopython] ANN: pythonpackages.com beta Message-ID: Hi biological computation folks, I am reaching out to various Python-related programming communities in order to offer new help packaging your software. If you have ever struggled with packaging and releasing Python software (e.g. to PyPI), please check out this service: - http://pythonpackages.com The basic idea is to automate packaging by checking out code, testing, and uploading (e.g. to PyPI) all through the web, as explained in this introduction: - http://docs.pythonpackages.com/en/latest/introduction.html Also, I will be available to answer your Python packaging questions most days/nights in #pythonpackages on irc.freenode.net. Hope to meet/talk with all of you soon. Alex -- Alex Clark ? http://pythonpackages.com/ONE_CLICK From dilara.ally at gmail.com Tue Jul 31 14:53:27 2012 From: dilara.ally at gmail.com (Dilara Ally) Date: Tue, 31 Jul 2012 11:53:27 -0700 Subject: [Biopython] matching headers and then writing the seq record In-Reply-To: References: Message-ID: Thanks Peter it sped it up considerably! I appreciate the fast replies on this listserv. On Jul 28, 2012, at 1:48 PM, Peter Cock wrote: > On Thu, Jul 26, 2012 at 6:48 PM, Dilara Ally wrote: >> ... It seems as if set undoes the elegance of using a generator. >> Any advice is greatly appreciated! ... >> >> headers_read1 = set(...) >> headers_read2 = set(...) >> header_matches = [x for x in headers_read1 if x in headers_read2] > > I would expect that using the built in set's intersection operation would > be faster than this list comprehension solution to create header_matches. > > Also, you should use a set not a list for header_matches because testing > membership with a set is much faster than a list. i.e. Try: > > header_matches = headers_read1.intersection(headers_read2) > > This might be a tiny change, but I expect it to be noticeably faster. > > Also, here: > >> def matched_records(records, pairType, header_matches): >> for rec in records: >> id = get_header(rec) >> result = id in header_matches >> if (result == True): >> newrec = replace_header(rec,pairType) >> yield newrec > > If you don't mind my style comments, you don't really need > to create the variables 'id' and 'result', and 'newrec' - I would > just do: > > def matched_records(records, pairType, header_matches): > for rec in records: > if get_header(rec) in header_matches: > yield replace_header(rec,pairType) > > And at that point you could write the whole thing as a > generator expression, which you may or may not find > more pleasing (I'm not sure if it makes any significant > difference to the speed). i.e. > > records = SeqIO.parse(sys.argv[1], "fastq") > pairType = 1 > wanted = (replace_header(rec,pairType) \ > for rec in records \ > if get_header(rec) in header_matches) > count = SeqIO.write(wanted, sys.argv[3], "fastq") > > I hope that helps, > > Peter From devaniranjan at gmail.com Tue Jul 31 15:24:34 2012 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 31 Jul 2012 15:24:34 -0400 Subject: [Biopython] Mocapy Message-ID: I was wondering if Mocapy is part of Biopython. I thought it was but I cannot find it in my biopython PDB folder. Thank you, George From eric.talevich at gmail.com Tue Jul 31 17:55:21 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 31 Jul 2012 17:55:21 -0400 Subject: [Biopython] Mocapy In-Reply-To: References: Message-ID: On Tue, Jul 31, 2012 at 3:24 PM, George Devaniranjan wrote: > I was wondering if Mocapy is part of Biopython. > > I thought it was but I cannot find it in my biopython PDB folder. > > Hi George, No, Mocapy++ is a separate project: http://sourceforge.net/projects/mocapy/ There is a branch to add some integration with Mocapy++ to Biopython, but we're waiting for the next stable release of Mocapy++ before merging it: https://github.com/mchelem/biopython -Eric From p.j.a.cock at googlemail.com Mon Jul 2 11:27:08 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 2 Jul 2012 12:27:08 +0100 Subject: [Biopython] Back translation support in Biopython In-Reply-To: References: Message-ID: On Wed, Apr 4, 2012 at 4:02 PM, Peter Cock wrote: > On Wed, Apr 4, 2012 at 2:49 AM, Eric Talevich wrote: >> Hi Igor, >> >> It sounds like you're referring to aligning amino acid sequences to codon >> sequences, as PAL2NAL does. This is different from what most people mean by >> back translation, but as you point out, certainly useful. >> >> If you write a function that can match a protein sequence alignment to a set >> of raw CDS sequences, returning a nucleotide alignment based on the >> codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does >> exactly that, plus a bit more, and is a fairly well-known and easily >> obtained program. Personally, I would prefer to write a wrapper for PAL2NAL >> under Bio.Align.Applications, using the existing Bio.Applications framework. > > As per the old thread, a simple function in Python taking the gapped protein > sequence, original nucleotide coding sequence, and the translation table > does sound useful. Then using that, you could go from a protein alignment > plus the original nucleotide coding sequences to a codon alignment, or > other tasks. Given this is all relatively straightforward string manipulation > and we already have the required genetic code tables in Biopython, I'm not > convinced that wrapping PAL2NAL would be the best solution (for this sub > task). Hi Igor, Did you do any work on back-translation (alignment threading) in Biopython? We needed to do this locally, and for some reason (yet to be determined) T-COFFEE wasn't working on our dataset, so I made a start at a Biopython implementation: https://github.com/peterjc/biopython/tree/back_trans https://github.com/peterjc/biopython/commit/7d14cdb59bb9d41c727c923c8aa7e3dda7779c80 Currently just one commit adding a Bio.Align.alignment_back_translate(...) function which takes a protein alignment and dictionary of nucleotide records - easy to get with Bio.SeqIO and Bio.AlignIO - with a stand alone example included in the doctest. There is also a new (currently private) function to do this for one sequence pair - perhaps useful on its own? There are potential complications with ID mapping between the proteins and nucleotides, thus the option of a key function, and the gap characters (would you ever want to use different gap characters in the protein and nucleotide alignments?). We could discuss implementation details over on the biopython-dev list, but the general API discussion might as well be here. e.g. Where to put the function and what to call it. Regards, Peter From from.d.putto at gmail.com Mon Jul 2 12:21:46 2012 From: from.d.putto at gmail.com (Sheila the angel) Date: Mon, 2 Jul 2012 14:21:46 +0200 Subject: [Biopython] searching homologene database Message-ID: To search tp53 homolog in homologene database - handle = Entrez.esearch(db="homologene", term="tp53[gene name] AND Homo sapiens[orgn]") record = Entrez.read(handle) handle = Entrez.efetch(db="homologene", id=record['IdList']) record = handle.read() print record I think record is asn.1 format !! how can I read or convert it in the genes protein table (as we see in the web result) http://www.ncbi.nlm.nih.gov/homologene/460 Thanks -- Sheila From w.arindrarto at gmail.com Mon Jul 2 12:39:31 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Mon, 2 Jul 2012 14:39:31 +0200 Subject: [Biopython] searching homologene database In-Reply-To: References: Message-ID: Hi Sheila, You can set the 'retmode' parameter in order to specify your preferred format. I'm not sure if NCBI provides an output format exactly like the one you see on their site, but instead of ASN.1 you can specify a more common format like XML. In your case, the call would be this (for XML, let's say): handle = Entrez.efetch(db="homologene", id=record['IdList'], retmode="xml") For a list of possible retmode values, you can look them up here: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch (see the explanation about 'retmode'). If you want to format the output further, you can use modules like the built-in elementtree or 3rd party modules like lxml to extract the tag values and feed them to your script / program. Hope that helps, Bow On Mon, Jul 2, 2012 at 2:21 PM, Sheila the angel wrote: > To search tp53 homolog in homologene database - > > handle = Entrez.esearch(db="homologene", term="tp53[gene name] AND Homo > sapiens[orgn]") > record = Entrez.read(handle) > handle = Entrez.efetch(db="homologene", id=record['IdList']) > record = handle.read() > print record > > I think record is asn.1 format !! how can I read or convert it in the genes > protein table (as we see in the web result) > http://www.ncbi.nlm.nih.gov/homologene/460 > > Thanks > > -- > Sheila > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From from.d.putto at gmail.com Tue Jul 10 13:16:25 2012 From: from.d.putto at gmail.com (Sheila the angel) Date: Tue, 10 Jul 2012 15:16:25 +0200 Subject: [Biopython] access Uniprot record by different ids Message-ID: I have a Uniprot AC list in which some AC are primary and some are secondary. The function my_dict = SeqIO.index("uniprot_sprot.dat", "swiss") makes dictionary of uniprot data but I can access a record only by primary AC. my_dict['P04637'] # gives the record my_dict['Q15086'] # KeyError my_dict['P53_HUMAN'] # KeyError Is it possible to access same record by both primary and secondary ACs (and by uniprot ID) ? From p.j.a.cock at googlemail.com Tue Jul 10 13:43:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 10 Jul 2012 14:43:31 +0100 Subject: [Biopython] access Uniprot record by different ids In-Reply-To: References: Message-ID: On Tue, Jul 10, 2012 at 2:16 PM, Sheila the angel wrote: > I have a Uniprot AC list in which some AC are primary and some are > secondary. The function > my_dict = SeqIO.index("uniprot_sprot.dat", "swiss") > makes dictionary of uniprot data but I can access a record only by primary > AC. > my_dict['P04637'] # gives the record > my_dict['Q15086'] # KeyError > my_dict['P53_HUMAN'] # KeyError > > Is it possible to access same record by both primary and secondary ACs > (and by uniprot ID) ? Not directly with Bio.SeqIO.index() or Bio.SeqIO.index_db(), no. You could perhaps use a second dictionary mapping aliases to the primary ID? Peter From n.j.loman at bham.ac.uk Wed Jul 11 15:02:00 2012 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Wed, 11 Jul 2012 16:02:00 +0100 Subject: [Biopython] SeqRecord substring should return SeqRecord or character? Message-ID: Hi there I wanted to add the last character of a SeqRecord s1 to another SeqRecord s2. However s1[-1] + s2 fails because s1[-1] returns a string rather than a SeqRecord just containing a single base and associated annotations. I have to do s1[-1:] to get a sliced SeqRecord. Is this behaviour intentional? I kind of assumed I would always get a SeqRecord from any given slice, and it's seems weird to get just a string back instead, although no doubt there's a good reason for this. Cheers Nick From p.j.a.cock at googlemail.com Wed Jul 11 15:21:00 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 11 Jul 2012 16:21:00 +0100 Subject: [Biopython] SeqRecord substring should return SeqRecord or character? In-Reply-To: References: Message-ID: On Wed, Jul 11, 2012 at 4:02 PM, Nick Loman wrote: > Hi there > > I wanted to add the last character of a SeqRecord s1 to another > SeqRecord s2. However s1[-1] + s2 fails because s1[-1] returns a > string rather than a SeqRecord just containing a single base and > associated annotations. I have to do s1[-1:] to get a sliced > SeqRecord. You should be able to do SeqRecord+string, and string+SeqRecord, both of which are specifically tested in the docstring. Have you got any more details? e.g. Version? Mini-example? > Is this behaviour intentional? I kind of assumed I would always get a > SeqRecord from any given slice, and it's seems weird to get just a > string back instead, although no doubt there's a good reason for this. For a single base/residue, the whole SeqRecord overhead does seem unnecessary. As to why you get a single letter string, not a single letter Seq, IIRC it was mimicking the Seq object. Peter From p.j.a.cock at googlemail.com Wed Jul 11 15:52:41 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 11 Jul 2012 16:52:41 +0100 Subject: [Biopython] SeqRecord substring should return SeqRecord or character? In-Reply-To: References: <620A45B10433AE4C81D3F931A02812F93BC80453CF@LESMBX1.adf.bham.ac.uk> Message-ID: On Wed, Jul 11, 2012 at 4:24 PM, Nick Loman wrote: > On Wed, Jul 11, 2012 at 4:21 PM, Peter Cock wrote: >> On Wed, Jul 11, 2012 at 4:02 PM, Nick Loman wrote: >>> Hi there >>> >>> I wanted to add the last character of a SeqRecord s1 to another >>> SeqRecord s2. However s1[-1] + s2 fails because s1[-1] returns a >>> string rather than a SeqRecord just containing a single base and >>> associated annotations. I have to do s1[-1:] to get a sliced >>> SeqRecord. >> >> You should be able to do SeqRecord+string, and string+SeqRecord, >> both of which are specifically tested in the docstring. Have you got >> any more details? e.g. Version? Mini-example? > > Hi Peter, > > It was doing this on a FASTQ record so it's the missing quality > annotation that cause the problem when trying to do this. Ah - so the addition should have worked, but you'd lose the partial quality string. You're stuck with ensuring you have two SeqRecords, so as you suggested rather than s1[-1]+s2 please use s1[-1:]+s2 instead. Slightly less clear, but only character more. This actually reminds me of similar behaviour with the bytes string in Python 3, where the same trick is required to get a single letter bytes string. >>> Is this behaviour intentional? I kind of assumed I would always get a >>> SeqRecord from any given slice, and it's seems weird to get just a >>> string back instead, although no doubt there's a good reason for this. >> >> For a single base/residue, the whole SeqRecord overhead does >> seem unnecessary. As to why you get a single letter string, not >> a single letter Seq, IIRC it was mimicking the Seq object. > > Yes, I guessed the overhead was likely to be the reason .. not sure > if there's a satisfactory solution? Returning a single letter SeqRecord have might been a better choice, and going back much further in Biopython's history the Seq object should probably have returned a single letter Seq (not a single letter string). There is a similar issue with the columns of an alignment. Peter From wheatontrue at gmail.com Thu Jul 12 08:53:26 2012 From: wheatontrue at gmail.com (Wheaton Little) Date: Thu, 12 Jul 2012 16:53:26 +0800 Subject: [Biopython] can I use the xml parser in biopython on other xml files? how? Message-ID: I would like to use the Biopython xml parser, if possible, on google patent xmls: http://www.google.com/googlebooks/uspto-patents-applications-text.html unfortunately, this is what I get: >>> t=open('ipa111229.xml','r').read() >>> import Bio >>> ttt=Bio.Entrez.read(t[:30000]) Traceback (most recent call last): File "", line 1, in ttt=Bio.Entrez.read(t[:30000]) File "/Library/Python/2.7/site-packages/Bio/Entrez/__init__.py", line 351, in read record = handler.read(handle) File "/Library/Python/2.7/site-packages/Bio/Entrez/Parser.py", line 169, in read self.parser.ParseFile(handle) TypeError: argument must have 'read' attribute What would I have to do to use the parser on this xml? From b.invergo at gmail.com Thu Jul 12 09:25:27 2012 From: b.invergo at gmail.com (Brandon Invergo) Date: Thu, 12 Jul 2012 11:25:27 +0200 Subject: [Biopython] can I use the xml parser in biopython on other xml files? how? In-Reply-To: References: Message-ID: <1342085127.614.10.camel@localhost.localdomain> With regards to the error that you receive, it's because you're trying to `read()` a list, when that method requires a file-like object. This would fix that: >>> ttt=Bio.Entrez.read(open('ipa111229.xml', 'r')) However, that wouldn't work because it requires a DTD from NCBI to read the file. Why not use one of Python's standard xml libraries (xml.sax or xml.dom (or xml.minidom))? -brandon On Thu, 2012-07-12 at 16:53 +0800, Wheaton Little wrote: > I would like to use the Biopython xml parser, if possible, on google > patent xmls: > > http://www.google.com/googlebooks/uspto-patents-applications-text.html > > unfortunately, this is what I get: > > >>> t=open('ipa111229.xml','r').read() > >>> import Bio > >>> ttt=Bio.Entrez.read(t[:30000]) > > Traceback (most recent call last): > File "", line 1, in > ttt=Bio.Entrez.read(t[:30000]) > File "/Library/Python/2.7/site-packages/Bio/Entrez/__init__.py", > line 351, in read > record = handler.read(handle) > File "/Library/Python/2.7/site-packages/Bio/Entrez/Parser.py", line > 169, in read > self.parser.ParseFile(handle) > TypeError: argument must have 'read' attribute > > What would I have to do to use the parser on this xml? > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Thu Jul 12 09:35:09 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 12 Jul 2012 10:35:09 +0100 Subject: [Biopython] can I use the xml parser in biopython on other xml files? how? In-Reply-To: References: Message-ID: On Thu, Jul 12, 2012 at 9:53 AM, Wheaton Little wrote: > I would like to use the Biopython xml parser, if possible, on google > patent xmls: > > http://www.google.com/googlebooks/uspto-patents-applications-text.html > > unfortunately, this is what I get: > >>>> t=open('ipa111229.xml','r').read() >>>> import Bio >>>> ttt=Bio.Entrez.read(t[:30000]) > > Traceback (most recent call last): > ... > TypeError: argument must have 'read' attribute > > What would I have to do to use the parser on this xml? In your example, you opened the file and read all the data into a string (variable t). The parser is not expecting a string, but a handle. String objects don't have a 'read' method, thus this error message. You could 'fix' this particular error by doing: handle=open('ipa111229.xml','r') from Bio import Entrez ttt=Entrez.read(handle) However, I doubt this will work as the Entrez parser is intended to be used with the NCBI XML files only. Python comes with several XML libraries in the standard library. ElementTree (or cElementTree) is quite popular, but as Brandom points out there are also DOM and SAX style parsers. Peter From from.d.putto at gmail.com Thu Jul 12 11:06:58 2012 From: from.d.putto at gmail.com (Sheila the angel) Date: Thu, 12 Jul 2012 13:06:58 +0200 Subject: [Biopython] access Uniprot record by different ids In-Reply-To: References: Message-ID: Thanks for reply. Now I made two dictionary one for uniprot_sprot.dat and another for secondary ids to primary ids. However it take too long to do this and I can't do Pickle for my_dict. I would like to know is it possible to dump my_dict (the uniprot.dat data) to MySql database. I looked at biopython-BioSQL page but didn't understand much (I am new to SQL) Thanks -- Sheila On Tue, Jul 10, 2012 at 3:43 PM, Peter Cock wrote: > On Tue, Jul 10, 2012 at 2:16 PM, Sheila the angel > wrote: > > I have a Uniprot AC list in which some AC are primary and some are > > secondary. The function > > my_dict = SeqIO.index("uniprot_sprot.dat", "swiss") > > makes dictionary of uniprot data but I can access a record only by > primary > > AC. > > my_dict['P04637'] # gives the record > > my_dict['Q15086'] # KeyError > > my_dict['P53_HUMAN'] # KeyError > > > > Is it possible to access same record by both primary and secondary ACs > > (and by uniprot ID) ? > > Not directly with Bio.SeqIO.index() or Bio.SeqIO.index_db(), no. > You could perhaps use a second dictionary mapping aliases to > the primary ID? > > Peter > From wheatontrue at gmail.com Thu Jul 12 11:57:51 2012 From: wheatontrue at gmail.com (Wheaton Little) Date: Thu, 12 Jul 2012 19:57:51 +0800 Subject: [Biopython] can I use the xml parser in biopython on other xml files? how? In-Reply-To: References: Message-ID: Indeed, it didn't like that. Using BeautifulSoup seems to work but not sure how well... Thanks for the advice, all! On Thu, Jul 12, 2012 at 5:35 PM, Peter Cock wrote: > On Thu, Jul 12, 2012 at 9:53 AM, Wheaton Little wrote: >> I would like to use the Biopython xml parser, if possible, on google >> patent xmls: >> >> http://www.google.com/googlebooks/uspto-patents-applications-text.html >> >> unfortunately, this is what I get: >> >>>>> t=open('ipa111229.xml','r').read() >>>>> import Bio >>>>> ttt=Bio.Entrez.read(t[:30000]) >> >> Traceback (most recent call last): >> ... >> TypeError: argument must have 'read' attribute >> >> What would I have to do to use the parser on this xml? > > In your example, you opened the file and read all the data into a > string (variable t). > > The parser is not expecting a string, but a handle. String objects > don't have a 'read' method, thus this error message. > > You could 'fix' this particular error by doing: > > handle=open('ipa111229.xml','r') > from Bio import Entrez > ttt=Entrez.read(handle) > > However, I doubt this will work as the Entrez parser is intended to be > used with the NCBI XML files only. > > Python comes with several XML libraries in the standard library. > ElementTree (or cElementTree) is quite popular, but as Brandom points > out there are also DOM and SAX style parsers. > > Peter From chaudhrynabeelahmed at gmail.com Thu Jul 12 12:58:58 2012 From: chaudhrynabeelahmed at gmail.com (Nabeel Ahmed) Date: Thu, 12 Jul 2012 17:58:58 +0500 Subject: [Biopython] Bioinformatics EMBOSS users Message-ID: I have recently installed EMBOSS-6.4.0 (Ubuntu 11.10). I am unable to make it work directly with live databases (embl, uniprot) , working totally fine with local sequence files. e.g % *plotorf * Plot potential open reading frames in a nucleotide sequence Input nucleotide sequence: *embl:x13776* *Error:* Failed to open filename 'embl'** Used 'showdb' , displayed table with zero rows. Is there any configuration, i am missing?? Ahmed From p.j.a.cock at googlemail.com Thu Jul 12 18:30:33 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 12 Jul 2012 19:30:33 +0100 Subject: [Biopython] Bioinformatics EMBOSS users In-Reply-To: References: Message-ID: On Thu, Jul 12, 2012 at 1:58 PM, Nabeel Ahmed wrote: > I have recently installed EMBOSS-6.4.0 (Ubuntu 11.10). > I am unable to make it work directly with live databases (embl, uniprot) , > working totally fine with local sequence files. > e.g > > % *plotorf * > Plot potential open reading frames in a nucleotide sequence > Input nucleotide sequence: *embl:x13776* > > *Error:* Failed to open filename 'embl'** > > Used 'showdb' , displayed table with zero rows. > > Is there any configuration, i am missing?? > > Ahmed I'm not sure - but the EMBOSS mailing list would be the place to ask: http://lists.open-bio.org/mailman/listinfo/emboss Peter From p.j.a.cock at googlemail.com Thu Jul 12 18:37:11 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 12 Jul 2012 19:37:11 +0100 Subject: [Biopython] access Uniprot record by different ids In-Reply-To: References: Message-ID: On Thu, Jul 12, 2012 at 12:06 PM, Sheila the angel wrote: > Thanks for reply. > Now I made two dictionary one for uniprot_sprot.dat and another for > secondary ids to primary ids. > However it take too long to do this and I can't do Pickle for my_dict. > I would like to know is it possible to dump my_dict (the uniprot.dat data) > to MySql database. Have you tried the Bio.SeqIO.index_db(...) function? This builds an SQLite database to hold the lookup table of offsets (i.e. the primary accession only). Creating the index is a little slow, but reuse is very fast. For your second dictionary mapping secondary accessions to the primary accession, you should be able to use pickle. > I looked at biopython-BioSQL page but didn't understand much > (I am new to SQL) > Thanks BioSQL is a bit complicated to get started with (although using SQLite is a lot simpler than MySQL or PostgreSQL). Peter From livingstonemark at gmail.com Tue Jul 17 01:49:37 2012 From: livingstonemark at gmail.com (Mark Livingstone) Date: Tue, 17 Jul 2012 11:49:37 +1000 Subject: [Biopython] The PDBParser Permissive setting Message-ID: Hi Guys, In my code I am experimenting with different ways of doing RMSD calculations. I have code which in addition to normal CA based RMSD can do (CA & CB) RMSD and also sidechain RMSD. On a perfect PDB file this works well. Unfortunately, the curation I have is fairly average / poor in quality :-( and I only find out when one of the liberal number of Try/Except blocks falls over. I need a better way to find out sooner if a PDB file is missing data. I am wondering therefore is for PDBParser I set Permissive=0, and after setting the relevant models and chains etc, I did wt_atoms = Bio.PDB.Selection.unfold_entities(wtc, 'A') If this successfully works without throwing an Exception, can I assume that this unfolded chain is perfect, or are there ways that I could still be tripped up? Alternatively, can anyone suggest code that I can employ in my curation process that will give me a decent sanity check of PDB quality, so I can get on writing experimental code - and not Try/Except blocks :-( Thanks in advance, MarkL From anaryin at gmail.com Tue Jul 17 06:42:09 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 17 Jul 2012 02:42:09 -0400 Subject: [Biopython] The PDBParser Permissive setting In-Reply-To: References: Message-ID: Hey Mark, What kind of validation do you want? Cheers, Jo?o No dia 17 de Jul de 2012 02:52, "Mark Livingstone" < livingstonemark at gmail.com> escreveu: > Hi Guys, > > In my code I am experimenting with different ways of doing RMSD > calculations. I have code which in addition to normal CA based RMSD > can do (CA & CB) RMSD and also sidechain RMSD. On a perfect PDB file > this works well. Unfortunately, the curation I have is fairly average > / poor in quality :-( and I only find out when one of the liberal > number of Try/Except blocks falls over. > > I need a better way to find out sooner if a PDB file is missing data. > > I am wondering therefore is for PDBParser I set Permissive=0, and > after setting the relevant models and chains etc, I did > > > wt_atoms = Bio.PDB.Selection.unfold_entities(wtc, 'A') > > If this successfully works without throwing an Exception, can I assume > that this unfolded chain is perfect, or are there ways that I could > still be tripped up? > > Alternatively, can anyone suggest code that I can employ in my > curation process that will give me a decent sanity check of PDB > quality, so I can get on writing experimental code - and not > Try/Except blocks :-( > > Thanks in advance, > > MarkL > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From livingstonemark at gmail.com Tue Jul 17 08:54:50 2012 From: livingstonemark at gmail.com (Mark Livingstone) Date: Tue, 17 Jul 2012 18:54:50 +1000 Subject: [Biopython] The PDBParser Permissive setting In-Reply-To: References: Message-ID: Hi Jo?o, I guess it would be good if I could get a data structure that had no discontinuities, no missing data points or unknowns. I would be able to tell it to ignore HOH or other irrelevancies. My use case as I mentioned is RMSD and similar algorithms, so one continuous structure with all the data attached that I can iterate through, selecting atoms / residues as needed, and get the names and coordinates as I go. So I guess I want a PDB Diagnostic type program to allow me to find exemplary PDB files to use during initial stages of development while I do proof of concept, since I know that finding edge case PDBs for later work is not as hard it seems as finding good ones ;-) Maybe the simplest way to think of the sort of PDBs is you can run your software and you don't need any try / except blocks for Biopython to work well :-D Cheers, MarkL On 17 July 2012 16:42, Jo?o Rodrigues wrote: > Hey Mark, > > What kind of validation do you want? > > Cheers, > > Jo?o > > No dia 17 de Jul de 2012 02:52, "Mark Livingstone" > escreveu: >> >> Hi Guys, >> >> In my code I am experimenting with different ways of doing RMSD >> calculations. I have code which in addition to normal CA based RMSD >> can do (CA & CB) RMSD and also sidechain RMSD. On a perfect PDB file >> this works well. Unfortunately, the curation I have is fairly average >> / poor in quality :-( and I only find out when one of the liberal >> number of Try/Except blocks falls over. >> >> I need a better way to find out sooner if a PDB file is missing data. >> >> I am wondering therefore is for PDBParser I set Permissive=0, and >> after setting the relevant models and chains etc, I did >> >> >> wt_atoms = Bio.PDB.Selection.unfold_entities(wtc, 'A') >> >> If this successfully works without throwing an Exception, can I assume >> that this unfolded chain is perfect, or are there ways that I could >> still be tripped up? >> >> Alternatively, can anyone suggest code that I can employ in my >> curation process that will give me a decent sanity check of PDB >> quality, so I can get on writing experimental code - and not >> Try/Except blocks :-( >> >> Thanks in advance, >> >> MarkL >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython From anaryin at gmail.com Tue Jul 17 09:35:56 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 17 Jul 2012 10:35:56 +0100 Subject: [Biopython] The PDBParser Permissive setting In-Reply-To: References: Message-ID: You mean for example, no chain breaks? And no missing atoms in residues? You can check the first one with a warning catcher (I think I answered something like this a few time ago here in the mailing list). The second one is trickier, you'll need a sort of topology to know which atoms belong to each residue. I have something like that in my GSOC branch but it's very very very experimental.. Is this what you mean? Which others would you be looking for? I think that for RMSD alone you need only to make sure that you match equivalent atoms. That should be easy enough without major modifications or endless try/excepts :) From dilara.ally at gmail.com Tue Jul 17 22:24:07 2012 From: dilara.ally at gmail.com (Dilara Ally) Date: Tue, 17 Jul 2012 15:24:07 -0700 Subject: [Biopython] When is a SeqRecord a SeqRecord Message-ID: Hi I've modified my code but why does the inclusion of return None and the subsequent code if filtered_rec is not None solve the problem? Thanks! Dilara q_threshold=20 def check_meanQ(rec, q_threshold): seqlen=len(rec) quality_scores=array(rec.letter_annotations["phred_quality"]) if round(quality_scores.mean()) <= q_threshold: print "Discarded ", rec.id, "because mean Q was", round(quality_scores.mean()) return None if round(quality_scores.mean()) > q_threshold: return rec from Bio import SeqIO for rec in SeqIO.parse("test.fastq", "fastq"): #print rec.id filtered_rec= check_meanQ(rec, q_threshold) if filtered_rec is not None: print filtered_rec.id print filtered_rec.letter_annotations From dilara.ally at gmail.com Tue Jul 17 19:11:12 2012 From: dilara.ally at gmail.com (Dilara Ally) Date: Tue, 17 Jul 2012 12:11:12 -0700 Subject: [Biopython] when is a SeqRecord not a SeqRecord Message-ID: Hi I'm trying to understand what is why when I print filtered_rec I get a SeqRecord but if I try to access any particular attribute of a SeqRecord such as letter_annotations I sometimes get an attribute error -- AttributeError: 'NoneType' object has no attribute 'letter_annotations.' q_threshold=20 def check_meanQ(record, q_threshold): seqlen=len(record) quality_scores=array(record.letter_annotations["phred_quality"]) if round(quality_scores.mean()) <= q_threshold: print "Discarded ", record.id, "because mean Q was", round(quality_scores.mean()) elif round(quality_scores.mean()) > q_threshold: return record from Bio import SeqIO for rec in SeqIO.parse("test.fastq", "fastq"): print rec.id filtered_rec= check_meanQ(rec, q_threshold) #print filtered_rec print filtered_rec.letter_annotations I've attached two fastq files that I've used with this code one is called test.fastq and the other is hiseq_pe_test.fastq Any help would be greatly appreciated. -------------- next part -------------- A non-text attachment was scrubbed... Name: test.fastq Type: application/octet-stream Size: 39217 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hiseq_pe_test.fastq Type: application/octet-stream Size: 1541 bytes Desc: not available URL: From chapmanb at 50mail.com Wed Jul 18 13:23:27 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 18 Jul 2012 09:23:27 -0400 Subject: [Biopython] when is a SeqRecord not a SeqRecord In-Reply-To: References: Message-ID: <87y5mhkxtc.fsf@fastmail.fm> Dilara; > I'm trying to understand what is why when I print filtered_rec I get a > SeqRecord but if I try to access any particular attribute of a SeqRecord > such as letter_annotations I sometimes get an attribute error -- > AttributeError: 'NoneType' object has no attribute > 'letter_annotations.' > def check_meanQ(record, q_threshold): > seqlen=len(record) > quality_scores=array(record.letter_annotations["phred_quality"]) > if round(quality_scores.mean()) <= q_threshold: > print "Discarded ", record.id, "because mean Q was", > round(quality_scores.mean()) > elif round(quality_scores.mean()) > q_threshold: > return record This function returns different results based on the comparison of mean quality scores to your threshold: - When it is below the threshold, it returns None (since you do not define an explicit return value) - When it is above the threshold, it returns a SeqRecord. > from Bio import SeqIO > for rec in SeqIO.parse("test.fastq", "fastq"): > print rec.id > filtered_rec= check_meanQ(rec, q_threshold) > #print filtered_rec > print filtered_rec.letter_annotations You are seeing the error since in the filtered cases the function returns None. You probably want: filtered_rec= check_meanQ(rec, q_threshold) if filtered_rec is not None: print filtered_rec.letter_annotations Brad From chapmanb at 50mail.com Wed Jul 18 13:23:27 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 18 Jul 2012 09:23:27 -0400 Subject: [Biopython] when is a SeqRecord not a SeqRecord In-Reply-To: References: Message-ID: <87y5mhkxtc.fsf@fastmail.fm> Dilara; > I'm trying to understand what is why when I print filtered_rec I get a > SeqRecord but if I try to access any particular attribute of a SeqRecord > such as letter_annotations I sometimes get an attribute error -- > AttributeError: 'NoneType' object has no attribute > 'letter_annotations.' > def check_meanQ(record, q_threshold): > seqlen=len(record) > quality_scores=array(record.letter_annotations["phred_quality"]) > if round(quality_scores.mean()) <= q_threshold: > print "Discarded ", record.id, "because mean Q was", > round(quality_scores.mean()) > elif round(quality_scores.mean()) > q_threshold: > return record This function returns different results based on the comparison of mean quality scores to your threshold: - When it is below the threshold, it returns None (since you do not define an explicit return value) - When it is above the threshold, it returns a SeqRecord. > from Bio import SeqIO > for rec in SeqIO.parse("test.fastq", "fastq"): > print rec.id > filtered_rec= check_meanQ(rec, q_threshold) > #print filtered_rec > print filtered_rec.letter_annotations You are seeing the error since in the filtered cases the function returns None. You probably want: filtered_rec= check_meanQ(rec, q_threshold) if filtered_rec is not None: print filtered_rec.letter_annotations Brad From bioinformaticsing at gmail.com Thu Jul 19 03:36:19 2012 From: bioinformaticsing at gmail.com (ning luwen) Date: Thu, 19 Jul 2012 11:36:19 +0800 Subject: [Biopython] Error while parsing bgk file Message-ID: Hi everyone, A error encountered when i parse a gbk file. the error message as follow: Traceback (most recent call last): File "stat_refseq_gbs.py", line 10, in for seq in f: File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 537, in parse for r in i: File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 445, in parse_records record = self.parse(handle, do_features) File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 428, in parse if self.feed(handle, consumer, do_features): File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 400, in feed self._feed_feature_table(consumer, self.parse_features(skip=False)) File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 350, in _feed_feature_table consumer.location(location_string) File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py", line 970, in location int(e), ValueError: invalid literal for int() with base 10: '68452073^68452074' the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the lines cause the error may be: V_segment complement(68451760..68452073^68452074) CDS complement(<68451760..68452072^68452073) -- regards, luwen ning From w.arindrarto at gmail.com Thu Jul 19 08:50:33 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 19 Jul 2012 10:50:33 +0200 Subject: [Biopython] Error while parsing bgk file In-Reply-To: References: Message-ID: Hi Ning, Thanks for reporting the error. A similar issue has been reported in the bug tracker here: https://redmine.open-bio.org/issues/3175 (it also looks like it's the same coordinate). It seems that this could be an invalid GenBank coordinate made by NCBI, though. >From which chromosome is this coordinate coming from? Is it the latest draft? cheers, Bow On Thu, Jul 19, 2012 at 5:36 AM, ning luwen wrote: > Hi everyone, > > A error encountered when i parse a gbk file. > > the error message as follow: > > Traceback (most recent call last): > File "stat_refseq_gbs.py", line 10, in > for seq in f: > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", > line 537, in parse > for r in i: > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 445, in parse_records > record = self.parse(handle, do_features) > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 428, in parse > if self.feed(handle, consumer, do_features): > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 400, in feed > self._feed_feature_table(consumer, self.parse_features(skip=False)) > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 350, in _feed_feature_table > consumer.location(location_string) > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py", > line 970, in location > int(e), > ValueError: invalid literal for int() with base 10: '68452073^68452074' > > the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the > lines cause the error may be: > > V_segment complement(68451760..68452073^68452074) > CDS complement(<68451760..68452072^68452073) > > -- > regards, > luwen ning > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From dilara.ally at gmail.com Thu Jul 19 15:51:35 2012 From: dilara.ally at gmail.com (Dilara Ally) Date: Thu, 19 Jul 2012 08:51:35 -0700 Subject: [Biopython] slice a record in two and writing both records Message-ID: If I have a function (modify_record) that slices up a SeqRecord into sub records and then returns the sliced record if it has a certain length (for e.g. the sliced record needs to be greater than 40bp), sometimes the original record when sliced will have two different records both greater than 40bp. I want to keep both sliced reads and rewrite them as separate records into a single fastq file. Here is my code: def modify_record(frec, win, len_threshold): quality_scores = array(frec.letter_annotations["phred_quality"]) all_window_qc = slidingWindow(quality_scores, win,1) track_qc = windowQ(all_window_qc) myzeros = boolean_array(track_qc, q_threshold,win) Nrec = slice_points(myzeros,win)[0][1]-1 where_to_slice = slice_points(myzeros,win)[1] where_to_slice.append(len(frec)+win) sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold) return sub_record q_threshold = 20 win = 5 len_threshold = 30 from Bio import SeqIO from numpy import * good_reads = (rec for rec in SeqIO.parse("hiseq_pe_test.fastq", "fastq") if array(rec.letter_annotations["phred_quality"]).mean() >= q_threshold) count = SeqIO.write(good_reads, "temp.fastq", "fastq") print "Saved %i reads" % count newly_filtered=[] for rec in SeqIO.parse("temp.fastq", "fastq"): s = modify_record(rec, win, len_threshold) newly_filtered.append(s) SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq") This writes only the first sub_record even when there are more than 1 that have a len >40bp. I've tried this as a generator expression and I'm still getting just the first sub_record. I'd also prefer to not to use append as it was previously suggested that this can lead to problems if you run the script more than once. Instead, I want to employ a generator expression - but I'm still getting used to the idea of generator expressions. My second question is more general. Generator expressions are more memory efficient than a list comprehension, but how are they better than just a simple loop that pulls in a single record, does something and then writes that record? Is it just a time issue? Many thanks for the help! From w.arindrarto at gmail.com Thu Jul 19 17:21:42 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 19 Jul 2012 19:21:42 +0200 Subject: [Biopython] slice a record in two and writing both records In-Reply-To: References: Message-ID: Hi Dilara, For your first question, it seems that the `modify_record` function always returns only one SeqRecord object. This is a bit of a guesswork from my end as I don't know how most of the functions in `modify_record` work, but since you still see an ouput sequence at the end, I think you may want to re-check again how `sub_record` returns its values / how it returns more than one SeqRecord objects. Also, you might want to try changing the last two lines: newly_filtered.append(s) SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq") Here, you're doing `SeqIO.write` for each iteration of the loop. Although the end result is the same (a file containing all the sequence you want), the code may be made more efficient by putting the SeqIO.write line outside of the loop, after all sequences are pooled in the `newly_filtered` list. For your second question, I personally find generator expressions to be more compact and easier to read. This is important for future code maintenance ~ having more readable lines of code means it's easier to understand your code and to debug them in case something goes wrong. Note that generator expressions aren't silver bullets. In some cases, for loops may still be better (e.g. if you're doing complex operations on the objects your iterating over). I find these two sites helpful when I first grappled with generators and generator expressions. I hope they are the same to you too: * http://stackoverflow.com/questions/1995418/python-generator-expression-vs-yield * http://www.dabeaz.com/generators/Generators.pdf (PDF) Hope that helps :), Bow On Thu, Jul 19, 2012 at 5:51 PM, Dilara Ally wrote: > If I have a function (modify_record) that slices up a SeqRecord into sub > records and then returns the sliced record if it has a certain length (for > e.g. the sliced record needs to be greater than 40bp), sometimes the > original record when sliced will have two different records both greater > than 40bp. I want to keep both sliced reads and rewrite them as separate > records into a single fastq file. Here is my code: > > def modify_record(frec, win, len_threshold): > quality_scores = array(frec.letter_annotations["phred_quality"]) > all_window_qc = slidingWindow(quality_scores, win,1) > track_qc = windowQ(all_window_qc) > myzeros = boolean_array(track_qc, q_threshold,win) > Nrec = slice_points(myzeros,win)[0][1]-1 > where_to_slice = slice_points(myzeros,win)[1] > where_to_slice.append(len(frec)+win) > sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold) > return sub_record > > q_threshold = 20 > win = 5 > len_threshold = 30 > > from Bio import SeqIO > from numpy import * > good_reads = (rec for rec in SeqIO.parse("hiseq_pe_test.fastq", "fastq") if > array(rec.letter_annotations["phred_quality"]).mean() >= q_threshold) > count = SeqIO.write(good_reads, "temp.fastq", "fastq") > print "Saved %i reads" % count > > newly_filtered=[] > for rec in SeqIO.parse("temp.fastq", "fastq"): > s = modify_record(rec, win, len_threshold) > newly_filtered.append(s) > SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq") > > This writes only the first sub_record even when there are more than 1 that > have a len >40bp. I've tried this as a generator expression and I'm still > getting just the first sub_record. I'd also prefer to not to use append > as it was previously suggested that this can lead to problems if you run > the script more than once. Instead, I want to employ a generator > expression - but I'm still getting used to the idea of generator > expressions. > > My second question is more general. Generator expressions are more memory > efficient than a list comprehension, but how are they better than just a > simple loop that pulls in a single record, does something and then writes > that record? Is it just a time issue? > > Many thanks for the help! > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From bioinformaticsing at gmail.com Fri Jul 20 03:41:20 2012 From: bioinformaticsing at gmail.com (ning luwen) Date: Fri, 20 Jul 2012 11:41:20 +0800 Subject: [Biopython] Fwd: Error while parsing bgk file In-Reply-To: References: Message-ID: ---------- Forwarded message ---------- From: Lenna Peterson Date: Thu, Jul 19, 2012 at 12:51 PM Subject: Re: [Biopython] Error while parsing bgk file To: ning luwen On Wed, Jul 18, 2012 at 11:36 PM, ning luwen wrote: > Hi everyone, > > A error encountered when i parse a gbk file. > > the error message as follow: > > Traceback (most recent call last): > File "stat_refseq_gbs.py", line 10, in > for seq in f: > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", > line 537, in parse > for r in i: > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 445, in parse_records > record = self.parse(handle, do_features) > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 428, in parse > if self.feed(handle, consumer, do_features): > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 400, in feed > self._feed_feature_table(consumer, self.parse_features(skip=False)) > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", > line 350, in _feed_feature_table > consumer.location(location_string) > File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py", > line 970, in location > int(e), > ValueError: invalid literal for int() with base 10: '68452073^68452074' > > the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the > lines cause the error may be: > > V_segment complement(68451760..68452073^68452074) > CDS complement(<68451760..68452072^68452073) > > -- > regards, > luwen ning > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Hi Luwen, Thanks for reporting this problem. I've submitted a patch that should fix it. https://github.com/biopython/biopython/pull/54 Lenna -- regards, luwen ning From bioinformaticsing at gmail.com Fri Jul 20 03:56:51 2012 From: bioinformaticsing at gmail.com (ning luwen) Date: Fri, 20 Jul 2012 11:56:51 +0800 Subject: [Biopython] Error while parsing bgk file In-Reply-To: References: Message-ID: Hi Bow, Thank you for your reply, and a patch by lenna can solve the interruption of the parse. ps: these gbk file was recently downloaded from ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/ (with extension of gbs.gz), and the file contained "invalid GenBank annotation" is ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_02/hs_ref_GRCh37.p5_chr2.gbs.gz On Thu, Jul 19, 2012 at 4:50 PM, Wibowo Arindrarto wrote: > Hi Ning, > > Thanks for reporting the error. A similar issue has been reported in > the bug tracker here: https://redmine.open-bio.org/issues/3175 (it > also looks like it's the same coordinate). It seems that this could be > an invalid GenBank coordinate made by NCBI, though. > > From which chromosome is this coordinate coming from? Is it the latest draft? > > cheers, > Bow > > > On Thu, Jul 19, 2012 at 5:36 AM, ning luwen wrote: >> Hi everyone, >> >> A error encountered when i parse a gbk file. >> >> the error message as follow: >> >> Traceback (most recent call last): >> File "stat_refseq_gbs.py", line 10, in >> for seq in f: >> File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", >> line 537, in parse >> for r in i: >> File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", >> line 445, in parse_records >> record = self.parse(handle, do_features) >> File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", >> line 428, in parse >> if self.feed(handle, consumer, do_features): >> File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", >> line 400, in feed >> self._feed_feature_table(consumer, self.parse_features(skip=False)) >> File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", >> line 350, in _feed_feature_table >> consumer.location(location_string) >> File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py", >> line 970, in location >> int(e), >> ValueError: invalid literal for int() with base 10: '68452073^68452074' >> >> the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the >> lines cause the error may be: >> >> V_segment complement(68451760..68452073^68452074) >> CDS complement(<68451760..68452072^68452073) >> >> -- >> regards, >> luwen ning >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython -- regards, luwen ning From p.j.a.cock at googlemail.com Fri Jul 20 10:07:04 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 20 Jul 2012 11:07:04 +0100 Subject: [Biopython] slice a record in two and writing both records In-Reply-To: References: Message-ID: On Thu, Jul 19, 2012 at 4:51 PM, Dilara Ally wrote: > If I have a function (modify_record) that slices up a SeqRecord into sub > records and then returns the sliced record if it has a certain length (for > e.g. the sliced record needs to be greater than 40bp), sometimes the > original record when sliced will have two different records both greater > than 40bp. I want to keep both sliced reads and rewrite them as separate > records into a single fastq file. Here is my code: > > def modify_record(frec, win, len_threshold): > quality_scores = array(frec.letter_annotations["phred_quality"]) > all_window_qc = slidingWindow(quality_scores, win,1) > track_qc = windowQ(all_window_qc) > myzeros = boolean_array(track_qc, q_threshold,win) > Nrec = slice_points(myzeros,win)[0][1]-1 > where_to_slice = slice_points(myzeros,win)[1] > where_to_slice.append(len(frec)+win) > sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold) > return sub_record > ... The key point is that for each input record you may want to produce several output records. A single function turning one input SeqRecord into one output SeqRecord won't work. I would suggest either, 1. Modify your function to return a list of SeqRecord objects, which could be zero, one (as now), or several - depending on the slice points. Then use itertools.chain to combine them, something like this: from itertools import chain good_reads = chain(modify_record(r) for r in SeqIO.parse(...)) count = SeqIO.write(good_reads, "filtered.fastq", "fastq") print "Saved %i read fragments" % count 2. Use a generator function to process the SeqRecord objects, def select_fragments(records, win, len_threshold): for record in records: where_to_slice = ... for slice_point in where_to_slice: yield record[slice_point] good_reads = select_fragments(SeqIO.parse(...)) count = SeqIO.write(good_reads, "filtered.fastq", "fastq") print "Saved %i read fragments" % count Both these approaches are generator/iteration based and will be memory efficient. Note you may also want to alter the record identifiers so that different fragments from a single read get different IDs. Peter From p.j.a.cock at googlemail.com Fri Jul 20 10:29:33 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 20 Jul 2012 11:29:33 +0100 Subject: [Biopython] Error while parsing bgk file In-Reply-To: References: Message-ID: On Fri, Jul 20, 2012 at 4:56 AM, ning luwen wrote: > Hi Bow, > > Thank you for your reply, and a patch by lenna can solve the > interruption of the parse. > > ps: these gbk file was recently downloaded from > ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/ (with extension of > gbs.gz), and the file contained "invalid GenBank annotation" is > ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_02/hs_ref_GRCh37.p5_chr2.gbs.gz Note the original bug report referred to a slightly different part/revision of this chromosome, but it is the same issue reported earlier: https://redmine.open-bio.org/issues/3175 ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_02/hs_ref_GRCh37.p2_chr2.gbk.gz I have now committed Lenna's fix, which means this file now parses with a warning about the problem features (which get None as their location): https://github.com/biopython/biopython/commit/bc733da09051ca53ad4515ac2d971ff0839a71ba https://github.com/biopython/biopython/commit/4bf78f72682f0500e93c410f8108891dade88ff8 Ning, if you would like to test this fix the simplest way is to get the latest source code from github, and reinstall Biopython. You can either use the git tool at the command line, or the github URL for a tarball: https://github.com/biopython/biopython/tarball/master (Please ask if you need more guidance with this) Regards, Peter From igorrcosta at hotmail.com Sat Jul 21 21:44:40 2012 From: igorrcosta at hotmail.com (Igor Rodrigues da Costa) Date: Sat, 21 Jul 2012 21:44:40 +0000 Subject: [Biopython] Back translation support in Biopython Message-ID: Hi Peter, I would eliminate the problem of ID mapping (or at least pass it to the user) by using only the function that uses one sequence pair. The other option is to check if the codon and the amino acid are equivalent at run time, using a given genetic code. I did this in my program that back translated using only the aligned protein sequence and the Uniprot/GI accession numbers (I did the search using Bio.Entrez), but in my case the nucleotide dictionary was only some different ways the nucleotide sequence could be imported from NCBI, each of them returning a different sequence. I can't see any need for different gap characters between both alignments, and I feel there can be both a Bio.SeqIO (using a pair of sequences only) and a Bio.AlignIO (using multiple sequences, probably slower if checking at run time) versions of this function. Att,Igor> Date: Mon, 2 Jul 2012 12:27:08 +0100 > Subject: Re: [Biopython] Back translation support in Biopython > From: p.j.a.cock at googlemail.com > To: igorrcosta at hotmail.com; eric.talevich at gmail.com > CC: biopython at lists.open-bio.org > > On Wed, Apr 4, 2012 at 4:02 PM, Peter Cock wrote: > > On Wed, Apr 4, 2012 at 2:49 AM, Eric Talevich wrote: > >> Hi Igor, > >> > >> It sounds like you're referring to aligning amino acid sequences to codon > >> sequences, as PAL2NAL does. This is different from what most people mean by > >> back translation, but as you point out, certainly useful. > >> > >> If you write a function that can match a protein sequence alignment to a set > >> of raw CDS sequences, returning a nucleotide alignment based on the > >> codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does > >> exactly that, plus a bit more, and is a fairly well-known and easily > >> obtained program. Personally, I would prefer to write a wrapper for PAL2NAL > >> under Bio.Align.Applications, using the existing Bio.Applications framework. > > > > As per the old thread, a simple function in Python taking the gapped protein > > sequence, original nucleotide coding sequence, and the translation table > > does sound useful. Then using that, you could go from a protein alignment > > plus the original nucleotide coding sequences to a codon alignment, or > > other tasks. Given this is all relatively straightforward string manipulation > > and we already have the required genetic code tables in Biopython, I'm not > > convinced that wrapping PAL2NAL would be the best solution (for this sub > > task). > > Hi Igor, > > Did you do any work on back-translation (alignment threading) in Biopython? > > We needed to do this locally, and for some reason (yet to be determined) > T-COFFEE wasn't working on our dataset, so I made a start at a Biopython > implementation: > > https://github.com/peterjc/biopython/tree/back_trans > https://github.com/peterjc/biopython/commit/7d14cdb59bb9d41c727c923c8aa7e3dda7779c80 > > Currently just one commit adding a Bio.Align.alignment_back_translate(...) > function which takes a protein alignment and dictionary of nucleotide > records - easy to get with Bio.SeqIO and Bio.AlignIO - with a stand alone > example included in the doctest. There is also a new (currently private) > function to do this for one sequence pair - perhaps useful on its own? > > There are potential complications with ID mapping between the proteins > and nucleotides, thus the option of a key function, and the gap characters > (would you ever want to use different gap characters in the protein and > nucleotide alignments?). We could discuss implementation details over > on the biopython-dev list, but the general API discussion might as well > be here. e.g. Where to put the function and what to call it. > > Regards, > > Peter From p.j.a.cock at googlemail.com Sun Jul 22 12:51:12 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 22 Jul 2012 13:51:12 +0100 Subject: [Biopython] Back translation support in Biopython In-Reply-To: References: Message-ID: On Sat, Jul 21, 2012 at 10:44 PM, Igor Rodrigues da Costa wrote: > > > Hi Peter, > I would eliminate the problem of ID mapping (or at least > pass it to the user) by using only the function that uses > one sequence pair. Making the function for doing one sequence pair part of the public API seems sensible then. > The other option is to check if the codon and the amino > acid are equivalent at run time, using a given genetic > code. I did this in my program that back translated > using only the aligned protein sequence and the > Uniprot/GI accession numbers (I did the search using > Bio.Entrez), but in my case the nucleotide dictionary > was only some different ways the nucleotide sequence > could be imported from NCBI, each of them returning > a different sequence. Certainly optionally checking the translation seems wise. There are potential complications with things like ambiguous bases, but in general this is useful. > I can't see any need for different gap characters > between both alignments, and I feel there can be both > a Bio.SeqIO (using a pair of sequences only) and a > Bio.AlignIO (using multiple sequences, probably slower > if checking at run time) versions of this function. I agree that an alignment based function, and a single sequence based function make sense - but probably under Bio.Align rather than Bio.SeqIO and Bio.AlignIO which are specifically for input/ouput functionality. Thanks for your thoughts, Peter From dilara.ally at gmail.com Mon Jul 23 21:48:30 2012 From: dilara.ally at gmail.com (Dilara Ally) Date: Mon, 23 Jul 2012 14:48:30 -0700 Subject: [Biopython] slice a record in two and writing both records In-Reply-To: References: Message-ID: <9085DA29-9159-44EE-BED7-56E3306B8EA3@gmail.com> Thanks. Itertools is a fantastic module! Dilara On Jul 20, 2012, at 3:07 AM, Peter Cock wrote: > On Thu, Jul 19, 2012 at 4:51 PM, Dilara Ally wrote: >> If I have a function (modify_record) that slices up a SeqRecord into sub >> records and then returns the sliced record if it has a certain length (for >> e.g. the sliced record needs to be greater than 40bp), sometimes the >> original record when sliced will have two different records both greater >> than 40bp. I want to keep both sliced reads and rewrite them as separate >> records into a single fastq file. Here is my code: >> >> def modify_record(frec, win, len_threshold): >> quality_scores = array(frec.letter_annotations["phred_quality"]) >> all_window_qc = slidingWindow(quality_scores, win,1) >> track_qc = windowQ(all_window_qc) >> myzeros = boolean_array(track_qc, q_threshold,win) >> Nrec = slice_points(myzeros,win)[0][1]-1 >> where_to_slice = slice_points(myzeros,win)[1] >> where_to_slice.append(len(frec)+win) >> sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold) >> return sub_record >> ... > > The key point is that for each input record you may want to > produce several output records. A single function turning > one input SeqRecord into one output SeqRecord won't work. > I would suggest either, > > 1. Modify your function to return a list of SeqRecord objects, > which could be zero, one (as now), or several - depending on > the slice points. Then use itertools.chain to combine them, > something like this: > > from itertools import chain > good_reads = chain(modify_record(r) for r in SeqIO.parse(...)) > count = SeqIO.write(good_reads, "filtered.fastq", "fastq") > print "Saved %i read fragments" % count > > 2. Use a generator function to process the SeqRecord objects, > > def select_fragments(records, win, len_threshold): > for record in records: > where_to_slice = ... > for slice_point in where_to_slice: > yield record[slice_point] > > good_reads = select_fragments(SeqIO.parse(...)) > count = SeqIO.write(good_reads, "filtered.fastq", "fastq") > print "Saved %i read fragments" % count > > Both these approaches are generator/iteration based and will > be memory efficient. > > Note you may also want to alter the record identifiers so that > different fragments from a single read get different IDs. > > Peter From llewelr at gmail.com Tue Jul 24 02:24:06 2012 From: llewelr at gmail.com (Richard Llewellyn) Date: Mon, 23 Jul 2012 20:24:06 -0600 Subject: [Biopython] Entrez.read(handle) for Bio._py3k.EvilHandleHack throws error with python 3.2 Message-ID: With python 3.2 and biopython 1.60 after getting a handle using Entrez.esummary (and esearch, others?) I get a TypeError: >>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here at example.org" >>> handle = Entrez.esummary(db="journals", id="30367") >>> record = Entrez.read(handle) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/__init__.py", line 351, in read record = handler.read(handle) File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/Parser.py", line 169, in read self.parser.ParseFile(handle) TypeError: read() did not return a bytes object (type=str) >>> handle Ah, it is evil! I realize py3k not yet officially supported. Thanks for the great work. From llewelr at gmail.com Tue Jul 24 03:20:00 2012 From: llewelr at gmail.com (Richard Llewellyn) Date: Mon, 23 Jul 2012 21:20:00 -0600 Subject: [Biopython] Entrez.read(handle) for Bio._py3k.EvilHandleHack throws error with python 3.2 In-Reply-To: References: Message-ID: Follow up for Entrez.read error on EvilHandleHack object: (this is python 3.2.3) If I change last line of Entrez.__init__.py _open function from return _binary_to_string_handle(handle) to return handle this error does not occur in example given below. On Mon, Jul 23, 2012 at 8:24 PM, Richard Llewellyn wrote: > With python 3.2 and biopython 1.60 after getting a handle using > Entrez.esummary (and esearch, others?) I get a TypeError: > > >>>> from Bio import Entrez >>>> Entrez.email = "Your.Name.Here at example.org" >>>> handle = Entrez.esummary(db="journals", id="30367") >>>> record = Entrez.read(handle) > > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/__init__.py", > line 351, in read > record = handler.read(handle) > File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/Parser.py", > line 169, in read > self.parser.ParseFile(handle) > TypeError: read() did not return a bytes object (type=str) > >>>> handle > > > Ah, it is evil! > > I realize py3k not yet officially supported. > > Thanks for the great work. From markd at soe.ucsc.edu Tue Jul 24 06:47:51 2012 From: markd at soe.ucsc.edu (Mark Diekhans) Date: Mon, 23 Jul 2012 23:47:51 -0700 Subject: [Biopython] accessing PDB IDcode when using PDBParser Message-ID: <20494.17687.678370.458937@hgwdev.cse.ucsc.edu> How does one access the idCode in the PDB HEADER when using the PDBParser? I can't find this in the documentation or the code. Also, what is function of the `id' argument for PDBParser.get_structure: The documentation is just self-referential: o id - string, the id that will be used for the structure Seems no obvious way via MMCIFParser either. Thanks! From anaryin at gmail.com Tue Jul 24 08:37:44 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 24 Jul 2012 10:37:44 +0200 Subject: [Biopython] accessing PDB IDcode when using PDBParser In-Reply-To: <20494.17687.678370.458937@hgwdev.cse.ucsc.edu> References: <20494.17687.678370.458937@hgwdev.cse.ucsc.edu> Message-ID: Hey Mark, Indeed there is no specific ID extraction from the HEADER. However, it comes as part of the "head" key in the header dictionary. If you split by whitespace and get the last field, you get the PDB ID. Example: HEADER HYDROLASE(ASPARTYL PROTEINASE) 17-OCT-89 2RSP The id you have in the get_structure function retrieves the first argument you pass to it. Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2012/7/24 Mark Diekhans > > How does one access the idCode in the PDB HEADER when using the PDBParser? > I can't find this in the documentation or the code. > > Also, what is function of the `id' argument for PDBParser.get_structure: > The documentation is just self-referential: > o id - string, the id that will be used for the structure > > Seems no obvious way via MMCIFParser either. > > Thanks! > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Tue Jul 24 09:41:35 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Jul 2012 10:41:35 +0100 Subject: [Biopython] Entrez.read(handle) for Bio._py3k.EvilHandleHack throws error with python 3.2 In-Reply-To: References: Message-ID: Hi Richard, It's great to have some feedback on Python 3 support :) On Tue, Jul 24, 2012 at 4:20 AM, Richard Llewellyn wrote: > Follow up for Entrez.read error on EvilHandleHack object: > > (this is python 3.2.3) > > If I change last line of Entrez.__init__.py _open function from > > return _binary_to_string_handle(handle) > to > return handle > > this error does not occur in example given below. Hmm. That call to _binary_to_string_handle converts from the bytes (binary) network handle to a string (unicode) handle which is required for most of the parsers in Biopython under Python 3 (e.g. FASTA, Genbank). Surprisingly the Entrez parser seems to be wanting a binary handle? That seems curious... I presume that means we don't have this particular case covered in the unit tests :( How familiar are you with the Python 3 split of bytes vs strings (unicode), and binary versus text handles? Peter From bjorn_johansson at bio.uminho.pt Fri Jul 27 08:03:41 2012 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Fri, 27 Jul 2012 09:03:41 +0100 Subject: [Biopython] Restriction cutting SeqRecord objects Message-ID: Hi, Restriction with Bio.Restriction only works for seq or mutable seq objects? I would like to digest SeqRecord objects and still keep the relevant features of the sequences. Did anyone perhaps implement something like this? One way would be to subclass the restriction enzymes, but they are created dynamically so I am not sure if this is a god idea. btw is the biopython site down? thanks, bjorn -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL www.bio.uminho.pt Google profile metabolicengineeringgroup Work (direct) +351-253 601517 | mob. +351-967 147 704 | mob. (SWE) 0739 792 968 Dept of Biology (secr) +351-253 60 4310 | fax +351-253 678980 From dilara.ally at gmail.com Thu Jul 26 17:48:44 2012 From: dilara.ally at gmail.com (Dilara Ally) Date: Thu, 26 Jul 2012 10:48:44 -0700 Subject: [Biopython] matching headers and then writing the seq record Message-ID: Hi Everyone, I'm interested in finding headers that match (in other words paired reads) in two different fastq files. Once the common headers are found, I then go back to the original fastq file and write those matched reads to a different fastq file. Right now, the part of the code that runs really slow is the headers_read1 and headers_read2 lines. And I was wondering if there was a more elegant way and time efficient manner than what I have done. It seems as if set undoes the elegance of using a generator. Any advice is greatly appreciated! Here is the code: def get_header(seq_record): fields = seq_record.id.split(':') lastfield = fields[6].split('_')[0] return lastfield def get_full_header(seq_record): fields = seq_record.id.split(':') headerInfo2 = fields[6].split('_')[0] headerInfo = str(fields[0]) + ":" + str(fields[1]) + ":" + str(fields[2]) + ":" + str(fields[3]) + ":" + str(fields[4]) + ":" + str(fields[5]) + ":" + str(headerInfo2) return headerInfo def replace_header(seq_record,pairType): if pairType == 1: ending = "/1" elif pairType == 2: ending = "/2" seq_record.id=seq_record.id+ending seq_record.name = "" seq_record.description = "" return seq_record def matched_records(records, pairType, header_matches): for rec in records: id = get_header(rec) result = id in header_matches #print result if (result == True): newrec = replace_header(rec,pairType) yield newrec import sys from Bio import SeqIO headers_read1 = set(get_header(seq_record) for seq_record in SeqIO.parse(sys.argv[1], "fastq")) headers_read2 = set(get_header(seq_record) for seq_record in SeqIO.parse(sys.argv[2], "fastq")) header_matches = [x for x in headers_read1 if x in headers_read2] records = SeqIO.parse(sys.argv[1], "fastq") pairType = 1 count = SeqIO.write(matched_records(records,pairType,header_matches), sys.argv[3], "fastq") print "Saved %i matched reads." %count records = SeqIO.parse(sys.argv[2], "fastq") pairType = 2 count = SeqIO.write(matched_records(records,pairType,header_matches), sys.argv[4], "fastq") print "Saved %i matched reads." %count From cartealy at yahoo.co.id Fri Jul 27 06:30:58 2012 From: cartealy at yahoo.co.id (Imam Cartealy) Date: Fri, 27 Jul 2012 14:30:58 +0800 (SGT) Subject: [Biopython] Is biopython.org down ? Message-ID: <1343370658.77420.YahooMailNeo@web190503.mail.sg3.yahoo.com> Hi everyone, I am having trouble accessing biopython.org for the last 2 days. Is biopython.org down ? Cheers ic ? Imam Cartealy Center for Biotechnology - BPPT Indonesia From idoerg at gmail.com Sat Jul 28 20:19:01 2012 From: idoerg at gmail.com (Iddo Friedberg) Date: Sat, 28 Jul 2012 16:19:01 -0400 Subject: [Biopython] Is biopython.org down ? In-Reply-To: <1343370658.77420.YahooMailNeo@web190503.mail.sg3.yahoo.com> References: <1343370658.77420.YahooMailNeo@web190503.mail.sg3.yahoo.com> Message-ID: Has been down for the past couple of days, but it is up now. On Fri, Jul 27, 2012 at 2:30 AM, Imam Cartealy wrote: > Hi everyone, > > I am having trouble accessing biopython.org for the last 2 days. Is > biopython.org down ? > > Cheers > > ic > > > Imam Cartealy > Center for Biotechnology - BPPT > Indonesia > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From p.j.a.cock at googlemail.com Sat Jul 28 20:48:32 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 28 Jul 2012 21:48:32 +0100 Subject: [Biopython] matching headers and then writing the seq record In-Reply-To: References: Message-ID: On Thu, Jul 26, 2012 at 6:48 PM, Dilara Ally wrote: > ... It seems as if set undoes the elegance of using a generator. > Any advice is greatly appreciated! ... > > headers_read1 = set(...) > headers_read2 = set(...) > header_matches = [x for x in headers_read1 if x in headers_read2] I would expect that using the built in set's intersection operation would be faster than this list comprehension solution to create header_matches. Also, you should use a set not a list for header_matches because testing membership with a set is much faster than a list. i.e. Try: header_matches = headers_read1.intersection(headers_read2) This might be a tiny change, but I expect it to be noticeably faster. Also, here: > def matched_records(records, pairType, header_matches): > for rec in records: > id = get_header(rec) > result = id in header_matches > if (result == True): > newrec = replace_header(rec,pairType) > yield newrec If you don't mind my style comments, you don't really need to create the variables 'id' and 'result', and 'newrec' - I would just do: def matched_records(records, pairType, header_matches): for rec in records: if get_header(rec) in header_matches: yield replace_header(rec,pairType) And at that point you could write the whole thing as a generator expression, which you may or may not find more pleasing (I'm not sure if it makes any significant difference to the speed). i.e. records = SeqIO.parse(sys.argv[1], "fastq") pairType = 1 wanted = (replace_header(rec,pairType) \ for rec in records \ if get_header(rec) in header_matches) count = SeqIO.write(wanted, sys.argv[3], "fastq") I hope that helps, Peter From aclark at aclark.net Sat Jul 28 23:45:10 2012 From: aclark at aclark.net (Alex Clark) Date: Sat, 28 Jul 2012 19:45:10 -0400 Subject: [Biopython] ANN: pythonpackages.com beta Message-ID: Hi biological computation folks, I am reaching out to various Python-related programming communities in order to offer new help packaging your software. If you have ever struggled with packaging and releasing Python software (e.g. to PyPI), please check out this service: - http://pythonpackages.com The basic idea is to automate packaging by checking out code, testing, and uploading (e.g. to PyPI) all through the web, as explained in this introduction: - http://docs.pythonpackages.com/en/latest/introduction.html Also, I will be available to answer your Python packaging questions most days/nights in #pythonpackages on irc.freenode.net. Hope to meet/talk with all of you soon. Alex -- Alex Clark ? http://pythonpackages.com/ONE_CLICK From dilara.ally at gmail.com Tue Jul 31 18:53:27 2012 From: dilara.ally at gmail.com (Dilara Ally) Date: Tue, 31 Jul 2012 11:53:27 -0700 Subject: [Biopython] matching headers and then writing the seq record In-Reply-To: References: Message-ID: Thanks Peter it sped it up considerably! I appreciate the fast replies on this listserv. On Jul 28, 2012, at 1:48 PM, Peter Cock wrote: > On Thu, Jul 26, 2012 at 6:48 PM, Dilara Ally wrote: >> ... It seems as if set undoes the elegance of using a generator. >> Any advice is greatly appreciated! ... >> >> headers_read1 = set(...) >> headers_read2 = set(...) >> header_matches = [x for x in headers_read1 if x in headers_read2] > > I would expect that using the built in set's intersection operation would > be faster than this list comprehension solution to create header_matches. > > Also, you should use a set not a list for header_matches because testing > membership with a set is much faster than a list. i.e. Try: > > header_matches = headers_read1.intersection(headers_read2) > > This might be a tiny change, but I expect it to be noticeably faster. > > Also, here: > >> def matched_records(records, pairType, header_matches): >> for rec in records: >> id = get_header(rec) >> result = id in header_matches >> if (result == True): >> newrec = replace_header(rec,pairType) >> yield newrec > > If you don't mind my style comments, you don't really need > to create the variables 'id' and 'result', and 'newrec' - I would > just do: > > def matched_records(records, pairType, header_matches): > for rec in records: > if get_header(rec) in header_matches: > yield replace_header(rec,pairType) > > And at that point you could write the whole thing as a > generator expression, which you may or may not find > more pleasing (I'm not sure if it makes any significant > difference to the speed). i.e. > > records = SeqIO.parse(sys.argv[1], "fastq") > pairType = 1 > wanted = (replace_header(rec,pairType) \ > for rec in records \ > if get_header(rec) in header_matches) > count = SeqIO.write(wanted, sys.argv[3], "fastq") > > I hope that helps, > > Peter From devaniranjan at gmail.com Tue Jul 31 19:24:34 2012 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 31 Jul 2012 15:24:34 -0400 Subject: [Biopython] Mocapy Message-ID: I was wondering if Mocapy is part of Biopython. I thought it was but I cannot find it in my biopython PDB folder. Thank you, George From eric.talevich at gmail.com Tue Jul 31 21:55:21 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 31 Jul 2012 17:55:21 -0400 Subject: [Biopython] Mocapy In-Reply-To: References: Message-ID: On Tue, Jul 31, 2012 at 3:24 PM, George Devaniranjan wrote: > I was wondering if Mocapy is part of Biopython. > > I thought it was but I cannot find it in my biopython PDB folder. > > Hi George, No, Mocapy++ is a separate project: http://sourceforge.net/projects/mocapy/ There is a branch to add some integration with Mocapy++ to Biopython, but we're waiting for the next stable release of Mocapy++ before merging it: https://github.com/mchelem/biopython -Eric