From p.j.a.cock at googlemail.com  Mon Jul  2 07:27:08 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 2 Jul 2012 12:27:08 +0100
Subject: [Biopython] Back translation support in Biopython
In-Reply-To: <CAKVJ-_4RrR-QnTJYJEGzeJwdNnYx_n7Rt202C18ij2UVQJrxBg@mail.gmail.com>
References: <SNT122-W6253A9D0A58B2040858C45C44F0@phx.gbl>
	<CAKVJ-_4A=7ptD+6NNVLzQs_XW64FjYypyYw=0Xb4LT6uXW+tJQ@mail.gmail.com>
	<SNT122-W175393282884586982B4ACC44D0@phx.gbl>
	<CAMC681nDmW4__aFV=2OkdEXqiWFhWGTGdo2M4DsEYPzppuwF7g@mail.gmail.com>
	<CAKVJ-_4RrR-QnTJYJEGzeJwdNnYx_n7Rt202C18ij2UVQJrxBg@mail.gmail.com>
Message-ID: <CAKVJ-_4WsEOyy=6jBMhZYPZ8X9G_+abYnHi5zDjh1EYJhYpAJw@mail.gmail.com>

On Wed, Apr 4, 2012 at 4:02 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Apr 4, 2012 at 2:49 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> Hi Igor,
>>
>> It sounds like you're referring to aligning amino acid sequences to codon
>> sequences, as PAL2NAL does. This is different from what most people mean by
>> back translation, but as you point out, certainly useful.
>>
>> If you write a function that can match a protein sequence alignment to a set
>> of raw CDS sequences, returning a nucleotide alignment based on the
>> codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does
>> exactly that, plus a bit more, and is a fairly well-known and easily
>> obtained program. Personally, I would prefer to write a wrapper for PAL2NAL
>> under Bio.Align.Applications, using the existing Bio.Applications framework.
>
> As per the old thread, a simple function in Python taking the gapped protein
> sequence, original nucleotide coding sequence, and the translation table
> does sound useful. Then using that, you could go from a protein alignment
> plus the original nucleotide coding sequences to a codon alignment, or
> other tasks. Given this is all relatively straightforward string manipulation
> and we already have the required genetic code tables in Biopython, I'm not
> convinced that wrapping PAL2NAL would be the best solution (for this sub
> task).

Hi Igor,

Did you do any work on back-translation (alignment threading) in Biopython?

We needed to do this locally, and for some reason (yet to be determined)
T-COFFEE wasn't working on our dataset, so I made a start at a Biopython
implementation:

https://github.com/peterjc/biopython/tree/back_trans
https://github.com/peterjc/biopython/commit/7d14cdb59bb9d41c727c923c8aa7e3dda7779c80

Currently just one commit adding a Bio.Align.alignment_back_translate(...)
function which takes a protein alignment and dictionary of nucleotide
records - easy to get with Bio.SeqIO and Bio.AlignIO - with a stand alone
example included in the doctest. There is also a new (currently private)
function to do this for one sequence pair - perhaps useful on its own?

There are potential complications with ID mapping between the proteins
and nucleotides, thus the option of a key function, and the gap characters
(would you ever want to use different gap characters in the protein and
nucleotide alignments?). We could discuss implementation details over
on the biopython-dev list, but the general API discussion might as well
be here. e.g. Where to put the function and what to call it.

Regards,

Peter

From from.d.putto at gmail.com  Mon Jul  2 08:21:46 2012
From: from.d.putto at gmail.com (Sheila the angel)
Date: Mon, 2 Jul 2012 14:21:46 +0200
Subject: [Biopython] searching homologene database
Message-ID: <CAFinXcQrh4y4=yWxzhp3jQGuqqjQTs9dxddP1sPLdHy5sMV3Vg@mail.gmail.com>

To search tp53 homolog in homologene database -

handle = Entrez.esearch(db="homologene", term="tp53[gene name] AND Homo
sapiens[orgn]")
record = Entrez.read(handle)
handle = Entrez.efetch(db="homologene", id=record['IdList'])
record = handle.read()
print record

I think record is asn.1 format !! how can I read or convert it in the genes
protein table (as we see in the web result)
http://www.ncbi.nlm.nih.gov/homologene/460

Thanks

--
Sheila

From w.arindrarto at gmail.com  Mon Jul  2 08:39:31 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Mon, 2 Jul 2012 14:39:31 +0200
Subject: [Biopython] searching homologene database
In-Reply-To: <CAFinXcQrh4y4=yWxzhp3jQGuqqjQTs9dxddP1sPLdHy5sMV3Vg@mail.gmail.com>
References: <CAFinXcQrh4y4=yWxzhp3jQGuqqjQTs9dxddP1sPLdHy5sMV3Vg@mail.gmail.com>
Message-ID: <CADEGkF7Cn++8Zk8c-GXrtgGNfU=WMHvwUY-m0M58e5BwfA=rQg@mail.gmail.com>

Hi Sheila,

You can set the 'retmode' parameter in order to specify your preferred
format. I'm not sure if NCBI provides an output format exactly like
the one you see on their site, but instead of ASN.1 you can specify a
more common
format like XML.

In your case, the call would be this (for XML, let's say):

handle = Entrez.efetch(db="homologene", id=record['IdList'], retmode="xml")

For a list of possible retmode values, you can look them up here:
http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch (see the
explanation about 'retmode').

If you want to format the output further, you can use modules like the
built-in elementtree or 3rd party modules like lxml to extract the tag
values and feed them to your script / program.

Hope that helps,
Bow


On Mon, Jul 2, 2012 at 2:21 PM, Sheila the angel <from.d.putto at gmail.com> wrote:
> To search tp53 homolog in homologene database -
>
> handle = Entrez.esearch(db="homologene", term="tp53[gene name] AND Homo
> sapiens[orgn]")
> record = Entrez.read(handle)
> handle = Entrez.efetch(db="homologene", id=record['IdList'])
> record = handle.read()
> print record
>
> I think record is asn.1 format !! how can I read or convert it in the genes
> protein table (as we see in the web result)
> http://www.ncbi.nlm.nih.gov/homologene/460
>
> Thanks
>
> --
> Sheila
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From from.d.putto at gmail.com  Tue Jul 10 09:16:25 2012
From: from.d.putto at gmail.com (Sheila the angel)
Date: Tue, 10 Jul 2012 15:16:25 +0200
Subject: [Biopython] access Uniprot record by different ids
Message-ID: <CAFinXcTHhTH8ra0NPD_30DkTUCafWj8efqcMMH-QtUX9dEUrbA@mail.gmail.com>

I have a Uniprot AC list in which some AC are primary and some are
secondary. The function
my_dict = SeqIO.index("uniprot_sprot.dat", "swiss")
makes dictionary of  uniprot data but I can access a record only by primary
AC.
my_dict['P04637']  # gives the record
my_dict['Q15086'] # KeyError
my_dict['P53_HUMAN'] # KeyError

Is it possible to access same record by both primary and secondary ACs (and
by uniprot ID) ?

From p.j.a.cock at googlemail.com  Tue Jul 10 09:43:31 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 10 Jul 2012 14:43:31 +0100
Subject: [Biopython] access Uniprot record by different ids
In-Reply-To: <CAFinXcTHhTH8ra0NPD_30DkTUCafWj8efqcMMH-QtUX9dEUrbA@mail.gmail.com>
References: <CAFinXcTHhTH8ra0NPD_30DkTUCafWj8efqcMMH-QtUX9dEUrbA@mail.gmail.com>
Message-ID: <CAKVJ-_4J8xrFgyz_2bw__sRH3Okzo2ubstuqpRSnpy8tE4YQzg@mail.gmail.com>

On Tue, Jul 10, 2012 at 2:16 PM, Sheila the angel
<from.d.putto at gmail.com> wrote:
> I have a Uniprot AC list in which some AC are primary and some are
> secondary. The function
> my_dict = SeqIO.index("uniprot_sprot.dat", "swiss")
> makes dictionary of  uniprot data but I can access a record only by primary
> AC.
> my_dict['P04637']  # gives the record
> my_dict['Q15086'] # KeyError
> my_dict['P53_HUMAN'] # KeyError
>
> Is it possible to access same record by both primary and secondary ACs
> (and by uniprot ID) ?

Not directly with Bio.SeqIO.index() or Bio.SeqIO.index_db(), no.
You could perhaps use a second dictionary mapping aliases to
the primary ID?

Peter

From n.j.loman at bham.ac.uk  Wed Jul 11 11:02:00 2012
From: n.j.loman at bham.ac.uk (Nick Loman)
Date: Wed, 11 Jul 2012 16:02:00 +0100
Subject: [Biopython] SeqRecord substring should return SeqRecord or
	character?
Message-ID: <CAFMxBqGRgBkVHZcDSkfCFkiOC85BqeddbhZ+dWahELAqnkJzpA@mail.gmail.com>

Hi there

I wanted to add the last character of a SeqRecord s1 to another
SeqRecord s2. However s1[-1] + s2 fails because s1[-1] returns a
string rather than a SeqRecord just containing a single base and
associated annotations. I have to do s1[-1:] to get a sliced
SeqRecord.

Is this behaviour intentional? I kind of assumed I would always get a
SeqRecord from any given slice, and it's seems weird to get just a
string back instead, although no doubt there's a good reason for this.

Cheers

Nick

From p.j.a.cock at googlemail.com  Wed Jul 11 11:21:00 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 11 Jul 2012 16:21:00 +0100
Subject: [Biopython] SeqRecord substring should return SeqRecord or
	character?
In-Reply-To: <CAFMxBqGRgBkVHZcDSkfCFkiOC85BqeddbhZ+dWahELAqnkJzpA@mail.gmail.com>
References: <CAFMxBqGRgBkVHZcDSkfCFkiOC85BqeddbhZ+dWahELAqnkJzpA@mail.gmail.com>
Message-ID: <CAKVJ-_7ZeBzMANFHfJX9KfO=7XqWsqjTutDnEcudokqWvyfk0Q@mail.gmail.com>

On Wed, Jul 11, 2012 at 4:02 PM, Nick Loman <n.j.loman at bham.ac.uk> wrote:
> Hi there
>
> I wanted to add the last character of a SeqRecord s1 to another
> SeqRecord s2. However s1[-1] + s2 fails because s1[-1] returns a
> string rather than a SeqRecord just containing a single base and
> associated annotations. I have to do s1[-1:] to get a sliced
> SeqRecord.

You should be able to do SeqRecord+string, and string+SeqRecord,
both of which are specifically tested in the docstring. Have you got
any more details? e.g. Version? Mini-example?

> Is this behaviour intentional? I kind of assumed I would always get a
> SeqRecord from any given slice, and it's seems weird to get just a
> string back instead, although no doubt there's a good reason for this.

For a single base/residue, the whole SeqRecord overhead does
seem unnecessary. As to why you get a single letter string, not
a single letter Seq, IIRC it was mimicking the Seq object.

Peter

From p.j.a.cock at googlemail.com  Wed Jul 11 11:52:41 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 11 Jul 2012 16:52:41 +0100
Subject: [Biopython] SeqRecord substring should return SeqRecord or
	character?
In-Reply-To: <CAFMxBqEDXCacQEG4jwYbG-b8pODfTX2LtBYoaZACJ=HMv5xqTg@mail.gmail.com>
References: <CAFMxBqGRgBkVHZcDSkfCFkiOC85BqeddbhZ+dWahELAqnkJzpA@mail.gmail.com>
	<620A45B10433AE4C81D3F931A02812F93BC80453CF@LESMBX1.adf.bham.ac.uk>
	<CAFMxBqEDXCacQEG4jwYbG-b8pODfTX2LtBYoaZACJ=HMv5xqTg@mail.gmail.com>
Message-ID: <CAKVJ-_4PnMSAsPee1Vc28Q_gHDXcPpSKDJ0X1WYc0YaVd1V8xg@mail.gmail.com>

On Wed, Jul 11, 2012 at 4:24 PM, Nick Loman <n.j.loman at bham.ac.uk> wrote:
> On Wed, Jul 11, 2012 at 4:21 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> On Wed, Jul 11, 2012 at 4:02 PM, Nick Loman <n.j.loman at bham.ac.uk> wrote:
>>> Hi there
>>>
>>> I wanted to add the last character of a SeqRecord s1 to another
>>> SeqRecord s2. However s1[-1] + s2 fails because s1[-1] returns a
>>> string rather than a SeqRecord just containing a single base and
>>> associated annotations. I have to do s1[-1:] to get a sliced
>>> SeqRecord.
>>
>> You should be able to do SeqRecord+string, and string+SeqRecord,
>> both of which are specifically tested in the docstring. Have you got
>> any more details? e.g. Version? Mini-example?
>
> Hi Peter,
>
> It was doing this on a FASTQ record so it's the missing quality
> annotation that cause the problem when trying to do this.

Ah - so the addition should have worked, but you'd lose the
partial quality string. You're stuck with ensuring you have two
SeqRecords, so as you suggested rather than s1[-1]+s2 please
use s1[-1:]+s2 instead. Slightly less clear, but only character more.

This actually reminds me of similar behaviour with the bytes
string in Python 3, where the same trick is required to get a
single letter bytes string.

>>> Is this behaviour intentional? I kind of assumed I would always get a
>>> SeqRecord from any given slice, and it's seems weird to get just a
>>> string back instead, although no doubt there's a good reason for this.
>>
>> For a single base/residue, the whole SeqRecord overhead does
>> seem unnecessary. As to why you get a single letter string, not
>> a single letter Seq, IIRC it was mimicking the Seq object.
>
> Yes, I guessed the overhead was likely to be the reason ..  not sure
> if there's a satisfactory solution?

Returning a single letter SeqRecord have might been a better
choice, and going back much further in Biopython's history the
Seq object should probably have returned a single letter Seq
(not a single letter string). There is a similar issue with the
columns of an alignment.

Peter

From wheatontrue at gmail.com  Thu Jul 12 04:53:26 2012
From: wheatontrue at gmail.com (Wheaton Little)
Date: Thu, 12 Jul 2012 16:53:26 +0800
Subject: [Biopython] can I use the xml parser in biopython on other xml
	files? how?
Message-ID: <CALntoc0YV+fO0ua1_E=8F84Q6bfU5t7L84e6M_zkJJvw4f5=RA@mail.gmail.com>

I would like to use the Biopython xml parser, if possible, on google
patent xmls:

http://www.google.com/googlebooks/uspto-patents-applications-text.html

unfortunately, this is what I get:

>>> t=open('ipa111229.xml','r').read()
>>> import Bio
>>> ttt=Bio.Entrez.read(t[:30000])

Traceback (most recent call last):
  File "<pyshell#20>", line 1, in <module>
    ttt=Bio.Entrez.read(t[:30000])
  File "/Library/Python/2.7/site-packages/Bio/Entrez/__init__.py",
line 351, in read
    record = handler.read(handle)
  File "/Library/Python/2.7/site-packages/Bio/Entrez/Parser.py", line
169, in read
    self.parser.ParseFile(handle)
TypeError: argument must have 'read' attribute

What would I have to do to use the parser on this xml?

From b.invergo at gmail.com  Thu Jul 12 05:25:27 2012
From: b.invergo at gmail.com (Brandon Invergo)
Date: Thu, 12 Jul 2012 11:25:27 +0200
Subject: [Biopython] can I use the xml parser in biopython on other xml
 files? how?
In-Reply-To: <CALntoc0YV+fO0ua1_E=8F84Q6bfU5t7L84e6M_zkJJvw4f5=RA@mail.gmail.com>
References: <CALntoc0YV+fO0ua1_E=8F84Q6bfU5t7L84e6M_zkJJvw4f5=RA@mail.gmail.com>
Message-ID: <1342085127.614.10.camel@localhost.localdomain>

With regards to the error that you receive, it's because you're trying
to `read()` a list, when that method requires a file-like object. This
would fix that:
>>> ttt=Bio.Entrez.read(open('ipa111229.xml', 'r'))

However, that wouldn't work because it requires a DTD from NCBI to read
the file.

Why not use one of Python's standard xml libraries (xml.sax or xml.dom
(or xml.minidom))?

-brandon

On Thu, 2012-07-12 at 16:53 +0800, Wheaton Little wrote:
> I would like to use the Biopython xml parser, if possible, on google
> patent xmls:
> 
> http://www.google.com/googlebooks/uspto-patents-applications-text.html
> 
> unfortunately, this is what I get:
> 
> >>> t=open('ipa111229.xml','r').read()
> >>> import Bio
> >>> ttt=Bio.Entrez.read(t[:30000])
> 
> Traceback (most recent call last):
>   File "<pyshell#20>", line 1, in <module>
>     ttt=Bio.Entrez.read(t[:30000])
>   File "/Library/Python/2.7/site-packages/Bio/Entrez/__init__.py",
> line 351, in read
>     record = handler.read(handle)
>   File "/Library/Python/2.7/site-packages/Bio/Entrez/Parser.py", line
> 169, in read
>     self.parser.ParseFile(handle)
> TypeError: argument must have 'read' attribute
> 
> What would I have to do to use the parser on this xml?
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From p.j.a.cock at googlemail.com  Thu Jul 12 05:35:09 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 12 Jul 2012 10:35:09 +0100
Subject: [Biopython] can I use the xml parser in biopython on other xml
 files? how?
In-Reply-To: <CALntoc0YV+fO0ua1_E=8F84Q6bfU5t7L84e6M_zkJJvw4f5=RA@mail.gmail.com>
References: <CALntoc0YV+fO0ua1_E=8F84Q6bfU5t7L84e6M_zkJJvw4f5=RA@mail.gmail.com>
Message-ID: <CAKVJ-_4ykiFE1QUWKoWX3xTKgXQ5+9KJ1Hkj7zHh7DWcXpdppw@mail.gmail.com>

On Thu, Jul 12, 2012 at 9:53 AM, Wheaton Little <wheatontrue at gmail.com> wrote:
> I would like to use the Biopython xml parser, if possible, on google
> patent xmls:
>
> http://www.google.com/googlebooks/uspto-patents-applications-text.html
>
> unfortunately, this is what I get:
>
>>>> t=open('ipa111229.xml','r').read()
>>>> import Bio
>>>> ttt=Bio.Entrez.read(t[:30000])
>
> Traceback (most recent call last):
>   ...
> TypeError: argument must have 'read' attribute
>
> What would I have to do to use the parser on this xml?

In your example, you opened the file and read all the data into a
string (variable t).

The parser is not expecting a string, but a handle. String objects
don't have a 'read' method, thus this error message.

You could 'fix' this particular error by doing:

handle=open('ipa111229.xml','r')
from Bio import Entrez
ttt=Entrez.read(handle)

However, I doubt this will work as the Entrez parser is intended to be
used with the NCBI XML files only.

Python comes with several XML libraries in the standard library.
ElementTree (or cElementTree) is quite popular, but as Brandom points
out there are also DOM and SAX style parsers.

Peter

From from.d.putto at gmail.com  Thu Jul 12 07:06:58 2012
From: from.d.putto at gmail.com (Sheila the angel)
Date: Thu, 12 Jul 2012 13:06:58 +0200
Subject: [Biopython] access Uniprot record by different ids
In-Reply-To: <CAKVJ-_4J8xrFgyz_2bw__sRH3Okzo2ubstuqpRSnpy8tE4YQzg@mail.gmail.com>
References: <CAFinXcTHhTH8ra0NPD_30DkTUCafWj8efqcMMH-QtUX9dEUrbA@mail.gmail.com>
	<CAKVJ-_4J8xrFgyz_2bw__sRH3Okzo2ubstuqpRSnpy8tE4YQzg@mail.gmail.com>
Message-ID: <CAFinXcShsVE8FDFMNX05KDiYjuyq7_mumy37wKXnzy-BwfNFiQ@mail.gmail.com>

Thanks for reply.
Now I made two dictionary one for uniprot_sprot.dat and another for
secondary ids to primary ids.
However it take too long to do this and I can't do Pickle for my_dict.
I would like to know is it possible to dump my_dict (the uniprot.dat data)
to MySql database.
I looked at biopython-BioSQL page  but didn't understand much (I am new to
SQL)
Thanks

--
Sheila


On Tue, Jul 10, 2012 at 3:43 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jul 10, 2012 at 2:16 PM, Sheila the angel
> <from.d.putto at gmail.com> wrote:
> > I have a Uniprot AC list in which some AC are primary and some are
> > secondary. The function
> > my_dict = SeqIO.index("uniprot_sprot.dat", "swiss")
> > makes dictionary of  uniprot data but I can access a record only by
> primary
> > AC.
> > my_dict['P04637']  # gives the record
> > my_dict['Q15086'] # KeyError
> > my_dict['P53_HUMAN'] # KeyError
> >
> > Is it possible to access same record by both primary and secondary ACs
> > (and by uniprot ID) ?
>
> Not directly with Bio.SeqIO.index() or Bio.SeqIO.index_db(), no.
> You could perhaps use a second dictionary mapping aliases to
> the primary ID?
>
> Peter
>

From wheatontrue at gmail.com  Thu Jul 12 07:57:51 2012
From: wheatontrue at gmail.com (Wheaton Little)
Date: Thu, 12 Jul 2012 19:57:51 +0800
Subject: [Biopython] can I use the xml parser in biopython on other xml
 files? how?
In-Reply-To: <CAKVJ-_4ykiFE1QUWKoWX3xTKgXQ5+9KJ1Hkj7zHh7DWcXpdppw@mail.gmail.com>
References: <CALntoc0YV+fO0ua1_E=8F84Q6bfU5t7L84e6M_zkJJvw4f5=RA@mail.gmail.com>
	<CAKVJ-_4ykiFE1QUWKoWX3xTKgXQ5+9KJ1Hkj7zHh7DWcXpdppw@mail.gmail.com>
Message-ID: <CALntoc1R5_oCYCHNZAJ6g26G3e4EujSg+hy0GZ+xA_Sqi+28Yg@mail.gmail.com>

Indeed, it didn't like that.  Using BeautifulSoup seems to work but
not sure how well...

Thanks for the advice, all!

On Thu, Jul 12, 2012 at 5:35 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Jul 12, 2012 at 9:53 AM, Wheaton Little <wheatontrue at gmail.com> wrote:
>> I would like to use the Biopython xml parser, if possible, on google
>> patent xmls:
>>
>> http://www.google.com/googlebooks/uspto-patents-applications-text.html
>>
>> unfortunately, this is what I get:
>>
>>>>> t=open('ipa111229.xml','r').read()
>>>>> import Bio
>>>>> ttt=Bio.Entrez.read(t[:30000])
>>
>> Traceback (most recent call last):
>>   ...
>> TypeError: argument must have 'read' attribute
>>
>> What would I have to do to use the parser on this xml?
>
> In your example, you opened the file and read all the data into a
> string (variable t).
>
> The parser is not expecting a string, but a handle. String objects
> don't have a 'read' method, thus this error message.
>
> You could 'fix' this particular error by doing:
>
> handle=open('ipa111229.xml','r')
> from Bio import Entrez
> ttt=Entrez.read(handle)
>
> However, I doubt this will work as the Entrez parser is intended to be
> used with the NCBI XML files only.
>
> Python comes with several XML libraries in the standard library.
> ElementTree (or cElementTree) is quite popular, but as Brandom points
> out there are also DOM and SAX style parsers.
>
> Peter

From chaudhrynabeelahmed at gmail.com  Thu Jul 12 08:58:58 2012
From: chaudhrynabeelahmed at gmail.com (Nabeel Ahmed)
Date: Thu, 12 Jul 2012 17:58:58 +0500
Subject: [Biopython] Bioinformatics EMBOSS users
Message-ID: <CAAmMzEsyjvF+B5KWH1nh+-wdTqEJ6P75omoMjFfR27CZoPUJSQ@mail.gmail.com>

I have recently installed EMBOSS-6.4.0 (Ubuntu 11.10).
I am unable to make it work directly with live databases (embl, uniprot) ,
working totally fine with local sequence files.
e.g

% *plotorf  *
Plot potential open reading frames in a nucleotide sequence
Input nucleotide sequence: *embl:x13776*

*Error:* Failed to open filename 'embl'**

Used 'showdb' , displayed table with zero rows.

Is there any configuration, i am missing??

Ahmed

From p.j.a.cock at googlemail.com  Thu Jul 12 14:30:33 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 12 Jul 2012 19:30:33 +0100
Subject: [Biopython] Bioinformatics EMBOSS users
In-Reply-To: <CAAmMzEsyjvF+B5KWH1nh+-wdTqEJ6P75omoMjFfR27CZoPUJSQ@mail.gmail.com>
References: <CAAmMzEsyjvF+B5KWH1nh+-wdTqEJ6P75omoMjFfR27CZoPUJSQ@mail.gmail.com>
Message-ID: <CAKVJ-_6YpUYWPR4FjuXbXfAvEzGzKMmrpvshCU6HjNbVpKaAHA@mail.gmail.com>

On Thu, Jul 12, 2012 at 1:58 PM, Nabeel Ahmed
<chaudhrynabeelahmed at gmail.com> wrote:
> I have recently installed EMBOSS-6.4.0 (Ubuntu 11.10).
> I am unable to make it work directly with live databases (embl, uniprot) ,
> working totally fine with local sequence files.
> e.g
>
> % *plotorf  *
> Plot potential open reading frames in a nucleotide sequence
> Input nucleotide sequence: *embl:x13776*
>
> *Error:* Failed to open filename 'embl'**
>
> Used 'showdb' , displayed table with zero rows.
>
> Is there any configuration, i am missing??
>
> Ahmed

I'm not sure - but the EMBOSS mailing list would be the place to ask:
http://lists.open-bio.org/mailman/listinfo/emboss

Peter

From p.j.a.cock at googlemail.com  Thu Jul 12 14:37:11 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 12 Jul 2012 19:37:11 +0100
Subject: [Biopython] access Uniprot record by different ids
In-Reply-To: <CAFinXcShsVE8FDFMNX05KDiYjuyq7_mumy37wKXnzy-BwfNFiQ@mail.gmail.com>
References: <CAFinXcTHhTH8ra0NPD_30DkTUCafWj8efqcMMH-QtUX9dEUrbA@mail.gmail.com>
	<CAKVJ-_4J8xrFgyz_2bw__sRH3Okzo2ubstuqpRSnpy8tE4YQzg@mail.gmail.com>
	<CAFinXcShsVE8FDFMNX05KDiYjuyq7_mumy37wKXnzy-BwfNFiQ@mail.gmail.com>
Message-ID: <CAKVJ-_6y5YU-Neu_HHh_bB_PykwskxNNr_2cvVvaBkg26J=g7A@mail.gmail.com>

On Thu, Jul 12, 2012 at 12:06 PM, Sheila the angel
<from.d.putto at gmail.com> wrote:
> Thanks for reply.
> Now I made two dictionary one for uniprot_sprot.dat and another for
> secondary ids to primary ids.
> However it take too long to do this and I can't do Pickle for my_dict.
> I would like to know is it possible to dump my_dict (the uniprot.dat data)
> to MySql database.

Have you tried the Bio.SeqIO.index_db(...) function? This builds
an SQLite database to hold the lookup table of offsets (i.e. the
primary accession only). Creating the index is a little slow, but
reuse is very fast.

For your second dictionary mapping secondary accessions to
the primary accession, you should be able to use pickle.

> I looked at biopython-BioSQL page  but didn't understand much
> (I am new to SQL)
> Thanks

BioSQL is a bit complicated to get started with (although
using SQLite is a lot simpler than MySQL or PostgreSQL).

Peter

From livingstonemark at gmail.com  Mon Jul 16 21:49:37 2012
From: livingstonemark at gmail.com (Mark Livingstone)
Date: Tue, 17 Jul 2012 11:49:37 +1000
Subject: [Biopython] The PDBParser Permissive setting
Message-ID: <CABGYGEzR62gNjt8tCX2q0N8EfwU7ruhebLdkjM934JnWsSZ=tg@mail.gmail.com>

Hi Guys,

In my code I am experimenting with different ways of doing RMSD
calculations. I have code which in addition to normal CA based RMSD
can do (CA & CB) RMSD and also sidechain RMSD. On a perfect PDB file
this works well. Unfortunately, the curation I have is fairly average
/ poor in quality :-( and I only find out when one of the liberal
number of Try/Except blocks falls over.

I need a better way to find out sooner if a PDB file is missing data.

I am wondering therefore is for PDBParser I set Permissive=0, and
after setting the relevant models and chains etc, I did


wt_atoms = Bio.PDB.Selection.unfold_entities(wtc, 'A')

If this successfully works without throwing an Exception, can I assume
that this unfolded chain is perfect, or are there ways that I could
still be tripped up?

Alternatively, can anyone suggest code that I can employ in my
curation process that will give me a decent sanity check of PDB
quality, so I can get on writing experimental code - and not
Try/Except blocks :-(

Thanks in advance,

MarkL

From anaryin at gmail.com  Tue Jul 17 02:42:09 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 17 Jul 2012 02:42:09 -0400
Subject: [Biopython] The PDBParser Permissive setting
In-Reply-To: <CABGYGEzR62gNjt8tCX2q0N8EfwU7ruhebLdkjM934JnWsSZ=tg@mail.gmail.com>
References: <CABGYGEzR62gNjt8tCX2q0N8EfwU7ruhebLdkjM934JnWsSZ=tg@mail.gmail.com>
Message-ID: <CAJ9sUYMpVP+sxLvw5MssvY6TGhy-RJzoPFZ0qNOKMue72D9o6w@mail.gmail.com>

Hey Mark,

What kind of validation do you want?

Cheers,

Jo?o
No dia 17 de Jul de 2012 02:52, "Mark Livingstone" <
livingstonemark at gmail.com> escreveu:

> Hi Guys,
>
> In my code I am experimenting with different ways of doing RMSD
> calculations. I have code which in addition to normal CA based RMSD
> can do (CA & CB) RMSD and also sidechain RMSD. On a perfect PDB file
> this works well. Unfortunately, the curation I have is fairly average
> / poor in quality :-( and I only find out when one of the liberal
> number of Try/Except blocks falls over.
>
> I need a better way to find out sooner if a PDB file is missing data.
>
> I am wondering therefore is for PDBParser I set Permissive=0, and
> after setting the relevant models and chains etc, I did
>
>
> wt_atoms = Bio.PDB.Selection.unfold_entities(wtc, 'A')
>
> If this successfully works without throwing an Exception, can I assume
> that this unfolded chain is perfect, or are there ways that I could
> still be tripped up?
>
> Alternatively, can anyone suggest code that I can employ in my
> curation process that will give me a decent sanity check of PDB
> quality, so I can get on writing experimental code - and not
> Try/Except blocks :-(
>
> Thanks in advance,
>
> MarkL
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From livingstonemark at gmail.com  Tue Jul 17 04:54:50 2012
From: livingstonemark at gmail.com (Mark Livingstone)
Date: Tue, 17 Jul 2012 18:54:50 +1000
Subject: [Biopython] The PDBParser Permissive setting
In-Reply-To: <CAJ9sUYMpVP+sxLvw5MssvY6TGhy-RJzoPFZ0qNOKMue72D9o6w@mail.gmail.com>
References: <CABGYGEzR62gNjt8tCX2q0N8EfwU7ruhebLdkjM934JnWsSZ=tg@mail.gmail.com>
	<CAJ9sUYMpVP+sxLvw5MssvY6TGhy-RJzoPFZ0qNOKMue72D9o6w@mail.gmail.com>
Message-ID: <CABGYGEzRbB1FLH4TSU1_DQu_V6zUFikBe948uO1Zug+eLA0cHw@mail.gmail.com>

Hi Jo?o,

I guess it would be good if I could get a data structure that had no
discontinuities, no missing data points or unknowns. I would be able
to tell it to ignore HOH or other irrelevancies.

My use case as I mentioned is RMSD and similar algorithms, so one
continuous structure with all the data attached that I can iterate
through, selecting atoms / residues as needed, and get the names and
coordinates as I go.

So I guess I want a PDB Diagnostic type program to allow me to find
exemplary PDB files to use during initial stages of development while
I do proof of concept, since I know that finding edge case PDBs for
later work is not as hard it seems as finding good ones ;-) Maybe the
simplest way to think of the sort of PDBs is you can run your software
and you don't need any try / except blocks for Biopython to work well
:-D

Cheers,

MarkL

On 17 July 2012 16:42, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Hey Mark,
>
> What kind of validation do you want?
>
> Cheers,
>
> Jo?o
>
> No dia 17 de Jul de 2012 02:52, "Mark Livingstone"
> <livingstonemark at gmail.com> escreveu:
>>
>> Hi Guys,
>>
>> In my code I am experimenting with different ways of doing RMSD
>> calculations. I have code which in addition to normal CA based RMSD
>> can do (CA & CB) RMSD and also sidechain RMSD. On a perfect PDB file
>> this works well. Unfortunately, the curation I have is fairly average
>> / poor in quality :-( and I only find out when one of the liberal
>> number of Try/Except blocks falls over.
>>
>> I need a better way to find out sooner if a PDB file is missing data.
>>
>> I am wondering therefore is for PDBParser I set Permissive=0, and
>> after setting the relevant models and chains etc, I did
>>
>>
>> wt_atoms = Bio.PDB.Selection.unfold_entities(wtc, 'A')
>>
>> If this successfully works without throwing an Exception, can I assume
>> that this unfolded chain is perfect, or are there ways that I could
>> still be tripped up?
>>
>> Alternatively, can anyone suggest code that I can employ in my
>> curation process that will give me a decent sanity check of PDB
>> quality, so I can get on writing experimental code - and not
>> Try/Except blocks :-(
>>
>> Thanks in advance,
>>
>> MarkL
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython


From anaryin at gmail.com  Tue Jul 17 05:35:56 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 17 Jul 2012 10:35:56 +0100
Subject: [Biopython] The PDBParser Permissive setting
In-Reply-To: <CABGYGEzRbB1FLH4TSU1_DQu_V6zUFikBe948uO1Zug+eLA0cHw@mail.gmail.com>
References: <CABGYGEzR62gNjt8tCX2q0N8EfwU7ruhebLdkjM934JnWsSZ=tg@mail.gmail.com>
	<CAJ9sUYMpVP+sxLvw5MssvY6TGhy-RJzoPFZ0qNOKMue72D9o6w@mail.gmail.com>
	<CABGYGEzRbB1FLH4TSU1_DQu_V6zUFikBe948uO1Zug+eLA0cHw@mail.gmail.com>
Message-ID: <CAJ9sUYNAEADas8r6PQ+uoy-_WJTPBRAwbwOOFaqRsei+0_vZhg@mail.gmail.com>

You mean for example, no chain breaks? And no missing atoms in residues?
You can check the first one with a warning catcher (I think I answered
something like this a few time ago here in the mailing list). The second
one is trickier, you'll need a sort of topology to know which atoms belong
to each residue. I have something like that in my GSOC branch but it's very
very very experimental..

Is this what you mean? Which others would you be looking for? I think that
for RMSD alone you need only to make sure that you match equivalent atoms.
That should be easy enough without major modifications or endless
try/excepts :)

From dilara.ally at gmail.com  Tue Jul 17 18:24:07 2012
From: dilara.ally at gmail.com (Dilara Ally)
Date: Tue, 17 Jul 2012 15:24:07 -0700
Subject: [Biopython] When is a SeqRecord a SeqRecord
Message-ID: <CAEfb3scVvMQ6pVH+mUUZrAb_5E5gCKh59cipvK6-tTh9iMV1BQ@mail.gmail.com>

Hi

I've modified my code but why does the inclusion of return None and the
subsequent code if filtered_rec is not None solve the problem? Thanks!

Dilara

q_threshold=20

def check_meanQ(rec, q_threshold):
    seqlen=len(rec)
    quality_scores=array(rec.letter_annotations["phred_quality"])
    if round(quality_scores.mean()) <= q_threshold:
        print "Discarded ", rec.id, "because mean Q was",
round(quality_scores.mean())
        return None
    if round(quality_scores.mean()) > q_threshold:
        return rec


from Bio import SeqIO
for rec in SeqIO.parse("test.fastq", "fastq"):
    #print rec.id
    filtered_rec= check_meanQ(rec, q_threshold)
    if filtered_rec is not None:
        print filtered_rec.id
        print filtered_rec.letter_annotations

From dilara.ally at gmail.com  Tue Jul 17 15:11:12 2012
From: dilara.ally at gmail.com (Dilara Ally)
Date: Tue, 17 Jul 2012 12:11:12 -0700
Subject: [Biopython] when is a SeqRecord not a SeqRecord
Message-ID: <CAEfb3sd_Ft_jyPxNGriRwhy0CGetgS0o0DcqLD61zw3P+rkWhQ@mail.gmail.com>

Hi

I'm trying to understand what is why when I print filtered_rec I get a
SeqRecord but if I try to access any particular attribute of a SeqRecord
such as letter_annotations I sometimes get an attribute error --
AttributeError: 'NoneType' object has no attribute 'letter_annotations.'


q_threshold=20

def check_meanQ(record, q_threshold):
    seqlen=len(record)
    quality_scores=array(record.letter_annotations["phred_quality"])
    if round(quality_scores.mean()) <= q_threshold:
        print "Discarded ", record.id, "because mean Q was",
round(quality_scores.mean())
    elif round(quality_scores.mean()) > q_threshold:
        return record

from Bio import SeqIO
for rec in SeqIO.parse("test.fastq", "fastq"):
    print rec.id
    filtered_rec= check_meanQ(rec, q_threshold)
    #print filtered_rec
    print filtered_rec.letter_annotations

I've attached two fastq files that I've used with this code one is called
test.fastq and the other is hiseq_pe_test.fastq

Any help would be greatly appreciated.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.fastq
Type: application/octet-stream
Size: 39217 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20120717/365cf2a5/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hiseq_pe_test.fastq
Type: application/octet-stream
Size: 1541 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20120717/365cf2a5/attachment-0003.obj>

From chapmanb at 50mail.com  Wed Jul 18 09:23:27 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 18 Jul 2012 09:23:27 -0400
Subject: [Biopython] when is a SeqRecord not a SeqRecord
In-Reply-To: <CAEfb3sd_Ft_jyPxNGriRwhy0CGetgS0o0DcqLD61zw3P+rkWhQ@mail.gmail.com>
References: <CAEfb3sd_Ft_jyPxNGriRwhy0CGetgS0o0DcqLD61zw3P+rkWhQ@mail.gmail.com>
Message-ID: <87y5mhkxtc.fsf@fastmail.fm>


Dilara;

> I'm trying to understand what is why when I print filtered_rec I get a
> SeqRecord but if I try to access any particular attribute of a SeqRecord
> such as letter_annotations I sometimes get an attribute error --
> AttributeError: 'NoneType' object has no attribute
> 'letter_annotations.'

> def check_meanQ(record, q_threshold):
>     seqlen=len(record)
>     quality_scores=array(record.letter_annotations["phred_quality"])
>     if round(quality_scores.mean()) <= q_threshold:
>         print "Discarded ", record.id, "because mean Q was",
> round(quality_scores.mean())
>     elif round(quality_scores.mean()) > q_threshold:
>         return record

This function returns different results based on the comparison of
mean quality scores to your threshold:

- When it is below the threshold, it returns None (since you do not
  define an explicit return value)
- When it is above the threshold, it returns a SeqRecord.

> from Bio import SeqIO
> for rec in SeqIO.parse("test.fastq", "fastq"):
>     print rec.id
>     filtered_rec= check_meanQ(rec, q_threshold)
>     #print filtered_rec
>     print filtered_rec.letter_annotations

You are seeing the error since in the filtered cases the function
returns None. You probably want:

filtered_rec= check_meanQ(rec, q_threshold)
if filtered_rec is not None:
   print filtered_rec.letter_annotations

Brad

From chapmanb at 50mail.com  Wed Jul 18 09:23:27 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 18 Jul 2012 09:23:27 -0400
Subject: [Biopython] when is a SeqRecord not a SeqRecord
In-Reply-To: <CAEfb3sd_Ft_jyPxNGriRwhy0CGetgS0o0DcqLD61zw3P+rkWhQ@mail.gmail.com>
References: <CAEfb3sd_Ft_jyPxNGriRwhy0CGetgS0o0DcqLD61zw3P+rkWhQ@mail.gmail.com>
Message-ID: <87y5mhkxtc.fsf@fastmail.fm>


Dilara;

> I'm trying to understand what is why when I print filtered_rec I get a
> SeqRecord but if I try to access any particular attribute of a SeqRecord
> such as letter_annotations I sometimes get an attribute error --
> AttributeError: 'NoneType' object has no attribute
> 'letter_annotations.'

> def check_meanQ(record, q_threshold):
>     seqlen=len(record)
>     quality_scores=array(record.letter_annotations["phred_quality"])
>     if round(quality_scores.mean()) <= q_threshold:
>         print "Discarded ", record.id, "because mean Q was",
> round(quality_scores.mean())
>     elif round(quality_scores.mean()) > q_threshold:
>         return record

This function returns different results based on the comparison of
mean quality scores to your threshold:

- When it is below the threshold, it returns None (since you do not
  define an explicit return value)
- When it is above the threshold, it returns a SeqRecord.

> from Bio import SeqIO
> for rec in SeqIO.parse("test.fastq", "fastq"):
>     print rec.id
>     filtered_rec= check_meanQ(rec, q_threshold)
>     #print filtered_rec
>     print filtered_rec.letter_annotations

You are seeing the error since in the filtered cases the function
returns None. You probably want:

filtered_rec= check_meanQ(rec, q_threshold)
if filtered_rec is not None:
   print filtered_rec.letter_annotations

Brad

From bioinformaticsing at gmail.com  Wed Jul 18 23:36:19 2012
From: bioinformaticsing at gmail.com (ning luwen)
Date: Thu, 19 Jul 2012 11:36:19 +0800
Subject: [Biopython] Error while parsing bgk file
Message-ID: <CAO51=Z4SG=0whhreVzHVrDAn2e1MPnQDjSW2Do_sF_c6EJhX0Q@mail.gmail.com>

Hi everyone,

A error encountered when i parse a gbk file.

the error message as follow:

Traceback (most recent call last):
  File "stat_refseq_gbs.py", line 10, in <module>
    for seq in f:
  File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py",
line 537, in parse
    for r in i:
  File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 445, in parse_records
    record = self.parse(handle, do_features)
  File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 428, in parse
    if self.feed(handle, consumer, do_features):
  File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 400, in feed
    self._feed_feature_table(consumer, self.parse_features(skip=False))
  File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 350, in _feed_feature_table
    consumer.location(location_string)
  File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py",
line 970, in location
    int(e),
ValueError: invalid literal for int() with base 10: '68452073^68452074'

the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the
lines cause the error may be:

     V_segment       complement(68451760..68452073^68452074)
     CDS             complement(<68451760..68452072^68452073)

-- 
regards,
luwen ning

From w.arindrarto at gmail.com  Thu Jul 19 04:50:33 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Thu, 19 Jul 2012 10:50:33 +0200
Subject: [Biopython] Error while parsing bgk file
In-Reply-To: <CAO51=Z4SG=0whhreVzHVrDAn2e1MPnQDjSW2Do_sF_c6EJhX0Q@mail.gmail.com>
References: <CAO51=Z4SG=0whhreVzHVrDAn2e1MPnQDjSW2Do_sF_c6EJhX0Q@mail.gmail.com>
Message-ID: <CADEGkF5tTA=tDVLTmh0nGcyN-W-ig5bL8kCVwTPk+qq8DkK-WA@mail.gmail.com>

Hi Ning,

Thanks for reporting the error. A similar issue has been reported in
the bug tracker here: https://redmine.open-bio.org/issues/3175 (it
also looks like it's the same coordinate). It seems that this could be
an invalid GenBank coordinate made by NCBI, though.

>From which chromosome is this coordinate coming from? Is it the latest draft?

cheers,
Bow


On Thu, Jul 19, 2012 at 5:36 AM, ning luwen <bioinformaticsing at gmail.com> wrote:
> Hi everyone,
>
> A error encountered when i parse a gbk file.
>
> the error message as follow:
>
> Traceback (most recent call last):
>   File "stat_refseq_gbs.py", line 10, in <module>
>     for seq in f:
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py",
> line 537, in parse
>     for r in i:
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 445, in parse_records
>     record = self.parse(handle, do_features)
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 428, in parse
>     if self.feed(handle, consumer, do_features):
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 400, in feed
>     self._feed_feature_table(consumer, self.parse_features(skip=False))
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 350, in _feed_feature_table
>     consumer.location(location_string)
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py",
> line 970, in location
>     int(e),
> ValueError: invalid literal for int() with base 10: '68452073^68452074'
>
> the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the
> lines cause the error may be:
>
>      V_segment       complement(68451760..68452073^68452074)
>      CDS             complement(<68451760..68452072^68452073)
>
> --
> regards,
> luwen ning
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From dilara.ally at gmail.com  Thu Jul 19 11:51:35 2012
From: dilara.ally at gmail.com (Dilara Ally)
Date: Thu, 19 Jul 2012 08:51:35 -0700
Subject: [Biopython] slice a record in two and writing both records
Message-ID: <CAEfb3sfpsCg4MgTPxXEh-oyPFe0OjCA9yw5d45cGZd_R6_9bvQ@mail.gmail.com>

If I have a function (modify_record) that slices up a SeqRecord into sub
records and then returns the sliced record if it has a certain length (for
e.g. the sliced record needs to be greater than 40bp), sometimes the
original record when sliced will have two different records both greater
than 40bp.  I want to keep both sliced reads and rewrite them as separate
records into a single fastq file.  Here is my code:

def modify_record(frec, win, len_threshold):
    quality_scores = array(frec.letter_annotations["phred_quality"])
    all_window_qc = slidingWindow(quality_scores, win,1)
    track_qc = windowQ(all_window_qc)
    myzeros = boolean_array(track_qc, q_threshold,win)
    Nrec = slice_points(myzeros,win)[0][1]-1
    where_to_slice = slice_points(myzeros,win)[1]
    where_to_slice.append(len(frec)+win)
    sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
    return sub_record

q_threshold = 20
win = 5
len_threshold = 30

from Bio import SeqIO
from numpy import *
good_reads = (rec for rec in SeqIO.parse("hiseq_pe_test.fastq", "fastq") if
array(rec.letter_annotations["phred_quality"]).mean() >= q_threshold)
count = SeqIO.write(good_reads, "temp.fastq", "fastq")
print "Saved %i reads" % count

newly_filtered=[]
for rec in SeqIO.parse("temp.fastq", "fastq"):
    s = modify_record(rec, win, len_threshold)
    newly_filtered.append(s)
    SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq")

This writes only the first sub_record even when there are more than 1 that
have a len >40bp. I've tried this as a generator expression and I'm still
getting just the first sub_record.   I'd also prefer to not to use append
as it was previously suggested that this can lead to problems if you run
the script more than once.  Instead, I want to employ a generator
expression - but I'm still getting used to the idea of generator
expressions.

My second question is more general.  Generator expressions are more memory
efficient than a list comprehension, but how are they better than just a
simple loop that pulls in a single record, does something and then writes
that record? Is it just a time issue?

Many thanks for the help!

From w.arindrarto at gmail.com  Thu Jul 19 13:21:42 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Thu, 19 Jul 2012 19:21:42 +0200
Subject: [Biopython] slice a record in two and writing both records
In-Reply-To: <CAEfb3sfpsCg4MgTPxXEh-oyPFe0OjCA9yw5d45cGZd_R6_9bvQ@mail.gmail.com>
References: <CAEfb3sfpsCg4MgTPxXEh-oyPFe0OjCA9yw5d45cGZd_R6_9bvQ@mail.gmail.com>
Message-ID: <CADEGkF6w2eB6stTd-oraRxoD2X5UMvjGgWAvWoVtvR6W0seCQQ@mail.gmail.com>

Hi Dilara,

For your first question, it seems that the `modify_record` function
always returns only one SeqRecord object. This is a bit of a guesswork
from my end as I don't know how most of the functions in `modify_record` work,
but since you still see an ouput sequence at the end, I
think you may want to re-check again how `sub_record` returns its
values / how it returns more than one SeqRecord objects.

Also, you might want to try changing the last two lines:

    newly_filtered.append(s)
    SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq")

Here, you're doing `SeqIO.write` for each iteration of the loop.
Although the end result is the same (a file containing all the sequence
you want), the code may be made more efficient by putting the
SeqIO.write line outside of the loop, after all sequences are pooled
in the `newly_filtered` list.

For your second question, I personally find generator expressions to
be more compact and easier to read. This is important for future code
maintenance ~ having more readable lines of code means it's easier to
understand your code and to debug them in case something goes wrong.
Note that generator expressions aren't silver bullets. In some cases,
for loops may still be better (e.g. if you're doing complex operations
on the objects your iterating over).

I find these two sites helpful when I first grappled with generators
and generator expressions. I hope they are the same to you too:

* http://stackoverflow.com/questions/1995418/python-generator-expression-vs-yield
* http://www.dabeaz.com/generators/Generators.pdf (PDF)

Hope that helps :),
Bow


On Thu, Jul 19, 2012 at 5:51 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
> If I have a function (modify_record) that slices up a SeqRecord into sub
> records and then returns the sliced record if it has a certain length (for
> e.g. the sliced record needs to be greater than 40bp), sometimes the
> original record when sliced will have two different records both greater
> than 40bp.  I want to keep both sliced reads and rewrite them as separate
> records into a single fastq file.  Here is my code:
>
> def modify_record(frec, win, len_threshold):
>     quality_scores = array(frec.letter_annotations["phred_quality"])
>     all_window_qc = slidingWindow(quality_scores, win,1)
>     track_qc = windowQ(all_window_qc)
>     myzeros = boolean_array(track_qc, q_threshold,win)
>     Nrec = slice_points(myzeros,win)[0][1]-1
>     where_to_slice = slice_points(myzeros,win)[1]
>     where_to_slice.append(len(frec)+win)
>     sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
>     return sub_record
>
> q_threshold = 20
> win = 5
> len_threshold = 30
>
> from Bio import SeqIO
> from numpy import *
> good_reads = (rec for rec in SeqIO.parse("hiseq_pe_test.fastq", "fastq") if
> array(rec.letter_annotations["phred_quality"]).mean() >= q_threshold)
> count = SeqIO.write(good_reads, "temp.fastq", "fastq")
> print "Saved %i reads" % count
>
> newly_filtered=[]
> for rec in SeqIO.parse("temp.fastq", "fastq"):
>     s = modify_record(rec, win, len_threshold)
>     newly_filtered.append(s)
>     SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq")
>
> This writes only the first sub_record even when there are more than 1 that
> have a len >40bp. I've tried this as a generator expression and I'm still
> getting just the first sub_record.   I'd also prefer to not to use append
> as it was previously suggested that this can lead to problems if you run
> the script more than once.  Instead, I want to employ a generator
> expression - but I'm still getting used to the idea of generator
> expressions.
>
> My second question is more general.  Generator expressions are more memory
> efficient than a list comprehension, but how are they better than just a
> simple loop that pulls in a single record, does something and then writes
> that record? Is it just a time issue?
>
> Many thanks for the help!
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From bioinformaticsing at gmail.com  Thu Jul 19 23:41:20 2012
From: bioinformaticsing at gmail.com (ning luwen)
Date: Fri, 20 Jul 2012 11:41:20 +0800
Subject: [Biopython] Fwd:  Error while parsing bgk file
In-Reply-To: <CALfq9t+_aq6iX0zcEA=wRL3j4=28-dvJQs-Lyjj94adZZCS2BQ@mail.gmail.com>
References: <CAO51=Z4SG=0whhreVzHVrDAn2e1MPnQDjSW2Do_sF_c6EJhX0Q@mail.gmail.com>
	<CALfq9t+_aq6iX0zcEA=wRL3j4=28-dvJQs-Lyjj94adZZCS2BQ@mail.gmail.com>
Message-ID: <CAO51=Z66NPZEGh-P2CoHBn-66D49HQgKwHqjeL=FGx1tMPxtPA@mail.gmail.com>

---------- Forwarded message ----------
From: Lenna Peterson <lennalenna at gmail.com>
Date: Thu, Jul 19, 2012 at 12:51 PM
Subject: Re: [Biopython] Error while parsing bgk file
To: ning luwen <bioinformaticsing at gmail.com>


On Wed, Jul 18, 2012 at 11:36 PM, ning luwen
<bioinformaticsing at gmail.com> wrote:
> Hi everyone,
>
> A error encountered when i parse a gbk file.
>
> the error message as follow:
>
> Traceback (most recent call last):
>   File "stat_refseq_gbs.py", line 10, in <module>
>     for seq in f:
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py",
> line 537, in parse
>     for r in i:
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 445, in parse_records
>     record = self.parse(handle, do_features)
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 428, in parse
>     if self.feed(handle, consumer, do_features):
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 400, in feed
>     self._feed_feature_table(consumer, self.parse_features(skip=False))
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 350, in _feed_feature_table
>     consumer.location(location_string)
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py",
> line 970, in location
>     int(e),
> ValueError: invalid literal for int() with base 10: '68452073^68452074'
>
> the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the
> lines cause the error may be:
>
>      V_segment       complement(68451760..68452073^68452074)
>      CDS             complement(<68451760..68452072^68452073)
>
> --
> regards,
> luwen ning
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


Hi Luwen,

Thanks for reporting this problem. I've submitted a patch that should fix it.

https://github.com/biopython/biopython/pull/54

Lenna


-- 
regards,
luwen ning

From bioinformaticsing at gmail.com  Thu Jul 19 23:56:51 2012
From: bioinformaticsing at gmail.com (ning luwen)
Date: Fri, 20 Jul 2012 11:56:51 +0800
Subject: [Biopython] Error while parsing bgk file
In-Reply-To: <CADEGkF5tTA=tDVLTmh0nGcyN-W-ig5bL8kCVwTPk+qq8DkK-WA@mail.gmail.com>
References: <CAO51=Z4SG=0whhreVzHVrDAn2e1MPnQDjSW2Do_sF_c6EJhX0Q@mail.gmail.com>
	<CADEGkF5tTA=tDVLTmh0nGcyN-W-ig5bL8kCVwTPk+qq8DkK-WA@mail.gmail.com>
Message-ID: <CAO51=Z4e-C8VaWm=j3OYrJsn4-_W-YQ0oDSzVHj+BJ0OmWhgHw@mail.gmail.com>

Hi Bow,

      Thank you for your reply,  and a patch by lenna can solve the
interruption of the parse.

      ps: these gbk file was recently downloaded from
ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/ (with extension of
gbs.gz), and the file contained "invalid GenBank annotation" is
ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_02/hs_ref_GRCh37.p5_chr2.gbs.gz

On Thu, Jul 19, 2012 at 4:50 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hi Ning,
>
> Thanks for reporting the error. A similar issue has been reported in
> the bug tracker here: https://redmine.open-bio.org/issues/3175 (it
> also looks like it's the same coordinate). It seems that this could be
> an invalid GenBank coordinate made by NCBI, though.
>
> From which chromosome is this coordinate coming from? Is it the latest draft?
>
> cheers,
> Bow
>
>
> On Thu, Jul 19, 2012 at 5:36 AM, ning luwen <bioinformaticsing at gmail.com> wrote:
>> Hi everyone,
>>
>> A error encountered when i parse a gbk file.
>>
>> the error message as follow:
>>
>> Traceback (most recent call last):
>>   File "stat_refseq_gbs.py", line 10, in <module>
>>     for seq in f:
>>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py",
>> line 537, in parse
>>     for r in i:
>>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
>> line 445, in parse_records
>>     record = self.parse(handle, do_features)
>>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
>> line 428, in parse
>>     if self.feed(handle, consumer, do_features):
>>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
>> line 400, in feed
>>     self._feed_feature_table(consumer, self.parse_features(skip=False))
>>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
>> line 350, in _feed_feature_table
>>     consumer.location(location_string)
>>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py",
>> line 970, in location
>>     int(e),
>> ValueError: invalid literal for int() with base 10: '68452073^68452074'
>>
>> the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the
>> lines cause the error may be:
>>
>>      V_segment       complement(68451760..68452073^68452074)
>>      CDS             complement(<68451760..68452072^68452073)
>>
>> --
>> regards,
>> luwen ning
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython


-- 
regards,
luwen ning

From p.j.a.cock at googlemail.com  Fri Jul 20 06:07:04 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 20 Jul 2012 11:07:04 +0100
Subject: [Biopython] slice a record in two and writing both records
In-Reply-To: <CAEfb3sfpsCg4MgTPxXEh-oyPFe0OjCA9yw5d45cGZd_R6_9bvQ@mail.gmail.com>
References: <CAEfb3sfpsCg4MgTPxXEh-oyPFe0OjCA9yw5d45cGZd_R6_9bvQ@mail.gmail.com>
Message-ID: <CAKVJ-_6mqoWuJUr1NSqVT37f-H6VgadVmjJxTHQ2LqfPxGpFNQ@mail.gmail.com>

On Thu, Jul 19, 2012 at 4:51 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
> If I have a function (modify_record) that slices up a SeqRecord into sub
> records and then returns the sliced record if it has a certain length (for
> e.g. the sliced record needs to be greater than 40bp), sometimes the
> original record when sliced will have two different records both greater
> than 40bp.  I want to keep both sliced reads and rewrite them as separate
> records into a single fastq file.  Here is my code:
>
> def modify_record(frec, win, len_threshold):
>     quality_scores = array(frec.letter_annotations["phred_quality"])
>     all_window_qc = slidingWindow(quality_scores, win,1)
>     track_qc = windowQ(all_window_qc)
>     myzeros = boolean_array(track_qc, q_threshold,win)
>     Nrec = slice_points(myzeros,win)[0][1]-1
>     where_to_slice = slice_points(myzeros,win)[1]
>     where_to_slice.append(len(frec)+win)
>     sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
>     return sub_record
> ...

The key point is that for each input record you may want to
produce several output records. A single function turning
one input SeqRecord into one output SeqRecord won't work.
I would suggest either,

1. Modify your function to return a list of SeqRecord objects,
which could be zero, one (as now), or several - depending on
the slice points. Then use itertools.chain to combine them,
something like this:

from itertools import chain
good_reads = chain(modify_record(r) for r in SeqIO.parse(...))
count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
print "Saved %i read fragments" % count

2. Use a generator function to process the SeqRecord objects,

def select_fragments(records, win, len_threshold):
    for record in records:
         where_to_slice = ...
         for slice_point in where_to_slice:
             yield record[slice_point]

good_reads = select_fragments(SeqIO.parse(...))
count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
print "Saved %i read fragments" % count

Both these approaches are generator/iteration based and will
be memory efficient.

Note you may also want to alter the record identifiers so that
different fragments from a single read get different IDs.

Peter

From p.j.a.cock at googlemail.com  Fri Jul 20 06:29:33 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 20 Jul 2012 11:29:33 +0100
Subject: [Biopython] Error while parsing bgk file
In-Reply-To: <CAO51=Z4e-C8VaWm=j3OYrJsn4-_W-YQ0oDSzVHj+BJ0OmWhgHw@mail.gmail.com>
References: <CAO51=Z4SG=0whhreVzHVrDAn2e1MPnQDjSW2Do_sF_c6EJhX0Q@mail.gmail.com>
	<CADEGkF5tTA=tDVLTmh0nGcyN-W-ig5bL8kCVwTPk+qq8DkK-WA@mail.gmail.com>
	<CAO51=Z4e-C8VaWm=j3OYrJsn4-_W-YQ0oDSzVHj+BJ0OmWhgHw@mail.gmail.com>
Message-ID: <CAKVJ-_7S3EMn-xPHLAqWM4zJzuVj=9gQ2V2s88u8cPQw2=XA6w@mail.gmail.com>

On Fri, Jul 20, 2012 at 4:56 AM, ning luwen <bioinformaticsing at gmail.com> wrote:
> Hi Bow,
>
>       Thank you for your reply,  and a patch by lenna can solve the
> interruption of the parse.
>
>       ps: these gbk file was recently downloaded from
> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/ (with extension of
> gbs.gz), and the file contained "invalid GenBank annotation" is
> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_02/hs_ref_GRCh37.p5_chr2.gbs.gz

Note the original bug report referred to a slightly different part/revision
of this chromosome, but it is the same issue reported earlier:
https://redmine.open-bio.org/issues/3175
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_02/hs_ref_GRCh37.p2_chr2.gbk.gz

I have now committed Lenna's fix, which means this file now parses
with a warning about the problem features (which get None as their
location):

https://github.com/biopython/biopython/commit/bc733da09051ca53ad4515ac2d971ff0839a71ba
https://github.com/biopython/biopython/commit/4bf78f72682f0500e93c410f8108891dade88ff8

Ning, if you would like to test this fix the simplest way is to get the
latest source code from github, and reinstall Biopython. You can
either use the git tool at the command line, or the github URL for
a tarball: https://github.com/biopython/biopython/tarball/master

(Please ask if you need more guidance with this)

Regards,

Peter

From igorrcosta at hotmail.com  Sat Jul 21 17:44:40 2012
From: igorrcosta at hotmail.com (Igor Rodrigues da Costa)
Date: Sat, 21 Jul 2012 21:44:40 +0000
Subject: [Biopython] Back translation support in Biopython
Message-ID: <SNT122-W17331BE590DC0C5B46EC21C4DF0@phx.gbl>


Hi Peter,
I would eliminate the problem of ID mapping (or at least pass it to the user) by using only the function that uses one sequence pair. The other option is to check if the codon and the amino acid are equivalent at run time, using a given genetic code. I did this in my program that back translated using only the aligned protein sequence and the Uniprot/GI accession numbers (I did the search using Bio.Entrez), but in my case the nucleotide dictionary was only some different ways the nucleotide sequence could be imported from NCBI, each of them returning a different sequence.
I can't see any need for different gap characters between both alignments, and I feel there can be both a Bio.SeqIO (using a pair of sequences only) and a Bio.AlignIO (using multiple sequences, probably slower if checking at run time) versions of this function. 
Att,Igor> Date: Mon, 2 Jul 2012 12:27:08 +0100
> Subject: Re: [Biopython] Back translation support in Biopython
> From: p.j.a.cock at googlemail.com
> To: igorrcosta at hotmail.com; eric.talevich at gmail.com
> CC: biopython at lists.open-bio.org
> 
> On Wed, Apr 4, 2012 at 4:02 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > On Wed, Apr 4, 2012 at 2:49 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
> >> Hi Igor,
> >>
> >> It sounds like you're referring to aligning amino acid sequences to codon
> >> sequences, as PAL2NAL does. This is different from what most people mean by
> >> back translation, but as you point out, certainly useful.
> >>
> >> If you write a function that can match a protein sequence alignment to a set
> >> of raw CDS sequences, returning a nucleotide alignment based on the
> >> codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does
> >> exactly that, plus a bit more, and is a fairly well-known and easily
> >> obtained program. Personally, I would prefer to write a wrapper for PAL2NAL
> >> under Bio.Align.Applications, using the existing Bio.Applications framework.
> >
> > As per the old thread, a simple function in Python taking the gapped protein
> > sequence, original nucleotide coding sequence, and the translation table
> > does sound useful. Then using that, you could go from a protein alignment
> > plus the original nucleotide coding sequences to a codon alignment, or
> > other tasks. Given this is all relatively straightforward string manipulation
> > and we already have the required genetic code tables in Biopython, I'm not
> > convinced that wrapping PAL2NAL would be the best solution (for this sub
> > task).
> 
> Hi Igor,
> 
> Did you do any work on back-translation (alignment threading) in Biopython?
> 
> We needed to do this locally, and for some reason (yet to be determined)
> T-COFFEE wasn't working on our dataset, so I made a start at a Biopython
> implementation:
> 
> https://github.com/peterjc/biopython/tree/back_trans
> https://github.com/peterjc/biopython/commit/7d14cdb59bb9d41c727c923c8aa7e3dda7779c80
> 
> Currently just one commit adding a Bio.Align.alignment_back_translate(...)
> function which takes a protein alignment and dictionary of nucleotide
> records - easy to get with Bio.SeqIO and Bio.AlignIO - with a stand alone
> example included in the doctest. There is also a new (currently private)
> function to do this for one sequence pair - perhaps useful on its own?
> 
> There are potential complications with ID mapping between the proteins
> and nucleotides, thus the option of a key function, and the gap characters
> (would you ever want to use different gap characters in the protein and
> nucleotide alignments?). We could discuss implementation details over
> on the biopython-dev list, but the general API discussion might as well
> be here. e.g. Where to put the function and what to call it.
> 
> Regards,
> 
> Peter
 		 	   		  

From p.j.a.cock at googlemail.com  Sun Jul 22 08:51:12 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 22 Jul 2012 13:51:12 +0100
Subject: [Biopython] Back translation support in Biopython
In-Reply-To: <SNT122-W17331BE590DC0C5B46EC21C4DF0@phx.gbl>
References: <SNT122-W17331BE590DC0C5B46EC21C4DF0@phx.gbl>
Message-ID: <CAKVJ-_5vLKDBj90-KFCb_ruDudT=-BZB9gaV==V1uAUhSJ=TZA@mail.gmail.com>

On Sat, Jul 21, 2012 at 10:44 PM, Igor Rodrigues da Costa
<igorrcosta at hotmail.com> wrote:
>
>
> Hi Peter,
> I would eliminate the problem of ID mapping (or at least
> pass it to the user) by using only the function that uses
> one sequence pair.

Making the function for doing one sequence pair part
of the public API seems sensible then.

> The other option is to check if the codon and the amino
> acid are equivalent at run time, using a given genetic
> code. I did this in my program that back translated
> using only the aligned protein sequence and the
> Uniprot/GI accession numbers (I did the search using
> Bio.Entrez), but in my case the nucleotide dictionary
> was only some different ways the nucleotide sequence
> could be imported from NCBI, each of them returning
> a different sequence.

Certainly optionally checking the translation seems wise.
There are potential complications with things like
ambiguous bases, but in general this is useful.

> I can't see any need for different gap characters
> between both alignments, and I feel there can be both
> a Bio.SeqIO (using a pair of sequences only) and a
> Bio.AlignIO (using multiple sequences, probably slower
> if checking at run time) versions of this function.

I agree that an alignment based function, and a
single sequence based function make sense - but
probably under Bio.Align rather than Bio.SeqIO and
Bio.AlignIO which are specifically for input/ouput
functionality.

Thanks for your thoughts,

Peter

From dilara.ally at gmail.com  Mon Jul 23 17:48:30 2012
From: dilara.ally at gmail.com (Dilara Ally)
Date: Mon, 23 Jul 2012 14:48:30 -0700
Subject: [Biopython] slice a record in two and writing both records
In-Reply-To: <CAKVJ-_6mqoWuJUr1NSqVT37f-H6VgadVmjJxTHQ2LqfPxGpFNQ@mail.gmail.com>
References: <CAEfb3sfpsCg4MgTPxXEh-oyPFe0OjCA9yw5d45cGZd_R6_9bvQ@mail.gmail.com>
	<CAKVJ-_6mqoWuJUr1NSqVT37f-H6VgadVmjJxTHQ2LqfPxGpFNQ@mail.gmail.com>
Message-ID: <9085DA29-9159-44EE-BED7-56E3306B8EA3@gmail.com>

Thanks.  Itertools is a fantastic module!
Dilara

On Jul 20, 2012, at 3:07 AM, Peter Cock wrote:

> On Thu, Jul 19, 2012 at 4:51 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
>> If I have a function (modify_record) that slices up a SeqRecord into sub
>> records and then returns the sliced record if it has a certain length (for
>> e.g. the sliced record needs to be greater than 40bp), sometimes the
>> original record when sliced will have two different records both greater
>> than 40bp.  I want to keep both sliced reads and rewrite them as separate
>> records into a single fastq file.  Here is my code:
>> 
>> def modify_record(frec, win, len_threshold):
>>    quality_scores = array(frec.letter_annotations["phred_quality"])
>>    all_window_qc = slidingWindow(quality_scores, win,1)
>>    track_qc = windowQ(all_window_qc)
>>    myzeros = boolean_array(track_qc, q_threshold,win)
>>    Nrec = slice_points(myzeros,win)[0][1]-1
>>    where_to_slice = slice_points(myzeros,win)[1]
>>    where_to_slice.append(len(frec)+win)
>>    sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
>>    return sub_record
>> ...
> 
> The key point is that for each input record you may want to
> produce several output records. A single function turning
> one input SeqRecord into one output SeqRecord won't work.
> I would suggest either,
> 
> 1. Modify your function to return a list of SeqRecord objects,
> which could be zero, one (as now), or several - depending on
> the slice points. Then use itertools.chain to combine them,
> something like this:
> 
> from itertools import chain
> good_reads = chain(modify_record(r) for r in SeqIO.parse(...))
> count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
> print "Saved %i read fragments" % count
> 
> 2. Use a generator function to process the SeqRecord objects,
> 
> def select_fragments(records, win, len_threshold):
>    for record in records:
>         where_to_slice = ...
>         for slice_point in where_to_slice:
>             yield record[slice_point]
> 
> good_reads = select_fragments(SeqIO.parse(...))
> count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
> print "Saved %i read fragments" % count
> 
> Both these approaches are generator/iteration based and will
> be memory efficient.
> 
> Note you may also want to alter the record identifiers so that
> different fragments from a single read get different IDs.
> 
> Peter


From llewelr at gmail.com  Mon Jul 23 22:24:06 2012
From: llewelr at gmail.com (Richard Llewellyn)
Date: Mon, 23 Jul 2012 20:24:06 -0600
Subject: [Biopython] Entrez.read(handle) for Bio._py3k.EvilHandleHack throws
 error with python 3.2
Message-ID: <CAJQgwq68pH4TAcCsY99o70fdWc8ziMOHxR6zkBVpUFeTUnUp1A@mail.gmail.com>

With python 3.2 and biopython 1.60 after getting a handle using
Entrez.esummary (and esearch, others?) I get a TypeError:


>>> from Bio import Entrez
>>> Entrez.email = "Your.Name.Here at example.org"
>>> handle = Entrez.esummary(db="journals", id="30367")
>>> record = Entrez.read(handle)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/__init__.py",
line 351, in read
    record = handler.read(handle)
  File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/Parser.py",
line 169, in read
    self.parser.ParseFile(handle)
TypeError: read() did not return a bytes object (type=str)

>>> handle
<Bio._py3k.EvilHandleHack object at 0xb737a34c>

Ah, it is evil!

I realize py3k not yet officially supported.

Thanks for the great work.

From llewelr at gmail.com  Mon Jul 23 23:20:00 2012
From: llewelr at gmail.com (Richard Llewellyn)
Date: Mon, 23 Jul 2012 21:20:00 -0600
Subject: [Biopython] Entrez.read(handle) for Bio._py3k.EvilHandleHack
 throws error with python 3.2
In-Reply-To: <CAJQgwq68pH4TAcCsY99o70fdWc8ziMOHxR6zkBVpUFeTUnUp1A@mail.gmail.com>
References: <CAJQgwq68pH4TAcCsY99o70fdWc8ziMOHxR6zkBVpUFeTUnUp1A@mail.gmail.com>
Message-ID: <CAJQgwq7jgK-onmMz=hfHkTjUTtqB-EekHxhgh9PF-KGW1QA=Cw@mail.gmail.com>

Follow up for Entrez.read error on EvilHandleHack object:

(this is python 3.2.3)

If I change last line of Entrez.__init__.py _open function from

return _binary_to_string_handle(handle)
to
return handle

this error does not occur in example given below.


On Mon, Jul 23, 2012 at 8:24 PM, Richard Llewellyn <llewelr at gmail.com> wrote:
> With python 3.2 and biopython 1.60 after getting a handle using
> Entrez.esummary (and esearch, others?) I get a TypeError:
>
>
>>>> from Bio import Entrez
>>>> Entrez.email = "Your.Name.Here at example.org"
>>>> handle = Entrez.esummary(db="journals", id="30367")
>>>> record = Entrez.read(handle)
>
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/__init__.py",
> line 351, in read
>     record = handler.read(handle)
>   File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/Parser.py",
> line 169, in read
>     self.parser.ParseFile(handle)
> TypeError: read() did not return a bytes object (type=str)
>
>>>> handle
> <Bio._py3k.EvilHandleHack object at 0xb737a34c>
>
> Ah, it is evil!
>
> I realize py3k not yet officially supported.
>
> Thanks for the great work.

From markd at soe.ucsc.edu  Tue Jul 24 02:47:51 2012
From: markd at soe.ucsc.edu (Mark Diekhans)
Date: Mon, 23 Jul 2012 23:47:51 -0700
Subject: [Biopython] accessing PDB IDcode when using PDBParser
Message-ID: <20494.17687.678370.458937@hgwdev.cse.ucsc.edu>


How does one access the idCode in the PDB HEADER when using the PDBParser?
I can't find this in the documentation or the code.

Also, what is function of the `id' argument for PDBParser.get_structure:
The documentation is just self-referential:
       o id - string, the id that will be used for the structure

Seems no obvious way via MMCIFParser either.

Thanks!


From anaryin at gmail.com  Tue Jul 24 04:37:44 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 24 Jul 2012 10:37:44 +0200
Subject: [Biopython] accessing PDB IDcode when using PDBParser
In-Reply-To: <20494.17687.678370.458937@hgwdev.cse.ucsc.edu>
References: <20494.17687.678370.458937@hgwdev.cse.ucsc.edu>
Message-ID: <CAJ9sUYOuEXc5DX-AkFfZGRV5xukaCe+VE4c-xgaOySDQpfFCpQ@mail.gmail.com>

Hey Mark,

Indeed there is no specific ID extraction from the HEADER. However, it
comes as part of the "head" key in the header dictionary. If you split by
whitespace and get the last field, you get the PDB ID.

Example:

HEADER    HYDROLASE(ASPARTYL PROTEINASE)          17-OCT-89   2RSP


The id you have in the get_structure function retrieves the first argument
you pass to it.


Cheers,

Jo?o [...] Rodrigues
http://nmr.chem.uu.nl/~joao


2012/7/24 Mark Diekhans <markd at soe.ucsc.edu>

>
> How does one access the idCode in the PDB HEADER when using the PDBParser?
> I can't find this in the documentation or the code.
>
> Also, what is function of the `id' argument for PDBParser.get_structure:
> The documentation is just self-referential:
>        o id - string, the id that will be used for the structure
>
> Seems no obvious way via MMCIFParser either.
>
> Thanks!
>
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From p.j.a.cock at googlemail.com  Tue Jul 24 05:41:35 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 24 Jul 2012 10:41:35 +0100
Subject: [Biopython] Entrez.read(handle) for Bio._py3k.EvilHandleHack
 throws error with python 3.2
In-Reply-To: <CAJQgwq7jgK-onmMz=hfHkTjUTtqB-EekHxhgh9PF-KGW1QA=Cw@mail.gmail.com>
References: <CAJQgwq68pH4TAcCsY99o70fdWc8ziMOHxR6zkBVpUFeTUnUp1A@mail.gmail.com>
	<CAJQgwq7jgK-onmMz=hfHkTjUTtqB-EekHxhgh9PF-KGW1QA=Cw@mail.gmail.com>
Message-ID: <CAKVJ-_4dg5=-PWLGr7OrT0riV0dm8cKvamcLCm1odDWn3=4QuA@mail.gmail.com>

Hi Richard,

It's great to have some feedback on Python 3 support :)

On Tue, Jul 24, 2012 at 4:20 AM, Richard Llewellyn <llewelr at gmail.com> wrote:
> Follow up for Entrez.read error on EvilHandleHack object:
>
> (this is python 3.2.3)
>
> If I change last line of Entrez.__init__.py _open function from
>
> return _binary_to_string_handle(handle)
> to
> return handle
>
> this error does not occur in example given below.

Hmm. That call to _binary_to_string_handle converts from the
bytes (binary) network handle to a string (unicode) handle which
is required for most of the parsers in Biopython under Python 3
(e.g. FASTA, Genbank).

Surprisingly the Entrez parser seems to be wanting a binary
handle? That seems curious... I presume that means we
don't have this particular case covered in the unit tests :(

How familiar are you with the Python 3 split of bytes vs strings
(unicode), and binary versus text handles?

Peter

From bjorn_johansson at bio.uminho.pt  Fri Jul 27 04:03:41 2012
From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=)
Date: Fri, 27 Jul 2012 09:03:41 +0100
Subject: [Biopython] Restriction cutting SeqRecord objects
Message-ID: <CAG_4V=aWTTrCyF7ZA-CsZ0xw157YoyBO-Vyh2REtzk5LPd34Hw@mail.gmail.com>

Hi,
Restriction with Bio.Restriction only works for seq or mutable seq objects?
I would like to digest SeqRecord objects and still keep the relevant
features of the sequences.

Did anyone perhaps implement something like this?
One way would be to subclass the restriction enzymes,
but they are created dynamically so I am not sure if this is a god idea.


btw is the biopython site down?

thanks,
bjorn

-- 
______O_________oO________oO______o_______oO__
Bj?rn Johansson
Assistant Professor
Departament of Biology
University of Minho
Campus de Gualtar
4710-057 Braga
PORTUGAL
www.bio.uminho.pt
Google profile
metabolicengineeringgroup
Work (direct) +351-253 601517 | mob.  +351-967 147 704 | mob. (SWE) 0739 792 968
Dept of Biology (secr) +351-253 60 4310  | fax +351-253 678980


From dilara.ally at gmail.com  Thu Jul 26 13:48:44 2012
From: dilara.ally at gmail.com (Dilara Ally)
Date: Thu, 26 Jul 2012 10:48:44 -0700
Subject: [Biopython] matching headers and then writing the seq record
Message-ID: <DDBE81B2-6FEB-4ABD-BC40-1E1E313BB7D5@gmail.com>

Hi Everyone, 

I'm interested in finding headers that match (in other words paired reads) in two different fastq files.  Once the common headers are found, I then go back to the original fastq file and write those matched reads to a different fastq file.  Right now, the part of the code that runs really slow is the headers_read1 and headers_read2 lines.  And I was wondering if there was a more elegant way and time efficient manner than what I have done.  It seems as if set undoes the elegance of using a generator.  Any advice is greatly appreciated!   Here is the code:

def get_header(seq_record):
    fields = seq_record.id.split(':')
    lastfield = fields[6].split('_')[0]
    return lastfield

def get_full_header(seq_record):
    fields = seq_record.id.split(':')
    headerInfo2 = fields[6].split('_')[0]
    headerInfo =  str(fields[0]) + ":" + str(fields[1]) + ":" + str(fields[2]) + ":" + str(fields[3]) + ":" + str(fields[4]) + ":" + str(fields[5]) + ":" + str(headerInfo2)
    return headerInfo 

def replace_header(seq_record,pairType):
    if pairType == 1:
        ending = "/1"
    elif pairType == 2:
        ending = "/2"
    seq_record.id=seq_record.id+ending
    seq_record.name = ""
    seq_record.description = ""
    return seq_record

def matched_records(records, pairType, header_matches):
    for rec in records:
        id = get_header(rec)
        result = id in header_matches
        #print result
        if (result == True):
            newrec = replace_header(rec,pairType)
            yield newrec

import sys
from Bio import SeqIO

headers_read1 = set(get_header(seq_record) for seq_record in SeqIO.parse(sys.argv[1], "fastq"))
headers_read2 = set(get_header(seq_record) for seq_record in SeqIO.parse(sys.argv[2], "fastq"))
header_matches = [x for x in headers_read1 if x in headers_read2]

records = SeqIO.parse(sys.argv[1], "fastq") 
pairType = 1
count = SeqIO.write(matched_records(records,pairType,header_matches), sys.argv[3], "fastq")
print "Saved %i matched reads." %count

records = SeqIO.parse(sys.argv[2], "fastq") 
pairType = 2
count = SeqIO.write(matched_records(records,pairType,header_matches), sys.argv[4], "fastq")
print "Saved %i matched reads." %count

From cartealy at yahoo.co.id  Fri Jul 27 02:30:58 2012
From: cartealy at yahoo.co.id (Imam Cartealy)
Date: Fri, 27 Jul 2012 14:30:58 +0800 (SGT)
Subject: [Biopython] Is biopython.org down ?
Message-ID: <1343370658.77420.YahooMailNeo@web190503.mail.sg3.yahoo.com>

Hi everyone,

I am having trouble accessing biopython.org for the last 2 days. Is biopython.org down ?

Cheers

ic

?
Imam Cartealy
Center for Biotechnology - BPPT
Indonesia


From idoerg at gmail.com  Sat Jul 28 16:19:01 2012
From: idoerg at gmail.com (Iddo Friedberg)
Date: Sat, 28 Jul 2012 16:19:01 -0400
Subject: [Biopython] Is biopython.org down ?
In-Reply-To: <1343370658.77420.YahooMailNeo@web190503.mail.sg3.yahoo.com>
References: <1343370658.77420.YahooMailNeo@web190503.mail.sg3.yahoo.com>
Message-ID: <CABm4-MRRY=ykF+ZoYF+KHZ3rDXO=gA_3-+B+FC4gb8CR6Yk7oQ@mail.gmail.com>

Has been down for the past couple of days, but it is up now.

On Fri, Jul 27, 2012 at 2:30 AM, Imam Cartealy <cartealy at yahoo.co.id> wrote:

> Hi everyone,
>
> I am having trouble accessing biopython.org for the last 2 days. Is
> biopython.org down ?
>
> Cheers
>
> ic
>
>
> Imam Cartealy
> Center for Biotechnology - BPPT
> Indonesia
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.

From p.j.a.cock at googlemail.com  Sat Jul 28 16:48:32 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 28 Jul 2012 21:48:32 +0100
Subject: [Biopython] matching headers and then writing the seq record
In-Reply-To: <DDBE81B2-6FEB-4ABD-BC40-1E1E313BB7D5@gmail.com>
References: <DDBE81B2-6FEB-4ABD-BC40-1E1E313BB7D5@gmail.com>
Message-ID: <CAKVJ-_6vQisoFTG5h6XHjgqO5RK3usDf1rtri-A6PdcxAVq9YA@mail.gmail.com>

On Thu, Jul 26, 2012 at 6:48 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
> ... It seems as if set undoes the elegance of using a generator.
> Any advice is greatly appreciated! ...
>
> headers_read1 = set(...)
> headers_read2 = set(...)
> header_matches = [x for x in headers_read1 if x in headers_read2]

I would expect that using the built in set's intersection operation would
be faster than this list comprehension solution to create header_matches.

Also, you should use a set not a list for header_matches because testing
membership with a set is much faster than a list. i.e. Try:

header_matches = headers_read1.intersection(headers_read2)

This might be a tiny change, but I expect it to be noticeably faster.

Also, here:

> def matched_records(records, pairType, header_matches):
>    for rec in records:
>        id = get_header(rec)
>        result = id in header_matches
>        if (result == True):
>            newrec = replace_header(rec,pairType)
>            yield newrec

If you don't mind my style comments, you don't really need
to create the variables 'id' and  'result', and 'newrec' - I would
just do:

def matched_records(records, pairType, header_matches):
    for rec in records:
        if get_header(rec) in header_matches:
            yield replace_header(rec,pairType)

And at that point you could write the whole thing as a
generator expression, which you may or may not find
more pleasing (I'm not sure if it makes any significant
difference to the speed). i.e.

records = SeqIO.parse(sys.argv[1], "fastq")
pairType = 1
wanted = (replace_header(rec,pairType) \
                 for rec in records \
                 if get_header(rec) in header_matches)
count = SeqIO.write(wanted, sys.argv[3], "fastq")

I hope that helps,

Peter

From aclark at aclark.net  Sat Jul 28 19:45:10 2012
From: aclark at aclark.net (Alex Clark)
Date: Sat, 28 Jul 2012 19:45:10 -0400
Subject: [Biopython] ANN: pythonpackages.com beta
Message-ID: <jv1ti6$bck$1@dough.gmane.org>

Hi biological computation folks,


I am reaching out to various Python-related programming communities in 
order to offer new help packaging your software.

If you have ever struggled with packaging and releasing Python software 
(e.g. to PyPI), please check out this service:


- http://pythonpackages.com


The basic idea is to automate packaging by checking out code, testing, 
and uploading (e.g. to PyPI) all through the web, as explained in this 
introduction:


- http://docs.pythonpackages.com/en/latest/introduction.html


Also, I will be available to answer your Python packaging questions most 
days/nights in #pythonpackages on irc.freenode.net. Hope to meet/talk 
with all of you soon.


Alex


-- 
Alex Clark ? http://pythonpackages.com/ONE_CLICK


From dilara.ally at gmail.com  Tue Jul 31 14:53:27 2012
From: dilara.ally at gmail.com (Dilara Ally)
Date: Tue, 31 Jul 2012 11:53:27 -0700
Subject: [Biopython] matching headers and then writing the seq record
In-Reply-To: <CAKVJ-_6vQisoFTG5h6XHjgqO5RK3usDf1rtri-A6PdcxAVq9YA@mail.gmail.com>
References: <DDBE81B2-6FEB-4ABD-BC40-1E1E313BB7D5@gmail.com>
	<CAKVJ-_6vQisoFTG5h6XHjgqO5RK3usDf1rtri-A6PdcxAVq9YA@mail.gmail.com>
Message-ID: <BAB3D952-032A-4EC1-B381-22268D3E95D9@gmail.com>

Thanks Peter it sped it up considerably!  I appreciate the fast replies on this listserv.


On Jul 28, 2012, at 1:48 PM, Peter Cock wrote:

> On Thu, Jul 26, 2012 at 6:48 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
>> ... It seems as if set undoes the elegance of using a generator.
>> Any advice is greatly appreciated! ...
>> 
>> headers_read1 = set(...)
>> headers_read2 = set(...)
>> header_matches = [x for x in headers_read1 if x in headers_read2]
> 
> I would expect that using the built in set's intersection operation would
> be faster than this list comprehension solution to create header_matches.
> 
> Also, you should use a set not a list for header_matches because testing
> membership with a set is much faster than a list. i.e. Try:
> 
> header_matches = headers_read1.intersection(headers_read2)
> 
> This might be a tiny change, but I expect it to be noticeably faster.
> 
> Also, here:
> 
>> def matched_records(records, pairType, header_matches):
>>   for rec in records:
>>       id = get_header(rec)
>>       result = id in header_matches
>>       if (result == True):
>>           newrec = replace_header(rec,pairType)
>>           yield newrec
> 
> If you don't mind my style comments, you don't really need
> to create the variables 'id' and  'result', and 'newrec' - I would
> just do:
> 
> def matched_records(records, pairType, header_matches):
>    for rec in records:
>        if get_header(rec) in header_matches:
>            yield replace_header(rec,pairType)
> 
> And at that point you could write the whole thing as a
> generator expression, which you may or may not find
> more pleasing (I'm not sure if it makes any significant
> difference to the speed). i.e.
> 
> records = SeqIO.parse(sys.argv[1], "fastq")
> pairType = 1
> wanted = (replace_header(rec,pairType) \
>                 for rec in records \
>                 if get_header(rec) in header_matches)
> count = SeqIO.write(wanted, sys.argv[3], "fastq")
> 
> I hope that helps,
> 
> Peter


From devaniranjan at gmail.com  Tue Jul 31 15:24:34 2012
From: devaniranjan at gmail.com (George Devaniranjan)
Date: Tue, 31 Jul 2012 15:24:34 -0400
Subject: [Biopython] Mocapy
Message-ID: <CAFU65PcikwfcKhnXGPo6r1UA6-5-QuvcTF7zbKFbK5Sdr1qi+g@mail.gmail.com>

I was wondering if Mocapy is part of Biopython.

I thought it was but I cannot find it in my biopython PDB folder.

Thank you,
George

From eric.talevich at gmail.com  Tue Jul 31 17:55:21 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 31 Jul 2012 17:55:21 -0400
Subject: [Biopython] Mocapy
In-Reply-To: <CAFU65PcikwfcKhnXGPo6r1UA6-5-QuvcTF7zbKFbK5Sdr1qi+g@mail.gmail.com>
References: <CAFU65PcikwfcKhnXGPo6r1UA6-5-QuvcTF7zbKFbK5Sdr1qi+g@mail.gmail.com>
Message-ID: <CAMC681mb5z0e2=y6P54FUOiXb5xOo=au_qhMaANwW3-r4g+PHA@mail.gmail.com>

On Tue, Jul 31, 2012 at 3:24 PM, George Devaniranjan <devaniranjan at gmail.com
> wrote:

> I was wondering if Mocapy is part of Biopython.
>
> I thought it was but I cannot find it in my biopython PDB folder.
>
>
Hi George,

No, Mocapy++ is a separate project:
http://sourceforge.net/projects/mocapy/

There is a branch to add some integration with Mocapy++ to Biopython, but
we're waiting for the next stable release of Mocapy++ before merging it:
https://github.com/mchelem/biopython

-Eric

From p.j.a.cock at googlemail.com  Mon Jul  2 11:27:08 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 2 Jul 2012 12:27:08 +0100
Subject: [Biopython] Back translation support in Biopython
In-Reply-To: <CAKVJ-_4RrR-QnTJYJEGzeJwdNnYx_n7Rt202C18ij2UVQJrxBg@mail.gmail.com>
References: <SNT122-W6253A9D0A58B2040858C45C44F0@phx.gbl>
	<CAKVJ-_4A=7ptD+6NNVLzQs_XW64FjYypyYw=0Xb4LT6uXW+tJQ@mail.gmail.com>
	<SNT122-W175393282884586982B4ACC44D0@phx.gbl>
	<CAMC681nDmW4__aFV=2OkdEXqiWFhWGTGdo2M4DsEYPzppuwF7g@mail.gmail.com>
	<CAKVJ-_4RrR-QnTJYJEGzeJwdNnYx_n7Rt202C18ij2UVQJrxBg@mail.gmail.com>
Message-ID: <CAKVJ-_4WsEOyy=6jBMhZYPZ8X9G_+abYnHi5zDjh1EYJhYpAJw@mail.gmail.com>

On Wed, Apr 4, 2012 at 4:02 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Apr 4, 2012 at 2:49 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> Hi Igor,
>>
>> It sounds like you're referring to aligning amino acid sequences to codon
>> sequences, as PAL2NAL does. This is different from what most people mean by
>> back translation, but as you point out, certainly useful.
>>
>> If you write a function that can match a protein sequence alignment to a set
>> of raw CDS sequences, returning a nucleotide alignment based on the
>> codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does
>> exactly that, plus a bit more, and is a fairly well-known and easily
>> obtained program. Personally, I would prefer to write a wrapper for PAL2NAL
>> under Bio.Align.Applications, using the existing Bio.Applications framework.
>
> As per the old thread, a simple function in Python taking the gapped protein
> sequence, original nucleotide coding sequence, and the translation table
> does sound useful. Then using that, you could go from a protein alignment
> plus the original nucleotide coding sequences to a codon alignment, or
> other tasks. Given this is all relatively straightforward string manipulation
> and we already have the required genetic code tables in Biopython, I'm not
> convinced that wrapping PAL2NAL would be the best solution (for this sub
> task).

Hi Igor,

Did you do any work on back-translation (alignment threading) in Biopython?

We needed to do this locally, and for some reason (yet to be determined)
T-COFFEE wasn't working on our dataset, so I made a start at a Biopython
implementation:

https://github.com/peterjc/biopython/tree/back_trans
https://github.com/peterjc/biopython/commit/7d14cdb59bb9d41c727c923c8aa7e3dda7779c80

Currently just one commit adding a Bio.Align.alignment_back_translate(...)
function which takes a protein alignment and dictionary of nucleotide
records - easy to get with Bio.SeqIO and Bio.AlignIO - with a stand alone
example included in the doctest. There is also a new (currently private)
function to do this for one sequence pair - perhaps useful on its own?

There are potential complications with ID mapping between the proteins
and nucleotides, thus the option of a key function, and the gap characters
(would you ever want to use different gap characters in the protein and
nucleotide alignments?). We could discuss implementation details over
on the biopython-dev list, but the general API discussion might as well
be here. e.g. Where to put the function and what to call it.

Regards,

Peter


From from.d.putto at gmail.com  Mon Jul  2 12:21:46 2012
From: from.d.putto at gmail.com (Sheila the angel)
Date: Mon, 2 Jul 2012 14:21:46 +0200
Subject: [Biopython] searching homologene database
Message-ID: <CAFinXcQrh4y4=yWxzhp3jQGuqqjQTs9dxddP1sPLdHy5sMV3Vg@mail.gmail.com>

To search tp53 homolog in homologene database -

handle = Entrez.esearch(db="homologene", term="tp53[gene name] AND Homo
sapiens[orgn]")
record = Entrez.read(handle)
handle = Entrez.efetch(db="homologene", id=record['IdList'])
record = handle.read()
print record

I think record is asn.1 format !! how can I read or convert it in the genes
protein table (as we see in the web result)
http://www.ncbi.nlm.nih.gov/homologene/460

Thanks

--
Sheila


From w.arindrarto at gmail.com  Mon Jul  2 12:39:31 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Mon, 2 Jul 2012 14:39:31 +0200
Subject: [Biopython] searching homologene database
In-Reply-To: <CAFinXcQrh4y4=yWxzhp3jQGuqqjQTs9dxddP1sPLdHy5sMV3Vg@mail.gmail.com>
References: <CAFinXcQrh4y4=yWxzhp3jQGuqqjQTs9dxddP1sPLdHy5sMV3Vg@mail.gmail.com>
Message-ID: <CADEGkF7Cn++8Zk8c-GXrtgGNfU=WMHvwUY-m0M58e5BwfA=rQg@mail.gmail.com>

Hi Sheila,

You can set the 'retmode' parameter in order to specify your preferred
format. I'm not sure if NCBI provides an output format exactly like
the one you see on their site, but instead of ASN.1 you can specify a
more common
format like XML.

In your case, the call would be this (for XML, let's say):

handle = Entrez.efetch(db="homologene", id=record['IdList'], retmode="xml")

For a list of possible retmode values, you can look them up here:
http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch (see the
explanation about 'retmode').

If you want to format the output further, you can use modules like the
built-in elementtree or 3rd party modules like lxml to extract the tag
values and feed them to your script / program.

Hope that helps,
Bow


On Mon, Jul 2, 2012 at 2:21 PM, Sheila the angel <from.d.putto at gmail.com> wrote:
> To search tp53 homolog in homologene database -
>
> handle = Entrez.esearch(db="homologene", term="tp53[gene name] AND Homo
> sapiens[orgn]")
> record = Entrez.read(handle)
> handle = Entrez.efetch(db="homologene", id=record['IdList'])
> record = handle.read()
> print record
>
> I think record is asn.1 format !! how can I read or convert it in the genes
> protein table (as we see in the web result)
> http://www.ncbi.nlm.nih.gov/homologene/460
>
> Thanks
>
> --
> Sheila
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From from.d.putto at gmail.com  Tue Jul 10 13:16:25 2012
From: from.d.putto at gmail.com (Sheila the angel)
Date: Tue, 10 Jul 2012 15:16:25 +0200
Subject: [Biopython] access Uniprot record by different ids
Message-ID: <CAFinXcTHhTH8ra0NPD_30DkTUCafWj8efqcMMH-QtUX9dEUrbA@mail.gmail.com>

I have a Uniprot AC list in which some AC are primary and some are
secondary. The function
my_dict = SeqIO.index("uniprot_sprot.dat", "swiss")
makes dictionary of  uniprot data but I can access a record only by primary
AC.
my_dict['P04637']  # gives the record
my_dict['Q15086'] # KeyError
my_dict['P53_HUMAN'] # KeyError

Is it possible to access same record by both primary and secondary ACs (and
by uniprot ID) ?


From p.j.a.cock at googlemail.com  Tue Jul 10 13:43:31 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 10 Jul 2012 14:43:31 +0100
Subject: [Biopython] access Uniprot record by different ids
In-Reply-To: <CAFinXcTHhTH8ra0NPD_30DkTUCafWj8efqcMMH-QtUX9dEUrbA@mail.gmail.com>
References: <CAFinXcTHhTH8ra0NPD_30DkTUCafWj8efqcMMH-QtUX9dEUrbA@mail.gmail.com>
Message-ID: <CAKVJ-_4J8xrFgyz_2bw__sRH3Okzo2ubstuqpRSnpy8tE4YQzg@mail.gmail.com>

On Tue, Jul 10, 2012 at 2:16 PM, Sheila the angel
<from.d.putto at gmail.com> wrote:
> I have a Uniprot AC list in which some AC are primary and some are
> secondary. The function
> my_dict = SeqIO.index("uniprot_sprot.dat", "swiss")
> makes dictionary of  uniprot data but I can access a record only by primary
> AC.
> my_dict['P04637']  # gives the record
> my_dict['Q15086'] # KeyError
> my_dict['P53_HUMAN'] # KeyError
>
> Is it possible to access same record by both primary and secondary ACs
> (and by uniprot ID) ?

Not directly with Bio.SeqIO.index() or Bio.SeqIO.index_db(), no.
You could perhaps use a second dictionary mapping aliases to
the primary ID?

Peter


From n.j.loman at bham.ac.uk  Wed Jul 11 15:02:00 2012
From: n.j.loman at bham.ac.uk (Nick Loman)
Date: Wed, 11 Jul 2012 16:02:00 +0100
Subject: [Biopython] SeqRecord substring should return SeqRecord or
	character?
Message-ID: <CAFMxBqGRgBkVHZcDSkfCFkiOC85BqeddbhZ+dWahELAqnkJzpA@mail.gmail.com>

Hi there

I wanted to add the last character of a SeqRecord s1 to another
SeqRecord s2. However s1[-1] + s2 fails because s1[-1] returns a
string rather than a SeqRecord just containing a single base and
associated annotations. I have to do s1[-1:] to get a sliced
SeqRecord.

Is this behaviour intentional? I kind of assumed I would always get a
SeqRecord from any given slice, and it's seems weird to get just a
string back instead, although no doubt there's a good reason for this.

Cheers

Nick


From p.j.a.cock at googlemail.com  Wed Jul 11 15:21:00 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 11 Jul 2012 16:21:00 +0100
Subject: [Biopython] SeqRecord substring should return SeqRecord or
	character?
In-Reply-To: <CAFMxBqGRgBkVHZcDSkfCFkiOC85BqeddbhZ+dWahELAqnkJzpA@mail.gmail.com>
References: <CAFMxBqGRgBkVHZcDSkfCFkiOC85BqeddbhZ+dWahELAqnkJzpA@mail.gmail.com>
Message-ID: <CAKVJ-_7ZeBzMANFHfJX9KfO=7XqWsqjTutDnEcudokqWvyfk0Q@mail.gmail.com>

On Wed, Jul 11, 2012 at 4:02 PM, Nick Loman <n.j.loman at bham.ac.uk> wrote:
> Hi there
>
> I wanted to add the last character of a SeqRecord s1 to another
> SeqRecord s2. However s1[-1] + s2 fails because s1[-1] returns a
> string rather than a SeqRecord just containing a single base and
> associated annotations. I have to do s1[-1:] to get a sliced
> SeqRecord.

You should be able to do SeqRecord+string, and string+SeqRecord,
both of which are specifically tested in the docstring. Have you got
any more details? e.g. Version? Mini-example?

> Is this behaviour intentional? I kind of assumed I would always get a
> SeqRecord from any given slice, and it's seems weird to get just a
> string back instead, although no doubt there's a good reason for this.

For a single base/residue, the whole SeqRecord overhead does
seem unnecessary. As to why you get a single letter string, not
a single letter Seq, IIRC it was mimicking the Seq object.

Peter


From p.j.a.cock at googlemail.com  Wed Jul 11 15:52:41 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 11 Jul 2012 16:52:41 +0100
Subject: [Biopython] SeqRecord substring should return SeqRecord or
	character?
In-Reply-To: <CAFMxBqEDXCacQEG4jwYbG-b8pODfTX2LtBYoaZACJ=HMv5xqTg@mail.gmail.com>
References: <CAFMxBqGRgBkVHZcDSkfCFkiOC85BqeddbhZ+dWahELAqnkJzpA@mail.gmail.com>
	<620A45B10433AE4C81D3F931A02812F93BC80453CF@LESMBX1.adf.bham.ac.uk>
	<CAFMxBqEDXCacQEG4jwYbG-b8pODfTX2LtBYoaZACJ=HMv5xqTg@mail.gmail.com>
Message-ID: <CAKVJ-_4PnMSAsPee1Vc28Q_gHDXcPpSKDJ0X1WYc0YaVd1V8xg@mail.gmail.com>

On Wed, Jul 11, 2012 at 4:24 PM, Nick Loman <n.j.loman at bham.ac.uk> wrote:
> On Wed, Jul 11, 2012 at 4:21 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> On Wed, Jul 11, 2012 at 4:02 PM, Nick Loman <n.j.loman at bham.ac.uk> wrote:
>>> Hi there
>>>
>>> I wanted to add the last character of a SeqRecord s1 to another
>>> SeqRecord s2. However s1[-1] + s2 fails because s1[-1] returns a
>>> string rather than a SeqRecord just containing a single base and
>>> associated annotations. I have to do s1[-1:] to get a sliced
>>> SeqRecord.
>>
>> You should be able to do SeqRecord+string, and string+SeqRecord,
>> both of which are specifically tested in the docstring. Have you got
>> any more details? e.g. Version? Mini-example?
>
> Hi Peter,
>
> It was doing this on a FASTQ record so it's the missing quality
> annotation that cause the problem when trying to do this.

Ah - so the addition should have worked, but you'd lose the
partial quality string. You're stuck with ensuring you have two
SeqRecords, so as you suggested rather than s1[-1]+s2 please
use s1[-1:]+s2 instead. Slightly less clear, but only character more.

This actually reminds me of similar behaviour with the bytes
string in Python 3, where the same trick is required to get a
single letter bytes string.

>>> Is this behaviour intentional? I kind of assumed I would always get a
>>> SeqRecord from any given slice, and it's seems weird to get just a
>>> string back instead, although no doubt there's a good reason for this.
>>
>> For a single base/residue, the whole SeqRecord overhead does
>> seem unnecessary. As to why you get a single letter string, not
>> a single letter Seq, IIRC it was mimicking the Seq object.
>
> Yes, I guessed the overhead was likely to be the reason ..  not sure
> if there's a satisfactory solution?

Returning a single letter SeqRecord have might been a better
choice, and going back much further in Biopython's history the
Seq object should probably have returned a single letter Seq
(not a single letter string). There is a similar issue with the
columns of an alignment.

Peter


From wheatontrue at gmail.com  Thu Jul 12 08:53:26 2012
From: wheatontrue at gmail.com (Wheaton Little)
Date: Thu, 12 Jul 2012 16:53:26 +0800
Subject: [Biopython] can I use the xml parser in biopython on other xml
	files? how?
Message-ID: <CALntoc0YV+fO0ua1_E=8F84Q6bfU5t7L84e6M_zkJJvw4f5=RA@mail.gmail.com>

I would like to use the Biopython xml parser, if possible, on google
patent xmls:

http://www.google.com/googlebooks/uspto-patents-applications-text.html

unfortunately, this is what I get:

>>> t=open('ipa111229.xml','r').read()
>>> import Bio
>>> ttt=Bio.Entrez.read(t[:30000])

Traceback (most recent call last):
  File "<pyshell#20>", line 1, in <module>
    ttt=Bio.Entrez.read(t[:30000])
  File "/Library/Python/2.7/site-packages/Bio/Entrez/__init__.py",
line 351, in read
    record = handler.read(handle)
  File "/Library/Python/2.7/site-packages/Bio/Entrez/Parser.py", line
169, in read
    self.parser.ParseFile(handle)
TypeError: argument must have 'read' attribute

What would I have to do to use the parser on this xml?


From b.invergo at gmail.com  Thu Jul 12 09:25:27 2012
From: b.invergo at gmail.com (Brandon Invergo)
Date: Thu, 12 Jul 2012 11:25:27 +0200
Subject: [Biopython] can I use the xml parser in biopython on other xml
 files? how?
In-Reply-To: <CALntoc0YV+fO0ua1_E=8F84Q6bfU5t7L84e6M_zkJJvw4f5=RA@mail.gmail.com>
References: <CALntoc0YV+fO0ua1_E=8F84Q6bfU5t7L84e6M_zkJJvw4f5=RA@mail.gmail.com>
Message-ID: <1342085127.614.10.camel@localhost.localdomain>

With regards to the error that you receive, it's because you're trying
to `read()` a list, when that method requires a file-like object. This
would fix that:
>>> ttt=Bio.Entrez.read(open('ipa111229.xml', 'r'))

However, that wouldn't work because it requires a DTD from NCBI to read
the file.

Why not use one of Python's standard xml libraries (xml.sax or xml.dom
(or xml.minidom))?

-brandon

On Thu, 2012-07-12 at 16:53 +0800, Wheaton Little wrote:
> I would like to use the Biopython xml parser, if possible, on google
> patent xmls:
> 
> http://www.google.com/googlebooks/uspto-patents-applications-text.html
> 
> unfortunately, this is what I get:
> 
> >>> t=open('ipa111229.xml','r').read()
> >>> import Bio
> >>> ttt=Bio.Entrez.read(t[:30000])
> 
> Traceback (most recent call last):
>   File "<pyshell#20>", line 1, in <module>
>     ttt=Bio.Entrez.read(t[:30000])
>   File "/Library/Python/2.7/site-packages/Bio/Entrez/__init__.py",
> line 351, in read
>     record = handler.read(handle)
>   File "/Library/Python/2.7/site-packages/Bio/Entrez/Parser.py", line
> 169, in read
>     self.parser.ParseFile(handle)
> TypeError: argument must have 'read' attribute
> 
> What would I have to do to use the parser on this xml?
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From p.j.a.cock at googlemail.com  Thu Jul 12 09:35:09 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 12 Jul 2012 10:35:09 +0100
Subject: [Biopython] can I use the xml parser in biopython on other xml
 files? how?
In-Reply-To: <CALntoc0YV+fO0ua1_E=8F84Q6bfU5t7L84e6M_zkJJvw4f5=RA@mail.gmail.com>
References: <CALntoc0YV+fO0ua1_E=8F84Q6bfU5t7L84e6M_zkJJvw4f5=RA@mail.gmail.com>
Message-ID: <CAKVJ-_4ykiFE1QUWKoWX3xTKgXQ5+9KJ1Hkj7zHh7DWcXpdppw@mail.gmail.com>

On Thu, Jul 12, 2012 at 9:53 AM, Wheaton Little <wheatontrue at gmail.com> wrote:
> I would like to use the Biopython xml parser, if possible, on google
> patent xmls:
>
> http://www.google.com/googlebooks/uspto-patents-applications-text.html
>
> unfortunately, this is what I get:
>
>>>> t=open('ipa111229.xml','r').read()
>>>> import Bio
>>>> ttt=Bio.Entrez.read(t[:30000])
>
> Traceback (most recent call last):
>   ...
> TypeError: argument must have 'read' attribute
>
> What would I have to do to use the parser on this xml?

In your example, you opened the file and read all the data into a
string (variable t).

The parser is not expecting a string, but a handle. String objects
don't have a 'read' method, thus this error message.

You could 'fix' this particular error by doing:

handle=open('ipa111229.xml','r')
from Bio import Entrez
ttt=Entrez.read(handle)

However, I doubt this will work as the Entrez parser is intended to be
used with the NCBI XML files only.

Python comes with several XML libraries in the standard library.
ElementTree (or cElementTree) is quite popular, but as Brandom points
out there are also DOM and SAX style parsers.

Peter


From from.d.putto at gmail.com  Thu Jul 12 11:06:58 2012
From: from.d.putto at gmail.com (Sheila the angel)
Date: Thu, 12 Jul 2012 13:06:58 +0200
Subject: [Biopython] access Uniprot record by different ids
In-Reply-To: <CAKVJ-_4J8xrFgyz_2bw__sRH3Okzo2ubstuqpRSnpy8tE4YQzg@mail.gmail.com>
References: <CAFinXcTHhTH8ra0NPD_30DkTUCafWj8efqcMMH-QtUX9dEUrbA@mail.gmail.com>
	<CAKVJ-_4J8xrFgyz_2bw__sRH3Okzo2ubstuqpRSnpy8tE4YQzg@mail.gmail.com>
Message-ID: <CAFinXcShsVE8FDFMNX05KDiYjuyq7_mumy37wKXnzy-BwfNFiQ@mail.gmail.com>

Thanks for reply.
Now I made two dictionary one for uniprot_sprot.dat and another for
secondary ids to primary ids.
However it take too long to do this and I can't do Pickle for my_dict.
I would like to know is it possible to dump my_dict (the uniprot.dat data)
to MySql database.
I looked at biopython-BioSQL page  but didn't understand much (I am new to
SQL)
Thanks

--
Sheila


On Tue, Jul 10, 2012 at 3:43 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jul 10, 2012 at 2:16 PM, Sheila the angel
> <from.d.putto at gmail.com> wrote:
> > I have a Uniprot AC list in which some AC are primary and some are
> > secondary. The function
> > my_dict = SeqIO.index("uniprot_sprot.dat", "swiss")
> > makes dictionary of  uniprot data but I can access a record only by
> primary
> > AC.
> > my_dict['P04637']  # gives the record
> > my_dict['Q15086'] # KeyError
> > my_dict['P53_HUMAN'] # KeyError
> >
> > Is it possible to access same record by both primary and secondary ACs
> > (and by uniprot ID) ?
>
> Not directly with Bio.SeqIO.index() or Bio.SeqIO.index_db(), no.
> You could perhaps use a second dictionary mapping aliases to
> the primary ID?
>
> Peter
>


From wheatontrue at gmail.com  Thu Jul 12 11:57:51 2012
From: wheatontrue at gmail.com (Wheaton Little)
Date: Thu, 12 Jul 2012 19:57:51 +0800
Subject: [Biopython] can I use the xml parser in biopython on other xml
 files? how?
In-Reply-To: <CAKVJ-_4ykiFE1QUWKoWX3xTKgXQ5+9KJ1Hkj7zHh7DWcXpdppw@mail.gmail.com>
References: <CALntoc0YV+fO0ua1_E=8F84Q6bfU5t7L84e6M_zkJJvw4f5=RA@mail.gmail.com>
	<CAKVJ-_4ykiFE1QUWKoWX3xTKgXQ5+9KJ1Hkj7zHh7DWcXpdppw@mail.gmail.com>
Message-ID: <CALntoc1R5_oCYCHNZAJ6g26G3e4EujSg+hy0GZ+xA_Sqi+28Yg@mail.gmail.com>

Indeed, it didn't like that.  Using BeautifulSoup seems to work but
not sure how well...

Thanks for the advice, all!

On Thu, Jul 12, 2012 at 5:35 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Jul 12, 2012 at 9:53 AM, Wheaton Little <wheatontrue at gmail.com> wrote:
>> I would like to use the Biopython xml parser, if possible, on google
>> patent xmls:
>>
>> http://www.google.com/googlebooks/uspto-patents-applications-text.html
>>
>> unfortunately, this is what I get:
>>
>>>>> t=open('ipa111229.xml','r').read()
>>>>> import Bio
>>>>> ttt=Bio.Entrez.read(t[:30000])
>>
>> Traceback (most recent call last):
>>   ...
>> TypeError: argument must have 'read' attribute
>>
>> What would I have to do to use the parser on this xml?
>
> In your example, you opened the file and read all the data into a
> string (variable t).
>
> The parser is not expecting a string, but a handle. String objects
> don't have a 'read' method, thus this error message.
>
> You could 'fix' this particular error by doing:
>
> handle=open('ipa111229.xml','r')
> from Bio import Entrez
> ttt=Entrez.read(handle)
>
> However, I doubt this will work as the Entrez parser is intended to be
> used with the NCBI XML files only.
>
> Python comes with several XML libraries in the standard library.
> ElementTree (or cElementTree) is quite popular, but as Brandom points
> out there are also DOM and SAX style parsers.
>
> Peter


From chaudhrynabeelahmed at gmail.com  Thu Jul 12 12:58:58 2012
From: chaudhrynabeelahmed at gmail.com (Nabeel Ahmed)
Date: Thu, 12 Jul 2012 17:58:58 +0500
Subject: [Biopython] Bioinformatics EMBOSS users
Message-ID: <CAAmMzEsyjvF+B5KWH1nh+-wdTqEJ6P75omoMjFfR27CZoPUJSQ@mail.gmail.com>

I have recently installed EMBOSS-6.4.0 (Ubuntu 11.10).
I am unable to make it work directly with live databases (embl, uniprot) ,
working totally fine with local sequence files.
e.g

% *plotorf  *
Plot potential open reading frames in a nucleotide sequence
Input nucleotide sequence: *embl:x13776*

*Error:* Failed to open filename 'embl'**

Used 'showdb' , displayed table with zero rows.

Is there any configuration, i am missing??

Ahmed


From p.j.a.cock at googlemail.com  Thu Jul 12 18:30:33 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 12 Jul 2012 19:30:33 +0100
Subject: [Biopython] Bioinformatics EMBOSS users
In-Reply-To: <CAAmMzEsyjvF+B5KWH1nh+-wdTqEJ6P75omoMjFfR27CZoPUJSQ@mail.gmail.com>
References: <CAAmMzEsyjvF+B5KWH1nh+-wdTqEJ6P75omoMjFfR27CZoPUJSQ@mail.gmail.com>
Message-ID: <CAKVJ-_6YpUYWPR4FjuXbXfAvEzGzKMmrpvshCU6HjNbVpKaAHA@mail.gmail.com>

On Thu, Jul 12, 2012 at 1:58 PM, Nabeel Ahmed
<chaudhrynabeelahmed at gmail.com> wrote:
> I have recently installed EMBOSS-6.4.0 (Ubuntu 11.10).
> I am unable to make it work directly with live databases (embl, uniprot) ,
> working totally fine with local sequence files.
> e.g
>
> % *plotorf  *
> Plot potential open reading frames in a nucleotide sequence
> Input nucleotide sequence: *embl:x13776*
>
> *Error:* Failed to open filename 'embl'**
>
> Used 'showdb' , displayed table with zero rows.
>
> Is there any configuration, i am missing??
>
> Ahmed

I'm not sure - but the EMBOSS mailing list would be the place to ask:
http://lists.open-bio.org/mailman/listinfo/emboss

Peter


From p.j.a.cock at googlemail.com  Thu Jul 12 18:37:11 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 12 Jul 2012 19:37:11 +0100
Subject: [Biopython] access Uniprot record by different ids
In-Reply-To: <CAFinXcShsVE8FDFMNX05KDiYjuyq7_mumy37wKXnzy-BwfNFiQ@mail.gmail.com>
References: <CAFinXcTHhTH8ra0NPD_30DkTUCafWj8efqcMMH-QtUX9dEUrbA@mail.gmail.com>
	<CAKVJ-_4J8xrFgyz_2bw__sRH3Okzo2ubstuqpRSnpy8tE4YQzg@mail.gmail.com>
	<CAFinXcShsVE8FDFMNX05KDiYjuyq7_mumy37wKXnzy-BwfNFiQ@mail.gmail.com>
Message-ID: <CAKVJ-_6y5YU-Neu_HHh_bB_PykwskxNNr_2cvVvaBkg26J=g7A@mail.gmail.com>

On Thu, Jul 12, 2012 at 12:06 PM, Sheila the angel
<from.d.putto at gmail.com> wrote:
> Thanks for reply.
> Now I made two dictionary one for uniprot_sprot.dat and another for
> secondary ids to primary ids.
> However it take too long to do this and I can't do Pickle for my_dict.
> I would like to know is it possible to dump my_dict (the uniprot.dat data)
> to MySql database.

Have you tried the Bio.SeqIO.index_db(...) function? This builds
an SQLite database to hold the lookup table of offsets (i.e. the
primary accession only). Creating the index is a little slow, but
reuse is very fast.

For your second dictionary mapping secondary accessions to
the primary accession, you should be able to use pickle.

> I looked at biopython-BioSQL page  but didn't understand much
> (I am new to SQL)
> Thanks

BioSQL is a bit complicated to get started with (although
using SQLite is a lot simpler than MySQL or PostgreSQL).

Peter


From livingstonemark at gmail.com  Tue Jul 17 01:49:37 2012
From: livingstonemark at gmail.com (Mark Livingstone)
Date: Tue, 17 Jul 2012 11:49:37 +1000
Subject: [Biopython] The PDBParser Permissive setting
Message-ID: <CABGYGEzR62gNjt8tCX2q0N8EfwU7ruhebLdkjM934JnWsSZ=tg@mail.gmail.com>

Hi Guys,

In my code I am experimenting with different ways of doing RMSD
calculations. I have code which in addition to normal CA based RMSD
can do (CA & CB) RMSD and also sidechain RMSD. On a perfect PDB file
this works well. Unfortunately, the curation I have is fairly average
/ poor in quality :-( and I only find out when one of the liberal
number of Try/Except blocks falls over.

I need a better way to find out sooner if a PDB file is missing data.

I am wondering therefore is for PDBParser I set Permissive=0, and
after setting the relevant models and chains etc, I did


wt_atoms = Bio.PDB.Selection.unfold_entities(wtc, 'A')

If this successfully works without throwing an Exception, can I assume
that this unfolded chain is perfect, or are there ways that I could
still be tripped up?

Alternatively, can anyone suggest code that I can employ in my
curation process that will give me a decent sanity check of PDB
quality, so I can get on writing experimental code - and not
Try/Except blocks :-(

Thanks in advance,

MarkL


From anaryin at gmail.com  Tue Jul 17 06:42:09 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 17 Jul 2012 02:42:09 -0400
Subject: [Biopython] The PDBParser Permissive setting
In-Reply-To: <CABGYGEzR62gNjt8tCX2q0N8EfwU7ruhebLdkjM934JnWsSZ=tg@mail.gmail.com>
References: <CABGYGEzR62gNjt8tCX2q0N8EfwU7ruhebLdkjM934JnWsSZ=tg@mail.gmail.com>
Message-ID: <CAJ9sUYMpVP+sxLvw5MssvY6TGhy-RJzoPFZ0qNOKMue72D9o6w@mail.gmail.com>

Hey Mark,

What kind of validation do you want?

Cheers,

Jo?o
No dia 17 de Jul de 2012 02:52, "Mark Livingstone" <
livingstonemark at gmail.com> escreveu:

> Hi Guys,
>
> In my code I am experimenting with different ways of doing RMSD
> calculations. I have code which in addition to normal CA based RMSD
> can do (CA & CB) RMSD and also sidechain RMSD. On a perfect PDB file
> this works well. Unfortunately, the curation I have is fairly average
> / poor in quality :-( and I only find out when one of the liberal
> number of Try/Except blocks falls over.
>
> I need a better way to find out sooner if a PDB file is missing data.
>
> I am wondering therefore is for PDBParser I set Permissive=0, and
> after setting the relevant models and chains etc, I did
>
>
> wt_atoms = Bio.PDB.Selection.unfold_entities(wtc, 'A')
>
> If this successfully works without throwing an Exception, can I assume
> that this unfolded chain is perfect, or are there ways that I could
> still be tripped up?
>
> Alternatively, can anyone suggest code that I can employ in my
> curation process that will give me a decent sanity check of PDB
> quality, so I can get on writing experimental code - and not
> Try/Except blocks :-(
>
> Thanks in advance,
>
> MarkL
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From livingstonemark at gmail.com  Tue Jul 17 08:54:50 2012
From: livingstonemark at gmail.com (Mark Livingstone)
Date: Tue, 17 Jul 2012 18:54:50 +1000
Subject: [Biopython] The PDBParser Permissive setting
In-Reply-To: <CAJ9sUYMpVP+sxLvw5MssvY6TGhy-RJzoPFZ0qNOKMue72D9o6w@mail.gmail.com>
References: <CABGYGEzR62gNjt8tCX2q0N8EfwU7ruhebLdkjM934JnWsSZ=tg@mail.gmail.com>
	<CAJ9sUYMpVP+sxLvw5MssvY6TGhy-RJzoPFZ0qNOKMue72D9o6w@mail.gmail.com>
Message-ID: <CABGYGEzRbB1FLH4TSU1_DQu_V6zUFikBe948uO1Zug+eLA0cHw@mail.gmail.com>

Hi Jo?o,

I guess it would be good if I could get a data structure that had no
discontinuities, no missing data points or unknowns. I would be able
to tell it to ignore HOH or other irrelevancies.

My use case as I mentioned is RMSD and similar algorithms, so one
continuous structure with all the data attached that I can iterate
through, selecting atoms / residues as needed, and get the names and
coordinates as I go.

So I guess I want a PDB Diagnostic type program to allow me to find
exemplary PDB files to use during initial stages of development while
I do proof of concept, since I know that finding edge case PDBs for
later work is not as hard it seems as finding good ones ;-) Maybe the
simplest way to think of the sort of PDBs is you can run your software
and you don't need any try / except blocks for Biopython to work well
:-D

Cheers,

MarkL

On 17 July 2012 16:42, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Hey Mark,
>
> What kind of validation do you want?
>
> Cheers,
>
> Jo?o
>
> No dia 17 de Jul de 2012 02:52, "Mark Livingstone"
> <livingstonemark at gmail.com> escreveu:
>>
>> Hi Guys,
>>
>> In my code I am experimenting with different ways of doing RMSD
>> calculations. I have code which in addition to normal CA based RMSD
>> can do (CA & CB) RMSD and also sidechain RMSD. On a perfect PDB file
>> this works well. Unfortunately, the curation I have is fairly average
>> / poor in quality :-( and I only find out when one of the liberal
>> number of Try/Except blocks falls over.
>>
>> I need a better way to find out sooner if a PDB file is missing data.
>>
>> I am wondering therefore is for PDBParser I set Permissive=0, and
>> after setting the relevant models and chains etc, I did
>>
>>
>> wt_atoms = Bio.PDB.Selection.unfold_entities(wtc, 'A')
>>
>> If this successfully works without throwing an Exception, can I assume
>> that this unfolded chain is perfect, or are there ways that I could
>> still be tripped up?
>>
>> Alternatively, can anyone suggest code that I can employ in my
>> curation process that will give me a decent sanity check of PDB
>> quality, so I can get on writing experimental code - and not
>> Try/Except blocks :-(
>>
>> Thanks in advance,
>>
>> MarkL
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython


From anaryin at gmail.com  Tue Jul 17 09:35:56 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 17 Jul 2012 10:35:56 +0100
Subject: [Biopython] The PDBParser Permissive setting
In-Reply-To: <CABGYGEzRbB1FLH4TSU1_DQu_V6zUFikBe948uO1Zug+eLA0cHw@mail.gmail.com>
References: <CABGYGEzR62gNjt8tCX2q0N8EfwU7ruhebLdkjM934JnWsSZ=tg@mail.gmail.com>
	<CAJ9sUYMpVP+sxLvw5MssvY6TGhy-RJzoPFZ0qNOKMue72D9o6w@mail.gmail.com>
	<CABGYGEzRbB1FLH4TSU1_DQu_V6zUFikBe948uO1Zug+eLA0cHw@mail.gmail.com>
Message-ID: <CAJ9sUYNAEADas8r6PQ+uoy-_WJTPBRAwbwOOFaqRsei+0_vZhg@mail.gmail.com>

You mean for example, no chain breaks? And no missing atoms in residues?
You can check the first one with a warning catcher (I think I answered
something like this a few time ago here in the mailing list). The second
one is trickier, you'll need a sort of topology to know which atoms belong
to each residue. I have something like that in my GSOC branch but it's very
very very experimental..

Is this what you mean? Which others would you be looking for? I think that
for RMSD alone you need only to make sure that you match equivalent atoms.
That should be easy enough without major modifications or endless
try/excepts :)


From dilara.ally at gmail.com  Tue Jul 17 22:24:07 2012
From: dilara.ally at gmail.com (Dilara Ally)
Date: Tue, 17 Jul 2012 15:24:07 -0700
Subject: [Biopython] When is a SeqRecord a SeqRecord
Message-ID: <CAEfb3scVvMQ6pVH+mUUZrAb_5E5gCKh59cipvK6-tTh9iMV1BQ@mail.gmail.com>

Hi

I've modified my code but why does the inclusion of return None and the
subsequent code if filtered_rec is not None solve the problem? Thanks!

Dilara

q_threshold=20

def check_meanQ(rec, q_threshold):
    seqlen=len(rec)
    quality_scores=array(rec.letter_annotations["phred_quality"])
    if round(quality_scores.mean()) <= q_threshold:
        print "Discarded ", rec.id, "because mean Q was",
round(quality_scores.mean())
        return None
    if round(quality_scores.mean()) > q_threshold:
        return rec


from Bio import SeqIO
for rec in SeqIO.parse("test.fastq", "fastq"):
    #print rec.id
    filtered_rec= check_meanQ(rec, q_threshold)
    if filtered_rec is not None:
        print filtered_rec.id
        print filtered_rec.letter_annotations


From dilara.ally at gmail.com  Tue Jul 17 19:11:12 2012
From: dilara.ally at gmail.com (Dilara Ally)
Date: Tue, 17 Jul 2012 12:11:12 -0700
Subject: [Biopython] when is a SeqRecord not a SeqRecord
Message-ID: <CAEfb3sd_Ft_jyPxNGriRwhy0CGetgS0o0DcqLD61zw3P+rkWhQ@mail.gmail.com>

Hi

I'm trying to understand what is why when I print filtered_rec I get a
SeqRecord but if I try to access any particular attribute of a SeqRecord
such as letter_annotations I sometimes get an attribute error --
AttributeError: 'NoneType' object has no attribute 'letter_annotations.'


q_threshold=20

def check_meanQ(record, q_threshold):
    seqlen=len(record)
    quality_scores=array(record.letter_annotations["phred_quality"])
    if round(quality_scores.mean()) <= q_threshold:
        print "Discarded ", record.id, "because mean Q was",
round(quality_scores.mean())
    elif round(quality_scores.mean()) > q_threshold:
        return record

from Bio import SeqIO
for rec in SeqIO.parse("test.fastq", "fastq"):
    print rec.id
    filtered_rec= check_meanQ(rec, q_threshold)
    #print filtered_rec
    print filtered_rec.letter_annotations

I've attached two fastq files that I've used with this code one is called
test.fastq and the other is hiseq_pe_test.fastq

Any help would be greatly appreciated.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.fastq
Type: application/octet-stream
Size: 39217 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20120717/365cf2a5/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hiseq_pe_test.fastq
Type: application/octet-stream
Size: 1541 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20120717/365cf2a5/attachment-0005.obj>

From chapmanb at 50mail.com  Wed Jul 18 13:23:27 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 18 Jul 2012 09:23:27 -0400
Subject: [Biopython] when is a SeqRecord not a SeqRecord
In-Reply-To: <CAEfb3sd_Ft_jyPxNGriRwhy0CGetgS0o0DcqLD61zw3P+rkWhQ@mail.gmail.com>
References: <CAEfb3sd_Ft_jyPxNGriRwhy0CGetgS0o0DcqLD61zw3P+rkWhQ@mail.gmail.com>
Message-ID: <87y5mhkxtc.fsf@fastmail.fm>


Dilara;

> I'm trying to understand what is why when I print filtered_rec I get a
> SeqRecord but if I try to access any particular attribute of a SeqRecord
> such as letter_annotations I sometimes get an attribute error --
> AttributeError: 'NoneType' object has no attribute
> 'letter_annotations.'

> def check_meanQ(record, q_threshold):
>     seqlen=len(record)
>     quality_scores=array(record.letter_annotations["phred_quality"])
>     if round(quality_scores.mean()) <= q_threshold:
>         print "Discarded ", record.id, "because mean Q was",
> round(quality_scores.mean())
>     elif round(quality_scores.mean()) > q_threshold:
>         return record

This function returns different results based on the comparison of
mean quality scores to your threshold:

- When it is below the threshold, it returns None (since you do not
  define an explicit return value)
- When it is above the threshold, it returns a SeqRecord.

> from Bio import SeqIO
> for rec in SeqIO.parse("test.fastq", "fastq"):
>     print rec.id
>     filtered_rec= check_meanQ(rec, q_threshold)
>     #print filtered_rec
>     print filtered_rec.letter_annotations

You are seeing the error since in the filtered cases the function
returns None. You probably want:

filtered_rec= check_meanQ(rec, q_threshold)
if filtered_rec is not None:
   print filtered_rec.letter_annotations

Brad


From chapmanb at 50mail.com  Wed Jul 18 13:23:27 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 18 Jul 2012 09:23:27 -0400
Subject: [Biopython] when is a SeqRecord not a SeqRecord
In-Reply-To: <CAEfb3sd_Ft_jyPxNGriRwhy0CGetgS0o0DcqLD61zw3P+rkWhQ@mail.gmail.com>
References: <CAEfb3sd_Ft_jyPxNGriRwhy0CGetgS0o0DcqLD61zw3P+rkWhQ@mail.gmail.com>
Message-ID: <87y5mhkxtc.fsf@fastmail.fm>


Dilara;

> I'm trying to understand what is why when I print filtered_rec I get a
> SeqRecord but if I try to access any particular attribute of a SeqRecord
> such as letter_annotations I sometimes get an attribute error --
> AttributeError: 'NoneType' object has no attribute
> 'letter_annotations.'

> def check_meanQ(record, q_threshold):
>     seqlen=len(record)
>     quality_scores=array(record.letter_annotations["phred_quality"])
>     if round(quality_scores.mean()) <= q_threshold:
>         print "Discarded ", record.id, "because mean Q was",
> round(quality_scores.mean())
>     elif round(quality_scores.mean()) > q_threshold:
>         return record

This function returns different results based on the comparison of
mean quality scores to your threshold:

- When it is below the threshold, it returns None (since you do not
  define an explicit return value)
- When it is above the threshold, it returns a SeqRecord.

> from Bio import SeqIO
> for rec in SeqIO.parse("test.fastq", "fastq"):
>     print rec.id
>     filtered_rec= check_meanQ(rec, q_threshold)
>     #print filtered_rec
>     print filtered_rec.letter_annotations

You are seeing the error since in the filtered cases the function
returns None. You probably want:

filtered_rec= check_meanQ(rec, q_threshold)
if filtered_rec is not None:
   print filtered_rec.letter_annotations

Brad


From bioinformaticsing at gmail.com  Thu Jul 19 03:36:19 2012
From: bioinformaticsing at gmail.com (ning luwen)
Date: Thu, 19 Jul 2012 11:36:19 +0800
Subject: [Biopython] Error while parsing bgk file
Message-ID: <CAO51=Z4SG=0whhreVzHVrDAn2e1MPnQDjSW2Do_sF_c6EJhX0Q@mail.gmail.com>

Hi everyone,

A error encountered when i parse a gbk file.

the error message as follow:

Traceback (most recent call last):
  File "stat_refseq_gbs.py", line 10, in <module>
    for seq in f:
  File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py",
line 537, in parse
    for r in i:
  File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 445, in parse_records
    record = self.parse(handle, do_features)
  File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 428, in parse
    if self.feed(handle, consumer, do_features):
  File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 400, in feed
    self._feed_feature_table(consumer, self.parse_features(skip=False))
  File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
line 350, in _feed_feature_table
    consumer.location(location_string)
  File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py",
line 970, in location
    int(e),
ValueError: invalid literal for int() with base 10: '68452073^68452074'

the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the
lines cause the error may be:

     V_segment       complement(68451760..68452073^68452074)
     CDS             complement(<68451760..68452072^68452073)

-- 
regards,
luwen ning


From w.arindrarto at gmail.com  Thu Jul 19 08:50:33 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Thu, 19 Jul 2012 10:50:33 +0200
Subject: [Biopython] Error while parsing bgk file
In-Reply-To: <CAO51=Z4SG=0whhreVzHVrDAn2e1MPnQDjSW2Do_sF_c6EJhX0Q@mail.gmail.com>
References: <CAO51=Z4SG=0whhreVzHVrDAn2e1MPnQDjSW2Do_sF_c6EJhX0Q@mail.gmail.com>
Message-ID: <CADEGkF5tTA=tDVLTmh0nGcyN-W-ig5bL8kCVwTPk+qq8DkK-WA@mail.gmail.com>

Hi Ning,

Thanks for reporting the error. A similar issue has been reported in
the bug tracker here: https://redmine.open-bio.org/issues/3175 (it
also looks like it's the same coordinate). It seems that this could be
an invalid GenBank coordinate made by NCBI, though.

>From which chromosome is this coordinate coming from? Is it the latest draft?

cheers,
Bow


On Thu, Jul 19, 2012 at 5:36 AM, ning luwen <bioinformaticsing at gmail.com> wrote:
> Hi everyone,
>
> A error encountered when i parse a gbk file.
>
> the error message as follow:
>
> Traceback (most recent call last):
>   File "stat_refseq_gbs.py", line 10, in <module>
>     for seq in f:
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py",
> line 537, in parse
>     for r in i:
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 445, in parse_records
>     record = self.parse(handle, do_features)
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 428, in parse
>     if self.feed(handle, consumer, do_features):
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 400, in feed
>     self._feed_feature_table(consumer, self.parse_features(skip=False))
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 350, in _feed_feature_table
>     consumer.location(location_string)
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py",
> line 970, in location
>     int(e),
> ValueError: invalid literal for int() with base 10: '68452073^68452074'
>
> the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the
> lines cause the error may be:
>
>      V_segment       complement(68451760..68452073^68452074)
>      CDS             complement(<68451760..68452072^68452073)
>
> --
> regards,
> luwen ning
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From dilara.ally at gmail.com  Thu Jul 19 15:51:35 2012
From: dilara.ally at gmail.com (Dilara Ally)
Date: Thu, 19 Jul 2012 08:51:35 -0700
Subject: [Biopython] slice a record in two and writing both records
Message-ID: <CAEfb3sfpsCg4MgTPxXEh-oyPFe0OjCA9yw5d45cGZd_R6_9bvQ@mail.gmail.com>

If I have a function (modify_record) that slices up a SeqRecord into sub
records and then returns the sliced record if it has a certain length (for
e.g. the sliced record needs to be greater than 40bp), sometimes the
original record when sliced will have two different records both greater
than 40bp.  I want to keep both sliced reads and rewrite them as separate
records into a single fastq file.  Here is my code:

def modify_record(frec, win, len_threshold):
    quality_scores = array(frec.letter_annotations["phred_quality"])
    all_window_qc = slidingWindow(quality_scores, win,1)
    track_qc = windowQ(all_window_qc)
    myzeros = boolean_array(track_qc, q_threshold,win)
    Nrec = slice_points(myzeros,win)[0][1]-1
    where_to_slice = slice_points(myzeros,win)[1]
    where_to_slice.append(len(frec)+win)
    sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
    return sub_record

q_threshold = 20
win = 5
len_threshold = 30

from Bio import SeqIO
from numpy import *
good_reads = (rec for rec in SeqIO.parse("hiseq_pe_test.fastq", "fastq") if
array(rec.letter_annotations["phred_quality"]).mean() >= q_threshold)
count = SeqIO.write(good_reads, "temp.fastq", "fastq")
print "Saved %i reads" % count

newly_filtered=[]
for rec in SeqIO.parse("temp.fastq", "fastq"):
    s = modify_record(rec, win, len_threshold)
    newly_filtered.append(s)
    SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq")

This writes only the first sub_record even when there are more than 1 that
have a len >40bp. I've tried this as a generator expression and I'm still
getting just the first sub_record.   I'd also prefer to not to use append
as it was previously suggested that this can lead to problems if you run
the script more than once.  Instead, I want to employ a generator
expression - but I'm still getting used to the idea of generator
expressions.

My second question is more general.  Generator expressions are more memory
efficient than a list comprehension, but how are they better than just a
simple loop that pulls in a single record, does something and then writes
that record? Is it just a time issue?

Many thanks for the help!


From w.arindrarto at gmail.com  Thu Jul 19 17:21:42 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Thu, 19 Jul 2012 19:21:42 +0200
Subject: [Biopython] slice a record in two and writing both records
In-Reply-To: <CAEfb3sfpsCg4MgTPxXEh-oyPFe0OjCA9yw5d45cGZd_R6_9bvQ@mail.gmail.com>
References: <CAEfb3sfpsCg4MgTPxXEh-oyPFe0OjCA9yw5d45cGZd_R6_9bvQ@mail.gmail.com>
Message-ID: <CADEGkF6w2eB6stTd-oraRxoD2X5UMvjGgWAvWoVtvR6W0seCQQ@mail.gmail.com>

Hi Dilara,

For your first question, it seems that the `modify_record` function
always returns only one SeqRecord object. This is a bit of a guesswork
from my end as I don't know how most of the functions in `modify_record` work,
but since you still see an ouput sequence at the end, I
think you may want to re-check again how `sub_record` returns its
values / how it returns more than one SeqRecord objects.

Also, you might want to try changing the last two lines:

    newly_filtered.append(s)
    SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq")

Here, you're doing `SeqIO.write` for each iteration of the loop.
Although the end result is the same (a file containing all the sequence
you want), the code may be made more efficient by putting the
SeqIO.write line outside of the loop, after all sequences are pooled
in the `newly_filtered` list.

For your second question, I personally find generator expressions to
be more compact and easier to read. This is important for future code
maintenance ~ having more readable lines of code means it's easier to
understand your code and to debug them in case something goes wrong.
Note that generator expressions aren't silver bullets. In some cases,
for loops may still be better (e.g. if you're doing complex operations
on the objects your iterating over).

I find these two sites helpful when I first grappled with generators
and generator expressions. I hope they are the same to you too:

* http://stackoverflow.com/questions/1995418/python-generator-expression-vs-yield
* http://www.dabeaz.com/generators/Generators.pdf (PDF)

Hope that helps :),
Bow


On Thu, Jul 19, 2012 at 5:51 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
> If I have a function (modify_record) that slices up a SeqRecord into sub
> records and then returns the sliced record if it has a certain length (for
> e.g. the sliced record needs to be greater than 40bp), sometimes the
> original record when sliced will have two different records both greater
> than 40bp.  I want to keep both sliced reads and rewrite them as separate
> records into a single fastq file.  Here is my code:
>
> def modify_record(frec, win, len_threshold):
>     quality_scores = array(frec.letter_annotations["phred_quality"])
>     all_window_qc = slidingWindow(quality_scores, win,1)
>     track_qc = windowQ(all_window_qc)
>     myzeros = boolean_array(track_qc, q_threshold,win)
>     Nrec = slice_points(myzeros,win)[0][1]-1
>     where_to_slice = slice_points(myzeros,win)[1]
>     where_to_slice.append(len(frec)+win)
>     sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
>     return sub_record
>
> q_threshold = 20
> win = 5
> len_threshold = 30
>
> from Bio import SeqIO
> from numpy import *
> good_reads = (rec for rec in SeqIO.parse("hiseq_pe_test.fastq", "fastq") if
> array(rec.letter_annotations["phred_quality"]).mean() >= q_threshold)
> count = SeqIO.write(good_reads, "temp.fastq", "fastq")
> print "Saved %i reads" % count
>
> newly_filtered=[]
> for rec in SeqIO.parse("temp.fastq", "fastq"):
>     s = modify_record(rec, win, len_threshold)
>     newly_filtered.append(s)
>     SeqIO.write(newly_filtered, "filtered_temp.fastq", "fastq")
>
> This writes only the first sub_record even when there are more than 1 that
> have a len >40bp. I've tried this as a generator expression and I'm still
> getting just the first sub_record.   I'd also prefer to not to use append
> as it was previously suggested that this can lead to problems if you run
> the script more than once.  Instead, I want to employ a generator
> expression - but I'm still getting used to the idea of generator
> expressions.
>
> My second question is more general.  Generator expressions are more memory
> efficient than a list comprehension, but how are they better than just a
> simple loop that pulls in a single record, does something and then writes
> that record? Is it just a time issue?
>
> Many thanks for the help!
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From bioinformaticsing at gmail.com  Fri Jul 20 03:41:20 2012
From: bioinformaticsing at gmail.com (ning luwen)
Date: Fri, 20 Jul 2012 11:41:20 +0800
Subject: [Biopython] Fwd:  Error while parsing bgk file
In-Reply-To: <CALfq9t+_aq6iX0zcEA=wRL3j4=28-dvJQs-Lyjj94adZZCS2BQ@mail.gmail.com>
References: <CAO51=Z4SG=0whhreVzHVrDAn2e1MPnQDjSW2Do_sF_c6EJhX0Q@mail.gmail.com>
	<CALfq9t+_aq6iX0zcEA=wRL3j4=28-dvJQs-Lyjj94adZZCS2BQ@mail.gmail.com>
Message-ID: <CAO51=Z66NPZEGh-P2CoHBn-66D49HQgKwHqjeL=FGx1tMPxtPA@mail.gmail.com>

---------- Forwarded message ----------
From: Lenna Peterson <lennalenna at gmail.com>
Date: Thu, Jul 19, 2012 at 12:51 PM
Subject: Re: [Biopython] Error while parsing bgk file
To: ning luwen <bioinformaticsing at gmail.com>


On Wed, Jul 18, 2012 at 11:36 PM, ning luwen
<bioinformaticsing at gmail.com> wrote:
> Hi everyone,
>
> A error encountered when i parse a gbk file.
>
> the error message as follow:
>
> Traceback (most recent call last):
>   File "stat_refseq_gbs.py", line 10, in <module>
>     for seq in f:
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py",
> line 537, in parse
>     for r in i:
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 445, in parse_records
>     record = self.parse(handle, do_features)
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 428, in parse
>     if self.feed(handle, consumer, do_features):
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 400, in feed
>     self._feed_feature_table(consumer, self.parse_features(skip=False))
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
> line 350, in _feed_feature_table
>     consumer.location(location_string)
>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py",
> line 970, in location
>     int(e),
> ValueError: invalid literal for int() with base 10: '68452073^68452074'
>
> the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the
> lines cause the error may be:
>
>      V_segment       complement(68451760..68452073^68452074)
>      CDS             complement(<68451760..68452072^68452073)
>
> --
> regards,
> luwen ning
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


Hi Luwen,

Thanks for reporting this problem. I've submitted a patch that should fix it.

https://github.com/biopython/biopython/pull/54

Lenna


-- 
regards,
luwen ning


From bioinformaticsing at gmail.com  Fri Jul 20 03:56:51 2012
From: bioinformaticsing at gmail.com (ning luwen)
Date: Fri, 20 Jul 2012 11:56:51 +0800
Subject: [Biopython] Error while parsing bgk file
In-Reply-To: <CADEGkF5tTA=tDVLTmh0nGcyN-W-ig5bL8kCVwTPk+qq8DkK-WA@mail.gmail.com>
References: <CAO51=Z4SG=0whhreVzHVrDAn2e1MPnQDjSW2Do_sF_c6EJhX0Q@mail.gmail.com>
	<CADEGkF5tTA=tDVLTmh0nGcyN-W-ig5bL8kCVwTPk+qq8DkK-WA@mail.gmail.com>
Message-ID: <CAO51=Z4e-C8VaWm=j3OYrJsn4-_W-YQ0oDSzVHj+BJ0OmWhgHw@mail.gmail.com>

Hi Bow,

      Thank you for your reply,  and a patch by lenna can solve the
interruption of the parse.

      ps: these gbk file was recently downloaded from
ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/ (with extension of
gbs.gz), and the file contained "invalid GenBank annotation" is
ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_02/hs_ref_GRCh37.p5_chr2.gbs.gz

On Thu, Jul 19, 2012 at 4:50 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hi Ning,
>
> Thanks for reporting the error. A similar issue has been reported in
> the bug tracker here: https://redmine.open-bio.org/issues/3175 (it
> also looks like it's the same coordinate). It seems that this could be
> an invalid GenBank coordinate made by NCBI, though.
>
> From which chromosome is this coordinate coming from? Is it the latest draft?
>
> cheers,
> Bow
>
>
> On Thu, Jul 19, 2012 at 5:36 AM, ning luwen <bioinformaticsing at gmail.com> wrote:
>> Hi everyone,
>>
>> A error encountered when i parse a gbk file.
>>
>> the error message as follow:
>>
>> Traceback (most recent call last):
>>   File "stat_refseq_gbs.py", line 10, in <module>
>>     for seq in f:
>>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/SeqIO/__init__.py",
>> line 537, in parse
>>     for r in i:
>>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
>> line 445, in parse_records
>>     record = self.parse(handle, do_features)
>>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
>> line 428, in parse
>>     if self.feed(handle, consumer, do_features):
>>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
>> line 400, in feed
>>     self._feed_feature_table(consumer, self.parse_features(skip=False))
>>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/Scanner.py",
>> line 350, in _feed_feature_table
>>     consumer.location(location_string)
>>   File "/media/disk2/bio/bin/lib/python2.7/site-packages/Bio/GenBank/__init__.py",
>> line 970, in location
>>     int(e),
>> ValueError: invalid literal for int() with base 10: '68452073^68452074'
>>
>> the file parsed is ref_GRCh37.p5, the biopython version is 1.60, the
>> lines cause the error may be:
>>
>>      V_segment       complement(68451760..68452073^68452074)
>>      CDS             complement(<68451760..68452072^68452073)
>>
>> --
>> regards,
>> luwen ning
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython


-- 
regards,
luwen ning


From p.j.a.cock at googlemail.com  Fri Jul 20 10:07:04 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 20 Jul 2012 11:07:04 +0100
Subject: [Biopython] slice a record in two and writing both records
In-Reply-To: <CAEfb3sfpsCg4MgTPxXEh-oyPFe0OjCA9yw5d45cGZd_R6_9bvQ@mail.gmail.com>
References: <CAEfb3sfpsCg4MgTPxXEh-oyPFe0OjCA9yw5d45cGZd_R6_9bvQ@mail.gmail.com>
Message-ID: <CAKVJ-_6mqoWuJUr1NSqVT37f-H6VgadVmjJxTHQ2LqfPxGpFNQ@mail.gmail.com>

On Thu, Jul 19, 2012 at 4:51 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
> If I have a function (modify_record) that slices up a SeqRecord into sub
> records and then returns the sliced record if it has a certain length (for
> e.g. the sliced record needs to be greater than 40bp), sometimes the
> original record when sliced will have two different records both greater
> than 40bp.  I want to keep both sliced reads and rewrite them as separate
> records into a single fastq file.  Here is my code:
>
> def modify_record(frec, win, len_threshold):
>     quality_scores = array(frec.letter_annotations["phred_quality"])
>     all_window_qc = slidingWindow(quality_scores, win,1)
>     track_qc = windowQ(all_window_qc)
>     myzeros = boolean_array(track_qc, q_threshold,win)
>     Nrec = slice_points(myzeros,win)[0][1]-1
>     where_to_slice = slice_points(myzeros,win)[1]
>     where_to_slice.append(len(frec)+win)
>     sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
>     return sub_record
> ...

The key point is that for each input record you may want to
produce several output records. A single function turning
one input SeqRecord into one output SeqRecord won't work.
I would suggest either,

1. Modify your function to return a list of SeqRecord objects,
which could be zero, one (as now), or several - depending on
the slice points. Then use itertools.chain to combine them,
something like this:

from itertools import chain
good_reads = chain(modify_record(r) for r in SeqIO.parse(...))
count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
print "Saved %i read fragments" % count

2. Use a generator function to process the SeqRecord objects,

def select_fragments(records, win, len_threshold):
    for record in records:
         where_to_slice = ...
         for slice_point in where_to_slice:
             yield record[slice_point]

good_reads = select_fragments(SeqIO.parse(...))
count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
print "Saved %i read fragments" % count

Both these approaches are generator/iteration based and will
be memory efficient.

Note you may also want to alter the record identifiers so that
different fragments from a single read get different IDs.

Peter


From p.j.a.cock at googlemail.com  Fri Jul 20 10:29:33 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 20 Jul 2012 11:29:33 +0100
Subject: [Biopython] Error while parsing bgk file
In-Reply-To: <CAO51=Z4e-C8VaWm=j3OYrJsn4-_W-YQ0oDSzVHj+BJ0OmWhgHw@mail.gmail.com>
References: <CAO51=Z4SG=0whhreVzHVrDAn2e1MPnQDjSW2Do_sF_c6EJhX0Q@mail.gmail.com>
	<CADEGkF5tTA=tDVLTmh0nGcyN-W-ig5bL8kCVwTPk+qq8DkK-WA@mail.gmail.com>
	<CAO51=Z4e-C8VaWm=j3OYrJsn4-_W-YQ0oDSzVHj+BJ0OmWhgHw@mail.gmail.com>
Message-ID: <CAKVJ-_7S3EMn-xPHLAqWM4zJzuVj=9gQ2V2s88u8cPQw2=XA6w@mail.gmail.com>

On Fri, Jul 20, 2012 at 4:56 AM, ning luwen <bioinformaticsing at gmail.com> wrote:
> Hi Bow,
>
>       Thank you for your reply,  and a patch by lenna can solve the
> interruption of the parse.
>
>       ps: these gbk file was recently downloaded from
> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/ (with extension of
> gbs.gz), and the file contained "invalid GenBank annotation" is
> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_02/hs_ref_GRCh37.p5_chr2.gbs.gz

Note the original bug report referred to a slightly different part/revision
of this chromosome, but it is the same issue reported earlier:
https://redmine.open-bio.org/issues/3175
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_02/hs_ref_GRCh37.p2_chr2.gbk.gz

I have now committed Lenna's fix, which means this file now parses
with a warning about the problem features (which get None as their
location):

https://github.com/biopython/biopython/commit/bc733da09051ca53ad4515ac2d971ff0839a71ba
https://github.com/biopython/biopython/commit/4bf78f72682f0500e93c410f8108891dade88ff8

Ning, if you would like to test this fix the simplest way is to get the
latest source code from github, and reinstall Biopython. You can
either use the git tool at the command line, or the github URL for
a tarball: https://github.com/biopython/biopython/tarball/master

(Please ask if you need more guidance with this)

Regards,

Peter


From igorrcosta at hotmail.com  Sat Jul 21 21:44:40 2012
From: igorrcosta at hotmail.com (Igor Rodrigues da Costa)
Date: Sat, 21 Jul 2012 21:44:40 +0000
Subject: [Biopython] Back translation support in Biopython
Message-ID: <SNT122-W17331BE590DC0C5B46EC21C4DF0@phx.gbl>


Hi Peter,
I would eliminate the problem of ID mapping (or at least pass it to the user) by using only the function that uses one sequence pair. The other option is to check if the codon and the amino acid are equivalent at run time, using a given genetic code. I did this in my program that back translated using only the aligned protein sequence and the Uniprot/GI accession numbers (I did the search using Bio.Entrez), but in my case the nucleotide dictionary was only some different ways the nucleotide sequence could be imported from NCBI, each of them returning a different sequence.
I can't see any need for different gap characters between both alignments, and I feel there can be both a Bio.SeqIO (using a pair of sequences only) and a Bio.AlignIO (using multiple sequences, probably slower if checking at run time) versions of this function. 
Att,Igor> Date: Mon, 2 Jul 2012 12:27:08 +0100
> Subject: Re: [Biopython] Back translation support in Biopython
> From: p.j.a.cock at googlemail.com
> To: igorrcosta at hotmail.com; eric.talevich at gmail.com
> CC: biopython at lists.open-bio.org
> 
> On Wed, Apr 4, 2012 at 4:02 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > On Wed, Apr 4, 2012 at 2:49 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
> >> Hi Igor,
> >>
> >> It sounds like you're referring to aligning amino acid sequences to codon
> >> sequences, as PAL2NAL does. This is different from what most people mean by
> >> back translation, but as you point out, certainly useful.
> >>
> >> If you write a function that can match a protein sequence alignment to a set
> >> of raw CDS sequences, returning a nucleotide alignment based on the
> >> codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does
> >> exactly that, plus a bit more, and is a fairly well-known and easily
> >> obtained program. Personally, I would prefer to write a wrapper for PAL2NAL
> >> under Bio.Align.Applications, using the existing Bio.Applications framework.
> >
> > As per the old thread, a simple function in Python taking the gapped protein
> > sequence, original nucleotide coding sequence, and the translation table
> > does sound useful. Then using that, you could go from a protein alignment
> > plus the original nucleotide coding sequences to a codon alignment, or
> > other tasks. Given this is all relatively straightforward string manipulation
> > and we already have the required genetic code tables in Biopython, I'm not
> > convinced that wrapping PAL2NAL would be the best solution (for this sub
> > task).
> 
> Hi Igor,
> 
> Did you do any work on back-translation (alignment threading) in Biopython?
> 
> We needed to do this locally, and for some reason (yet to be determined)
> T-COFFEE wasn't working on our dataset, so I made a start at a Biopython
> implementation:
> 
> https://github.com/peterjc/biopython/tree/back_trans
> https://github.com/peterjc/biopython/commit/7d14cdb59bb9d41c727c923c8aa7e3dda7779c80
> 
> Currently just one commit adding a Bio.Align.alignment_back_translate(...)
> function which takes a protein alignment and dictionary of nucleotide
> records - easy to get with Bio.SeqIO and Bio.AlignIO - with a stand alone
> example included in the doctest. There is also a new (currently private)
> function to do this for one sequence pair - perhaps useful on its own?
> 
> There are potential complications with ID mapping between the proteins
> and nucleotides, thus the option of a key function, and the gap characters
> (would you ever want to use different gap characters in the protein and
> nucleotide alignments?). We could discuss implementation details over
> on the biopython-dev list, but the general API discussion might as well
> be here. e.g. Where to put the function and what to call it.
> 
> Regards,
> 
> Peter
 		 	   		  

From p.j.a.cock at googlemail.com  Sun Jul 22 12:51:12 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 22 Jul 2012 13:51:12 +0100
Subject: [Biopython] Back translation support in Biopython
In-Reply-To: <SNT122-W17331BE590DC0C5B46EC21C4DF0@phx.gbl>
References: <SNT122-W17331BE590DC0C5B46EC21C4DF0@phx.gbl>
Message-ID: <CAKVJ-_5vLKDBj90-KFCb_ruDudT=-BZB9gaV==V1uAUhSJ=TZA@mail.gmail.com>

On Sat, Jul 21, 2012 at 10:44 PM, Igor Rodrigues da Costa
<igorrcosta at hotmail.com> wrote:
>
>
> Hi Peter,
> I would eliminate the problem of ID mapping (or at least
> pass it to the user) by using only the function that uses
> one sequence pair.

Making the function for doing one sequence pair part
of the public API seems sensible then.

> The other option is to check if the codon and the amino
> acid are equivalent at run time, using a given genetic
> code. I did this in my program that back translated
> using only the aligned protein sequence and the
> Uniprot/GI accession numbers (I did the search using
> Bio.Entrez), but in my case the nucleotide dictionary
> was only some different ways the nucleotide sequence
> could be imported from NCBI, each of them returning
> a different sequence.

Certainly optionally checking the translation seems wise.
There are potential complications with things like
ambiguous bases, but in general this is useful.

> I can't see any need for different gap characters
> between both alignments, and I feel there can be both
> a Bio.SeqIO (using a pair of sequences only) and a
> Bio.AlignIO (using multiple sequences, probably slower
> if checking at run time) versions of this function.

I agree that an alignment based function, and a
single sequence based function make sense - but
probably under Bio.Align rather than Bio.SeqIO and
Bio.AlignIO which are specifically for input/ouput
functionality.

Thanks for your thoughts,

Peter


From dilara.ally at gmail.com  Mon Jul 23 21:48:30 2012
From: dilara.ally at gmail.com (Dilara Ally)
Date: Mon, 23 Jul 2012 14:48:30 -0700
Subject: [Biopython] slice a record in two and writing both records
In-Reply-To: <CAKVJ-_6mqoWuJUr1NSqVT37f-H6VgadVmjJxTHQ2LqfPxGpFNQ@mail.gmail.com>
References: <CAEfb3sfpsCg4MgTPxXEh-oyPFe0OjCA9yw5d45cGZd_R6_9bvQ@mail.gmail.com>
	<CAKVJ-_6mqoWuJUr1NSqVT37f-H6VgadVmjJxTHQ2LqfPxGpFNQ@mail.gmail.com>
Message-ID: <9085DA29-9159-44EE-BED7-56E3306B8EA3@gmail.com>

Thanks.  Itertools is a fantastic module!
Dilara

On Jul 20, 2012, at 3:07 AM, Peter Cock wrote:

> On Thu, Jul 19, 2012 at 4:51 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
>> If I have a function (modify_record) that slices up a SeqRecord into sub
>> records and then returns the sliced record if it has a certain length (for
>> e.g. the sliced record needs to be greater than 40bp), sometimes the
>> original record when sliced will have two different records both greater
>> than 40bp.  I want to keep both sliced reads and rewrite them as separate
>> records into a single fastq file.  Here is my code:
>> 
>> def modify_record(frec, win, len_threshold):
>>    quality_scores = array(frec.letter_annotations["phred_quality"])
>>    all_window_qc = slidingWindow(quality_scores, win,1)
>>    track_qc = windowQ(all_window_qc)
>>    myzeros = boolean_array(track_qc, q_threshold,win)
>>    Nrec = slice_points(myzeros,win)[0][1]-1
>>    where_to_slice = slice_points(myzeros,win)[1]
>>    where_to_slice.append(len(frec)+win)
>>    sub_record = slice_record(frec,Nrec,where_to_slice,win,len_threshold)
>>    return sub_record
>> ...
> 
> The key point is that for each input record you may want to
> produce several output records. A single function turning
> one input SeqRecord into one output SeqRecord won't work.
> I would suggest either,
> 
> 1. Modify your function to return a list of SeqRecord objects,
> which could be zero, one (as now), or several - depending on
> the slice points. Then use itertools.chain to combine them,
> something like this:
> 
> from itertools import chain
> good_reads = chain(modify_record(r) for r in SeqIO.parse(...))
> count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
> print "Saved %i read fragments" % count
> 
> 2. Use a generator function to process the SeqRecord objects,
> 
> def select_fragments(records, win, len_threshold):
>    for record in records:
>         where_to_slice = ...
>         for slice_point in where_to_slice:
>             yield record[slice_point]
> 
> good_reads = select_fragments(SeqIO.parse(...))
> count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
> print "Saved %i read fragments" % count
> 
> Both these approaches are generator/iteration based and will
> be memory efficient.
> 
> Note you may also want to alter the record identifiers so that
> different fragments from a single read get different IDs.
> 
> Peter


From llewelr at gmail.com  Tue Jul 24 02:24:06 2012
From: llewelr at gmail.com (Richard Llewellyn)
Date: Mon, 23 Jul 2012 20:24:06 -0600
Subject: [Biopython] Entrez.read(handle) for Bio._py3k.EvilHandleHack throws
 error with python 3.2
Message-ID: <CAJQgwq68pH4TAcCsY99o70fdWc8ziMOHxR6zkBVpUFeTUnUp1A@mail.gmail.com>

With python 3.2 and biopython 1.60 after getting a handle using
Entrez.esummary (and esearch, others?) I get a TypeError:


>>> from Bio import Entrez
>>> Entrez.email = "Your.Name.Here at example.org"
>>> handle = Entrez.esummary(db="journals", id="30367")
>>> record = Entrez.read(handle)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/__init__.py",
line 351, in read
    record = handler.read(handle)
  File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/Parser.py",
line 169, in read
    self.parser.ParseFile(handle)
TypeError: read() did not return a bytes object (type=str)

>>> handle
<Bio._py3k.EvilHandleHack object at 0xb737a34c>

Ah, it is evil!

I realize py3k not yet officially supported.

Thanks for the great work.


From llewelr at gmail.com  Tue Jul 24 03:20:00 2012
From: llewelr at gmail.com (Richard Llewellyn)
Date: Mon, 23 Jul 2012 21:20:00 -0600
Subject: [Biopython] Entrez.read(handle) for Bio._py3k.EvilHandleHack
 throws error with python 3.2
In-Reply-To: <CAJQgwq68pH4TAcCsY99o70fdWc8ziMOHxR6zkBVpUFeTUnUp1A@mail.gmail.com>
References: <CAJQgwq68pH4TAcCsY99o70fdWc8ziMOHxR6zkBVpUFeTUnUp1A@mail.gmail.com>
Message-ID: <CAJQgwq7jgK-onmMz=hfHkTjUTtqB-EekHxhgh9PF-KGW1QA=Cw@mail.gmail.com>

Follow up for Entrez.read error on EvilHandleHack object:

(this is python 3.2.3)

If I change last line of Entrez.__init__.py _open function from

return _binary_to_string_handle(handle)
to
return handle

this error does not occur in example given below.


On Mon, Jul 23, 2012 at 8:24 PM, Richard Llewellyn <llewelr at gmail.com> wrote:
> With python 3.2 and biopython 1.60 after getting a handle using
> Entrez.esummary (and esearch, others?) I get a TypeError:
>
>
>>>> from Bio import Entrez
>>>> Entrez.email = "Your.Name.Here at example.org"
>>>> handle = Entrez.esummary(db="journals", id="30367")
>>>> record = Entrez.read(handle)
>
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/__init__.py",
> line 351, in read
>     record = handler.read(handle)
>   File "/usr/local/lib/python3.2/site-packages/Bio/Entrez/Parser.py",
> line 169, in read
>     self.parser.ParseFile(handle)
> TypeError: read() did not return a bytes object (type=str)
>
>>>> handle
> <Bio._py3k.EvilHandleHack object at 0xb737a34c>
>
> Ah, it is evil!
>
> I realize py3k not yet officially supported.
>
> Thanks for the great work.


From markd at soe.ucsc.edu  Tue Jul 24 06:47:51 2012
From: markd at soe.ucsc.edu (Mark Diekhans)
Date: Mon, 23 Jul 2012 23:47:51 -0700
Subject: [Biopython] accessing PDB IDcode when using PDBParser
Message-ID: <20494.17687.678370.458937@hgwdev.cse.ucsc.edu>


How does one access the idCode in the PDB HEADER when using the PDBParser?
I can't find this in the documentation or the code.

Also, what is function of the `id' argument for PDBParser.get_structure:
The documentation is just self-referential:
       o id - string, the id that will be used for the structure

Seems no obvious way via MMCIFParser either.

Thanks!


From anaryin at gmail.com  Tue Jul 24 08:37:44 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 24 Jul 2012 10:37:44 +0200
Subject: [Biopython] accessing PDB IDcode when using PDBParser
In-Reply-To: <20494.17687.678370.458937@hgwdev.cse.ucsc.edu>
References: <20494.17687.678370.458937@hgwdev.cse.ucsc.edu>
Message-ID: <CAJ9sUYOuEXc5DX-AkFfZGRV5xukaCe+VE4c-xgaOySDQpfFCpQ@mail.gmail.com>

Hey Mark,

Indeed there is no specific ID extraction from the HEADER. However, it
comes as part of the "head" key in the header dictionary. If you split by
whitespace and get the last field, you get the PDB ID.

Example:

HEADER    HYDROLASE(ASPARTYL PROTEINASE)          17-OCT-89   2RSP


The id you have in the get_structure function retrieves the first argument
you pass to it.


Cheers,

Jo?o [...] Rodrigues
http://nmr.chem.uu.nl/~joao


2012/7/24 Mark Diekhans <markd at soe.ucsc.edu>

>
> How does one access the idCode in the PDB HEADER when using the PDBParser?
> I can't find this in the documentation or the code.
>
> Also, what is function of the `id' argument for PDBParser.get_structure:
> The documentation is just self-referential:
>        o id - string, the id that will be used for the structure
>
> Seems no obvious way via MMCIFParser either.
>
> Thanks!
>
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From p.j.a.cock at googlemail.com  Tue Jul 24 09:41:35 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 24 Jul 2012 10:41:35 +0100
Subject: [Biopython] Entrez.read(handle) for Bio._py3k.EvilHandleHack
 throws error with python 3.2
In-Reply-To: <CAJQgwq7jgK-onmMz=hfHkTjUTtqB-EekHxhgh9PF-KGW1QA=Cw@mail.gmail.com>
References: <CAJQgwq68pH4TAcCsY99o70fdWc8ziMOHxR6zkBVpUFeTUnUp1A@mail.gmail.com>
	<CAJQgwq7jgK-onmMz=hfHkTjUTtqB-EekHxhgh9PF-KGW1QA=Cw@mail.gmail.com>
Message-ID: <CAKVJ-_4dg5=-PWLGr7OrT0riV0dm8cKvamcLCm1odDWn3=4QuA@mail.gmail.com>

Hi Richard,

It's great to have some feedback on Python 3 support :)

On Tue, Jul 24, 2012 at 4:20 AM, Richard Llewellyn <llewelr at gmail.com> wrote:
> Follow up for Entrez.read error on EvilHandleHack object:
>
> (this is python 3.2.3)
>
> If I change last line of Entrez.__init__.py _open function from
>
> return _binary_to_string_handle(handle)
> to
> return handle
>
> this error does not occur in example given below.

Hmm. That call to _binary_to_string_handle converts from the
bytes (binary) network handle to a string (unicode) handle which
is required for most of the parsers in Biopython under Python 3
(e.g. FASTA, Genbank).

Surprisingly the Entrez parser seems to be wanting a binary
handle? That seems curious... I presume that means we
don't have this particular case covered in the unit tests :(

How familiar are you with the Python 3 split of bytes vs strings
(unicode), and binary versus text handles?

Peter


From bjorn_johansson at bio.uminho.pt  Fri Jul 27 08:03:41 2012
From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=)
Date: Fri, 27 Jul 2012 09:03:41 +0100
Subject: [Biopython] Restriction cutting SeqRecord objects
Message-ID: <CAG_4V=aWTTrCyF7ZA-CsZ0xw157YoyBO-Vyh2REtzk5LPd34Hw@mail.gmail.com>

Hi,
Restriction with Bio.Restriction only works for seq or mutable seq objects?
I would like to digest SeqRecord objects and still keep the relevant
features of the sequences.

Did anyone perhaps implement something like this?
One way would be to subclass the restriction enzymes,
but they are created dynamically so I am not sure if this is a god idea.


btw is the biopython site down?

thanks,
bjorn

-- 
______O_________oO________oO______o_______oO__
Bj?rn Johansson
Assistant Professor
Departament of Biology
University of Minho
Campus de Gualtar
4710-057 Braga
PORTUGAL
www.bio.uminho.pt
Google profile
metabolicengineeringgroup
Work (direct) +351-253 601517 | mob.  +351-967 147 704 | mob. (SWE) 0739 792 968
Dept of Biology (secr) +351-253 60 4310  | fax +351-253 678980


From dilara.ally at gmail.com  Thu Jul 26 17:48:44 2012
From: dilara.ally at gmail.com (Dilara Ally)
Date: Thu, 26 Jul 2012 10:48:44 -0700
Subject: [Biopython] matching headers and then writing the seq record
Message-ID: <DDBE81B2-6FEB-4ABD-BC40-1E1E313BB7D5@gmail.com>

Hi Everyone, 

I'm interested in finding headers that match (in other words paired reads) in two different fastq files.  Once the common headers are found, I then go back to the original fastq file and write those matched reads to a different fastq file.  Right now, the part of the code that runs really slow is the headers_read1 and headers_read2 lines.  And I was wondering if there was a more elegant way and time efficient manner than what I have done.  It seems as if set undoes the elegance of using a generator.  Any advice is greatly appreciated!   Here is the code:

def get_header(seq_record):
    fields = seq_record.id.split(':')
    lastfield = fields[6].split('_')[0]
    return lastfield

def get_full_header(seq_record):
    fields = seq_record.id.split(':')
    headerInfo2 = fields[6].split('_')[0]
    headerInfo =  str(fields[0]) + ":" + str(fields[1]) + ":" + str(fields[2]) + ":" + str(fields[3]) + ":" + str(fields[4]) + ":" + str(fields[5]) + ":" + str(headerInfo2)
    return headerInfo 

def replace_header(seq_record,pairType):
    if pairType == 1:
        ending = "/1"
    elif pairType == 2:
        ending = "/2"
    seq_record.id=seq_record.id+ending
    seq_record.name = ""
    seq_record.description = ""
    return seq_record

def matched_records(records, pairType, header_matches):
    for rec in records:
        id = get_header(rec)
        result = id in header_matches
        #print result
        if (result == True):
            newrec = replace_header(rec,pairType)
            yield newrec

import sys
from Bio import SeqIO

headers_read1 = set(get_header(seq_record) for seq_record in SeqIO.parse(sys.argv[1], "fastq"))
headers_read2 = set(get_header(seq_record) for seq_record in SeqIO.parse(sys.argv[2], "fastq"))
header_matches = [x for x in headers_read1 if x in headers_read2]

records = SeqIO.parse(sys.argv[1], "fastq") 
pairType = 1
count = SeqIO.write(matched_records(records,pairType,header_matches), sys.argv[3], "fastq")
print "Saved %i matched reads." %count

records = SeqIO.parse(sys.argv[2], "fastq") 
pairType = 2
count = SeqIO.write(matched_records(records,pairType,header_matches), sys.argv[4], "fastq")
print "Saved %i matched reads." %count


From cartealy at yahoo.co.id  Fri Jul 27 06:30:58 2012
From: cartealy at yahoo.co.id (Imam Cartealy)
Date: Fri, 27 Jul 2012 14:30:58 +0800 (SGT)
Subject: [Biopython] Is biopython.org down ?
Message-ID: <1343370658.77420.YahooMailNeo@web190503.mail.sg3.yahoo.com>

Hi everyone,

I am having trouble accessing biopython.org for the last 2 days. Is biopython.org down ?

Cheers

ic

?
Imam Cartealy
Center for Biotechnology - BPPT
Indonesia


From idoerg at gmail.com  Sat Jul 28 20:19:01 2012
From: idoerg at gmail.com (Iddo Friedberg)
Date: Sat, 28 Jul 2012 16:19:01 -0400
Subject: [Biopython] Is biopython.org down ?
In-Reply-To: <1343370658.77420.YahooMailNeo@web190503.mail.sg3.yahoo.com>
References: <1343370658.77420.YahooMailNeo@web190503.mail.sg3.yahoo.com>
Message-ID: <CABm4-MRRY=ykF+ZoYF+KHZ3rDXO=gA_3-+B+FC4gb8CR6Yk7oQ@mail.gmail.com>

Has been down for the past couple of days, but it is up now.

On Fri, Jul 27, 2012 at 2:30 AM, Imam Cartealy <cartealy at yahoo.co.id> wrote:

> Hi everyone,
>
> I am having trouble accessing biopython.org for the last 2 days. Is
> biopython.org down ?
>
> Cheers
>
> ic
>
>
> Imam Cartealy
> Center for Biotechnology - BPPT
> Indonesia
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.


From p.j.a.cock at googlemail.com  Sat Jul 28 20:48:32 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 28 Jul 2012 21:48:32 +0100
Subject: [Biopython] matching headers and then writing the seq record
In-Reply-To: <DDBE81B2-6FEB-4ABD-BC40-1E1E313BB7D5@gmail.com>
References: <DDBE81B2-6FEB-4ABD-BC40-1E1E313BB7D5@gmail.com>
Message-ID: <CAKVJ-_6vQisoFTG5h6XHjgqO5RK3usDf1rtri-A6PdcxAVq9YA@mail.gmail.com>

On Thu, Jul 26, 2012 at 6:48 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
> ... It seems as if set undoes the elegance of using a generator.
> Any advice is greatly appreciated! ...
>
> headers_read1 = set(...)
> headers_read2 = set(...)
> header_matches = [x for x in headers_read1 if x in headers_read2]

I would expect that using the built in set's intersection operation would
be faster than this list comprehension solution to create header_matches.

Also, you should use a set not a list for header_matches because testing
membership with a set is much faster than a list. i.e. Try:

header_matches = headers_read1.intersection(headers_read2)

This might be a tiny change, but I expect it to be noticeably faster.

Also, here:

> def matched_records(records, pairType, header_matches):
>    for rec in records:
>        id = get_header(rec)
>        result = id in header_matches
>        if (result == True):
>            newrec = replace_header(rec,pairType)
>            yield newrec

If you don't mind my style comments, you don't really need
to create the variables 'id' and  'result', and 'newrec' - I would
just do:

def matched_records(records, pairType, header_matches):
    for rec in records:
        if get_header(rec) in header_matches:
            yield replace_header(rec,pairType)

And at that point you could write the whole thing as a
generator expression, which you may or may not find
more pleasing (I'm not sure if it makes any significant
difference to the speed). i.e.

records = SeqIO.parse(sys.argv[1], "fastq")
pairType = 1
wanted = (replace_header(rec,pairType) \
                 for rec in records \
                 if get_header(rec) in header_matches)
count = SeqIO.write(wanted, sys.argv[3], "fastq")

I hope that helps,

Peter


From aclark at aclark.net  Sat Jul 28 23:45:10 2012
From: aclark at aclark.net (Alex Clark)
Date: Sat, 28 Jul 2012 19:45:10 -0400
Subject: [Biopython] ANN: pythonpackages.com beta
Message-ID: <jv1ti6$bck$1@dough.gmane.org>

Hi biological computation folks,


I am reaching out to various Python-related programming communities in 
order to offer new help packaging your software.

If you have ever struggled with packaging and releasing Python software 
(e.g. to PyPI), please check out this service:


- http://pythonpackages.com


The basic idea is to automate packaging by checking out code, testing, 
and uploading (e.g. to PyPI) all through the web, as explained in this 
introduction:


- http://docs.pythonpackages.com/en/latest/introduction.html


Also, I will be available to answer your Python packaging questions most 
days/nights in #pythonpackages on irc.freenode.net. Hope to meet/talk 
with all of you soon.


Alex


-- 
Alex Clark ? http://pythonpackages.com/ONE_CLICK


From dilara.ally at gmail.com  Tue Jul 31 18:53:27 2012
From: dilara.ally at gmail.com (Dilara Ally)
Date: Tue, 31 Jul 2012 11:53:27 -0700
Subject: [Biopython] matching headers and then writing the seq record
In-Reply-To: <CAKVJ-_6vQisoFTG5h6XHjgqO5RK3usDf1rtri-A6PdcxAVq9YA@mail.gmail.com>
References: <DDBE81B2-6FEB-4ABD-BC40-1E1E313BB7D5@gmail.com>
	<CAKVJ-_6vQisoFTG5h6XHjgqO5RK3usDf1rtri-A6PdcxAVq9YA@mail.gmail.com>
Message-ID: <BAB3D952-032A-4EC1-B381-22268D3E95D9@gmail.com>

Thanks Peter it sped it up considerably!  I appreciate the fast replies on this listserv.


On Jul 28, 2012, at 1:48 PM, Peter Cock wrote:

> On Thu, Jul 26, 2012 at 6:48 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
>> ... It seems as if set undoes the elegance of using a generator.
>> Any advice is greatly appreciated! ...
>> 
>> headers_read1 = set(...)
>> headers_read2 = set(...)
>> header_matches = [x for x in headers_read1 if x in headers_read2]
> 
> I would expect that using the built in set's intersection operation would
> be faster than this list comprehension solution to create header_matches.
> 
> Also, you should use a set not a list for header_matches because testing
> membership with a set is much faster than a list. i.e. Try:
> 
> header_matches = headers_read1.intersection(headers_read2)
> 
> This might be a tiny change, but I expect it to be noticeably faster.
> 
> Also, here:
> 
>> def matched_records(records, pairType, header_matches):
>>   for rec in records:
>>       id = get_header(rec)
>>       result = id in header_matches
>>       if (result == True):
>>           newrec = replace_header(rec,pairType)
>>           yield newrec
> 
> If you don't mind my style comments, you don't really need
> to create the variables 'id' and  'result', and 'newrec' - I would
> just do:
> 
> def matched_records(records, pairType, header_matches):
>    for rec in records:
>        if get_header(rec) in header_matches:
>            yield replace_header(rec,pairType)
> 
> And at that point you could write the whole thing as a
> generator expression, which you may or may not find
> more pleasing (I'm not sure if it makes any significant
> difference to the speed). i.e.
> 
> records = SeqIO.parse(sys.argv[1], "fastq")
> pairType = 1
> wanted = (replace_header(rec,pairType) \
>                 for rec in records \
>                 if get_header(rec) in header_matches)
> count = SeqIO.write(wanted, sys.argv[3], "fastq")
> 
> I hope that helps,
> 
> Peter


From devaniranjan at gmail.com  Tue Jul 31 19:24:34 2012
From: devaniranjan at gmail.com (George Devaniranjan)
Date: Tue, 31 Jul 2012 15:24:34 -0400
Subject: [Biopython] Mocapy
Message-ID: <CAFU65PcikwfcKhnXGPo6r1UA6-5-QuvcTF7zbKFbK5Sdr1qi+g@mail.gmail.com>

I was wondering if Mocapy is part of Biopython.

I thought it was but I cannot find it in my biopython PDB folder.

Thank you,
George


From eric.talevich at gmail.com  Tue Jul 31 21:55:21 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 31 Jul 2012 17:55:21 -0400
Subject: [Biopython] Mocapy
In-Reply-To: <CAFU65PcikwfcKhnXGPo6r1UA6-5-QuvcTF7zbKFbK5Sdr1qi+g@mail.gmail.com>
References: <CAFU65PcikwfcKhnXGPo6r1UA6-5-QuvcTF7zbKFbK5Sdr1qi+g@mail.gmail.com>
Message-ID: <CAMC681mb5z0e2=y6P54FUOiXb5xOo=au_qhMaANwW3-r4g+PHA@mail.gmail.com>

On Tue, Jul 31, 2012 at 3:24 PM, George Devaniranjan <devaniranjan at gmail.com
> wrote:

> I was wondering if Mocapy is part of Biopython.
>
> I thought it was but I cannot find it in my biopython PDB folder.
>
>
Hi George,

No, Mocapy++ is a separate project:
http://sourceforge.net/projects/mocapy/

There is a branch to add some integration with Mocapy++ to Biopython, but
we're waiting for the next stable release of Mocapy++ before merging it:
https://github.com/mchelem/biopython

-Eric