From ishengomae at nm-aist.ac.tz  Sun Feb  2 14:28:23 2014
From: ishengomae at nm-aist.ac.tz (Edson Ishengoma)
Date: Sun, 2 Feb 2014 22:28:23 +0300
Subject: [Biopython] Help modify this code so it can do what I want it to do
Message-ID: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>

Hi folks,

I picked this code from somewhere and edited it a bit but it still can't
achieve what I need. I have an xml output of tblastn hits on my customized
database and now I am in the process to extract the results with biopython.
With tblastn sometimes the returned hit is multiple local hits
corresponding to certain positions along the query with significant scores.
Now I want to concatenate these local hits which initially requires sorting
according to positions.

for record in records:
>    for alignment in record.alignments:
>                 hits = sorted((hsp.query_start, hsp.query_end, hsp.sbjct_start, hsp.sbjct_end, alignment.title, hsp.query, hsp.sbjct)\
>                                for hsp in alignment.hsps) # sorting results according to positions
>                 complete_query_seq = ''
>                 complete_sbjct_seq =''
>                 for q_start, q_end, sb_start, sb_end, title, query, sbjct in hits:
>                       print title
>                       print 'The query starts from position: ' + str(q_start)
>                       print 'The query ends at position: ' + str(q_end)
>                       print 'The hit starts at position: ' + str(sb_start)
>                       print 'The hit ends at position: ' + str(sb_end)
>                       print 'The  query is: ' + query
>                       print 'The hit is: ' + sbjct
>                       complete_query_seq += str(query[q_start:q_end]) # concatenating subsequent query/subject portions with alignments
>                       complete_sbjct_seq += str(query[sb_start:sb_end])
>                print 'Complete query seq is: ' + complete_query_seq
>                print 'Complete subject seq is: ' + complete_sbjct_seq
>
> This would print:

> Species_1The query starts from position: 1The query ends at position: 184The hit starts at position: 1The hit ends at position: 552The query is: ####### query_seqThe hit is: ######### hit_seqSpecies_1The query starts from position: 390The query ends at position: 510The hit starts at position: 549The hit ends at position: 911The query is: ####### query_seqThe hit is: ######### hit_seqSpecies_1The query starts from position: 492The query ends at position: 787The hit starts at position: 889The hit ends at position: 1776The query is: ####### query_seqThe hit is: ######### hit_seq
> Complete query seq is: ####### query_seq
> Complete subject seq is: ######### hit_seq
>
> This is not what I want as clearly the program did no concatenation at
all, or I messed up seriously. What I want is Complete query seq is: #######
############## (color coded to mean the different portions of query with
significant hits) with no sequence overlaps. How do I achieve that?

Thanks,

Regards,

Edson.


From saketkc at gmail.com  Sun Feb  2 23:22:42 2014
From: saketkc at gmail.com (Saket Choudhary)
Date: Mon, 3 Feb 2014 09:52:42 +0530
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <CAKVJ-_7wV2JD2=Lbjem1WDW6e=eO=W6x4KOHxJh0zuVyNrsrVw@mail.gmail.com>
References: <CAMC681ktJLOjcS746YB3MwCb534rZvfk87p-HGVdyt3Le=Y2RQ@mail.gmail.com>
	<CAKVJ-_7wV2JD2=Lbjem1WDW6e=eO=W6x4KOHxJh0zuVyNrsrVw@mail.gmail.com>
Message-ID: <CAEDHeiuP-jc0nPdK-2k1TxF04QcqWqF98Nb6faMdbizQ1ZABgQ@mail.gmail.com>

On 31 January 2014 16:25, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> Hi folks,
>>
>> Google Summer of Code is on again for 2014, and the Open Bioinformatics
>> Foundation (OBF) is once again applying as a mentoring organization.
>> Participating in GSoC as an organization is very competitive, and we will
>> need your help in gathering a good set of ideas and potential mentors for
>> Biopython's role in GSoC this year.
>>
>> If you have an idea for a Summer of Code project, please post your idea
>> here on the Biopython mailing list for discussion and start an outline on
>> this wiki page:
>> http://biopython.org/wiki/Google_Summer_of_Code
>>
>> We also welcome ideas that fit with OBF's mission but are not part of a
>> single Bio* project, or span multiple projects -- these ideas can be posted
>> on the OBF wiki and discussed on the OBF mailing list:
>> http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas
>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>>
>> Here's to another fun and productive Summer of Code!
>>
>> Cheers,
>> Eric & Raoul
>
> Thanks Eric & Raoul,
>
> Remember that the ideas don't have to come from potential mentors -
> if as a student there is something you'd particularly like to work on
> please ask, and perhaps we can find a suitable (Biopython) mentor.
>
> Regards,
>
> Peter

I would like to propose a QC module for NGS & Microarray data.
Essentially a fastQC[1] and limma[2], respectively ported to
Biopython.


[1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
[2] http://bioconductor.org/packages/devel/bioc/html/limma.html


Saket

> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From p.j.a.cock at googlemail.com  Mon Feb  3 07:19:40 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Feb 2014 12:19:40 +0000
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
Message-ID: <CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>

On Sun, Feb 2, 2014 at 7:28 PM, Edson Ishengoma
<ishengomae at nm-aist.ac.tz> wrote:
> Hi folks,
>
> I picked this code from somewhere and edited it a bit but it still can't
> achieve what I need. I have an xml output of tblastn hits on my customized
> database and now I am in the process to extract the results with biopython.
> With tblastn sometimes the returned hit is multiple local hits corresponding
> to certain positions along the query with significant scores. Now I want to
> concatenate these local hits which initially requires sorting according to
> positions.
>
> ...
>                       complete_query_seq += str(query[q_start:q_end])
>                       complete_sbjct_seq += str(query[sb_start:sb_end])
> ...

Shouldn't you be taking a slice from the subject sequence (the database
match) there, rather than the query sequence?

Another approach would be to use the alignment sequence fragments
BLAST gives you (and remove the gap characters).

Peter

From ivangreg at gmail.com  Mon Feb  3 08:43:17 2014
From: ivangreg at gmail.com (Ivan Gregoretti)
Date: Mon, 3 Feb 2014 08:43:17 -0500
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
Message-ID: <CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>

Hello Edson,

There is an argument that you can pass to tblastn that is called
max_hsps_per_subject. Try -max_hsps_per_subjec=1 and be sure not to
pass the flag -ungapped. That might do the job for you.

The help says

tblastn -help
...
 *** Statistical options
 -dbsize <Int8>
   Effective length of the database
 -searchsp <Int8, >=0>
   Effective length of the search space
 -max_hsps_per_subject <Integer, >=0>
   Override maximum number of HSPs per subject to save for ungapped searches
   (0 means do not override)
   Default = `0'
...

Ivan


Ivan Gregoretti, PhD


On Mon, Feb 3, 2014 at 7:19 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Sun, Feb 2, 2014 at 7:28 PM, Edson Ishengoma
> <ishengomae at nm-aist.ac.tz> wrote:
>> Hi folks,
>>
>> I picked this code from somewhere and edited it a bit but it still can't
>> achieve what I need. I have an xml output of tblastn hits on my customized
>> database and now I am in the process to extract the results with biopython.
>> With tblastn sometimes the returned hit is multiple local hits corresponding
>> to certain positions along the query with significant scores. Now I want to
>> concatenate these local hits which initially requires sorting according to
>> positions.
>>
>> ...
>>                       complete_query_seq += str(query[q_start:q_end])
>>                       complete_sbjct_seq += str(query[sb_start:sb_end])
>> ...
>
> Shouldn't you be taking a slice from the subject sequence (the database
> match) there, rather than the query sequence?
>
> Another approach would be to use the alignment sequence fragments
> BLAST gives you (and remove the gap characters).
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From p.j.a.cock at googlemail.com  Mon Feb  3 12:15:44 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Feb 2014 17:15:44 +0000
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <CAHjieL16gaEx1BKrFRCGebuuQvjN2YamacaY13P=46oq8CY97g@mail.gmail.com>
References: <CAMC681ktJLOjcS746YB3MwCb534rZvfk87p-HGVdyt3Le=Y2RQ@mail.gmail.com>
	<CAKVJ-_7wV2JD2=Lbjem1WDW6e=eO=W6x4KOHxJh0zuVyNrsrVw@mail.gmail.com>
	<CAHjieL15erc+qddVPNRP-eEHArCuY-Fu3dhZjHRsxp70Js78iA@mail.gmail.com>
	<CAKVJ-_4z1Rm8rg=3PX0eDMLVhfy_Zw9jWOohr_KRc+T5qS6AkQ@mail.gmail.com>
	<CAHjieL16gaEx1BKrFRCGebuuQvjN2YamacaY13P=46oq8CY97g@mail.gmail.com>
Message-ID: <CAKVJ-_6ZCPAT9v3tS=5GD_gw7e_a1L0LmwcMnDc6jjoL66XqeQ@mail.gmail.com>

On Mon, Feb 3, 2014 at 4:21 PM, Lisa Cohen <lisa.johnson.cohen at gmail.com> wrote:
> Hello Everyone,
>
> I am a new bioinformatics student and interested in working on a Biopython
> package for gene ontology and functional annotation. I've noticed that this
> is in "discussion stages" on the wiki page [1]. Perhaps working with
> blast2GO [2], b2g4pipe Galaxy wrapper [3], other existing tools [4].
>
> Is this a feasible Google Summer of Code project idea? Is anyone interested
> in working with me?
>
> Lisa
>
> [1] http://www.biopython.org/w/index.php?title=Gene_Ontology&redirect=no
> [2] http://www.blast2go.com/b2ghome
> [3] https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go
> [4] https://github.como/tanghaiba/goatools

Something based around (gene) ontology support might make a good
project. Chris Lasher was once looking at this, as was Kyle Ellrott.

On the general subject of ontologies, more recently Iddo Friedburg
and Bartek Wilczynski were talking about some OBO work just last month:
http://lists.open-bio.org/pipermail/biopython-dev/2014-January/thread.html

Peter

From ishengomae at nm-aist.ac.tz  Mon Feb  3 14:16:55 2014
From: ishengomae at nm-aist.ac.tz (Edson Ishengoma)
Date: Mon, 3 Feb 2014 22:16:55 +0300
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
Message-ID: <CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>

Hi Peter,

Sorry that was the typo, it should be:
complete_sbjct_seq += str(sbjct[sb_start:sb_end]).

I tried a suggestion by Ivan on the providing tblastn option
[-max_hsps_per_subject 1] but still the output shows up as fragmented hits.

Peter said: "Another approach would be to use the alignment sequence
fragments BLAST gives you (and remove the gap characters)."
With the script I have I can only extract the first fragment only for each
hit. I don't know why string slicing method [sb_start:sb_end] in my script
does not include start and end positions for subsequent fragments.

Regards,

Edson


On Mon, Feb 3, 2014 at 4:43 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:

> Hello Edson,
>
> There is an argument that you can pass to tblastn that is called
> max_hsps_per_subject. Try -max_hsps_per_subjec=1 and be sure not to
> pass the flag -ungapped. That might do the job for you.
>
> The help says
>
> tblastn -help
> ...
>  *** Statistical options
>  -dbsize <Int8>
>    Effective length of the database
>  -searchsp <Int8, >=0>
>    Effective length of the search space
>  -max_hsps_per_subject <Integer, >=0>
>    Override maximum number of HSPs per subject to save for ungapped
> searches
>    (0 means do not override)
>    Default = `0'
> ...
>
> Ivan
>
>
>
> Ivan Gregoretti, PhD
>
>
> On Mon, Feb 3, 2014 at 7:19 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> > On Sun, Feb 2, 2014 at 7:28 PM, Edson Ishengoma
> > <ishengomae at nm-aist.ac.tz> wrote:
> >> Hi folks,
> >>
> >> I picked this code from somewhere and edited it a bit but it still can't
> >> achieve what I need. I have an xml output of tblastn hits on my
> customized
> >> database and now I am in the process to extract the results with
> biopython.
> >> With tblastn sometimes the returned hit is multiple local hits
> corresponding
> >> to certain positions along the query with significant scores. Now I
> want to
> >> concatenate these local hits which initially requires sorting according
> to
> >> positions.
> >>
> >> ...
> >>                       complete_query_seq += str(query[q_start:q_end])
> >>                       complete_sbjct_seq += str(query[sb_start:sb_end])
> >> ...
> >
> > Shouldn't you be taking a slice from the subject sequence (the database
> > match) there, rather than the query sequence?
> >
> > Another approach would be to use the alignment sequence fragments
> > BLAST gives you (and remove the gap characters).
> >
> > Peter
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>

From p.j.a.cock at googlemail.com  Mon Feb  3 15:14:04 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Feb 2014 20:14:04 +0000
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
	<CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
Message-ID: <CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>

On Monday, February 3, 2014, Edson Ishengoma <ishengomae at nm-aist.ac.tz>
wrote:

> Hi Peter,
>
> Sorry that was the typo, it should be:
> complete_sbjct_seq += str(sbjct[sb_start:sb_end]).
>
> I tried a suggestion by Ivan on the providing tblastn option
> [-max_hsps_per_subject 1] but still the output shows up as fragmented hits.
>
> Peter said: "Another approach would be to use the alignment sequence
> fragments BLAST gives you (and remove the gap characters)."
> With the script I have I can only extract the first fragment only for each
> hit. I don't know why string slicing method [sb_start:sb_end] in my script
> does not include start and end positions for subsequent fragments.
>
> Regards,
>
> Edson
>

Hi Edson,

Emails can mess up Python indentation, so posting the file online might
show something silly we've missed - I find http://gist.github.com works
well for this.

It would also help if you could share a sample BLAST output file where the
script is failing, as then people on the list could recreate your problem
on their own computer, which is often the first step in solving it.

Peter

From ishengomae at nm-aist.ac.tz  Mon Feb  3 16:45:38 2014
From: ishengomae at nm-aist.ac.tz (Edson Ishengoma)
Date: Tue, 4 Feb 2014 00:45:38 +0300
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
	<CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
	<CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>
Message-ID: <CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>

Thanks Peter.

Here is a link to my script at
https://gist.github.com/EBIshengoma/efc4ad3e32427891931d

Also, please find attached the sample xml output.


On Mon, Feb 3, 2014 at 11:14 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

>
> On Monday, February 3, 2014, Edson Ishengoma <ishengomae at nm-aist.ac.tz>
> wrote:
>
>> Hi Peter,
>>
>> Sorry that was the typo, it should be:
>> complete_sbjct_seq += str(sbjct[sb_start:sb_end]).
>>
>> I tried a suggestion by Ivan on the providing tblastn option
>> [-max_hsps_per_subject 1] but still the output shows up as fragmented hits.
>>
>> Peter said: "Another approach would be to use the alignment sequence
>> fragments BLAST gives you (and remove the gap characters)."
>> With the script I have I can only extract the first fragment only for
>> each hit. I don't know why string slicing method [sb_start:sb_end] in my
>> script
>> does not include start and end positions for subsequent fragments.
>>
>> Regards,
>>
>> Edson
>>
>
> Hi Edson,
>
> Emails can mess up Python indentation, so posting the file online might
> show something silly we've missed - I find http://gist.github.com works
> well for this.
>
> It would also help if you could share a sample BLAST output file where the
> script is failing, as then people on the list could recreate your problem
> on their own computer, which is often the first step in solving it.
>
> Peter
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Sample_output.xml
Type: text/xml
Size: 12909 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20140204/d5442841/attachment-0001.xml>

From aradwen at gmail.com  Mon Feb  3 19:08:27 2014
From: aradwen at gmail.com (Radhouane Aniba)
Date: Mon, 3 Feb 2014 16:08:27 -0800
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
	<CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
	<CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>
	<CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>
Message-ID: <CAH52Azrm6Ec6GCt+hsMZ0SCyfNkLjpgQeoMiE4mpVQRmhCBY9g@mail.gmail.com>

You can try use coderscrowd.com as well you will have all modifications
separately on your code and you can validate the one it works better for you

Rad


On Mon, Feb 3, 2014 at 1:45 PM, Edson Ishengoma <ishengomae at nm-aist.ac.tz>wrote:

> Thanks Peter.
>
> Here is a link to my script at
> https://gist.github.com/EBIshengoma/efc4ad3e32427891931d
>
> Also, please find attached the sample xml output.
>
>
>
> On Mon, Feb 3, 2014 at 11:14 PM, Peter Cock <p.j.a.cock at googlemail.com
> >wrote:
>
> >
> > On Monday, February 3, 2014, Edson Ishengoma <ishengomae at nm-aist.ac.tz>
> > wrote:
> >
> >> Hi Peter,
> >>
> >> Sorry that was the typo, it should be:
> >> complete_sbjct_seq += str(sbjct[sb_start:sb_end]).
> >>
> >> I tried a suggestion by Ivan on the providing tblastn option
> >> [-max_hsps_per_subject 1] but still the output shows up as fragmented
> hits.
> >>
> >> Peter said: "Another approach would be to use the alignment sequence
> >> fragments BLAST gives you (and remove the gap characters)."
> >> With the script I have I can only extract the first fragment only for
> >> each hit. I don't know why string slicing method [sb_start:sb_end] in my
> >> script
> >> does not include start and end positions for subsequent fragments.
> >>
> >> Regards,
> >>
> >> Edson
> >>
> >
> > Hi Edson,
> >
> > Emails can mess up Python indentation, so posting the file online might
> > show something silly we've missed - I find http://gist.github.com works
> > well for this.
> >
> > It would also help if you could share a sample BLAST output file where
> the
> > script is failing, as then people on the list could recreate your problem
> > on their own computer, which is often the first step in solving it.
> >
> > Peter
> >
> >
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>


-- 
*Radhouane Aniba*
*Bioinformatics Postdoctoral Research Scientist*

*Institute for Advanced Computer StudiesCenter for Bioinformatics and
Computational Biology* *(CBCB)*

*University of Maryland, College ParkMD 20742*

From p.j.a.cock at googlemail.com  Tue Feb  4 03:46:11 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 4 Feb 2014 08:46:11 +0000
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
	<CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
	<CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>
	<CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>
Message-ID: <CAKVJ-_4v_0g=ffoeRgh3S7axyJcBni+jHqrNqO+-haTTDVrJ6w@mail.gmail.com>

On Monday, February 3, 2014, Edson Ishengoma <ishengomae at nm-aist.ac.tz>
wrote:

> Thanks Peter.
>
> Here is a link to my script at
> https://gist.github.com/EBIshengoma/efc4ad3e32427891931d
>
> Also, please find attached the sample xml output.
>
>
The start of the script is missing (import statements, how
you loaded the query and subject sequences, and how
you parsed the BLAST output). We'd need at least that
to run your script.

Regards,

Peter

From ishengomae at nm-aist.ac.tz  Tue Feb  4 04:12:53 2014
From: ishengomae at nm-aist.ac.tz (Edson Ishengoma)
Date: Tue, 4 Feb 2014 12:12:53 +0300
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKVJ-_4v_0g=ffoeRgh3S7axyJcBni+jHqrNqO+-haTTDVrJ6w@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
	<CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
	<CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>
	<CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>
	<CAKVJ-_4v_0g=ffoeRgh3S7axyJcBni+jHqrNqO+-haTTDVrJ6w@mail.gmail.com>
Message-ID: <CAKoVMy=b8zExmu4h9niZvMsQgAs94-gK_RCmMyRhLa_i0vM_rw@mail.gmail.com>

Hi Peter,

My apology, I have updated the code at
https://gist.github.com/EBIshengoma/efc4ad3e32427891931d to appear exactly
how I run it from my computer.

Thanks.


Edson B. Ishengoma
PhD-Candidate
*School of Life Sciences and Engineering
Nelson Mandela African Institute of Science and Technology
Nelson Mandela Road
P. O. Box 447, Arusha
Tanzania (255)
*
*ishengomae at nm-aist.ac.tz  *ebarongo82 at yahoo.co.uk
*

<http://www.nm-aist.ac.tz/>Mobile: +255 762 348 037, +255 714 789 360,
  Website: www.nm-aist.ac.tz
Skype: edson.ishengoma

*
*
**


On Tue, Feb 4, 2014 at 11:46 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

>
>
> On Monday, February 3, 2014, Edson Ishengoma <ishengomae at nm-aist.ac.tz>
> wrote:
>
>> Thanks Peter.
>>
>> Here is a link to my script at
>> https://gist.github.com/EBIshengoma/efc4ad3e32427891931d
>>
>> Also, please find attached the sample xml output.
>>
>>
> The start of the script is missing (import statements, how
> you loaded the query and subject sequences, and how
> you parsed the BLAST output). We'd need at least that
> to run your script.
>
> Regards,
>
> Peter
>
>

From bartha.daniel at agrar.mta.hu  Tue Feb  4 05:38:46 2014
From: bartha.daniel at agrar.mta.hu (=?UTF-8?Q?Bartha_D=C3=A1niel?=)
Date: Tue, 4 Feb 2014 11:38:46 +0100
Subject: [Biopython] help! entrez esearch popset issue
Message-ID: <CAF+2+Y8AA7wB14HXi-Pdx+fgEq+4o2-WeFwC9BcRDgX6TrF5_g@mail.gmail.com>

Hi People,

I have an issue with biopythons esearch/efetch, and this drives me crazy.

If I search for something in the PopSet, like this, but the query is
arbitrary:

query = "Homo sapiens[Organism] NOT mitochondrion[All Fields]";

esearch_handle = Entrez.esearch(db="popset", term=query)
search_results = Entrez.read(esearch_handle)
accnos = search_results['IdList']

I get somehow always only 20 results in my IdList, but with the same term,
many thousands on the website. Is this a bug?

Because by default, on the website, 20 results per page are shown, and
surprise, my 20 results are equal with the first page. The biopython
documentation regarding the PopSet DB is not very talkative, so I ask you,
how do I solve this problem elegant ("python only")?

Since the same constellation doesn't cause any issues by searching in the
protein or other sequence DB, either has the PopSet DB some tricks I don't
kow or this is a BUG(?).


Regards:

Daniel


-- 
D?niel Bartha, molecular bionics engineer, BSc
Bioinformatician
Institute for Veterinary Medical Research
Centre for Agricultural Research
Hungarian Academy of Sciences
Hung?ria k?r?t 21.
Budapest
1143
Hungary

e-mail:
bartha.daniel at agrar.mta.hu


From saketkc at gmail.com  Tue Feb  4 07:25:45 2014
From: saketkc at gmail.com (Saket Choudhary)
Date: Tue, 4 Feb 2014 12:25:45 +0000
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <20140204231638.41daaf4a@kmserver>
References: <CAMC681ktJLOjcS746YB3MwCb534rZvfk87p-HGVdyt3Le=Y2RQ@mail.gmail.com>
	<CAKVJ-_7wV2JD2=Lbjem1WDW6e=eO=W6x4KOHxJh0zuVyNrsrVw@mail.gmail.com>
	<CAEDHeiuP-jc0nPdK-2k1TxF04QcqWqF98Nb6faMdbizQ1ZABgQ@mail.gmail.com>
	<20140204231638.41daaf4a@kmserver>
Message-ID: <CAEDHeitJ9wHNwANP3cgOPVjYnUWN_qf0BtCVtpnqTHL95HpzWg@mail.gmail.com>

Hi Kevin,

In fact I had forked this long ago[1], didn't have time to contribute
it to though.

Thanks for the awesome work!

[1] https://github.com/saketkc/pyNGSQC

Saket

On 4 February 2014 12:16, Kevin Murray <kevin at kdmurray.id.au> wrote:
> Saket,
>
> Apologies in advance if this is a little too unsolicited! =)
>
> Feel free to use pyNGSQC[1] as the basis for some of the proposed QC
> stuff, if it is of any use. I've been meaning to refactor this to use
> Biopython and in the long term submit a pull request, but I doubt I'll
> have time. I can share the refactoring progress with you/push it to
> github if you're interested.
>
> [1]: https://github.com/kdmurray91/pyNGSQC
>
>
> Cheers,
>
> Kevin
>
> On Mon, 3 Feb 2014 09:52:42 +0530
> Saket Choudhary <saketkc at gmail.com> wrote:
>
>>On 31 January 2014 16:25, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>> On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich
>>> <eric.talevich at gmail.com> wrote:
>>>> Hi folks,
>>>>
>>>> Google Summer of Code is on again for 2014, and the Open
>>>> Bioinformatics Foundation (OBF) is once again applying as a
>>>> mentoring organization. Participating in GSoC as an organization is
>>>> very competitive, and we will need your help in gathering a good
>>>> set of ideas and potential mentors for Biopython's role in GSoC
>>>> this year.
>>>>
>>>> If you have an idea for a Summer of Code project, please post your
>>>> idea here on the Biopython mailing list for discussion and start an
>>>> outline on this wiki page:
>>>> http://biopython.org/wiki/Google_Summer_of_Code
>>>>
>>>> We also welcome ideas that fit with OBF's mission but are not part
>>>> of a single Bio* project, or span multiple projects -- these ideas
>>>> can be posted on the OBF wiki and discussed on the OBF mailing list:
>>>> http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas
>>>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>>>>
>>>> Here's to another fun and productive Summer of Code!
>>>>
>>>> Cheers,
>>>> Eric & Raoul
>>>
>>> Thanks Eric & Raoul,
>>>
>>> Remember that the ideas don't have to come from potential mentors -
>>> if as a student there is something you'd particularly like to work on
>>> please ask, and perhaps we can find a suitable (Biopython) mentor.
>>>
>>> Regards,
>>>
>>> Peter
>>
>>I would like to propose a QC module for NGS & Microarray data.
>>Essentially a fastQC[1] and limma[2], respectively ported to
>>Biopython.
>>
>>
>>
>>[1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
>>[2] http://bioconductor.org/packages/devel/bioc/html/limma.html
>>
>>
>>Saket
>>
>>> _______________________________________________
>>> Biopython mailing list  -  Biopython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>_______________________________________________
>>Biopython mailing list  -  Biopython at lists.open-bio.org
>>http://lists.open-bio.org/mailman/listinfo/biopython

From kevin at kdmurray.id.au  Tue Feb  4 07:34:56 2014
From: kevin at kdmurray.id.au (Kevin Murray)
Date: Tue, 4 Feb 2014 23:34:56 +1100
Subject: [Biopython] help! entrez esearch popset issue
In-Reply-To: <CAF+2+Y8AA7wB14HXi-Pdx+fgEq+4o2-WeFwC9BcRDgX6TrF5_g@mail.gmail.com>
References: <CAF+2+Y8AA7wB14HXi-Pdx+fgEq+4o2-WeFwC9BcRDgX6TrF5_g@mail.gmail.com>
Message-ID: <20140204233456.7204362d@kmserver>

Bartha,

I believe that the retstart keyword argument is your friend.
Something like [Completely contrived and untested]:

request = Entrez.read(Entrez.esearch(db, qry, retstart=0))
answers = request["IdList"]
expected = int(request["Count"])
returned =  len(answers)
while returned < expected:
	request = Entrez.read(Entrez.esearch(db,
				qry,retstart=returned))
	returned += len(request["IdList"])
	answers.extend(request["IdList"])
	print(answers)

This is documented here:
http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EFetch_

Others may have more intelligent/complete solutions.

Cheers,
Kevin

On Tue, 4 Feb 2014 11:38:46 +0100
Bartha D?niel <bartha.daniel at agrar.mta.hu> wrote:

>Hi People,
>
>I have an issue with biopythons esearch/efetch, and this drives me
>crazy.
>
>If I search for something in the PopSet, like this, but the query is
>arbitrary:
>
>query = "Homo sapiens[Organism] NOT mitochondrion[All Fields]";
>
>esearch_handle = Entrez.esearch(db="popset", term=query)
>search_results = Entrez.read(esearch_handle)
>accnos = search_results['IdList']
>
>I get somehow always only 20 results in my IdList, but with the same
>term, many thousands on the website. Is this a bug?
>
>Because by default, on the website, 20 results per page are shown, and
>surprise, my 20 results are equal with the first page. The biopython
>documentation regarding the PopSet DB is not very talkative, so I ask
>you, how do I solve this problem elegant ("python only")?
>
>Since the same constellation doesn't cause any issues by searching in
>the protein or other sequence DB, either has the PopSet DB some tricks
>I don't kow or this is a BUG(?).
>
>
>Regards:
>
>Daniel
>
>
>


From kevin at kdmurray.id.au  Tue Feb  4 07:16:38 2014
From: kevin at kdmurray.id.au (Kevin Murray)
Date: Tue, 4 Feb 2014 23:16:38 +1100
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <CAEDHeiuP-jc0nPdK-2k1TxF04QcqWqF98Nb6faMdbizQ1ZABgQ@mail.gmail.com>
References: <CAMC681ktJLOjcS746YB3MwCb534rZvfk87p-HGVdyt3Le=Y2RQ@mail.gmail.com>
	<CAKVJ-_7wV2JD2=Lbjem1WDW6e=eO=W6x4KOHxJh0zuVyNrsrVw@mail.gmail.com>
	<CAEDHeiuP-jc0nPdK-2k1TxF04QcqWqF98Nb6faMdbizQ1ZABgQ@mail.gmail.com>
Message-ID: <20140204231638.41daaf4a@kmserver>

Saket,

Apologies in advance if this is a little too unsolicited! =)

Feel free to use pyNGSQC[1] as the basis for some of the proposed QC
stuff, if it is of any use. I've been meaning to refactor this to use
Biopython and in the long term submit a pull request, but I doubt I'll
have time. I can share the refactoring progress with you/push it to
github if you're interested.

[1]: https://github.com/kdmurray91/pyNGSQC


Cheers,

Kevin

On Mon, 3 Feb 2014 09:52:42 +0530
Saket Choudhary <saketkc at gmail.com> wrote:

>On 31 January 2014 16:25, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich
>> <eric.talevich at gmail.com> wrote:
>>> Hi folks,
>>>
>>> Google Summer of Code is on again for 2014, and the Open
>>> Bioinformatics Foundation (OBF) is once again applying as a
>>> mentoring organization. Participating in GSoC as an organization is
>>> very competitive, and we will need your help in gathering a good
>>> set of ideas and potential mentors for Biopython's role in GSoC
>>> this year.
>>>
>>> If you have an idea for a Summer of Code project, please post your
>>> idea here on the Biopython mailing list for discussion and start an
>>> outline on this wiki page:
>>> http://biopython.org/wiki/Google_Summer_of_Code
>>>
>>> We also welcome ideas that fit with OBF's mission but are not part
>>> of a single Bio* project, or span multiple projects -- these ideas
>>> can be posted on the OBF wiki and discussed on the OBF mailing list:
>>> http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas
>>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>>>
>>> Here's to another fun and productive Summer of Code!
>>>
>>> Cheers,
>>> Eric & Raoul
>>
>> Thanks Eric & Raoul,
>>
>> Remember that the ideas don't have to come from potential mentors -
>> if as a student there is something you'd particularly like to work on
>> please ask, and perhaps we can find a suitable (Biopython) mentor.
>>
>> Regards,
>>
>> Peter
>
>I would like to propose a QC module for NGS & Microarray data.
>Essentially a fastQC[1] and limma[2], respectively ported to
>Biopython.
>
>
>
>[1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
>[2] http://bioconductor.org/packages/devel/bioc/html/limma.html
>
>
>Saket
>
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>_______________________________________________
>Biopython mailing list  -  Biopython at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/biopython

From idoerg at gmail.com  Tue Feb  4 08:18:37 2014
From: idoerg at gmail.com (Iddo Friedberg)
Date: Tue, 4 Feb 2014 08:18:37 -0500
Subject: [Biopython] help! entrez esearch popset issue
In-Reply-To: <CAF+2+Y8AA7wB14HXi-Pdx+fgEq+4o2-WeFwC9BcRDgX6TrF5_g@mail.gmail.com>
References: <CAF+2+Y8AA7wB14HXi-Pdx+fgEq+4o2-WeFwC9BcRDgX6TrF5_g@mail.gmail.com>
Message-ID: <CABm4-MRhdpxNM0B64kUYXGO4NENjEckRuq+=K+fN7bAxa3tHAg@mail.gmail.com>

Default number of records returned is 20.

Read about the retmax and retstart arguments to see how to increase that
number:

http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch


On Tue, Feb 4, 2014 at 5:38 AM, Bartha D?niel <bartha.daniel at agrar.mta.hu>wrote:

> Hi People,
>
> I have an issue with biopythons esearch/efetch, and this drives me crazy.
>
> If I search for something in the PopSet, like this, but the query is
> arbitrary:
>
> query = "Homo sapiens[Organism] NOT mitochondrion[All Fields]";
>
> esearch_handle = Entrez.esearch(db="popset", term=query)
> search_results = Entrez.read(esearch_handle)
> accnos = search_results['IdList']
>
> I get somehow always only 20 results in my IdList, but with the same term,
> many thousands on the website. Is this a bug?
>
> Because by default, on the website, 20 results per page are shown, and
> surprise, my 20 results are equal with the first page. The biopython
> documentation regarding the PopSet DB is not very talkative, so I ask you,
> how do I solve this problem elegant ("python only")?
>
> Since the same constellation doesn't cause any issues by searching in the
> protein or other sequence DB, either has the PopSet DB some tricks I don't
> kow or this is a BUG(?).
>
>
> Regards:
>
> Daniel
>
>
>
> --
> D?niel Bartha, molecular bionics engineer, BSc
> Bioinformatician
> Institute for Veterinary Medical Research
> Centre for Agricultural Research
> Hungarian Academy of Sciences
> Hung?ria k?r?t 21.
> Budapest
> 1143
> Hungary
>
> e-mail:
> bartha.daniel at agrar.mta.hu
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.


From jgrant at smith.edu  Tue Feb  4 11:09:19 2014
From: jgrant at smith.edu (Jessica Grant)
Date: Tue, 4 Feb 2014 11:09:19 -0500
Subject: [Biopython] amazon aws
Message-ID: <CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw@mail.gmail.com>

Hello,

Has anyone been successful in installing Biopython on an instance of the
amazon cloud?  If so, can I get some advice?  I tried finding an easy
install package, but couldn't, so I started to try installing from source.
 I ran into trouble because with setup.py bcause it couldn't find gcc.  I
am going to try to find and install gcc...

Also, will this need to get reinstalled every time I start an instance of
the cloud?

Thanks!!

Jessica

From zhigangwu.bgi at gmail.com  Tue Feb  4 11:44:49 2014
From: zhigangwu.bgi at gmail.com (Zhigang Wu)
Date: Tue, 4 Feb 2014 08:44:49 -0800
Subject: [Biopython] amazon aws
In-Reply-To: <CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw@mail.gmail.com>
References: <CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw@mail.gmail.com>
Message-ID: <3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com>

What is the Linux distribution of EC2 instance you bring up? If it's Debian or Ubuntu, then sudo apt-get install biopython should be sufficient.

The idea is just use whatever package manager available in EC2 instance.

Zhigang

Sent from my iPhone

> On Feb 4, 2014, at 8:09 AM, Jessica Grant <jgrant at smith.edu> wrote:
> 
> Hello,
> 
> Has anyone been successful in installing Biopython on an instance of the
> amazon cloud?  If so, can I get some advice?  I tried finding an easy
> install package, but couldn't, so I started to try installing from source.
> I ran into trouble because with setup.py bcause it couldn't find gcc.  I
> am going to try to find and install gcc...
> 
> Also, will this need to get reinstalled every time I start an instance of
> the cloud?
> 
> Thanks!!
> 
> Jessica
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From jgrant at smith.edu  Tue Feb  4 11:47:41 2014
From: jgrant at smith.edu (Jessica Grant)
Date: Tue, 4 Feb 2014 11:47:41 -0500
Subject: [Biopython] amazon aws
In-Reply-To: <3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com>
References: <CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw@mail.gmail.com>
	<3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com>
Message-ID: <CAOuNqdkaChPkvVySokNGiQ+Hd=vxpAwkaTZ-ubb58R1f3kVNcg@mail.gmail.com>

I am just trying this out to see if this is going to work for us, so I am
using the free version - Amazon Linux AMI x86_64 PV  - and apt-get didn't
work for me here.
I will try launching an Ubuntu instance instead.

Thank you for your response!

Jessica


On Tue, Feb 4, 2014 at 11:44 AM, Zhigang Wu <zhigangwu.bgi at gmail.com> wrote:

> What is the Linux distribution of EC2 instance you bring up? If it's
> Debian or Ubuntu, then sudo apt-get install biopython should be sufficient.
>
> The idea is just use whatever package manager available in EC2 instance.
>
> Zhigang
>
> Sent from my iPhone
>
> > On Feb 4, 2014, at 8:09 AM, Jessica Grant <jgrant at smith.edu> wrote:
> >
> > Hello,
> >
> > Has anyone been successful in installing Biopython on an instance of the
> > amazon cloud?  If so, can I get some advice?  I tried finding an easy
> > install package, but couldn't, so I started to try installing from
> source.
> > I ran into trouble because with setup.py bcause it couldn't find gcc.  I
> > am going to try to find and install gcc...
> >
> > Also, will this need to get reinstalled every time I start an instance of
> > the cloud?
> >
> > Thanks!!
> >
> > Jessica
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>

From jgrant at smith.edu  Tue Feb  4 12:05:19 2014
From: jgrant at smith.edu (Jessica Grant)
Date: Tue, 4 Feb 2014 12:05:19 -0500
Subject: [Biopython] amazon aws
In-Reply-To: <CAOuNqdkaChPkvVySokNGiQ+Hd=vxpAwkaTZ-ubb58R1f3kVNcg@mail.gmail.com>
References: <CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw@mail.gmail.com>
	<3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com>
	<CAOuNqdkaChPkvVySokNGiQ+Hd=vxpAwkaTZ-ubb58R1f3kVNcg@mail.gmail.com>
Message-ID: <CAOuNqdnXvZvwa2jQnm4bXWG+Jx8Vn3LffM+s1bbZYqECmUiiqw@mail.gmail.com>

Yes, that worked!  Now on to RaxML...

Thank you!


On Tue, Feb 4, 2014 at 11:47 AM, Jessica Grant <jgrant at smith.edu> wrote:

> I am just trying this out to see if this is going to work for us, so I am
> using the free version - Amazon Linux AMI x86_64 PV  - and apt-get didn't
> work for me here.
> I will try launching an Ubuntu instance instead.
>
> Thank you for your response!
>
> Jessica
>
>
>
>
> On Tue, Feb 4, 2014 at 11:44 AM, Zhigang Wu <zhigangwu.bgi at gmail.com>wrote:
>
>> What is the Linux distribution of EC2 instance you bring up? If it's
>> Debian or Ubuntu, then sudo apt-get install biopython should be sufficient.
>>
>> The idea is just use whatever package manager available in EC2 instance.
>>
>> Zhigang
>>
>> Sent from my iPhone
>>
>> > On Feb 4, 2014, at 8:09 AM, Jessica Grant <jgrant at smith.edu> wrote:
>> >
>> > Hello,
>> >
>> > Has anyone been successful in installing Biopython on an instance of the
>> > amazon cloud?  If so, can I get some advice?  I tried finding an easy
>> > install package, but couldn't, so I started to try installing from
>> source.
>> > I ran into trouble because with setup.py bcause it couldn't find gcc.  I
>> > am going to try to find and install gcc...
>> >
>> > Also, will this need to get reinstalled every time I start an instance
>> of
>> > the cloud?
>> >
>> > Thanks!!
>> >
>> > Jessica
>> > _______________________________________________
>> > Biopython mailing list  -  Biopython at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>

From cdshaffer at gmail.com  Tue Feb  4 12:52:54 2014
From: cdshaffer at gmail.com (christopher shaffer)
Date: Tue, 4 Feb 2014 11:52:54 -0600
Subject: [Biopython] amazon aws
Message-ID: <CAPMjwhxwQw_kNODc11nOH3SgEo7LO4HbVf58p=wswtK6tUrb+A@mail.gmail.com>

Jessica,
I am not going to spam the biopython list as this is off topic, but you
might want to look at the iPlant collaborative. This is an NSF funded
"cyberinfrastructure" that has an AWS like service called Atmospheres. It
is all free to registered users. They have recently been expanding from
plant bioinformatics by adding more support for microbs and animals so
there is a good chance they have a machine that has what you need.

They appear to be down for maintenance right now, but once they are back up
you could check through all the virtual machines and see if any have what
you need.

I just created an account myself so I am afraid I don't know much more but
I was quite impressed with the "overview of iPlant" webinar I attended last
week.

Chris Shaffer
Biology
Washington Univ in St. Louis
P.S. I have no connection to iPlant except as an interested user.


> Date: Tue, 4 Feb 2014 11:09:19 -0500
> From: Jessica Grant <jgrant at smith.edu>
> Subject: [Biopython] amazon aws
> To: Biopython at lists.open-bio.org
> Message-ID:
>         <
> CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hello,
>
> Has anyone been successful in installing Biopython on an instance of the
> amazon cloud?  If so, can I get some advice?  I tried finding an easy
> install package, but couldn't, so I started to try installing from source.
>  I ran into trouble because with setup.py bcause it couldn't find gcc.  I
> am going to try to find and install gcc...
>
> Also, will this need to get reinstalled every time I start an instance of
> the cloud?
>
> Thanks!!
>
> Jessica
>
>

From cjfields at illinois.edu  Tue Feb  4 13:11:56 2014
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 4 Feb 2014 18:11:56 +0000
Subject: [Biopython] amazon aws
In-Reply-To: <CAPMjwhxwQw_kNODc11nOH3SgEo7LO4HbVf58p=wswtK6tUrb+A@mail.gmail.com>
References: <CAPMjwhxwQw_kNODc11nOH3SgEo7LO4HbVf58p=wswtK6tUrb+A@mail.gmail.com>
Message-ID: <B2326498-56E2-4363-A0DD-0D1673E5A1F8@illinois.edu>

Jessica,

I suggest setting up an instance using whatever (*cough*linux*cough*) OS you want; could be Amazon AWS, iPlant (which I think uses OpenStack), or another snapshot-capable cloud service. Install what you need, then take a snapshot of the instance, which in general should store any customizations you made.  Maybe look into CloudBioLinux, Scientific Linux, or similar images for a good start in this direction.

chris

On Feb 4, 2014, at 11:52 AM, christopher shaffer <cdshaffer at gmail.com> wrote:

> Jessica,
> I am not going to spam the biopython list as this is off topic, but you
> might want to look at the iPlant collaborative. This is an NSF funded
> "cyberinfrastructure" that has an AWS like service called Atmospheres. It
> is all free to registered users. They have recently been expanding from
> plant bioinformatics by adding more support for microbs and animals so
> there is a good chance they have a machine that has what you need.
> 
> They appear to be down for maintenance right now, but once they are back up
> you could check through all the virtual machines and see if any have what
> you need.
> 
> I just created an account myself so I am afraid I don't know much more but
> I was quite impressed with the "overview of iPlant" webinar I attended last
> week.
> 
> Chris Shaffer
> Biology
> Washington Univ in St. Louis
> P.S. I have no connection to iPlant except as an interested user.
> 
> 
>> Date: Tue, 4 Feb 2014 11:09:19 -0500
>> From: Jessica Grant <jgrant at smith.edu>
>> Subject: [Biopython] amazon aws
>> To: Biopython at lists.open-bio.org
>> Message-ID:
>>        <
>> CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw at mail.gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1
>> 
>> Hello,
>> 
>> Has anyone been successful in installing Biopython on an instance of the
>> amazon cloud?  If so, can I get some advice?  I tried finding an easy
>> install package, but couldn't, so I started to try installing from source.
>> I ran into trouble because with setup.py bcause it couldn't find gcc.  I
>> am going to try to find and install gcc...
>> 
>> Also, will this need to get reinstalled every time I start an instance of
>> the cloud?
>> 
>> Thanks!!
>> 
>> Jessica
>> 
>> 
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From p.j.a.cock at googlemail.com  Wed Feb  5 11:07:22 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Feb 2014 16:07:22 +0000
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKoVMy=b8zExmu4h9niZvMsQgAs94-gK_RCmMyRhLa_i0vM_rw@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
	<CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
	<CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>
	<CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>
	<CAKVJ-_4v_0g=ffoeRgh3S7axyJcBni+jHqrNqO+-haTTDVrJ6w@mail.gmail.com>
	<CAKoVMy=b8zExmu4h9niZvMsQgAs94-gK_RCmMyRhLa_i0vM_rw@mail.gmail.com>
Message-ID: <CAKVJ-_7BfxYnQw_K5_=b4Y1DaAqNuYvgut6L-tLhRTAxmW-QEA@mail.gmail.com>

Hi Edson,

I can see where the problem stems from now - it did puzzle me for a while.
For this part to make sense, query and sbjct need to be the FULL sequence
of the query and the subject (as given to BLAST as input):

    complete_query_seq += str(query[q_start-1:q_end])
    complete_sbjct_seq += str(sbjct[sb_start-1:sb_end])

(I had assumed these variables were setup at the beginning of the file,
which I partly why I asked for the full script.)

However, via the for loop, you are using hsp.query, hsp.sbjct as query
and sbjct, This are the PARTIAL sequences aligned with gap characters.
This might do what you seemed to want:

    complete_query_seq += query.replace("-", "")
    complete_sbjct_seq += sbjct.replace("-", "")

However, this will concatenate the fragments with an HSP - any bit of
the query or subject which did not align will not be included. Any bit
which appears in more than one HSP will be there twice. And also
if you're using masking you'll have XXXXX X regions in the sequence
where the filter said it was low complexity.

I would instead get the original unmodified query/subject sequences
from the original FASTA files given to BLAST.

Peter


On Tue, Feb 4, 2014 at 9:12 AM, Edson Ishengoma
<ishengomae at nm-aist.ac.tz> wrote:
> Hi Peter,
>
> My apology, I have updated the code at
> https://gist.github.com/EBIshengoma/efc4ad3e32427891931d to appear exactly
> how I run it from my computer.
>
> Thanks.
>

From ishengomae at nm-aist.ac.tz  Wed Feb  5 12:52:17 2014
From: ishengomae at nm-aist.ac.tz (Edson Ishengoma)
Date: Wed, 5 Feb 2014 20:52:17 +0300
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKVJ-_7BfxYnQw_K5_=b4Y1DaAqNuYvgut6L-tLhRTAxmW-QEA@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
	<CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
	<CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>
	<CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>
	<CAKVJ-_4v_0g=ffoeRgh3S7axyJcBni+jHqrNqO+-haTTDVrJ6w@mail.gmail.com>
	<CAKoVMy=b8zExmu4h9niZvMsQgAs94-gK_RCmMyRhLa_i0vM_rw@mail.gmail.com>
	<CAKVJ-_7BfxYnQw_K5_=b4Y1DaAqNuYvgut6L-tLhRTAxmW-QEA@mail.gmail.com>
Message-ID: <CAKoVMy=R7B557Y8zJENognmjacJZmduD7w7KBBxX3eL7YvErag@mail.gmail.com>

Hi Peter,

Woow, that made my day. Thank you very much and keep up the good work.

Regards,

Edson

On Wed, Feb 5, 2014 at 7:07 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Hi Edson,
>
> I can see where the problem stems from now - it did puzzle me for a while.
> For this part to make sense, query and sbjct need to be the FULL sequence
> of the query and the subject (as given to BLAST as input):
>
>     complete_query_seq += str(query[q_start-1:q_end])
>     complete_sbjct_seq += str(sbjct[sb_start-1:sb_end])
>
> (I had assumed these variables were setup at the beginning of the file,
> which I partly why I asked for the full script.)
>
> However, via the for loop, you are using hsp.query, hsp.sbjct as query
> and sbjct, This are the PARTIAL sequences aligned with gap characters.
> This might do what you seemed to want:
>
>     complete_query_seq += query.replace("-", "")
>     complete_sbjct_seq += sbjct.replace("-", "")
>
> However, this will concatenate the fragments with an HSP - any bit of
> the query or subject which did not align will not be included. Any bit
> which appears in more than one HSP will be there twice. And also
> if you're using masking you'll have XXXXX X regions in the sequence
> where the filter said it was low complexity.
>
> I would instead get the original unmodified query/subject sequences
> from the original FASTA files given to BLAST.
>
> Peter
>
>
> On Tue, Feb 4, 2014 at 9:12 AM, Edson Ishengoma
> <ishengomae at nm-aist.ac.tz> wrote:
> > Hi Peter,
> >
> > My apology, I have updated the code at
> > https://gist.github.com/EBIshengoma/efc4ad3e32427891931d to appear
> exactly
> > how I run it from my computer.
> >
> > Thanks.
> >
>

From anubhavmaity7 at gmail.com  Sun Feb  9 10:05:23 2014
From: anubhavmaity7 at gmail.com (Anubhav Maity)
Date: Sun, 9 Feb 2014 20:35:23 +0530
Subject: [Biopython] Fwd: [GSoC] Want to contribute to open-bio for GSOC 2014
In-Reply-To: <CAKVJ-_7nGzS_y=krDgSLmTXY_c8n_UKwuL+dzqPh1vBAQm299w@mail.gmail.com>
References: <CAOAabDmBoq=+4Q1ehHt9eFduTUiftt=etZWgkKrOWmeY_gjuiQ@mail.gmail.com>
	<CAKVJ-_7nGzS_y=krDgSLmTXY_c8n_UKwuL+dzqPh1vBAQm299w@mail.gmail.com>
Message-ID: <CAOAabDm9Pc7QRELVw_eAdiyfcEHmQ4oY8zgyZAF1f7bsSiA7-A@mail.gmail.com>

Hi,

Thanks You, Peter, for your reply.
I have setup my github account and have forked the source code. I have
build and install biopython after reading the README file in the github
repository.

I want to contribute code to bioython. I want some suggestions from where
to start?

Waiting for your reply.

Thanks and Regards,
Anubhav

---------- Forwarded message ----------
From: Peter Cock <p.j.a.cock at googlemail.com>
Date: Sat, Feb 8, 2014 at 6:28 PM
Subject: Re: [GSoC] Want to contribute to open-bio for GSOC 2014
To: Anubhav Maity <anubhavmaity7 at gmail.com>
Cc: OBF GSoC <gsoc at lists.open-bio.org>


On Fri, Feb 7, 2014 at 10:33 PM, Anubhav Maity <anubhavmaity7 at gmail.com>
wrote:
> Hi,
>
> I am a BTech student from an Indian university and want to contribute code
> for open-bio for GSOC 2014.
> I love to code and can code in python. I have studied biology in high
> school and have taken biotechnology during my college study.
> I have looked on the projects of biopython  i.e Codon alignment and
> analysis, Bio.Phylo: filling in the gaps and Indexing & Lazy-loading
> Sequence Parsers. All the projects are very interesting. I want to
> contribute in one of these projects, please help me in getting started.
> Waiting for your positive reply.
>
> Thanks and Regards,
> Anubhav

Hi Anubhav,

Please sign up to the biopython and biopython-dev mailing lists
and introduce yourself there too. You will also need a GitHub
account to contribute to Biopython development - so you might
want to set that up now as well:

http://lists.open-bio.org/mailman/listinfo/biopython
http://lists.open-bio.org/mailman/listinfo/biopython-dev
https://github.com/biopython/biopython

Regards,

Peter

From davidsshin at lbl.gov  Mon Feb 10 09:23:58 2014
From: davidsshin at lbl.gov (David Shin)
Date: Mon, 10 Feb 2014 06:23:58 -0800
Subject: [Biopython] Summer of Code 2014 - Call for project ideas Re: going
 from protein to gene to oligos for cloning
Message-ID: <CAA_Ck1i+efh+nbF6xvs9MWc-=Vj9ya_4JBXxjPzgLPYn-hCghw@mail.gmail.com>

Hi all -

Just another suggestion for the summer of code project....
Going from protein sequences to gene coding regions.

With the reduction of costs associated with DNA synthesis and the advent of
"buying genes", along with more robust robotics, we are now at a time where
many are making large lists of proteins to express for biochemistry,
biophysics and structural biology. However, parsing the data available to
make choices to refine those lists and then obtaining just the coding
regions for the proteins of interest is a little daunting.

As discussed previously, finding a protein at NCBI doesn't lend readily to
getting the gene (coding region) for cloning in a readily automated
fashion. I still haven't tested the code suggested by Peter below, but this
could be cleanup project if it is broken, and or a similar project could be
started from scratch. If it seems like something you are interested, I will
test the code earlier, if that's a starting point someone would like to
pursue... though, may need to speak to the author first, not sure.

Thanks,
Dave


> Hi Dave,
>
> The catch here is the protein IDs are not directly usable in the
> nucleotide database - which is where ELink (Entrez Link) comes
> in, available as the Entrez.elink(...) function in Biopython.
>
> I've not tried it myself, but a colleague posted a long example
> on his blog which sounds close to what you are aiming for:
>
>
> http://armchairbiology.blogspot.co.uk/2013/02/surely-this-has-been-done-already.html
>
> https://github.com/widdowquinn/scripts/blob/master/bioinformatics/get_NCBI_cds_from_protein.py
>
> Peter
>


On Fri, Dec 6, 2013 at 2:24 AM, Peter Cock <p.j.a.cock at googlemail.com>
 wrote:

> On Fri, Dec 6, 2013 at 7:27 AM, David Shin <davidsshin at lbl.gov> wrote:
> > Hi again,
> >
> > I'm trying to use biopython to help me grab a lot of protein sequences
> that
> > will eventually be used as the basis for cloning. I'm almost done
> screening
> > my protein sequences, and pretty much ok on that part...
> >
> > I was just curious if anyone has already developed, or has any decent
> > advice on going from protein codes to getting the actual coding sequences
> > of the genes.
> >
> > At this point, my plan is to take protein codes (ie. numbers in
> > gi|145323746|) and use these to search entrez nucleotide databases
> directly
> > to get hits (I have tested it once seems to work to get genbank
> records...
> > then try to use the information inside to get the nucleotide sequences...
> > or I guess the other way is to use the top hit from tblastn somehow?
> >
> > Thanks,
> >
> > Dave
>

From vishnuc11j93 at gmail.com  Tue Feb 11 03:32:25 2014
From: vishnuc11j93 at gmail.com (Vishnu Chilakamarri)
Date: Tue, 11 Feb 2014 14:02:25 +0530
Subject: [Biopython] Adding SVM in biopython
Message-ID: <CAPwdTsCUPNthwhNiT_D3UF7QyG-nxKEAiYpYtmZvNCKUhmiaog@mail.gmail.com>

Hello,

I am currently working in a project to predict the GTP binding sites given
an amino acid sequence. The classification algorithm I'm using is SVM. As
of now I'm using SVM-light and python's scikit library for classification
and evaluating the model. For adding this in biopython we can use libSVM as
it has a python interface which can be used for this purpose.I would like
to discuss the feasibility of adding this in biopython's library and also
evaluation metrics such as F1 score and MCC.

Thank you,
Vishnu Chilakamarri

From p.j.a.cock at googlemail.com  Tue Feb 11 06:39:46 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 11 Feb 2014 11:39:46 +0000
Subject: [Biopython] Adding SVM in biopython
In-Reply-To: <CAPwdTsCUPNthwhNiT_D3UF7QyG-nxKEAiYpYtmZvNCKUhmiaog@mail.gmail.com>
References: <CAPwdTsCUPNthwhNiT_D3UF7QyG-nxKEAiYpYtmZvNCKUhmiaog@mail.gmail.com>
Message-ID: <CAKVJ-_7ab-4n+HyMXhsWXPb7zeOaJ4w=s4_6hyq4wj+-eOXZ1Q@mail.gmail.com>

On Tue, Feb 11, 2014 at 8:32 AM, Vishnu Chilakamarri
<vishnuc11j93 at gmail.com> wrote:
> Hello,
>
> I am currently working in a project to predict the GTP binding sites given
> an amino acid sequence. The classification algorithm I'm using is SVM. As
> of now I'm using SVM-light and python's scikit library for classification
> and evaluating the model.

Hello Vishnu,

General machine learning contributions would probably fit better
under the scikit libraries than in Biopython - their use goes way
beyond just biology after all ;)

> For adding this in biopython we can use libSVM as
> it has a python interface which can be used for this purpose.I would like
> to discuss the feasibility of adding this in biopython's library ...

Given libSVM has a Python interface, what would you be adding?
https://github.com/cjlin1/libsvm/tree/master/python

> and also evaluation metrics such as F1 score and MCC.
>

Isn't this already in scikit-learn?
http://scikit-learn.org/stable/modules/model_evaluation.html

Maybe I've not understood what you are suggesting?

Regards,

Peter

From vishnuc11j93 at gmail.com  Tue Feb 11 09:55:01 2014
From: vishnuc11j93 at gmail.com (Vishnu Chilakamarri)
Date: Tue, 11 Feb 2014 20:25:01 +0530
Subject: [Biopython] Adding SVM in biopython
In-Reply-To: <CAKVJ-_6wejZqCRV-r0GBim5efuL0EOXgC=MZBvaZoB6r3uZDkQ@mail.gmail.com>
References: <CAPwdTsCUPNthwhNiT_D3UF7QyG-nxKEAiYpYtmZvNCKUhmiaog@mail.gmail.com>
	<CAKVJ-_7ab-4n+HyMXhsWXPb7zeOaJ4w=s4_6hyq4wj+-eOXZ1Q@mail.gmail.com>
	<CAPwdTsBEsXbZ0AtvHrR1quz0zsqybYR8u31VtAJx9SC91n4SMQ@mail.gmail.com>
	<CAKVJ-_6wejZqCRV-r0GBim5efuL0EOXgC=MZBvaZoB6r3uZDkQ@mail.gmail.com>
Message-ID: <CAPwdTsAiuaUOJL-JowUf9_HzCDvHUY8S1Ngf-DB5JPxwYVAkaQ@mail.gmail.com>

Hello Peter,

You're right , addition of another machine learning algorithm in biopython
does not seem necessary.Sorry about that. I was actually looking for
contributing to biopython for Google Summer of Code. I was reading about
the lazy parsers idea which seems very interesting. Like you mentioned in
the Biopython Wiki, I started reading about tabix and BAM indexing. Formats
such as FASTA can be converted to BAM and then indexed using tabix. I read
from here about how Tabix works :
http://bioinformatics.oxfordjournals.org/content/27/5/718.full . Apart from
this is there any source from where I can learn more about this? Thanks in
advance.


On Tue, Feb 11, 2014 at 8:12 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Feb 11, 2014 at 2:23 PM, Vishnu Chilakamarri
> <vishnuc11j93 at gmail.com> wrote:
> > Hello Peter,
> >
> > You're right , addition of another machine learning algorithm in
> biopython
> > does not seem necessary.
>
> Do you want to reply on the list?
>
> > Sorry about that. I was actually looking for
> > contributing to biopython for Google Summer of Code. I was reading about
> the
> > lazy parsers idea which seems very interesting. Like you mentioned in the
> > Biopython Wiki, I started reading about tabix and BAM indexing. Formats
> such
> > as FASTA can be converted to BAM and then indexed using tabix.
>
> Not quite, you compress the FASTA file using bgzip (which uses
> BGZF, a type of GZIP compression). See:
>
> http://blastedbio.blogspot.co.uk/2011/11/bgzf-blocked-bigger-better-gzip.html
>
> > I read from here about how Tabix works :
> > http://bioinformatics.oxfordjournals.org/content/27/5/718.full . Apart
> from
> > this is there any source from where I can learn more about this? Thanks
> in
> > advance.
>
> For BGZF (used in BAM and tabix), my blog post and the Biopython code:
> https://github.com/biopython/biopython/blob/master/Bio/bgzf.py
>
> Peter
>


-- 
Vishnu Chilakamarri
+919049437582
Public Relations Team
BITSAA
B.E. Computer Science + Msc Biological Sciences

From jttkim at googlemail.com  Tue Feb 11 14:17:47 2014
From: jttkim at googlemail.com (Jan Kim)
Date: Tue, 11 Feb 2014 19:17:47 +0000
Subject: [Biopython] Alignment Scores?
Message-ID: <20140211191746.GF17385@localhost>

Dear All,

the EMBOSS "srspair" alignment format includes identity, similarity and
gap statistics as well as the alignment score, see [1]. Is this info
available from alignment objects as returned by Bio.AlgnIO.parse(...).next() ?

I haven't found anything in the documentation and a peek into a sample
object didn't reveal anything either:

    >>> p = Bio.AlignIO.parse('sa-needle.txt', 'emboss')
    >>> a = p.next()
    >>> a.__dict__.keys()
    ['_records', '_alphabet']

Obviously availability of properties such as (percent) identity etc.
will vary with aligment format and type (e.g. some apply only to pairwise
alignment), so I was looking for something perhaps like a dictionary
of optional additional data, somewhat like the letter_annotations in the
SeqRecord class.

I'll probably start rolling my own simplistic solution based on a few
regular expressions for now -- if this is a crude re-invention of a wheel
that's been polished before please let me know, though.

Best regards, Jan

[1] http://emboss.sourceforge.net/docs/themes/alnformats/align.srspair
-- 
 +- Jan T. Kim -------------------------------------------------------+
 |             email: jttkim at gmail.com                                |
 |             WWW:   http://www.jtkim.dreamhosters.com/              |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*

From p.j.a.cock at googlemail.com  Tue Feb 11 13:25:44 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 11 Feb 2014 18:25:44 +0000
Subject: [Biopython] Alignment Scores?
In-Reply-To: <20140211191746.GF17385@localhost>
References: <20140211191746.GF17385@localhost>
Message-ID: <CAKVJ-_7Sh8MNFpHRJheiAJ7gePWmU+PHx2aT-BCk6Se7a=hSew@mail.gmail.com>

On Tue, Feb 11, 2014 at 7:17 PM, Jan Kim <jttkim at googlemail.com> wrote:
> Dear All,
>
> the EMBOSS "srspair" alignment format includes identity, similarity and
> gap statistics as well as the alignment score, see [1]. Is this info
> available from alignment objects as returned by Bio.AlgnIO.parse(...).next() ?

Not currently, no.

> Obviously availability of properties such as (percent) identity etc.
> will vary with aligment format and type (e.g. some apply only to pairwise
> alignment), so I was looking for something perhaps like a dictionary
> of optional additional data, somewhat like the letter_annotations in the
> SeqRecord class.

There's an open issue to do for something like that for the alignment
object... some of the AlignIO parsers hide this kind of thing under
a private attribute as a short term hack. However, read on.

> I'll probably start rolling my own simplistic solution based on a few
> regular expressions for now -- if this is a crude re-invention of a wheel
> that's been polished before please let me know, though.

You could tweak the AlignIO parser, but this would fit better as part of
EMBOSS pair format support in (the quite new) SearchIO module,
where this kind of attribute is expected: http://biopython.org/wiki/SearchIO

Regards,

Peter

From mmokrejs at fold.natur.cuni.cz  Thu Feb 13 15:38:34 2014
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Thu, 13 Feb 2014 21:38:34 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
Message-ID: <52FD2D4A.9010300@fold.natur.cuni.cz>

Hi,
   I am in the process of conversion to the new XML parsing code written by Bow.
So far, I have deciphered the following replacement strings (somewhat written in sed(1) format):


/hsp.identities/hsp.ident_num/
/hsp.score/hsp.bitscore/
/hsp.expect/hsp.evalue/
/hsp.bits/hsp.bitscore/
/hsp.gaps/hsp.gap_num/
/hsp.bits/hsp.bitscore_raw/
/hsp.positives/hsp.pos_num/
/hsp.sbjct_start/hsp.hit_start/
/hsp.sbjct_end/hsp.hit_end/
# hsp.query_start # no change from NCBIXML
# hsp.query_end # no change from NCBIXML
/record.query.split()[0]/record.id/
/alignment.hit_def.split(' ')[0]/alignment.hit_id/
/record.alignments/record.hits/

/hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML (don't remember whether the counts include minus signs of the alignment or not)


Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. I think the former length was including the minus sign for gaps while the latter is just the real length of the query sequence.

Nevertheless, what did alignment.length transform into? Into len(hsp.query_all)? I don't think hsp.query_span but who knows. ;)


Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that has been added to SearchIO in 1.63. so, that's all from me now until I upgrade. ;)


Thank you,
Martin

From w.arindrarto at gmail.com  Thu Feb 13 16:22:13 2014
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Thu, 13 Feb 2014 22:22:13 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <52FD2D4A.9010300@fold.natur.cuni.cz>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>
Message-ID: <CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>

Hi Martin,

Here's the 'convention' I use on the length-related attributes in
SearchIO's blast parsers:

* 'aln_span' attribute denote the length of the alignment itself,
which means this includes the gaps sign ('-'). In Blast, this is
always parsed from the file. You're right that this used to be
hsp.align_length.

* 'seq_len' attributes denote the length of either the query (in
qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
gaps. These are parsed from the BLAST XML file itself. One of these,
hit.seq_len, is the one that used to be alignment.length.

* 'query_span' and 'hit_span' are always computed by SearchIO (always
end coordinate - start coordinate of the query / hit match of the HSP,
so they do not count the gap characters). They may or may not be equal
to their seq_len counterparts, depending on how much the HSP covers
the query / hit sequences.

(I couldn't find any reference to sbjct_length in the current
codebase, perhaps it was removed some time ago?)

Since this is SearchIO, it also applies to other formats as well (e.g.
aln_span always counts the gap character).

The 'gap_num' error sounds a bit weird, though. If I recall correctly,
it should work in 1.62 (it was added very early in the beginning).
What problems are you having?

Cheers,
Bow

On Thu, Feb 13, 2014 at 9:38 PM, Martin Mokrejs
<mmokrejs at fold.natur.cuni.cz> wrote:
> Hi,
>   I am in the process of conversion to the new XML parsing code written by
> Bow.
> So far, I have deciphered the following replacement strings (somewhat
> written in sed(1) format):
>
>
> /hsp.identities/hsp.ident_num/
> /hsp.score/hsp.bitscore/
> /hsp.expect/hsp.evalue/
> /hsp.bits/hsp.bitscore/
> /hsp.gaps/hsp.gap_num/
> /hsp.bits/hsp.bitscore_raw/
> /hsp.positives/hsp.pos_num/
> /hsp.sbjct_start/hsp.hit_start/
> /hsp.sbjct_end/hsp.hit_end/
> # hsp.query_start # no change from NCBIXML
> # hsp.query_end # no change from NCBIXML
> /record.query.split()[0]/record.id/
> /alignment.hit_def.split(' ')[0]/alignment.hit_id/
> /record.alignments/record.hits/
>
> /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML
> (don't remember whether the counts include minus signs of the alignment or
> not)
>
>
>
>
> Now I am uncertain. There used to be hsp.sbjct_length and alignment.length.
> I think the former length was including the minus sign for gaps while the
> latter is just the real length of the query sequence.
>
> Nevertheless, what did alignment.length transform into? Into
> len(hsp.query_all)? I don't think hsp.query_span but who knows. ;)
>
>
>
> Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that
> has been added to SearchIO in 1.63. so, that's all from me now until I
> upgrade. ;)
>
>
> Thank you,
> Martin
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From mmokrejs at fold.natur.cuni.cz  Thu Feb 13 16:46:51 2014
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Thu, 13 Feb 2014 22:46:51 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>
	<CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>
Message-ID: <52FD3D4B.8040602@fold.natur.cuni.cz>

Hi Bow,
   thank you for thorough guidance. Comments interleaved.

Wibowo Arindrarto wrote:
> Hi Martin,
>
> Here's the 'convention' I use on the length-related attributes in
> SearchIO's blast parsers:
>
> * 'aln_span' attribute denote the length of the alignment itself,
> which means this includes the gaps sign ('-'). In Blast, this is
> always parsed from the file. You're right that this used to be
> hsp.align_length.
>
> * 'seq_len' attributes denote the length of either the query (in
> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
> gaps. These are parsed from the BLAST XML file itself. One of these,
> hit.seq_len, is the one that used to be alignment.length.

How about record.seq_len in SearchIO, isn't that same as well? At least
I am hoping that the length (163 below) of the original query sequence, stored in

  <BlastOutput_query-len>163</BlastOutput_query-len>

of the XML input file. Having access to its value from under hsp object would be the best for me.


> * 'query_span' and 'hit_span' are always computed by SearchIO (always
> end coordinate - start coordinate of the query / hit match of the HSP,
> so they do not count the gap characters). They may or may not be equal
> to their seq_len counterparts, depending on how much the HSP covers
> the query / hit sequences.

I hope you wanted to say "end - start + 1" ;-)

>
> (I couldn't find any reference to sbjct_length in the current
> codebase, perhaps it was removed some time ago?)

I have the feelings that either blast or biopython used subjct_* with the 'u' in the name.


> Since this is SearchIO, it also applies to other formats as well (e.g.
> aln_span always counts the gap character).

Fine with me, I need both values describing length region covered in the HSP, with and without the minus signs.


> The 'gap_num' error sounds a bit weird, though. If I recall correctly,
> it should work in 1.62 (it was added very early in the beginning).
> What problems are you having?


if str(_hsp.gap_num) == '(None, None)':
     ....
AttributeError: 'HSP' object has no attribute 'gap_num'


Here is the hsp object structure:

_hsp=['_NON_STICKY_ATTRS', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_aln_span_get', '_get_coords', '_hit_end_get', '_hit_inter_ranges_get', '_hit_inter_spans_get', '_hit_range_get', '_hit_span_get', '_hit_start_get', '_inter_ranges_get', '_inter_spans_get', '_items', '_query_end_get', '_query_inter_ranges_get', '_query_inter_spans_get', '_query_range_get', '_query_span_get', '_query_start_get', '_str_hsp_header', '_transfer_attrs', '_validate_fragment', 'aln', 'aln_all', 'aln_annotation', 'aln_annotation_all', 'aln_span', 'alphabet', 'bitscore', 'bitscore_raw', 'evalue', 'fragment', 'fragments', 'hit', 'hit_all', 'hit_description', 'hit_end', 'hit_end_all', 'hit_features', 'hit_
 f
eatures_all', 'hit_frame', 'hit_frame_all', 'hit_id', 'hit_inter_ranges', 'hit_inter_spans', 'hit_range', 'hit_range_all', 'hit_span', 'hit_span_all', 'hit_start', 'hit_start_all', 'hit_strand', 'hit_strand_all', 'ident_num', 'is_fragmented', 'pos_num', 'query', 'query_all', 'query_description', 'query_end', 'query_end_all', 'query_features', 'query_features_all', 'query_frame', 'query_frame_all', 'query_id', 'query_inter_ranges', 'query_inter_spans', 'query_range', 'query_range_all', 'query_span', 'query_span_all', 'query_start', 'query_start_all', 'query_strand', 'query_strand_all']


And eventually if that matters, the super-parent/blast record object:

['_NON_STICKY_ATTRS', '_QueryResult__marker', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_blast_id', '_description', '_hit_key_function', '_id', '_items', '_transfer_attrs', 'absorb', 'append', 'description', 'fragments', 'hit_filter', 'hit_keys', 'hit_map', 'hits', 'hsp_filter', 'hsp_map', 'hsps', 'id', 'index', 'items', 'iterhit_keys', 'iterhits', 'iteritems', 'param_evalue_threshold', 'param_filter', 'param_gap_extend', 'param_gap_open', 'param_score_match', 'param_score_mismatch', 'pop', 'program', 'reference', 'seq_len', 'sort', 'stat_db_len', 'stat_db_num', 'stat_eff_space', 'stat_entropy', 'stat_hsp_len', 'stat_kappa', 'stat_lambda', 'target', 'version']


A new comment:

The off-by-one change in SearchIO only complicates matters for me, so I immediately fix it to natural numbering, via:

_query_start = hsp.query_start + 1
_hit_start = hsp.hit_start + 1

I know we talked about this in the past and this is just to say that I did not change my mind here. ;) Same with SffIO although there are two reason for off-by-one numberings, one due to the SFF specs but the other is likewise, to keep in sync with pythonic numbering. These always caused more troubles to me than anything good. Any values I have in variables are 1-based and in the few cases I need to do python slicing, I adjust appropriately, but in remaining cases I am always printing or storing the 1-based values. So, this concept ( http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec114 ) is only for the sake of being pythonic, but bad for users.


Thanks,
Martin

>
> Cheers,
> Bow
>
> On Thu, Feb 13, 2014 at 9:38 PM, Martin Mokrejs
> <mmokrejs at fold.natur.cuni.cz> wrote:
>> Hi,
>>    I am in the process of conversion to the new XML parsing code written by
>> Bow.
>> So far, I have deciphered the following replacement strings (somewhat
>> written in sed(1) format):
>>
>>
>> /hsp.identities/hsp.ident_num/
>> /hsp.score/hsp.bitscore/
>> /hsp.expect/hsp.evalue/
>> /hsp.bits/hsp.bitscore/
>> /hsp.gaps/hsp.gap_num/
>> /hsp.bits/hsp.bitscore_raw/
>> /hsp.positives/hsp.pos_num/
>> /hsp.sbjct_start/hsp.hit_start/
>> /hsp.sbjct_end/hsp.hit_end/
>> # hsp.query_start # no change from NCBIXML
>> # hsp.query_end # no change from NCBIXML
>> /record.query.split()[0]/record.id/
>> /alignment.hit_def.split(' ')[0]/alignment.hit_id/
>> /record.alignments/record.hits/
>>
>> /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML
>> (don't remember whether the counts include minus signs of the alignment or
>> not)
>>
>>
>>
>>
>> Now I am uncertain. There used to be hsp.sbjct_length and alignment.length.
>> I think the former length was including the minus sign for gaps while the
>> latter is just the real length of the query sequence.
>>
>> Nevertheless, what did alignment.length transform into? Into
>> len(hsp.query_all)? I don't think hsp.query_span but who knows. ;)
>>
>>
>>
>> Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that
>> has been added to SearchIO in 1.63. so, that's all from me now until I
>> upgrade. ;)
>>
>>
>> Thank you,
>> Martin
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
>

From mmokrejs at fold.natur.cuni.cz  Thu Feb 13 17:06:44 2014
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Thu, 13 Feb 2014 23:06:44 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <52FD3D4B.8040602@fold.natur.cuni.cz>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>
	<CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>
	<52FD3D4B.8040602@fold.natur.cuni.cz>
Message-ID: <52FD41F4.8080301@fold.natur.cuni.cz>

Martin Mokrejs wrote:
> Hi Bow,
>    thank you for thorough guidance. Comments interleaved.
>
> Wibowo Arindrarto wrote:
>> Hi Martin,
>>
>> Here's the 'convention' I use on the length-related attributes in
>> SearchIO's blast parsers:
>>
>> * 'aln_span' attribute denote the length of the alignment itself,
>> which means this includes the gaps sign ('-'). In Blast, this is
>> always parsed from the file. You're right that this used to be
>> hsp.align_length.
>>
>> * 'seq_len' attributes denote the length of either the query (in
>> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
>> gaps. These are parsed from the BLAST XML file itself. One of these,
>> hit.seq_len, is the one that used to be alignment.length.
>
> How about record.seq_len in SearchIO, isn't that same as well? At least
> I am hoping that the length (163 below) of the original query sequence, stored in
>
>   <BlastOutput_query-len>163</BlastOutput_query-len>
>
> of the XML input file. Having access to its value from under hsp object would be the best for me.
>
>
>> * 'query_span' and 'hit_span' are always computed by SearchIO (always
>> end coordinate - start coordinate of the query / hit match of the HSP,
>> so they do not count the gap characters). They may or may not be equal
>> to their seq_len counterparts, depending on how much the HSP covers
>> the query / hit sequences.
>
> I hope you wanted to say "end - start + 1" ;-)
>
>>
>> (I couldn't find any reference to sbjct_length in the current
>> codebase, perhaps it was removed some time ago?)
>
> I have the feelings that either blast or biopython used subjct_* with the 'u' in the name.
>
>
>> Since this is SearchIO, it also applies to other formats as well (e.g.
>> aln_span always counts the gap character).
>
> Fine with me, I need both values describing length region covered in the HSP, with and without the minus signs.
>
>
>> The 'gap_num' error sounds a bit weird, though. If I recall correctly,
>> it should work in 1.62 (it was added very early in the beginning).
>> What problems are you having?
>
>
> if str(_hsp.gap_num) == '(None, None)':
>      ....
> AttributeError: 'HSP' object has no attribute 'gap_num'

Yeah, I know why. You told me once ( https://github.com/biopython/biopython/issues/222 ) that it is optional. Indeed, the XML file lacks in this case the <Hsp_gaps> section. Actually, this old silly test for (None, None) is in my code just because of that bug. I would prefer if SearchIO provided

hsp.gap_num == None

and likewise for the other, optional attributes to sanitize the blast XML output with some default values. I use None for such cases so that if an integer is later expected python chokes on the None value, which is good. Mostly I only check is the variable returns true or false so the None default is ok for me.

alternatively, I have to check the dictionary of hsp whether it contains gap_num, which is inconvenient.

Martin

From w.arindrarto at gmail.com  Thu Feb 13 17:13:36 2014
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Thu, 13 Feb 2014 23:13:36 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <52FD3D4B.8040602@fold.natur.cuni.cz>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>
	<CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>
	<52FD3D4B.8040602@fold.natur.cuni.cz>
Message-ID: <CADEGkF6ak30P6gD_25rK-x6vmYRdeE4uPaUQLKd_GRZhzeMJCQ@mail.gmail.com>

Hi Martin,

>> Here's the 'convention' I use on the length-related attributes in
>> SearchIO's blast parsers:
>>
>> * 'aln_span' attribute denote the length of the alignment itself,
>> which means this includes the gaps sign ('-'). In Blast, this is
>> always parsed from the file. You're right that this used to be
>> hsp.align_length.
>>
>> * 'seq_len' attributes denote the length of either the query (in
>> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
>> gaps. These are parsed from the BLAST XML file itself. One of these,
>> hit.seq_len, is the one that used to be alignment.length.
>
>
> How about record.seq_len in SearchIO, isn't that same as well? At least
> I am hoping that the length (163 below) of the original query sequence,
> stored in
>
>  <BlastOutput_query-len>163</BlastOutput_query-len>
>
> of the XML input file. Having access to its value from under hsp object
> would be the best for me.

if by 'record' you're referring to the top-most container (the
QueryResult), then record.seq_len denotes the length of the full query
sequence. This may or may not be the same as hit.seq_len.

I did not choose to store it under the HSP object, for the following
reasons because the HSP object is never meant to be used alone, always
with Hit and QueryResult. So whenever one has access to an HSP, he/she
must also have access to the containing Hit and QueryResult. Since the
seq_len are attributes common to all HSPs (originating from the
hit/query sequences), storing them in Hit and QueryResult objects
seems most appropriate.

>> * 'query_span' and 'hit_span' are always computed by SearchIO (always
>> end coordinate - start coordinate of the query / hit match of the HSP,
>> so they do not count the gap characters). They may or may not be equal
>> to their seq_len counterparts, depending on how much the HSP covers
>> the query / hit sequences.
>
>
> I hope you wanted to say "end - start + 1" ;-)

This is related to your comment below, I think. For better or worse,
we needed to adhere to one consistent indexing and numbering system.
Python's system was chosen based on the fact that anyone using
Biopython should be (or is already) familiar with them and that
SearchIO aims to unify all the different coordinate system that
different programs use. Of course you'll notice that the consequence
of this system is that one can calculate the length (or span, really)
of the hit / query sequences by computing `end -start` instead of `end
- start + 1` :).

>> (I couldn't find any reference to sbjct_length in the current
>> codebase, perhaps it was removed some time ago?)
>
>
> I have the feelings that either blast or biopython used subjct_* with the
> 'u' in the name.

Couldn't find that either :/..

>> The 'gap_num' error sounds a bit weird, though. If I recall correctly,
>> it should work in 1.62 (it was added very early in the beginning).
>> What problems are you having?
>
>

(pasting the comment from your other email)

>> if str(_hsp.gap_num) == '(None, None)':
>>      ....
>> AttributeError: 'HSP' object has no attribute 'gap_num'
>
>
> Yeah, I know why. You told me once (
> https://github.com/biopython/biopython/issues/222 ) that it is optional.
> Indeed, the XML file lacks in this case the <Hsp_gaps> section. Actually,
> this old silly test for (None, None) is in my code just because of that bug.
> I would prefer if SearchIO provided
>
> hsp.gap_num == None
>
> and likewise for the other, optional attributes to sanitize the blast XML
> output with some default values. I use None for such cases so that if an
> integer is later expected python chokes on the None value, which is good.
> Mostly I only check is the variable returns true or false so the None
> default is ok for me.
>
> alternatively, I have to check the dictionary of hsp whether it contains
> gap_num, which is inconvenient.

Guess you solved it. But yeah, I was a bit ambivalent on the issue on
whether to note missing attributes as None or simply nothing (as in,
not having the attribute at all). To me (others, feel free to weigh in
here), having it store nothing at all seems more preferred. If the
former is chosen, the only way to be consistent is to store all other
attributes from other search programs (e.g. HMMER's parameter in a
BLAST HSP) as None (otherwise we use None for one missing attribute
and not for the other?). This seems a bit cumbersome, so I chose to
store nothing at all.

> A new comment:
>
> The off-by-one change in SearchIO only complicates matters for me, so I
> immediately fix it to natural numbering, via:
>
> _query_start = hsp.query_start + 1
> _hit_start = hsp.hit_start + 1
>
> I know we talked about this in the past and this is just to say that I did
> not change my mind here. ;) Same with SffIO although there are two reason
> for off-by-one numberings, one due to the SFF specs but the other is
> likewise, to keep in sync with pythonic numbering. These always caused more
> troubles to me than anything good. Any values I have in variables are
> 1-based and in the few cases I need to do python slicing, I adjust
> appropriately, but in remaining cases I am always printing or storing the
> 1-based values. So, this concept (
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec114 ) is only for
> the sake of being pythonic, but bad for users.

This was addressed above :).


Cheers,
Bow

From mmokrejs at fold.natur.cuni.cz  Thu Feb 13 17:37:38 2014
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Thu, 13 Feb 2014 23:37:38 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <CADEGkF6ak30P6gD_25rK-x6vmYRdeE4uPaUQLKd_GRZhzeMJCQ@mail.gmail.com>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>
	<CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>
	<52FD3D4B.8040602@fold.natur.cuni.cz>
	<CADEGkF6ak30P6gD_25rK-x6vmYRdeE4uPaUQLKd_GRZhzeMJCQ@mail.gmail.com>
Message-ID: <52FD4932.1060407@fold.natur.cuni.cz>

Hi Bow,

Wibowo Arindrarto wrote:
> Hi Martin,
>
>>> Here's the 'convention' I use on the length-related attributes in
>>> SearchIO's blast parsers:
>>>
>>> * 'aln_span' attribute denote the length of the alignment itself,
>>> which means this includes the gaps sign ('-'). In Blast, this is
>>> always parsed from the file. You're right that this used to be
>>> hsp.align_length.
>>>
>>> * 'seq_len' attributes denote the length of either the query (in
>>> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
>>> gaps. These are parsed from the BLAST XML file itself. One of these,
>>> hit.seq_len, is the one that used to be alignment.length.
>>
>>
>> How about record.seq_len in SearchIO, isn't that same as well? At least
>> I am hoping that the length (163 below) of the original query sequence,
>> stored in
>>
>>   <BlastOutput_query-len>163</BlastOutput_query-len>
>>
>> of the XML input file. Having access to its value from under hsp object
>> would be the best for me.
>
> if by 'record' you're referring to the top-most container (the
> QueryResult), then record.seq_len denotes the length of the full query
> sequence. This may or may not be the same as hit.seq_len.
>
> I did not choose to store it under the HSP object, for the following
> reasons because the HSP object is never meant to be used alone, always
> with Hit and QueryResult. So whenever one has access to an HSP, he/she
> must also have access to the containing Hit and QueryResult. Since the
> seq_len are attributes common to all HSPs (originating from the
> hit/query sequences), storing them in Hit and QueryResult objects
> seems most appropriate.

So far I had in one of my functions only hsp object and from it I accessed hsp.align_length. Due to the transition to SearchIO I have to modify the function so that it has access to record.seq_len (or QueryResult as you say). yes, I did it now but please consider some functionality is missing. I don't mind my own API change but others might be concerned. I believe I want record.seq_len and not pray on hit.seq_len. I am not sure if we are talking about the same but my testsuite will complain once the code compiles.


>
>>> * 'query_span' and 'hit_span' are always computed by SearchIO (always
>>> end coordinate - start coordinate of the query / hit match of the HSP,
>>> so they do not count the gap characters). They may or may not be equal
>>> to their seq_len counterparts, depending on how much the HSP covers
>>> the query / hit sequences.
>>
>>
>> I hope you wanted to say "end - start + 1" ;-)
>
> This is related to your comment below, I think. For better or worse,

Damn, right, in this case 4-1+1 = 4-0 ;)


> we needed to adhere to one consistent indexing and numbering system.
> Python's system was chosen based on the fact that anyone using
> Biopython should be (or is already) familiar with them and that
> SearchIO aims to unify all the different coordinate system that
> different programs use. Of course you'll notice that the consequence
> of this system is that one can calculate the length (or span, really)
> of the hit / query sequences by computing `end -start` instead of `end
> - start + 1` :).

Well, took me a while. ;)

>
>>> (I couldn't find any reference to sbjct_length in the current
>>> codebase, perhaps it was removed some time ago?)
>>
>>
>> I have the feelings that either blast or biopython used subjct_* with the
>> 'u' in the name.
>
> Couldn't find that either :/..
>
>>> The 'gap_num' error sounds a bit weird, though. If I recall correctly,
>>> it should work in 1.62 (it was added very early in the beginning).
>>> What problems are you having?
>>
>>
>
> (pasting the comment from your other email)
>
>>> if str(_hsp.gap_num) == '(None, None)':
>>>       ....
>>> AttributeError: 'HSP' object has no attribute 'gap_num'
>>
>>
>> Yeah, I know why. You told me once (
>> https://github.com/biopython/biopython/issues/222 ) that it is optional.
>> Indeed, the XML file lacks in this case the <Hsp_gaps> section. Actually,
>> this old silly test for (None, None) is in my code just because of that bug.
>> I would prefer if SearchIO provided
>>
>> hsp.gap_num == None
>>
>> and likewise for the other, optional attributes to sanitize the blast XML
>> output with some default values. I use None for such cases so that if an
>> integer is later expected python chokes on the None value, which is good.
>> Mostly I only check is the variable returns true or false so the None
>> default is ok for me.
>>
>> alternatively, I have to check the dictionary of hsp whether it contains
>> gap_num, which is inconvenient.
>
> Guess you solved it. But yeah, I was a bit ambivalent on the issue on
> whether to note missing attributes as None or simply nothing (as in,
> not having the attribute at all). To me (others, feel free to weigh in
> here), having it store nothing at all seems more preferred. If the
> former is chosen, the only way to be consistent is to store all other
> attributes from other search programs (e.g. HMMER's parameter in a
> BLAST HSP) as None (otherwise we use None for one missing attribute
> and not for the other?). This seems a bit cumbersome, so I chose to
> store nothing at all.

I will see in how many places I have to wrap access to any of these three (or maybe more) optional values and wrap them by an extra if conditional. I think I will just carelessly force my own defaults, that will keep the code shorter and easier to read. I understand your concern about defining defaults for all possible values but I have opposite opinions. Let's see what other say.

The "good" thing is that now hsp.gap_num does not exist while before hsp.gaps was (None, None) hence the tests for True succeeded. Now the code breaks, cool. :))

Martin

From mmokrejs at fold.natur.cuni.cz  Fri Feb 14 17:57:25 2014
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Fri, 14 Feb 2014 23:57:25 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <52FD4932.1060407@fold.natur.cuni.cz>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>	<CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>	<52FD3D4B.8040602@fold.natur.cuni.cz>	<CADEGkF6ak30P6gD_25rK-x6vmYRdeE4uPaUQLKd_GRZhzeMJCQ@mail.gmail.com>
	<52FD4932.1060407@fold.natur.cuni.cz>
Message-ID: <52FE9F55.4040508@fold.natur.cuni.cz>

Hi Bow,
   regarding the missing .gap_num attributes and likewise other ... I believe it is reasonable
for BLAST XML output to omit them to save some space if there are just no gaps in the alignment
or identity is 100%, etc. However, objects instantiated while parsing should have them.
I don;t like having some instances of same object having more attributes while some
having less. I don't mind having a global hook in SearchIO forcing this strict mode and
affecting default parameters inherited from blast-result related classes while parsing XML.


   Another issue I see now that I used to poke over two iterators in a while loop. I was checking
that each of the iterators returned a result object (evaluating as True). The reason
for this ugly-ness was/is two-fold:

   1. "for blah in zip(iter1, iter2):" would only poke over the same length of items
but I wanted to make sure iter1 and iter2 did NOT have, accidentally, different lengths.
One of the iterators was the from the XML output stream and expensive to calculate number
of entries in an extra sweep. The iter2 could be counted for a number of its items
rather cheaply. However, outside outside biopython I could grep through the XML stream.

   2. Second reason for the ugly checks for _record evaluating as True was because
blastall interleaves the XML stream with dummy entries (which evaluate as False object
from NCBIXML.parse()) and also, time to time, blastall places into the stream the very
first result. So, I used to check that _record.id is not same as the _record.id I got
when I just started parsing the XML stream (I cache the very first result id, how ugly,
right?). Both issues I already mentioned in biopython's bugzilla and this email list and
notably, notified NCBI. Unfortunately, they answered they won't fix any of these
(look into archives of this biopython list about a year ago or so?).


   Back to NCBIXML.parse() to SearchIO.parse() transition. Seemed I could have replaced

if _record:
     ...
  
whith

if _record.id:
    ....

but that is unnecessarily expensive because python must get much deeper into the object.


Unfortunately, this won't help me to deal with "empty" objects created by SearchIO when no match
was found. I am talking about this XML section resulting in object evaluating as False but
_record.id gives 'FL40XAE01A1L3P':

     <Iteration>
       <Iteration_iter-num>2</Iteration_iter-num>
       <Iteration_query-ID>lcl|2_0</Iteration_query-ID>
       <Iteration_query-def>FL40XAE01A1L3P length=127 xy=0311_1171 region=1 run=R_2008_12_17_15_21_00_</Iteration_query-def>
       <Iteration_query-len>374</Iteration_query-len>
       <Iteration_stat>
         <Statistics>
           <Statistics_db-num>99</Statistics_db-num>
           <Statistics_db-len>47536</Statistics_db-len>
           <Statistics_hsp-len>0</Statistics_hsp-len>
           <Statistics_eff-space>0</Statistics_eff-space>
           <Statistics_kappa>0.41</Statistics_kappa>
           <Statistics_lambda>0.625</Statistics_lambda>
           <Statistics_entropy>0.78</Statistics_entropy>
         </Statistics>
       </Iteration_stat>
       <Iteration_message>No hits found</Iteration_message>
     </Iteration>


Here is the same through SearchIO:

>>> _record =_blastn_iterator.next()
>>> print _record
Program: blastn (2.2.26)
   Query: FL40XAE01A1L3P (374)
          length=127 xy=0311_1171 region=1 run=R_2008_12_17_15_21_00_
  Target: queries.fasta queries2.fasta
    Hits: 0
>>>
>>> if _record:
...     print "true"
... else:
...     print "false"
...
false
>>>


I understand that the object evaluates as False because it has no sequence and therefore
appears to be "empty", but it is real result. I understand you want to follow some universal
logic of biopython about empty/non-empty objects but I don't think in this case it is a good idea.
Or do you want me to check for _record.hits evaluating as True?


In my original pseudocode I had

if _record:
     # either a match was found
     # or no match was found but the object is valid and evaluates as True
else:
     # reached EOF
     # or
     # reached broken XML item interleaved in the stream (just ignore the crap)

would read now:

if _record.id:
     if _record.hits:
         # a match was found
     else:
         # no match was found
else:
     # reached EOF
     # reached broken XML item interleaved in the stream (just ignore the crap)


Looks I can accomplish what I used to have but I would like to know your opinion and
a coding style advice before I get on my way. ;-)


Thank you,
Martin

From p.j.a.cock at googlemail.com  Sat Feb 15 07:25:45 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 15 Feb 2014 12:25:45 +0000
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <52FE9F55.4040508@fold.natur.cuni.cz>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>
	<CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>
	<52FD3D4B.8040602@fold.natur.cuni.cz>
	<CADEGkF6ak30P6gD_25rK-x6vmYRdeE4uPaUQLKd_GRZhzeMJCQ@mail.gmail.com>
	<52FD4932.1060407@fold.natur.cuni.cz>
	<52FE9F55.4040508@fold.natur.cuni.cz>
Message-ID: <CAKVJ-_6+VB96TaF9nZbEQOwnqEZFc5gjc1WtvbuynfYfX=TN6w@mail.gmail.com>

On Fri, Feb 14, 2014 at 10:57 PM, Martin Mokrejs
<mmokrejs at fold.natur.cuni.cz> wrote:
>
>   Another issue I see now that I used to poke over two iterators in a while
> loop. I was checking that each of the iterators returned a result object
> (evaluating as True).

With some of the BLAST output formats (e.g. tabular), if a query
had no records it will not appear in the output at all - and so if you
iterate over it, there will be less results than if you iterated over the
query FASTA file.  Similarly, if you had several BLAST files for the
same query (e.g. against different databases) they might be missing
results for different queries.

In this kind of situation, a single loop using zip(...) isn't going to
work. However, it would be a nice match to SearchIO.index(...)
I think. e.g. Something like this (untested):

from Bio import SeqIO
from Bio import SearchIO
blast_index = SearchIO.index(blast_file, blast_format)
for query_seq_record in SeqIO.parse(query_file, "fasta"):
    query_id = query_seq_record.id
    if query_id not in blast_index:
        #BLAST format where empty results are missing
        #e.g. BLAST tabular
        continue
    query_result = blast_index[query.id]
    if not query_result.hits:
        #BLAST result with no hits, e.g. BLAST text
        continue
    print("Have hits for %s" % query_id)

Peter

From mmokrejs at fold.natur.cuni.cz  Sat Feb 15 11:28:18 2014
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Sat, 15 Feb 2014 17:28:18 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <52FD2D4A.9010300@fold.natur.cuni.cz>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>
Message-ID: <52FF95A2.7070102@fold.natur.cuni.cz>

Martin Mokrejs wrote:
> Hi,
>    I am in the process of conversion to the new XML parsing code written by Bow.
> So far, I have deciphered the following replacement strings (somewhat written in sed(1) format):
>
>
> /hsp.identities/hsp.ident_num/
> /hsp.score/hsp.bitscore/
> /hsp.expect/hsp.evalue/
> /hsp.bits/hsp.bitscore/
> /hsp.gaps/hsp.gap_num/
> /hsp.bits/hsp.bitscore_raw/

Aside from the fact I pasted twice the _hsp.bits line, my guess was wrong. The code works now but needed the following changes from NCBIXML to SearchIO names:

/_hsp.score/_hsp.bitscore_raw/
/_hsp.bits/_hsp.bitscore/


> /hsp.positives/hsp.pos_num/
> /hsp.sbjct_start/hsp.hit_start/
> /hsp.sbjct_end/hsp.hit_end/
> # hsp.query_start # no change from NCBIXML
> # hsp.query_end # no change from NCBIXML
> /record.query.split()[0]/record.id/
> /alignment.hit_def.split(' ')[0]/alignment.hit_id/
> /record.alignments/record.hits/
>
> /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML (don't remember whether the counts include minus signs of the alignment or not)
>
>
>
>
> Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. I think the former length was including the minus sign for gaps while the latter is just the real length of the query sequence.
>
> Nevertheless, what did alignment.length transform into? Into len(hsp.query_all)? I don't think hsp.query_span but who knows. ;)


Answering myself:

/alignment.hit_id/alignment.id/
/alignment.length/_record.hits[0].seq_len/


Other changes:

_hsp.sbjct/_hsp.hit.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-]
_hsp.query/_hsp.query.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-]
_hsp.match/_hsp.aln_annotation['homology']/ # e.g. '||||||||||||||||||||||||||||||||||| |||||||||| |   ||| ||   ||||||| |||||'

I think the dictionary key should have been better named "similarity".


The strand does not translate simply to SearchIO, one needs to do:
/_hsp.strand/(_hsp.query_strand, _hsp.hit_strand)/ # the tuple will be e.g. (1, 1) while I think it used to be under NCBIXML as either ('Plus', 'Plus'), ('Plus, 'Minus'), (None, None), etc.


>
>
>
> Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that has been added to SearchIO in 1.63. so, that's all from me now until I upgrade. ;)

I got around with try/except although it is more expensive than previously sufficient if/else tests:

         # undo the off-by-one change in SearchIO and transform back to real-life numbers
         _hit_start = _hsp.hit_start + 1
         _query_start = _hsp.query_start + 1

         try:
             _ident_num = _hsp.ident_num
         except:
             _ident_num = 0

         try:
             _pos_num = _hsp.pos_num
         except:
             _pos_num = 0

         try:
             _gap_num = _hsp.gap_num
         except:
             # calculate gaps count missing sometimes in legacy blast XML output
             # see also https://redmine.open-bio.org/issues/3363 saying that also _multimer_hsp_identities and _multimer_hsp_positives are affected
             _gap_num = _hsp.aln_span - _ident_num


So far I can conclude, that by transition from NCBIXML to SearchIO I got 30% wallclock speedup, but the most important will be for me whether it will save me memory used for parsing of huge XML files (>100GB uncompressed) . That I don't know yet, am still testing.

Martin

From vishnuc11j93 at gmail.com  Sat Feb 15 22:39:58 2014
From: vishnuc11j93 at gmail.com (Vishnu Chilakamarri)
Date: Sun, 16 Feb 2014 09:09:58 +0530
Subject: [Biopython] Using Tabix on a bgzf file
Message-ID: <CAPwdTsCn8msUza1w_YprYPPeg8pkQeEgnH03BMw_Jb4xZTjqnw@mail.gmail.com>

Hi Peter,

I read your code on bgzf compression and the blog post. I used
uniprot_sprot_varsplic.fasta.gz<ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz>as
the example (from the EBI ftp) to compress in bgzf and then index
using
Tabix. Now the file I've gotten has a .tbi extension. I'm trying to parse
the file but gives a preset not provided error and when I'm trying to
access columns I'm getting indexes overlap error. Can you tell me where
I've gone wrong?

Thank you,
Vishnu

From jordan.r.willis at Vanderbilt.Edu  Sun Feb 16 01:49:19 2014
From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R)
Date: Sun, 16 Feb 2014 06:49:19 +0000
Subject: [Biopython] extra annotations for phyla tree
Message-ID: <AC7D5B64FC829E429B0C96F7E3EE5AAD7291A597@ITS-HCWNEM108.ds.vanderbilt.edu>


Hi,

First off, whomever wrote the DistanceTree and DistranceMatrix Calculator?hat?s off! I have been looking for an easy way to do custom distance matrices for a while. Wow.

Anyway, I noticed you can add some extra annotations to your leafs by converting your tree into a PhyloXML. I was wondering if there are ways to color branches and adjust thickness to highlight branches of interest. I know you can simply open the trees in other programs like Dendroscope and color them manually, but you can imagine a scenario where you have thousands of trees to compare etc. 

Jordan


From p.j.a.cock at googlemail.com  Sun Feb 16 09:32:58 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 16 Feb 2014 14:32:58 +0000
Subject: [Biopython] Using Tabix on a bgzf file
In-Reply-To: <CAPwdTsCn8msUza1w_YprYPPeg8pkQeEgnH03BMw_Jb4xZTjqnw@mail.gmail.com>
References: <CAPwdTsCn8msUza1w_YprYPPeg8pkQeEgnH03BMw_Jb4xZTjqnw@mail.gmail.com>
Message-ID: <CAKVJ-_6c+=VE4tPVJHWR-Y5kU8oKupzS0xNMcq=n-5s4+LHa+A@mail.gmail.com>

On Sunday, February 16, 2014, Vishnu Chilakamarri <vishnuc11j93 at gmail.com>
wrote:

> Hi Peter,
>
> I read your code on bgzf compression and the blog post. I used
> uniprot_sprot_varsplic.fasta.gz<
> ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz
> >as
> the example (from the EBI ftp) to compress in bgzf and then index
> using
> Tabix. Now the file I've gotten has a .tbi extension. I'm trying to parse
> the file but gives a preset not provided error and when I'm trying to
> access columns I'm getting indexes overlap error. Can you tell me where
> I've gone wrong?
>
> Thank you,
> Vishnu
>
>
Biopython doesn't (currently) use the tabix index (*.tbi) file.

Biopython's Bio.SeqIO indexing code uses the BGFZ compressed
sequence file directly. Using the SeqIO.index(...) function will make
an in memory index, using SeqIO.index_db(...) will make an index
in disk using SQLite. This system is quite separate from tabix
(and Biopython uses it for many many sequence files formats,
not just FASTA).

Peter

From bjorn_johansson at bio.uminho.pt  Sun Feb 16 14:23:45 2014
From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=)
Date: Sun, 16 Feb 2014 19:23:45 +0000
Subject: [Biopython] CAI confusion
Message-ID: <CAG_4V=aZbGy-r+CF8w--HN4fhZm6hWK_nhpDrmFoC50o+6VhiQ@mail.gmail.com>

Hi,

I am trying to use the Bio.SeqUtils.CodonUsage module to calculate CAI for
S. cerevisiae genes.
Biopython comes with the SharpEcoliIndex from
Bio.SeqUtils.CodonUsageIndices, but none for S. cerevisiae.

I found one here:
http://downloads.yeastgenome.org/unpublished_data/codon/s_cerevisiae-codonusage.txt
and here:
http://downloads.yeastgenome.org/unpublished_data/codon/ysc.orf.cod

I parsed the first table which have the following format, unfortunately w/o
headers:

Gly	GGG	17673	6.05	0.12
Gly	GGA	32723	11.20	0.23
Gly	GGT	66198	22.66	0.46
Gly	GGC	28522	9.76	0.20
	
Glu	GAG	57046	19.52	0.30
...

? believe the last column is the fraction. I think biopython expects
instead relative adaptedness w as indata for each codon.

 <https://paperpile.com/shared/V6j5yN>see
http://www.ncbi.nlm.nih.gov/pubmed/3547335

How do I calculate w from the frequency? Are there any examples or
code avaliable? I googled, but could not find anything.

grateful for help!

/bjorn


-- 
______O_________oO________oO______o_______oO__
Bj?rn Johansson
Assistant Professor
Departament of Biology
University of Minho
Campus de Gualtar
4710-057 Braga
PORTUGAL
www.bio.uminho.pt
Google profile <https://profiles.google.com/bjornjobb>
Google Scholar Profile<http://scholar.google.com/citations?user=7AiEuJ4AAAAJ>
my group <https://sites.google.com/site/metabolicengineeringgroup/>
Office (direct) +351-253 601517 | (PT) mob.  +351-967 147 704 | (SWE) mob.
 +46 739 792 968
Dept of Biology (secr) +351-253 60 4310  | fax +351-253 678980


From eric.talevich at gmail.com  Mon Feb 17 01:25:18 2014
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sun, 16 Feb 2014 22:25:18 -0800
Subject: [Biopython] extra annotations for phyla tree
In-Reply-To: <AC7D5B64FC829E429B0C96F7E3EE5AAD7291A597@ITS-HCWNEM108.ds.vanderbilt.edu>
References: <AC7D5B64FC829E429B0C96F7E3EE5AAD7291A597@ITS-HCWNEM108.ds.vanderbilt.edu>
Message-ID: <CAMC681=n=xUtYF_+o57EbRMURdBVW83S2gEPwoQx0aNqCmKrbg@mail.gmail.com>

On Sat, Feb 15, 2014 at 10:49 PM, Willis, Jordan R <
jordan.r.willis at vanderbilt.edu> wrote:

>
> Hi,
>
> First off, whomever wrote the DistanceTree and DistranceMatrix
> Calculator...hat's off! I have been looking for an easy way to do custom
> distance matrices for a while. Wow.
>
> Anyway, I noticed you can add some extra annotations to your leafs by
> converting your tree into a PhyloXML. I was wondering if there are ways to
> color branches and adjust thickness to highlight branches of interest. I
> know you can simply open the trees in other programs like Dendroscope and
> color them manually, but you can imagine a scenario where you have
> thousands of trees to compare etc.
>
> Jordan
>

Hi Jordan,

The TreeConstruction and Consensus modules are the recent work of Yanbo Ye.
Good to hear you're using it and liking it.

As for annotating branch display colors and widths, you can accomplish this
by setting the .color and .width attributes of Clade objects. See:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec233

tree = Phylo.read("mytree.nwk", "newick")
clade = tree.common_ancestor("A", "B")
clade.color = "red"
clade.width = 2

Note that the clade color and width is recursive, applying to all
descendent clade branches too (per the phyloXML spec). To save the
annotations so they can be read by Dendroscope and Archaeopteryx, the trees
must be saved in phyloXML format:

Phylo.write(tree, "mytree-annotated.xml", "phyloxml")


Cheers,
Eric

From anaryin at gmail.com  Wed Feb 19 09:39:10 2014
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 19 Feb 2014 15:39:10 +0100
Subject: [Biopython] Bio.PDB local MMCIF files
In-Reply-To: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com>
References: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com>
Message-ID: <CAJ9sUYNSOXDK+9fNXS7=3nX04RKR6+Y7SyYUfRpQGrhsr8AJhw@mail.gmail.com>

Hello,

The implementation I was referring to by the EBI people is
here<http://www.ebi.ac.uk/~glen/PDBeCIF/>.
I tested it during a workshop and it is very fast and robust (they use it,
that should be enough reason) so maybe we could benefit a lot from either
its incorporation or adaptation?

As for what I suggested. Since my GSOC period, already 4 years ago.., I
noticed that the PDB module is a bit messy in terms of organization. The
module itself if named after the databank, which can be confused with the
format name, the mmcif parser is defined inside in a subfolder and there
are application wrappers there too (DSSP, NACCESS). Besides this issue,
which is not an issue at all and just my own pet peeve, there is a lot that
the entire module could gain from a thorough revision. I've been using it
very often and some normal manipulations of structures are not
straightforward to carry out (calculating a center of mass for example,
removing double occupancies) due to the parser being slow and quite memory
hungry. In fact, trying to run the parser on a very large collection of
structures often results in a random crash due to memory issues.

I've been toying with a lot of changes, performance improvements, etc, but
I'm not satisfied at all with them.. somethings that i've been trying is to
have the structure coordinates defined as a full numpy array instead of N
arrays per structure (one per atom) or the usage of __slots__ to mitigate
memory usage (managed to get it down 33% this way). This would also go in
line with a suggestion from Eric a long time ago to make a Bio.Struct
module which would be the perfect "playground" to implement and test these
changes. Other developments that I think are worth looking into are for
example making a nice library to link a parsed structure to the PDB
database and fetch information on it using the REST services they provide.

I'd like to hear your opinion (as in, everybody, developers and users) on
this and if it makes sense to indeed give a bit of TLC to the Bio.PDB
module. Also, on what changes you think should be carried out to improve
the module, like which features are missing, which applications are worth
wrapping.

Just to kick off some discussion. Maybe a new thread should be opened for
this later on.

Cheers,

Jo?o


From p.j.a.cock at googlemail.com  Wed Feb 19 09:51:59 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 19 Feb 2014 14:51:59 +0000
Subject: [Biopython] Bio.PDB local MMCIF files
In-Reply-To: <CAJ9sUYNSOXDK+9fNXS7=3nX04RKR6+Y7SyYUfRpQGrhsr8AJhw@mail.gmail.com>
References: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com>
	<CAJ9sUYNSOXDK+9fNXS7=3nX04RKR6+Y7SyYUfRpQGrhsr8AJhw@mail.gmail.com>
Message-ID: <CAKVJ-_5EXOZgpHDY4u4vySip+uQE+GnD3kMe6F5_PabuYDfghA@mail.gmail.com>

On Wed, Feb 19, 2014 at 2:39 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Hello,
>
> The implementation I was referring to by the EBI people is here. I tested it
> during a workshop and it is very fast and robust (they use it, that should
> be enough reason) so maybe we could benefit a lot from either its
> incorporation or adaptation?
>
> As for what I suggested. Since my GSOC period, already 4 years ago.., I
> noticed that the PDB module is a bit messy in terms of organization. The
> module itself if named after the databank, which can be confused with the
> format name, the mmcif parser is defined inside in a subfolder and there are
> application wrappers there too (DSSP, NACCESS). Besides this issue, which is
> not an issue at all and just my own pet peeve, there is a lot that the
> entire module could gain from a thorough revision. I've been using it very
> often and some normal manipulations of structures are not straightforward to
> carry out (calculating a center of mass for example, removing double
> occupancies) due to the parser being slow and quite memory hungry. In fact,
> trying to run the parser on a very large collection of structures often
> results in a random crash due to memory issues.
>
> I've been toying with a lot of changes, performance improvements, etc, but
> I'm not satisfied at all with them.. somethings that i've been trying is to
> have the structure coordinates defined as a full numpy array instead of N
> arrays per structure (one per atom) or the usage of __slots__ to mitigate
> memory usage (managed to get it down 33% this way). This would also go in
> line with a suggestion from Eric a long time ago to make a Bio.Struct module
> which would be the perfect "playground" to implement and test these changes.
> Other developments that I think are worth looking into are for example
> making a nice library to link a parsed structure to the PDB database and
> fetch information on it using the REST services they provide.
>
> I'd like to hear your opinion (as in, everybody, developers and users) on
> this and if it makes sense to indeed give a bit of TLC to the Bio.PDB
> module. Also, on what changes you think should be carried out to improve the
> module, like which features are missing, which applications are worth
> wrapping.
>
> Just to kick off some discussion. Maybe a new thread should be opened for
> this later on.
>
> Cheers,
>
> Jo?o

+1 on a new thread, and Bio.Struct (or better lower case, Bio.struct
or Bio.structure or something to be a bit more PEP8 like?).

Peter


From anaryin at gmail.com  Wed Feb 19 11:42:54 2014
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 19 Feb 2014 17:42:54 +0100
Subject: [Biopython] Bio.PDB.PDBParser() Superimposer()
In-Reply-To: <CAMrqo6zSgOZFWJ3ZSBtEOTTGRbch6dShw4gtUpYBD7ZpxjEVvA@mail.gmail.com>
References: <CAMrqo6zww=ftqVOUCbYn=A3JUtnYWC3aciiCd2eJVfOccYXR-A@mail.gmail.com>
	<CAJ9sUYN01JGb+TxKPVRt8PGaXXjgD33_TRLxycAWwxNfkVws7A@mail.gmail.com>
	<CAMrqo6zSgOZFWJ3ZSBtEOTTGRbch6dShw4gtUpYBD7ZpxjEVvA@mail.gmail.com>
Message-ID: <CAJ9sUYNJDPoHZK1zLDexScW0d1pCJQiGGmBPUYyoCGZ7OqvYCw@mail.gmail.com>

Hi Jurgens,

Sorry for the delay.. hope it still goes on time.

If the numbering of the two proteins is the same (equivalent residues have
equivalent residue numbers), usually the case if you compare different
models generated by simulation, then it is straightforward to trim them (check
this gist <https://gist.github.com/JoaoRodrigues/9095892>).

Otherwise you have to perform a sequence alignment and parse the alignment
to extract the equivalent atoms and do the same logic as before (this is
quite tricky..). I have a script that does this but it's not trivial at all
and might be extremely specific for your application.

Cheers,

Jo?o


2014-01-16 13:18 GMT+01:00 Jurgens de Bruin <debruinjj at gmail.com>:

> Hi Jo?o Rodrigues,
>
> Thanks for the reply much appreciated, this does make sense but I would
> greatly appreciate examples with some code.
>
> Thanks
>
>
> On 16 January 2014 13:59, Jo?o Rodrigues <anaryin at gmail.com> wrote:
>
>> Hi Jurgens,
>>
>> When you pass the two sequences to the Superimposer I guess you can trim
>> the sequence to that which you want (pass a list of residues that is sliced
>> to those that you want to include). The only requirement would be that both
>> have the same number of atoms.
>>
>> If this doesn't make much sense I can give an example with code.
>>
>> Cheers,
>>
>> Jo?o
>>
>>
>>  2014/1/16 Jurgens de Bruin <debruinjj at gmail.com>
>>
>>>  Hi,
>>>
>>> I am trying to calculate the RMS for two pdb files but the proteins
>>> differ
>>> in length. Currently I want to exclude the leading/trailing parts of the
>>> longer sequence but I am having difficulty figuring out how I will be
>>> able
>>> to do this.
>>>
>>> Any help would be appreciated.
>>>
>>>
>>> --
>>> Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/
>>> distinti saluti/siong/du? y?/??????
>>>
>>> Jurgens de Bruin
>>>
>>> _______________________________________________
>>> Biopython mailing list  -  Biopython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>>
>>
>>
>
>
> --
> Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/
> distinti saluti/siong/du? y?/??????
>
> Jurgens de Bruin
>


From p.j.a.cock at googlemail.com  Wed Feb 19 11:47:38 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 19 Feb 2014 16:47:38 +0000
Subject: [Biopython] Bio.PDB.PDBParser() Superimposer()
In-Reply-To: <CAJ9sUYNJDPoHZK1zLDexScW0d1pCJQiGGmBPUYyoCGZ7OqvYCw@mail.gmail.com>
References: <CAMrqo6zww=ftqVOUCbYn=A3JUtnYWC3aciiCd2eJVfOccYXR-A@mail.gmail.com>
	<CAJ9sUYN01JGb+TxKPVRt8PGaXXjgD33_TRLxycAWwxNfkVws7A@mail.gmail.com>
	<CAMrqo6zSgOZFWJ3ZSBtEOTTGRbch6dShw4gtUpYBD7ZpxjEVvA@mail.gmail.com>
	<CAJ9sUYNJDPoHZK1zLDexScW0d1pCJQiGGmBPUYyoCGZ7OqvYCw@mail.gmail.com>
Message-ID: <CAKVJ-_6Qk6DH1n3K+xOe67pM6n_ah-hmRphDkE6JWMdUt=7bbw@mail.gmail.com>

On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Hi Jurgens,
>
> Sorry for the delay.. hope it still goes on time.
>
> If the numbering of the two proteins is the same (equivalent residues have
> equivalent residue numbers), usually the case if you compare different
> models generated by simulation, then it is straightforward to trim them (check
> this gist <https://gist.github.com/JoaoRodrigues/9095892>).

Here's a slightly more complex example picking out a stable core
for the alignment (ignoring variable loops):
http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/

> Otherwise you have to perform a sequence alignment and parse the alignment
> to extract the equivalent atoms and do the same logic as before (this is
> quite tricky..). I have a script that does this but it's not trivial at all
> and might be extremely specific for your application.

Yes. Fiddly.

Peter


From anaryin at gmail.com  Wed Feb 19 12:07:17 2014
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 19 Feb 2014 18:07:17 +0100
Subject: [Biopython] Bio.PDB.PDBParser() Superimposer()
In-Reply-To: <AC7D5B64FC829E429B0C96F7E3EE5AAD7291FBDF@ITS-HCWNEM108.ds.vanderbilt.edu>
References: <CAMrqo6zww=ftqVOUCbYn=A3JUtnYWC3aciiCd2eJVfOccYXR-A@mail.gmail.com>
	<CAJ9sUYN01JGb+TxKPVRt8PGaXXjgD33_TRLxycAWwxNfkVws7A@mail.gmail.com>
	<CAMrqo6zSgOZFWJ3ZSBtEOTTGRbch6dShw4gtUpYBD7ZpxjEVvA@mail.gmail.com>
	<CAJ9sUYNJDPoHZK1zLDexScW0d1pCJQiGGmBPUYyoCGZ7OqvYCw@mail.gmail.com>
	<CAKVJ-_6Qk6DH1n3K+xOe67pM6n_ah-hmRphDkE6JWMdUt=7bbw@mail.gmail.com>
	<AC7D5B64FC829E429B0C96F7E3EE5AAD7291FBDF@ITS-HCWNEM108.ds.vanderbilt.edu>
Message-ID: <CAJ9sUYMb8k72NP1ZLMjpiiKauT2drj+33NNQpL+d185+5bDd0w@mail.gmail.com>

Hey Jordan,

Mind pasting that somewhere? I spent a few hours coding something like that
recently so it would be nice to compare !

Cheers,

Jo?o


2014-02-19 18:05 GMT+01:00 Willis, Jordan R <jordan.r.willis at vanderbilt.edu>
:

> I also have an example where I have one native and several models that
> needs an RMSD.
>
> It performs a multiple sequence alignment one at a time and iterates
> through the alignment file to do a one-to-one array of atoms in the
> sequence alignment before calculating a superposition. If the atoms do not
> match, they are thrown out of the alignment. Let me know if  you want to
> see this, it?s a bit complex.
>
> Jordan
>
>
>
>
> On Feb 19, 2014, at 10:47 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>
> > On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues <anaryin at gmail.com>
> wrote:
> >> Hi Jurgens,
> >>
> >> Sorry for the delay.. hope it still goes on time.
> >>
> >> If the numbering of the two proteins is the same (equivalent residues
> have
> >> equivalent residue numbers), usually the case if you compare different
> >> models generated by simulation, then it is straightforward to trim them
> (check
> >> this gist <https://gist.github.com/JoaoRodrigues/9095892>).
> >
> > Here's a slightly more complex example picking out a stable core
> > for the alignment (ignoring variable loops):
> > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/
> >
> >> Otherwise you have to perform a sequence alignment and parse the
> alignment
> >> to extract the equivalent atoms and do the same logic as before (this is
> >> quite tricky..). I have a script that does this but it's not trivial at
> all
> >> and might be extremely specific for your application.
> >
> > Yes. Fiddly.
> >
> > Peter
> >
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
>
>
>


From jordan.r.willis at Vanderbilt.Edu  Wed Feb 19 12:05:31 2014
From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R)
Date: Wed, 19 Feb 2014 17:05:31 +0000
Subject: [Biopython] Bio.PDB.PDBParser() Superimposer()
In-Reply-To: <CAKVJ-_6Qk6DH1n3K+xOe67pM6n_ah-hmRphDkE6JWMdUt=7bbw@mail.gmail.com>
References: <CAMrqo6zww=ftqVOUCbYn=A3JUtnYWC3aciiCd2eJVfOccYXR-A@mail.gmail.com>
	<CAJ9sUYN01JGb+TxKPVRt8PGaXXjgD33_TRLxycAWwxNfkVws7A@mail.gmail.com>
	<CAMrqo6zSgOZFWJ3ZSBtEOTTGRbch6dShw4gtUpYBD7ZpxjEVvA@mail.gmail.com>
	<CAJ9sUYNJDPoHZK1zLDexScW0d1pCJQiGGmBPUYyoCGZ7OqvYCw@mail.gmail.com>
	<CAKVJ-_6Qk6DH1n3K+xOe67pM6n_ah-hmRphDkE6JWMdUt=7bbw@mail.gmail.com>
Message-ID: <AC7D5B64FC829E429B0C96F7E3EE5AAD7291FBDF@ITS-HCWNEM108.ds.vanderbilt.edu>

I also have an example where I have one native and several models that needs an RMSD.

It performs a multiple sequence alignment one at a time and iterates through the alignment file to do a one-to-one array of atoms in the sequence alignment before calculating a superposition. If the atoms do not match, they are thrown out of the alignment. Let me know if  you want to see this, it?s a bit complex.

Jordan


On Feb 19, 2014, at 10:47 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
>> Hi Jurgens,
>> 
>> Sorry for the delay.. hope it still goes on time.
>> 
>> If the numbering of the two proteins is the same (equivalent residues have
>> equivalent residue numbers), usually the case if you compare different
>> models generated by simulation, then it is straightforward to trim them (check
>> this gist <https://gist.github.com/JoaoRodrigues/9095892>).
> 
> Here's a slightly more complex example picking out a stable core
> for the alignment (ignoring variable loops):
> http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/
> 
>> Otherwise you have to perform a sequence alignment and parse the alignment
>> to extract the equivalent atoms and do the same logic as before (this is
>> quite tricky..). I have a script that does this but it's not trivial at all
>> and might be extremely specific for your application.
> 
> Yes. Fiddly.
> 
> Peter
> 
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 


From jordan.r.willis at Vanderbilt.Edu  Wed Feb 19 12:52:36 2014
From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R)
Date: Wed, 19 Feb 2014 17:52:36 +0000
Subject: [Biopython] Bio.PDB.PDBParser() Superimposer()
In-Reply-To: <CAJ9sUYMb8k72NP1ZLMjpiiKauT2drj+33NNQpL+d185+5bDd0w@mail.gmail.com>
References: <CAMrqo6zww=ftqVOUCbYn=A3JUtnYWC3aciiCd2eJVfOccYXR-A@mail.gmail.com>
	<CAJ9sUYN01JGb+TxKPVRt8PGaXXjgD33_TRLxycAWwxNfkVws7A@mail.gmail.com>
	<CAMrqo6zSgOZFWJ3ZSBtEOTTGRbch6dShw4gtUpYBD7ZpxjEVvA@mail.gmail.com>
	<CAJ9sUYNJDPoHZK1zLDexScW0d1pCJQiGGmBPUYyoCGZ7OqvYCw@mail.gmail.com>
	<CAKVJ-_6Qk6DH1n3K+xOe67pM6n_ah-hmRphDkE6JWMdUt=7bbw@mail.gmail.com>
	<AC7D5B64FC829E429B0C96F7E3EE5AAD7291FBDF@ITS-HCWNEM108.ds.vanderbilt.edu>
	<CAJ9sUYMb8k72NP1ZLMjpiiKauT2drj+33NNQpL+d185+5bDd0w@mail.gmail.com>
Message-ID: <AC7D5B64FC829E429B0C96F7E3EE5AAD7291FD50@ITS-HCWNEM108.ds.vanderbilt.edu>

This will calculate an all_atom_RMSD, c-alpha and backbone atom rmd. I took out all the extra stuff specific to the Rosetta community that will actually score the file too. But this is generalized

scoreimposer_align.py -n native.pdb -m *.pdbs

-m is the multiprocess flag (requires python2.7)

https://gist.github.com/jwillis0720/9097426

Jordan

On Feb 19, 2014, at 11:07 AM, Jo?o Rodrigues <anaryin at gmail.com<mailto:anaryin at gmail.com>> wrote:

Hey Jordan,

Mind pasting that somewhere? I spent a few hours coding something like that recently so it would be nice to compare !

Cheers,

Jo?o


2014-02-19 18:05 GMT+01:00 Willis, Jordan R <jordan.r.willis at vanderbilt.edu<mailto:jordan.r.willis at vanderbilt.edu>>:
I also have an example where I have one native and several models that needs an RMSD.

It performs a multiple sequence alignment one at a time and iterates through the alignment file to do a one-to-one array of atoms in the sequence alignment before calculating a superposition. If the atoms do not match, they are thrown out of the alignment. Let me know if  you want to see this, it?s a bit complex.

Jordan


On Feb 19, 2014, at 10:47 AM, Peter Cock <p.j.a.cock at googlemail.com<mailto:p.j.a.cock at googlemail.com>> wrote:

> On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues <anaryin at gmail.com<mailto:anaryin at gmail.com>> wrote:
>> Hi Jurgens,
>>
>> Sorry for the delay.. hope it still goes on time.
>>
>> If the numbering of the two proteins is the same (equivalent residues have
>> equivalent residue numbers), usually the case if you compare different
>> models generated by simulation, then it is straightforward to trim them (check
>> this gist <https://gist.github.com/JoaoRodrigues/9095892>).
>
> Here's a slightly more complex example picking out a stable core
> for the alignment (ignoring variable loops):
> http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/
>
>> Otherwise you have to perform a sequence alignment and parse the alignment
>> to extract the equivalent atoms and do the same logic as before (this is
>> quite tricky..). I have a script that does this but it's not trivial at all
>> and might be extremely specific for your application.
>
> Yes. Fiddly.
>
> Peter
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org<mailto:Biopython at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From cjfields at illinois.edu  Thu Feb 20 09:16:16 2014
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 20 Feb 2014 14:16:16 +0000
Subject: [Biopython] Bio.PDB local MMCIF files
In-Reply-To: <CAKVJ-_5EXOZgpHDY4u4vySip+uQE+GnD3kMe6F5_PabuYDfghA@mail.gmail.com>
References: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com>
	<CAJ9sUYNSOXDK+9fNXS7=3nX04RKR6+Y7SyYUfRpQGrhsr8AJhw@mail.gmail.com>,
	<CAKVJ-_5EXOZgpHDY4u4vySip+uQE+GnD3kMe6F5_PabuYDfghA@mail.gmail.com>
Message-ID: <608E332B-F339-4474-A206-209ED6EA3D84@illinois.edu>

On Feb 19, 2014, at 8:55 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:
> 
>> On Wed, Feb 19, 2014 at 2:39 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
>> Hello,
>> 
>> The implementation I was referring to by the EBI people is here. I tested it
>> during a workshop and it is very fast and robust (they use it, that should
>> be enough reason) so maybe we could benefit a lot from either its
>> incorporation or adaptation?
>> 
>> As for what I suggested. Since my GSOC period, already 4 years ago.., I
>> noticed that the PDB module is a bit messy in terms of organization. The
>> module itself if named after the databank, which can be confused with the
>> format name, the mmcif parser is defined inside in a subfolder and there are
>> application wrappers there too (DSSP, NACCESS). Besides this issue, which is
>> not an issue at all and just my own pet peeve, there is a lot that the
>> entire module could gain from a thorough revision. I've been using it very
>> often and some normal manipulations of structures are not straightforward to
>> carry out (calculating a center of mass for example, removing double
>> occupancies) due to the parser being slow and quite memory hungry. In fact,
>> trying to run the parser on a very large collection of structures often
>> results in a random crash due to memory issues.
>> 
>> I've been toying with a lot of changes, performance improvements, etc, but
>> I'm not satisfied at all with them.. somethings that i've been trying is to
>> have the structure coordinates defined as a full numpy array instead of N
>> arrays per structure (one per atom) or the usage of __slots__ to mitigate
>> memory usage (managed to get it down 33% this way). This would also go in
>> line with a suggestion from Eric a long time ago to make a Bio.Struct module
>> which would be the perfect "playground" to implement and test these changes.
>> Other developments that I think are worth looking into are for example
>> making a nice library to link a parsed structure to the PDB database and
>> fetch information on it using the REST services they provide.
>> 
>> I'd like to hear your opinion (as in, everybody, developers and users) on
>> this and if it makes sense to indeed give a bit of TLC to the Bio.PDB
>> module. Also, on what changes you think should be carried out to improve the
>> module, like which features are missing, which applications are worth
>> wrapping.
>> 
>> Just to kick off some discussion. Maybe a new thread should be opened for
>> this later on.
>> 
>> Cheers,
>> 
>> Jo?o
> 
> +1 on a new thread, and Bio.Struct (or better lower case, Bio.struct
> or Bio.structure or something to be a bit more PEP8 like?).
> 
> Peter

The similarly designed (but terribly maintained) BioPerl code is Bio::Structure.  It think it was designed years back to be agnostic to a specific database but of course based much of its design on PDB data.

Chris

From leo2 at stanford.edu  Mon Feb 24 20:59:45 2014
From: leo2 at stanford.edu (Leo Alexander Hansmann)
Date: Mon, 24 Feb 2014 17:59:45 -0800 (PST)
Subject: [Biopython] consensus for forward and reverse reads from a
	sequencing run
In-Reply-To: <997170947.1096602.1393293281154.JavaMail.zimbra@stanford.edu>
Message-ID: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu>

Hi,
I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example:
sequence in the forward read file: AATCGTCGGTTACTCTG
corresponding line in the reverse read file: CTCTGAGGGAGAGATC
I want: AATCGTCGGTTACTCTGAGGGAGAGATC
Thank you so much!
Leo

From jordan.r.willis at Vanderbilt.Edu  Mon Feb 24 21:21:40 2014
From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R)
Date: Tue, 25 Feb 2014 02:21:40 +0000
Subject: [Biopython] consensus for forward and reverse reads from
	a	sequencing run
In-Reply-To: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu>
References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu>
Message-ID: <5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu>

Hi Leo,

I know this is not what you asked and I?m not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want).

Jordan

On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann <leo2 at stanford.edu<mailto:leo2 at stanford.edu>> wrote:

Hi,
I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example:
sequence in the forward read file: AATCGTCGGTTACTCTG
corresponding line in the reverse read file: CTCTGAGGGAGAGATC
I want: AATCGTCGGTTACTCTGAGGGAGAGATC
Thank you so much!
Leo


From ivangreg at gmail.com  Mon Feb 24 22:34:24 2014
From: ivangreg at gmail.com (Ivan Gregoretti)
Date: Mon, 24 Feb 2014 22:34:24 -0500
Subject: [Biopython] consensus for forward and reverse reads from a
 sequencing run
In-Reply-To: <5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu>
References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu>
	<5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu>
Message-ID: <CAOaPOXUV-iTwPK3JbyngmBicCxBNkSnz=Dionp99mJc_7A6RUw@mail.gmail.com>

Hello Leo,

Besides pandaseq, also consider FLASH from the Salzberg lab.
http://ccb.jhu.edu/software/FLASH/

I've been using it for over a year without problems. I wish there was
a Biopython tool though.

Cheers,

Ivan


Ivan Gregoretti, PhD
Bioinformatics


On Mon, Feb 24, 2014 at 9:21 PM, Willis, Jordan R
<jordan.r.willis at vanderbilt.edu> wrote:
> Hi Leo,
>
> I know this is not what you asked and I'm not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want).
>
> Jordan
>
> On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann <leo2 at stanford.edu<mailto:leo2 at stanford.edu>> wrote:
>
> Hi,
> I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example:
> sequence in the forward read file: AATCGTCGGTTACTCTG
> corresponding line in the reverse read file: CTCTGAGGGAGAGATC
> I want: AATCGTCGGTTACTCTGAGGGAGAGATC
> Thank you so much!
> Leo
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From egor.lakomkin at gmail.com  Tue Feb 25 00:02:49 2014
From: egor.lakomkin at gmail.com (Lakomkin Egor)
Date: Tue, 25 Feb 2014 13:02:49 +0800
Subject: [Biopython] [GSoC] Text mining for biopython
Message-ID: <CAHKQwwVakT7gdUbKXAMs4zWr2fS-vWArx9NbxPPrkMEwwfEiuQ@mail.gmail.com>

Hello,

I am PhD student, doing research in biomedical text mining, especially
gene ontology term recognition. I would like to ask if there is any
interest of doing GSoC text mining project under biopython?

Regards, Egor

From egor.lakomkin at gmail.com  Tue Feb 25 00:07:20 2014
From: egor.lakomkin at gmail.com (Lakomkin Egor)
Date: Tue, 25 Feb 2014 13:07:20 +0800
Subject: [Biopython] [GSoC] Text mining for biopython
Message-ID: <CAHKQwwVVd+eKo8deYZATjh23-SXKA7fZ-1wWGNHz+zpW_VpFeA@mail.gmail.com>

Hello,

I am PhD student, doing research in biomedical text mining, especially
gene ontology term recognition. I would like to ask if there is any
interest of doing GSoC text mining project under biopython?

Regards, Egor

From p.j.a.cock at googlemail.com  Tue Feb 25 06:22:09 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 25 Feb 2014 11:22:09 +0000
Subject: [Biopython] consensus for forward and reverse reads from a
 sequencing run
In-Reply-To: <CAOaPOXUV-iTwPK3JbyngmBicCxBNkSnz=Dionp99mJc_7A6RUw@mail.gmail.com>
References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu>
	<5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu>
	<CAOaPOXUV-iTwPK3JbyngmBicCxBNkSnz=Dionp99mJc_7A6RUw@mail.gmail.com>
Message-ID: <CAKVJ-_4kG3O4qiy+ooWGO53YfP4vGbeyxVStu8n=f-COSiSbSA@mail.gmail.com>

I agree that for this specific task (merging overlapped paired
FASTQ reads) an existing dedicated tool/script is a very
sensible choice. There are plenty to choose from.

What Biopython might benefit from is either sample code
on the Cookbook wiki for how to do this, or perhaps a new
function in Bio.SeqUtils? i.e. Bits to help you do something
new or different, if you need to customise a bespoke
analysis.

Peter

On Tue, Feb 25, 2014 at 3:34 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hello Leo,
>
> Besides pandaseq, also consider FLASH from the Salzberg lab.
> http://ccb.jhu.edu/software/FLASH/
>
> I've been using it for over a year without problems. I wish there was
> a Biopython tool though.
>
> Cheers,
>
> Ivan
>
>
>
> Ivan Gregoretti, PhD
> Bioinformatics
>
>
>
> On Mon, Feb 24, 2014 at 9:21 PM, Willis, Jordan R
> <jordan.r.willis at vanderbilt.edu> wrote:
>> Hi Leo,
>>
>> I know this is not what you asked and I'm not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want).
>>
>> Jordan
>>
>> On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann <leo2 at stanford.edu<mailto:leo2 at stanford.edu>> wrote:
>>
>> Hi,
>> I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example:
>> sequence in the forward read file: AATCGTCGGTTACTCTG
>> corresponding line in the reverse read file: CTCTGAGGGAGAGATC
>> I want: AATCGTCGGTTACTCTGAGGGAGAGATC
>> Thank you so much!
>> Leo
>>


From p.j.a.cock at googlemail.com  Tue Feb 25 06:36:57 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 25 Feb 2014 11:36:57 +0000
Subject: [Biopython] [GSoC] Text mining for biopython
In-Reply-To: <CAHKQwwVakT7gdUbKXAMs4zWr2fS-vWArx9NbxPPrkMEwwfEiuQ@mail.gmail.com>
References: <CAHKQwwVakT7gdUbKXAMs4zWr2fS-vWArx9NbxPPrkMEwwfEiuQ@mail.gmail.com>
Message-ID: <CAKVJ-_71XR9z+9EargWgHdJidKhPGgjKLMQou5Fg1tBoy6OENw@mail.gmail.com>

On Tue, Feb 25, 2014 at 5:02 AM, Lakomkin Egor <egor.lakomkin at gmail.com> wrote:
> Hello,
>
> I am PhD student, doing research in biomedical text mining, especially
> gene ontology term recognition. I would like to ask if there is any
> interest of doing GSoC text mining project under biopython?
>
> Regards, Egor

Hi Egor,

I'm not aware of any of the current Biopython development
team doing any text mining work - but I can think of a few
people I've met at hackathons/conferences which might be:

Karin Verspoor, NICTA
http://textminingscience.com/content/karin-verspoor
https://twitter.com/karinv

Kevin Cohen, University of Colorado School of Medicine
http://compbio.ucdenver.edu/Hunter_lab/Cohen/index.shtml
https://twitter.com/KevinBCohen

Daniel Jamieson, PhD student at University of Manchester
https://twitter.com/danielgjamieson

(I've not checked if they use Python in their work)

However, sorting out a nice combined module for Gene Ontology
support (and ontologies in general) would be good. There are a
number of people already looking at this (check the biopython
and biopython-dev mailing list archives with Google).

Regards,

Peter

From cjfields at illinois.edu  Tue Feb 25 10:40:43 2014
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 25 Feb 2014 15:40:43 +0000
Subject: [Biopython] consensus for forward and reverse reads from a
 sequencing run
In-Reply-To: <CAKVJ-_4kG3O4qiy+ooWGO53YfP4vGbeyxVStu8n=f-COSiSbSA@mail.gmail.com>
References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu>
	<5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu>
	<CAOaPOXUV-iTwPK3JbyngmBicCxBNkSnz=Dionp99mJc_7A6RUw@mail.gmail.com>
	<CAKVJ-_4kG3O4qiy+ooWGO53YfP4vGbeyxVStu8n=f-COSiSbSA@mail.gmail.com>
Message-ID: <112D9B62-CA39-4072-BA01-08C332EC8FE9@illinois.edu>

Torsten Seeman blogged on this and listed a bunch of tools, including a python-based approach:

    http://thegenomefactory.blogspot.com/2012/11/tools-to-merge-overlapping-paired-end.html

He also mentioned the one we have been using internally for MiSeq data (PEAR), which we have found works much better than PandaSeq in many circumstances (complete or overextended overlaps):

    http://bioinformatics.oxfordjournals.org/content/early/2013/11/10/bioinformatics.btt593.full

chris

On Feb 25, 2014, at 5:22 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> I agree that for this specific task (merging overlapped paired
> FASTQ reads) an existing dedicated tool/script is a very
> sensible choice. There are plenty to choose from.
> 
> What Biopython might benefit from is either sample code
> on the Cookbook wiki for how to do this, or perhaps a new
> function in Bio.SeqUtils? i.e. Bits to help you do something
> new or different, if you need to customise a bespoke
> analysis.
> 
> Peter
> 
> On Tue, Feb 25, 2014 at 3:34 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>> Hello Leo,
>> 
>> Besides pandaseq, also consider FLASH from the Salzberg lab.
>> http://ccb.jhu.edu/software/FLASH/
>> 
>> I've been using it for over a year without problems. I wish there was
>> a Biopython tool though.
>> 
>> Cheers,
>> 
>> Ivan
>> 
>> 
>> 
>> Ivan Gregoretti, PhD
>> Bioinformatics
>> 
>> 
>> 
>> On Mon, Feb 24, 2014 at 9:21 PM, Willis, Jordan R
>> <jordan.r.willis at vanderbilt.edu> wrote:
>>> Hi Leo,
>>> 
>>> I know this is not what you asked and I'm not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want).
>>> 
>>> Jordan
>>> 
>>> On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann <leo2 at stanford.edu<mailto:leo2 at stanford.edu>> wrote:
>>> 
>>> Hi,
>>> I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example:
>>> sequence in the forward read file: AATCGTCGGTTACTCTG
>>> corresponding line in the reverse read file: CTCTGAGGGAGAGATC
>>> I want: AATCGTCGGTTACTCTGAGGGAGAGATC
>>> Thank you so much!
>>> Leo
>>> 
> 
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From harsh.beria93 at gmail.com  Wed Feb 26 11:14:24 2014
From: harsh.beria93 at gmail.com (Harsh Beria)
Date: Wed, 26 Feb 2014 21:44:24 +0530
Subject: [Biopython] Gsoc 2014 aspirant
Message-ID: <CAPMf=YzNg7y2ZN8EC1jYXJcg3X21NFS3DTpHm9suQD7fchT=jQ@mail.gmail.com>

Hi,

I am a Harsh Beria, third year UG student at Indian Institute of
Technology, Kharagpur. I have started working in Computational Biophysics
recently, having written code for pdb to fasta parser, sequence alignment
using Needleman Wunch and Smith Waterman, Secondary Structure prediction,
Henikoff's weight and am currently working on Monte Carlo simulation.
Overall, I have started to like this field and want to carry my interest
forward by pursuing a relevant project for GSOC 2014. I mainly code in C
and python and would like to start contributing to the Biopython library. I
started going through the official contribution wiki page (
http://biopython.org/wiki/Contributing)

I also went through the wiki page of Bio.SeqlO's. I seriously want to
contribute to the Biopython library through GSOC. What do I do next ?

Thanks
-- 

Harsh Beria,
Indian Institute of Technology,Kharagpur
<http://www.iitkgp.ac.in/>E-mail: harsh.beria93 at gmail.com

From p.j.a.cock at googlemail.com  Thu Feb 27 08:49:22 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 27 Feb 2014 13:49:22 +0000
Subject: [Biopython] Introductory Biopython material
Message-ID: <CAKVJ-_59XRH2p7qnJ4S_+Zyoj0TgJ-0MJ2u-31Dbw4c1irg3rQ@mail.gmail.com>

Hello all,

This is just to let you know that I've written some introductory
Biopython material targeting Python Novices, focused on some
practical sequence manipulation examples, freely available
under the CC-BY licence here:

https://github.com/peterjc/biopython_workshop

I've run this as a workshop twice, but it should be fine for
self study as well.

I'm open to moving this under the Biopython project's GitHub
account, if people think that would be better?

I've added a few links to this from the website - these can be
moved/edited/removed if people think there's a better place
to put them: http://biopython.org/wiki/SeqIO and
http://biopython.org/wiki/Category:Wiki_Documentation

Regards,

Peter

From tra at popgen.net  Thu Feb 27 09:53:48 2014
From: tra at popgen.net (Tiago Antao)
Date: Thu, 27 Feb 2014 14:53:48 +0000
Subject: [Biopython] Bio.PopGen.SimCoal partial deprecation
Message-ID: <20140227145348.44cbe923@lnx>

Dear all,

With the availability of the new fastsimcoal interface by Melissa
Gymrek, I was planning on deprecating the code to deal with old version
(SimCoal 2.0).

This would mean deprecating
class SimCoalController (Bio.PopGen.SimCoal.Controller.py), along with
the relevant test code (and SimCoal2 dependency).

All the other code would be maintained (e.g. templating). And Melissa's
new fastsimcoal class (FastSimCoalController) would of course be added.

If somebody has strong feelings against this deprecation, please do
voice your concerns.

Best,
Tiago

From Leighton.Pritchard at hutton.ac.uk  Thu Feb 27 10:50:18 2014
From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard)
Date: Thu, 27 Feb 2014 15:50:18 +0000
Subject: [Biopython]  Google Summer of Code 2014 - Call for project ideas
Message-ID: <E72D33BF424829408854FEB604A6959B867ECE2D@DUEXC02.ad.hutton.ac.uk>

I would like to propose further development of the GenomeDiagram module (and maybe the KGML module, if it?s incorporated into Biopython) to enable browser-based interactive visualisation, along the lines of Bokeh[1]

[1] http://bokeh.pydata.org/


--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:leighton.pritchard at hutton.ac.uk<http://hutton.ac.uk>       w:http://www.hutton.ac.uk/staff/leighton-pritchard<http://www.hutton.ac.uk/staff/leighton-pritchard>
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827


________________________________________________________


This email is from the James Hutton Institute, however the views

expressed by the sender are not necessarily the views of the James Hutton

Institute and its subsidiaries. This email and any attachments are confidential and 

are intended solely for the use of the recipient(s) to whom they are addressed.

If you are not the intended recipient, you should not read, copy, disclose or rely on 

any information contained in this email, and we would ask you to contact the 

sender immediately and delete the email from your system.  Although the James 

Hutton Institute has taken reasonable precautions to ensure no viruses are present 

in this email, neither the Institute nor the sender accepts any responsibility for any 

viruses, and it is your responsibility to scan the email and any attachments.


The James Hutton Institute is a Scottish charitable company limited by guarantee.

Registered in Scotland No. SC374831

Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. 

Charity No. SC041796


From p.j.a.cock at googlemail.com  Thu Feb 27 11:12:31 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 27 Feb 2014 16:12:31 +0000
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <E72D33BF424829408854FEB604A6959B867ECE2D@DUEXC02.ad.hutton.ac.uk>
References: <E72D33BF424829408854FEB604A6959B867ECE2D@DUEXC02.ad.hutton.ac.uk>
Message-ID: <CAKVJ-_7CRrFXfXfsCvs99GHcihE6D0mpSymfs5hXfh=HtPR2Zw@mail.gmail.com>

On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard
<Leighton.Pritchard at hutton.ac.uk> wrote:
> I would like to propose further development of the GenomeDiagram
> module (and maybe the KGML module, if it's incorporated into Biopython)
> to enable browser-based interactive visualisation, along the lines of Bokeh[1]
>
> [1] http://bokeh.pydata.org/

I presume you're offering to mentor this - which would be great :)

Peter

P.S. The KGML module Leighton's talking about is here:
https://github.com/biopython/biopython/pull/173

Leighton's blog posts about this work:
http://armchairbiology.blogspot.co.uk/2013/01/keggwatch-part-i.html
http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-ii.html
http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-iii.html

From tra at popgen.net  Thu Feb 27 11:19:44 2014
From: tra at popgen.net (Tiago Antao)
Date: Thu, 27 Feb 2014 16:19:44 +0000
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <CAKVJ-_7CRrFXfXfsCvs99GHcihE6D0mpSymfs5hXfh=HtPR2Zw@mail.gmail.com>
References: <E72D33BF424829408854FEB604A6959B867ECE2D@DUEXC02.ad.hutton.ac.uk>
	<CAKVJ-_7CRrFXfXfsCvs99GHcihE6D0mpSymfs5hXfh=HtPR2Zw@mail.gmail.com>
Message-ID: <20140227161944.05640d0d@lnx>

Hi,

On Thu, 27 Feb 2014 16:12:31 +0000
Peter Cock <p.j.a.cock at googlemail.com> wrote:

> P.S. The KGML module Leighton's talking about is here:
> https://github.com/biopython/biopython/pull/173


Would this add a new library dependency to Biopython (PIL)? I am all in
favour of that (as independent modules could have their dependencies
without causing problems - as you only need the dependency if you
actually use the module).

But that would require the revision of the module dependency policy,
right? Which until now has been a bit on the conservative side...

I am thinking here matplotlib and scipy, for instance...

Tiago

From p.j.a.cock at googlemail.com  Thu Feb 27 11:31:11 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 27 Feb 2014 16:31:11 +0000
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <FEA7E0D4-D6BF-41C0-97D4-2B6F161307EC@illinois.edu>
References: <E72D33BF424829408854FEB604A6959B867ECE2D@DUEXC02.ad.hutton.ac.uk>
	<CAKVJ-_7CRrFXfXfsCvs99GHcihE6D0mpSymfs5hXfh=HtPR2Zw@mail.gmail.com>
	<FEA7E0D4-D6BF-41C0-97D4-2B6F161307EC@illinois.edu>
Message-ID: <CAKVJ-_4g6jkVVd7N3kmb==OaONwP0_vObZ7CjqJoyc9VqVRBpQ@mail.gmail.com>

On Thu, Feb 27, 2014 at 4:25 PM, Fields, Christopher J
<cjfields at illinois.edu> wrote:
> On Feb 27, 2014, at 10:12 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
>> On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard
>> <Leighton.Pritchard at hutton.ac.uk> wrote:
>>> I would like to propose further development of the GenomeDiagram
>>> module (and maybe the KGML module, if it's incorporated into Biopython)
>>> to enable browser-based interactive visualisation, along the lines of Bokeh[1]
>>>
>>> [1] http://bokeh.pydata.org/
>>
>> I presume you're offering to mentor this - which would be great :)
>>
>> Peter
>
> I would add that to the wiki, and indicate whether you can mentor it.
> Seems like a cool idea!
>
> chris

Leighton left out the link, but had added this to the Biopython wiki:
http://biopython.org/wiki/GSOC#Interactive_GenomeDiagram_Module

Peter

From cjfields at illinois.edu  Thu Feb 27 11:25:18 2014
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 27 Feb 2014 16:25:18 +0000
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <CAKVJ-_7CRrFXfXfsCvs99GHcihE6D0mpSymfs5hXfh=HtPR2Zw@mail.gmail.com>
References: <E72D33BF424829408854FEB604A6959B867ECE2D@DUEXC02.ad.hutton.ac.uk>
	<CAKVJ-_7CRrFXfXfsCvs99GHcihE6D0mpSymfs5hXfh=HtPR2Zw@mail.gmail.com>
Message-ID: <FEA7E0D4-D6BF-41C0-97D4-2B6F161307EC@illinois.edu>

On Feb 27, 2014, at 10:12 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard
> <Leighton.Pritchard at hutton.ac.uk> wrote:
>> I would like to propose further development of the GenomeDiagram
>> module (and maybe the KGML module, if it's incorporated into Biopython)
>> to enable browser-based interactive visualisation, along the lines of Bokeh[1]
>> 
>> [1] http://bokeh.pydata.org/
> 
> I presume you're offering to mentor this - which would be great :)
> 
> Peter
> 
> P.S. The KGML module Leighton's talking about is here:
> https://github.com/biopython/biopython/pull/173
> 
> Leighton's blog posts about this work:
> http://armchairbiology.blogspot.co.uk/2013/01/keggwatch-part-i.html
> http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-ii.html
> http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-iii.html

I would add that to the wiki, and indicate whether you can mentor it.  Seems like a cool idea!

chris


From ishengomae at nm-aist.ac.tz  Sun Feb  2 19:28:23 2014
From: ishengomae at nm-aist.ac.tz (Edson Ishengoma)
Date: Sun, 2 Feb 2014 22:28:23 +0300
Subject: [Biopython] Help modify this code so it can do what I want it to do
Message-ID: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>

Hi folks,

I picked this code from somewhere and edited it a bit but it still can't
achieve what I need. I have an xml output of tblastn hits on my customized
database and now I am in the process to extract the results with biopython.
With tblastn sometimes the returned hit is multiple local hits
corresponding to certain positions along the query with significant scores.
Now I want to concatenate these local hits which initially requires sorting
according to positions.

for record in records:
>    for alignment in record.alignments:
>                 hits = sorted((hsp.query_start, hsp.query_end, hsp.sbjct_start, hsp.sbjct_end, alignment.title, hsp.query, hsp.sbjct)\
>                                for hsp in alignment.hsps) # sorting results according to positions
>                 complete_query_seq = ''
>                 complete_sbjct_seq =''
>                 for q_start, q_end, sb_start, sb_end, title, query, sbjct in hits:
>                       print title
>                       print 'The query starts from position: ' + str(q_start)
>                       print 'The query ends at position: ' + str(q_end)
>                       print 'The hit starts at position: ' + str(sb_start)
>                       print 'The hit ends at position: ' + str(sb_end)
>                       print 'The  query is: ' + query
>                       print 'The hit is: ' + sbjct
>                       complete_query_seq += str(query[q_start:q_end]) # concatenating subsequent query/subject portions with alignments
>                       complete_sbjct_seq += str(query[sb_start:sb_end])
>                print 'Complete query seq is: ' + complete_query_seq
>                print 'Complete subject seq is: ' + complete_sbjct_seq
>
> This would print:

> Species_1The query starts from position: 1The query ends at position: 184The hit starts at position: 1The hit ends at position: 552The query is: ####### query_seqThe hit is: ######### hit_seqSpecies_1The query starts from position: 390The query ends at position: 510The hit starts at position: 549The hit ends at position: 911The query is: ####### query_seqThe hit is: ######### hit_seqSpecies_1The query starts from position: 492The query ends at position: 787The hit starts at position: 889The hit ends at position: 1776The query is: ####### query_seqThe hit is: ######### hit_seq
> Complete query seq is: ####### query_seq
> Complete subject seq is: ######### hit_seq
>
> This is not what I want as clearly the program did no concatenation at
all, or I messed up seriously. What I want is Complete query seq is: #######
############## (color coded to mean the different portions of query with
significant hits) with no sequence overlaps. How do I achieve that?

Thanks,

Regards,

Edson.


From saketkc at gmail.com  Mon Feb  3 04:22:42 2014
From: saketkc at gmail.com (Saket Choudhary)
Date: Mon, 3 Feb 2014 09:52:42 +0530
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <CAKVJ-_7wV2JD2=Lbjem1WDW6e=eO=W6x4KOHxJh0zuVyNrsrVw@mail.gmail.com>
References: <CAMC681ktJLOjcS746YB3MwCb534rZvfk87p-HGVdyt3Le=Y2RQ@mail.gmail.com>
	<CAKVJ-_7wV2JD2=Lbjem1WDW6e=eO=W6x4KOHxJh0zuVyNrsrVw@mail.gmail.com>
Message-ID: <CAEDHeiuP-jc0nPdK-2k1TxF04QcqWqF98Nb6faMdbizQ1ZABgQ@mail.gmail.com>

On 31 January 2014 16:25, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> Hi folks,
>>
>> Google Summer of Code is on again for 2014, and the Open Bioinformatics
>> Foundation (OBF) is once again applying as a mentoring organization.
>> Participating in GSoC as an organization is very competitive, and we will
>> need your help in gathering a good set of ideas and potential mentors for
>> Biopython's role in GSoC this year.
>>
>> If you have an idea for a Summer of Code project, please post your idea
>> here on the Biopython mailing list for discussion and start an outline on
>> this wiki page:
>> http://biopython.org/wiki/Google_Summer_of_Code
>>
>> We also welcome ideas that fit with OBF's mission but are not part of a
>> single Bio* project, or span multiple projects -- these ideas can be posted
>> on the OBF wiki and discussed on the OBF mailing list:
>> http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas
>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>>
>> Here's to another fun and productive Summer of Code!
>>
>> Cheers,
>> Eric & Raoul
>
> Thanks Eric & Raoul,
>
> Remember that the ideas don't have to come from potential mentors -
> if as a student there is something you'd particularly like to work on
> please ask, and perhaps we can find a suitable (Biopython) mentor.
>
> Regards,
>
> Peter

I would like to propose a QC module for NGS & Microarray data.
Essentially a fastQC[1] and limma[2], respectively ported to
Biopython.


[1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
[2] http://bioconductor.org/packages/devel/bioc/html/limma.html


Saket

> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From p.j.a.cock at googlemail.com  Mon Feb  3 12:19:40 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Feb 2014 12:19:40 +0000
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
Message-ID: <CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>

On Sun, Feb 2, 2014 at 7:28 PM, Edson Ishengoma
<ishengomae at nm-aist.ac.tz> wrote:
> Hi folks,
>
> I picked this code from somewhere and edited it a bit but it still can't
> achieve what I need. I have an xml output of tblastn hits on my customized
> database and now I am in the process to extract the results with biopython.
> With tblastn sometimes the returned hit is multiple local hits corresponding
> to certain positions along the query with significant scores. Now I want to
> concatenate these local hits which initially requires sorting according to
> positions.
>
> ...
>                       complete_query_seq += str(query[q_start:q_end])
>                       complete_sbjct_seq += str(query[sb_start:sb_end])
> ...

Shouldn't you be taking a slice from the subject sequence (the database
match) there, rather than the query sequence?

Another approach would be to use the alignment sequence fragments
BLAST gives you (and remove the gap characters).

Peter


From ivangreg at gmail.com  Mon Feb  3 13:43:17 2014
From: ivangreg at gmail.com (Ivan Gregoretti)
Date: Mon, 3 Feb 2014 08:43:17 -0500
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
Message-ID: <CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>

Hello Edson,

There is an argument that you can pass to tblastn that is called
max_hsps_per_subject. Try -max_hsps_per_subjec=1 and be sure not to
pass the flag -ungapped. That might do the job for you.

The help says

tblastn -help
...
 *** Statistical options
 -dbsize <Int8>
   Effective length of the database
 -searchsp <Int8, >=0>
   Effective length of the search space
 -max_hsps_per_subject <Integer, >=0>
   Override maximum number of HSPs per subject to save for ungapped searches
   (0 means do not override)
   Default = `0'
...

Ivan


Ivan Gregoretti, PhD


On Mon, Feb 3, 2014 at 7:19 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Sun, Feb 2, 2014 at 7:28 PM, Edson Ishengoma
> <ishengomae at nm-aist.ac.tz> wrote:
>> Hi folks,
>>
>> I picked this code from somewhere and edited it a bit but it still can't
>> achieve what I need. I have an xml output of tblastn hits on my customized
>> database and now I am in the process to extract the results with biopython.
>> With tblastn sometimes the returned hit is multiple local hits corresponding
>> to certain positions along the query with significant scores. Now I want to
>> concatenate these local hits which initially requires sorting according to
>> positions.
>>
>> ...
>>                       complete_query_seq += str(query[q_start:q_end])
>>                       complete_sbjct_seq += str(query[sb_start:sb_end])
>> ...
>
> Shouldn't you be taking a slice from the subject sequence (the database
> match) there, rather than the query sequence?
>
> Another approach would be to use the alignment sequence fragments
> BLAST gives you (and remove the gap characters).
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From p.j.a.cock at googlemail.com  Mon Feb  3 17:15:44 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Feb 2014 17:15:44 +0000
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <CAHjieL16gaEx1BKrFRCGebuuQvjN2YamacaY13P=46oq8CY97g@mail.gmail.com>
References: <CAMC681ktJLOjcS746YB3MwCb534rZvfk87p-HGVdyt3Le=Y2RQ@mail.gmail.com>
	<CAKVJ-_7wV2JD2=Lbjem1WDW6e=eO=W6x4KOHxJh0zuVyNrsrVw@mail.gmail.com>
	<CAHjieL15erc+qddVPNRP-eEHArCuY-Fu3dhZjHRsxp70Js78iA@mail.gmail.com>
	<CAKVJ-_4z1Rm8rg=3PX0eDMLVhfy_Zw9jWOohr_KRc+T5qS6AkQ@mail.gmail.com>
	<CAHjieL16gaEx1BKrFRCGebuuQvjN2YamacaY13P=46oq8CY97g@mail.gmail.com>
Message-ID: <CAKVJ-_6ZCPAT9v3tS=5GD_gw7e_a1L0LmwcMnDc6jjoL66XqeQ@mail.gmail.com>

On Mon, Feb 3, 2014 at 4:21 PM, Lisa Cohen <lisa.johnson.cohen at gmail.com> wrote:
> Hello Everyone,
>
> I am a new bioinformatics student and interested in working on a Biopython
> package for gene ontology and functional annotation. I've noticed that this
> is in "discussion stages" on the wiki page [1]. Perhaps working with
> blast2GO [2], b2g4pipe Galaxy wrapper [3], other existing tools [4].
>
> Is this a feasible Google Summer of Code project idea? Is anyone interested
> in working with me?
>
> Lisa
>
> [1] http://www.biopython.org/w/index.php?title=Gene_Ontology&redirect=no
> [2] http://www.blast2go.com/b2ghome
> [3] https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go
> [4] https://github.como/tanghaiba/goatools

Something based around (gene) ontology support might make a good
project. Chris Lasher was once looking at this, as was Kyle Ellrott.

On the general subject of ontologies, more recently Iddo Friedburg
and Bartek Wilczynski were talking about some OBO work just last month:
http://lists.open-bio.org/pipermail/biopython-dev/2014-January/thread.html

Peter


From ishengomae at nm-aist.ac.tz  Mon Feb  3 19:16:55 2014
From: ishengomae at nm-aist.ac.tz (Edson Ishengoma)
Date: Mon, 3 Feb 2014 22:16:55 +0300
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
Message-ID: <CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>

Hi Peter,

Sorry that was the typo, it should be:
complete_sbjct_seq += str(sbjct[sb_start:sb_end]).

I tried a suggestion by Ivan on the providing tblastn option
[-max_hsps_per_subject 1] but still the output shows up as fragmented hits.

Peter said: "Another approach would be to use the alignment sequence
fragments BLAST gives you (and remove the gap characters)."
With the script I have I can only extract the first fragment only for each
hit. I don't know why string slicing method [sb_start:sb_end] in my script
does not include start and end positions for subsequent fragments.

Regards,

Edson


On Mon, Feb 3, 2014 at 4:43 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:

> Hello Edson,
>
> There is an argument that you can pass to tblastn that is called
> max_hsps_per_subject. Try -max_hsps_per_subjec=1 and be sure not to
> pass the flag -ungapped. That might do the job for you.
>
> The help says
>
> tblastn -help
> ...
>  *** Statistical options
>  -dbsize <Int8>
>    Effective length of the database
>  -searchsp <Int8, >=0>
>    Effective length of the search space
>  -max_hsps_per_subject <Integer, >=0>
>    Override maximum number of HSPs per subject to save for ungapped
> searches
>    (0 means do not override)
>    Default = `0'
> ...
>
> Ivan
>
>
>
> Ivan Gregoretti, PhD
>
>
> On Mon, Feb 3, 2014 at 7:19 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> > On Sun, Feb 2, 2014 at 7:28 PM, Edson Ishengoma
> > <ishengomae at nm-aist.ac.tz> wrote:
> >> Hi folks,
> >>
> >> I picked this code from somewhere and edited it a bit but it still can't
> >> achieve what I need. I have an xml output of tblastn hits on my
> customized
> >> database and now I am in the process to extract the results with
> biopython.
> >> With tblastn sometimes the returned hit is multiple local hits
> corresponding
> >> to certain positions along the query with significant scores. Now I
> want to
> >> concatenate these local hits which initially requires sorting according
> to
> >> positions.
> >>
> >> ...
> >>                       complete_query_seq += str(query[q_start:q_end])
> >>                       complete_sbjct_seq += str(query[sb_start:sb_end])
> >> ...
> >
> > Shouldn't you be taking a slice from the subject sequence (the database
> > match) there, rather than the query sequence?
> >
> > Another approach would be to use the alignment sequence fragments
> > BLAST gives you (and remove the gap characters).
> >
> > Peter
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>


From p.j.a.cock at googlemail.com  Mon Feb  3 20:14:04 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Feb 2014 20:14:04 +0000
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
	<CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
Message-ID: <CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>

On Monday, February 3, 2014, Edson Ishengoma <ishengomae at nm-aist.ac.tz>
wrote:

> Hi Peter,
>
> Sorry that was the typo, it should be:
> complete_sbjct_seq += str(sbjct[sb_start:sb_end]).
>
> I tried a suggestion by Ivan on the providing tblastn option
> [-max_hsps_per_subject 1] but still the output shows up as fragmented hits.
>
> Peter said: "Another approach would be to use the alignment sequence
> fragments BLAST gives you (and remove the gap characters)."
> With the script I have I can only extract the first fragment only for each
> hit. I don't know why string slicing method [sb_start:sb_end] in my script
> does not include start and end positions for subsequent fragments.
>
> Regards,
>
> Edson
>

Hi Edson,

Emails can mess up Python indentation, so posting the file online might
show something silly we've missed - I find http://gist.github.com works
well for this.

It would also help if you could share a sample BLAST output file where the
script is failing, as then people on the list could recreate your problem
on their own computer, which is often the first step in solving it.

Peter


From ishengomae at nm-aist.ac.tz  Mon Feb  3 21:45:38 2014
From: ishengomae at nm-aist.ac.tz (Edson Ishengoma)
Date: Tue, 4 Feb 2014 00:45:38 +0300
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
	<CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
	<CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>
Message-ID: <CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>

Thanks Peter.

Here is a link to my script at
https://gist.github.com/EBIshengoma/efc4ad3e32427891931d

Also, please find attached the sample xml output.


On Mon, Feb 3, 2014 at 11:14 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

>
> On Monday, February 3, 2014, Edson Ishengoma <ishengomae at nm-aist.ac.tz>
> wrote:
>
>> Hi Peter,
>>
>> Sorry that was the typo, it should be:
>> complete_sbjct_seq += str(sbjct[sb_start:sb_end]).
>>
>> I tried a suggestion by Ivan on the providing tblastn option
>> [-max_hsps_per_subject 1] but still the output shows up as fragmented hits.
>>
>> Peter said: "Another approach would be to use the alignment sequence
>> fragments BLAST gives you (and remove the gap characters)."
>> With the script I have I can only extract the first fragment only for
>> each hit. I don't know why string slicing method [sb_start:sb_end] in my
>> script
>> does not include start and end positions for subsequent fragments.
>>
>> Regards,
>>
>> Edson
>>
>
> Hi Edson,
>
> Emails can mess up Python indentation, so posting the file online might
> show something silly we've missed - I find http://gist.github.com works
> well for this.
>
> It would also help if you could share a sample BLAST output file where the
> script is failing, as then people on the list could recreate your problem
> on their own computer, which is often the first step in solving it.
>
> Peter
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Sample_output.xml
Type: text/xml
Size: 12909 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20140204/d5442841/attachment-0002.xml>

From aradwen at gmail.com  Tue Feb  4 00:08:27 2014
From: aradwen at gmail.com (Radhouane Aniba)
Date: Mon, 3 Feb 2014 16:08:27 -0800
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
	<CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
	<CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>
	<CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>
Message-ID: <CAH52Azrm6Ec6GCt+hsMZ0SCyfNkLjpgQeoMiE4mpVQRmhCBY9g@mail.gmail.com>

You can try use coderscrowd.com as well you will have all modifications
separately on your code and you can validate the one it works better for you

Rad


On Mon, Feb 3, 2014 at 1:45 PM, Edson Ishengoma <ishengomae at nm-aist.ac.tz>wrote:

> Thanks Peter.
>
> Here is a link to my script at
> https://gist.github.com/EBIshengoma/efc4ad3e32427891931d
>
> Also, please find attached the sample xml output.
>
>
>
> On Mon, Feb 3, 2014 at 11:14 PM, Peter Cock <p.j.a.cock at googlemail.com
> >wrote:
>
> >
> > On Monday, February 3, 2014, Edson Ishengoma <ishengomae at nm-aist.ac.tz>
> > wrote:
> >
> >> Hi Peter,
> >>
> >> Sorry that was the typo, it should be:
> >> complete_sbjct_seq += str(sbjct[sb_start:sb_end]).
> >>
> >> I tried a suggestion by Ivan on the providing tblastn option
> >> [-max_hsps_per_subject 1] but still the output shows up as fragmented
> hits.
> >>
> >> Peter said: "Another approach would be to use the alignment sequence
> >> fragments BLAST gives you (and remove the gap characters)."
> >> With the script I have I can only extract the first fragment only for
> >> each hit. I don't know why string slicing method [sb_start:sb_end] in my
> >> script
> >> does not include start and end positions for subsequent fragments.
> >>
> >> Regards,
> >>
> >> Edson
> >>
> >
> > Hi Edson,
> >
> > Emails can mess up Python indentation, so posting the file online might
> > show something silly we've missed - I find http://gist.github.com works
> > well for this.
> >
> > It would also help if you could share a sample BLAST output file where
> the
> > script is failing, as then people on the list could recreate your problem
> > on their own computer, which is often the first step in solving it.
> >
> > Peter
> >
> >
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>


-- 
*Radhouane Aniba*
*Bioinformatics Postdoctoral Research Scientist*

*Institute for Advanced Computer StudiesCenter for Bioinformatics and
Computational Biology* *(CBCB)*

*University of Maryland, College ParkMD 20742*


From p.j.a.cock at googlemail.com  Tue Feb  4 08:46:11 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 4 Feb 2014 08:46:11 +0000
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
	<CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
	<CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>
	<CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>
Message-ID: <CAKVJ-_4v_0g=ffoeRgh3S7axyJcBni+jHqrNqO+-haTTDVrJ6w@mail.gmail.com>

On Monday, February 3, 2014, Edson Ishengoma <ishengomae at nm-aist.ac.tz>
wrote:

> Thanks Peter.
>
> Here is a link to my script at
> https://gist.github.com/EBIshengoma/efc4ad3e32427891931d
>
> Also, please find attached the sample xml output.
>
>
The start of the script is missing (import statements, how
you loaded the query and subject sequences, and how
you parsed the BLAST output). We'd need at least that
to run your script.

Regards,

Peter


From ishengomae at nm-aist.ac.tz  Tue Feb  4 09:12:53 2014
From: ishengomae at nm-aist.ac.tz (Edson Ishengoma)
Date: Tue, 4 Feb 2014 12:12:53 +0300
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKVJ-_4v_0g=ffoeRgh3S7axyJcBni+jHqrNqO+-haTTDVrJ6w@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
	<CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
	<CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>
	<CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>
	<CAKVJ-_4v_0g=ffoeRgh3S7axyJcBni+jHqrNqO+-haTTDVrJ6w@mail.gmail.com>
Message-ID: <CAKoVMy=b8zExmu4h9niZvMsQgAs94-gK_RCmMyRhLa_i0vM_rw@mail.gmail.com>

Hi Peter,

My apology, I have updated the code at
https://gist.github.com/EBIshengoma/efc4ad3e32427891931d to appear exactly
how I run it from my computer.

Thanks.


Edson B. Ishengoma
PhD-Candidate
*School of Life Sciences and Engineering
Nelson Mandela African Institute of Science and Technology
Nelson Mandela Road
P. O. Box 447, Arusha
Tanzania (255)
*
*ishengomae at nm-aist.ac.tz  *ebarongo82 at yahoo.co.uk
*

<http://www.nm-aist.ac.tz/>Mobile: +255 762 348 037, +255 714 789 360,
  Website: www.nm-aist.ac.tz
Skype: edson.ishengoma

*
*
**


On Tue, Feb 4, 2014 at 11:46 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

>
>
> On Monday, February 3, 2014, Edson Ishengoma <ishengomae at nm-aist.ac.tz>
> wrote:
>
>> Thanks Peter.
>>
>> Here is a link to my script at
>> https://gist.github.com/EBIshengoma/efc4ad3e32427891931d
>>
>> Also, please find attached the sample xml output.
>>
>>
> The start of the script is missing (import statements, how
> you loaded the query and subject sequences, and how
> you parsed the BLAST output). We'd need at least that
> to run your script.
>
> Regards,
>
> Peter
>
>


From bartha.daniel at agrar.mta.hu  Tue Feb  4 10:38:46 2014
From: bartha.daniel at agrar.mta.hu (=?UTF-8?Q?Bartha_D=C3=A1niel?=)
Date: Tue, 4 Feb 2014 11:38:46 +0100
Subject: [Biopython] help! entrez esearch popset issue
Message-ID: <CAF+2+Y8AA7wB14HXi-Pdx+fgEq+4o2-WeFwC9BcRDgX6TrF5_g@mail.gmail.com>

Hi People,

I have an issue with biopythons esearch/efetch, and this drives me crazy.

If I search for something in the PopSet, like this, but the query is
arbitrary:

query = "Homo sapiens[Organism] NOT mitochondrion[All Fields]";

esearch_handle = Entrez.esearch(db="popset", term=query)
search_results = Entrez.read(esearch_handle)
accnos = search_results['IdList']

I get somehow always only 20 results in my IdList, but with the same term,
many thousands on the website. Is this a bug?

Because by default, on the website, 20 results per page are shown, and
surprise, my 20 results are equal with the first page. The biopython
documentation regarding the PopSet DB is not very talkative, so I ask you,
how do I solve this problem elegant ("python only")?

Since the same constellation doesn't cause any issues by searching in the
protein or other sequence DB, either has the PopSet DB some tricks I don't
kow or this is a BUG(?).


Regards:

Daniel


-- 
D?niel Bartha, molecular bionics engineer, BSc
Bioinformatician
Institute for Veterinary Medical Research
Centre for Agricultural Research
Hungarian Academy of Sciences
Hung?ria k?r?t 21.
Budapest
1143
Hungary

e-mail:
bartha.daniel at agrar.mta.hu


From saketkc at gmail.com  Tue Feb  4 12:25:45 2014
From: saketkc at gmail.com (Saket Choudhary)
Date: Tue, 4 Feb 2014 12:25:45 +0000
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <20140204231638.41daaf4a@kmserver>
References: <CAMC681ktJLOjcS746YB3MwCb534rZvfk87p-HGVdyt3Le=Y2RQ@mail.gmail.com>
	<CAKVJ-_7wV2JD2=Lbjem1WDW6e=eO=W6x4KOHxJh0zuVyNrsrVw@mail.gmail.com>
	<CAEDHeiuP-jc0nPdK-2k1TxF04QcqWqF98Nb6faMdbizQ1ZABgQ@mail.gmail.com>
	<20140204231638.41daaf4a@kmserver>
Message-ID: <CAEDHeitJ9wHNwANP3cgOPVjYnUWN_qf0BtCVtpnqTHL95HpzWg@mail.gmail.com>

Hi Kevin,

In fact I had forked this long ago[1], didn't have time to contribute
it to though.

Thanks for the awesome work!

[1] https://github.com/saketkc/pyNGSQC

Saket

On 4 February 2014 12:16, Kevin Murray <kevin at kdmurray.id.au> wrote:
> Saket,
>
> Apologies in advance if this is a little too unsolicited! =)
>
> Feel free to use pyNGSQC[1] as the basis for some of the proposed QC
> stuff, if it is of any use. I've been meaning to refactor this to use
> Biopython and in the long term submit a pull request, but I doubt I'll
> have time. I can share the refactoring progress with you/push it to
> github if you're interested.
>
> [1]: https://github.com/kdmurray91/pyNGSQC
>
>
> Cheers,
>
> Kevin
>
> On Mon, 3 Feb 2014 09:52:42 +0530
> Saket Choudhary <saketkc at gmail.com> wrote:
>
>>On 31 January 2014 16:25, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>> On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich
>>> <eric.talevich at gmail.com> wrote:
>>>> Hi folks,
>>>>
>>>> Google Summer of Code is on again for 2014, and the Open
>>>> Bioinformatics Foundation (OBF) is once again applying as a
>>>> mentoring organization. Participating in GSoC as an organization is
>>>> very competitive, and we will need your help in gathering a good
>>>> set of ideas and potential mentors for Biopython's role in GSoC
>>>> this year.
>>>>
>>>> If you have an idea for a Summer of Code project, please post your
>>>> idea here on the Biopython mailing list for discussion and start an
>>>> outline on this wiki page:
>>>> http://biopython.org/wiki/Google_Summer_of_Code
>>>>
>>>> We also welcome ideas that fit with OBF's mission but are not part
>>>> of a single Bio* project, or span multiple projects -- these ideas
>>>> can be posted on the OBF wiki and discussed on the OBF mailing list:
>>>> http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas
>>>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>>>>
>>>> Here's to another fun and productive Summer of Code!
>>>>
>>>> Cheers,
>>>> Eric & Raoul
>>>
>>> Thanks Eric & Raoul,
>>>
>>> Remember that the ideas don't have to come from potential mentors -
>>> if as a student there is something you'd particularly like to work on
>>> please ask, and perhaps we can find a suitable (Biopython) mentor.
>>>
>>> Regards,
>>>
>>> Peter
>>
>>I would like to propose a QC module for NGS & Microarray data.
>>Essentially a fastQC[1] and limma[2], respectively ported to
>>Biopython.
>>
>>
>>
>>[1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
>>[2] http://bioconductor.org/packages/devel/bioc/html/limma.html
>>
>>
>>Saket
>>
>>> _______________________________________________
>>> Biopython mailing list  -  Biopython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>_______________________________________________
>>Biopython mailing list  -  Biopython at lists.open-bio.org
>>http://lists.open-bio.org/mailman/listinfo/biopython


From kevin at kdmurray.id.au  Tue Feb  4 12:34:56 2014
From: kevin at kdmurray.id.au (Kevin Murray)
Date: Tue, 4 Feb 2014 23:34:56 +1100
Subject: [Biopython] help! entrez esearch popset issue
In-Reply-To: <CAF+2+Y8AA7wB14HXi-Pdx+fgEq+4o2-WeFwC9BcRDgX6TrF5_g@mail.gmail.com>
References: <CAF+2+Y8AA7wB14HXi-Pdx+fgEq+4o2-WeFwC9BcRDgX6TrF5_g@mail.gmail.com>
Message-ID: <20140204233456.7204362d@kmserver>

Bartha,

I believe that the retstart keyword argument is your friend.
Something like [Completely contrived and untested]:

request = Entrez.read(Entrez.esearch(db, qry, retstart=0))
answers = request["IdList"]
expected = int(request["Count"])
returned =  len(answers)
while returned < expected:
	request = Entrez.read(Entrez.esearch(db,
				qry,retstart=returned))
	returned += len(request["IdList"])
	answers.extend(request["IdList"])
	print(answers)

This is documented here:
http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EFetch_

Others may have more intelligent/complete solutions.

Cheers,
Kevin

On Tue, 4 Feb 2014 11:38:46 +0100
Bartha D?niel <bartha.daniel at agrar.mta.hu> wrote:

>Hi People,
>
>I have an issue with biopythons esearch/efetch, and this drives me
>crazy.
>
>If I search for something in the PopSet, like this, but the query is
>arbitrary:
>
>query = "Homo sapiens[Organism] NOT mitochondrion[All Fields]";
>
>esearch_handle = Entrez.esearch(db="popset", term=query)
>search_results = Entrez.read(esearch_handle)
>accnos = search_results['IdList']
>
>I get somehow always only 20 results in my IdList, but with the same
>term, many thousands on the website. Is this a bug?
>
>Because by default, on the website, 20 results per page are shown, and
>surprise, my 20 results are equal with the first page. The biopython
>documentation regarding the PopSet DB is not very talkative, so I ask
>you, how do I solve this problem elegant ("python only")?
>
>Since the same constellation doesn't cause any issues by searching in
>the protein or other sequence DB, either has the PopSet DB some tricks
>I don't kow or this is a BUG(?).
>
>
>Regards:
>
>Daniel
>
>
>


From kevin at kdmurray.id.au  Tue Feb  4 12:16:38 2014
From: kevin at kdmurray.id.au (Kevin Murray)
Date: Tue, 4 Feb 2014 23:16:38 +1100
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <CAEDHeiuP-jc0nPdK-2k1TxF04QcqWqF98Nb6faMdbizQ1ZABgQ@mail.gmail.com>
References: <CAMC681ktJLOjcS746YB3MwCb534rZvfk87p-HGVdyt3Le=Y2RQ@mail.gmail.com>
	<CAKVJ-_7wV2JD2=Lbjem1WDW6e=eO=W6x4KOHxJh0zuVyNrsrVw@mail.gmail.com>
	<CAEDHeiuP-jc0nPdK-2k1TxF04QcqWqF98Nb6faMdbizQ1ZABgQ@mail.gmail.com>
Message-ID: <20140204231638.41daaf4a@kmserver>

Saket,

Apologies in advance if this is a little too unsolicited! =)

Feel free to use pyNGSQC[1] as the basis for some of the proposed QC
stuff, if it is of any use. I've been meaning to refactor this to use
Biopython and in the long term submit a pull request, but I doubt I'll
have time. I can share the refactoring progress with you/push it to
github if you're interested.

[1]: https://github.com/kdmurray91/pyNGSQC


Cheers,

Kevin

On Mon, 3 Feb 2014 09:52:42 +0530
Saket Choudhary <saketkc at gmail.com> wrote:

>On 31 January 2014 16:25, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich
>> <eric.talevich at gmail.com> wrote:
>>> Hi folks,
>>>
>>> Google Summer of Code is on again for 2014, and the Open
>>> Bioinformatics Foundation (OBF) is once again applying as a
>>> mentoring organization. Participating in GSoC as an organization is
>>> very competitive, and we will need your help in gathering a good
>>> set of ideas and potential mentors for Biopython's role in GSoC
>>> this year.
>>>
>>> If you have an idea for a Summer of Code project, please post your
>>> idea here on the Biopython mailing list for discussion and start an
>>> outline on this wiki page:
>>> http://biopython.org/wiki/Google_Summer_of_Code
>>>
>>> We also welcome ideas that fit with OBF's mission but are not part
>>> of a single Bio* project, or span multiple projects -- these ideas
>>> can be posted on the OBF wiki and discussed on the OBF mailing list:
>>> http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas
>>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>>>
>>> Here's to another fun and productive Summer of Code!
>>>
>>> Cheers,
>>> Eric & Raoul
>>
>> Thanks Eric & Raoul,
>>
>> Remember that the ideas don't have to come from potential mentors -
>> if as a student there is something you'd particularly like to work on
>> please ask, and perhaps we can find a suitable (Biopython) mentor.
>>
>> Regards,
>>
>> Peter
>
>I would like to propose a QC module for NGS & Microarray data.
>Essentially a fastQC[1] and limma[2], respectively ported to
>Biopython.
>
>
>
>[1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
>[2] http://bioconductor.org/packages/devel/bioc/html/limma.html
>
>
>Saket
>
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>_______________________________________________
>Biopython mailing list  -  Biopython at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/biopython


From idoerg at gmail.com  Tue Feb  4 13:18:37 2014
From: idoerg at gmail.com (Iddo Friedberg)
Date: Tue, 4 Feb 2014 08:18:37 -0500
Subject: [Biopython] help! entrez esearch popset issue
In-Reply-To: <CAF+2+Y8AA7wB14HXi-Pdx+fgEq+4o2-WeFwC9BcRDgX6TrF5_g@mail.gmail.com>
References: <CAF+2+Y8AA7wB14HXi-Pdx+fgEq+4o2-WeFwC9BcRDgX6TrF5_g@mail.gmail.com>
Message-ID: <CABm4-MRhdpxNM0B64kUYXGO4NENjEckRuq+=K+fN7bAxa3tHAg@mail.gmail.com>

Default number of records returned is 20.

Read about the retmax and retstart arguments to see how to increase that
number:

http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch


On Tue, Feb 4, 2014 at 5:38 AM, Bartha D?niel <bartha.daniel at agrar.mta.hu>wrote:

> Hi People,
>
> I have an issue with biopythons esearch/efetch, and this drives me crazy.
>
> If I search for something in the PopSet, like this, but the query is
> arbitrary:
>
> query = "Homo sapiens[Organism] NOT mitochondrion[All Fields]";
>
> esearch_handle = Entrez.esearch(db="popset", term=query)
> search_results = Entrez.read(esearch_handle)
> accnos = search_results['IdList']
>
> I get somehow always only 20 results in my IdList, but with the same term,
> many thousands on the website. Is this a bug?
>
> Because by default, on the website, 20 results per page are shown, and
> surprise, my 20 results are equal with the first page. The biopython
> documentation regarding the PopSet DB is not very talkative, so I ask you,
> how do I solve this problem elegant ("python only")?
>
> Since the same constellation doesn't cause any issues by searching in the
> protein or other sequence DB, either has the PopSet DB some tricks I don't
> kow or this is a BUG(?).
>
>
> Regards:
>
> Daniel
>
>
>
> --
> D?niel Bartha, molecular bionics engineer, BSc
> Bioinformatician
> Institute for Veterinary Medical Research
> Centre for Agricultural Research
> Hungarian Academy of Sciences
> Hung?ria k?r?t 21.
> Budapest
> 1143
> Hungary
>
> e-mail:
> bartha.daniel at agrar.mta.hu
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.


From jgrant at smith.edu  Tue Feb  4 16:09:19 2014
From: jgrant at smith.edu (Jessica Grant)
Date: Tue, 4 Feb 2014 11:09:19 -0500
Subject: [Biopython] amazon aws
Message-ID: <CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw@mail.gmail.com>

Hello,

Has anyone been successful in installing Biopython on an instance of the
amazon cloud?  If so, can I get some advice?  I tried finding an easy
install package, but couldn't, so I started to try installing from source.
 I ran into trouble because with setup.py bcause it couldn't find gcc.  I
am going to try to find and install gcc...

Also, will this need to get reinstalled every time I start an instance of
the cloud?

Thanks!!

Jessica


From zhigangwu.bgi at gmail.com  Tue Feb  4 16:44:49 2014
From: zhigangwu.bgi at gmail.com (Zhigang Wu)
Date: Tue, 4 Feb 2014 08:44:49 -0800
Subject: [Biopython] amazon aws
In-Reply-To: <CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw@mail.gmail.com>
References: <CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw@mail.gmail.com>
Message-ID: <3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com>

What is the Linux distribution of EC2 instance you bring up? If it's Debian or Ubuntu, then sudo apt-get install biopython should be sufficient.

The idea is just use whatever package manager available in EC2 instance.

Zhigang

Sent from my iPhone

> On Feb 4, 2014, at 8:09 AM, Jessica Grant <jgrant at smith.edu> wrote:
> 
> Hello,
> 
> Has anyone been successful in installing Biopython on an instance of the
> amazon cloud?  If so, can I get some advice?  I tried finding an easy
> install package, but couldn't, so I started to try installing from source.
> I ran into trouble because with setup.py bcause it couldn't find gcc.  I
> am going to try to find and install gcc...
> 
> Also, will this need to get reinstalled every time I start an instance of
> the cloud?
> 
> Thanks!!
> 
> Jessica
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From jgrant at smith.edu  Tue Feb  4 16:47:41 2014
From: jgrant at smith.edu (Jessica Grant)
Date: Tue, 4 Feb 2014 11:47:41 -0500
Subject: [Biopython] amazon aws
In-Reply-To: <3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com>
References: <CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw@mail.gmail.com>
	<3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com>
Message-ID: <CAOuNqdkaChPkvVySokNGiQ+Hd=vxpAwkaTZ-ubb58R1f3kVNcg@mail.gmail.com>

I am just trying this out to see if this is going to work for us, so I am
using the free version - Amazon Linux AMI x86_64 PV  - and apt-get didn't
work for me here.
I will try launching an Ubuntu instance instead.

Thank you for your response!

Jessica


On Tue, Feb 4, 2014 at 11:44 AM, Zhigang Wu <zhigangwu.bgi at gmail.com> wrote:

> What is the Linux distribution of EC2 instance you bring up? If it's
> Debian or Ubuntu, then sudo apt-get install biopython should be sufficient.
>
> The idea is just use whatever package manager available in EC2 instance.
>
> Zhigang
>
> Sent from my iPhone
>
> > On Feb 4, 2014, at 8:09 AM, Jessica Grant <jgrant at smith.edu> wrote:
> >
> > Hello,
> >
> > Has anyone been successful in installing Biopython on an instance of the
> > amazon cloud?  If so, can I get some advice?  I tried finding an easy
> > install package, but couldn't, so I started to try installing from
> source.
> > I ran into trouble because with setup.py bcause it couldn't find gcc.  I
> > am going to try to find and install gcc...
> >
> > Also, will this need to get reinstalled every time I start an instance of
> > the cloud?
> >
> > Thanks!!
> >
> > Jessica
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>


From jgrant at smith.edu  Tue Feb  4 17:05:19 2014
From: jgrant at smith.edu (Jessica Grant)
Date: Tue, 4 Feb 2014 12:05:19 -0500
Subject: [Biopython] amazon aws
In-Reply-To: <CAOuNqdkaChPkvVySokNGiQ+Hd=vxpAwkaTZ-ubb58R1f3kVNcg@mail.gmail.com>
References: <CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw@mail.gmail.com>
	<3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com>
	<CAOuNqdkaChPkvVySokNGiQ+Hd=vxpAwkaTZ-ubb58R1f3kVNcg@mail.gmail.com>
Message-ID: <CAOuNqdnXvZvwa2jQnm4bXWG+Jx8Vn3LffM+s1bbZYqECmUiiqw@mail.gmail.com>

Yes, that worked!  Now on to RaxML...

Thank you!


On Tue, Feb 4, 2014 at 11:47 AM, Jessica Grant <jgrant at smith.edu> wrote:

> I am just trying this out to see if this is going to work for us, so I am
> using the free version - Amazon Linux AMI x86_64 PV  - and apt-get didn't
> work for me here.
> I will try launching an Ubuntu instance instead.
>
> Thank you for your response!
>
> Jessica
>
>
>
>
> On Tue, Feb 4, 2014 at 11:44 AM, Zhigang Wu <zhigangwu.bgi at gmail.com>wrote:
>
>> What is the Linux distribution of EC2 instance you bring up? If it's
>> Debian or Ubuntu, then sudo apt-get install biopython should be sufficient.
>>
>> The idea is just use whatever package manager available in EC2 instance.
>>
>> Zhigang
>>
>> Sent from my iPhone
>>
>> > On Feb 4, 2014, at 8:09 AM, Jessica Grant <jgrant at smith.edu> wrote:
>> >
>> > Hello,
>> >
>> > Has anyone been successful in installing Biopython on an instance of the
>> > amazon cloud?  If so, can I get some advice?  I tried finding an easy
>> > install package, but couldn't, so I started to try installing from
>> source.
>> > I ran into trouble because with setup.py bcause it couldn't find gcc.  I
>> > am going to try to find and install gcc...
>> >
>> > Also, will this need to get reinstalled every time I start an instance
>> of
>> > the cloud?
>> >
>> > Thanks!!
>> >
>> > Jessica
>> > _______________________________________________
>> > Biopython mailing list  -  Biopython at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>


From cdshaffer at gmail.com  Tue Feb  4 17:52:54 2014
From: cdshaffer at gmail.com (christopher shaffer)
Date: Tue, 4 Feb 2014 11:52:54 -0600
Subject: [Biopython] amazon aws
Message-ID: <CAPMjwhxwQw_kNODc11nOH3SgEo7LO4HbVf58p=wswtK6tUrb+A@mail.gmail.com>

Jessica,
I am not going to spam the biopython list as this is off topic, but you
might want to look at the iPlant collaborative. This is an NSF funded
"cyberinfrastructure" that has an AWS like service called Atmospheres. It
is all free to registered users. They have recently been expanding from
plant bioinformatics by adding more support for microbs and animals so
there is a good chance they have a machine that has what you need.

They appear to be down for maintenance right now, but once they are back up
you could check through all the virtual machines and see if any have what
you need.

I just created an account myself so I am afraid I don't know much more but
I was quite impressed with the "overview of iPlant" webinar I attended last
week.

Chris Shaffer
Biology
Washington Univ in St. Louis
P.S. I have no connection to iPlant except as an interested user.


> Date: Tue, 4 Feb 2014 11:09:19 -0500
> From: Jessica Grant <jgrant at smith.edu>
> Subject: [Biopython] amazon aws
> To: Biopython at lists.open-bio.org
> Message-ID:
>         <
> CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hello,
>
> Has anyone been successful in installing Biopython on an instance of the
> amazon cloud?  If so, can I get some advice?  I tried finding an easy
> install package, but couldn't, so I started to try installing from source.
>  I ran into trouble because with setup.py bcause it couldn't find gcc.  I
> am going to try to find and install gcc...
>
> Also, will this need to get reinstalled every time I start an instance of
> the cloud?
>
> Thanks!!
>
> Jessica
>
>


From cjfields at illinois.edu  Tue Feb  4 18:11:56 2014
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 4 Feb 2014 18:11:56 +0000
Subject: [Biopython] amazon aws
In-Reply-To: <CAPMjwhxwQw_kNODc11nOH3SgEo7LO4HbVf58p=wswtK6tUrb+A@mail.gmail.com>
References: <CAPMjwhxwQw_kNODc11nOH3SgEo7LO4HbVf58p=wswtK6tUrb+A@mail.gmail.com>
Message-ID: <B2326498-56E2-4363-A0DD-0D1673E5A1F8@illinois.edu>

Jessica,

I suggest setting up an instance using whatever (*cough*linux*cough*) OS you want; could be Amazon AWS, iPlant (which I think uses OpenStack), or another snapshot-capable cloud service. Install what you need, then take a snapshot of the instance, which in general should store any customizations you made.  Maybe look into CloudBioLinux, Scientific Linux, or similar images for a good start in this direction.

chris

On Feb 4, 2014, at 11:52 AM, christopher shaffer <cdshaffer at gmail.com> wrote:

> Jessica,
> I am not going to spam the biopython list as this is off topic, but you
> might want to look at the iPlant collaborative. This is an NSF funded
> "cyberinfrastructure" that has an AWS like service called Atmospheres. It
> is all free to registered users. They have recently been expanding from
> plant bioinformatics by adding more support for microbs and animals so
> there is a good chance they have a machine that has what you need.
> 
> They appear to be down for maintenance right now, but once they are back up
> you could check through all the virtual machines and see if any have what
> you need.
> 
> I just created an account myself so I am afraid I don't know much more but
> I was quite impressed with the "overview of iPlant" webinar I attended last
> week.
> 
> Chris Shaffer
> Biology
> Washington Univ in St. Louis
> P.S. I have no connection to iPlant except as an interested user.
> 
> 
>> Date: Tue, 4 Feb 2014 11:09:19 -0500
>> From: Jessica Grant <jgrant at smith.edu>
>> Subject: [Biopython] amazon aws
>> To: Biopython at lists.open-bio.org
>> Message-ID:
>>        <
>> CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw at mail.gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1
>> 
>> Hello,
>> 
>> Has anyone been successful in installing Biopython on an instance of the
>> amazon cloud?  If so, can I get some advice?  I tried finding an easy
>> install package, but couldn't, so I started to try installing from source.
>> I ran into trouble because with setup.py bcause it couldn't find gcc.  I
>> am going to try to find and install gcc...
>> 
>> Also, will this need to get reinstalled every time I start an instance of
>> the cloud?
>> 
>> Thanks!!
>> 
>> Jessica
>> 
>> 
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From p.j.a.cock at googlemail.com  Wed Feb  5 16:07:22 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Feb 2014 16:07:22 +0000
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKoVMy=b8zExmu4h9niZvMsQgAs94-gK_RCmMyRhLa_i0vM_rw@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
	<CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
	<CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>
	<CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>
	<CAKVJ-_4v_0g=ffoeRgh3S7axyJcBni+jHqrNqO+-haTTDVrJ6w@mail.gmail.com>
	<CAKoVMy=b8zExmu4h9niZvMsQgAs94-gK_RCmMyRhLa_i0vM_rw@mail.gmail.com>
Message-ID: <CAKVJ-_7BfxYnQw_K5_=b4Y1DaAqNuYvgut6L-tLhRTAxmW-QEA@mail.gmail.com>

Hi Edson,

I can see where the problem stems from now - it did puzzle me for a while.
For this part to make sense, query and sbjct need to be the FULL sequence
of the query and the subject (as given to BLAST as input):

    complete_query_seq += str(query[q_start-1:q_end])
    complete_sbjct_seq += str(sbjct[sb_start-1:sb_end])

(I had assumed these variables were setup at the beginning of the file,
which I partly why I asked for the full script.)

However, via the for loop, you are using hsp.query, hsp.sbjct as query
and sbjct, This are the PARTIAL sequences aligned with gap characters.
This might do what you seemed to want:

    complete_query_seq += query.replace("-", "")
    complete_sbjct_seq += sbjct.replace("-", "")

However, this will concatenate the fragments with an HSP - any bit of
the query or subject which did not align will not be included. Any bit
which appears in more than one HSP will be there twice. And also
if you're using masking you'll have XXXXX X regions in the sequence
where the filter said it was low complexity.

I would instead get the original unmodified query/subject sequences
from the original FASTA files given to BLAST.

Peter


On Tue, Feb 4, 2014 at 9:12 AM, Edson Ishengoma
<ishengomae at nm-aist.ac.tz> wrote:
> Hi Peter,
>
> My apology, I have updated the code at
> https://gist.github.com/EBIshengoma/efc4ad3e32427891931d to appear exactly
> how I run it from my computer.
>
> Thanks.
>


From ishengomae at nm-aist.ac.tz  Wed Feb  5 17:52:17 2014
From: ishengomae at nm-aist.ac.tz (Edson Ishengoma)
Date: Wed, 5 Feb 2014 20:52:17 +0300
Subject: [Biopython] Help modify this code so it can do what I want it
	to do
In-Reply-To: <CAKVJ-_7BfxYnQw_K5_=b4Y1DaAqNuYvgut6L-tLhRTAxmW-QEA@mail.gmail.com>
References: <CAKoVMym_6z39evs5BBPQqBN-R9pOkjjq3mfW62h5w8E=cfOGeg@mail.gmail.com>
	<CAKVJ-_4D3ZSA-z_R256aqtOEweYTzZKVDc+NMNJSg9N7NEpxEQ@mail.gmail.com>
	<CAOaPOXU=8kjrnSirfBMM7ipb6yXv=pfhX152gbh=nDnjyMGpWw@mail.gmail.com>
	<CAKoVMykAxaU2saXVgNjysZ2mpY1QfV=OjmQMogkw4smrKOQFGg@mail.gmail.com>
	<CAKVJ-_7CErdgZ-u3koM57s2uyHRx9c8HWb8YEGWhWuB0UQH5fQ@mail.gmail.com>
	<CAKoVMykU7hd+=UBMpUTAnLJqqUixXDUVa24cgu3aOtwQzM79sQ@mail.gmail.com>
	<CAKVJ-_4v_0g=ffoeRgh3S7axyJcBni+jHqrNqO+-haTTDVrJ6w@mail.gmail.com>
	<CAKoVMy=b8zExmu4h9niZvMsQgAs94-gK_RCmMyRhLa_i0vM_rw@mail.gmail.com>
	<CAKVJ-_7BfxYnQw_K5_=b4Y1DaAqNuYvgut6L-tLhRTAxmW-QEA@mail.gmail.com>
Message-ID: <CAKoVMy=R7B557Y8zJENognmjacJZmduD7w7KBBxX3eL7YvErag@mail.gmail.com>

Hi Peter,

Woow, that made my day. Thank you very much and keep up the good work.

Regards,

Edson

On Wed, Feb 5, 2014 at 7:07 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Hi Edson,
>
> I can see where the problem stems from now - it did puzzle me for a while.
> For this part to make sense, query and sbjct need to be the FULL sequence
> of the query and the subject (as given to BLAST as input):
>
>     complete_query_seq += str(query[q_start-1:q_end])
>     complete_sbjct_seq += str(sbjct[sb_start-1:sb_end])
>
> (I had assumed these variables were setup at the beginning of the file,
> which I partly why I asked for the full script.)
>
> However, via the for loop, you are using hsp.query, hsp.sbjct as query
> and sbjct, This are the PARTIAL sequences aligned with gap characters.
> This might do what you seemed to want:
>
>     complete_query_seq += query.replace("-", "")
>     complete_sbjct_seq += sbjct.replace("-", "")
>
> However, this will concatenate the fragments with an HSP - any bit of
> the query or subject which did not align will not be included. Any bit
> which appears in more than one HSP will be there twice. And also
> if you're using masking you'll have XXXXX X regions in the sequence
> where the filter said it was low complexity.
>
> I would instead get the original unmodified query/subject sequences
> from the original FASTA files given to BLAST.
>
> Peter
>
>
> On Tue, Feb 4, 2014 at 9:12 AM, Edson Ishengoma
> <ishengomae at nm-aist.ac.tz> wrote:
> > Hi Peter,
> >
> > My apology, I have updated the code at
> > https://gist.github.com/EBIshengoma/efc4ad3e32427891931d to appear
> exactly
> > how I run it from my computer.
> >
> > Thanks.
> >
>


From anubhavmaity7 at gmail.com  Sun Feb  9 15:05:23 2014
From: anubhavmaity7 at gmail.com (Anubhav Maity)
Date: Sun, 9 Feb 2014 20:35:23 +0530
Subject: [Biopython] Fwd: [GSoC] Want to contribute to open-bio for GSOC 2014
In-Reply-To: <CAKVJ-_7nGzS_y=krDgSLmTXY_c8n_UKwuL+dzqPh1vBAQm299w@mail.gmail.com>
References: <CAOAabDmBoq=+4Q1ehHt9eFduTUiftt=etZWgkKrOWmeY_gjuiQ@mail.gmail.com>
	<CAKVJ-_7nGzS_y=krDgSLmTXY_c8n_UKwuL+dzqPh1vBAQm299w@mail.gmail.com>
Message-ID: <CAOAabDm9Pc7QRELVw_eAdiyfcEHmQ4oY8zgyZAF1f7bsSiA7-A@mail.gmail.com>

Hi,

Thanks You, Peter, for your reply.
I have setup my github account and have forked the source code. I have
build and install biopython after reading the README file in the github
repository.

I want to contribute code to bioython. I want some suggestions from where
to start?

Waiting for your reply.

Thanks and Regards,
Anubhav

---------- Forwarded message ----------
From: Peter Cock <p.j.a.cock at googlemail.com>
Date: Sat, Feb 8, 2014 at 6:28 PM
Subject: Re: [GSoC] Want to contribute to open-bio for GSOC 2014
To: Anubhav Maity <anubhavmaity7 at gmail.com>
Cc: OBF GSoC <gsoc at lists.open-bio.org>


On Fri, Feb 7, 2014 at 10:33 PM, Anubhav Maity <anubhavmaity7 at gmail.com>
wrote:
> Hi,
>
> I am a BTech student from an Indian university and want to contribute code
> for open-bio for GSOC 2014.
> I love to code and can code in python. I have studied biology in high
> school and have taken biotechnology during my college study.
> I have looked on the projects of biopython  i.e Codon alignment and
> analysis, Bio.Phylo: filling in the gaps and Indexing & Lazy-loading
> Sequence Parsers. All the projects are very interesting. I want to
> contribute in one of these projects, please help me in getting started.
> Waiting for your positive reply.
>
> Thanks and Regards,
> Anubhav

Hi Anubhav,

Please sign up to the biopython and biopython-dev mailing lists
and introduce yourself there too. You will also need a GitHub
account to contribute to Biopython development - so you might
want to set that up now as well:

http://lists.open-bio.org/mailman/listinfo/biopython
http://lists.open-bio.org/mailman/listinfo/biopython-dev
https://github.com/biopython/biopython

Regards,

Peter


From davidsshin at lbl.gov  Mon Feb 10 14:23:58 2014
From: davidsshin at lbl.gov (David Shin)
Date: Mon, 10 Feb 2014 06:23:58 -0800
Subject: [Biopython] Summer of Code 2014 - Call for project ideas Re: going
 from protein to gene to oligos for cloning
Message-ID: <CAA_Ck1i+efh+nbF6xvs9MWc-=Vj9ya_4JBXxjPzgLPYn-hCghw@mail.gmail.com>

Hi all -

Just another suggestion for the summer of code project....
Going from protein sequences to gene coding regions.

With the reduction of costs associated with DNA synthesis and the advent of
"buying genes", along with more robust robotics, we are now at a time where
many are making large lists of proteins to express for biochemistry,
biophysics and structural biology. However, parsing the data available to
make choices to refine those lists and then obtaining just the coding
regions for the proteins of interest is a little daunting.

As discussed previously, finding a protein at NCBI doesn't lend readily to
getting the gene (coding region) for cloning in a readily automated
fashion. I still haven't tested the code suggested by Peter below, but this
could be cleanup project if it is broken, and or a similar project could be
started from scratch. If it seems like something you are interested, I will
test the code earlier, if that's a starting point someone would like to
pursue... though, may need to speak to the author first, not sure.

Thanks,
Dave


> Hi Dave,
>
> The catch here is the protein IDs are not directly usable in the
> nucleotide database - which is where ELink (Entrez Link) comes
> in, available as the Entrez.elink(...) function in Biopython.
>
> I've not tried it myself, but a colleague posted a long example
> on his blog which sounds close to what you are aiming for:
>
>
> http://armchairbiology.blogspot.co.uk/2013/02/surely-this-has-been-done-already.html
>
> https://github.com/widdowquinn/scripts/blob/master/bioinformatics/get_NCBI_cds_from_protein.py
>
> Peter
>


On Fri, Dec 6, 2013 at 2:24 AM, Peter Cock <p.j.a.cock at googlemail.com>
 wrote:

> On Fri, Dec 6, 2013 at 7:27 AM, David Shin <davidsshin at lbl.gov> wrote:
> > Hi again,
> >
> > I'm trying to use biopython to help me grab a lot of protein sequences
> that
> > will eventually be used as the basis for cloning. I'm almost done
> screening
> > my protein sequences, and pretty much ok on that part...
> >
> > I was just curious if anyone has already developed, or has any decent
> > advice on going from protein codes to getting the actual coding sequences
> > of the genes.
> >
> > At this point, my plan is to take protein codes (ie. numbers in
> > gi|145323746|) and use these to search entrez nucleotide databases
> directly
> > to get hits (I have tested it once seems to work to get genbank
> records...
> > then try to use the information inside to get the nucleotide sequences...
> > or I guess the other way is to use the top hit from tblastn somehow?
> >
> > Thanks,
> >
> > Dave
>


From vishnuc11j93 at gmail.com  Tue Feb 11 08:32:25 2014
From: vishnuc11j93 at gmail.com (Vishnu Chilakamarri)
Date: Tue, 11 Feb 2014 14:02:25 +0530
Subject: [Biopython] Adding SVM in biopython
Message-ID: <CAPwdTsCUPNthwhNiT_D3UF7QyG-nxKEAiYpYtmZvNCKUhmiaog@mail.gmail.com>

Hello,

I am currently working in a project to predict the GTP binding sites given
an amino acid sequence. The classification algorithm I'm using is SVM. As
of now I'm using SVM-light and python's scikit library for classification
and evaluating the model. For adding this in biopython we can use libSVM as
it has a python interface which can be used for this purpose.I would like
to discuss the feasibility of adding this in biopython's library and also
evaluation metrics such as F1 score and MCC.

Thank you,
Vishnu Chilakamarri


From p.j.a.cock at googlemail.com  Tue Feb 11 11:39:46 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 11 Feb 2014 11:39:46 +0000
Subject: [Biopython] Adding SVM in biopython
In-Reply-To: <CAPwdTsCUPNthwhNiT_D3UF7QyG-nxKEAiYpYtmZvNCKUhmiaog@mail.gmail.com>
References: <CAPwdTsCUPNthwhNiT_D3UF7QyG-nxKEAiYpYtmZvNCKUhmiaog@mail.gmail.com>
Message-ID: <CAKVJ-_7ab-4n+HyMXhsWXPb7zeOaJ4w=s4_6hyq4wj+-eOXZ1Q@mail.gmail.com>

On Tue, Feb 11, 2014 at 8:32 AM, Vishnu Chilakamarri
<vishnuc11j93 at gmail.com> wrote:
> Hello,
>
> I am currently working in a project to predict the GTP binding sites given
> an amino acid sequence. The classification algorithm I'm using is SVM. As
> of now I'm using SVM-light and python's scikit library for classification
> and evaluating the model.

Hello Vishnu,

General machine learning contributions would probably fit better
under the scikit libraries than in Biopython - their use goes way
beyond just biology after all ;)

> For adding this in biopython we can use libSVM as
> it has a python interface which can be used for this purpose.I would like
> to discuss the feasibility of adding this in biopython's library ...

Given libSVM has a Python interface, what would you be adding?
https://github.com/cjlin1/libsvm/tree/master/python

> and also evaluation metrics such as F1 score and MCC.
>

Isn't this already in scikit-learn?
http://scikit-learn.org/stable/modules/model_evaluation.html

Maybe I've not understood what you are suggesting?

Regards,

Peter


From vishnuc11j93 at gmail.com  Tue Feb 11 14:55:01 2014
From: vishnuc11j93 at gmail.com (Vishnu Chilakamarri)
Date: Tue, 11 Feb 2014 20:25:01 +0530
Subject: [Biopython] Adding SVM in biopython
In-Reply-To: <CAKVJ-_6wejZqCRV-r0GBim5efuL0EOXgC=MZBvaZoB6r3uZDkQ@mail.gmail.com>
References: <CAPwdTsCUPNthwhNiT_D3UF7QyG-nxKEAiYpYtmZvNCKUhmiaog@mail.gmail.com>
	<CAKVJ-_7ab-4n+HyMXhsWXPb7zeOaJ4w=s4_6hyq4wj+-eOXZ1Q@mail.gmail.com>
	<CAPwdTsBEsXbZ0AtvHrR1quz0zsqybYR8u31VtAJx9SC91n4SMQ@mail.gmail.com>
	<CAKVJ-_6wejZqCRV-r0GBim5efuL0EOXgC=MZBvaZoB6r3uZDkQ@mail.gmail.com>
Message-ID: <CAPwdTsAiuaUOJL-JowUf9_HzCDvHUY8S1Ngf-DB5JPxwYVAkaQ@mail.gmail.com>

Hello Peter,

You're right , addition of another machine learning algorithm in biopython
does not seem necessary.Sorry about that. I was actually looking for
contributing to biopython for Google Summer of Code. I was reading about
the lazy parsers idea which seems very interesting. Like you mentioned in
the Biopython Wiki, I started reading about tabix and BAM indexing. Formats
such as FASTA can be converted to BAM and then indexed using tabix. I read
from here about how Tabix works :
http://bioinformatics.oxfordjournals.org/content/27/5/718.full . Apart from
this is there any source from where I can learn more about this? Thanks in
advance.


On Tue, Feb 11, 2014 at 8:12 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Feb 11, 2014 at 2:23 PM, Vishnu Chilakamarri
> <vishnuc11j93 at gmail.com> wrote:
> > Hello Peter,
> >
> > You're right , addition of another machine learning algorithm in
> biopython
> > does not seem necessary.
>
> Do you want to reply on the list?
>
> > Sorry about that. I was actually looking for
> > contributing to biopython for Google Summer of Code. I was reading about
> the
> > lazy parsers idea which seems very interesting. Like you mentioned in the
> > Biopython Wiki, I started reading about tabix and BAM indexing. Formats
> such
> > as FASTA can be converted to BAM and then indexed using tabix.
>
> Not quite, you compress the FASTA file using bgzip (which uses
> BGZF, a type of GZIP compression). See:
>
> http://blastedbio.blogspot.co.uk/2011/11/bgzf-blocked-bigger-better-gzip.html
>
> > I read from here about how Tabix works :
> > http://bioinformatics.oxfordjournals.org/content/27/5/718.full . Apart
> from
> > this is there any source from where I can learn more about this? Thanks
> in
> > advance.
>
> For BGZF (used in BAM and tabix), my blog post and the Biopython code:
> https://github.com/biopython/biopython/blob/master/Bio/bgzf.py
>
> Peter
>


-- 
Vishnu Chilakamarri
+919049437582
Public Relations Team
BITSAA
B.E. Computer Science + Msc Biological Sciences


From jttkim at googlemail.com  Tue Feb 11 19:17:47 2014
From: jttkim at googlemail.com (Jan Kim)
Date: Tue, 11 Feb 2014 19:17:47 +0000
Subject: [Biopython] Alignment Scores?
Message-ID: <20140211191746.GF17385@localhost>

Dear All,

the EMBOSS "srspair" alignment format includes identity, similarity and
gap statistics as well as the alignment score, see [1]. Is this info
available from alignment objects as returned by Bio.AlgnIO.parse(...).next() ?

I haven't found anything in the documentation and a peek into a sample
object didn't reveal anything either:

    >>> p = Bio.AlignIO.parse('sa-needle.txt', 'emboss')
    >>> a = p.next()
    >>> a.__dict__.keys()
    ['_records', '_alphabet']

Obviously availability of properties such as (percent) identity etc.
will vary with aligment format and type (e.g. some apply only to pairwise
alignment), so I was looking for something perhaps like a dictionary
of optional additional data, somewhat like the letter_annotations in the
SeqRecord class.

I'll probably start rolling my own simplistic solution based on a few
regular expressions for now -- if this is a crude re-invention of a wheel
that's been polished before please let me know, though.

Best regards, Jan

[1] http://emboss.sourceforge.net/docs/themes/alnformats/align.srspair
-- 
 +- Jan T. Kim -------------------------------------------------------+
 |             email: jttkim at gmail.com                                |
 |             WWW:   http://www.jtkim.dreamhosters.com/              |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*


From p.j.a.cock at googlemail.com  Tue Feb 11 18:25:44 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 11 Feb 2014 18:25:44 +0000
Subject: [Biopython] Alignment Scores?
In-Reply-To: <20140211191746.GF17385@localhost>
References: <20140211191746.GF17385@localhost>
Message-ID: <CAKVJ-_7Sh8MNFpHRJheiAJ7gePWmU+PHx2aT-BCk6Se7a=hSew@mail.gmail.com>

On Tue, Feb 11, 2014 at 7:17 PM, Jan Kim <jttkim at googlemail.com> wrote:
> Dear All,
>
> the EMBOSS "srspair" alignment format includes identity, similarity and
> gap statistics as well as the alignment score, see [1]. Is this info
> available from alignment objects as returned by Bio.AlgnIO.parse(...).next() ?

Not currently, no.

> Obviously availability of properties such as (percent) identity etc.
> will vary with aligment format and type (e.g. some apply only to pairwise
> alignment), so I was looking for something perhaps like a dictionary
> of optional additional data, somewhat like the letter_annotations in the
> SeqRecord class.

There's an open issue to do for something like that for the alignment
object... some of the AlignIO parsers hide this kind of thing under
a private attribute as a short term hack. However, read on.

> I'll probably start rolling my own simplistic solution based on a few
> regular expressions for now -- if this is a crude re-invention of a wheel
> that's been polished before please let me know, though.

You could tweak the AlignIO parser, but this would fit better as part of
EMBOSS pair format support in (the quite new) SearchIO module,
where this kind of attribute is expected: http://biopython.org/wiki/SearchIO

Regards,

Peter


From mmokrejs at fold.natur.cuni.cz  Thu Feb 13 20:38:34 2014
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Thu, 13 Feb 2014 21:38:34 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
Message-ID: <52FD2D4A.9010300@fold.natur.cuni.cz>

Hi,
   I am in the process of conversion to the new XML parsing code written by Bow.
So far, I have deciphered the following replacement strings (somewhat written in sed(1) format):


/hsp.identities/hsp.ident_num/
/hsp.score/hsp.bitscore/
/hsp.expect/hsp.evalue/
/hsp.bits/hsp.bitscore/
/hsp.gaps/hsp.gap_num/
/hsp.bits/hsp.bitscore_raw/
/hsp.positives/hsp.pos_num/
/hsp.sbjct_start/hsp.hit_start/
/hsp.sbjct_end/hsp.hit_end/
# hsp.query_start # no change from NCBIXML
# hsp.query_end # no change from NCBIXML
/record.query.split()[0]/record.id/
/alignment.hit_def.split(' ')[0]/alignment.hit_id/
/record.alignments/record.hits/

/hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML (don't remember whether the counts include minus signs of the alignment or not)


Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. I think the former length was including the minus sign for gaps while the latter is just the real length of the query sequence.

Nevertheless, what did alignment.length transform into? Into len(hsp.query_all)? I don't think hsp.query_span but who knows. ;)


Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that has been added to SearchIO in 1.63. so, that's all from me now until I upgrade. ;)


Thank you,
Martin


From w.arindrarto at gmail.com  Thu Feb 13 21:22:13 2014
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Thu, 13 Feb 2014 22:22:13 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <52FD2D4A.9010300@fold.natur.cuni.cz>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>
Message-ID: <CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>

Hi Martin,

Here's the 'convention' I use on the length-related attributes in
SearchIO's blast parsers:

* 'aln_span' attribute denote the length of the alignment itself,
which means this includes the gaps sign ('-'). In Blast, this is
always parsed from the file. You're right that this used to be
hsp.align_length.

* 'seq_len' attributes denote the length of either the query (in
qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
gaps. These are parsed from the BLAST XML file itself. One of these,
hit.seq_len, is the one that used to be alignment.length.

* 'query_span' and 'hit_span' are always computed by SearchIO (always
end coordinate - start coordinate of the query / hit match of the HSP,
so they do not count the gap characters). They may or may not be equal
to their seq_len counterparts, depending on how much the HSP covers
the query / hit sequences.

(I couldn't find any reference to sbjct_length in the current
codebase, perhaps it was removed some time ago?)

Since this is SearchIO, it also applies to other formats as well (e.g.
aln_span always counts the gap character).

The 'gap_num' error sounds a bit weird, though. If I recall correctly,
it should work in 1.62 (it was added very early in the beginning).
What problems are you having?

Cheers,
Bow

On Thu, Feb 13, 2014 at 9:38 PM, Martin Mokrejs
<mmokrejs at fold.natur.cuni.cz> wrote:
> Hi,
>   I am in the process of conversion to the new XML parsing code written by
> Bow.
> So far, I have deciphered the following replacement strings (somewhat
> written in sed(1) format):
>
>
> /hsp.identities/hsp.ident_num/
> /hsp.score/hsp.bitscore/
> /hsp.expect/hsp.evalue/
> /hsp.bits/hsp.bitscore/
> /hsp.gaps/hsp.gap_num/
> /hsp.bits/hsp.bitscore_raw/
> /hsp.positives/hsp.pos_num/
> /hsp.sbjct_start/hsp.hit_start/
> /hsp.sbjct_end/hsp.hit_end/
> # hsp.query_start # no change from NCBIXML
> # hsp.query_end # no change from NCBIXML
> /record.query.split()[0]/record.id/
> /alignment.hit_def.split(' ')[0]/alignment.hit_id/
> /record.alignments/record.hits/
>
> /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML
> (don't remember whether the counts include minus signs of the alignment or
> not)
>
>
>
>
> Now I am uncertain. There used to be hsp.sbjct_length and alignment.length.
> I think the former length was including the minus sign for gaps while the
> latter is just the real length of the query sequence.
>
> Nevertheless, what did alignment.length transform into? Into
> len(hsp.query_all)? I don't think hsp.query_span but who knows. ;)
>
>
>
> Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that
> has been added to SearchIO in 1.63. so, that's all from me now until I
> upgrade. ;)
>
>
> Thank you,
> Martin
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From mmokrejs at fold.natur.cuni.cz  Thu Feb 13 21:46:51 2014
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Thu, 13 Feb 2014 22:46:51 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>
	<CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>
Message-ID: <52FD3D4B.8040602@fold.natur.cuni.cz>

Hi Bow,
   thank you for thorough guidance. Comments interleaved.

Wibowo Arindrarto wrote:
> Hi Martin,
>
> Here's the 'convention' I use on the length-related attributes in
> SearchIO's blast parsers:
>
> * 'aln_span' attribute denote the length of the alignment itself,
> which means this includes the gaps sign ('-'). In Blast, this is
> always parsed from the file. You're right that this used to be
> hsp.align_length.
>
> * 'seq_len' attributes denote the length of either the query (in
> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
> gaps. These are parsed from the BLAST XML file itself. One of these,
> hit.seq_len, is the one that used to be alignment.length.

How about record.seq_len in SearchIO, isn't that same as well? At least
I am hoping that the length (163 below) of the original query sequence, stored in

  <BlastOutput_query-len>163</BlastOutput_query-len>

of the XML input file. Having access to its value from under hsp object would be the best for me.


> * 'query_span' and 'hit_span' are always computed by SearchIO (always
> end coordinate - start coordinate of the query / hit match of the HSP,
> so they do not count the gap characters). They may or may not be equal
> to their seq_len counterparts, depending on how much the HSP covers
> the query / hit sequences.

I hope you wanted to say "end - start + 1" ;-)

>
> (I couldn't find any reference to sbjct_length in the current
> codebase, perhaps it was removed some time ago?)

I have the feelings that either blast or biopython used subjct_* with the 'u' in the name.


> Since this is SearchIO, it also applies to other formats as well (e.g.
> aln_span always counts the gap character).

Fine with me, I need both values describing length region covered in the HSP, with and without the minus signs.


> The 'gap_num' error sounds a bit weird, though. If I recall correctly,
> it should work in 1.62 (it was added very early in the beginning).
> What problems are you having?


if str(_hsp.gap_num) == '(None, None)':
     ....
AttributeError: 'HSP' object has no attribute 'gap_num'


Here is the hsp object structure:

_hsp=['_NON_STICKY_ATTRS', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_aln_span_get', '_get_coords', '_hit_end_get', '_hit_inter_ranges_get', '_hit_inter_spans_get', '_hit_range_get', '_hit_span_get', '_hit_start_get', '_inter_ranges_get', '_inter_spans_get', '_items', '_query_end_get', '_query_inter_ranges_get', '_query_inter_spans_get', '_query_range_get', '_query_span_get', '_query_start_get', '_str_hsp_header', '_transfer_attrs', '_validate_fragment', 'aln', 'aln_all', 'aln_annotation', 'aln_annotation_all', 'aln_span', 'alphabet', 'bitscore', 'bitscore_raw', 'evalue', 'fragment', 'fragments', 'hit', 'hit_all', 'hit_description', 'hit_end', 'hit_end_all', 'hit_features', 'hit_
 f
eatures_all', 'hit_frame', 'hit_frame_all', 'hit_id', 'hit_inter_ranges', 'hit_inter_spans', 'hit_range', 'hit_range_all', 'hit_span', 'hit_span_all', 'hit_start', 'hit_start_all', 'hit_strand', 'hit_strand_all', 'ident_num', 'is_fragmented', 'pos_num', 'query', 'query_all', 'query_description', 'query_end', 'query_end_all', 'query_features', 'query_features_all', 'query_frame', 'query_frame_all', 'query_id', 'query_inter_ranges', 'query_inter_spans', 'query_range', 'query_range_all', 'query_span', 'query_span_all', 'query_start', 'query_start_all', 'query_strand', 'query_strand_all']


And eventually if that matters, the super-parent/blast record object:

['_NON_STICKY_ATTRS', '_QueryResult__marker', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_blast_id', '_description', '_hit_key_function', '_id', '_items', '_transfer_attrs', 'absorb', 'append', 'description', 'fragments', 'hit_filter', 'hit_keys', 'hit_map', 'hits', 'hsp_filter', 'hsp_map', 'hsps', 'id', 'index', 'items', 'iterhit_keys', 'iterhits', 'iteritems', 'param_evalue_threshold', 'param_filter', 'param_gap_extend', 'param_gap_open', 'param_score_match', 'param_score_mismatch', 'pop', 'program', 'reference', 'seq_len', 'sort', 'stat_db_len', 'stat_db_num', 'stat_eff_space', 'stat_entropy', 'stat_hsp_len', 'stat_kappa', 'stat_lambda', 'target', 'version']


A new comment:

The off-by-one change in SearchIO only complicates matters for me, so I immediately fix it to natural numbering, via:

_query_start = hsp.query_start + 1
_hit_start = hsp.hit_start + 1

I know we talked about this in the past and this is just to say that I did not change my mind here. ;) Same with SffIO although there are two reason for off-by-one numberings, one due to the SFF specs but the other is likewise, to keep in sync with pythonic numbering. These always caused more troubles to me than anything good. Any values I have in variables are 1-based and in the few cases I need to do python slicing, I adjust appropriately, but in remaining cases I am always printing or storing the 1-based values. So, this concept ( http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec114 ) is only for the sake of being pythonic, but bad for users.


Thanks,
Martin

>
> Cheers,
> Bow
>
> On Thu, Feb 13, 2014 at 9:38 PM, Martin Mokrejs
> <mmokrejs at fold.natur.cuni.cz> wrote:
>> Hi,
>>    I am in the process of conversion to the new XML parsing code written by
>> Bow.
>> So far, I have deciphered the following replacement strings (somewhat
>> written in sed(1) format):
>>
>>
>> /hsp.identities/hsp.ident_num/
>> /hsp.score/hsp.bitscore/
>> /hsp.expect/hsp.evalue/
>> /hsp.bits/hsp.bitscore/
>> /hsp.gaps/hsp.gap_num/
>> /hsp.bits/hsp.bitscore_raw/
>> /hsp.positives/hsp.pos_num/
>> /hsp.sbjct_start/hsp.hit_start/
>> /hsp.sbjct_end/hsp.hit_end/
>> # hsp.query_start # no change from NCBIXML
>> # hsp.query_end # no change from NCBIXML
>> /record.query.split()[0]/record.id/
>> /alignment.hit_def.split(' ')[0]/alignment.hit_id/
>> /record.alignments/record.hits/
>>
>> /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML
>> (don't remember whether the counts include minus signs of the alignment or
>> not)
>>
>>
>>
>>
>> Now I am uncertain. There used to be hsp.sbjct_length and alignment.length.
>> I think the former length was including the minus sign for gaps while the
>> latter is just the real length of the query sequence.
>>
>> Nevertheless, what did alignment.length transform into? Into
>> len(hsp.query_all)? I don't think hsp.query_span but who knows. ;)
>>
>>
>>
>> Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that
>> has been added to SearchIO in 1.63. so, that's all from me now until I
>> upgrade. ;)
>>
>>
>> Thank you,
>> Martin
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
>


From mmokrejs at fold.natur.cuni.cz  Thu Feb 13 22:06:44 2014
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Thu, 13 Feb 2014 23:06:44 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <52FD3D4B.8040602@fold.natur.cuni.cz>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>
	<CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>
	<52FD3D4B.8040602@fold.natur.cuni.cz>
Message-ID: <52FD41F4.8080301@fold.natur.cuni.cz>

Martin Mokrejs wrote:
> Hi Bow,
>    thank you for thorough guidance. Comments interleaved.
>
> Wibowo Arindrarto wrote:
>> Hi Martin,
>>
>> Here's the 'convention' I use on the length-related attributes in
>> SearchIO's blast parsers:
>>
>> * 'aln_span' attribute denote the length of the alignment itself,
>> which means this includes the gaps sign ('-'). In Blast, this is
>> always parsed from the file. You're right that this used to be
>> hsp.align_length.
>>
>> * 'seq_len' attributes denote the length of either the query (in
>> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
>> gaps. These are parsed from the BLAST XML file itself. One of these,
>> hit.seq_len, is the one that used to be alignment.length.
>
> How about record.seq_len in SearchIO, isn't that same as well? At least
> I am hoping that the length (163 below) of the original query sequence, stored in
>
>   <BlastOutput_query-len>163</BlastOutput_query-len>
>
> of the XML input file. Having access to its value from under hsp object would be the best for me.
>
>
>> * 'query_span' and 'hit_span' are always computed by SearchIO (always
>> end coordinate - start coordinate of the query / hit match of the HSP,
>> so they do not count the gap characters). They may or may not be equal
>> to their seq_len counterparts, depending on how much the HSP covers
>> the query / hit sequences.
>
> I hope you wanted to say "end - start + 1" ;-)
>
>>
>> (I couldn't find any reference to sbjct_length in the current
>> codebase, perhaps it was removed some time ago?)
>
> I have the feelings that either blast or biopython used subjct_* with the 'u' in the name.
>
>
>> Since this is SearchIO, it also applies to other formats as well (e.g.
>> aln_span always counts the gap character).
>
> Fine with me, I need both values describing length region covered in the HSP, with and without the minus signs.
>
>
>> The 'gap_num' error sounds a bit weird, though. If I recall correctly,
>> it should work in 1.62 (it was added very early in the beginning).
>> What problems are you having?
>
>
> if str(_hsp.gap_num) == '(None, None)':
>      ....
> AttributeError: 'HSP' object has no attribute 'gap_num'

Yeah, I know why. You told me once ( https://github.com/biopython/biopython/issues/222 ) that it is optional. Indeed, the XML file lacks in this case the <Hsp_gaps> section. Actually, this old silly test for (None, None) is in my code just because of that bug. I would prefer if SearchIO provided

hsp.gap_num == None

and likewise for the other, optional attributes to sanitize the blast XML output with some default values. I use None for such cases so that if an integer is later expected python chokes on the None value, which is good. Mostly I only check is the variable returns true or false so the None default is ok for me.

alternatively, I have to check the dictionary of hsp whether it contains gap_num, which is inconvenient.

Martin


From w.arindrarto at gmail.com  Thu Feb 13 22:13:36 2014
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Thu, 13 Feb 2014 23:13:36 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <52FD3D4B.8040602@fold.natur.cuni.cz>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>
	<CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>
	<52FD3D4B.8040602@fold.natur.cuni.cz>
Message-ID: <CADEGkF6ak30P6gD_25rK-x6vmYRdeE4uPaUQLKd_GRZhzeMJCQ@mail.gmail.com>

Hi Martin,

>> Here's the 'convention' I use on the length-related attributes in
>> SearchIO's blast parsers:
>>
>> * 'aln_span' attribute denote the length of the alignment itself,
>> which means this includes the gaps sign ('-'). In Blast, this is
>> always parsed from the file. You're right that this used to be
>> hsp.align_length.
>>
>> * 'seq_len' attributes denote the length of either the query (in
>> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
>> gaps. These are parsed from the BLAST XML file itself. One of these,
>> hit.seq_len, is the one that used to be alignment.length.
>
>
> How about record.seq_len in SearchIO, isn't that same as well? At least
> I am hoping that the length (163 below) of the original query sequence,
> stored in
>
>  <BlastOutput_query-len>163</BlastOutput_query-len>
>
> of the XML input file. Having access to its value from under hsp object
> would be the best for me.

if by 'record' you're referring to the top-most container (the
QueryResult), then record.seq_len denotes the length of the full query
sequence. This may or may not be the same as hit.seq_len.

I did not choose to store it under the HSP object, for the following
reasons because the HSP object is never meant to be used alone, always
with Hit and QueryResult. So whenever one has access to an HSP, he/she
must also have access to the containing Hit and QueryResult. Since the
seq_len are attributes common to all HSPs (originating from the
hit/query sequences), storing them in Hit and QueryResult objects
seems most appropriate.

>> * 'query_span' and 'hit_span' are always computed by SearchIO (always
>> end coordinate - start coordinate of the query / hit match of the HSP,
>> so they do not count the gap characters). They may or may not be equal
>> to their seq_len counterparts, depending on how much the HSP covers
>> the query / hit sequences.
>
>
> I hope you wanted to say "end - start + 1" ;-)

This is related to your comment below, I think. For better or worse,
we needed to adhere to one consistent indexing and numbering system.
Python's system was chosen based on the fact that anyone using
Biopython should be (or is already) familiar with them and that
SearchIO aims to unify all the different coordinate system that
different programs use. Of course you'll notice that the consequence
of this system is that one can calculate the length (or span, really)
of the hit / query sequences by computing `end -start` instead of `end
- start + 1` :).

>> (I couldn't find any reference to sbjct_length in the current
>> codebase, perhaps it was removed some time ago?)
>
>
> I have the feelings that either blast or biopython used subjct_* with the
> 'u' in the name.

Couldn't find that either :/..

>> The 'gap_num' error sounds a bit weird, though. If I recall correctly,
>> it should work in 1.62 (it was added very early in the beginning).
>> What problems are you having?
>
>

(pasting the comment from your other email)

>> if str(_hsp.gap_num) == '(None, None)':
>>      ....
>> AttributeError: 'HSP' object has no attribute 'gap_num'
>
>
> Yeah, I know why. You told me once (
> https://github.com/biopython/biopython/issues/222 ) that it is optional.
> Indeed, the XML file lacks in this case the <Hsp_gaps> section. Actually,
> this old silly test for (None, None) is in my code just because of that bug.
> I would prefer if SearchIO provided
>
> hsp.gap_num == None
>
> and likewise for the other, optional attributes to sanitize the blast XML
> output with some default values. I use None for such cases so that if an
> integer is later expected python chokes on the None value, which is good.
> Mostly I only check is the variable returns true or false so the None
> default is ok for me.
>
> alternatively, I have to check the dictionary of hsp whether it contains
> gap_num, which is inconvenient.

Guess you solved it. But yeah, I was a bit ambivalent on the issue on
whether to note missing attributes as None or simply nothing (as in,
not having the attribute at all). To me (others, feel free to weigh in
here), having it store nothing at all seems more preferred. If the
former is chosen, the only way to be consistent is to store all other
attributes from other search programs (e.g. HMMER's parameter in a
BLAST HSP) as None (otherwise we use None for one missing attribute
and not for the other?). This seems a bit cumbersome, so I chose to
store nothing at all.

> A new comment:
>
> The off-by-one change in SearchIO only complicates matters for me, so I
> immediately fix it to natural numbering, via:
>
> _query_start = hsp.query_start + 1
> _hit_start = hsp.hit_start + 1
>
> I know we talked about this in the past and this is just to say that I did
> not change my mind here. ;) Same with SffIO although there are two reason
> for off-by-one numberings, one due to the SFF specs but the other is
> likewise, to keep in sync with pythonic numbering. These always caused more
> troubles to me than anything good. Any values I have in variables are
> 1-based and in the few cases I need to do python slicing, I adjust
> appropriately, but in remaining cases I am always printing or storing the
> 1-based values. So, this concept (
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec114 ) is only for
> the sake of being pythonic, but bad for users.

This was addressed above :).


Cheers,
Bow


From mmokrejs at fold.natur.cuni.cz  Thu Feb 13 22:37:38 2014
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Thu, 13 Feb 2014 23:37:38 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <CADEGkF6ak30P6gD_25rK-x6vmYRdeE4uPaUQLKd_GRZhzeMJCQ@mail.gmail.com>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>
	<CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>
	<52FD3D4B.8040602@fold.natur.cuni.cz>
	<CADEGkF6ak30P6gD_25rK-x6vmYRdeE4uPaUQLKd_GRZhzeMJCQ@mail.gmail.com>
Message-ID: <52FD4932.1060407@fold.natur.cuni.cz>

Hi Bow,

Wibowo Arindrarto wrote:
> Hi Martin,
>
>>> Here's the 'convention' I use on the length-related attributes in
>>> SearchIO's blast parsers:
>>>
>>> * 'aln_span' attribute denote the length of the alignment itself,
>>> which means this includes the gaps sign ('-'). In Blast, this is
>>> always parsed from the file. You're right that this used to be
>>> hsp.align_length.
>>>
>>> * 'seq_len' attributes denote the length of either the query (in
>>> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
>>> gaps. These are parsed from the BLAST XML file itself. One of these,
>>> hit.seq_len, is the one that used to be alignment.length.
>>
>>
>> How about record.seq_len in SearchIO, isn't that same as well? At least
>> I am hoping that the length (163 below) of the original query sequence,
>> stored in
>>
>>   <BlastOutput_query-len>163</BlastOutput_query-len>
>>
>> of the XML input file. Having access to its value from under hsp object
>> would be the best for me.
>
> if by 'record' you're referring to the top-most container (the
> QueryResult), then record.seq_len denotes the length of the full query
> sequence. This may or may not be the same as hit.seq_len.
>
> I did not choose to store it under the HSP object, for the following
> reasons because the HSP object is never meant to be used alone, always
> with Hit and QueryResult. So whenever one has access to an HSP, he/she
> must also have access to the containing Hit and QueryResult. Since the
> seq_len are attributes common to all HSPs (originating from the
> hit/query sequences), storing them in Hit and QueryResult objects
> seems most appropriate.

So far I had in one of my functions only hsp object and from it I accessed hsp.align_length. Due to the transition to SearchIO I have to modify the function so that it has access to record.seq_len (or QueryResult as you say). yes, I did it now but please consider some functionality is missing. I don't mind my own API change but others might be concerned. I believe I want record.seq_len and not pray on hit.seq_len. I am not sure if we are talking about the same but my testsuite will complain once the code compiles.


>
>>> * 'query_span' and 'hit_span' are always computed by SearchIO (always
>>> end coordinate - start coordinate of the query / hit match of the HSP,
>>> so they do not count the gap characters). They may or may not be equal
>>> to their seq_len counterparts, depending on how much the HSP covers
>>> the query / hit sequences.
>>
>>
>> I hope you wanted to say "end - start + 1" ;-)
>
> This is related to your comment below, I think. For better or worse,

Damn, right, in this case 4-1+1 = 4-0 ;)


> we needed to adhere to one consistent indexing and numbering system.
> Python's system was chosen based on the fact that anyone using
> Biopython should be (or is already) familiar with them and that
> SearchIO aims to unify all the different coordinate system that
> different programs use. Of course you'll notice that the consequence
> of this system is that one can calculate the length (or span, really)
> of the hit / query sequences by computing `end -start` instead of `end
> - start + 1` :).

Well, took me a while. ;)

>
>>> (I couldn't find any reference to sbjct_length in the current
>>> codebase, perhaps it was removed some time ago?)
>>
>>
>> I have the feelings that either blast or biopython used subjct_* with the
>> 'u' in the name.
>
> Couldn't find that either :/..
>
>>> The 'gap_num' error sounds a bit weird, though. If I recall correctly,
>>> it should work in 1.62 (it was added very early in the beginning).
>>> What problems are you having?
>>
>>
>
> (pasting the comment from your other email)
>
>>> if str(_hsp.gap_num) == '(None, None)':
>>>       ....
>>> AttributeError: 'HSP' object has no attribute 'gap_num'
>>
>>
>> Yeah, I know why. You told me once (
>> https://github.com/biopython/biopython/issues/222 ) that it is optional.
>> Indeed, the XML file lacks in this case the <Hsp_gaps> section. Actually,
>> this old silly test for (None, None) is in my code just because of that bug.
>> I would prefer if SearchIO provided
>>
>> hsp.gap_num == None
>>
>> and likewise for the other, optional attributes to sanitize the blast XML
>> output with some default values. I use None for such cases so that if an
>> integer is later expected python chokes on the None value, which is good.
>> Mostly I only check is the variable returns true or false so the None
>> default is ok for me.
>>
>> alternatively, I have to check the dictionary of hsp whether it contains
>> gap_num, which is inconvenient.
>
> Guess you solved it. But yeah, I was a bit ambivalent on the issue on
> whether to note missing attributes as None or simply nothing (as in,
> not having the attribute at all). To me (others, feel free to weigh in
> here), having it store nothing at all seems more preferred. If the
> former is chosen, the only way to be consistent is to store all other
> attributes from other search programs (e.g. HMMER's parameter in a
> BLAST HSP) as None (otherwise we use None for one missing attribute
> and not for the other?). This seems a bit cumbersome, so I chose to
> store nothing at all.

I will see in how many places I have to wrap access to any of these three (or maybe more) optional values and wrap them by an extra if conditional. I think I will just carelessly force my own defaults, that will keep the code shorter and easier to read. I understand your concern about defining defaults for all possible values but I have opposite opinions. Let's see what other say.

The "good" thing is that now hsp.gap_num does not exist while before hsp.gaps was (None, None) hence the tests for True succeeded. Now the code breaks, cool. :))

Martin


From mmokrejs at fold.natur.cuni.cz  Fri Feb 14 22:57:25 2014
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Fri, 14 Feb 2014 23:57:25 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <52FD4932.1060407@fold.natur.cuni.cz>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>	<CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>	<52FD3D4B.8040602@fold.natur.cuni.cz>	<CADEGkF6ak30P6gD_25rK-x6vmYRdeE4uPaUQLKd_GRZhzeMJCQ@mail.gmail.com>
	<52FD4932.1060407@fold.natur.cuni.cz>
Message-ID: <52FE9F55.4040508@fold.natur.cuni.cz>

Hi Bow,
   regarding the missing .gap_num attributes and likewise other ... I believe it is reasonable
for BLAST XML output to omit them to save some space if there are just no gaps in the alignment
or identity is 100%, etc. However, objects instantiated while parsing should have them.
I don;t like having some instances of same object having more attributes while some
having less. I don't mind having a global hook in SearchIO forcing this strict mode and
affecting default parameters inherited from blast-result related classes while parsing XML.


   Another issue I see now that I used to poke over two iterators in a while loop. I was checking
that each of the iterators returned a result object (evaluating as True). The reason
for this ugly-ness was/is two-fold:

   1. "for blah in zip(iter1, iter2):" would only poke over the same length of items
but I wanted to make sure iter1 and iter2 did NOT have, accidentally, different lengths.
One of the iterators was the from the XML output stream and expensive to calculate number
of entries in an extra sweep. The iter2 could be counted for a number of its items
rather cheaply. However, outside outside biopython I could grep through the XML stream.

   2. Second reason for the ugly checks for _record evaluating as True was because
blastall interleaves the XML stream with dummy entries (which evaluate as False object
from NCBIXML.parse()) and also, time to time, blastall places into the stream the very
first result. So, I used to check that _record.id is not same as the _record.id I got
when I just started parsing the XML stream (I cache the very first result id, how ugly,
right?). Both issues I already mentioned in biopython's bugzilla and this email list and
notably, notified NCBI. Unfortunately, they answered they won't fix any of these
(look into archives of this biopython list about a year ago or so?).


   Back to NCBIXML.parse() to SearchIO.parse() transition. Seemed I could have replaced

if _record:
     ...
  
whith

if _record.id:
    ....

but that is unnecessarily expensive because python must get much deeper into the object.


Unfortunately, this won't help me to deal with "empty" objects created by SearchIO when no match
was found. I am talking about this XML section resulting in object evaluating as False but
_record.id gives 'FL40XAE01A1L3P':

     <Iteration>
       <Iteration_iter-num>2</Iteration_iter-num>
       <Iteration_query-ID>lcl|2_0</Iteration_query-ID>
       <Iteration_query-def>FL40XAE01A1L3P length=127 xy=0311_1171 region=1 run=R_2008_12_17_15_21_00_</Iteration_query-def>
       <Iteration_query-len>374</Iteration_query-len>
       <Iteration_stat>
         <Statistics>
           <Statistics_db-num>99</Statistics_db-num>
           <Statistics_db-len>47536</Statistics_db-len>
           <Statistics_hsp-len>0</Statistics_hsp-len>
           <Statistics_eff-space>0</Statistics_eff-space>
           <Statistics_kappa>0.41</Statistics_kappa>
           <Statistics_lambda>0.625</Statistics_lambda>
           <Statistics_entropy>0.78</Statistics_entropy>
         </Statistics>
       </Iteration_stat>
       <Iteration_message>No hits found</Iteration_message>
     </Iteration>


Here is the same through SearchIO:

>>> _record =_blastn_iterator.next()
>>> print _record
Program: blastn (2.2.26)
   Query: FL40XAE01A1L3P (374)
          length=127 xy=0311_1171 region=1 run=R_2008_12_17_15_21_00_
  Target: queries.fasta queries2.fasta
    Hits: 0
>>>
>>> if _record:
...     print "true"
... else:
...     print "false"
...
false
>>>


I understand that the object evaluates as False because it has no sequence and therefore
appears to be "empty", but it is real result. I understand you want to follow some universal
logic of biopython about empty/non-empty objects but I don't think in this case it is a good idea.
Or do you want me to check for _record.hits evaluating as True?


In my original pseudocode I had

if _record:
     # either a match was found
     # or no match was found but the object is valid and evaluates as True
else:
     # reached EOF
     # or
     # reached broken XML item interleaved in the stream (just ignore the crap)

would read now:

if _record.id:
     if _record.hits:
         # a match was found
     else:
         # no match was found
else:
     # reached EOF
     # reached broken XML item interleaved in the stream (just ignore the crap)


Looks I can accomplish what I used to have but I would like to know your opinion and
a coding style advice before I get on my way. ;-)


Thank you,
Martin


From p.j.a.cock at googlemail.com  Sat Feb 15 12:25:45 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 15 Feb 2014 12:25:45 +0000
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <52FE9F55.4040508@fold.natur.cuni.cz>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>
	<CADEGkF54WRps2pxF09ir446dqfnu9wMmYAk=jeDQxgo7+giCMQ@mail.gmail.com>
	<52FD3D4B.8040602@fold.natur.cuni.cz>
	<CADEGkF6ak30P6gD_25rK-x6vmYRdeE4uPaUQLKd_GRZhzeMJCQ@mail.gmail.com>
	<52FD4932.1060407@fold.natur.cuni.cz>
	<52FE9F55.4040508@fold.natur.cuni.cz>
Message-ID: <CAKVJ-_6+VB96TaF9nZbEQOwnqEZFc5gjc1WtvbuynfYfX=TN6w@mail.gmail.com>

On Fri, Feb 14, 2014 at 10:57 PM, Martin Mokrejs
<mmokrejs at fold.natur.cuni.cz> wrote:
>
>   Another issue I see now that I used to poke over two iterators in a while
> loop. I was checking that each of the iterators returned a result object
> (evaluating as True).

With some of the BLAST output formats (e.g. tabular), if a query
had no records it will not appear in the output at all - and so if you
iterate over it, there will be less results than if you iterated over the
query FASTA file.  Similarly, if you had several BLAST files for the
same query (e.g. against different databases) they might be missing
results for different queries.

In this kind of situation, a single loop using zip(...) isn't going to
work. However, it would be a nice match to SearchIO.index(...)
I think. e.g. Something like this (untested):

from Bio import SeqIO
from Bio import SearchIO
blast_index = SearchIO.index(blast_file, blast_format)
for query_seq_record in SeqIO.parse(query_file, "fasta"):
    query_id = query_seq_record.id
    if query_id not in blast_index:
        #BLAST format where empty results are missing
        #e.g. BLAST tabular
        continue
    query_result = blast_index[query.id]
    if not query_result.hits:
        #BLAST result with no hits, e.g. BLAST text
        continue
    print("Have hits for %s" % query_id)

Peter


From mmokrejs at fold.natur.cuni.cz  Sat Feb 15 16:28:18 2014
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Sat, 15 Feb 2014 17:28:18 +0100
Subject: [Biopython] Converting from NCBIXML to SearchIO
In-Reply-To: <52FD2D4A.9010300@fold.natur.cuni.cz>
References: <52FD2D4A.9010300@fold.natur.cuni.cz>
Message-ID: <52FF95A2.7070102@fold.natur.cuni.cz>

Martin Mokrejs wrote:
> Hi,
>    I am in the process of conversion to the new XML parsing code written by Bow.
> So far, I have deciphered the following replacement strings (somewhat written in sed(1) format):
>
>
> /hsp.identities/hsp.ident_num/
> /hsp.score/hsp.bitscore/
> /hsp.expect/hsp.evalue/
> /hsp.bits/hsp.bitscore/
> /hsp.gaps/hsp.gap_num/
> /hsp.bits/hsp.bitscore_raw/

Aside from the fact I pasted twice the _hsp.bits line, my guess was wrong. The code works now but needed the following changes from NCBIXML to SearchIO names:

/_hsp.score/_hsp.bitscore_raw/
/_hsp.bits/_hsp.bitscore/


> /hsp.positives/hsp.pos_num/
> /hsp.sbjct_start/hsp.hit_start/
> /hsp.sbjct_end/hsp.hit_end/
> # hsp.query_start # no change from NCBIXML
> # hsp.query_end # no change from NCBIXML
> /record.query.split()[0]/record.id/
> /alignment.hit_def.split(' ')[0]/alignment.hit_id/
> /record.alignments/record.hits/
>
> /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML (don't remember whether the counts include minus signs of the alignment or not)
>
>
>
>
> Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. I think the former length was including the minus sign for gaps while the latter is just the real length of the query sequence.
>
> Nevertheless, what did alignment.length transform into? Into len(hsp.query_all)? I don't think hsp.query_span but who knows. ;)


Answering myself:

/alignment.hit_id/alignment.id/
/alignment.length/_record.hits[0].seq_len/


Other changes:

_hsp.sbjct/_hsp.hit.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-]
_hsp.query/_hsp.query.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-]
_hsp.match/_hsp.aln_annotation['homology']/ # e.g. '||||||||||||||||||||||||||||||||||| |||||||||| |   ||| ||   ||||||| |||||'

I think the dictionary key should have been better named "similarity".


The strand does not translate simply to SearchIO, one needs to do:
/_hsp.strand/(_hsp.query_strand, _hsp.hit_strand)/ # the tuple will be e.g. (1, 1) while I think it used to be under NCBIXML as either ('Plus', 'Plus'), ('Plus, 'Minus'), (None, None), etc.


>
>
>
> Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that has been added to SearchIO in 1.63. so, that's all from me now until I upgrade. ;)

I got around with try/except although it is more expensive than previously sufficient if/else tests:

         # undo the off-by-one change in SearchIO and transform back to real-life numbers
         _hit_start = _hsp.hit_start + 1
         _query_start = _hsp.query_start + 1

         try:
             _ident_num = _hsp.ident_num
         except:
             _ident_num = 0

         try:
             _pos_num = _hsp.pos_num
         except:
             _pos_num = 0

         try:
             _gap_num = _hsp.gap_num
         except:
             # calculate gaps count missing sometimes in legacy blast XML output
             # see also https://redmine.open-bio.org/issues/3363 saying that also _multimer_hsp_identities and _multimer_hsp_positives are affected
             _gap_num = _hsp.aln_span - _ident_num


So far I can conclude, that by transition from NCBIXML to SearchIO I got 30% wallclock speedup, but the most important will be for me whether it will save me memory used for parsing of huge XML files (>100GB uncompressed) . That I don't know yet, am still testing.

Martin


From vishnuc11j93 at gmail.com  Sun Feb 16 03:39:58 2014
From: vishnuc11j93 at gmail.com (Vishnu Chilakamarri)
Date: Sun, 16 Feb 2014 09:09:58 +0530
Subject: [Biopython] Using Tabix on a bgzf file
Message-ID: <CAPwdTsCn8msUza1w_YprYPPeg8pkQeEgnH03BMw_Jb4xZTjqnw@mail.gmail.com>

Hi Peter,

I read your code on bgzf compression and the blog post. I used
uniprot_sprot_varsplic.fasta.gz<ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz>as
the example (from the EBI ftp) to compress in bgzf and then index
using
Tabix. Now the file I've gotten has a .tbi extension. I'm trying to parse
the file but gives a preset not provided error and when I'm trying to
access columns I'm getting indexes overlap error. Can you tell me where
I've gone wrong?

Thank you,
Vishnu


From jordan.r.willis at Vanderbilt.Edu  Sun Feb 16 06:49:19 2014
From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R)
Date: Sun, 16 Feb 2014 06:49:19 +0000
Subject: [Biopython] extra annotations for phyla tree
Message-ID: <AC7D5B64FC829E429B0C96F7E3EE5AAD7291A597@ITS-HCWNEM108.ds.vanderbilt.edu>


Hi,

First off, whomever wrote the DistanceTree and DistranceMatrix Calculator?hat?s off! I have been looking for an easy way to do custom distance matrices for a while. Wow.

Anyway, I noticed you can add some extra annotations to your leafs by converting your tree into a PhyloXML. I was wondering if there are ways to color branches and adjust thickness to highlight branches of interest. I know you can simply open the trees in other programs like Dendroscope and color them manually, but you can imagine a scenario where you have thousands of trees to compare etc. 

Jordan


From p.j.a.cock at googlemail.com  Sun Feb 16 14:32:58 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 16 Feb 2014 14:32:58 +0000
Subject: [Biopython] Using Tabix on a bgzf file
In-Reply-To: <CAPwdTsCn8msUza1w_YprYPPeg8pkQeEgnH03BMw_Jb4xZTjqnw@mail.gmail.com>
References: <CAPwdTsCn8msUza1w_YprYPPeg8pkQeEgnH03BMw_Jb4xZTjqnw@mail.gmail.com>
Message-ID: <CAKVJ-_6c+=VE4tPVJHWR-Y5kU8oKupzS0xNMcq=n-5s4+LHa+A@mail.gmail.com>

On Sunday, February 16, 2014, Vishnu Chilakamarri <vishnuc11j93 at gmail.com>
wrote:

> Hi Peter,
>
> I read your code on bgzf compression and the blog post. I used
> uniprot_sprot_varsplic.fasta.gz<
> ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz
> >as
> the example (from the EBI ftp) to compress in bgzf and then index
> using
> Tabix. Now the file I've gotten has a .tbi extension. I'm trying to parse
> the file but gives a preset not provided error and when I'm trying to
> access columns I'm getting indexes overlap error. Can you tell me where
> I've gone wrong?
>
> Thank you,
> Vishnu
>
>
Biopython doesn't (currently) use the tabix index (*.tbi) file.

Biopython's Bio.SeqIO indexing code uses the BGFZ compressed
sequence file directly. Using the SeqIO.index(...) function will make
an in memory index, using SeqIO.index_db(...) will make an index
in disk using SQLite. This system is quite separate from tabix
(and Biopython uses it for many many sequence files formats,
not just FASTA).

Peter


From bjorn_johansson at bio.uminho.pt  Sun Feb 16 19:23:45 2014
From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=)
Date: Sun, 16 Feb 2014 19:23:45 +0000
Subject: [Biopython] CAI confusion
Message-ID: <CAG_4V=aZbGy-r+CF8w--HN4fhZm6hWK_nhpDrmFoC50o+6VhiQ@mail.gmail.com>

Hi,

I am trying to use the Bio.SeqUtils.CodonUsage module to calculate CAI for
S. cerevisiae genes.
Biopython comes with the SharpEcoliIndex from
Bio.SeqUtils.CodonUsageIndices, but none for S. cerevisiae.

I found one here:
http://downloads.yeastgenome.org/unpublished_data/codon/s_cerevisiae-codonusage.txt
and here:
http://downloads.yeastgenome.org/unpublished_data/codon/ysc.orf.cod

I parsed the first table which have the following format, unfortunately w/o
headers:

Gly	GGG	17673	6.05	0.12
Gly	GGA	32723	11.20	0.23
Gly	GGT	66198	22.66	0.46
Gly	GGC	28522	9.76	0.20
	
Glu	GAG	57046	19.52	0.30
...

? believe the last column is the fraction. I think biopython expects
instead relative adaptedness w as indata for each codon.

 <https://paperpile.com/shared/V6j5yN>see
http://www.ncbi.nlm.nih.gov/pubmed/3547335

How do I calculate w from the frequency? Are there any examples or
code avaliable? I googled, but could not find anything.

grateful for help!

/bjorn


-- 
______O_________oO________oO______o_______oO__
Bj?rn Johansson
Assistant Professor
Departament of Biology
University of Minho
Campus de Gualtar
4710-057 Braga
PORTUGAL
www.bio.uminho.pt
Google profile <https://profiles.google.com/bjornjobb>
Google Scholar Profile<http://scholar.google.com/citations?user=7AiEuJ4AAAAJ>
my group <https://sites.google.com/site/metabolicengineeringgroup/>
Office (direct) +351-253 601517 | (PT) mob.  +351-967 147 704 | (SWE) mob.
 +46 739 792 968
Dept of Biology (secr) +351-253 60 4310  | fax +351-253 678980


From eric.talevich at gmail.com  Mon Feb 17 06:25:18 2014
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sun, 16 Feb 2014 22:25:18 -0800
Subject: [Biopython] extra annotations for phyla tree
In-Reply-To: <AC7D5B64FC829E429B0C96F7E3EE5AAD7291A597@ITS-HCWNEM108.ds.vanderbilt.edu>
References: <AC7D5B64FC829E429B0C96F7E3EE5AAD7291A597@ITS-HCWNEM108.ds.vanderbilt.edu>
Message-ID: <CAMC681=n=xUtYF_+o57EbRMURdBVW83S2gEPwoQx0aNqCmKrbg@mail.gmail.com>

On Sat, Feb 15, 2014 at 10:49 PM, Willis, Jordan R <
jordan.r.willis at vanderbilt.edu> wrote:

>
> Hi,
>
> First off, whomever wrote the DistanceTree and DistranceMatrix
> Calculator...hat's off! I have been looking for an easy way to do custom
> distance matrices for a while. Wow.
>
> Anyway, I noticed you can add some extra annotations to your leafs by
> converting your tree into a PhyloXML. I was wondering if there are ways to
> color branches and adjust thickness to highlight branches of interest. I
> know you can simply open the trees in other programs like Dendroscope and
> color them manually, but you can imagine a scenario where you have
> thousands of trees to compare etc.
>
> Jordan
>

Hi Jordan,

The TreeConstruction and Consensus modules are the recent work of Yanbo Ye.
Good to hear you're using it and liking it.

As for annotating branch display colors and widths, you can accomplish this
by setting the .color and .width attributes of Clade objects. See:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec233

tree = Phylo.read("mytree.nwk", "newick")
clade = tree.common_ancestor("A", "B")
clade.color = "red"
clade.width = 2

Note that the clade color and width is recursive, applying to all
descendent clade branches too (per the phyloXML spec). To save the
annotations so they can be read by Dendroscope and Archaeopteryx, the trees
must be saved in phyloXML format:

Phylo.write(tree, "mytree-annotated.xml", "phyloxml")


Cheers,
Eric


From anaryin at gmail.com  Wed Feb 19 14:39:10 2014
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 19 Feb 2014 15:39:10 +0100
Subject: [Biopython] Bio.PDB local MMCIF files
In-Reply-To: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com>
References: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com>
Message-ID: <CAJ9sUYNSOXDK+9fNXS7=3nX04RKR6+Y7SyYUfRpQGrhsr8AJhw@mail.gmail.com>

Hello,

The implementation I was referring to by the EBI people is
here<http://www.ebi.ac.uk/~glen/PDBeCIF/>.
I tested it during a workshop and it is very fast and robust (they use it,
that should be enough reason) so maybe we could benefit a lot from either
its incorporation or adaptation?

As for what I suggested. Since my GSOC period, already 4 years ago.., I
noticed that the PDB module is a bit messy in terms of organization. The
module itself if named after the databank, which can be confused with the
format name, the mmcif parser is defined inside in a subfolder and there
are application wrappers there too (DSSP, NACCESS). Besides this issue,
which is not an issue at all and just my own pet peeve, there is a lot that
the entire module could gain from a thorough revision. I've been using it
very often and some normal manipulations of structures are not
straightforward to carry out (calculating a center of mass for example,
removing double occupancies) due to the parser being slow and quite memory
hungry. In fact, trying to run the parser on a very large collection of
structures often results in a random crash due to memory issues.

I've been toying with a lot of changes, performance improvements, etc, but
I'm not satisfied at all with them.. somethings that i've been trying is to
have the structure coordinates defined as a full numpy array instead of N
arrays per structure (one per atom) or the usage of __slots__ to mitigate
memory usage (managed to get it down 33% this way). This would also go in
line with a suggestion from Eric a long time ago to make a Bio.Struct
module which would be the perfect "playground" to implement and test these
changes. Other developments that I think are worth looking into are for
example making a nice library to link a parsed structure to the PDB
database and fetch information on it using the REST services they provide.

I'd like to hear your opinion (as in, everybody, developers and users) on
this and if it makes sense to indeed give a bit of TLC to the Bio.PDB
module. Also, on what changes you think should be carried out to improve
the module, like which features are missing, which applications are worth
wrapping.

Just to kick off some discussion. Maybe a new thread should be opened for
this later on.

Cheers,

Jo?o


From p.j.a.cock at googlemail.com  Wed Feb 19 14:51:59 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 19 Feb 2014 14:51:59 +0000
Subject: [Biopython] Bio.PDB local MMCIF files
In-Reply-To: <CAJ9sUYNSOXDK+9fNXS7=3nX04RKR6+Y7SyYUfRpQGrhsr8AJhw@mail.gmail.com>
References: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com>
	<CAJ9sUYNSOXDK+9fNXS7=3nX04RKR6+Y7SyYUfRpQGrhsr8AJhw@mail.gmail.com>
Message-ID: <CAKVJ-_5EXOZgpHDY4u4vySip+uQE+GnD3kMe6F5_PabuYDfghA@mail.gmail.com>

On Wed, Feb 19, 2014 at 2:39 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Hello,
>
> The implementation I was referring to by the EBI people is here. I tested it
> during a workshop and it is very fast and robust (they use it, that should
> be enough reason) so maybe we could benefit a lot from either its
> incorporation or adaptation?
>
> As for what I suggested. Since my GSOC period, already 4 years ago.., I
> noticed that the PDB module is a bit messy in terms of organization. The
> module itself if named after the databank, which can be confused with the
> format name, the mmcif parser is defined inside in a subfolder and there are
> application wrappers there too (DSSP, NACCESS). Besides this issue, which is
> not an issue at all and just my own pet peeve, there is a lot that the
> entire module could gain from a thorough revision. I've been using it very
> often and some normal manipulations of structures are not straightforward to
> carry out (calculating a center of mass for example, removing double
> occupancies) due to the parser being slow and quite memory hungry. In fact,
> trying to run the parser on a very large collection of structures often
> results in a random crash due to memory issues.
>
> I've been toying with a lot of changes, performance improvements, etc, but
> I'm not satisfied at all with them.. somethings that i've been trying is to
> have the structure coordinates defined as a full numpy array instead of N
> arrays per structure (one per atom) or the usage of __slots__ to mitigate
> memory usage (managed to get it down 33% this way). This would also go in
> line with a suggestion from Eric a long time ago to make a Bio.Struct module
> which would be the perfect "playground" to implement and test these changes.
> Other developments that I think are worth looking into are for example
> making a nice library to link a parsed structure to the PDB database and
> fetch information on it using the REST services they provide.
>
> I'd like to hear your opinion (as in, everybody, developers and users) on
> this and if it makes sense to indeed give a bit of TLC to the Bio.PDB
> module. Also, on what changes you think should be carried out to improve the
> module, like which features are missing, which applications are worth
> wrapping.
>
> Just to kick off some discussion. Maybe a new thread should be opened for
> this later on.
>
> Cheers,
>
> Jo?o

+1 on a new thread, and Bio.Struct (or better lower case, Bio.struct
or Bio.structure or something to be a bit more PEP8 like?).

Peter


From anaryin at gmail.com  Wed Feb 19 16:42:54 2014
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 19 Feb 2014 17:42:54 +0100
Subject: [Biopython] Bio.PDB.PDBParser() Superimposer()
In-Reply-To: <CAMrqo6zSgOZFWJ3ZSBtEOTTGRbch6dShw4gtUpYBD7ZpxjEVvA@mail.gmail.com>
References: <CAMrqo6zww=ftqVOUCbYn=A3JUtnYWC3aciiCd2eJVfOccYXR-A@mail.gmail.com>
	<CAJ9sUYN01JGb+TxKPVRt8PGaXXjgD33_TRLxycAWwxNfkVws7A@mail.gmail.com>
	<CAMrqo6zSgOZFWJ3ZSBtEOTTGRbch6dShw4gtUpYBD7ZpxjEVvA@mail.gmail.com>
Message-ID: <CAJ9sUYNJDPoHZK1zLDexScW0d1pCJQiGGmBPUYyoCGZ7OqvYCw@mail.gmail.com>

Hi Jurgens,

Sorry for the delay.. hope it still goes on time.

If the numbering of the two proteins is the same (equivalent residues have
equivalent residue numbers), usually the case if you compare different
models generated by simulation, then it is straightforward to trim them (check
this gist <https://gist.github.com/JoaoRodrigues/9095892>).

Otherwise you have to perform a sequence alignment and parse the alignment
to extract the equivalent atoms and do the same logic as before (this is
quite tricky..). I have a script that does this but it's not trivial at all
and might be extremely specific for your application.

Cheers,

Jo?o


2014-01-16 13:18 GMT+01:00 Jurgens de Bruin <debruinjj at gmail.com>:

> Hi Jo?o Rodrigues,
>
> Thanks for the reply much appreciated, this does make sense but I would
> greatly appreciate examples with some code.
>
> Thanks
>
>
> On 16 January 2014 13:59, Jo?o Rodrigues <anaryin at gmail.com> wrote:
>
>> Hi Jurgens,
>>
>> When you pass the two sequences to the Superimposer I guess you can trim
>> the sequence to that which you want (pass a list of residues that is sliced
>> to those that you want to include). The only requirement would be that both
>> have the same number of atoms.
>>
>> If this doesn't make much sense I can give an example with code.
>>
>> Cheers,
>>
>> Jo?o
>>
>>
>>  2014/1/16 Jurgens de Bruin <debruinjj at gmail.com>
>>
>>>  Hi,
>>>
>>> I am trying to calculate the RMS for two pdb files but the proteins
>>> differ
>>> in length. Currently I want to exclude the leading/trailing parts of the
>>> longer sequence but I am having difficulty figuring out how I will be
>>> able
>>> to do this.
>>>
>>> Any help would be appreciated.
>>>
>>>
>>> --
>>> Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/
>>> distinti saluti/siong/du? y?/??????
>>>
>>> Jurgens de Bruin
>>>
>>> _______________________________________________
>>> Biopython mailing list  -  Biopython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>>
>>
>>
>
>
> --
> Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/
> distinti saluti/siong/du? y?/??????
>
> Jurgens de Bruin
>


From p.j.a.cock at googlemail.com  Wed Feb 19 16:47:38 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 19 Feb 2014 16:47:38 +0000
Subject: [Biopython] Bio.PDB.PDBParser() Superimposer()
In-Reply-To: <CAJ9sUYNJDPoHZK1zLDexScW0d1pCJQiGGmBPUYyoCGZ7OqvYCw@mail.gmail.com>
References: <CAMrqo6zww=ftqVOUCbYn=A3JUtnYWC3aciiCd2eJVfOccYXR-A@mail.gmail.com>
	<CAJ9sUYN01JGb+TxKPVRt8PGaXXjgD33_TRLxycAWwxNfkVws7A@mail.gmail.com>
	<CAMrqo6zSgOZFWJ3ZSBtEOTTGRbch6dShw4gtUpYBD7ZpxjEVvA@mail.gmail.com>
	<CAJ9sUYNJDPoHZK1zLDexScW0d1pCJQiGGmBPUYyoCGZ7OqvYCw@mail.gmail.com>
Message-ID: <CAKVJ-_6Qk6DH1n3K+xOe67pM6n_ah-hmRphDkE6JWMdUt=7bbw@mail.gmail.com>

On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Hi Jurgens,
>
> Sorry for the delay.. hope it still goes on time.
>
> If the numbering of the two proteins is the same (equivalent residues have
> equivalent residue numbers), usually the case if you compare different
> models generated by simulation, then it is straightforward to trim them (check
> this gist <https://gist.github.com/JoaoRodrigues/9095892>).

Here's a slightly more complex example picking out a stable core
for the alignment (ignoring variable loops):
http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/

> Otherwise you have to perform a sequence alignment and parse the alignment
> to extract the equivalent atoms and do the same logic as before (this is
> quite tricky..). I have a script that does this but it's not trivial at all
> and might be extremely specific for your application.

Yes. Fiddly.

Peter


From anaryin at gmail.com  Wed Feb 19 17:07:17 2014
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 19 Feb 2014 18:07:17 +0100
Subject: [Biopython] Bio.PDB.PDBParser() Superimposer()
In-Reply-To: <AC7D5B64FC829E429B0C96F7E3EE5AAD7291FBDF@ITS-HCWNEM108.ds.vanderbilt.edu>
References: <CAMrqo6zww=ftqVOUCbYn=A3JUtnYWC3aciiCd2eJVfOccYXR-A@mail.gmail.com>
	<CAJ9sUYN01JGb+TxKPVRt8PGaXXjgD33_TRLxycAWwxNfkVws7A@mail.gmail.com>
	<CAMrqo6zSgOZFWJ3ZSBtEOTTGRbch6dShw4gtUpYBD7ZpxjEVvA@mail.gmail.com>
	<CAJ9sUYNJDPoHZK1zLDexScW0d1pCJQiGGmBPUYyoCGZ7OqvYCw@mail.gmail.com>
	<CAKVJ-_6Qk6DH1n3K+xOe67pM6n_ah-hmRphDkE6JWMdUt=7bbw@mail.gmail.com>
	<AC7D5B64FC829E429B0C96F7E3EE5AAD7291FBDF@ITS-HCWNEM108.ds.vanderbilt.edu>
Message-ID: <CAJ9sUYMb8k72NP1ZLMjpiiKauT2drj+33NNQpL+d185+5bDd0w@mail.gmail.com>

Hey Jordan,

Mind pasting that somewhere? I spent a few hours coding something like that
recently so it would be nice to compare !

Cheers,

Jo?o


2014-02-19 18:05 GMT+01:00 Willis, Jordan R <jordan.r.willis at vanderbilt.edu>
:

> I also have an example where I have one native and several models that
> needs an RMSD.
>
> It performs a multiple sequence alignment one at a time and iterates
> through the alignment file to do a one-to-one array of atoms in the
> sequence alignment before calculating a superposition. If the atoms do not
> match, they are thrown out of the alignment. Let me know if  you want to
> see this, it?s a bit complex.
>
> Jordan
>
>
>
>
> On Feb 19, 2014, at 10:47 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>
> > On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues <anaryin at gmail.com>
> wrote:
> >> Hi Jurgens,
> >>
> >> Sorry for the delay.. hope it still goes on time.
> >>
> >> If the numbering of the two proteins is the same (equivalent residues
> have
> >> equivalent residue numbers), usually the case if you compare different
> >> models generated by simulation, then it is straightforward to trim them
> (check
> >> this gist <https://gist.github.com/JoaoRodrigues/9095892>).
> >
> > Here's a slightly more complex example picking out a stable core
> > for the alignment (ignoring variable loops):
> > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/
> >
> >> Otherwise you have to perform a sequence alignment and parse the
> alignment
> >> to extract the equivalent atoms and do the same logic as before (this is
> >> quite tricky..). I have a script that does this but it's not trivial at
> all
> >> and might be extremely specific for your application.
> >
> > Yes. Fiddly.
> >
> > Peter
> >
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
>
>
>


From jordan.r.willis at Vanderbilt.Edu  Wed Feb 19 17:05:31 2014
From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R)
Date: Wed, 19 Feb 2014 17:05:31 +0000
Subject: [Biopython] Bio.PDB.PDBParser() Superimposer()
In-Reply-To: <CAKVJ-_6Qk6DH1n3K+xOe67pM6n_ah-hmRphDkE6JWMdUt=7bbw@mail.gmail.com>
References: <CAMrqo6zww=ftqVOUCbYn=A3JUtnYWC3aciiCd2eJVfOccYXR-A@mail.gmail.com>
	<CAJ9sUYN01JGb+TxKPVRt8PGaXXjgD33_TRLxycAWwxNfkVws7A@mail.gmail.com>
	<CAMrqo6zSgOZFWJ3ZSBtEOTTGRbch6dShw4gtUpYBD7ZpxjEVvA@mail.gmail.com>
	<CAJ9sUYNJDPoHZK1zLDexScW0d1pCJQiGGmBPUYyoCGZ7OqvYCw@mail.gmail.com>
	<CAKVJ-_6Qk6DH1n3K+xOe67pM6n_ah-hmRphDkE6JWMdUt=7bbw@mail.gmail.com>
Message-ID: <AC7D5B64FC829E429B0C96F7E3EE5AAD7291FBDF@ITS-HCWNEM108.ds.vanderbilt.edu>

I also have an example where I have one native and several models that needs an RMSD.

It performs a multiple sequence alignment one at a time and iterates through the alignment file to do a one-to-one array of atoms in the sequence alignment before calculating a superposition. If the atoms do not match, they are thrown out of the alignment. Let me know if  you want to see this, it?s a bit complex.

Jordan


On Feb 19, 2014, at 10:47 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
>> Hi Jurgens,
>> 
>> Sorry for the delay.. hope it still goes on time.
>> 
>> If the numbering of the two proteins is the same (equivalent residues have
>> equivalent residue numbers), usually the case if you compare different
>> models generated by simulation, then it is straightforward to trim them (check
>> this gist <https://gist.github.com/JoaoRodrigues/9095892>).
> 
> Here's a slightly more complex example picking out a stable core
> for the alignment (ignoring variable loops):
> http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/
> 
>> Otherwise you have to perform a sequence alignment and parse the alignment
>> to extract the equivalent atoms and do the same logic as before (this is
>> quite tricky..). I have a script that does this but it's not trivial at all
>> and might be extremely specific for your application.
> 
> Yes. Fiddly.
> 
> Peter
> 
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 


From jordan.r.willis at Vanderbilt.Edu  Wed Feb 19 17:52:36 2014
From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R)
Date: Wed, 19 Feb 2014 17:52:36 +0000
Subject: [Biopython] Bio.PDB.PDBParser() Superimposer()
In-Reply-To: <CAJ9sUYMb8k72NP1ZLMjpiiKauT2drj+33NNQpL+d185+5bDd0w@mail.gmail.com>
References: <CAMrqo6zww=ftqVOUCbYn=A3JUtnYWC3aciiCd2eJVfOccYXR-A@mail.gmail.com>
	<CAJ9sUYN01JGb+TxKPVRt8PGaXXjgD33_TRLxycAWwxNfkVws7A@mail.gmail.com>
	<CAMrqo6zSgOZFWJ3ZSBtEOTTGRbch6dShw4gtUpYBD7ZpxjEVvA@mail.gmail.com>
	<CAJ9sUYNJDPoHZK1zLDexScW0d1pCJQiGGmBPUYyoCGZ7OqvYCw@mail.gmail.com>
	<CAKVJ-_6Qk6DH1n3K+xOe67pM6n_ah-hmRphDkE6JWMdUt=7bbw@mail.gmail.com>
	<AC7D5B64FC829E429B0C96F7E3EE5AAD7291FBDF@ITS-HCWNEM108.ds.vanderbilt.edu>
	<CAJ9sUYMb8k72NP1ZLMjpiiKauT2drj+33NNQpL+d185+5bDd0w@mail.gmail.com>
Message-ID: <AC7D5B64FC829E429B0C96F7E3EE5AAD7291FD50@ITS-HCWNEM108.ds.vanderbilt.edu>

This will calculate an all_atom_RMSD, c-alpha and backbone atom rmd. I took out all the extra stuff specific to the Rosetta community that will actually score the file too. But this is generalized

scoreimposer_align.py -n native.pdb -m *.pdbs

-m is the multiprocess flag (requires python2.7)

https://gist.github.com/jwillis0720/9097426

Jordan

On Feb 19, 2014, at 11:07 AM, Jo?o Rodrigues <anaryin at gmail.com<mailto:anaryin at gmail.com>> wrote:

Hey Jordan,

Mind pasting that somewhere? I spent a few hours coding something like that recently so it would be nice to compare !

Cheers,

Jo?o


2014-02-19 18:05 GMT+01:00 Willis, Jordan R <jordan.r.willis at vanderbilt.edu<mailto:jordan.r.willis at vanderbilt.edu>>:
I also have an example where I have one native and several models that needs an RMSD.

It performs a multiple sequence alignment one at a time and iterates through the alignment file to do a one-to-one array of atoms in the sequence alignment before calculating a superposition. If the atoms do not match, they are thrown out of the alignment. Let me know if  you want to see this, it?s a bit complex.

Jordan


On Feb 19, 2014, at 10:47 AM, Peter Cock <p.j.a.cock at googlemail.com<mailto:p.j.a.cock at googlemail.com>> wrote:

> On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues <anaryin at gmail.com<mailto:anaryin at gmail.com>> wrote:
>> Hi Jurgens,
>>
>> Sorry for the delay.. hope it still goes on time.
>>
>> If the numbering of the two proteins is the same (equivalent residues have
>> equivalent residue numbers), usually the case if you compare different
>> models generated by simulation, then it is straightforward to trim them (check
>> this gist <https://gist.github.com/JoaoRodrigues/9095892>).
>
> Here's a slightly more complex example picking out a stable core
> for the alignment (ignoring variable loops):
> http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/
>
>> Otherwise you have to perform a sequence alignment and parse the alignment
>> to extract the equivalent atoms and do the same logic as before (this is
>> quite tricky..). I have a script that does this but it's not trivial at all
>> and might be extremely specific for your application.
>
> Yes. Fiddly.
>
> Peter
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org<mailto:Biopython at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From cjfields at illinois.edu  Thu Feb 20 14:16:16 2014
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 20 Feb 2014 14:16:16 +0000
Subject: [Biopython] Bio.PDB local MMCIF files
In-Reply-To: <CAKVJ-_5EXOZgpHDY4u4vySip+uQE+GnD3kMe6F5_PabuYDfghA@mail.gmail.com>
References: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com>
	<CAJ9sUYNSOXDK+9fNXS7=3nX04RKR6+Y7SyYUfRpQGrhsr8AJhw@mail.gmail.com>,
	<CAKVJ-_5EXOZgpHDY4u4vySip+uQE+GnD3kMe6F5_PabuYDfghA@mail.gmail.com>
Message-ID: <608E332B-F339-4474-A206-209ED6EA3D84@illinois.edu>

On Feb 19, 2014, at 8:55 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:
> 
>> On Wed, Feb 19, 2014 at 2:39 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
>> Hello,
>> 
>> The implementation I was referring to by the EBI people is here. I tested it
>> during a workshop and it is very fast and robust (they use it, that should
>> be enough reason) so maybe we could benefit a lot from either its
>> incorporation or adaptation?
>> 
>> As for what I suggested. Since my GSOC period, already 4 years ago.., I
>> noticed that the PDB module is a bit messy in terms of organization. The
>> module itself if named after the databank, which can be confused with the
>> format name, the mmcif parser is defined inside in a subfolder and there are
>> application wrappers there too (DSSP, NACCESS). Besides this issue, which is
>> not an issue at all and just my own pet peeve, there is a lot that the
>> entire module could gain from a thorough revision. I've been using it very
>> often and some normal manipulations of structures are not straightforward to
>> carry out (calculating a center of mass for example, removing double
>> occupancies) due to the parser being slow and quite memory hungry. In fact,
>> trying to run the parser on a very large collection of structures often
>> results in a random crash due to memory issues.
>> 
>> I've been toying with a lot of changes, performance improvements, etc, but
>> I'm not satisfied at all with them.. somethings that i've been trying is to
>> have the structure coordinates defined as a full numpy array instead of N
>> arrays per structure (one per atom) or the usage of __slots__ to mitigate
>> memory usage (managed to get it down 33% this way). This would also go in
>> line with a suggestion from Eric a long time ago to make a Bio.Struct module
>> which would be the perfect "playground" to implement and test these changes.
>> Other developments that I think are worth looking into are for example
>> making a nice library to link a parsed structure to the PDB database and
>> fetch information on it using the REST services they provide.
>> 
>> I'd like to hear your opinion (as in, everybody, developers and users) on
>> this and if it makes sense to indeed give a bit of TLC to the Bio.PDB
>> module. Also, on what changes you think should be carried out to improve the
>> module, like which features are missing, which applications are worth
>> wrapping.
>> 
>> Just to kick off some discussion. Maybe a new thread should be opened for
>> this later on.
>> 
>> Cheers,
>> 
>> Jo?o
> 
> +1 on a new thread, and Bio.Struct (or better lower case, Bio.struct
> or Bio.structure or something to be a bit more PEP8 like?).
> 
> Peter

The similarly designed (but terribly maintained) BioPerl code is Bio::Structure.  It think it was designed years back to be agnostic to a specific database but of course based much of its design on PDB data.

Chris


From leo2 at stanford.edu  Tue Feb 25 01:59:45 2014
From: leo2 at stanford.edu (Leo Alexander Hansmann)
Date: Mon, 24 Feb 2014 17:59:45 -0800 (PST)
Subject: [Biopython] consensus for forward and reverse reads from a
	sequencing run
In-Reply-To: <997170947.1096602.1393293281154.JavaMail.zimbra@stanford.edu>
Message-ID: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu>

Hi,
I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example:
sequence in the forward read file: AATCGTCGGTTACTCTG
corresponding line in the reverse read file: CTCTGAGGGAGAGATC
I want: AATCGTCGGTTACTCTGAGGGAGAGATC
Thank you so much!
Leo


From jordan.r.willis at Vanderbilt.Edu  Tue Feb 25 02:21:40 2014
From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R)
Date: Tue, 25 Feb 2014 02:21:40 +0000
Subject: [Biopython] consensus for forward and reverse reads from
	a	sequencing run
In-Reply-To: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu>
References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu>
Message-ID: <5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu>

Hi Leo,

I know this is not what you asked and I?m not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want).

Jordan

On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann <leo2 at stanford.edu<mailto:leo2 at stanford.edu>> wrote:

Hi,
I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example:
sequence in the forward read file: AATCGTCGGTTACTCTG
corresponding line in the reverse read file: CTCTGAGGGAGAGATC
I want: AATCGTCGGTTACTCTGAGGGAGAGATC
Thank you so much!
Leo


From ivangreg at gmail.com  Tue Feb 25 03:34:24 2014
From: ivangreg at gmail.com (Ivan Gregoretti)
Date: Mon, 24 Feb 2014 22:34:24 -0500
Subject: [Biopython] consensus for forward and reverse reads from a
 sequencing run
In-Reply-To: <5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu>
References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu>
	<5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu>
Message-ID: <CAOaPOXUV-iTwPK3JbyngmBicCxBNkSnz=Dionp99mJc_7A6RUw@mail.gmail.com>

Hello Leo,

Besides pandaseq, also consider FLASH from the Salzberg lab.
http://ccb.jhu.edu/software/FLASH/

I've been using it for over a year without problems. I wish there was
a Biopython tool though.

Cheers,

Ivan


Ivan Gregoretti, PhD
Bioinformatics


On Mon, Feb 24, 2014 at 9:21 PM, Willis, Jordan R
<jordan.r.willis at vanderbilt.edu> wrote:
> Hi Leo,
>
> I know this is not what you asked and I'm not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want).
>
> Jordan
>
> On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann <leo2 at stanford.edu<mailto:leo2 at stanford.edu>> wrote:
>
> Hi,
> I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example:
> sequence in the forward read file: AATCGTCGGTTACTCTG
> corresponding line in the reverse read file: CTCTGAGGGAGAGATC
> I want: AATCGTCGGTTACTCTGAGGGAGAGATC
> Thank you so much!
> Leo
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From egor.lakomkin at gmail.com  Tue Feb 25 05:02:49 2014
From: egor.lakomkin at gmail.com (Lakomkin Egor)
Date: Tue, 25 Feb 2014 13:02:49 +0800
Subject: [Biopython] [GSoC] Text mining for biopython
Message-ID: <CAHKQwwVakT7gdUbKXAMs4zWr2fS-vWArx9NbxPPrkMEwwfEiuQ@mail.gmail.com>

Hello,

I am PhD student, doing research in biomedical text mining, especially
gene ontology term recognition. I would like to ask if there is any
interest of doing GSoC text mining project under biopython?

Regards, Egor


From egor.lakomkin at gmail.com  Tue Feb 25 05:07:20 2014
From: egor.lakomkin at gmail.com (Lakomkin Egor)
Date: Tue, 25 Feb 2014 13:07:20 +0800
Subject: [Biopython] [GSoC] Text mining for biopython
Message-ID: <CAHKQwwVVd+eKo8deYZATjh23-SXKA7fZ-1wWGNHz+zpW_VpFeA@mail.gmail.com>

Hello,

I am PhD student, doing research in biomedical text mining, especially
gene ontology term recognition. I would like to ask if there is any
interest of doing GSoC text mining project under biopython?

Regards, Egor


From p.j.a.cock at googlemail.com  Tue Feb 25 11:22:09 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 25 Feb 2014 11:22:09 +0000
Subject: [Biopython] consensus for forward and reverse reads from a
 sequencing run
In-Reply-To: <CAOaPOXUV-iTwPK3JbyngmBicCxBNkSnz=Dionp99mJc_7A6RUw@mail.gmail.com>
References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu>
	<5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu>
	<CAOaPOXUV-iTwPK3JbyngmBicCxBNkSnz=Dionp99mJc_7A6RUw@mail.gmail.com>
Message-ID: <CAKVJ-_4kG3O4qiy+ooWGO53YfP4vGbeyxVStu8n=f-COSiSbSA@mail.gmail.com>

I agree that for this specific task (merging overlapped paired
FASTQ reads) an existing dedicated tool/script is a very
sensible choice. There are plenty to choose from.

What Biopython might benefit from is either sample code
on the Cookbook wiki for how to do this, or perhaps a new
function in Bio.SeqUtils? i.e. Bits to help you do something
new or different, if you need to customise a bespoke
analysis.

Peter

On Tue, Feb 25, 2014 at 3:34 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hello Leo,
>
> Besides pandaseq, also consider FLASH from the Salzberg lab.
> http://ccb.jhu.edu/software/FLASH/
>
> I've been using it for over a year without problems. I wish there was
> a Biopython tool though.
>
> Cheers,
>
> Ivan
>
>
>
> Ivan Gregoretti, PhD
> Bioinformatics
>
>
>
> On Mon, Feb 24, 2014 at 9:21 PM, Willis, Jordan R
> <jordan.r.willis at vanderbilt.edu> wrote:
>> Hi Leo,
>>
>> I know this is not what you asked and I'm not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want).
>>
>> Jordan
>>
>> On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann <leo2 at stanford.edu<mailto:leo2 at stanford.edu>> wrote:
>>
>> Hi,
>> I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example:
>> sequence in the forward read file: AATCGTCGGTTACTCTG
>> corresponding line in the reverse read file: CTCTGAGGGAGAGATC
>> I want: AATCGTCGGTTACTCTGAGGGAGAGATC
>> Thank you so much!
>> Leo
>>


From p.j.a.cock at googlemail.com  Tue Feb 25 11:36:57 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 25 Feb 2014 11:36:57 +0000
Subject: [Biopython] [GSoC] Text mining for biopython
In-Reply-To: <CAHKQwwVakT7gdUbKXAMs4zWr2fS-vWArx9NbxPPrkMEwwfEiuQ@mail.gmail.com>
References: <CAHKQwwVakT7gdUbKXAMs4zWr2fS-vWArx9NbxPPrkMEwwfEiuQ@mail.gmail.com>
Message-ID: <CAKVJ-_71XR9z+9EargWgHdJidKhPGgjKLMQou5Fg1tBoy6OENw@mail.gmail.com>

On Tue, Feb 25, 2014 at 5:02 AM, Lakomkin Egor <egor.lakomkin at gmail.com> wrote:
> Hello,
>
> I am PhD student, doing research in biomedical text mining, especially
> gene ontology term recognition. I would like to ask if there is any
> interest of doing GSoC text mining project under biopython?
>
> Regards, Egor

Hi Egor,

I'm not aware of any of the current Biopython development
team doing any text mining work - but I can think of a few
people I've met at hackathons/conferences which might be:

Karin Verspoor, NICTA
http://textminingscience.com/content/karin-verspoor
https://twitter.com/karinv

Kevin Cohen, University of Colorado School of Medicine
http://compbio.ucdenver.edu/Hunter_lab/Cohen/index.shtml
https://twitter.com/KevinBCohen

Daniel Jamieson, PhD student at University of Manchester
https://twitter.com/danielgjamieson

(I've not checked if they use Python in their work)

However, sorting out a nice combined module for Gene Ontology
support (and ontologies in general) would be good. There are a
number of people already looking at this (check the biopython
and biopython-dev mailing list archives with Google).

Regards,

Peter


From cjfields at illinois.edu  Tue Feb 25 15:40:43 2014
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 25 Feb 2014 15:40:43 +0000
Subject: [Biopython] consensus for forward and reverse reads from a
 sequencing run
In-Reply-To: <CAKVJ-_4kG3O4qiy+ooWGO53YfP4vGbeyxVStu8n=f-COSiSbSA@mail.gmail.com>
References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu>
	<5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu>
	<CAOaPOXUV-iTwPK3JbyngmBicCxBNkSnz=Dionp99mJc_7A6RUw@mail.gmail.com>
	<CAKVJ-_4kG3O4qiy+ooWGO53YfP4vGbeyxVStu8n=f-COSiSbSA@mail.gmail.com>
Message-ID: <112D9B62-CA39-4072-BA01-08C332EC8FE9@illinois.edu>

Torsten Seeman blogged on this and listed a bunch of tools, including a python-based approach:

    http://thegenomefactory.blogspot.com/2012/11/tools-to-merge-overlapping-paired-end.html

He also mentioned the one we have been using internally for MiSeq data (PEAR), which we have found works much better than PandaSeq in many circumstances (complete or overextended overlaps):

    http://bioinformatics.oxfordjournals.org/content/early/2013/11/10/bioinformatics.btt593.full

chris

On Feb 25, 2014, at 5:22 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> I agree that for this specific task (merging overlapped paired
> FASTQ reads) an existing dedicated tool/script is a very
> sensible choice. There are plenty to choose from.
> 
> What Biopython might benefit from is either sample code
> on the Cookbook wiki for how to do this, or perhaps a new
> function in Bio.SeqUtils? i.e. Bits to help you do something
> new or different, if you need to customise a bespoke
> analysis.
> 
> Peter
> 
> On Tue, Feb 25, 2014 at 3:34 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>> Hello Leo,
>> 
>> Besides pandaseq, also consider FLASH from the Salzberg lab.
>> http://ccb.jhu.edu/software/FLASH/
>> 
>> I've been using it for over a year without problems. I wish there was
>> a Biopython tool though.
>> 
>> Cheers,
>> 
>> Ivan
>> 
>> 
>> 
>> Ivan Gregoretti, PhD
>> Bioinformatics
>> 
>> 
>> 
>> On Mon, Feb 24, 2014 at 9:21 PM, Willis, Jordan R
>> <jordan.r.willis at vanderbilt.edu> wrote:
>>> Hi Leo,
>>> 
>>> I know this is not what you asked and I'm not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want).
>>> 
>>> Jordan
>>> 
>>> On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann <leo2 at stanford.edu<mailto:leo2 at stanford.edu>> wrote:
>>> 
>>> Hi,
>>> I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example:
>>> sequence in the forward read file: AATCGTCGGTTACTCTG
>>> corresponding line in the reverse read file: CTCTGAGGGAGAGATC
>>> I want: AATCGTCGGTTACTCTGAGGGAGAGATC
>>> Thank you so much!
>>> Leo
>>> 
> 
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From harsh.beria93 at gmail.com  Wed Feb 26 16:14:24 2014
From: harsh.beria93 at gmail.com (Harsh Beria)
Date: Wed, 26 Feb 2014 21:44:24 +0530
Subject: [Biopython] Gsoc 2014 aspirant
Message-ID: <CAPMf=YzNg7y2ZN8EC1jYXJcg3X21NFS3DTpHm9suQD7fchT=jQ@mail.gmail.com>

Hi,

I am a Harsh Beria, third year UG student at Indian Institute of
Technology, Kharagpur. I have started working in Computational Biophysics
recently, having written code for pdb to fasta parser, sequence alignment
using Needleman Wunch and Smith Waterman, Secondary Structure prediction,
Henikoff's weight and am currently working on Monte Carlo simulation.
Overall, I have started to like this field and want to carry my interest
forward by pursuing a relevant project for GSOC 2014. I mainly code in C
and python and would like to start contributing to the Biopython library. I
started going through the official contribution wiki page (
http://biopython.org/wiki/Contributing)

I also went through the wiki page of Bio.SeqlO's. I seriously want to
contribute to the Biopython library through GSOC. What do I do next ?

Thanks
-- 

Harsh Beria,
Indian Institute of Technology,Kharagpur
<http://www.iitkgp.ac.in/>E-mail: harsh.beria93 at gmail.com


From p.j.a.cock at googlemail.com  Thu Feb 27 13:49:22 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 27 Feb 2014 13:49:22 +0000
Subject: [Biopython] Introductory Biopython material
Message-ID: <CAKVJ-_59XRH2p7qnJ4S_+Zyoj0TgJ-0MJ2u-31Dbw4c1irg3rQ@mail.gmail.com>

Hello all,

This is just to let you know that I've written some introductory
Biopython material targeting Python Novices, focused on some
practical sequence manipulation examples, freely available
under the CC-BY licence here:

https://github.com/peterjc/biopython_workshop

I've run this as a workshop twice, but it should be fine for
self study as well.

I'm open to moving this under the Biopython project's GitHub
account, if people think that would be better?

I've added a few links to this from the website - these can be
moved/edited/removed if people think there's a better place
to put them: http://biopython.org/wiki/SeqIO and
http://biopython.org/wiki/Category:Wiki_Documentation

Regards,

Peter


From tra at popgen.net  Thu Feb 27 14:53:48 2014
From: tra at popgen.net (Tiago Antao)
Date: Thu, 27 Feb 2014 14:53:48 +0000
Subject: [Biopython] Bio.PopGen.SimCoal partial deprecation
Message-ID: <20140227145348.44cbe923@lnx>

Dear all,

With the availability of the new fastsimcoal interface by Melissa
Gymrek, I was planning on deprecating the code to deal with old version
(SimCoal 2.0).

This would mean deprecating
class SimCoalController (Bio.PopGen.SimCoal.Controller.py), along with
the relevant test code (and SimCoal2 dependency).

All the other code would be maintained (e.g. templating). And Melissa's
new fastsimcoal class (FastSimCoalController) would of course be added.

If somebody has strong feelings against this deprecation, please do
voice your concerns.

Best,
Tiago


From Leighton.Pritchard at hutton.ac.uk  Thu Feb 27 15:50:18 2014
From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard)
Date: Thu, 27 Feb 2014 15:50:18 +0000
Subject: [Biopython]  Google Summer of Code 2014 - Call for project ideas
Message-ID: <E72D33BF424829408854FEB604A6959B867ECE2D@DUEXC02.ad.hutton.ac.uk>

I would like to propose further development of the GenomeDiagram module (and maybe the KGML module, if it?s incorporated into Biopython) to enable browser-based interactive visualisation, along the lines of Bokeh[1]

[1] http://bokeh.pydata.org/


--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:leighton.pritchard at hutton.ac.uk<http://hutton.ac.uk>       w:http://www.hutton.ac.uk/staff/leighton-pritchard<http://www.hutton.ac.uk/staff/leighton-pritchard>
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827


________________________________________________________


This email is from the James Hutton Institute, however the views

expressed by the sender are not necessarily the views of the James Hutton

Institute and its subsidiaries. This email and any attachments are confidential and 

are intended solely for the use of the recipient(s) to whom they are addressed.

If you are not the intended recipient, you should not read, copy, disclose or rely on 

any information contained in this email, and we would ask you to contact the 

sender immediately and delete the email from your system.  Although the James 

Hutton Institute has taken reasonable precautions to ensure no viruses are present 

in this email, neither the Institute nor the sender accepts any responsibility for any 

viruses, and it is your responsibility to scan the email and any attachments.


The James Hutton Institute is a Scottish charitable company limited by guarantee.

Registered in Scotland No. SC374831

Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. 

Charity No. SC041796


From p.j.a.cock at googlemail.com  Thu Feb 27 16:12:31 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 27 Feb 2014 16:12:31 +0000
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <E72D33BF424829408854FEB604A6959B867ECE2D@DUEXC02.ad.hutton.ac.uk>
References: <E72D33BF424829408854FEB604A6959B867ECE2D@DUEXC02.ad.hutton.ac.uk>
Message-ID: <CAKVJ-_7CRrFXfXfsCvs99GHcihE6D0mpSymfs5hXfh=HtPR2Zw@mail.gmail.com>

On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard
<Leighton.Pritchard at hutton.ac.uk> wrote:
> I would like to propose further development of the GenomeDiagram
> module (and maybe the KGML module, if it's incorporated into Biopython)
> to enable browser-based interactive visualisation, along the lines of Bokeh[1]
>
> [1] http://bokeh.pydata.org/

I presume you're offering to mentor this - which would be great :)

Peter

P.S. The KGML module Leighton's talking about is here:
https://github.com/biopython/biopython/pull/173

Leighton's blog posts about this work:
http://armchairbiology.blogspot.co.uk/2013/01/keggwatch-part-i.html
http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-ii.html
http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-iii.html


From tra at popgen.net  Thu Feb 27 16:19:44 2014
From: tra at popgen.net (Tiago Antao)
Date: Thu, 27 Feb 2014 16:19:44 +0000
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <CAKVJ-_7CRrFXfXfsCvs99GHcihE6D0mpSymfs5hXfh=HtPR2Zw@mail.gmail.com>
References: <E72D33BF424829408854FEB604A6959B867ECE2D@DUEXC02.ad.hutton.ac.uk>
	<CAKVJ-_7CRrFXfXfsCvs99GHcihE6D0mpSymfs5hXfh=HtPR2Zw@mail.gmail.com>
Message-ID: <20140227161944.05640d0d@lnx>

Hi,

On Thu, 27 Feb 2014 16:12:31 +0000
Peter Cock <p.j.a.cock at googlemail.com> wrote:

> P.S. The KGML module Leighton's talking about is here:
> https://github.com/biopython/biopython/pull/173


Would this add a new library dependency to Biopython (PIL)? I am all in
favour of that (as independent modules could have their dependencies
without causing problems - as you only need the dependency if you
actually use the module).

But that would require the revision of the module dependency policy,
right? Which until now has been a bit on the conservative side...

I am thinking here matplotlib and scipy, for instance...

Tiago


From p.j.a.cock at googlemail.com  Thu Feb 27 16:31:11 2014
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 27 Feb 2014 16:31:11 +0000
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <FEA7E0D4-D6BF-41C0-97D4-2B6F161307EC@illinois.edu>
References: <E72D33BF424829408854FEB604A6959B867ECE2D@DUEXC02.ad.hutton.ac.uk>
	<CAKVJ-_7CRrFXfXfsCvs99GHcihE6D0mpSymfs5hXfh=HtPR2Zw@mail.gmail.com>
	<FEA7E0D4-D6BF-41C0-97D4-2B6F161307EC@illinois.edu>
Message-ID: <CAKVJ-_4g6jkVVd7N3kmb==OaONwP0_vObZ7CjqJoyc9VqVRBpQ@mail.gmail.com>

On Thu, Feb 27, 2014 at 4:25 PM, Fields, Christopher J
<cjfields at illinois.edu> wrote:
> On Feb 27, 2014, at 10:12 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
>> On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard
>> <Leighton.Pritchard at hutton.ac.uk> wrote:
>>> I would like to propose further development of the GenomeDiagram
>>> module (and maybe the KGML module, if it's incorporated into Biopython)
>>> to enable browser-based interactive visualisation, along the lines of Bokeh[1]
>>>
>>> [1] http://bokeh.pydata.org/
>>
>> I presume you're offering to mentor this - which would be great :)
>>
>> Peter
>
> I would add that to the wiki, and indicate whether you can mentor it.
> Seems like a cool idea!
>
> chris

Leighton left out the link, but had added this to the Biopython wiki:
http://biopython.org/wiki/GSOC#Interactive_GenomeDiagram_Module

Peter


From cjfields at illinois.edu  Thu Feb 27 16:25:18 2014
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 27 Feb 2014 16:25:18 +0000
Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas
In-Reply-To: <CAKVJ-_7CRrFXfXfsCvs99GHcihE6D0mpSymfs5hXfh=HtPR2Zw@mail.gmail.com>
References: <E72D33BF424829408854FEB604A6959B867ECE2D@DUEXC02.ad.hutton.ac.uk>
	<CAKVJ-_7CRrFXfXfsCvs99GHcihE6D0mpSymfs5hXfh=HtPR2Zw@mail.gmail.com>
Message-ID: <FEA7E0D4-D6BF-41C0-97D4-2B6F161307EC@illinois.edu>

On Feb 27, 2014, at 10:12 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard
> <Leighton.Pritchard at hutton.ac.uk> wrote:
>> I would like to propose further development of the GenomeDiagram
>> module (and maybe the KGML module, if it's incorporated into Biopython)
>> to enable browser-based interactive visualisation, along the lines of Bokeh[1]
>> 
>> [1] http://bokeh.pydata.org/
> 
> I presume you're offering to mentor this - which would be great :)
> 
> Peter
> 
> P.S. The KGML module Leighton's talking about is here:
> https://github.com/biopython/biopython/pull/173
> 
> Leighton's blog posts about this work:
> http://armchairbiology.blogspot.co.uk/2013/01/keggwatch-part-i.html
> http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-ii.html
> http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-iii.html

I would add that to the wiki, and indicate whether you can mentor it.  Seems like a cool idea!

chris