From mdehoon at c2b2.columbia.edu  Thu Jun  1 20:57:35 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu, 01 Jun 2006 17:57:35 -0700
Subject: [BioPython] NCBIWWW.qblast with refseq by organism
In-Reply-To: <20060526165711.94194.qmail@web51708.mail.yahoo.com>
References: <20060526165711.94194.qmail@web51708.mail.yahoo.com>
Message-ID: <447F8CFF.9050204@c2b2.columbia.edu>

Denil Wickrama wrote:
> Hi, I would like to BLAST a list of proteins against the refseq 
> database and retrieve the corresponding accession numbers of the
> exact hits. I get errors when I change from the nr database to the
> refseq database. Also I am trying to restrict the results by organism
> name, but that was not successful.
 > result_handle = NCBIWWW.qblast("blastp", "nr", seq, 
entrez_query='"rattus norvegicus"
> [Organism]')
 > result_handle = NCBIWWW.qblast("blastp", "refseq", seq, 
entrez_query='"rattus norvegicus" [Organism]')
 > Is it possible to do refseq searches with NCBIWWW.qblast?

It turns out that the NCBI server actually wants "refseq_protein" 
instead of "refseq". (You can check this by saving NCBI's 
Protein-protein blast page in HTML, and looking at the source). So if 
you replace "refseq" by "refseq_protein", your code should run.

Restricting the results by organism worked fine for me with the 
entrez_query you have.

--Michiel.


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032

From mdehoon at c2b2.columbia.edu  Thu Jun  1 21:12:57 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu, 01 Jun 2006 18:12:57 -0700
Subject: [BioPython] NCBIWWW.qblast
In-Reply-To: <20060531114048.83077.qmail@web36813.mail.mud.yahoo.com>
References: <20060531114048.83077.qmail@web36813.mail.mud.yahoo.com>
Message-ID: <447F9099.1040800@c2b2.columbia.edu>

Try this instead:

from Bio import Fasta
file_for_blast = open('fasta', 'r')
f_iterator = Fasta.Iterator(file_for_blast)

from Bio.Blast import NCBIWWW

seqnum = 0

for f_record in f_iterator:
     result_handle = NCBIWWW.qblast('blastp', 'nr', f_record)
     save_file = open('my_blast'+str(seqnum)+'.out', 'w')
     blast_results = result_handle.read()
     save_file.write(blast_results)
     save_file.close()
     seqnum += 1


--Michiel.

alper soyler wrote:
> Dear All,
> 
> I have a fasta file (called fasta) containing 20 proteins. I want to blast them in an order. How can I write the results of these 20 proteins in different output files. I tried to write the below script but the 'my_blast2.out' file turned empty. Can you help me please?
> 
> regards,
> Alper
> 
> #!usr/local/bin/python
> 
> from Bio import Fasta
> file_for_blast = open('fasta', 'r')
> f_iterator = Fasta.Iterator(file_for_blast)
> f_record = f_iterator.next()
> 
> from Bio.Blast import NCBIWWW
> result_handle = NCBIWWW.qblast('blastp', 'nr', f_record)
> 
> seqnum = 0
> 
> for f_record  in f_iterator:
>     save_file = open('my_blast.out', 'w')
>     blast_results = result_handle.read()
>     save_file.write(blast_results)
>     save_file.close()
>     seqnum += 1
>     save_file2 = open('my_blast2.out', 'w')
>     blast_results = result_handle.read()
>     save_file2.write(blast_results)
>     save_file2.close()
> 		
> ---------------------------------
> Be a chatter box. Enjoy free PC-to-PC calls  with Yahoo! Messenger with Voice.
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032

From omid9dr18 at hotmail.com  Thu Jun  1 18:39:34 2006
From: omid9dr18 at hotmail.com (Omid Khalouei)
Date: Thu, 1 Jun 2006 22:39:34 +0000
Subject: [BioPython] Synthesized or Clinical PDB sequence
Message-ID: <BAY103-W77543FB5420D46D12C82EE6900@phx.gbl>

Hello,
 
Is there any way to find out if a sequence corresponding to a PDB structure was obtained clinically or was synthesized without having to read the primary citations?
 
Thanks for your help.
Omid K.
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

From boris.steipe at utoronto.ca  Thu Jun  1 22:25:48 2006
From: boris.steipe at utoronto.ca (Boris Steipe)
Date: Thu, 1 Jun 2006 22:25:48 -0400
Subject: [BioPython] Synthesized or Clinical PDB sequence
In-Reply-To: <BAY103-W77543FB5420D46D12C82EE6900@phx.gbl>
References: <BAY103-W77543FB5420D46D12C82EE6900@phx.gbl>
Message-ID: <CCAB0DA0-7488-41BA-BF83-E22B11AD5E59@utoronto.ca>

Since the PDB does not use a constrained vocabulary, this is a bit  
unreliable. But the information is supposed to be entered in the  
SOURCE record.
cf.: http://www.rcsb.org/pdb/file_formats/pdb/pdbguide2.2/part_20.html

HTH,
Boris


On 1 Jun 2006, at 18:39, Omid Khalouei wrote:

> Hello,
>
> Is there any way to find out if a sequence corresponding to a PDB  
> structure was obtained clinically or was synthesized without having  
> to read the primary citations?
>
> Thanks for your help.
> Omid K.
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today it's  
> FREE!
> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From lee.byung-chul at kaist.ac.kr  Fri Jun  2 05:45:09 2006
From: lee.byung-chul at kaist.ac.kr (Lee, Byung-chul)
Date: Fri, 02 Jun 2006 18:45:09 +0900
Subject: [BioPython] Drawing Ramanchandran plot
Message-ID: <448008A5.8090602@kaist.ac.kr>

Hi all,

During calculating the torsion angles of some atoms in PDB files, I want
to draw the Ramanchandran plot of those.
However, I cannot find any modules or methods of doing that in Bio.PDB,
so if anyone knows where it is os how to make it, please inform me.

Thanks,
Byung-chul.

-- 
--------------------------------------------------------
The important thing is not to stop questioning.
                               : Albert Einstein

Byung chul Lee 
  a member of Protein BioInformatics Lab. (PBIL)
                at Detp. BioSystems KAIST, Korea
                                  Ph.D candidate
                                  82-42-869-4357
--------------------------------------------------------


From biopython at maubp.freeserve.co.uk  Fri Jun  2 08:15:25 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 02 Jun 2006 13:15:25 +0100
Subject: [BioPython] Drawing Ramanchandran plot
In-Reply-To: <448008A5.8090602@kaist.ac.kr>
References: <448008A5.8090602@kaist.ac.kr>
Message-ID: <44802BDD.6080703@maubp.freeserve.co.uk>

Lee, Byung-chul wrote:
> Hi all,
> 
> During calculating the torsion angles of some atoms in PDB files, I want
> to draw the Ramanchandran plot of those.
> However, I cannot find any modules or methods of doing that in Bio.PDB,
> so if anyone knows where it is os how to make it, please inform me.
> 
> Thanks,
> Byung-chul.
> 

A work in progress:

http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/ramachandran/

Short summary about calculating the angles:
* MMTK is great, providing it can load the PDB file.
   Very very easy to get the angles
* BioPython's Bio.PDB will load most/al PDB files, but
   you have to work out the backbone and angles yourself.
* Python Macromolecular Library (mmLib) might also be worth looking at.

Once you have the angles, you will want to draw the plots - the link 
above suggests a package like Excel, R, or Peter Robinson's Java Program:

http://www.charite.de/ch/medgen/compgen/ramachandran/

Peter


From sbassi at gmail.com  Wed Jun  7 15:25:44 2006
From: sbassi at gmail.com (Sebastian Bassi)
Date: Wed, 7 Jun 2006 16:25:44 -0300
Subject: [BioPython] From REF to sequence?
Message-ID: <b43bf2080606071225y4db4c23an5572468818908179@mail.gmail.com>

Hello,

I have a list like this:

>ref|NP_918285.1|
>dbj|BAD88119.1|
>dbj|BAD88118.1|
>ref|XP_475495.1|
>emb|CAD37200.1|
>gb|AAM64572.1|

(the list is much bigger, but with this sample you could get the idea).
I would like to create an URL from each entry to retrieve the full
NCBI information about these sequence. Is there a Biopython method for
doing this? I read once about a NCBI syntaxis to build URLs, but I
can't find it.
Best regards,
SB.

-- 
Bioinformatics news: http://www.bioinformatica.info
Lriser: http://www.linspire.com/lraiser_success.php?serial=318

From chris.lasher at gmail.com  Thu Jun  8 17:32:26 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Thu, 8 Jun 2006 17:32:26 -0400
Subject: [BioPython] Distance Matrix Parsers
Message-ID: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>

Hi all,
  Are there any modules in BioPython to parse distance matrices? My
poking around the BioPython modules and Google searching does not turn
up any signs indicating there are distance matrix parsers, currently.
Two particularly useful parsers would be a parser for the output of
DNADIST/PROTDIST/RESTDIST from PHYLIP
(http://evolution.genetics.washington.edu/phylip.html), and a parser
for the MEGA (http://www.megasoftware.net/mega.html) distance matrix
format. If not, would there be any interest in creating parsers for
these matrices, other than my own? I think parsers for distance
matrices could be very useful to the community.

Chris

From mcolosimo at mitre.org  Fri Jun  9 08:16:02 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Fri, 9 Jun 2006 08:16:02 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
Message-ID: <9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org>

Hi Chris,

I don't think there is a parser for those. I have in the past thought  
about writing them up. I was looking over the structure of BioPython  
to see where it would best fit [I'll save my rant on this for another  
time, maybe later today]. In the mean time, the folks at BioPerl have  
Bio-Phylo CPAN module <http://search.cpan.org/~rvosa/Bio-Phylo/>,  
which looks nice, but it does NOT have what you are looking for.  
However, I am planning on following that.

Marc

On Jun 8, 2006, at 5:32 PM, Chris Lasher wrote:

> Hi all,
>   Are there any modules in BioPython to parse distance matrices? My
> poking around the BioPython modules and Google searching does not turn
> up any signs indicating there are distance matrix parsers, currently.
> Two particularly useful parsers would be a parser for the output of
> DNADIST/PROTDIST/RESTDIST from PHYLIP
> (http://evolution.genetics.washington.edu/phylip.html), and a parser
> for the MEGA (http://www.megasoftware.net/mega.html) distance matrix
> format. If not, would there be any interest in creating parsers for
> these matrices, other than my own? I think parsers for distance
> matrices could be very useful to the community.
>
> Chris
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From chris.lasher at gmail.com  Fri Jun  9 11:59:56 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Fri, 9 Jun 2006 11:59:56 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org>
Message-ID: <128a885f0606090859x608e733ela89fdb879e531dc8@mail.gmail.com>

Hi Marc,

Thanks for the reply. I had not seen the Bio::Phylo package before.
Thanks for pointing that out. That seems to have be a really useful
library, though it's not exactly what I was thinking about when I
originally posted. I was thinking more along the lines of the
Bio::Matrix modules
(http://bio.perl.org/wiki/Special:Search?search=matrix&go=Go).

I don't think writing parsers for these formats will be that
difficult. I am unsure, however, about what type of data structure the
matrix should be. The simplest solution is a nested list. Perhaps this
is the proper solution, as the user can then convert this over to a
NumPy multi-dimensional array, say, or some matrix object. I dunno.
Thoughts, comments, suggestions?

Chris

On 6/9/06, Marc Colosimo <mcolosimo at mitre.org> wrote:
> Hi Chris,
>
> I don't think there is a parser for those. I have in the past thought
> about writing them up. I was looking over the structure of BioPython
> to see where it would best fit [I'll save my rant on this for another
> time, maybe later today]. In the mean time, the folks at BioPerl have
> Bio-Phylo CPAN module <http://search.cpan.org/~rvosa/Bio-Phylo/>,
> which looks nice, but it does NOT have what you are looking for.
> However, I am planning on following that.
>
> Marc
>
> On Jun 8, 2006, at 5:32 PM, Chris Lasher wrote:
>
> > Hi all,
> >   Are there any modules in BioPython to parse distance matrices? My
> > poking around the BioPython modules and Google searching does not turn
> > up any signs indicating there are distance matrix parsers, currently.
> > Two particularly useful parsers would be a parser for the output of
> > DNADIST/PROTDIST/RESTDIST from PHYLIP
> > (http://evolution.genetics.washington.edu/phylip.html), and a parser
> > for the MEGA (http://www.megasoftware.net/mega.html) distance matrix
> > format. If not, would there be any interest in creating parsers for
> > these matrices, other than my own? I think parsers for distance
> > matrices could be very useful to the community.
> >
> > Chris
> > _______________________________________________
> > BioPython mailing list  -  BioPython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>
>

From mcolosimo at mitre.org  Fri Jun  9 14:41:29 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Fri, 9 Jun 2006 14:41:29 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <128a885f0606090859x608e733ela89fdb879e531dc8@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org>
	<128a885f0606090859x608e733ela89fdb879e531dc8@mail.gmail.com>
Message-ID: <8AC5BAA2-BA47-4772-88C7-DF4B2061A8E2@mitre.org>

Chris,

I likewise didn't know about the Bio::Matrix::PhylipDist module.  
Personally, I would opt for a Matrix Object (since this is Python a  
OO language) and store it internally as a nested list. That way you  
have the best of both worlds. The next question is the object  
hierarchy. Here I would opt for a top level Matrix class (or module)  
and then subclass that under Phylo. So, something like this:

Bio.Matrix
Bio.Phylo.Matrix

and maybe things like the following (which isn't used/followed much  
here in BioPython)

Bio.Phylo.IO
Bio.Phylo.Parsers.PhylipDist
Bio.Phylo.Parsers.Newick
Bio.Phylo.Parsers.Nexus

And/or have
Bio.Phylo.Matrix.IO that uses the PhylipDist parser.

The next big question is what should Bio.Phylo.IO return? For  
inspiration, we might want to look at Mesquite <http:// 
mesquiteproject.org/mesquite/mesquite.html>.

Marc

On Jun 9, 2006, at 11:59 AM, Chris Lasher wrote:

> Hi Marc,
>
> Thanks for the reply. I had not seen the Bio::Phylo package before.
> Thanks for pointing that out. That seems to have be a really useful
> library, though it's not exactly what I was thinking about when I
> originally posted. I was thinking more along the lines of the
> Bio::Matrix modules
> (http://bio.perl.org/wiki/Special:Search?search=matrix&go=Go).
>
> I don't think writing parsers for these formats will be that
> difficult. I am unsure, however, about what type of data structure the
> matrix should be. The simplest solution is a nested list. Perhaps this
> is the proper solution, as the user can then convert this over to a
> NumPy multi-dimensional array, say, or some matrix object. I dunno.
> Thoughts, comments, suggestions?
>
> Chris
>
> On 6/9/06, Marc Colosimo <mcolosimo at mitre.org> wrote:
>> Hi Chris,
>>
>> I don't think there is a parser for those. I have in the past thought
>> about writing them up. I was looking over the structure of BioPython
>> to see where it would best fit [I'll save my rant on this for another
>> time, maybe later today]. In the mean time, the folks at BioPerl have
>> Bio-Phylo CPAN module <http://search.cpan.org/~rvosa/Bio-Phylo/>,
>> which looks nice, but it does NOT have what you are looking for.
>> However, I am planning on following that.
>>
>> Marc
>>
>> On Jun 8, 2006, at 5:32 PM, Chris Lasher wrote:
>>
>>> Hi all,
>>>   Are there any modules in BioPython to parse distance matrices? My
>>> poking around the BioPython modules and Google searching does not  
>>> turn
>>> up any signs indicating there are distance matrix parsers,  
>>> currently.
>>> Two particularly useful parsers would be a parser for the output of
>>> DNADIST/PROTDIST/RESTDIST from PHYLIP
>>> (http://evolution.genetics.washington.edu/phylip.html), and a parser
>>> for the MEGA (http://www.megasoftware.net/mega.html) distance matrix
>>> format. If not, would there be any interest in creating parsers for
>>> these matrices, other than my own? I think parsers for distance
>>> matrices could be very useful to the community.
>>>
>>> Chris
>>> _______________________________________________
>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From chris.lasher at gmail.com  Fri Jun  9 17:13:32 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Fri, 9 Jun 2006 17:13:32 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <8AC5BAA2-BA47-4772-88C7-DF4B2061A8E2@mitre.org>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org>
	<128a885f0606090859x608e733ela89fdb879e531dc8@mail.gmail.com>
	<8AC5BAA2-BA47-4772-88C7-DF4B2061A8E2@mitre.org>
Message-ID: <128a885f0606091413o23088caesf4934a81f0cc0489@mail.gmail.com>

> I likewise didn't know about the Bio::Matrix::PhylipDist module.
> Personally, I would opt for a Matrix Object (since this is Python a
> OO language) and store it internally as a nested list. That way you
> have the best of both worlds. The next question is the object
> hierarchy. Here I would opt for a top level Matrix class (or module)
> and then subclass that under Phylo. So, something like this:
>
> Bio.Matrix
> Bio.Phylo.Matrix

So is this more appropriate than Bio.Matrix.Phylo? A phylogenetic
matrix is a type of matrix, so that hierarchy is immediately
appealing, however, a phylogenetic matrix is not of much use in and of
itself, so I can see the argument that it should be placed in a
phylogeny package (which we have yet to write but as mentioned
earlier, could be very useful).

> and maybe things like the following (which isn't used/followed much
> here in BioPython)
>
> Bio.Phylo.IO
> Bio.Phylo.Parsers.PhylipDist
> Bio.Phylo.Parsers.Newick
> Bio.Phylo.Parsers.Nexus
>
> And/or have
> Bio.Phylo.Matrix.IO that uses the PhylipDist parser.

This is very very good, in my opinion. Thanks for doing the
heavy-lifting of the brainwork on this! =-)

> The next big question is what should Bio.Phylo.IO return? For
> inspiration, we might want to look at Mesquite <http://
> mesquiteproject.org/mesquite/mesquite.html>.

I must give a better look at this site before commenting, but once
again, thanks for bringing this to my awareness! What a helpful past
couple of emails. I will be out for the weekend but will think more
about this.

As a sidenote, should this discussion be moved to biopython-dev or is
it fine here?

Thanks again Marc,
Chris

From biopython at maubp.freeserve.co.uk  Sat Jun 10 06:10:02 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 10 Jun 2006 11:10:02 +0100
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
Message-ID: <448A9A7A.6050501@maubp.freeserve.co.uk>

Chris Lasher wrote:
> Hi all, Are there any modules in BioPython to parse distance
> matrices? My poking around the BioPython modules and Google searching
> does not turn up any signs indicating there are distance matrix
> parsers, currently. Two particularly useful parsers would be a parser
> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP 
> (http://evolution.genetics.washington.edu/phylip.html),

I've done a very small amount of work with neighbour joining trees, 
using PHYLIP format distance matrices.  The closest I could find to a 
file format definition was this page:

http://evolution.genetics.washington.edu/phylip/doc/distance.html

Points to be aware of:

In my experience, most software tools usually write the distances as a 
full symmetric matrix.  However, the "standard" explicitly discusses 
lower triangular form (missing out the diagonal distance zero entries) 
which has the significant advantage of using about half the disk space. 
  This is significant once you get into thousands of taxa.

So, make sure any parser can cope with both full symmetric, and lower 
triangular forms - ideally without the user having to care.

This also raises the point about how to store the matrix in memory. 
Does Numeric/NumPy have an efficient way of storing symmetric matrices? 
  This is less flexible than the suggested list of lists, but for large 
datasets would need much less memory.

Second point - the "official" PHYLIP distance matrix file format 
truncates the taxa names at 10 characters.  Some tools (e.g. clustalw) 
ignore this limitation and will use as many as needed for the full name. 
  I personally find this much nicer - after all most gene identifiers 
(e.g. GI numbers) are eight characters to start with, and if you are 
dealing with multiple features in each gene 10 characters is tough going.

So, I would make sure you test the parser on this format variant (with 
names longer than 10 characters).  I can supply some examples if you like.

For writing matrices to file, the issue of following the strict 10 
character taxa limit might best be handled as an option (default to max 
10, with a warning if any names are truncated, and an error if 
truncation renders names non-unique?).

Likewise an option to save matrices as either fully symmetric or lower 
triangular.  I would lean towards using fully symmetric as the default 
as it seems to be more common.

> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
> distance matrix format. If not, would there be any interest in
> creating parsers for these matrices, other than my own? I think
> parsers for distance matrices could be very useful to the community.

I suspect that for serious tree building pure python will not be 
competitive with existing C/C++ code on speed - but non-the-less could 
be useful.

Peter


From idoerg at burnham.org  Sat Jun 10 11:08:43 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sat, 10 Jun 2006 08:08:43 -0700
Subject: [BioPython] Distance Matrix Parsers
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
Message-ID: <1F97379A556D0946AAEFE3F63FD6F5744D468D@MAIL.burnham.org>

Hi,

Bio.SubsMat has a parser for substitution matrices, lower triangular and square. Feel free to recycle code.

Best,

Iddo


--
Iddo Friedberg, PhD
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org


-----Original Message-----
From: biopython-bounces at lists.open-bio.org on behalf of Peter
Sent: Sat 6/10/2006 3:10 AM
To: BioPython Mailing List
Subject: Re: [BioPython] Distance Matrix Parsers
 
Chris Lasher wrote:
> Hi all, Are there any modules in BioPython to parse distance
> matrices? My poking around the BioPython modules and Google searching
> does not turn up any signs indicating there are distance matrix
> parsers, currently. Two particularly useful parsers would be a parser
> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP 
> (http://evolution.genetics.washington.edu/phylip.html),

I've done a very small amount of work with neighbour joining trees, 
using PHYLIP format distance matrices.  The closest I could find to a 
file format definition was this page:

http://evolution.genetics.washington.edu/phylip/doc/distance.html

Points to be aware of:

In my experience, most software tools usually write the distances as a 
full symmetric matrix.  However, the "standard" explicitly discusses 
lower triangular form (missing out the diagonal distance zero entries) 
which has the significant advantage of using about half the disk space. 
  This is significant once you get into thousands of taxa.

So, make sure any parser can cope with both full symmetric, and lower 
triangular forms - ideally without the user having to care.

This also raises the point about how to store the matrix in memory. 
Does Numeric/NumPy have an efficient way of storing symmetric matrices? 
  This is less flexible than the suggested list of lists, but for large 
datasets would need much less memory.

Second point - the "official" PHYLIP distance matrix file format 
truncates the taxa names at 10 characters.  Some tools (e.g. clustalw) 
ignore this limitation and will use as many as needed for the full name. 
  I personally find this much nicer - after all most gene identifiers 
(e.g. GI numbers) are eight characters to start with, and if you are 
dealing with multiple features in each gene 10 characters is tough going.

So, I would make sure you test the parser on this format variant (with 
names longer than 10 characters).  I can supply some examples if you like.

For writing matrices to file, the issue of following the strict 10 
character taxa limit might best be handled as an option (default to max 
10, with a warning if any names are truncated, and an error if 
truncation renders names non-unique?).

Likewise an option to save matrices as either fully symmetric or lower 
triangular.  I would lean towards using fully symmetric as the default 
as it seems to be more common.

> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
> distance matrix format. If not, would there be any interest in
> creating parsers for these matrices, other than my own? I think
> parsers for distance matrices could be very useful to the community.

I suspect that for serious tree building pure python will not be 
competitive with existing C/C++ code on speed - but non-the-less could 
be useful.

Peter

_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From idoerg at burnham.org  Sat Jun 10 11:08:43 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sat, 10 Jun 2006 08:08:43 -0700
Subject: [BioPython] Distance Matrix Parsers
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
Message-ID: <1F97379A556D0946AAEFE3F63FD6F5744D468D@MAIL.burnham.org>

Hi,

Bio.SubsMat has a parser for substitution matrices, lower triangular and square. Feel free to recycle code.

Best,

Iddo


--
Iddo Friedberg, PhD
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org


-----Original Message-----
From: biopython-bounces at lists.open-bio.org on behalf of Peter
Sent: Sat 6/10/2006 3:10 AM
To: BioPython Mailing List
Subject: Re: [BioPython] Distance Matrix Parsers
 
Chris Lasher wrote:
> Hi all, Are there any modules in BioPython to parse distance
> matrices? My poking around the BioPython modules and Google searching
> does not turn up any signs indicating there are distance matrix
> parsers, currently. Two particularly useful parsers would be a parser
> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP 
> (http://evolution.genetics.washington.edu/phylip.html),

I've done a very small amount of work with neighbour joining trees, 
using PHYLIP format distance matrices.  The closest I could find to a 
file format definition was this page:

http://evolution.genetics.washington.edu/phylip/doc/distance.html

Points to be aware of:

In my experience, most software tools usually write the distances as a 
full symmetric matrix.  However, the "standard" explicitly discusses 
lower triangular form (missing out the diagonal distance zero entries) 
which has the significant advantage of using about half the disk space. 
  This is significant once you get into thousands of taxa.

So, make sure any parser can cope with both full symmetric, and lower 
triangular forms - ideally without the user having to care.

This also raises the point about how to store the matrix in memory. 
Does Numeric/NumPy have an efficient way of storing symmetric matrices? 
  This is less flexible than the suggested list of lists, but for large 
datasets would need much less memory.

Second point - the "official" PHYLIP distance matrix file format 
truncates the taxa names at 10 characters.  Some tools (e.g. clustalw) 
ignore this limitation and will use as many as needed for the full name. 
  I personally find this much nicer - after all most gene identifiers 
(e.g. GI numbers) are eight characters to start with, and if you are 
dealing with multiple features in each gene 10 characters is tough going.

So, I would make sure you test the parser on this format variant (with 
names longer than 10 characters).  I can supply some examples if you like.

For writing matrices to file, the issue of following the strict 10 
character taxa limit might best be handled as an option (default to max 
10, with a warning if any names are truncated, and an error if 
truncation renders names non-unique?).

Likewise an option to save matrices as either fully symmetric or lower 
triangular.  I would lean towards using fully symmetric as the default 
as it seems to be more common.

> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
> distance matrix format. If not, would there be any interest in
> creating parsers for these matrices, other than my own? I think
> parsers for distance matrices could be very useful to the community.

I suspect that for serious tree building pure python will not be 
competitive with existing C/C++ code on speed - but non-the-less could 
be useful.

Peter

_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 4656 bytes
Desc: not available
Url : http://lists.open-bio.org/pipermail/biopython/attachments/20060610/5b8aa9fa/attachment-0001.bin 

From mcolosimo at mitre.org  Mon Jun 12 08:38:18 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Mon, 12 Jun 2006 08:38:18 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <128a885f0606091413o23088caesf4934a81f0cc0489@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org>
	<128a885f0606090859x608e733ela89fdb879e531dc8@mail.gmail.com>
	<8AC5BAA2-BA47-4772-88C7-DF4B2061A8E2@mitre.org>
	<128a885f0606091413o23088caesf4934a81f0cc0489@mail.gmail.com>
Message-ID: <65DF4A7E-B365-4E61-93D4-156A36F6ED54@mitre.org>

[cross-posting to biopython-dev]

Chris,

Oops, didn't notice this was on the general biopython mailing list. I  
think many of the developers also subscribe to this list, but just in  
case I'm cross posting this.

Iddo pointed out the Bio.SubsMat, which I didn't know what  that  
module did. One problem with names like that, but the API Docs are  
helpful only when you look at them <http://biopython.org/DIST/docs/ 
api/public/trees.html> (Kuddos for those who add documentation).

Given Bio.SubsMat and the BioPerl Module, I would strongly consider  
combining the Bio.SubsMat and the PhylipDist into a new Bio.Matrix  
module. From a Phylo module, a function/class can always call the  
Bio.Matrix classes.

Marc

On Jun 9, 2006, at 5:13 PM, Chris Lasher wrote:

>> I likewise didn't know about the Bio::Matrix::PhylipDist module.
>> Personally, I would opt for a Matrix Object (since this is Python a
>> OO language) and store it internally as a nested list. That way you
>> have the best of both worlds. The next question is the object
>> hierarchy. Here I would opt for a top level Matrix class (or module)
>> and then subclass that under Phylo. So, something like this:
>>
>> Bio.Matrix
>> Bio.Phylo.Matrix
>
> So is this more appropriate than Bio.Matrix.Phylo? A phylogenetic
> matrix is a type of matrix, so that hierarchy is immediately
> appealing, however, a phylogenetic matrix is not of much use in and of
> itself, so I can see the argument that it should be placed in a
> phylogeny package (which we have yet to write but as mentioned
> earlier, could be very useful).
>
>> and maybe things like the following (which isn't used/followed much
>> here in BioPython)
>>
>> Bio.Phylo.IO
>> Bio.Phylo.Parsers.PhylipDist
>> Bio.Phylo.Parsers.Newick
>> Bio.Phylo.Parsers.Nexus
>>
>> And/or have
>> Bio.Phylo.Matrix.IO that uses the PhylipDist parser.
>
> This is very very good, in my opinion. Thanks for doing the
> heavy-lifting of the brainwork on this! =-)
>
>> The next big question is what should Bio.Phylo.IO return? For
>> inspiration, we might want to look at Mesquite <http://
>> mesquiteproject.org/mesquite/mesquite.html>.
>
> I must give a better look at this site before commenting, but once
> again, thanks for bringing this to my awareness! What a helpful past
> couple of emails. I will be out for the weekend but will think more
> about this.
>
> As a sidenote, should this discussion be moved to biopython-dev or is
> it fine here?
>
> Thanks again Marc,
> Chris
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From mcolosimo at mitre.org  Mon Jun 12 09:18:41 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Mon, 12 Jun 2006 09:18:41 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <448A9A7A.6050501@maubp.freeserve.co.uk>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
Message-ID: <CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>

[cross post]
On Jun 10, 2006, at 6:10 AM, Peter wrote:

> Chris Lasher wrote:
>> Hi all, Are there any modules in BioPython to parse distance
>> matrices? My poking around the BioPython modules and Google searching
>> does not turn up any signs indicating there are distance matrix
>> parsers, currently. Two particularly useful parsers would be a parser
>> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP
>> (http://evolution.genetics.washington.edu/phylip.html),
>
> I've done a very small amount of work with neighbour joining trees,
> using PHYLIP format distance matrices.  The closest I could find to a
> file format definition was this page:
>
> http://evolution.genetics.washington.edu/phylip/doc/distance.html
>
> Points to be aware of:
>
> In my experience, most software tools usually write the distances as a
> full symmetric matrix.  However, the "standard" explicitly discusses
> lower triangular form (missing out the diagonal distance zero entries)
> which has the significant advantage of using about half the disk  
> space.
>   This is significant once you get into thousands of taxa.

This is still small potatoes compared to the input needed to generate  
the distance matrixs (especially with DNA/RNA sequences of any  
decently sized gene).

>
> So, make sure any parser can cope with both full symmetric, and lower
> triangular forms - ideally without the user having to care.

Phylip does ask you which to either read or write; this is a pain at  
times. So, having a parser figure this out would be nice. However,  
the user should know about the choices.

>
> This also raises the point about how to store the matrix in memory.
> Does Numeric/NumPy have an efficient way of storing symmetric  
> matrices?
>   This is less flexible than the suggested list of lists, but for  
> large
> datasets would need much less memory.

I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at  
storing these things. But you lose that when you want to do pythonish  
things to it (like write it back out).

>
> Second point - the "official" PHYLIP distance matrix file format
> truncates the taxa names at 10 characters.  Some tools (e.g. clustalw)
> ignore this limitation and will use as many as needed for the full  
> name.

ClustalW does the CORRECT thing, it truncates the name to 10  
characters for Phylip output (alignments). And it does the CORRECT  
thing for its  distance matrix file.

In Clustalw's trees.c file

void distance_matrix_output(FILE *ofile)

	fprintf(ofile,"\n%-*s ",max_names,names[i]);  /* left justify to the  
maximum length of names in current alignment file and use a space as  
a sep */

spaces in names are bad in this case, but phylip is okay with them,  
since the first 10 characters are the taxon name.

>   I personally find this much nicer - after all most gene identifiers
> (e.g. GI numbers) are eight characters to start with, and if you are
> dealing with multiple features in each gene 10 characters is tough  
> going.
>
> So, I would make sure you test the parser on this format variant (with
> names longer than 10 characters).  I can supply some examples if  
> you like.

By definition this isn't a variant of Phylip, but another format. So,  
one would need two parsers: PhylipDist and Dist (or ClustalDist).

>
> For writing matrices to file, the issue of following the strict 10
> character taxa limit might best be handled as an option (default to  
> max
> 10, with a warning if any names are truncated, and an error if
> truncation renders names non-unique?).

DON'T give an option of 10 or more. That is NOT the definition of the  
Phylip file Matrix structure, so why give the option? Make another  
class that outputs the whole name (ClustalDist).

I am pretty sure that Phylip doesn't care about non-unique names so  
why error out? However, the class should have a means for the user to  
ask this question.

>
> Likewise an option to save matrices as either fully symmetric or lower
> triangular.  I would lean towards using fully symmetric as the default
> as it seems to be more common.

Phylip's default seems to be a "Square" distance matrix, i.e. fully  
symmetric. Keep this in mind when naming or documentation.

>
>> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
>> distance matrix format. If not, would there be any interest in
>> creating parsers for these matrices, other than my own? I think
>> parsers for distance matrices could be very useful to the community.
>
> I suspect that for serious tree building pure python will not be
> competitive with existing C/C++ code on speed - but non-the-less could
> be useful.
>

Well, we do have things like SciPy and PyClustal, which make things  
more even.

Marc

From mcolosimo at mitre.org  Mon Jun 12 09:18:41 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Mon, 12 Jun 2006 09:18:41 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <448A9A7A.6050501@maubp.freeserve.co.uk>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
Message-ID: <CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>

[cross post]
On Jun 10, 2006, at 6:10 AM, Peter wrote:

> Chris Lasher wrote:
>> Hi all, Are there any modules in BioPython to parse distance
>> matrices? My poking around the BioPython modules and Google searching
>> does not turn up any signs indicating there are distance matrix
>> parsers, currently. Two particularly useful parsers would be a parser
>> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP
>> (http://evolution.genetics.washington.edu/phylip.html),
>
> I've done a very small amount of work with neighbour joining trees,
> using PHYLIP format distance matrices.  The closest I could find to a
> file format definition was this page:
>
> http://evolution.genetics.washington.edu/phylip/doc/distance.html
>
> Points to be aware of:
>
> In my experience, most software tools usually write the distances as a
> full symmetric matrix.  However, the "standard" explicitly discusses
> lower triangular form (missing out the diagonal distance zero entries)
> which has the significant advantage of using about half the disk  
> space.
>   This is significant once you get into thousands of taxa.

This is still small potatoes compared to the input needed to generate  
the distance matrixs (especially with DNA/RNA sequences of any  
decently sized gene).

>
> So, make sure any parser can cope with both full symmetric, and lower
> triangular forms - ideally without the user having to care.

Phylip does ask you which to either read or write; this is a pain at  
times. So, having a parser figure this out would be nice. However,  
the user should know about the choices.

>
> This also raises the point about how to store the matrix in memory.
> Does Numeric/NumPy have an efficient way of storing symmetric  
> matrices?
>   This is less flexible than the suggested list of lists, but for  
> large
> datasets would need much less memory.

I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at  
storing these things. But you lose that when you want to do pythonish  
things to it (like write it back out).

>
> Second point - the "official" PHYLIP distance matrix file format
> truncates the taxa names at 10 characters.  Some tools (e.g. clustalw)
> ignore this limitation and will use as many as needed for the full  
> name.

ClustalW does the CORRECT thing, it truncates the name to 10  
characters for Phylip output (alignments). And it does the CORRECT  
thing for its  distance matrix file.

In Clustalw's trees.c file

void distance_matrix_output(FILE *ofile)

	fprintf(ofile,"\n%-*s ",max_names,names[i]);  /* left justify to the  
maximum length of names in current alignment file and use a space as  
a sep */

spaces in names are bad in this case, but phylip is okay with them,  
since the first 10 characters are the taxon name.

>   I personally find this much nicer - after all most gene identifiers
> (e.g. GI numbers) are eight characters to start with, and if you are
> dealing with multiple features in each gene 10 characters is tough  
> going.
>
> So, I would make sure you test the parser on this format variant (with
> names longer than 10 characters).  I can supply some examples if  
> you like.

By definition this isn't a variant of Phylip, but another format. So,  
one would need two parsers: PhylipDist and Dist (or ClustalDist).

>
> For writing matrices to file, the issue of following the strict 10
> character taxa limit might best be handled as an option (default to  
> max
> 10, with a warning if any names are truncated, and an error if
> truncation renders names non-unique?).

DON'T give an option of 10 or more. That is NOT the definition of the  
Phylip file Matrix structure, so why give the option? Make another  
class that outputs the whole name (ClustalDist).

I am pretty sure that Phylip doesn't care about non-unique names so  
why error out? However, the class should have a means for the user to  
ask this question.

>
> Likewise an option to save matrices as either fully symmetric or lower
> triangular.  I would lean towards using fully symmetric as the default
> as it seems to be more common.

Phylip's default seems to be a "Square" distance matrix, i.e. fully  
symmetric. Keep this in mind when naming or documentation.

>
>> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
>> distance matrix format. If not, would there be any interest in
>> creating parsers for these matrices, other than my own? I think
>> parsers for distance matrices could be very useful to the community.
>
> I suspect that for serious tree building pure python will not be
> competitive with existing C/C++ code on speed - but non-the-less could
> be useful.
>

Well, we do have things like SciPy and PyClustal, which make things  
more even.

Marc

From asmund.skjaveland at usit.uio.no  Mon Jun 12 11:45:26 2006
From: asmund.skjaveland at usit.uio.no (=?ISO-8859-1?Q?=C5smund_Skj=E6veland?=)
Date: Mon, 12 Jun 2006 17:45:26 +0200
Subject: [BioPython] Generating Nexus file from Genbank file
Message-ID: <448D8C16.6050204@fys.uio.no>

I have a file of Genbank records, and want to extract some of them and
save to a Nexus file. As far as I can tell from the API, this should work:

#!/site/compython/Linux/bin/python

import Bio, sys, time
from Bio.GenBank import Iterator
from Bio.Nexus.Nexus import Nexus

gbfile='results/sequences-txid34828.genbank'

fp = Bio.GenBank.FeatureParser()
gb = open(gbfile, 'r')

it = Bio.GenBank.Iterator(gb, fp)

nex = Nexus()

nr = 0;
rec = it.next()
while rec:
     # A string to identify the sequence with
     nexusname=rec.features[0].qualifiers['db_xref'][0] + '--' + rec.name
     nex.add_sequence(nexusname, rec.seq)

     rec = it.next()

print "\n\n%d records, %d gene names" % (nr, len(genenames))

nex.write_nexus_data('results/genegrab.nex', mrbayes=True)


But it doesn't. When I run it:

Traceback (most recent call last):
   File "py_nexustest.py", line 39, in ?
     nex.add_sequence(nexusname, rec.seq)
   File
"/site/compython/Linux/lib/python2.4/site-packages/Bio/Nexus/Nexus.py",
line 1412, in add_sequence
     self.matrix[name]=Seq(sequence,self.alphabet)
AttributeError: 'Nexus' object has no attribute 'alphabet'

What am I doing wrong? I don't really know the Nexus format, I just want
to send certain sequences to MrBayes.

-- 
?smund Skj?veland {
    Scientific Computing Group, UiO;
}


From rohini.damle at gmail.com  Tue Jun 13 15:09:21 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Tue, 13 Jun 2006 12:09:21 -0700
Subject: [BioPython] (no subject)
Message-ID: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>

Hi,
 I am new to bipyton trying to use ncbistandalone parser to parse my blast
out put which is in txt format.
the parser works well for older blast uptputs but breaks down for newer
blast outputs. Can someone suggest me a way to overcome this blast parser's
problem?
Thanks

From winter at biotec.tu-dresden.de  Wed Jun 14 04:00:20 2006
From: winter at biotec.tu-dresden.de (Christof Winter)
Date: Wed, 14 Jun 2006 10:00:20 +0200
Subject: [BioPython] (no subject)
In-Reply-To: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
Message-ID: <448FC214.20805@biotec.tu-dresden.de>

Hi Rohini,

can you provide a minimal example of your python code along with two blast reports 
(working/not working)?

Cheers,
Christof


Rohini Damle wrote:
> Hi,
>  I am new to bipyton trying to use ncbistandalone parser to parse my blast
> out put which is in txt format.
> the parser works well for older blast uptputs but breaks down for newer
> blast outputs. Can someone suggest me a way to overcome this blast parser's
> problem?
> Thanks
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From biopython at maubp.freeserve.co.uk  Wed Jun 14 05:09:48 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 14 Jun 2006 10:09:48 +0100
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
Message-ID: <448FD25C.20101@maubp.freeserve.co.uk>

Rohini Damle wrote:
> Hi,
> I am new to bipyton trying to use ncbistandalone parser to parse my blast
> out put which is in txt format.
> the parser works well for older blast uptputs but breaks down for newer
> blast outputs.

The NCBI standalone blast and web blast plain text output keeps changing 
slightly, and as a result, the parser isn't always up to date.

 > Can someone suggest me a way to overcome this blast parser's
> problem?

We recommend you use the XML output instead (this is possible with both 
online blast and the standalone tools).

For the stand alone tools, repeat your searches with the command line 
option -m 7 to get XML output.

If you are using the Bio.NCBIStandalone.blastall() command, use argument 
align_view to set this.

You still use NCBIStandalone.Iterator (if you have multiple queries) but 
now use NCBIXML.BlastParser instead of NCBIStandalone.BlastParser

e.g.
http://bugzilla.open-bio.org/attachment.cgi?id=293&action=view

Peter


From rohini.damle at gmail.com  Wed Jun 14 14:22:59 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Wed, 14 Jun 2006 11:22:59 -0700
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <448FD25C.20101@maubp.freeserve.co.uk>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
	<448FD25C.20101@maubp.freeserve.co.uk>
Message-ID: <d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>

Thank you very much for your help.
I have 55-56 proteins & I am using Blast to find out short, nearly exact
matches. The xml parser works fine for first record but even if I used the
iterator, I CAN NOT ITERATE through the records, I have used the same code
as u have given, what might be wrong?
Rohini.


On 6/14/06, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Rohini Damle wrote:
> > Hi,
> > I am new to bipyton trying to use ncbistandalone parser to parse my
> blast
> > out put which is in txt format.
> > the parser works well for older blast uptputs but breaks down for newer
> > blast outputs.
>
> The NCBI standalone blast and web blast plain text output keeps changing
> slightly, and as a result, the parser isn't always up to date.
>
> > Can someone suggest me a way to overcome this blast parser's
> > problem?
>
> We recommend you use the XML output instead (this is possible with both
> online blast and the standalone tools).
>
> For the stand alone tools, repeat your searches with the command line
> option -m 7 to get XML output.
>
> If you are using the Bio.NCBIStandalone.blastall() command, use argument
> align_view to set this.
>
> You still use NCBIStandalone.Iterator (if you have multiple queries) but
> now use NCBIXML.BlastParser instead of NCBIStandalone.BlastParser
>
> e.g.
> http://bugzilla.open-bio.org/attachment.cgi?id=293&action=view
>
> Peter
>
>

From manickam.muthuraman at wur.nl  Wed Jun 14 16:22:56 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Wed, 14 Jun 2006 22:22:56 +0200
Subject: [BioPython] parsing the blastoutput and printing the alingment
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>

I am new to python 

I am getting error in parsing blastoutput more over the same problem was been addressed by Michiel De Hoon but i could not clear...here is the error what i am getting.

first i got error when i typed b_record=b_parser.parse(blast_out) as michiel suggested i changed to 
b_record=b_parser.parse(blast_out)
Traceback (most recent call last):
  File "<input>", line 1, in ?
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse
    self._parser.parse(handler)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.4/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 211, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python2.4/xml/sax/handler.py", line 38, in fatalError
    raise exception
SAXParseException: my_blast.out:1:4: not well-formed (invalid token)
blast_out=open('my_blast.out','r')
from Bio.Blast import NCBIStandalone
from Bio.Blast import NCBIXML
b_parser=NCBIXML.BlastParser()
b_iterator1=NCBIStandalone.Iterator(blast_out,b_parser)
for alignment in b_iterator1.alignments:
    for hsp in alignment.hsps:
        print 'seq:',alignment.title
    
Traceback (most recent call last):
  File "<input>", line 1, in ?
AttributeError: Iterator instance has no attribute 'alignments'


how do i print the title.alignment and so on.....from the blast output file
thanks in advance
-- 
Manickam(melaimanik)


From biopython at maubp.freeserve.co.uk  Wed Jun 14 17:54:53 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 14 Jun 2006 22:54:53 +0100
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>	<448FD25C.20101@maubp.freeserve.co.uk>
	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>
Message-ID: <449085AD.7010801@maubp.freeserve.co.uk>

Rohini Damle wrote:
> Thank you very much for your help.
> I have 55-56 proteins & I am using Blast to find out short, nearly exact
> matches. The xml parser works fine for first record but even if I used the
> iterator, I CAN NOT ITERATE through the records, I have used the same code
> as u have given, what might be wrong?
> Rohini.

If you you send us a short be of example code, and the error message 
that would help.  Also, what version of BioPython are you using, and do 
you have Windows or Linux or MacOS...

One guess is that you will need to update the NCBIStandalone.py file to 
include a recent fix for iterating XML files.

Assuming you are using BioPython 1.41 on Windows, the click on this link 
and pick "download" near the top of the page to get the latest verion:

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIStandalone.py?cvsroot=biopython

Save it here:

c:\python24\lib\site-packages\Bio\Blast\NCBIStandalone.py

(Make a copy of the old file first, just in case)

Peter


From mdehoon at c2b2.columbia.edu  Wed Jun 14 17:55:17 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Wed, 14 Jun 2006 17:55:17 -0400
Subject: [BioPython] parsing the blastoutput and printing the alingment
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
Message-ID: <449085C5.4020101@c2b2.columbia.edu>

Muthuraman, Manickam wrote:
> b_parser=NCBIXML.BlastParser()
> b_iterator1=NCBIStandalone.Iterator(blast_out,b_parser)
> for alignment in b_iterator1.alignments:
>     for hsp in alignment.hsps:
>         print 'seq:',alignment.title
>     
> Traceback (most recent call last):
>   File "<input>", line 1, in ?
> AttributeError: Iterator instance has no attribute 'alignments'
> 
Use:
b_record = b_iterator1.next()
for alignment in b_record.alignments:
    ...

Just like the example in the tutorial.

--Michiel.


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032

From biopython at maubp.freeserve.co.uk  Wed Jun 14 17:48:20 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 14 Jun 2006 22:48:20 +0100
Subject: [BioPython] parsing the blastoutput and printing the alingment
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
Message-ID: <44908424.2070407@maubp.freeserve.co.uk>

Muthuraman, Manickam wrote:
> I am new to python 
> 
> I am getting error in parsing blastoutput more over the same problem
 > was been addressed by Michiel De Hoon but i could not clear...
> 
> blast_out=open('my_blast.out','r')
> from Bio.Blast import NCBIStandalone
> from Bio.Blast import NCBIXML
> b_parser=NCBIXML.BlastParser()
> b_iterator1=NCBIStandalone.Iterator(blast_out,b_parser)
> for alignment in b_iterator1.alignments:
>     for hsp in alignment.hsps:
>         print 'seq:',alignment.title
 >

Your example code is wrong.  The iterator object will return blast 
record objects (which have an alignments property).

Try something like this:

blast_out=open('my_blast.out','r')
from Bio.Blast import NCBIStandalone
from Bio.Blast import NCBIXML
b_parser=NCBIXML.BlastParser()
b_iterator=NCBIStandalone.Iterator(blast_out,b_parser)
for b_record in b_iterator:
     for alignment in b_record.alignments:
         for hsp in alignment.hsps:
             print 'seq:',alignment.title


Or for a full and tested example, try this :

http://bugzilla.open-bio.org/attachment.cgi?id=293&action=view

Peter


From rohini.damle at gmail.com  Wed Jun 14 14:21:18 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Wed, 14 Jun 2006 11:21:18 -0700
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <448FD25C.20101@maubp.freeserve.co.uk>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
	<448FD25C.20101@maubp.freeserve.co.uk>
Message-ID: <d9fd76050606141121o548ff7e7of9c031344cdbb1cb@mail.gmail.com>

Thank you very much for your help.
I have 55-56 proteins & I am using Blast to find out short, nearly exact
matches. The xml parser works fine for first record but even if I used the
iterator, I


On 6/14/06, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Rohini Damle wrote:
> > Hi,
> > I am new to bipyton trying to use ncbistandalone parser to parse my
> blast
> > out put which is in txt format.
> > the parser works well for older blast uptputs but breaks down for newer
> > blast outputs.
>
> The NCBI standalone blast and web blast plain text output keeps changing
> slightly, and as a result, the parser isn't always up to date.
>
> > Can someone suggest me a way to overcome this blast parser's
> > problem?
>
> We recommend you use the XML output instead (this is possible with both
> online blast and the standalone tools).
>
> For the stand alone tools, repeat your searches with the command line
> option -m 7 to get XML output.
>
> If you are using the Bio.NCBIStandalone.blastall() command, use argument
> align_view to set this.
>
> You still use NCBIStandalone.Iterator (if you have multiple queries) but
> now use NCBIXML.BlastParser instead of NCBIStandalone.BlastParser
>
> e.g.
> http://bugzilla.open-bio.org/attachment.cgi?id=293&action=view
>
> Peter
>
>

From manickam.muthuraman at wur.nl  Thu Jun 15 07:47:34 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Thu, 15 Jun 2006 13:47:34 +0200
Subject: [BioPython] parsing the blastoutput and printing the alingment
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>

Still i am getting the same error or error. I tried as Peter suggested but it fails. 


I have attached the error and the code

[manickam at bioinfo python]$ cat blas.py
from Bio import Fasta
file_for_blast=open('/home/manickam/Documents/m_cold.fasta','r')
f_iterator=Fasta.Iterator(file_for_blast)
f_record=f_iterator.next()
from Bio.Blast import NCBIWWW
result_handle=NCBIWWW.qblast('blastp','nr',f_record)
save_file=open('/home/manickam/my_blast.out','w')
blast_results=result_handle.read()
save_file.write(blast_results)
save_file.close()
blast_out=open('/home/manickam/my_blast.out','r')
from Bio.Blast import NCBIXML
from Bio.Blast import NCBIStandalone
b_parser=NCBIXML.BlastParser()
b_iterator=NCBIStandalone.Iterator(blast_out,b_parser)
for b_record in b_iterator:
    print "inside (3)outer loop"
    for alignment in b_record.alignments:
        print "inside 2 loop"
        for hsp in alignment.hsps:
            print "inside 1 loop"
            print 'seq:',alignment.title
blast_out.close()

[manickam at bioinfo python]$
[manickam at bioinfo python]$ python blas.py
/usr/lib/python2.4/site-packages/Bio/Blast/NCBIWWW.py:1064: UserWarning: qblast works only with blastn and blastp for now.
  warnings.warn("qblast works only with blastn and blastp for now.")
Traceback (most recent call last):
  File "blas.py", line 16, in ?
    for b_record in b_iterator:
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIStandalone.py", line 1385, in next
    return self._parser.parse(File.StringHandle(data))
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse
    self._parser.parse(handler)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.4/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 211, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python2.4/xml/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:4: not well-formed (invalid token)
[manickam at bioinfo python]$                                                               


From manickam.muthuraman at wur.nl  Thu Jun 15 07:51:36 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Thu, 15 Jun 2006 13:51:36 +0200
Subject: [BioPython] parsing the blastoutput and printing the alingment
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>

Dear Michiel

I tried your suggestion as well but i am getting error. I could even understand where i am making mistake.


[manickam at bioinfo python]$ cat blas.py
from Bio import Fasta
file_for_blast=open('/home/manickam/Documents/m_cold.fasta','r')
f_iterator=Fasta.Iterator(file_for_blast)
f_record=f_iterator.next()
from Bio.Blast import NCBIWWW
result_handle=NCBIWWW.qblast('blastp','nr',f_record)
save_file=open('/home/manickam/my_blast.out','w')
blast_results=result_handle.read()
save_file.write(blast_results)
save_file.close()
blast_out=open('/home/manickam/my_blast.out','r')
from Bio.Blast import NCBIXML
from Bio.Blast import NCBIStandalone
b_parser=NCBIXML.BlastParser()
b_iterator=NCBIStandalone.Iterator(blast_out,b_parser)
b_record = b_iterator.next()
for alignment in b_record.alignments:
    print "inside 2 loop"
    for hsp in alignment.hsps:
        print "inside 1 loop"
        print 'seq:',alignment.title
blast_out.close()

[manickam at bioinfo python]$ python blas.py
/usr/lib/python2.4/site-packages/Bio/Blast/NCBIWWW.py:1064: UserWarning: qblast works only with blastn and blastp for now.
  warnings.warn("qblast works only with blastn and blastp for now.")
Traceback (most recent call last):
  File "blas.py", line 16, in ?
    b_record = b_iterator.next()
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIStandalone.py", line 1385, in next
    return self._parser.parse(File.StringHandle(data))
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse
    self._parser.parse(handler)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.4/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 211, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python2.4/xml/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:4: not well-formed (invalid token)
[manickam at bioinfo python]$                              


From biopython at maubp.freeserve.co.uk  Thu Jun 15 08:25:06 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Jun 2006 13:25:06 +0100
Subject: [BioPython] parsing the blastoutput and printing the alingment
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>
Message-ID: <449151A2.1040602@maubp.freeserve.co.uk>

Muthuraman, Manickam wrote:
> Still i am getting the same error or error. I tried as Peter suggested but it fails. 
 > ...

I couldn't see anything clearly wrong just from reading your code.

Which version of BioPython do you have?

Since BioPython 1.41 NCBIWWW.qblast uses XML as the default output 
format, but you can force this by:

result_handle=NCBIWWW.qblast('blastp','nr',f_record, format_type="XML")

Try opening your output file /home/manickam/my_blast.out in a text 
editor to double check it really is XML - i.e. does it start <XML...>

If it is XML, then BioPython doesn't like it for some reason.  Maybe you 
could email the file to me and Michiel to take a look?

Peter


From manickam.muthuraman at wur.nl  Thu Jun 15 10:13:17 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Thu, 15 Jun 2006 16:13:17 +0200
Subject: [BioPython] parsing the blastoutput and printing the alingment
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>


Dear peter

here is the code my_blast.out and the error. My need is to get all the blast hit sequences in fasta format. By parsing and i can extract accession number from it.

Code
from Bio import Fasta
file_for_blast=open('/home/manickam/Documents/m_cold.fasta','r')
f_iterator=Fasta.Iterator(file_for_blast)
f_record=f_iterator.next()
from Bio.Blast import NCBIWWW
result_handle=NCBIWWW.qblast('blastp','nr',f_record, format_type="XML")
save_file=open('/home/manickam/my_blast.out','w')
blast_results=result_handle.read()
save_file.write(blast_results)
save_file.close()
blast_out=open('/home/manickam/my_blast.out','r')
from Bio.Blast import NCBIXML
from Bio.Blast import NCBIStandalone
b_parser=NCBIXML.BlastParser()
b_iterator=NCBIStandalone.Iterator(blast_out,b_parser)
b_record = b_iterator.next()
for alignment in b_record.alignments:
    print "inside 2 loop"
    for hsp in alignment.hsps:
        print "inside 1 loop"
        print 'seq:',alignment.title
blast_out.close()

Error
[root at bioinfo python]# python blas.py
/usr/lib/python2.4/site-packages/Bio/Blast/NCBIWWW.py:1064: UserWarning: qblast works only with blastn and blastp for now.
  warnings.warn("qblast works only with blastn and blastp for now.")
Traceback (most recent call last):
  File "blas.py", line 16, in ?
    b_record = b_iterator.next()
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIStandalone.py", line 1410, in next
    return self._parser.parse(File.StringHandle(data))
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse
    self._parser.parse(handler)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.4/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 211, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python2.4/xml/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:4: not well-formed (invalid token)
[root at bioinfo python]#            

my_blast.out
HTTP/1.1 200 OK
Date: Thu, 15 Jun 2006 13:57:19 GMT
Server: Nde
Content-Type: application/xml
Connection: close

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastp</BlastOutput_program>
  <BlastOutput_version>BLASTP 2.2.14 [May-07-2006]</BlastOutput_version>
  <BlastOutput_reference>Altschul, Stephen F., Thomas L. Madden, Alejandro A. Sch??ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>nr</BlastOutput_db>
  <BlastOutput_query-ID>1_13944</BlastOutput_query-ID>
  <BlastOutput_query-def>1BK0</BlastOutput_query-def>
  <BlastOutput_query-len>331</BlastOutput_query-len>
  <BlastOutput_param>
.
.
.
.
.
.
.
.
 <Hsp_identity>76</Hsp_identity>
              <Hsp_positive>128</Hsp_positive>
              <Hsp_gaps>27</Hsp_gaps>
              <Hsp_align-len>295</Hsp_align-len>
              <Hsp_qseq>VPKIDVSPLFGD-DQAAKMRVAQQIDAASRDTGFFYAVNHGIN---VQRLSQKTKEFHMSITPEEKWDLAIRAYNKEHQDQVRAGYYLSIPGKKAVESFCYLNP--NFTPDHPRIQAKTPTHEVNVWPDETKHPGFQDFAEQYYWDVFGLSSALLKGYALALGKEENFFARHFKPDDTLASVVLIRYP-YLDPYPEAAIKTAADGTKLSFEWHEDVSLITVLYQSNVQNLQVETAAGYQDIEADDTGYLINCGSYMAHLTNNYYKAPIHRV--KWVNAERQSLPFFVNLGYDSVI</Hsp_qseq>
              <Hsp_hseq>LPVIDLSLLDGSPESAAKFR--DDLLCATHDVGFFYLVGHGVDESLMDDLLAASREFFD--LPEDQKFAVENVKSPQFRGYTRVGGELT-EGKTDWREQIDVGPERDVIDNAPGLADYWRLEGPNLWPDAV--PQLRGLVNEWNDKLSAVSLRLLRAWAHALGAPEDVFDNAFA-DKPFPQLKIVRYPGESNPEPKQGVGAHRDGGVLTL----------LMVEPGKGGLQVDYNGEWVDVPPKPGAFVVNIGEMLELATEGYLKATLHRVISPLIGDDRISIPFFFNPALDTVM</Hsp_hseq>
              <Hsp_midline>+P ID+S L G  + AAK R    +  A+ D GFFY V HG++   +  L   ++EF     PE++        + + +   R G  L+  GK        + P  +   + P +         N+WPD    P  +    ++   +  +S  LL+ +A ALG  E+ F   F  D     + ++RYP   +P P+  +    DG  L+           ++ +     LQV+    + D+      +++N G  +   T  Y KA +HRV    +  +R S+PFF N   D+V+</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
      </Iteration_hits>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>3695564</Statistics_db-num>
          <Statistics_db-len>1269795892</Statistics_db-len>
          <Statistics_hsp-len>0</Statistics_hsp-len>
          <Statistics_eff-space>0</Statistics_eff-space>
          <Statistics_kappa>0.041</Statistics_kappa>
          <Statistics_lambda>0.267</Statistics_lambda>
          <Statistics_entropy>0.14</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>


From biopython at maubp.freeserve.co.uk  Thu Jun 15 11:01:42 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Jun 2006 16:01:42 +0100
Subject: [BioPython] parsing the blastoutput and printing the alingment
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>
Message-ID: <44917656.6090602@maubp.freeserve.co.uk>

Muthuraman, Manickam wrote:
> Dear peter
> 
> here is the code my_blast.out and the error. My need is to get all the
 > blast hit sequences in fasta format. By parsing and i can extract
 > accession number from it.

I made an example fasta file containing just this one sequence twice:

 >example1
VPKIDVSPLFGDDQAAKMRVAQQIDAASRDTGFFYAVNHGINVQRLSQKTKEFHMSITP
EEKWDLAIRAYNKEHQDQVRAGYYLSIPGKKAVESFCYLNPNFTPDHPRIQAKTPTHEV
NVWPDETKHPGFQDFAEQYYWDVFGLSSALLKGYALALGKEENFFARHFKPDDTLASVV
LIRYPYLDPYPEAAIKTAADGTKLSFEWHEDVSLITVLYQSNVQNLQVETAAGYQDIEA
DDTGYLINCGSYMAHLTNNYYKAPIHRVKWVNAERQSLPFFVNLGYDSVI
 >example2
VPKIDVSPLFGDDQAAKMRVAQQIDAASRDTGFFYAVNHGINVQRLSQKTKEFHMSITP
EEKWDLAIRAYNKEHQDQVRAGYYLSIPGKKAVESFCYLNPNFTPDHPRIQAKTPTHEV
NVWPDETKHPGFQDFAEQYYWDVFGLSSALLKGYALALGKEENFFARHFKPDDTLASVV
LIRYPYLDPYPEAAIKTAADGTKLSFEWHEDVSLITVLYQSNVQNLQVETAAGYQDIEA
DDTGYLINCGSYMAHLTNNYYKAPIHRVKWVNAERQSLPFFVNLGYDSVI

I then edited the filenames in your example, and ran the code.  It 
worked for me using a fresh install of BioPython 1.41 on Linux with 
Python 2.4.2

So the good news is your code seems fine.

Maybe there is something "funny" with your fasta file?  Accented 
characters for example - which would then be in the output XML file?

Could you send me the fasta file and the XML file (in full, as 
attachments), off the mailing list to avoid clogging up everyone's inboxes.

Thanks

Peter


From biopython at maubp.freeserve.co.uk  Thu Jun 15 11:08:32 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Jun 2006 16:08:32 +0100
Subject: [BioPython] Abuse of the new Wiki Homepage
Message-ID: <449177F0.1010209@maubp.freeserve.co.uk>

I've noticed someone has created an account "Ceas" on the wiki and has 
been inserting junk/spam links.  For example, look at the history of the 
main page:

http://biopython.org/wiki/Biopython

Who is in charge of the Wiki?  Can we
(a) block this account (short term action)
(b) tighten up rules for creating new accounts?

Peter


From arareko at campus.iztacala.unam.mx  Thu Jun 15 12:13:50 2006
From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra)
Date: Thu, 15 Jun 2006 11:13:50 -0500
Subject: [BioPython] Abuse of the new Wiki Homepage
In-Reply-To: <449177F0.1010209@maubp.freeserve.co.uk>
References: <449177F0.1010209@maubp.freeserve.co.uk>
Message-ID: <4491873E.50509@campus.iztacala.unam.mx>

Hi Peter,

We started to have the same problem in the BioPerl wiki some months ago. 
The way we usually solve this is by blocking the user account and 
rolling back to the previous version of the affected document.

We have a list of wiki administrators who are constantly (and 
independently) monitoring the recent changes in the site. This way we 
can keep track of the changes and revert damages to the content:

http://bioperl.org/wiki/BioPerl:Administrators
http://bioperl.org/wiki/Special:Recentchanges

You can also keep track of the changes by using the RSS or Atom feeds 
provided by the Recentchanges page:

http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=rss
http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=atom

The wiki system has memory of the blocked users and IP's, you can have a 
look here:

http://bioperl.org/wiki/Special:Ipblocklist

There also exists a Blacklist, which is a complement to the main 
Wikimedia's one and helps detect spam content before it goes into a 
document:

http://bioperl.org/wiki/Help:Blacklist
http://meta.wikimedia.org/wiki/Spam_blacklist

I don't know who's in charge of BioPython's wiki but I hope this info 
can be helpful to you.

Regards,
Mauricio.

Peter wrote:
> I've noticed someone has created an account "Ceas" on the wiki and has 
> been inserting junk/spam links.  For example, look at the history of the 
> main page:
> 
> http://biopython.org/wiki/Biopython
> 
> Who is in charge of the Wiki?  Can we
> (a) block this account (short term action)
> (b) tighten up rules for creating new accounts?
> 
> Peter
> 
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 

-- 
MAURICIO HERRERA CUADRA
arareko at campus.iztacala.unam.mx
Laboratorio de Gen?tica
Unidad de Morfofisiolog?a y Funci?n
Facultad de Estudios Superiores Iztacala, UNAM


From dag at sonsorol.org  Thu Jun 15 12:31:01 2006
From: dag at sonsorol.org (Chris Dagdigian)
Date: Thu, 15 Jun 2006 12:31:01 -0400
Subject: [BioPython] Abuse of the new Wiki Homepage
In-Reply-To: <4491873E.50509@campus.iztacala.unam.mx>
References: <449177F0.1010209@maubp.freeserve.co.uk>
	<4491873E.50509@campus.iztacala.unam.mx>
Message-ID: <F2ADE880-41EA-412F-A34A-61E6E6533AA5@sonsorol.org>


I deal with a number of wiki sites, all of which are subjected to a  
constant stream of automated spam posters.

The single best defense is volunteers who monitor the "Recent  
Changes" feed and take instant action to rollback the spam changes:

http://biopython.org/wiki/Special:Recentchanges

People can monitor that page (in web or RSS form) and rollback spam  
shortly after it happens. It really is the best way.  Anyone can roll  
back changes. If you find yourself doing it often, ask to become a  
wiki administrator and then you'll be able to blocklist people and IP  
addresses as well.

Behind the scenes we do other things to block spam, including regular  
expression tests on content, blacklists etc. but it is a constant  
arms race with the wiki spammers and we are always a bit behind.

My $.02

-Chris


On Jun 15, 2006, at 12:13 PM, Mauricio Herrera Cuadra wrote:

> Hi Peter,
>
> We started to have the same problem in the BioPerl wiki some months  
> ago. The way we usually solve this is by blocking the user account  
> and rolling back to the previous version of the affected document.
>
> We have a list of wiki administrators who are constantly (and  
> independently) monitoring the recent changes in the site. This way  
> we can keep track of the changes and revert damages to the content:
>
> http://bioperl.org/wiki/BioPerl:Administrators
> http://bioperl.org/wiki/Special:Recentchanges
>
> You can also keep track of the changes by using the RSS or Atom  
> feeds provided by the Recentchanges page:
>
> http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=rss
> http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=atom
>
> The wiki system has memory of the blocked users and IP's, you can  
> have a look here:
>
> http://bioperl.org/wiki/Special:Ipblocklist
>
> There also exists a Blacklist, which is a complement to the main  
> Wikimedia's one and helps detect spam content before it goes into a  
> document:
>
> http://bioperl.org/wiki/Help:Blacklist
> http://meta.wikimedia.org/wiki/Spam_blacklist
>
> I don't know who's in charge of BioPython's wiki but I hope this  
> info can be helpful to you.
>
> Regards,
> Mauricio.
>
> Peter wrote:
>> I've noticed someone has created an account "Ceas" on the wiki and  
>> has been inserting junk/spam links.  For example, look at the  
>> history of the main page:
>> http://biopython.org/wiki/Biopython
>> Who is in charge of the Wiki?  Can we
>> (a) block this account (short term action)
>> (b) tighten up rules for creating new accounts?
>> Peter
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
> -- 
> MAURICIO HERRERA CUADRA
> arareko at campus.iztacala.unam.mx
> Laboratorio de Gen?tica
> Unidad de Morfofisiolog?a y Funci?n
> Facultad de Estudios Superiores Iztacala, UNAM


From jason.stajich at duke.edu  Thu Jun 15 12:45:50 2006
From: jason.stajich at duke.edu (Jason Stajich)
Date: Thu, 15 Jun 2006 12:45:50 -0400
Subject: [BioPython] Fwd:  Abuse of the new Wiki Homepage
References: <FBF0A76F-92B2-4F8B-BA28-400B1E1A0C2E@duke.edu>
Message-ID: <29F97001-146E-414A-8E5D-330AEDAB3392@duke.edu>


Begin forwarded message:

> From: Jason Stajich <jason.stajich at duke.edu>
> Date: June 15, 2006 12:40:13 PM EDT
> To: Mauricio Herrera Cuadra <arareko at campus.iztacala.unam.mx>
> Cc: biopython at biopython.org, Chris Dagdigian <dag at sonsorol.org>,  
> Chris Fields <cjfields at uiuc.edu>
> Subject: Re: [BioPython] Abuse of the new Wiki Homepage
>
> I'm not convinced the blacklist is working - but we need to make  
> sure it is enabled in the conf file on the server.  I've locked the  
> blacklist page as well so that only sysops can edit it.  Iddo and  
> Michiel are the main site admins right now, other people can be  
> promoted by them or one of the main site admins if we know who you  
> are.
>
> I've blocked the previous spammer's account.  You can easily revert  
> changes by using the rollback button on the diff page.
>
> The biopython community will have to decide how it wants to handle  
> new accounts to the wiki site. Whether there is patrolling or if  
> you want to lock the site down.  I would encourage all legitimate  
> users to add something to their User page so that we can have an  
> easier time distinguishing random account creation from real people.
>
> -jason
> On Jun 15, 2006, at 12:13 PM, Mauricio Herrera Cuadra wrote:
>
>> Hi Peter,
>>
>> We started to have the same problem in the BioPerl wiki some  
>> months ago. The way we usually solve this is by blocking the user  
>> account and rolling back to the previous version of the affected  
>> document.
>>
>> We have a list of wiki administrators who are constantly (and  
>> independently) monitoring the recent changes in the site. This way  
>> we can keep track of the changes and revert damages to the content:
>>
>> http://bioperl.org/wiki/BioPerl:Administrators
>> http://bioperl.org/wiki/Special:Recentchanges
>>
>> You can also keep track of the changes by using the RSS or Atom  
>> feeds provided by the Recentchanges page:
>>
>> http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=rss
>> http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=atom
>>
>> The wiki system has memory of the blocked users and IP's, you can  
>> have a look here:
>>
>> http://bioperl.org/wiki/Special:Ipblocklist
>>
>> There also exists a Blacklist, which is a complement to the main  
>> Wikimedia's one and helps detect spam content before it goes into  
>> a document:
>>
>> http://bioperl.org/wiki/Help:Blacklist
>> http://meta.wikimedia.org/wiki/Spam_blacklist
>>
>> I don't know who's in charge of BioPython's wiki but I hope this  
>> info can be helpful to you.
>>
>> Regards,
>> Mauricio.
>>
>> Peter wrote:
>>> I've noticed someone has created an account "Ceas" on the wiki  
>>> and has been inserting junk/spam links.  For example, look at the  
>>> history of the main page:
>>> http://biopython.org/wiki/Biopython
>>> Who is in charge of the Wiki?  Can we
>>> (a) block this account (short term action)
>>> (b) tighten up rules for creating new accounts?
>>> Peter
>>> _______________________________________________
>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>> -- 
>> MAURICIO HERRERA CUADRA
>> arareko at campus.iztacala.unam.mx
>> Laboratorio de Gen?tica
>> Unidad de Morfofisiolog?a y Funci?n
>> Facultad de Estudios Superiores Iztacala, UNAM
>>
>
> --
> Jason Stajich
> Duke University
> http://www.duke.edu/~jes12
>
>

--
Jason Stajich
Duke University
http://www.duke.edu/~jes12


From rohini.damle at gmail.com  Thu Jun 15 12:36:27 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Thu, 15 Jun 2006 09:36:27 -0700
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <449085AD.7010801@maubp.freeserve.co.uk>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
	<448FD25C.20101@maubp.freeserve.co.uk>
	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>
	<449085AD.7010801@maubp.freeserve.co.uk>
Message-ID: <d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>

Hi,
I am using BioPython 1.41 on windows I have also updated
NcbIstandalone.pyfor the link u gave. here is my code.

from Bio.Blast import NCBIStandalone
from Bio.Blast import NCBIXML
blast_out = open("4proteinblast.xml","r")
b_iterator = NCBIStandalone.Iterator(blast_out, NCBIXML.BlastParser())

for b_record in b_iterator :
        query_name = b_record.query
        print query_name
       for alignment in b_record.alignments:
               print '****Alignment****'
               print 'sequence:', alignment.title

This code gives "sequences producing significant alignments for all the 4
proteins
#but printing querry name as P1
I mean I am getting all the information I want but I have 4 protein querries
and this code is giving only P1 as a query (not P2, P3, P4 but giving
information about them) I ma attachin the xml file of 4 protein blast
results.
_thank you for your help.


On 6/14/06, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Rohini Damle wrote:
> > Thank you very much for your help.
> > I have 55-56 proteins & I am using Blast to find out short, nearly exact
> > matches. The xml parser works fine for first record but even if I used
> the
> > iterator, I CAN NOT ITERATE through the records, I have used the same
> code
> > as u have given, what might be wrong?
> > Rohini.
>
> If you you send us a short be of example code, and the error message
> that would help.  Also, what version of BioPython are you using, and do
> you have Windows or Linux or MacOS...
>
> One guess is that you will need to update the NCBIStandalone.py file to
> include a recent fix for iterating XML files.
>
> Assuming you are using BioPython 1.41 on Windows, the click on this link
> and pick "download" near the top of the page to get the latest verion:
>
>
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIStandalone.py?cvsroot=biopython
>
> Save it here:
>
> c:\python24\lib\site-packages\Bio\Blast\NCBIStandalone.py
>
> (Make a copy of the old file first, just in case)
>
> Peter
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 4proteinblast.xml
Type: text/xml
Size: 98271 bytes
Desc: not available
Url : http://lists.open-bio.org/pipermail/biopython/attachments/20060615/722b8845/attachment-0001.xml 

From cjfields at uiuc.edu  Thu Jun 15 12:41:05 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 15 Jun 2006 11:41:05 -0500
Subject: [BioPython] Abuse of the new Wiki Homepage
In-Reply-To: <4491873E.50509@campus.iztacala.unam.mx>
Message-ID: <000601c6909a$7f4ec5b0$15327e82@pyrimidine>

Looks like Jason's doing some work on the BioPython wiki to get it up to
speed.  I added Help:Blacklist as a start.

Like Mauricio said, probably need to get a small group of sysadmins together
to keep an eye on things and block potential spammers.  

Chris

> -----Original Message-----
> From: Mauricio Herrera Cuadra [mailto:arareko at campus.iztacala.unam.mx]
> Sent: Thursday, June 15, 2006 11:14 AM
> To: biopython at lists.open-bio.org
> Cc: biopython at biopython.org; Jason Stajich; Chris Dagdigian; Chris Fields
> Subject: Re: [BioPython] Abuse of the new Wiki Homepage
> 
> Hi Peter,
> 
> We started to have the same problem in the BioPerl wiki some months ago.
> The way we usually solve this is by blocking the user account and
> rolling back to the previous version of the affected document.
> 
> We have a list of wiki administrators who are constantly (and
> independently) monitoring the recent changes in the site. This way we
> can keep track of the changes and revert damages to the content:
> 
> http://bioperl.org/wiki/BioPerl:Administrators
> http://bioperl.org/wiki/Special:Recentchanges
> 
> You can also keep track of the changes by using the RSS or Atom feeds
> provided by the Recentchanges page:
> 
> http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=rss
> http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=atom
> 
> The wiki system has memory of the blocked users and IP's, you can have a
> look here:
> 
> http://bioperl.org/wiki/Special:Ipblocklist
> 
> There also exists a Blacklist, which is a complement to the main
> Wikimedia's one and helps detect spam content before it goes into a
> document:
> 
> http://bioperl.org/wiki/Help:Blacklist
> http://meta.wikimedia.org/wiki/Spam_blacklist
> 
> I don't know who's in charge of BioPython's wiki but I hope this info
> can be helpful to you.
> 
> Regards,
> Mauricio.
> 
> Peter wrote:
> > I've noticed someone has created an account "Ceas" on the wiki and has
> > been inserting junk/spam links.  For example, look at the history of the
> > main page:
> >
> > http://biopython.org/wiki/Biopython
> >
> > Who is in charge of the Wiki?  Can we
> > (a) block this account (short term action)
> > (b) tighten up rules for creating new accounts?
> >
> > Peter
> >
> > _______________________________________________
> > BioPython mailing list  -  BioPython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
> 
> --
> MAURICIO HERRERA CUADRA
> arareko at campus.iztacala.unam.mx
> Laboratorio de Gen?tica
> Unidad de Morfofisiolog?a y Funci?n
> Facultad de Estudios Superiores Iztacala, UNAM


From biopython at maubp.freeserve.co.uk  Thu Jun 15 13:30:18 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Jun 2006 18:30:18 +0100
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>	<448FD25C.20101@maubp.freeserve.co.uk>	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>	<449085AD.7010801@maubp.freeserve.co.uk>
	<d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>
Message-ID: <4491992A.5040301@maubp.freeserve.co.uk>

Rohini Damle wrote:
> Hi,
> I am using BioPython 1.41 on windows I have also updated
> NcbIstandalone.pyfor the link u gave. here is my code.
> 
> from Bio.Blast import NCBIStandalone
> from Bio.Blast import NCBIXML
> blast_out = open("4proteinblast.xml","r")
> b_iterator = NCBIStandalone.Iterator(blast_out, NCBIXML.BlastParser())
> 
> for b_record in b_iterator :
>        query_name = b_record.query
>        print query_name
>       for alignment in b_record.alignments:
>               print '****Alignment****'
>               print 'sequence:', alignment.title
> 
> This code gives "sequences producing significant alignments for all the 4
> proteins but printing querry name as P1

This code does the same thing, but prints less on screen so its easier 
to read:

from Bio.Blast import NCBIStandalone
from Bio.Blast import NCBIXML
blast_out = open("4proteinblast.xml","r")
b_iterator = NCBIStandalone.Iterator(blast_out, NCBIXML.BlastParser())

for b_record in b_iterator :
     query_name = b_record.query
     print query_name
     for alignment in b_record.alignments:
         print query_name, alignment.title.split()[0]


 > I mean I am getting all the information I want but I have 4 protein
> querries and this code is giving only P1 as a query (not P2, P3, P4
 > but giving information about them) I ma attachin the xml file of
 > 4 protein blast results. thank you for your help.

Looking at the raw XML file by hand, I could only see references to P1, 
the first protein.

If the file had results for all four proteins I would expect to see:

<?xml version="1.0"?>
... results for P1 ...
<?xml version="1.0"?>
... results for P2 ...
<?xml version="1.0"?>
... results for P3 ...
<?xml version="1.0"?>
... results for P4 ...

Are you sure you gave Blast all four input sequences - and not just the 
first sequence?

Peter


From mdehoon at c2b2.columbia.edu  Thu Jun 15 13:43:51 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu, 15 Jun 2006 13:43:51 -0400
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <4491992A.5040301@maubp.freeserve.co.uk>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>	<448FD25C.20101@maubp.freeserve.co.uk>	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>	<449085AD.7010801@maubp.freeserve.co.uk>	<d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>
	<4491992A.5040301@maubp.freeserve.co.uk>
Message-ID: <44919C57.7030204@c2b2.columbia.edu>

Peter wrote:
> 
> Looking at the raw XML file by hand, I could only see references to P1, 
> the first protein.
> 
> If the file had results for all four proteins I would expect to see:
> 
> <?xml version="1.0"?>
> ... results for P1 ...
> <?xml version="1.0"?>
> ... results for P2 ...
> <?xml version="1.0"?>
> ... results for P3 ...
> <?xml version="1.0"?>
> ... results for P4 ...
> 
There are results for all four proteins in the XML file, but they look 
like this:

  <Iteration>
    <Iteration_iter-num>2</Iteration_iter-num>
    <Iteration_query-ID>2_20304</Iteration_query-ID>
    <Iteration_query-def>p2</Iteration_query-def>
    ...
  </Iteration>

and so on. Could you let us know how this XML file was generated?

--Michiel


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032

From biopython at maubp.freeserve.co.uk  Thu Jun 15 13:53:53 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Jun 2006 18:53:53 +0100
Subject: [BioPython] parsing the blastoutput and printing the alingment
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>
Message-ID: <44919EB1.1080805@maubp.freeserve.co.uk>

I know you haven't got the XML parsing working get - but I thought I 
should point something else out...

Muthuraman, Manickam wrote:
> from Bio import Fasta
> file_for_blast=open('/home/manickam/Documents/m_cold.fasta','r')
> f_iterator=Fasta.Iterator(file_for_blast)
> f_record=f_iterator.next()

f_record will contain a single fasta record (the first entry in the file 
m_cold.fasta only).

> from Bio.Blast import NCBIWWW
> result_handle=NCBIWWW.qblast('blastp','nr',f_record, format_type="XML")

This will only run blast on the one record (i.e. the first fasta entry 
in m_cold.fasta), so the resulting XML file will only have blast results 
for this protein.

I'm not sure if you can use the online NCBI blast (i.e. NCBIWWW.qblast) 
to submit multiple queries...

You might want to install stand alone blast on your own machine - as 
this will accept multiple inputs.  You just tell it to read m_cold.fasta 
as its input file, and the resulting XML file will contain the results 
for each sequence in the fasta file.

Note that if you know in advance that the XML blast output is from a 
single input query, you don't need the NCBI iterator.

Peter


From rohini.damle at gmail.com  Thu Jun 15 14:24:38 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Thu, 15 Jun 2006 11:24:38 -0700
Subject: [BioPython] (no subject)
Message-ID: <d9fd76050606151124m5d9cd72eme04fd3327074cd29@mail.gmail.com>

> I opened the 'search for short nearly exact match' blast tool then
> enterd these prtein sequences
>  >p1
> FILGIIITV
>  >p2
> GLFDFVNFV
>  >p3
> FLIVSLCPT
>  >p4
> RVYEALYYV
>
>
> Set parameters like evalue and organism and chose the putput format as XML
> The output does not contain references for all the 4 proteins inthe
> starting but in the <Iteration> block (one block for each protein)
> is there any other way to generate the XML formatted output?
> -Rohini.

From biopython at maubp.freeserve.co.uk  Thu Jun 15 14:38:54 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Jun 2006 19:38:54 +0100
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <44919C57.7030204@c2b2.columbia.edu>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>	<448FD25C.20101@maubp.freeserve.co.uk>	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>	<449085AD.7010801@maubp.freeserve.co.uk>	<d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>	<4491992A.5040301@maubp.freeserve.co.uk>
	<44919C57.7030204@c2b2.columbia.edu>
Message-ID: <4491A93E.2020306@maubp.freeserve.co.uk>

Michiel Jan Laurens de Hoon wrote:
> Peter wrote:
> 
>>Looking at the raw XML file by hand, I could only see references to P1, 
>>the first protein.
>>
>>If the file had results for all four proteins I would expect to see:
>>
>><?xml version="1.0"?>
>>... results for P1 ...
>><?xml version="1.0"?>
>>... results for P2 ...
>><?xml version="1.0"?>
>>... results for P3 ...
>><?xml version="1.0"?>
>>... results for P4 ...
>>
> 
> There are results for all four proteins in the XML file, but they look 
> like this:
> 
>   <Iteration>
>     <Iteration_iter-num>2</Iteration_iter-num>
>     <Iteration_query-ID>2_20304</Iteration_query-ID>
>     <Iteration_query-def>p2</Iteration_query-def>
>     ...
>   </Iteration>
> 
> and so on.

Oh yeah.  I should have seen that, sorry.

According to the XML file, it is from BLASTP 2.2.14 [May-07-2006], maybe 
they changed the XML format without telling anyone?

I couldn't see anything obvious on this page:

http://www.ncbi.nlm.nih.gov/blast/blast_whatsnew.shtml

This looks like the source code here:

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi.tar.gz

And you can view their CVS here:

http://www.ncbi.nlm.nih.gov/cvsweb/index.cgi/ncbi/algo/blast/

There is nothing in the check-in comments that leaps out at me regarding 
XML iterations...

 >
 > Could you let us know how this XML file was generated?
 >

e.g. Standalone or online?

Peter


From cariaso at yahoo.com  Thu Jun 15 14:39:21 2006
From: cariaso at yahoo.com (Mike Cariaso)
Date: Thu, 15 Jun 2006 11:39:21 -0700 (PDT)
Subject: [BioPython] Fwd:  Abuse of the new Wiki Homepage
In-Reply-To: <29F97001-146E-414A-8E5D-330AEDAB3392@duke.edu>
Message-ID: <20060615183921.27494.qmail@web52711.mail.yahoo.com>

> The biopython community will have to decide how it wants to handle  
> new accounts to the wiki site. Whether there is patrolling or if  
> you want to lock the site down.  I would encourage all legitimate  
> users to add something to their User page so that we can have an  
> easier time distinguishing random account creation from real people.

Consider this my vote against any sort of lock down against new users. It can be a real deterent to new contributors, and that is something we sorely need. I'd be more willing to roll back the useless spam, than to risk detering valuable new contributions.

Thank you to Maubp for already removing all of Ceas's garbage. 


_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From rohini.damle at gmail.com  Thu Jun 15 14:44:38 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Thu, 15 Jun 2006 11:44:38 -0700
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <4491A93E.2020306@maubp.freeserve.co.uk>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
	<448FD25C.20101@maubp.freeserve.co.uk>
	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>
	<449085AD.7010801@maubp.freeserve.co.uk>
	<d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>
	<4491992A.5040301@maubp.freeserve.co.uk>
	<44919C57.7030204@c2b2.columbia.edu>
	<4491A93E.2020306@maubp.freeserve.co.uk>
Message-ID: <d9fd76050606151144q44d935aai3b2bef9a6d71210d@mail.gmail.com>

I used online ncbi blast to generate the xml output
Rohini


On 6/15/06, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Michiel Jan Laurens de Hoon wrote:
> > Peter wrote:
> >
> >>Looking at the raw XML file by hand, I could only see references to P1,
> >>the first protein.
> >>
> >>If the file had results for all four proteins I would expect to see:
> >>
> >><?xml version="1.0"?>
> >>... results for P1 ...
> >><?xml version="1.0"?>
> >>... results for P2 ...
> >><?xml version="1.0"?>
> >>... results for P3 ...
> >><?xml version="1.0"?>
> >>... results for P4 ...
> >>
> >
> > There are results for all four proteins in the XML file, but they look
> > like this:
> >
> >   <Iteration>
> >     <Iteration_iter-num>2</Iteration_iter-num>
> >     <Iteration_query-ID>2_20304</Iteration_query-ID>
> >     <Iteration_query-def>p2</Iteration_query-def>
> >     ...
> >   </Iteration>
> >
> > and so on.
>
> Oh yeah.  I should have seen that, sorry.
>
> According to the XML file, it is from BLASTP 2.2.14 [May-07-2006], maybe
> they changed the XML format without telling anyone?
>
> I couldn't see anything obvious on this page:
>
> http://www.ncbi.nlm.nih.gov/blast/blast_whatsnew.shtml
>
> This looks like the source code here:
>
> ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi.tar.gz
>
> And you can view their CVS here:
>
> http://www.ncbi.nlm.nih.gov/cvsweb/index.cgi/ncbi/algo/blast/
>
> There is nothing in the check-in comments that leaps out at me regarding
> XML iterations...
>
> >
> > Could you let us know how this XML file was generated?
> >
>
> e.g. Standalone or online?
>
> Peter
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From cjfields at uiuc.edu  Thu Jun 15 12:55:40 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 15 Jun 2006 11:55:40 -0500
Subject: [BioPython] Abuse of the new Wiki Homepage
In-Reply-To: <FBF0A76F-92B2-4F8B-BA28-400B1E1A0C2E@duke.edu>
Message-ID: <000701c6909c$88bca480$15327e82@pyrimidine>

> I'm not convinced the blacklist is working - but we need to make sure
> it is enabled in the conf file on the server.  I've locked the
> blacklist page as well so that only sysops can edit it.  Iddo and
> Michiel are the main site admins right now, other people can be
> promoted by them or one of the main site admins if we know who you are.

Agreed.  I actually added the page as 'Help:BlackList' then redirected it to
'Help:Blacklist'; someone with admin privies can delete that redirect link
if they want.  My oops.  Like Jason says, probably doesn't make much of a
difference (the wiki version of the raindance, to ward off evil spammers).
 
> I've blocked the previous spammer's account.  You can easily revert
> changes by using the rollback button on the diff page.
> 
> The biopython community will have to decide how it wants to handle
> new accounts to the wiki site. Whether there is patrolling or if you
> want to lock the site down.  I would encourage all legitimate users
> to add something to their User page so that we can have an easier
> time distinguishing random account creation from real people.
> 
> -jason
> On Jun 15, 2006, at 12:13 PM, Mauricio Herrera Cuadra wrote:
> 
> > Hi Peter,
> >
> > We started to have the same problem in the BioPerl wiki some months
> > ago. The way we usually solve this is by blocking the user account
> > and rolling back to the previous version of the affected document.
> >
> > We have a list of wiki administrators who are constantly (and
> > independently) monitoring the recent changes in the site. This way
> > we can keep track of the changes and revert damages to the content:
> >
> > http://bioperl.org/wiki/BioPerl:Administrators
> > http://bioperl.org/wiki/Special:Recentchanges
> >
> > You can also keep track of the changes by using the RSS or Atom
> > feeds provided by the Recentchanges page:
> >
> > http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=rss
> > http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=atom
> >
> > The wiki system has memory of the blocked users and IP's, you can
> > have a look here:
> >
> > http://bioperl.org/wiki/Special:Ipblocklist
> >
> > There also exists a Blacklist, which is a complement to the main
> > Wikimedia's one and helps detect spam content before it goes into a
> > document:
> >
> > http://bioperl.org/wiki/Help:Blacklist
> > http://meta.wikimedia.org/wiki/Spam_blacklist
> >
> > I don't know who's in charge of BioPython's wiki but I hope this
> > info can be helpful to you.
> >
> > Regards,
> > Mauricio.
> >
> > Peter wrote:
> >> I've noticed someone has created an account "Ceas" on the wiki and
> >> has been inserting junk/spam links.  For example, look at the
> >> history of the main page:
> >> http://biopython.org/wiki/Biopython
> >> Who is in charge of the Wiki?  Can we
> >> (a) block this account (short term action)
> >> (b) tighten up rules for creating new accounts?
> >> Peter
> >> _______________________________________________
> >> BioPython mailing list  -  BioPython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >
> > --
> > MAURICIO HERRERA CUADRA
> > arareko at campus.iztacala.unam.mx
> > Laboratorio de Gen?tica
> > Unidad de Morfofisiolog?a y Funci?n
> > Facultad de Estudios Superiores Iztacala, UNAM
> >
> 
> --
> Jason Stajich
> Duke University
> http://www.duke.edu/~jes12


From manickam.muthuraman at wur.nl  Thu Jun 15 17:29:51 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Thu, 15 Jun 2006 23:29:51 +0200
Subject: [BioPython] parsing the blastoutput and printing the alingment
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>
	<44917656.6090602@maubp.freeserve.co.uk>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>

Dear Peter

In this mail i am attaching three files :seq file,python script file and the blast output. I am using python Python 2.4.1 (#2, Aug 25 2005, 18:20:57)and biopython 1.40

i spent almost the whole evening to upgarde the python and biopython in mandriva linux but i failed. 

let me know is the version of python and biopython matter here

thanks for helping me out of this
from
manickam

-----Original Message-----
From:	Peter [mailto:biopython at maubp.freeserve.co.uk]
Sent:	Thu 6/15/2006 5:01 PM
To:	Muthuraman, Manickam
Cc:	biopython at lists.open-bio.org
Subject:	Re: [BioPython] parsing the blastoutput and printing the alingment

Muthuraman, Manickam wrote:
> Dear peter
> 
> here is the code my_blast.out and the error. My need is to get all the
 > blast hit sequences in fasta format. By parsing and i can extract
 > accession number from it.

I made an example fasta file containing just this one sequence twice:

 >example1
VPKIDVSPLFGDDQAAKMRVAQQIDAASRDTGFFYAVNHGINVQRLSQKTKEFHMSITP
EEKWDLAIRAYNKEHQDQVRAGYYLSIPGKKAVESFCYLNPNFTPDHPRIQAKTPTHEV
NVWPDETKHPGFQDFAEQYYWDVFGLSSALLKGYALALGKEENFFARHFKPDDTLASVV
LIRYPYLDPYPEAAIKTAADGTKLSFEWHEDVSLITVLYQSNVQNLQVETAAGYQDIEA
DDTGYLINCGSYMAHLTNNYYKAPIHRVKWVNAERQSLPFFVNLGYDSVI
 >example2
VPKIDVSPLFGDDQAAKMRVAQQIDAASRDTGFFYAVNHGINVQRLSQKTKEFHMSITP
EEKWDLAIRAYNKEHQDQVRAGYYLSIPGKKAVESFCYLNPNFTPDHPRIQAKTPTHEV
NVWPDETKHPGFQDFAEQYYWDVFGLSSALLKGYALALGKEENFFARHFKPDDTLASVV
LIRYPYLDPYPEAAIKTAADGTKLSFEWHEDVSLITVLYQSNVQNLQVETAAGYQDIEA
DDTGYLINCGSYMAHLTNNYYKAPIHRVKWVNAERQSLPFFVNLGYDSVI

I then edited the filenames in your example, and ran the code.  It 
worked for me using a fresh install of BioPython 1.41 on Linux with 
Python 2.4.2

So the good news is your code seems fine.

Maybe there is something "funny" with your fasta file?  Accented 
characters for example - which would then be in the output XML file?

Could you send me the fasta file and the XML file (in full, as 
attachments), off the mailing list to avoid clogging up everyone's inboxes.

Thanks

Peter


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 164714 bytes
Desc: not available
Url : http://lists.open-bio.org/pipermail/biopython/attachments/20060615/2f04b211/attachment-0001.bin 

From mdehoon at c2b2.columbia.edu  Thu Jun 15 18:37:18 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu, 15 Jun 2006 18:37:18 -0400
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <4491A93E.2020306@maubp.freeserve.co.uk>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>	<448FD25C.20101@maubp.freeserve.co.uk>	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>	<449085AD.7010801@maubp.freeserve.co.uk>	<d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>	<4491992A.5040301@maubp.freeserve.co.uk>
	<44919C57.7030204@c2b2.columbia.edu>
	<4491A93E.2020306@maubp.freeserve.co.uk>
Message-ID: <4491E11E.5020705@c2b2.columbia.edu>

Peter wrote:
> According to the XML file, it is from BLASTP 2.2.14 [May-07-2006], maybe 
> they changed the XML format without telling anyone?
> 
It appears that the XML format did change.
With Blastp 2.2.14, multiple searches generate multiple 
<Iteration>...</Iteration> blocks, one for each search.
With an older Blastp, multiple searches effectively generate multiple 
XML files (each with one <Iteration>...</Iteration> block). These files 
are then concatenated into one output file. Biopython then parses this 
file by looking for the beginning of each XML file in this output file.

The new output is in a sense better because the output file is a valid 
XML file. It may be that Biopython's XML parser ignores the <Iteration> 
tags, since in the old format there was only one <Iteration> block 
anyway, and therefore fails with the new format.

--Michiel.

-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032

From biopython at maubp.freeserve.co.uk  Thu Jun 15 18:31:59 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Jun 2006 23:31:59 +0100
Subject: [BioPython] parsing the blastoutput and printing the alingment
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>
Message-ID: <4491DFDF.9070506@maubp.freeserve.co.uk>

Muthuraman, Manickam wrote:
> Dear Peter
> 
> In this mail i am attaching three files :seq file,python script file
> and the blast output. I am using python Python 2.4.1 (#2, Aug 25
> 2005, 18:20:57)and biopython 1.40

Your attachment came as a weird winmail.dat file - something  Outlook 
and the Microsoft Exchange Client sometimes does.  There is a Linux tool 
to "unzip" the file called tnef, which I installed on Ubuntu with a 
simple "apt-get install tnef"

Anyway, the problem is simply that your XML file has this little HTTP 
header at the start:

HTTP/1.1 200 OK
Date: Thu, 15 Jun 2006 21:23:08 GMT
Server: Nde
Content-Type: application/xml
Connection: close

If you edit the file to remove this, the BioPython can read the file fine.

Looking over my old email, Michiel de Hoon checked in a fix from 
Alexander Morgan for this in March.  You need to update this file:

/usr/lib/python2.4/site-packages/Bio/Blast/NCBIWWW.py

Latest code is available here:

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIWWW.py?cvsroot=biopython

It also gets rid of this annoying message:

UserWarning: qblast works only with blastn and blastp for now.

Peter


From manickam.muthuraman at wur.nl  Fri Jun 16 10:27:00 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Fri, 16 Jun 2006 16:27:00 +0200
Subject: [BioPython] Running Blast locally
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>
	<4491DFDF.9070506@maubp.freeserve.co.uk>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA9@salte0008.wurnet.nl>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AFAB@salte0008.wurnet.nl>


Dear peter

In the last mail i said that b_record is none , so i tried to run the blastall in my local computer and it works right now.

here is the command :
./blastall -d db/swissprot -i /home/manickam/Documents/m_cold.fasta -p blastp 
 and i am getting the result. so let me know if i need to put this command in string and pass this string (example:my_blast_exe). Still i want to know how to pass the input file(my_blast_file).

i think i confuse myself
let me know your view for this
from
manickam


From winter at biotec.tu-dresden.de  Fri Jun 16 10:35:56 2006
From: winter at biotec.tu-dresden.de (Christof Winter)
Date: Fri, 16 Jun 2006 16:35:56 +0200
Subject: [BioPython] Running Blast locally
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AFAB@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>	<4491DFDF.9070506@maubp.freeserve.co.uk>	<4CDD243B32D07748944828EA7A29E4A3E2AFA9@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AFAB@salte0008.wurnet.nl>
Message-ID: <4492C1CC.4020607@biotec.tu-dresden.de>

Dear Manickam,

Can you try
blastall -V T -d db/swissprot -i /home/manickam/Documents/m_cold.fasta -p blastp

instead?

Christof


Muthuraman, Manickam wrote:
> Dear peter
> 
> In the last mail i said that b_record is none , so i tried to run the blastall in my local computer and it works right now.
> 
> here is the command :
> ./blastall -d db/swissprot -i /home/manickam/Documents/m_cold.fasta -p blastp 
>  and i am getting the result. so let me know if i need to put this command in string and pass this string (example:my_blast_exe). Still i want to know how to pass the input file(my_blast_file).
> 
> i think i confuse myself
> let me know your view for this
> from
> manickam
> 
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

-- 
Christof Winter
Bioinformatics Group
TU Dresden
Tatzberg 47-51
01307 Dresden, Germany

From manickam.muthuraman at wur.nl  Fri Jun 16 10:52:15 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Fri, 16 Jun 2006 16:52:15 +0200
Subject: [BioPython] Running Blast locally
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>	<4491DFDF.9070506@maubp.freeserve.co.uk>	<4CDD243B32D07748944828EA7A29E4A3E2AFA9@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AFAB@salte0008.wurnet.nl>
	<4492C1CC.4020607@biotec.tu-dresden.de>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AFAC@salte0008.wurnet.nl>

Dear Christof

Your command also works  separately but my question was how to intergrate blast in biopython script.

in biopython tutorial and cookbook they have the follwoing code where i need to provide the path to database ,file to blast and blast_exe.

I am not clear how to set the path for seq_file,db and exe.

import os
my_blast_db=os.path.join(os.getcwd(),'at-est','a-cds-10-7.fasta')
my_blast_file=os.path.join(os.getcwd(),'at-est','test_blast','sorghum_est-test.fasta')
my_blast_exe=os.path.join(os.getcwd(),'blast','/home/manickam/blast/blastall')


here is the whole script
import os
my_blast_db=os.path.join(os.getcwd(),'at-est','a-cds-10-7.fasta')
my_blast_file=os.path.join(os.getcwd(),'at-est','test_blast','sorghum_est-test.fasta')
my_blast_exe=os.path.join(os.getcwd(),'blast','/home/manickam/blast/blastall')
from Bio.Blast import NCBIStandalone
blast_out,error_info=NCBIStandalone.blastall(my_blast_exe,'blastp',my_blast_db,my_blast_file)
b_parser=NCBIStandalone.BlastParser()
b_iterator=NCBIStandalone.Iterator(blast_out,b_parser)
b_record=b_iterator.next()
while 1:
    b_record=b_iterator.next()
    if b_record is None:
        break
    for alignment in b_record.alignments:
        print "inside 2 loop"
        for hsp in alignment.hsps:
            print "inside 1 loop"
            print 'seq:',alignment.title

it runs but b_record is None so it comes out of the while loop at first time itself. so it mean i am not getting out put of the blast.

from
manickam


From manickam.muthuraman at wur.nl  Fri Jun 16 04:42:08 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Fri, 16 Jun 2006 10:42:08 +0200
Subject: [BioPython] parsing the blastoutput and printing the alingment
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>
	<4491DFDF.9070506@maubp.freeserve.co.uk>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AFA7@salte0008.wurnet.nl>


Thanks peter

After overwriting the NCBIWWW.py header file my script works. 
once again i would like to thank

from
manickam

-----Original Message-----
From:	Peter [mailto:biopython at maubp.freeserve.co.uk]
Sent:	Fri 6/16/2006 12:31 AM
To:	Muthuraman, Manickam
Cc:	biopython at lists.open-bio.org
Subject:	Re: [BioPython] parsing the blastoutput and printing the alingment
Muthuraman, Manickam wrote:
> Dear Peter
> 
> In this mail i am attaching three files :seq file,python script file
> and the blast output. I am using python Python 2.4.1 (#2, Aug 25
> 2005, 18:20:57)and biopython 1.40

Your attachment came as a weird winmail.dat file - something  Outlook 
and the Microsoft Exchange Client sometimes does.  There is a Linux tool 
to "unzip" the file called tnef, which I installed on Ubuntu with a 
simple "apt-get install tnef"

Anyway, the problem is simply that your XML file has this little HTTP 
header at the start:

HTTP/1.1 200 OK
Date: Thu, 15 Jun 2006 21:23:08 GMT
Server: Nde
Content-Type: application/xml
Connection: close

If you edit the file to remove this, the BioPython can read the file fine.

Looking over my old email, Michiel de Hoon checked in a fix from 
Alexander Morgan for this in March.  You need to update this file:

/usr/lib/python2.4/site-packages/Bio/Blast/NCBIWWW.py

Latest code is available here:

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIWWW.py?cvsroot=biopython

It also gets rid of this annoying message:

UserWarning: qblast works only with blastn and blastp for now.

Peter


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 3991 bytes
Desc: not available
Url : http://lists.open-bio.org/pipermail/biopython/attachments/20060616/490186bf/attachment-0001.bin 

From manickam.muthuraman at wur.nl  Fri Jun 16 09:12:08 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Fri, 16 Jun 2006 15:12:08 +0200
Subject: [BioPython] Running Blast locally
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>
	<4491DFDF.9070506@maubp.freeserve.co.uk>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AFA9@salte0008.wurnet.nl>


Dear Peter

i am not clear about the subtopic running blast locally

let me explain in detail
i have blast executable files in my home directory i.e
/home/manickam/blast/blastall

i have my database files of nr,swissprot,pdb in /usr/junk/

the files which i can see under /usr/junk/        folder are 
nr.00.phr    
 nr.00.ppi  
nr.01.phr     
nr.01.ppi  nr.pal        
pdbaa.00.msk

lot in there and there extenstions are *.phr , ppi ,pal,msk,psq

i am not clear from the manual where do i need to provide the input sequences and how to i store the out put after running the local blast.

below is the following code which i tried and it works but b_record is none.

mport os
my_blast_db=os.path.join(os.getcwd(),'at-est','a-cds-10-7.fasta')
my_blast_file=os.path.join(os.getcwd(),'at-est','test_blast','sorghum_est-test.fasta')
my_blast_exe=os.path.join(os.getcwd(),'blast','/home/manickam/blast/blastall')
from Bio.Blast import NCBIStandalone
blast_out,error_info=NCBIStandalone.blastall(my_blast_exe,'blastp',my_blast_db,my_blast_file)
b_parser=NCBIStandalone.BlastParser()
b_iterator=NCBIStandalone.Iterator(blast_out,b_parser)
b_record=b_iterator.next()
while 1:
    b_record=b_iterator.next()
    if b_record is None:
        break
    for alignment in b_record.alignments:
        for hsp in alignment.hsps:
            print 'seq:',alignment.title

from
manickam

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 3446 bytes
Desc: not available
Url : http://lists.open-bio.org/pipermail/biopython/attachments/20060616/2de84992/attachment-0001.bin 

From biopython at maubp.freeserve.co.uk  Fri Jun 16 11:53:31 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 16 Jun 2006 16:53:31 +0100
Subject: [BioPython] Running Blast locally
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AFAC@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>	<4491DFDF.9070506@maubp.freeserve.co.uk>	<4CDD243B32D07748944828EA7A29E4A3E2AFA9@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFAB@salte0008.wurnet.nl>	<4492C1CC.4020607@biotec.tu-dresden.de>
	<4CDD243B32D07748944828EA7A29E4A3E2AFAC@salte0008.wurnet.nl>
Message-ID: <4492D3FB.1040706@maubp.freeserve.co.uk>

Muthuraman, Manickam wrote:
> Dear Christof
> 
> Your command also works  separately but my question was how to intergrate blast in biopython script.
> 
> in biopython tutorial and cookbook they have the follwoing code where i need to provide the path to database ,file to blast and blast_exe.
> 
> I am not clear how to set the path for seq_file,db and exe.
> 
> import os
> my_blast_db=os.path.join(os.getcwd(),'at-est','a-cds-10-7.fasta')
> my_blast_file=os.path.join(os.getcwd(),'at-est','test_blast','sorghum_est-test.fasta')
> my_blast_exe=os.path.join(os.getcwd(),'blast','/home/manickam/blast/blastall')

Try typing this at the python prompt:

import os
help(os.path.join)

Are you familiar with relative paths etc?  You might find something like 
this easier to understand:

my_blast_db   = '/home/manickam/db/at-est/a-cds-10-7.fasta')
my_blast_file = '/home/manickam/sorghum_est-test.fasta')
my_blast_exe  = '/home/manickam/blast/blastall'

Or, based on you previous email you were using:

 > here is the command :
 > ./blastall -d db/swissprot -i /home/manickam/Documents/m_cold.fasta
 > -p blastp

Maybe something like this:

my_blast_db   = '/home/manickam/blast/db/swissprot')
my_blast_file = '/home/manickam/Documents/m_cold.fasta')
my_blast_exe  = '/home/manickam/blast/blastall'

It all depends on where you installed the blast program, where you put 
the blast databases, and where you are going to have your inputfile.

> here is the whole script
01> import os
02> my_blast_db=os.path.join(os.getcwd(),'at-est','a-cds-10-7.fasta')
03> 
my_blast_file=os.path.join(os.getcwd(),'at-est','test_blast','sorghum_est-test.fasta')
04> 
my_blast_exe=os.path.join(os.getcwd(),'blast','/home/manickam/blast/blastall')
05> from Bio.Blast import NCBIStandalone
06> 
blast_out,error_info=NCBIStandalone.blastall(my_blast_exe,'blastp',my_blast_db,my_blast_file)

At this point, some example scripts will save the output to a file, and 
then reload it and carry on.  This is very helpful if you have problems 
because you can open the file by hand and look at it.

07> b_parser=NCBIStandalone.BlastParser()
08> b_iterator=NCBIStandalone.Iterator(blast_out,b_parser)
09> b_record=b_iterator.next()
10> while 1:
11>     b_record=b_iterator.next()
12>     if b_record is None:
13>         break
14>     for alignment in b_record.alignments:
15>         print "inside 2 loop"
16>         for hsp in alignment.hsps:
17>             print "inside 1 loop"
18>             print 'seq:',alignment.title
> 
> it runs but b_record is None so it comes out of the while loop at first time itself. so it mean i am not getting out put of the blast.

Notice that at line 9, you set b_record to the first set of results 
(i.e. from the first sequence in your FASTA file).

Then, inside the look, at line 11 set b_record to the SECOND set of 
results and try and look at it.

I suggest you comment out line 9, and it should work better.

Finally, this code is using the "plain text" blast output, which can 
sometimes cause BioPython trouble.  I would recommend the XML parser but 
as you might know from the mailing list, it looks like they have changed 
the file format for multiple results in XML output...

Peter


From biopython at maubp.freeserve.co.uk  Fri Jun 16 12:06:14 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 16 Jun 2006 17:06:14 +0100
Subject: [BioPython] Running Blast locally
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AFA9@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>	<4491DFDF.9070506@maubp.freeserve.co.uk>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA9@salte0008.wurnet.nl>
Message-ID: <4492D6F6.5060100@maubp.freeserve.co.uk>

I didn't see this email - they arrived out of order at my computer. 
Please also read my longer reply...

Muthuraman, Manickam wrote:
> i have blast executable files in my home directory i.e
> /home/manickam/blast/blastall

Then use this:

my_blast_exe='/home/manickam/blast/blastall'

> i have my database files of nr,swissprot,pdb in /usr/junk/
> 
> the files which i can see under /usr/junk/        folder are 
> nr.00.phr    
>  nr.00.ppi  
> nr.01.phr     
> nr.01.ppi  nr.pal        
> pdbaa.00.msk
> 
> lot in there and there extenstions are *.phr , ppi ,pal,msk,psq

I think you should use one of these, but I haven't checked this:

my_blast_db='/usr/junk/nr'
my_blast_db='/usr/junk/swissprot'
my_blast_db='/usr/junk/pdb'

> i am not clear from the manual where do i need to provide the input sequences

The input fasta file can be anywhere - you just have to tell Blast where 
it is.  e.g.

my_blast_file='/home/manickam/Documents/m_cold.fasta')


 > and how to i store the out put after running the local blast.

If you run blast "by hand" at the command prompt, use the option -o 
outputfilename (that is a lower case letter o, not zero, not uppercase).

You can also using python to write the results to a file.

> below is the following code which i tried and it works but b_record is none.

See my other email

Peter


From gvwilson at cs.utoronto.ca  Sun Jun 18 14:15:04 2006
From: gvwilson at cs.utoronto.ca (Greg Wilson)
Date: Sun, 18 Jun 2006 14:15:04 -0400
Subject: [BioPython] ann: open source course on basic software development
	skills
Message-ID: <e7456q$pq2$11@sea.gmane.org>

http://www.third-bit.com/swc is an open source course on basic software
development skills, aimed primarily at people with backgrounds in
science, engineering, and medicine who have little formal training in
programming, but find themselves doing a lot of it.  The course was
developed in part through support from the Python Software Foundation;
all of the material can be used and modified free of charge (but with
attribution).  If you have questions, would like to contribute material,
or have a success story you'd like to share, please contact Greg Wilson
(gvwilson at cs.utoronto.ca).

Thanks,
Greg


From rohini.damle at gmail.com  Mon Jun 19 19:36:36 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Mon, 19 Jun 2006 16:36:36 -0700
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <4491E11E.5020705@c2b2.columbia.edu>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
	<448FD25C.20101@maubp.freeserve.co.uk>
	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>
	<449085AD.7010801@maubp.freeserve.co.uk>
	<d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>
	<4491992A.5040301@maubp.freeserve.co.uk>
	<44919C57.7030204@c2b2.columbia.edu>
	<4491A93E.2020306@maubp.freeserve.co.uk>
	<4491E11E.5020705@c2b2.columbia.edu>
Message-ID: <d9fd76050606191636s7246b7e4va89754200ffd2eb1@mail.gmail.com>

So what do one need to do to make biopython working?  Make changes in the
XML  parser so that it will consider one iteration for one result out put?
-Rohini


On 6/15/06, Michiel Jan Laurens de Hoon <mdehoon at c2b2.columbia.edu> wrote:
>
> Peter wrote:
> > According to the XML file, it is from BLASTP 2.2.14 [May-07-2006], maybe
> > they changed the XML format without telling anyone?
> >
> It appears that the XML format did change.
> With Blastp 2.2.14, multiple searches generate multiple
> <Iteration>...</Iteration> blocks, one for each search.
> With an older Blastp, multiple searches effectively generate multiple
> XML files (each with one <Iteration>...</Iteration> block). These files
> are then concatenated into one output file. Biopython then parses this
> file by looking for the beginning of each XML file in this output file.
>
> The new output is in a sense better because the output file is a valid
> XML file. It may be that Biopython's XML parser ignores the <Iteration>
> tags, since in the old format there was only one <Iteration> block
> anyway, and therefore fails with the new format.
>
> --Michiel.
>
> --
> Michiel de Hoon
> Center for Computational Biology and Bioinformatics
> Columbia University
> 1130 St Nicholas Avenue
> New York, NY 10032
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From biopython at maubp.freeserve.co.uk  Tue Jun 20 09:52:48 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 20 Jun 2006 14:52:48 +0100
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <d9fd76050606191636s7246b7e4va89754200ffd2eb1@mail.gmail.com>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>	<448FD25C.20101@maubp.freeserve.co.uk>	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>	<449085AD.7010801@maubp.freeserve.co.uk>	<d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>	<4491992A.5040301@maubp.freeserve.co.uk>	<44919C57.7030204@c2b2.columbia.edu>	<4491A93E.2020306@maubp.freeserve.co.uk>	<4491E11E.5020705@c2b2.columbia.edu>
	<d9fd76050606191636s7246b7e4va89754200ffd2eb1@mail.gmail.com>
Message-ID: <4497FDB0.1000903@maubp.freeserve.co.uk>

Peter wrote:
>>> According to the XML file, it is from BLASTP 2.2.14 [May-07-2006], 
 >>> maybe they changed the XML format without telling anyone?

Michiel wrote:
>>It appears that the XML format did change.
>>With Blastp 2.2.14, multiple searches generate multiple
>><Iteration>...</Iteration> blocks, one for each search.
>>With an older Blastp, multiple searches effectively generate multiple
>>XML files (each with one <Iteration>...</Iteration> block). These files
>>are then concatenated into one output file. Biopython then parses this
>>file by looking for the beginning of each XML file in this output file.
>>
>>The new output is in a sense better because the output file is a valid
>>XML file. It may be that Biopython's XML parser ignores the <Iteration>
>>tags, since in the old format there was only one <Iteration> block
>>anyway, and therefore fails with the new format.

Rohini Damle wrote:
 > So what do one need to do to make biopython working?  Make changes in
 > the XML parser so that it will consider one iteration for one result
 > output?

Basically, yes, we need to change the BioPython NCBI Blast XML code 
somehow - this might be best moved to the development mailing list.

Some relevant but probably slightly out of data documentation:

ftp://ftp.ncbi.nlm.nih.gov/blast/documents/xml/README.blxml

Notice this appears to describe the <Iteration>...</Iteration> block as 
follows:

BlastOutput_iter-num: the psi-blast iteration number (optional)

So whatever we do, we should have a look at the psi-blast output as well...

One idea I was thinking about is to modify the existing Blast XML parser 
to specify WHICH iteratation number it should parse (ignoring the rest). 
  An invalid iteration number would throw a new exception error.

Then, a new Blast XML iterator would call the parser repeatedly 
incrementing the iteration number until the "invalid iteration number" 
error was raised, which would signal the end.

Note that with the "old style concatenated XML entries" we could parse 
each entry one by one, without having to load the entire XML file into 
memory at once.  I don't think that will be possible with the new style 
XML files.

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun 21 10:27:06 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 21 Jun 2006 15:27:06 +0100
Subject: [BioPython] docs have moved on the website
Message-ID: <4499573A.5060409@maubp.freeserve.co.uk>

I don't know if anyone has noticed this, but for example this:

http://www.biopython.org/docs/cookbook/genbank_to_fasta.html

Has moved to here:

http://www.biopython.org/DIST/docs/cookbook/genbank_to_fasta.html

Is it too late to revert to the old position?

If it is, to preserve any old links from external sites (and also to 
save google and other search engines having to update their indexes) 
maybe the website could automatically forward queries for:

http://www.biopython.org/docs/*

to:

http://www.biopython.org/DIST/docs/*

Good idea?  Bad idea?

Peter


From rohini.damle at gmail.com  Wed Jun 21 15:06:29 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Wed, 21 Jun 2006 12:06:29 -0700
Subject: [BioPython] Biopython's XMl parser fails with NCBI blast changed
	XML output format
Message-ID: <d9fd76050606211206pa104f7dwdebfcb05dcab09d2@mail.gmail.com>

Hi,
I am trying to parse the blast output (XML formatted, using online NCBI's
blast) I got as a result for 'short nearly exact matches' for my 50-55 short
protein sequences.
It looks like the XML format has changed and biopython's XML parser fails to
parse the blast records.
can somebody show a way to fix this thing?
Thank you
Rohini Damle

From biopython at maubp.freeserve.co.uk  Sun Jun 25 17:37:53 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 25 Jun 2006 22:37:53 +0100
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
Message-ID: <449F0231.2050308@maubp.freeserve.co.uk>

[Off topic, but recently has anyone else get valid messages bounced due 
to a "suspicious header"?]

Hello List,

I recently wanted to load a "PHYLIP distance matrix file" created by
clustalw for my own research...

As discussed earlier, clustalw bends the official PHYLIP specification
by not truncating long names to 10 characters.  For my dataset I need
the long names to avoid ambiguity.

The attached code implements a fairly simple distance matrix class and
associated code to read (parse) and write PHYLIP style distance matrices.

There are options to control strict 10 character name truncation, and
the separator character(s) when writing files.

Internally, I store the distances as a list of lists (of different
lengths) to mimic a lower triangular matrix.

For example, this matrix:

[[0.0, 0.1, 0.2],
   [0.1, 0.0, 0.5],
   [0.2, 0.5, 0.0]]

Is stored as this:

[[], [0.1], [0.2, 0.5]]

This may not be the best way to do this in terms of speed and memory usage.

There are some simple test cases included, but I have pushed the code
very far and there may be problems.  Anyway - in case anyone is
interested either in the short term, or for ideas for how BioPython
could support these files - here it is.

I'm sure someone more familiar with arrays (Numeric and NumPy) would be
able to make the class act more like an array - but the basics are there.

As far as I could see, neither Numeric or NumPy have a specific
symmetric matrix / symmetric array class which would be ideal.

Members of the list are welcome to use the code, but please contact me
before re-distributing it to anyone else.

Peter

-------------- next part --------------
A non-text attachment was scrubbed...
Name: phylip_dst.py
Type: text/x-python
Size: 16528 bytes
Desc: not available
Url : http://lists.open-bio.org/pipermail/biopython/attachments/20060625/8d20b314/attachment-0001.py 

From chris.lasher at gmail.com  Tue Jun 27 17:34:37 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 27 Jun 2006 17:34:37 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <449F0231.2050308@maubp.freeserve.co.uk>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
	<449F0231.2050308@maubp.freeserve.co.uk>
Message-ID: <128a885f0606271434v4d5a40e9x1ceb0037d750f6a1@mail.gmail.com>

Hi Peter,

Would you be up for licensing your code under the BioPython license?
If not, I shouldn't  look at it, as I've started coding my own module
for the project. From your description, your module sounds very good.
=-)

Chris

On 6/25/06, Peter <biopython at maubp.freeserve.co.uk> wrote:
> [Off topic, but recently has anyone else get valid messages bounced due
> to a "suspicious header"?]
>
> Hello List,
>
> I recently wanted to load a "PHYLIP distance matrix file" created by
> clustalw for my own research...
>
> As discussed earlier, clustalw bends the official PHYLIP specification
> by not truncating long names to 10 characters.  For my dataset I need
> the long names to avoid ambiguity.
>
> The attached code implements a fairly simple distance matrix class and
> associated code to read (parse) and write PHYLIP style distance matrices.
>
> There are options to control strict 10 character name truncation, and
> the separator character(s) when writing files.
>
> Internally, I store the distances as a list of lists (of different
> lengths) to mimic a lower triangular matrix.
>
> For example, this matrix:
>
> [[0.0, 0.1, 0.2],
>    [0.1, 0.0, 0.5],
>    [0.2, 0.5, 0.0]]
>
> Is stored as this:
>
> [[], [0.1], [0.2, 0.5]]
>
> This may not be the best way to do this in terms of speed and memory usage.
>
> There are some simple test cases included, but I have pushed the code
> very far and there may be problems.  Anyway - in case anyone is
> interested either in the short term, or for ideas for how BioPython
> could support these files - here it is.
>
> I'm sure someone more familiar with arrays (Numeric and NumPy) would be
> able to make the class act more like an array - but the basics are there.
>
> As far as I could see, neither Numeric or NumPy have a specific
> symmetric matrix / symmetric array class which would be ideal.
>
> Members of the list are welcome to use the code, but please contact me
> before re-distributing it to anyone else.
>
> Peter
>
>
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
>

From biopython at maubp.freeserve.co.uk  Tue Jun 27 18:33:34 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 27 Jun 2006 23:33:34 +0100
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <128a885f0606271434v4d5a40e9x1ceb0037d750f6a1@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>	<448A9A7A.6050501@maubp.freeserve.co.uk>	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>	<449F0231.2050308@maubp.freeserve.co.uk>
	<128a885f0606271434v4d5a40e9x1ceb0037d750f6a1@mail.gmail.com>
Message-ID: <44A1B23E.5080007@maubp.freeserve.co.uk>

Chris Lasher wrote:
> Hi Peter,
> 
> Would you be up for licensing your code under the BioPython license?
> If not, I shouldn't  look at it, as I've started coding my own module
> for the project. From your description, your module sounds very good.
> =-)
> 
> Chris

I am quite happy to contribute the code to BioPython under the 
appropriate license, so please go ahead.

I've filled a bug on adding PHYLIP distance parsers to BioPython and 
attached a slightly revised version of the code (added "fuzzy" equality 
testing of matrices - mainly for testing):

http://bugzilla.open-bio.org/show_bug.cgi?id=2034

If anyone else really wants the code under some other license (GPL 
maybe) I could probably be persuaded.

Peter


From chris.lasher at gmail.com  Tue Jun 27 19:32:12 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 27 Jun 2006 19:32:12 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <44A1B23E.5080007@maubp.freeserve.co.uk>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
	<449F0231.2050308@maubp.freeserve.co.uk>
	<128a885f0606271434v4d5a40e9x1ceb0037d750f6a1@mail.gmail.com>
	<44A1B23E.5080007@maubp.freeserve.co.uk>
Message-ID: <128a885f0606271632q2988f2d7y543dd441535f9808@mail.gmail.com>

[Oops! I didn't realize I was posting to the user list! Reverting it
back to BP-Dev]
This code looks very good, Peter!

As far as licensing, I'm new to the game, but my guess is the
BioPython license (http://www.biopython.org/DIST/LICENSE ) is highly
prefered for BioPython. You still retain copyright with the license,
but the code is more "free" than under any version of the GPL.

Chris

On 6/27/06, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Chris Lasher wrote:
> > Hi Peter,
> >
> > Would you be up for licensing your code under the BioPython license?
> > If not, I shouldn't  look at it, as I've started coding my own module
> > for the project. From your description, your module sounds very good.
> > =-)
> >
> > Chris
>
> I am quite happy to contribute the code to BioPython under the
> appropriate license, so please go ahead.
>
> I've filled a bug on adding PHYLIP distance parsers to BioPython and
> attached a slightly revised version of the code (added "fuzzy" equality
> testing of matrices - mainly for testing):
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2034
>
> If anyone else really wants the code under some other license (GPL
> maybe) I could probably be persuaded.
>
> Peter
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From cjfields at uiuc.edu  Wed Jun 28 14:30:44 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 28 Jun 2006 13:30:44 -0500
Subject: [BioPython] Wiki spammed
Message-ID: <005201c69ae0$f78c59c0$15327e82@pyrimidine>

Guys,

Just wanted to let whoever's in charge know that you need to roll back
changes to this page:

http://biopython.org/wiki/Biopython

The spammers have struck again!

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign 


From mdehoon at c2b2.columbia.edu  Fri Jun  2 00:57:35 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu, 01 Jun 2006 17:57:35 -0700
Subject: [BioPython] NCBIWWW.qblast with refseq by organism
In-Reply-To: <20060526165711.94194.qmail@web51708.mail.yahoo.com>
References: <20060526165711.94194.qmail@web51708.mail.yahoo.com>
Message-ID: <447F8CFF.9050204@c2b2.columbia.edu>

Denil Wickrama wrote:
> Hi, I would like to BLAST a list of proteins against the refseq 
> database and retrieve the corresponding accession numbers of the
> exact hits. I get errors when I change from the nr database to the
> refseq database. Also I am trying to restrict the results by organism
> name, but that was not successful.
 > result_handle = NCBIWWW.qblast("blastp", "nr", seq, 
entrez_query='"rattus norvegicus"
> [Organism]')
 > result_handle = NCBIWWW.qblast("blastp", "refseq", seq, 
entrez_query='"rattus norvegicus" [Organism]')
 > Is it possible to do refseq searches with NCBIWWW.qblast?

It turns out that the NCBI server actually wants "refseq_protein" 
instead of "refseq". (You can check this by saving NCBI's 
Protein-protein blast page in HTML, and looking at the source). So if 
you replace "refseq" by "refseq_protein", your code should run.

Restricting the results by organism worked fine for me with the 
entrez_query you have.

--Michiel.


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032


From mdehoon at c2b2.columbia.edu  Fri Jun  2 01:12:57 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu, 01 Jun 2006 18:12:57 -0700
Subject: [BioPython] NCBIWWW.qblast
In-Reply-To: <20060531114048.83077.qmail@web36813.mail.mud.yahoo.com>
References: <20060531114048.83077.qmail@web36813.mail.mud.yahoo.com>
Message-ID: <447F9099.1040800@c2b2.columbia.edu>

Try this instead:

from Bio import Fasta
file_for_blast = open('fasta', 'r')
f_iterator = Fasta.Iterator(file_for_blast)

from Bio.Blast import NCBIWWW

seqnum = 0

for f_record in f_iterator:
     result_handle = NCBIWWW.qblast('blastp', 'nr', f_record)
     save_file = open('my_blast'+str(seqnum)+'.out', 'w')
     blast_results = result_handle.read()
     save_file.write(blast_results)
     save_file.close()
     seqnum += 1


--Michiel.

alper soyler wrote:
> Dear All,
> 
> I have a fasta file (called fasta) containing 20 proteins. I want to blast them in an order. How can I write the results of these 20 proteins in different output files. I tried to write the below script but the 'my_blast2.out' file turned empty. Can you help me please?
> 
> regards,
> Alper
> 
> #!usr/local/bin/python
> 
> from Bio import Fasta
> file_for_blast = open('fasta', 'r')
> f_iterator = Fasta.Iterator(file_for_blast)
> f_record = f_iterator.next()
> 
> from Bio.Blast import NCBIWWW
> result_handle = NCBIWWW.qblast('blastp', 'nr', f_record)
> 
> seqnum = 0
> 
> for f_record  in f_iterator:
>     save_file = open('my_blast.out', 'w')
>     blast_results = result_handle.read()
>     save_file.write(blast_results)
>     save_file.close()
>     seqnum += 1
>     save_file2 = open('my_blast2.out', 'w')
>     blast_results = result_handle.read()
>     save_file2.write(blast_results)
>     save_file2.close()
> 		
> ---------------------------------
> Be a chatter box. Enjoy free PC-to-PC calls  with Yahoo! Messenger with Voice.
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032


From omid9dr18 at hotmail.com  Thu Jun  1 22:39:34 2006
From: omid9dr18 at hotmail.com (Omid Khalouei)
Date: Thu, 1 Jun 2006 22:39:34 +0000
Subject: [BioPython] Synthesized or Clinical PDB sequence
Message-ID: <BAY103-W77543FB5420D46D12C82EE6900@phx.gbl>

Hello,
 
Is there any way to find out if a sequence corresponding to a PDB structure was obtained clinically or was synthesized without having to read the primary citations?
 
Thanks for your help.
Omid K.
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


From boris.steipe at utoronto.ca  Fri Jun  2 02:25:48 2006
From: boris.steipe at utoronto.ca (Boris Steipe)
Date: Thu, 1 Jun 2006 22:25:48 -0400
Subject: [BioPython] Synthesized or Clinical PDB sequence
In-Reply-To: <BAY103-W77543FB5420D46D12C82EE6900@phx.gbl>
References: <BAY103-W77543FB5420D46D12C82EE6900@phx.gbl>
Message-ID: <CCAB0DA0-7488-41BA-BF83-E22B11AD5E59@utoronto.ca>

Since the PDB does not use a constrained vocabulary, this is a bit  
unreliable. But the information is supposed to be entered in the  
SOURCE record.
cf.: http://www.rcsb.org/pdb/file_formats/pdb/pdbguide2.2/part_20.html

HTH,
Boris


On 1 Jun 2006, at 18:39, Omid Khalouei wrote:

> Hello,
>
> Is there any way to find out if a sequence corresponding to a PDB  
> structure was obtained clinically or was synthesized without having  
> to read the primary citations?
>
> Thanks for your help.
> Omid K.
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today it's  
> FREE!
> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From lee.byung-chul at kaist.ac.kr  Fri Jun  2 09:45:09 2006
From: lee.byung-chul at kaist.ac.kr (Lee, Byung-chul)
Date: Fri, 02 Jun 2006 18:45:09 +0900
Subject: [BioPython] Drawing Ramanchandran plot
Message-ID: <448008A5.8090602@kaist.ac.kr>

Hi all,

During calculating the torsion angles of some atoms in PDB files, I want
to draw the Ramanchandran plot of those.
However, I cannot find any modules or methods of doing that in Bio.PDB,
so if anyone knows where it is os how to make it, please inform me.

Thanks,
Byung-chul.

-- 
--------------------------------------------------------
The important thing is not to stop questioning.
                               : Albert Einstein

Byung chul Lee 
  a member of Protein BioInformatics Lab. (PBIL)
                at Detp. BioSystems KAIST, Korea
                                  Ph.D candidate
                                  82-42-869-4357
--------------------------------------------------------


From biopython at maubp.freeserve.co.uk  Fri Jun  2 12:15:25 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 02 Jun 2006 13:15:25 +0100
Subject: [BioPython] Drawing Ramanchandran plot
In-Reply-To: <448008A5.8090602@kaist.ac.kr>
References: <448008A5.8090602@kaist.ac.kr>
Message-ID: <44802BDD.6080703@maubp.freeserve.co.uk>

Lee, Byung-chul wrote:
> Hi all,
> 
> During calculating the torsion angles of some atoms in PDB files, I want
> to draw the Ramanchandran plot of those.
> However, I cannot find any modules or methods of doing that in Bio.PDB,
> so if anyone knows where it is os how to make it, please inform me.
> 
> Thanks,
> Byung-chul.
> 

A work in progress:

http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/ramachandran/

Short summary about calculating the angles:
* MMTK is great, providing it can load the PDB file.
   Very very easy to get the angles
* BioPython's Bio.PDB will load most/al PDB files, but
   you have to work out the backbone and angles yourself.
* Python Macromolecular Library (mmLib) might also be worth looking at.

Once you have the angles, you will want to draw the plots - the link 
above suggests a package like Excel, R, or Peter Robinson's Java Program:

http://www.charite.de/ch/medgen/compgen/ramachandran/

Peter


From sbassi at gmail.com  Wed Jun  7 19:25:44 2006
From: sbassi at gmail.com (Sebastian Bassi)
Date: Wed, 7 Jun 2006 16:25:44 -0300
Subject: [BioPython] From REF to sequence?
Message-ID: <b43bf2080606071225y4db4c23an5572468818908179@mail.gmail.com>

Hello,

I have a list like this:

>ref|NP_918285.1|
>dbj|BAD88119.1|
>dbj|BAD88118.1|
>ref|XP_475495.1|
>emb|CAD37200.1|
>gb|AAM64572.1|

(the list is much bigger, but with this sample you could get the idea).
I would like to create an URL from each entry to retrieve the full
NCBI information about these sequence. Is there a Biopython method for
doing this? I read once about a NCBI syntaxis to build URLs, but I
can't find it.
Best regards,
SB.

-- 
Bioinformatics news: http://www.bioinformatica.info
Lriser: http://www.linspire.com/lraiser_success.php?serial=318


From chris.lasher at gmail.com  Thu Jun  8 21:32:26 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Thu, 8 Jun 2006 17:32:26 -0400
Subject: [BioPython] Distance Matrix Parsers
Message-ID: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>

Hi all,
  Are there any modules in BioPython to parse distance matrices? My
poking around the BioPython modules and Google searching does not turn
up any signs indicating there are distance matrix parsers, currently.
Two particularly useful parsers would be a parser for the output of
DNADIST/PROTDIST/RESTDIST from PHYLIP
(http://evolution.genetics.washington.edu/phylip.html), and a parser
for the MEGA (http://www.megasoftware.net/mega.html) distance matrix
format. If not, would there be any interest in creating parsers for
these matrices, other than my own? I think parsers for distance
matrices could be very useful to the community.

Chris


From mcolosimo at mitre.org  Fri Jun  9 12:16:02 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Fri, 9 Jun 2006 08:16:02 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
Message-ID: <9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org>

Hi Chris,

I don't think there is a parser for those. I have in the past thought  
about writing them up. I was looking over the structure of BioPython  
to see where it would best fit [I'll save my rant on this for another  
time, maybe later today]. In the mean time, the folks at BioPerl have  
Bio-Phylo CPAN module <http://search.cpan.org/~rvosa/Bio-Phylo/>,  
which looks nice, but it does NOT have what you are looking for.  
However, I am planning on following that.

Marc

On Jun 8, 2006, at 5:32 PM, Chris Lasher wrote:

> Hi all,
>   Are there any modules in BioPython to parse distance matrices? My
> poking around the BioPython modules and Google searching does not turn
> up any signs indicating there are distance matrix parsers, currently.
> Two particularly useful parsers would be a parser for the output of
> DNADIST/PROTDIST/RESTDIST from PHYLIP
> (http://evolution.genetics.washington.edu/phylip.html), and a parser
> for the MEGA (http://www.megasoftware.net/mega.html) distance matrix
> format. If not, would there be any interest in creating parsers for
> these matrices, other than my own? I think parsers for distance
> matrices could be very useful to the community.
>
> Chris
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From chris.lasher at gmail.com  Fri Jun  9 15:59:56 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Fri, 9 Jun 2006 11:59:56 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org>
Message-ID: <128a885f0606090859x608e733ela89fdb879e531dc8@mail.gmail.com>

Hi Marc,

Thanks for the reply. I had not seen the Bio::Phylo package before.
Thanks for pointing that out. That seems to have be a really useful
library, though it's not exactly what I was thinking about when I
originally posted. I was thinking more along the lines of the
Bio::Matrix modules
(http://bio.perl.org/wiki/Special:Search?search=matrix&go=Go).

I don't think writing parsers for these formats will be that
difficult. I am unsure, however, about what type of data structure the
matrix should be. The simplest solution is a nested list. Perhaps this
is the proper solution, as the user can then convert this over to a
NumPy multi-dimensional array, say, or some matrix object. I dunno.
Thoughts, comments, suggestions?

Chris

On 6/9/06, Marc Colosimo <mcolosimo at mitre.org> wrote:
> Hi Chris,
>
> I don't think there is a parser for those. I have in the past thought
> about writing them up. I was looking over the structure of BioPython
> to see where it would best fit [I'll save my rant on this for another
> time, maybe later today]. In the mean time, the folks at BioPerl have
> Bio-Phylo CPAN module <http://search.cpan.org/~rvosa/Bio-Phylo/>,
> which looks nice, but it does NOT have what you are looking for.
> However, I am planning on following that.
>
> Marc
>
> On Jun 8, 2006, at 5:32 PM, Chris Lasher wrote:
>
> > Hi all,
> >   Are there any modules in BioPython to parse distance matrices? My
> > poking around the BioPython modules and Google searching does not turn
> > up any signs indicating there are distance matrix parsers, currently.
> > Two particularly useful parsers would be a parser for the output of
> > DNADIST/PROTDIST/RESTDIST from PHYLIP
> > (http://evolution.genetics.washington.edu/phylip.html), and a parser
> > for the MEGA (http://www.megasoftware.net/mega.html) distance matrix
> > format. If not, would there be any interest in creating parsers for
> > these matrices, other than my own? I think parsers for distance
> > matrices could be very useful to the community.
> >
> > Chris
> > _______________________________________________
> > BioPython mailing list  -  BioPython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>
>


From mcolosimo at mitre.org  Fri Jun  9 18:41:29 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Fri, 9 Jun 2006 14:41:29 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <128a885f0606090859x608e733ela89fdb879e531dc8@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org>
	<128a885f0606090859x608e733ela89fdb879e531dc8@mail.gmail.com>
Message-ID: <8AC5BAA2-BA47-4772-88C7-DF4B2061A8E2@mitre.org>

Chris,

I likewise didn't know about the Bio::Matrix::PhylipDist module.  
Personally, I would opt for a Matrix Object (since this is Python a  
OO language) and store it internally as a nested list. That way you  
have the best of both worlds. The next question is the object  
hierarchy. Here I would opt for a top level Matrix class (or module)  
and then subclass that under Phylo. So, something like this:

Bio.Matrix
Bio.Phylo.Matrix

and maybe things like the following (which isn't used/followed much  
here in BioPython)

Bio.Phylo.IO
Bio.Phylo.Parsers.PhylipDist
Bio.Phylo.Parsers.Newick
Bio.Phylo.Parsers.Nexus

And/or have
Bio.Phylo.Matrix.IO that uses the PhylipDist parser.

The next big question is what should Bio.Phylo.IO return? For  
inspiration, we might want to look at Mesquite <http:// 
mesquiteproject.org/mesquite/mesquite.html>.

Marc

On Jun 9, 2006, at 11:59 AM, Chris Lasher wrote:

> Hi Marc,
>
> Thanks for the reply. I had not seen the Bio::Phylo package before.
> Thanks for pointing that out. That seems to have be a really useful
> library, though it's not exactly what I was thinking about when I
> originally posted. I was thinking more along the lines of the
> Bio::Matrix modules
> (http://bio.perl.org/wiki/Special:Search?search=matrix&go=Go).
>
> I don't think writing parsers for these formats will be that
> difficult. I am unsure, however, about what type of data structure the
> matrix should be. The simplest solution is a nested list. Perhaps this
> is the proper solution, as the user can then convert this over to a
> NumPy multi-dimensional array, say, or some matrix object. I dunno.
> Thoughts, comments, suggestions?
>
> Chris
>
> On 6/9/06, Marc Colosimo <mcolosimo at mitre.org> wrote:
>> Hi Chris,
>>
>> I don't think there is a parser for those. I have in the past thought
>> about writing them up. I was looking over the structure of BioPython
>> to see where it would best fit [I'll save my rant on this for another
>> time, maybe later today]. In the mean time, the folks at BioPerl have
>> Bio-Phylo CPAN module <http://search.cpan.org/~rvosa/Bio-Phylo/>,
>> which looks nice, but it does NOT have what you are looking for.
>> However, I am planning on following that.
>>
>> Marc
>>
>> On Jun 8, 2006, at 5:32 PM, Chris Lasher wrote:
>>
>>> Hi all,
>>>   Are there any modules in BioPython to parse distance matrices? My
>>> poking around the BioPython modules and Google searching does not  
>>> turn
>>> up any signs indicating there are distance matrix parsers,  
>>> currently.
>>> Two particularly useful parsers would be a parser for the output of
>>> DNADIST/PROTDIST/RESTDIST from PHYLIP
>>> (http://evolution.genetics.washington.edu/phylip.html), and a parser
>>> for the MEGA (http://www.megasoftware.net/mega.html) distance matrix
>>> format. If not, would there be any interest in creating parsers for
>>> these matrices, other than my own? I think parsers for distance
>>> matrices could be very useful to the community.
>>>
>>> Chris
>>> _______________________________________________
>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From chris.lasher at gmail.com  Fri Jun  9 21:13:32 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Fri, 9 Jun 2006 17:13:32 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <8AC5BAA2-BA47-4772-88C7-DF4B2061A8E2@mitre.org>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org>
	<128a885f0606090859x608e733ela89fdb879e531dc8@mail.gmail.com>
	<8AC5BAA2-BA47-4772-88C7-DF4B2061A8E2@mitre.org>
Message-ID: <128a885f0606091413o23088caesf4934a81f0cc0489@mail.gmail.com>

> I likewise didn't know about the Bio::Matrix::PhylipDist module.
> Personally, I would opt for a Matrix Object (since this is Python a
> OO language) and store it internally as a nested list. That way you
> have the best of both worlds. The next question is the object
> hierarchy. Here I would opt for a top level Matrix class (or module)
> and then subclass that under Phylo. So, something like this:
>
> Bio.Matrix
> Bio.Phylo.Matrix

So is this more appropriate than Bio.Matrix.Phylo? A phylogenetic
matrix is a type of matrix, so that hierarchy is immediately
appealing, however, a phylogenetic matrix is not of much use in and of
itself, so I can see the argument that it should be placed in a
phylogeny package (which we have yet to write but as mentioned
earlier, could be very useful).

> and maybe things like the following (which isn't used/followed much
> here in BioPython)
>
> Bio.Phylo.IO
> Bio.Phylo.Parsers.PhylipDist
> Bio.Phylo.Parsers.Newick
> Bio.Phylo.Parsers.Nexus
>
> And/or have
> Bio.Phylo.Matrix.IO that uses the PhylipDist parser.

This is very very good, in my opinion. Thanks for doing the
heavy-lifting of the brainwork on this! =-)

> The next big question is what should Bio.Phylo.IO return? For
> inspiration, we might want to look at Mesquite <http://
> mesquiteproject.org/mesquite/mesquite.html>.

I must give a better look at this site before commenting, but once
again, thanks for bringing this to my awareness! What a helpful past
couple of emails. I will be out for the weekend but will think more
about this.

As a sidenote, should this discussion be moved to biopython-dev or is
it fine here?

Thanks again Marc,
Chris


From biopython at maubp.freeserve.co.uk  Sat Jun 10 10:10:02 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 10 Jun 2006 11:10:02 +0100
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
Message-ID: <448A9A7A.6050501@maubp.freeserve.co.uk>

Chris Lasher wrote:
> Hi all, Are there any modules in BioPython to parse distance
> matrices? My poking around the BioPython modules and Google searching
> does not turn up any signs indicating there are distance matrix
> parsers, currently. Two particularly useful parsers would be a parser
> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP 
> (http://evolution.genetics.washington.edu/phylip.html),

I've done a very small amount of work with neighbour joining trees, 
using PHYLIP format distance matrices.  The closest I could find to a 
file format definition was this page:

http://evolution.genetics.washington.edu/phylip/doc/distance.html

Points to be aware of:

In my experience, most software tools usually write the distances as a 
full symmetric matrix.  However, the "standard" explicitly discusses 
lower triangular form (missing out the diagonal distance zero entries) 
which has the significant advantage of using about half the disk space. 
  This is significant once you get into thousands of taxa.

So, make sure any parser can cope with both full symmetric, and lower 
triangular forms - ideally without the user having to care.

This also raises the point about how to store the matrix in memory. 
Does Numeric/NumPy have an efficient way of storing symmetric matrices? 
  This is less flexible than the suggested list of lists, but for large 
datasets would need much less memory.

Second point - the "official" PHYLIP distance matrix file format 
truncates the taxa names at 10 characters.  Some tools (e.g. clustalw) 
ignore this limitation and will use as many as needed for the full name. 
  I personally find this much nicer - after all most gene identifiers 
(e.g. GI numbers) are eight characters to start with, and if you are 
dealing with multiple features in each gene 10 characters is tough going.

So, I would make sure you test the parser on this format variant (with 
names longer than 10 characters).  I can supply some examples if you like.

For writing matrices to file, the issue of following the strict 10 
character taxa limit might best be handled as an option (default to max 
10, with a warning if any names are truncated, and an error if 
truncation renders names non-unique?).

Likewise an option to save matrices as either fully symmetric or lower 
triangular.  I would lean towards using fully symmetric as the default 
as it seems to be more common.

> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
> distance matrix format. If not, would there be any interest in
> creating parsers for these matrices, other than my own? I think
> parsers for distance matrices could be very useful to the community.

I suspect that for serious tree building pure python will not be 
competitive with existing C/C++ code on speed - but non-the-less could 
be useful.

Peter


From idoerg at burnham.org  Sat Jun 10 15:08:43 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sat, 10 Jun 2006 08:08:43 -0700
Subject: [BioPython] Distance Matrix Parsers
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
Message-ID: <1F97379A556D0946AAEFE3F63FD6F5744D468D@MAIL.burnham.org>

Hi,

Bio.SubsMat has a parser for substitution matrices, lower triangular and square. Feel free to recycle code.

Best,

Iddo


--
Iddo Friedberg, PhD
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org


-----Original Message-----
From: biopython-bounces at lists.open-bio.org on behalf of Peter
Sent: Sat 6/10/2006 3:10 AM
To: BioPython Mailing List
Subject: Re: [BioPython] Distance Matrix Parsers
 
Chris Lasher wrote:
> Hi all, Are there any modules in BioPython to parse distance
> matrices? My poking around the BioPython modules and Google searching
> does not turn up any signs indicating there are distance matrix
> parsers, currently. Two particularly useful parsers would be a parser
> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP 
> (http://evolution.genetics.washington.edu/phylip.html),

I've done a very small amount of work with neighbour joining trees, 
using PHYLIP format distance matrices.  The closest I could find to a 
file format definition was this page:

http://evolution.genetics.washington.edu/phylip/doc/distance.html

Points to be aware of:

In my experience, most software tools usually write the distances as a 
full symmetric matrix.  However, the "standard" explicitly discusses 
lower triangular form (missing out the diagonal distance zero entries) 
which has the significant advantage of using about half the disk space. 
  This is significant once you get into thousands of taxa.

So, make sure any parser can cope with both full symmetric, and lower 
triangular forms - ideally without the user having to care.

This also raises the point about how to store the matrix in memory. 
Does Numeric/NumPy have an efficient way of storing symmetric matrices? 
  This is less flexible than the suggested list of lists, but for large 
datasets would need much less memory.

Second point - the "official" PHYLIP distance matrix file format 
truncates the taxa names at 10 characters.  Some tools (e.g. clustalw) 
ignore this limitation and will use as many as needed for the full name. 
  I personally find this much nicer - after all most gene identifiers 
(e.g. GI numbers) are eight characters to start with, and if you are 
dealing with multiple features in each gene 10 characters is tough going.

So, I would make sure you test the parser on this format variant (with 
names longer than 10 characters).  I can supply some examples if you like.

For writing matrices to file, the issue of following the strict 10 
character taxa limit might best be handled as an option (default to max 
10, with a warning if any names are truncated, and an error if 
truncation renders names non-unique?).

Likewise an option to save matrices as either fully symmetric or lower 
triangular.  I would lean towards using fully symmetric as the default 
as it seems to be more common.

> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
> distance matrix format. If not, would there be any interest in
> creating parsers for these matrices, other than my own? I think
> parsers for distance matrices could be very useful to the community.

I suspect that for serious tree building pure python will not be 
competitive with existing C/C++ code on speed - but non-the-less could 
be useful.

Peter

_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From idoerg at burnham.org  Sat Jun 10 15:08:43 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sat, 10 Jun 2006 08:08:43 -0700
Subject: [BioPython] Distance Matrix Parsers
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
Message-ID: <1F97379A556D0946AAEFE3F63FD6F5744D468D@MAIL.burnham.org>

Hi,

Bio.SubsMat has a parser for substitution matrices, lower triangular and square. Feel free to recycle code.

Best,

Iddo


--
Iddo Friedberg, PhD
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org


-----Original Message-----
From: biopython-bounces at lists.open-bio.org on behalf of Peter
Sent: Sat 6/10/2006 3:10 AM
To: BioPython Mailing List
Subject: Re: [BioPython] Distance Matrix Parsers
 
Chris Lasher wrote:
> Hi all, Are there any modules in BioPython to parse distance
> matrices? My poking around the BioPython modules and Google searching
> does not turn up any signs indicating there are distance matrix
> parsers, currently. Two particularly useful parsers would be a parser
> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP 
> (http://evolution.genetics.washington.edu/phylip.html),

I've done a very small amount of work with neighbour joining trees, 
using PHYLIP format distance matrices.  The closest I could find to a 
file format definition was this page:

http://evolution.genetics.washington.edu/phylip/doc/distance.html

Points to be aware of:

In my experience, most software tools usually write the distances as a 
full symmetric matrix.  However, the "standard" explicitly discusses 
lower triangular form (missing out the diagonal distance zero entries) 
which has the significant advantage of using about half the disk space. 
  This is significant once you get into thousands of taxa.

So, make sure any parser can cope with both full symmetric, and lower 
triangular forms - ideally without the user having to care.

This also raises the point about how to store the matrix in memory. 
Does Numeric/NumPy have an efficient way of storing symmetric matrices? 
  This is less flexible than the suggested list of lists, but for large 
datasets would need much less memory.

Second point - the "official" PHYLIP distance matrix file format 
truncates the taxa names at 10 characters.  Some tools (e.g. clustalw) 
ignore this limitation and will use as many as needed for the full name. 
  I personally find this much nicer - after all most gene identifiers 
(e.g. GI numbers) are eight characters to start with, and if you are 
dealing with multiple features in each gene 10 characters is tough going.

So, I would make sure you test the parser on this format variant (with 
names longer than 10 characters).  I can supply some examples if you like.

For writing matrices to file, the issue of following the strict 10 
character taxa limit might best be handled as an option (default to max 
10, with a warning if any names are truncated, and an error if 
truncation renders names non-unique?).

Likewise an option to save matrices as either fully symmetric or lower 
triangular.  I would lean towards using fully symmetric as the default 
as it seems to be more common.

> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
> distance matrix format. If not, would there be any interest in
> creating parsers for these matrices, other than my own? I think
> parsers for distance matrices could be very useful to the community.

I suspect that for serious tree building pure python will not be 
competitive with existing C/C++ code on speed - but non-the-less could 
be useful.

Peter

_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 4656 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20060610/5b8aa9fa/attachment-0002.bin>

From mcolosimo at mitre.org  Mon Jun 12 12:38:18 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Mon, 12 Jun 2006 08:38:18 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <128a885f0606091413o23088caesf4934a81f0cc0489@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org>
	<128a885f0606090859x608e733ela89fdb879e531dc8@mail.gmail.com>
	<8AC5BAA2-BA47-4772-88C7-DF4B2061A8E2@mitre.org>
	<128a885f0606091413o23088caesf4934a81f0cc0489@mail.gmail.com>
Message-ID: <65DF4A7E-B365-4E61-93D4-156A36F6ED54@mitre.org>

[cross-posting to biopython-dev]

Chris,

Oops, didn't notice this was on the general biopython mailing list. I  
think many of the developers also subscribe to this list, but just in  
case I'm cross posting this.

Iddo pointed out the Bio.SubsMat, which I didn't know what  that  
module did. One problem with names like that, but the API Docs are  
helpful only when you look at them <http://biopython.org/DIST/docs/ 
api/public/trees.html> (Kuddos for those who add documentation).

Given Bio.SubsMat and the BioPerl Module, I would strongly consider  
combining the Bio.SubsMat and the PhylipDist into a new Bio.Matrix  
module. From a Phylo module, a function/class can always call the  
Bio.Matrix classes.

Marc

On Jun 9, 2006, at 5:13 PM, Chris Lasher wrote:

>> I likewise didn't know about the Bio::Matrix::PhylipDist module.
>> Personally, I would opt for a Matrix Object (since this is Python a
>> OO language) and store it internally as a nested list. That way you
>> have the best of both worlds. The next question is the object
>> hierarchy. Here I would opt for a top level Matrix class (or module)
>> and then subclass that under Phylo. So, something like this:
>>
>> Bio.Matrix
>> Bio.Phylo.Matrix
>
> So is this more appropriate than Bio.Matrix.Phylo? A phylogenetic
> matrix is a type of matrix, so that hierarchy is immediately
> appealing, however, a phylogenetic matrix is not of much use in and of
> itself, so I can see the argument that it should be placed in a
> phylogeny package (which we have yet to write but as mentioned
> earlier, could be very useful).
>
>> and maybe things like the following (which isn't used/followed much
>> here in BioPython)
>>
>> Bio.Phylo.IO
>> Bio.Phylo.Parsers.PhylipDist
>> Bio.Phylo.Parsers.Newick
>> Bio.Phylo.Parsers.Nexus
>>
>> And/or have
>> Bio.Phylo.Matrix.IO that uses the PhylipDist parser.
>
> This is very very good, in my opinion. Thanks for doing the
> heavy-lifting of the brainwork on this! =-)
>
>> The next big question is what should Bio.Phylo.IO return? For
>> inspiration, we might want to look at Mesquite <http://
>> mesquiteproject.org/mesquite/mesquite.html>.
>
> I must give a better look at this site before commenting, but once
> again, thanks for bringing this to my awareness! What a helpful past
> couple of emails. I will be out for the weekend but will think more
> about this.
>
> As a sidenote, should this discussion be moved to biopython-dev or is
> it fine here?
>
> Thanks again Marc,
> Chris
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From mcolosimo at mitre.org  Mon Jun 12 13:18:41 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Mon, 12 Jun 2006 09:18:41 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <448A9A7A.6050501@maubp.freeserve.co.uk>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
Message-ID: <CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>

[cross post]
On Jun 10, 2006, at 6:10 AM, Peter wrote:

> Chris Lasher wrote:
>> Hi all, Are there any modules in BioPython to parse distance
>> matrices? My poking around the BioPython modules and Google searching
>> does not turn up any signs indicating there are distance matrix
>> parsers, currently. Two particularly useful parsers would be a parser
>> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP
>> (http://evolution.genetics.washington.edu/phylip.html),
>
> I've done a very small amount of work with neighbour joining trees,
> using PHYLIP format distance matrices.  The closest I could find to a
> file format definition was this page:
>
> http://evolution.genetics.washington.edu/phylip/doc/distance.html
>
> Points to be aware of:
>
> In my experience, most software tools usually write the distances as a
> full symmetric matrix.  However, the "standard" explicitly discusses
> lower triangular form (missing out the diagonal distance zero entries)
> which has the significant advantage of using about half the disk  
> space.
>   This is significant once you get into thousands of taxa.

This is still small potatoes compared to the input needed to generate  
the distance matrixs (especially with DNA/RNA sequences of any  
decently sized gene).

>
> So, make sure any parser can cope with both full symmetric, and lower
> triangular forms - ideally without the user having to care.

Phylip does ask you which to either read or write; this is a pain at  
times. So, having a parser figure this out would be nice. However,  
the user should know about the choices.

>
> This also raises the point about how to store the matrix in memory.
> Does Numeric/NumPy have an efficient way of storing symmetric  
> matrices?
>   This is less flexible than the suggested list of lists, but for  
> large
> datasets would need much less memory.

I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at  
storing these things. But you lose that when you want to do pythonish  
things to it (like write it back out).

>
> Second point - the "official" PHYLIP distance matrix file format
> truncates the taxa names at 10 characters.  Some tools (e.g. clustalw)
> ignore this limitation and will use as many as needed for the full  
> name.

ClustalW does the CORRECT thing, it truncates the name to 10  
characters for Phylip output (alignments). And it does the CORRECT  
thing for its  distance matrix file.

In Clustalw's trees.c file

void distance_matrix_output(FILE *ofile)

	fprintf(ofile,"\n%-*s ",max_names,names[i]);  /* left justify to the  
maximum length of names in current alignment file and use a space as  
a sep */

spaces in names are bad in this case, but phylip is okay with them,  
since the first 10 characters are the taxon name.

>   I personally find this much nicer - after all most gene identifiers
> (e.g. GI numbers) are eight characters to start with, and if you are
> dealing with multiple features in each gene 10 characters is tough  
> going.
>
> So, I would make sure you test the parser on this format variant (with
> names longer than 10 characters).  I can supply some examples if  
> you like.

By definition this isn't a variant of Phylip, but another format. So,  
one would need two parsers: PhylipDist and Dist (or ClustalDist).

>
> For writing matrices to file, the issue of following the strict 10
> character taxa limit might best be handled as an option (default to  
> max
> 10, with a warning if any names are truncated, and an error if
> truncation renders names non-unique?).

DON'T give an option of 10 or more. That is NOT the definition of the  
Phylip file Matrix structure, so why give the option? Make another  
class that outputs the whole name (ClustalDist).

I am pretty sure that Phylip doesn't care about non-unique names so  
why error out? However, the class should have a means for the user to  
ask this question.

>
> Likewise an option to save matrices as either fully symmetric or lower
> triangular.  I would lean towards using fully symmetric as the default
> as it seems to be more common.

Phylip's default seems to be a "Square" distance matrix, i.e. fully  
symmetric. Keep this in mind when naming or documentation.

>
>> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
>> distance matrix format. If not, would there be any interest in
>> creating parsers for these matrices, other than my own? I think
>> parsers for distance matrices could be very useful to the community.
>
> I suspect that for serious tree building pure python will not be
> competitive with existing C/C++ code on speed - but non-the-less could
> be useful.
>

Well, we do have things like SciPy and PyClustal, which make things  
more even.

Marc


From mcolosimo at mitre.org  Mon Jun 12 13:18:41 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Mon, 12 Jun 2006 09:18:41 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <448A9A7A.6050501@maubp.freeserve.co.uk>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
Message-ID: <CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>

[cross post]
On Jun 10, 2006, at 6:10 AM, Peter wrote:

> Chris Lasher wrote:
>> Hi all, Are there any modules in BioPython to parse distance
>> matrices? My poking around the BioPython modules and Google searching
>> does not turn up any signs indicating there are distance matrix
>> parsers, currently. Two particularly useful parsers would be a parser
>> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP
>> (http://evolution.genetics.washington.edu/phylip.html),
>
> I've done a very small amount of work with neighbour joining trees,
> using PHYLIP format distance matrices.  The closest I could find to a
> file format definition was this page:
>
> http://evolution.genetics.washington.edu/phylip/doc/distance.html
>
> Points to be aware of:
>
> In my experience, most software tools usually write the distances as a
> full symmetric matrix.  However, the "standard" explicitly discusses
> lower triangular form (missing out the diagonal distance zero entries)
> which has the significant advantage of using about half the disk  
> space.
>   This is significant once you get into thousands of taxa.

This is still small potatoes compared to the input needed to generate  
the distance matrixs (especially with DNA/RNA sequences of any  
decently sized gene).

>
> So, make sure any parser can cope with both full symmetric, and lower
> triangular forms - ideally without the user having to care.

Phylip does ask you which to either read or write; this is a pain at  
times. So, having a parser figure this out would be nice. However,  
the user should know about the choices.

>
> This also raises the point about how to store the matrix in memory.
> Does Numeric/NumPy have an efficient way of storing symmetric  
> matrices?
>   This is less flexible than the suggested list of lists, but for  
> large
> datasets would need much less memory.

I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at  
storing these things. But you lose that when you want to do pythonish  
things to it (like write it back out).

>
> Second point - the "official" PHYLIP distance matrix file format
> truncates the taxa names at 10 characters.  Some tools (e.g. clustalw)
> ignore this limitation and will use as many as needed for the full  
> name.

ClustalW does the CORRECT thing, it truncates the name to 10  
characters for Phylip output (alignments). And it does the CORRECT  
thing for its  distance matrix file.

In Clustalw's trees.c file

void distance_matrix_output(FILE *ofile)

	fprintf(ofile,"\n%-*s ",max_names,names[i]);  /* left justify to the  
maximum length of names in current alignment file and use a space as  
a sep */

spaces in names are bad in this case, but phylip is okay with them,  
since the first 10 characters are the taxon name.

>   I personally find this much nicer - after all most gene identifiers
> (e.g. GI numbers) are eight characters to start with, and if you are
> dealing with multiple features in each gene 10 characters is tough  
> going.
>
> So, I would make sure you test the parser on this format variant (with
> names longer than 10 characters).  I can supply some examples if  
> you like.

By definition this isn't a variant of Phylip, but another format. So,  
one would need two parsers: PhylipDist and Dist (or ClustalDist).

>
> For writing matrices to file, the issue of following the strict 10
> character taxa limit might best be handled as an option (default to  
> max
> 10, with a warning if any names are truncated, and an error if
> truncation renders names non-unique?).

DON'T give an option of 10 or more. That is NOT the definition of the  
Phylip file Matrix structure, so why give the option? Make another  
class that outputs the whole name (ClustalDist).

I am pretty sure that Phylip doesn't care about non-unique names so  
why error out? However, the class should have a means for the user to  
ask this question.

>
> Likewise an option to save matrices as either fully symmetric or lower
> triangular.  I would lean towards using fully symmetric as the default
> as it seems to be more common.

Phylip's default seems to be a "Square" distance matrix, i.e. fully  
symmetric. Keep this in mind when naming or documentation.

>
>> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
>> distance matrix format. If not, would there be any interest in
>> creating parsers for these matrices, other than my own? I think
>> parsers for distance matrices could be very useful to the community.
>
> I suspect that for serious tree building pure python will not be
> competitive with existing C/C++ code on speed - but non-the-less could
> be useful.
>

Well, we do have things like SciPy and PyClustal, which make things  
more even.

Marc


From asmund.skjaveland at usit.uio.no  Mon Jun 12 15:45:26 2006
From: asmund.skjaveland at usit.uio.no (=?ISO-8859-1?Q?=C5smund_Skj=E6veland?=)
Date: Mon, 12 Jun 2006 17:45:26 +0200
Subject: [BioPython] Generating Nexus file from Genbank file
Message-ID: <448D8C16.6050204@fys.uio.no>

I have a file of Genbank records, and want to extract some of them and
save to a Nexus file. As far as I can tell from the API, this should work:

#!/site/compython/Linux/bin/python

import Bio, sys, time
from Bio.GenBank import Iterator
from Bio.Nexus.Nexus import Nexus

gbfile='results/sequences-txid34828.genbank'

fp = Bio.GenBank.FeatureParser()
gb = open(gbfile, 'r')

it = Bio.GenBank.Iterator(gb, fp)

nex = Nexus()

nr = 0;
rec = it.next()
while rec:
     # A string to identify the sequence with
     nexusname=rec.features[0].qualifiers['db_xref'][0] + '--' + rec.name
     nex.add_sequence(nexusname, rec.seq)

     rec = it.next()

print "\n\n%d records, %d gene names" % (nr, len(genenames))

nex.write_nexus_data('results/genegrab.nex', mrbayes=True)


But it doesn't. When I run it:

Traceback (most recent call last):
   File "py_nexustest.py", line 39, in ?
     nex.add_sequence(nexusname, rec.seq)
   File
"/site/compython/Linux/lib/python2.4/site-packages/Bio/Nexus/Nexus.py",
line 1412, in add_sequence
     self.matrix[name]=Seq(sequence,self.alphabet)
AttributeError: 'Nexus' object has no attribute 'alphabet'

What am I doing wrong? I don't really know the Nexus format, I just want
to send certain sequences to MrBayes.

-- 
?smund Skj?veland {
    Scientific Computing Group, UiO;
}


From rohini.damle at gmail.com  Tue Jun 13 19:09:21 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Tue, 13 Jun 2006 12:09:21 -0700
Subject: [BioPython] (no subject)
Message-ID: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>

Hi,
 I am new to bipyton trying to use ncbistandalone parser to parse my blast
out put which is in txt format.
the parser works well for older blast uptputs but breaks down for newer
blast outputs. Can someone suggest me a way to overcome this blast parser's
problem?
Thanks


From winter at biotec.tu-dresden.de  Wed Jun 14 08:00:20 2006
From: winter at biotec.tu-dresden.de (Christof Winter)
Date: Wed, 14 Jun 2006 10:00:20 +0200
Subject: [BioPython] (no subject)
In-Reply-To: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
Message-ID: <448FC214.20805@biotec.tu-dresden.de>

Hi Rohini,

can you provide a minimal example of your python code along with two blast reports 
(working/not working)?

Cheers,
Christof


Rohini Damle wrote:
> Hi,
>  I am new to bipyton trying to use ncbistandalone parser to parse my blast
> out put which is in txt format.
> the parser works well for older blast uptputs but breaks down for newer
> blast outputs. Can someone suggest me a way to overcome this blast parser's
> problem?
> Thanks
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From biopython at maubp.freeserve.co.uk  Wed Jun 14 09:09:48 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 14 Jun 2006 10:09:48 +0100
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
Message-ID: <448FD25C.20101@maubp.freeserve.co.uk>

Rohini Damle wrote:
> Hi,
> I am new to bipyton trying to use ncbistandalone parser to parse my blast
> out put which is in txt format.
> the parser works well for older blast uptputs but breaks down for newer
> blast outputs.

The NCBI standalone blast and web blast plain text output keeps changing 
slightly, and as a result, the parser isn't always up to date.

 > Can someone suggest me a way to overcome this blast parser's
> problem?

We recommend you use the XML output instead (this is possible with both 
online blast and the standalone tools).

For the stand alone tools, repeat your searches with the command line 
option -m 7 to get XML output.

If you are using the Bio.NCBIStandalone.blastall() command, use argument 
align_view to set this.

You still use NCBIStandalone.Iterator (if you have multiple queries) but 
now use NCBIXML.BlastParser instead of NCBIStandalone.BlastParser

e.g.
http://bugzilla.open-bio.org/attachment.cgi?id=293&action=view

Peter


From rohini.damle at gmail.com  Wed Jun 14 18:22:59 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Wed, 14 Jun 2006 11:22:59 -0700
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <448FD25C.20101@maubp.freeserve.co.uk>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
	<448FD25C.20101@maubp.freeserve.co.uk>
Message-ID: <d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>

Thank you very much for your help.
I have 55-56 proteins & I am using Blast to find out short, nearly exact
matches. The xml parser works fine for first record but even if I used the
iterator, I CAN NOT ITERATE through the records, I have used the same code
as u have given, what might be wrong?
Rohini.


On 6/14/06, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Rohini Damle wrote:
> > Hi,
> > I am new to bipyton trying to use ncbistandalone parser to parse my
> blast
> > out put which is in txt format.
> > the parser works well for older blast uptputs but breaks down for newer
> > blast outputs.
>
> The NCBI standalone blast and web blast plain text output keeps changing
> slightly, and as a result, the parser isn't always up to date.
>
> > Can someone suggest me a way to overcome this blast parser's
> > problem?
>
> We recommend you use the XML output instead (this is possible with both
> online blast and the standalone tools).
>
> For the stand alone tools, repeat your searches with the command line
> option -m 7 to get XML output.
>
> If you are using the Bio.NCBIStandalone.blastall() command, use argument
> align_view to set this.
>
> You still use NCBIStandalone.Iterator (if you have multiple queries) but
> now use NCBIXML.BlastParser instead of NCBIStandalone.BlastParser
>
> e.g.
> http://bugzilla.open-bio.org/attachment.cgi?id=293&action=view
>
> Peter
>
>


From manickam.muthuraman at wur.nl  Wed Jun 14 20:22:56 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Wed, 14 Jun 2006 22:22:56 +0200
Subject: [BioPython] parsing the blastoutput and printing the alingment
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>

I am new to python 

I am getting error in parsing blastoutput more over the same problem was been addressed by Michiel De Hoon but i could not clear...here is the error what i am getting.

first i got error when i typed b_record=b_parser.parse(blast_out) as michiel suggested i changed to 
b_record=b_parser.parse(blast_out)
Traceback (most recent call last):
  File "<input>", line 1, in ?
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse
    self._parser.parse(handler)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.4/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 211, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python2.4/xml/sax/handler.py", line 38, in fatalError
    raise exception
SAXParseException: my_blast.out:1:4: not well-formed (invalid token)
blast_out=open('my_blast.out','r')
from Bio.Blast import NCBIStandalone
from Bio.Blast import NCBIXML
b_parser=NCBIXML.BlastParser()
b_iterator1=NCBIStandalone.Iterator(blast_out,b_parser)
for alignment in b_iterator1.alignments:
    for hsp in alignment.hsps:
        print 'seq:',alignment.title
    
Traceback (most recent call last):
  File "<input>", line 1, in ?
AttributeError: Iterator instance has no attribute 'alignments'


how do i print the title.alignment and so on.....from the blast output file
thanks in advance
-- 
Manickam(melaimanik)


From biopython at maubp.freeserve.co.uk  Wed Jun 14 21:54:53 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 14 Jun 2006 22:54:53 +0100
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>	<448FD25C.20101@maubp.freeserve.co.uk>
	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>
Message-ID: <449085AD.7010801@maubp.freeserve.co.uk>

Rohini Damle wrote:
> Thank you very much for your help.
> I have 55-56 proteins & I am using Blast to find out short, nearly exact
> matches. The xml parser works fine for first record but even if I used the
> iterator, I CAN NOT ITERATE through the records, I have used the same code
> as u have given, what might be wrong?
> Rohini.

If you you send us a short be of example code, and the error message 
that would help.  Also, what version of BioPython are you using, and do 
you have Windows or Linux or MacOS...

One guess is that you will need to update the NCBIStandalone.py file to 
include a recent fix for iterating XML files.

Assuming you are using BioPython 1.41 on Windows, the click on this link 
and pick "download" near the top of the page to get the latest verion:

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIStandalone.py?cvsroot=biopython

Save it here:

c:\python24\lib\site-packages\Bio\Blast\NCBIStandalone.py

(Make a copy of the old file first, just in case)

Peter


From mdehoon at c2b2.columbia.edu  Wed Jun 14 21:55:17 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Wed, 14 Jun 2006 17:55:17 -0400
Subject: [BioPython] parsing the blastoutput and printing the alingment
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
Message-ID: <449085C5.4020101@c2b2.columbia.edu>

Muthuraman, Manickam wrote:
> b_parser=NCBIXML.BlastParser()
> b_iterator1=NCBIStandalone.Iterator(blast_out,b_parser)
> for alignment in b_iterator1.alignments:
>     for hsp in alignment.hsps:
>         print 'seq:',alignment.title
>     
> Traceback (most recent call last):
>   File "<input>", line 1, in ?
> AttributeError: Iterator instance has no attribute 'alignments'
> 
Use:
b_record = b_iterator1.next()
for alignment in b_record.alignments:
    ...

Just like the example in the tutorial.

--Michiel.


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032


From biopython at maubp.freeserve.co.uk  Wed Jun 14 21:48:20 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 14 Jun 2006 22:48:20 +0100
Subject: [BioPython] parsing the blastoutput and printing the alingment
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
Message-ID: <44908424.2070407@maubp.freeserve.co.uk>

Muthuraman, Manickam wrote:
> I am new to python 
> 
> I am getting error in parsing blastoutput more over the same problem
 > was been addressed by Michiel De Hoon but i could not clear...
> 
> blast_out=open('my_blast.out','r')
> from Bio.Blast import NCBIStandalone
> from Bio.Blast import NCBIXML
> b_parser=NCBIXML.BlastParser()
> b_iterator1=NCBIStandalone.Iterator(blast_out,b_parser)
> for alignment in b_iterator1.alignments:
>     for hsp in alignment.hsps:
>         print 'seq:',alignment.title
 >

Your example code is wrong.  The iterator object will return blast 
record objects (which have an alignments property).

Try something like this:

blast_out=open('my_blast.out','r')
from Bio.Blast import NCBIStandalone
from Bio.Blast import NCBIXML
b_parser=NCBIXML.BlastParser()
b_iterator=NCBIStandalone.Iterator(blast_out,b_parser)
for b_record in b_iterator:
     for alignment in b_record.alignments:
         for hsp in alignment.hsps:
             print 'seq:',alignment.title


Or for a full and tested example, try this :

http://bugzilla.open-bio.org/attachment.cgi?id=293&action=view

Peter


From rohini.damle at gmail.com  Wed Jun 14 18:21:18 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Wed, 14 Jun 2006 11:21:18 -0700
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <448FD25C.20101@maubp.freeserve.co.uk>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
	<448FD25C.20101@maubp.freeserve.co.uk>
Message-ID: <d9fd76050606141121o548ff7e7of9c031344cdbb1cb@mail.gmail.com>

Thank you very much for your help.
I have 55-56 proteins & I am using Blast to find out short, nearly exact
matches. The xml parser works fine for first record but even if I used the
iterator, I


On 6/14/06, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Rohini Damle wrote:
> > Hi,
> > I am new to bipyton trying to use ncbistandalone parser to parse my
> blast
> > out put which is in txt format.
> > the parser works well for older blast uptputs but breaks down for newer
> > blast outputs.
>
> The NCBI standalone blast and web blast plain text output keeps changing
> slightly, and as a result, the parser isn't always up to date.
>
> > Can someone suggest me a way to overcome this blast parser's
> > problem?
>
> We recommend you use the XML output instead (this is possible with both
> online blast and the standalone tools).
>
> For the stand alone tools, repeat your searches with the command line
> option -m 7 to get XML output.
>
> If you are using the Bio.NCBIStandalone.blastall() command, use argument
> align_view to set this.
>
> You still use NCBIStandalone.Iterator (if you have multiple queries) but
> now use NCBIXML.BlastParser instead of NCBIStandalone.BlastParser
>
> e.g.
> http://bugzilla.open-bio.org/attachment.cgi?id=293&action=view
>
> Peter
>
>


From manickam.muthuraman at wur.nl  Thu Jun 15 11:47:34 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Thu, 15 Jun 2006 13:47:34 +0200
Subject: [BioPython] parsing the blastoutput and printing the alingment
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>

Still i am getting the same error or error. I tried as Peter suggested but it fails. 


I have attached the error and the code

[manickam at bioinfo python]$ cat blas.py
from Bio import Fasta
file_for_blast=open('/home/manickam/Documents/m_cold.fasta','r')
f_iterator=Fasta.Iterator(file_for_blast)
f_record=f_iterator.next()
from Bio.Blast import NCBIWWW
result_handle=NCBIWWW.qblast('blastp','nr',f_record)
save_file=open('/home/manickam/my_blast.out','w')
blast_results=result_handle.read()
save_file.write(blast_results)
save_file.close()
blast_out=open('/home/manickam/my_blast.out','r')
from Bio.Blast import NCBIXML
from Bio.Blast import NCBIStandalone
b_parser=NCBIXML.BlastParser()
b_iterator=NCBIStandalone.Iterator(blast_out,b_parser)
for b_record in b_iterator:
    print "inside (3)outer loop"
    for alignment in b_record.alignments:
        print "inside 2 loop"
        for hsp in alignment.hsps:
            print "inside 1 loop"
            print 'seq:',alignment.title
blast_out.close()

[manickam at bioinfo python]$
[manickam at bioinfo python]$ python blas.py
/usr/lib/python2.4/site-packages/Bio/Blast/NCBIWWW.py:1064: UserWarning: qblast works only with blastn and blastp for now.
  warnings.warn("qblast works only with blastn and blastp for now.")
Traceback (most recent call last):
  File "blas.py", line 16, in ?
    for b_record in b_iterator:
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIStandalone.py", line 1385, in next
    return self._parser.parse(File.StringHandle(data))
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse
    self._parser.parse(handler)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.4/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 211, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python2.4/xml/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:4: not well-formed (invalid token)
[manickam at bioinfo python]$                                                               


From manickam.muthuraman at wur.nl  Thu Jun 15 11:51:36 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Thu, 15 Jun 2006 13:51:36 +0200
Subject: [BioPython] parsing the blastoutput and printing the alingment
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>

Dear Michiel

I tried your suggestion as well but i am getting error. I could even understand where i am making mistake.


[manickam at bioinfo python]$ cat blas.py
from Bio import Fasta
file_for_blast=open('/home/manickam/Documents/m_cold.fasta','r')
f_iterator=Fasta.Iterator(file_for_blast)
f_record=f_iterator.next()
from Bio.Blast import NCBIWWW
result_handle=NCBIWWW.qblast('blastp','nr',f_record)
save_file=open('/home/manickam/my_blast.out','w')
blast_results=result_handle.read()
save_file.write(blast_results)
save_file.close()
blast_out=open('/home/manickam/my_blast.out','r')
from Bio.Blast import NCBIXML
from Bio.Blast import NCBIStandalone
b_parser=NCBIXML.BlastParser()
b_iterator=NCBIStandalone.Iterator(blast_out,b_parser)
b_record = b_iterator.next()
for alignment in b_record.alignments:
    print "inside 2 loop"
    for hsp in alignment.hsps:
        print "inside 1 loop"
        print 'seq:',alignment.title
blast_out.close()

[manickam at bioinfo python]$ python blas.py
/usr/lib/python2.4/site-packages/Bio/Blast/NCBIWWW.py:1064: UserWarning: qblast works only with blastn and blastp for now.
  warnings.warn("qblast works only with blastn and blastp for now.")
Traceback (most recent call last):
  File "blas.py", line 16, in ?
    b_record = b_iterator.next()
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIStandalone.py", line 1385, in next
    return self._parser.parse(File.StringHandle(data))
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse
    self._parser.parse(handler)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.4/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 211, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python2.4/xml/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:4: not well-formed (invalid token)
[manickam at bioinfo python]$                              


From biopython at maubp.freeserve.co.uk  Thu Jun 15 12:25:06 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Jun 2006 13:25:06 +0100
Subject: [BioPython] parsing the blastoutput and printing the alingment
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>
Message-ID: <449151A2.1040602@maubp.freeserve.co.uk>

Muthuraman, Manickam wrote:
> Still i am getting the same error or error. I tried as Peter suggested but it fails. 
 > ...

I couldn't see anything clearly wrong just from reading your code.

Which version of BioPython do you have?

Since BioPython 1.41 NCBIWWW.qblast uses XML as the default output 
format, but you can force this by:

result_handle=NCBIWWW.qblast('blastp','nr',f_record, format_type="XML")

Try opening your output file /home/manickam/my_blast.out in a text 
editor to double check it really is XML - i.e. does it start <XML...>

If it is XML, then BioPython doesn't like it for some reason.  Maybe you 
could email the file to me and Michiel to take a look?

Peter


From manickam.muthuraman at wur.nl  Thu Jun 15 14:13:17 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Thu, 15 Jun 2006 16:13:17 +0200
Subject: [BioPython] parsing the blastoutput and printing the alingment
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>


Dear peter

here is the code my_blast.out and the error. My need is to get all the blast hit sequences in fasta format. By parsing and i can extract accession number from it.

Code
from Bio import Fasta
file_for_blast=open('/home/manickam/Documents/m_cold.fasta','r')
f_iterator=Fasta.Iterator(file_for_blast)
f_record=f_iterator.next()
from Bio.Blast import NCBIWWW
result_handle=NCBIWWW.qblast('blastp','nr',f_record, format_type="XML")
save_file=open('/home/manickam/my_blast.out','w')
blast_results=result_handle.read()
save_file.write(blast_results)
save_file.close()
blast_out=open('/home/manickam/my_blast.out','r')
from Bio.Blast import NCBIXML
from Bio.Blast import NCBIStandalone
b_parser=NCBIXML.BlastParser()
b_iterator=NCBIStandalone.Iterator(blast_out,b_parser)
b_record = b_iterator.next()
for alignment in b_record.alignments:
    print "inside 2 loop"
    for hsp in alignment.hsps:
        print "inside 1 loop"
        print 'seq:',alignment.title
blast_out.close()

Error
[root at bioinfo python]# python blas.py
/usr/lib/python2.4/site-packages/Bio/Blast/NCBIWWW.py:1064: UserWarning: qblast works only with blastn and blastp for now.
  warnings.warn("qblast works only with blastn and blastp for now.")
Traceback (most recent call last):
  File "blas.py", line 16, in ?
    b_record = b_iterator.next()
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIStandalone.py", line 1410, in next
    return self._parser.parse(File.StringHandle(data))
  File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 112, in parse
    self._parser.parse(handler)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.4/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.4/xml/sax/expatreader.py", line 211, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python2.4/xml/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:4: not well-formed (invalid token)
[root at bioinfo python]#            

my_blast.out
HTTP/1.1 200 OK
Date: Thu, 15 Jun 2006 13:57:19 GMT
Server: Nde
Content-Type: application/xml
Connection: close

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastp</BlastOutput_program>
  <BlastOutput_version>BLASTP 2.2.14 [May-07-2006]</BlastOutput_version>
  <BlastOutput_reference>Altschul, Stephen F., Thomas L. Madden, Alejandro A. Sch??ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>nr</BlastOutput_db>
  <BlastOutput_query-ID>1_13944</BlastOutput_query-ID>
  <BlastOutput_query-def>1BK0</BlastOutput_query-def>
  <BlastOutput_query-len>331</BlastOutput_query-len>
  <BlastOutput_param>
.
.
.
.
.
.
.
.
 <Hsp_identity>76</Hsp_identity>
              <Hsp_positive>128</Hsp_positive>
              <Hsp_gaps>27</Hsp_gaps>
              <Hsp_align-len>295</Hsp_align-len>
              <Hsp_qseq>VPKIDVSPLFGD-DQAAKMRVAQQIDAASRDTGFFYAVNHGIN---VQRLSQKTKEFHMSITPEEKWDLAIRAYNKEHQDQVRAGYYLSIPGKKAVESFCYLNP--NFTPDHPRIQAKTPTHEVNVWPDETKHPGFQDFAEQYYWDVFGLSSALLKGYALALGKEENFFARHFKPDDTLASVVLIRYP-YLDPYPEAAIKTAADGTKLSFEWHEDVSLITVLYQSNVQNLQVETAAGYQDIEADDTGYLINCGSYMAHLTNNYYKAPIHRV--KWVNAERQSLPFFVNLGYDSVI</Hsp_qseq>
              <Hsp_hseq>LPVIDLSLLDGSPESAAKFR--DDLLCATHDVGFFYLVGHGVDESLMDDLLAASREFFD--LPEDQKFAVENVKSPQFRGYTRVGGELT-EGKTDWREQIDVGPERDVIDNAPGLADYWRLEGPNLWPDAV--PQLRGLVNEWNDKLSAVSLRLLRAWAHALGAPEDVFDNAFA-DKPFPQLKIVRYPGESNPEPKQGVGAHRDGGVLTL----------LMVEPGKGGLQVDYNGEWVDVPPKPGAFVVNIGEMLELATEGYLKATLHRVISPLIGDDRISIPFFFNPALDTVM</Hsp_hseq>
              <Hsp_midline>+P ID+S L G  + AAK R    +  A+ D GFFY V HG++   +  L   ++EF     PE++        + + +   R G  L+  GK        + P  +   + P +         N+WPD    P  +    ++   +  +S  LL+ +A ALG  E+ F   F  D     + ++RYP   +P P+  +    DG  L+           ++ +     LQV+    + D+      +++N G  +   T  Y KA +HRV    +  +R S+PFF N   D+V+</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
      </Iteration_hits>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>3695564</Statistics_db-num>
          <Statistics_db-len>1269795892</Statistics_db-len>
          <Statistics_hsp-len>0</Statistics_hsp-len>
          <Statistics_eff-space>0</Statistics_eff-space>
          <Statistics_kappa>0.041</Statistics_kappa>
          <Statistics_lambda>0.267</Statistics_lambda>
          <Statistics_entropy>0.14</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>


From biopython at maubp.freeserve.co.uk  Thu Jun 15 15:01:42 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Jun 2006 16:01:42 +0100
Subject: [BioPython] parsing the blastoutput and printing the alingment
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>
Message-ID: <44917656.6090602@maubp.freeserve.co.uk>

Muthuraman, Manickam wrote:
> Dear peter
> 
> here is the code my_blast.out and the error. My need is to get all the
 > blast hit sequences in fasta format. By parsing and i can extract
 > accession number from it.

I made an example fasta file containing just this one sequence twice:

 >example1
VPKIDVSPLFGDDQAAKMRVAQQIDAASRDTGFFYAVNHGINVQRLSQKTKEFHMSITP
EEKWDLAIRAYNKEHQDQVRAGYYLSIPGKKAVESFCYLNPNFTPDHPRIQAKTPTHEV
NVWPDETKHPGFQDFAEQYYWDVFGLSSALLKGYALALGKEENFFARHFKPDDTLASVV
LIRYPYLDPYPEAAIKTAADGTKLSFEWHEDVSLITVLYQSNVQNLQVETAAGYQDIEA
DDTGYLINCGSYMAHLTNNYYKAPIHRVKWVNAERQSLPFFVNLGYDSVI
 >example2
VPKIDVSPLFGDDQAAKMRVAQQIDAASRDTGFFYAVNHGINVQRLSQKTKEFHMSITP
EEKWDLAIRAYNKEHQDQVRAGYYLSIPGKKAVESFCYLNPNFTPDHPRIQAKTPTHEV
NVWPDETKHPGFQDFAEQYYWDVFGLSSALLKGYALALGKEENFFARHFKPDDTLASVV
LIRYPYLDPYPEAAIKTAADGTKLSFEWHEDVSLITVLYQSNVQNLQVETAAGYQDIEA
DDTGYLINCGSYMAHLTNNYYKAPIHRVKWVNAERQSLPFFVNLGYDSVI

I then edited the filenames in your example, and ran the code.  It 
worked for me using a fresh install of BioPython 1.41 on Linux with 
Python 2.4.2

So the good news is your code seems fine.

Maybe there is something "funny" with your fasta file?  Accented 
characters for example - which would then be in the output XML file?

Could you send me the fasta file and the XML file (in full, as 
attachments), off the mailing list to avoid clogging up everyone's inboxes.

Thanks

Peter


From biopython at maubp.freeserve.co.uk  Thu Jun 15 15:08:32 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Jun 2006 16:08:32 +0100
Subject: [BioPython] Abuse of the new Wiki Homepage
Message-ID: <449177F0.1010209@maubp.freeserve.co.uk>

I've noticed someone has created an account "Ceas" on the wiki and has 
been inserting junk/spam links.  For example, look at the history of the 
main page:

http://biopython.org/wiki/Biopython

Who is in charge of the Wiki?  Can we
(a) block this account (short term action)
(b) tighten up rules for creating new accounts?

Peter


From arareko at campus.iztacala.unam.mx  Thu Jun 15 16:13:50 2006
From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra)
Date: Thu, 15 Jun 2006 11:13:50 -0500
Subject: [BioPython] Abuse of the new Wiki Homepage
In-Reply-To: <449177F0.1010209@maubp.freeserve.co.uk>
References: <449177F0.1010209@maubp.freeserve.co.uk>
Message-ID: <4491873E.50509@campus.iztacala.unam.mx>

Hi Peter,

We started to have the same problem in the BioPerl wiki some months ago. 
The way we usually solve this is by blocking the user account and 
rolling back to the previous version of the affected document.

We have a list of wiki administrators who are constantly (and 
independently) monitoring the recent changes in the site. This way we 
can keep track of the changes and revert damages to the content:

http://bioperl.org/wiki/BioPerl:Administrators
http://bioperl.org/wiki/Special:Recentchanges

You can also keep track of the changes by using the RSS or Atom feeds 
provided by the Recentchanges page:

http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=rss
http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=atom

The wiki system has memory of the blocked users and IP's, you can have a 
look here:

http://bioperl.org/wiki/Special:Ipblocklist

There also exists a Blacklist, which is a complement to the main 
Wikimedia's one and helps detect spam content before it goes into a 
document:

http://bioperl.org/wiki/Help:Blacklist
http://meta.wikimedia.org/wiki/Spam_blacklist

I don't know who's in charge of BioPython's wiki but I hope this info 
can be helpful to you.

Regards,
Mauricio.

Peter wrote:
> I've noticed someone has created an account "Ceas" on the wiki and has 
> been inserting junk/spam links.  For example, look at the history of the 
> main page:
> 
> http://biopython.org/wiki/Biopython
> 
> Who is in charge of the Wiki?  Can we
> (a) block this account (short term action)
> (b) tighten up rules for creating new accounts?
> 
> Peter
> 
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 

-- 
MAURICIO HERRERA CUADRA
arareko at campus.iztacala.unam.mx
Laboratorio de Gen?tica
Unidad de Morfofisiolog?a y Funci?n
Facultad de Estudios Superiores Iztacala, UNAM


From dag at sonsorol.org  Thu Jun 15 16:31:01 2006
From: dag at sonsorol.org (Chris Dagdigian)
Date: Thu, 15 Jun 2006 12:31:01 -0400
Subject: [BioPython] Abuse of the new Wiki Homepage
In-Reply-To: <4491873E.50509@campus.iztacala.unam.mx>
References: <449177F0.1010209@maubp.freeserve.co.uk>
	<4491873E.50509@campus.iztacala.unam.mx>
Message-ID: <F2ADE880-41EA-412F-A34A-61E6E6533AA5@sonsorol.org>


I deal with a number of wiki sites, all of which are subjected to a  
constant stream of automated spam posters.

The single best defense is volunteers who monitor the "Recent  
Changes" feed and take instant action to rollback the spam changes:

http://biopython.org/wiki/Special:Recentchanges

People can monitor that page (in web or RSS form) and rollback spam  
shortly after it happens. It really is the best way.  Anyone can roll  
back changes. If you find yourself doing it often, ask to become a  
wiki administrator and then you'll be able to blocklist people and IP  
addresses as well.

Behind the scenes we do other things to block spam, including regular  
expression tests on content, blacklists etc. but it is a constant  
arms race with the wiki spammers and we are always a bit behind.

My $.02

-Chris


On Jun 15, 2006, at 12:13 PM, Mauricio Herrera Cuadra wrote:

> Hi Peter,
>
> We started to have the same problem in the BioPerl wiki some months  
> ago. The way we usually solve this is by blocking the user account  
> and rolling back to the previous version of the affected document.
>
> We have a list of wiki administrators who are constantly (and  
> independently) monitoring the recent changes in the site. This way  
> we can keep track of the changes and revert damages to the content:
>
> http://bioperl.org/wiki/BioPerl:Administrators
> http://bioperl.org/wiki/Special:Recentchanges
>
> You can also keep track of the changes by using the RSS or Atom  
> feeds provided by the Recentchanges page:
>
> http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=rss
> http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=atom
>
> The wiki system has memory of the blocked users and IP's, you can  
> have a look here:
>
> http://bioperl.org/wiki/Special:Ipblocklist
>
> There also exists a Blacklist, which is a complement to the main  
> Wikimedia's one and helps detect spam content before it goes into a  
> document:
>
> http://bioperl.org/wiki/Help:Blacklist
> http://meta.wikimedia.org/wiki/Spam_blacklist
>
> I don't know who's in charge of BioPython's wiki but I hope this  
> info can be helpful to you.
>
> Regards,
> Mauricio.
>
> Peter wrote:
>> I've noticed someone has created an account "Ceas" on the wiki and  
>> has been inserting junk/spam links.  For example, look at the  
>> history of the main page:
>> http://biopython.org/wiki/Biopython
>> Who is in charge of the Wiki?  Can we
>> (a) block this account (short term action)
>> (b) tighten up rules for creating new accounts?
>> Peter
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
> -- 
> MAURICIO HERRERA CUADRA
> arareko at campus.iztacala.unam.mx
> Laboratorio de Gen?tica
> Unidad de Morfofisiolog?a y Funci?n
> Facultad de Estudios Superiores Iztacala, UNAM


From jason.stajich at duke.edu  Thu Jun 15 16:45:50 2006
From: jason.stajich at duke.edu (Jason Stajich)
Date: Thu, 15 Jun 2006 12:45:50 -0400
Subject: [BioPython] Fwd:  Abuse of the new Wiki Homepage
References: <FBF0A76F-92B2-4F8B-BA28-400B1E1A0C2E@duke.edu>
Message-ID: <29F97001-146E-414A-8E5D-330AEDAB3392@duke.edu>


Begin forwarded message:

> From: Jason Stajich <jason.stajich at duke.edu>
> Date: June 15, 2006 12:40:13 PM EDT
> To: Mauricio Herrera Cuadra <arareko at campus.iztacala.unam.mx>
> Cc: biopython at biopython.org, Chris Dagdigian <dag at sonsorol.org>,  
> Chris Fields <cjfields at uiuc.edu>
> Subject: Re: [BioPython] Abuse of the new Wiki Homepage
>
> I'm not convinced the blacklist is working - but we need to make  
> sure it is enabled in the conf file on the server.  I've locked the  
> blacklist page as well so that only sysops can edit it.  Iddo and  
> Michiel are the main site admins right now, other people can be  
> promoted by them or one of the main site admins if we know who you  
> are.
>
> I've blocked the previous spammer's account.  You can easily revert  
> changes by using the rollback button on the diff page.
>
> The biopython community will have to decide how it wants to handle  
> new accounts to the wiki site. Whether there is patrolling or if  
> you want to lock the site down.  I would encourage all legitimate  
> users to add something to their User page so that we can have an  
> easier time distinguishing random account creation from real people.
>
> -jason
> On Jun 15, 2006, at 12:13 PM, Mauricio Herrera Cuadra wrote:
>
>> Hi Peter,
>>
>> We started to have the same problem in the BioPerl wiki some  
>> months ago. The way we usually solve this is by blocking the user  
>> account and rolling back to the previous version of the affected  
>> document.
>>
>> We have a list of wiki administrators who are constantly (and  
>> independently) monitoring the recent changes in the site. This way  
>> we can keep track of the changes and revert damages to the content:
>>
>> http://bioperl.org/wiki/BioPerl:Administrators
>> http://bioperl.org/wiki/Special:Recentchanges
>>
>> You can also keep track of the changes by using the RSS or Atom  
>> feeds provided by the Recentchanges page:
>>
>> http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=rss
>> http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=atom
>>
>> The wiki system has memory of the blocked users and IP's, you can  
>> have a look here:
>>
>> http://bioperl.org/wiki/Special:Ipblocklist
>>
>> There also exists a Blacklist, which is a complement to the main  
>> Wikimedia's one and helps detect spam content before it goes into  
>> a document:
>>
>> http://bioperl.org/wiki/Help:Blacklist
>> http://meta.wikimedia.org/wiki/Spam_blacklist
>>
>> I don't know who's in charge of BioPython's wiki but I hope this  
>> info can be helpful to you.
>>
>> Regards,
>> Mauricio.
>>
>> Peter wrote:
>>> I've noticed someone has created an account "Ceas" on the wiki  
>>> and has been inserting junk/spam links.  For example, look at the  
>>> history of the main page:
>>> http://biopython.org/wiki/Biopython
>>> Who is in charge of the Wiki?  Can we
>>> (a) block this account (short term action)
>>> (b) tighten up rules for creating new accounts?
>>> Peter
>>> _______________________________________________
>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>> -- 
>> MAURICIO HERRERA CUADRA
>> arareko at campus.iztacala.unam.mx
>> Laboratorio de Gen?tica
>> Unidad de Morfofisiolog?a y Funci?n
>> Facultad de Estudios Superiores Iztacala, UNAM
>>
>
> --
> Jason Stajich
> Duke University
> http://www.duke.edu/~jes12
>
>

--
Jason Stajich
Duke University
http://www.duke.edu/~jes12


From rohini.damle at gmail.com  Thu Jun 15 16:36:27 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Thu, 15 Jun 2006 09:36:27 -0700
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <449085AD.7010801@maubp.freeserve.co.uk>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
	<448FD25C.20101@maubp.freeserve.co.uk>
	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>
	<449085AD.7010801@maubp.freeserve.co.uk>
Message-ID: <d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>

Hi,
I am using BioPython 1.41 on windows I have also updated
NcbIstandalone.pyfor the link u gave. here is my code.

from Bio.Blast import NCBIStandalone
from Bio.Blast import NCBIXML
blast_out = open("4proteinblast.xml","r")
b_iterator = NCBIStandalone.Iterator(blast_out, NCBIXML.BlastParser())

for b_record in b_iterator :
        query_name = b_record.query
        print query_name
       for alignment in b_record.alignments:
               print '****Alignment****'
               print 'sequence:', alignment.title

This code gives "sequences producing significant alignments for all the 4
proteins
#but printing querry name as P1
I mean I am getting all the information I want but I have 4 protein querries
and this code is giving only P1 as a query (not P2, P3, P4 but giving
information about them) I ma attachin the xml file of 4 protein blast
results.
_thank you for your help.


On 6/14/06, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Rohini Damle wrote:
> > Thank you very much for your help.
> > I have 55-56 proteins & I am using Blast to find out short, nearly exact
> > matches. The xml parser works fine for first record but even if I used
> the
> > iterator, I CAN NOT ITERATE through the records, I have used the same
> code
> > as u have given, what might be wrong?
> > Rohini.
>
> If you you send us a short be of example code, and the error message
> that would help.  Also, what version of BioPython are you using, and do
> you have Windows or Linux or MacOS...
>
> One guess is that you will need to update the NCBIStandalone.py file to
> include a recent fix for iterating XML files.
>
> Assuming you are using BioPython 1.41 on Windows, the click on this link
> and pick "download" near the top of the page to get the latest verion:
>
>
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIStandalone.py?cvsroot=biopython
>
> Save it here:
>
> c:\python24\lib\site-packages\Bio\Blast\NCBIStandalone.py
>
> (Make a copy of the old file first, just in case)
>
> Peter
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 4proteinblast.xml
Type: text/xml
Size: 98271 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20060615/722b8845/attachment-0002.xml>

From cjfields at uiuc.edu  Thu Jun 15 16:41:05 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 15 Jun 2006 11:41:05 -0500
Subject: [BioPython] Abuse of the new Wiki Homepage
In-Reply-To: <4491873E.50509@campus.iztacala.unam.mx>
Message-ID: <000601c6909a$7f4ec5b0$15327e82@pyrimidine>

Looks like Jason's doing some work on the BioPython wiki to get it up to
speed.  I added Help:Blacklist as a start.

Like Mauricio said, probably need to get a small group of sysadmins together
to keep an eye on things and block potential spammers.  

Chris

> -----Original Message-----
> From: Mauricio Herrera Cuadra [mailto:arareko at campus.iztacala.unam.mx]
> Sent: Thursday, June 15, 2006 11:14 AM
> To: biopython at lists.open-bio.org
> Cc: biopython at biopython.org; Jason Stajich; Chris Dagdigian; Chris Fields
> Subject: Re: [BioPython] Abuse of the new Wiki Homepage
> 
> Hi Peter,
> 
> We started to have the same problem in the BioPerl wiki some months ago.
> The way we usually solve this is by blocking the user account and
> rolling back to the previous version of the affected document.
> 
> We have a list of wiki administrators who are constantly (and
> independently) monitoring the recent changes in the site. This way we
> can keep track of the changes and revert damages to the content:
> 
> http://bioperl.org/wiki/BioPerl:Administrators
> http://bioperl.org/wiki/Special:Recentchanges
> 
> You can also keep track of the changes by using the RSS or Atom feeds
> provided by the Recentchanges page:
> 
> http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=rss
> http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=atom
> 
> The wiki system has memory of the blocked users and IP's, you can have a
> look here:
> 
> http://bioperl.org/wiki/Special:Ipblocklist
> 
> There also exists a Blacklist, which is a complement to the main
> Wikimedia's one and helps detect spam content before it goes into a
> document:
> 
> http://bioperl.org/wiki/Help:Blacklist
> http://meta.wikimedia.org/wiki/Spam_blacklist
> 
> I don't know who's in charge of BioPython's wiki but I hope this info
> can be helpful to you.
> 
> Regards,
> Mauricio.
> 
> Peter wrote:
> > I've noticed someone has created an account "Ceas" on the wiki and has
> > been inserting junk/spam links.  For example, look at the history of the
> > main page:
> >
> > http://biopython.org/wiki/Biopython
> >
> > Who is in charge of the Wiki?  Can we
> > (a) block this account (short term action)
> > (b) tighten up rules for creating new accounts?
> >
> > Peter
> >
> > _______________________________________________
> > BioPython mailing list  -  BioPython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
> 
> --
> MAURICIO HERRERA CUADRA
> arareko at campus.iztacala.unam.mx
> Laboratorio de Gen?tica
> Unidad de Morfofisiolog?a y Funci?n
> Facultad de Estudios Superiores Iztacala, UNAM


From biopython at maubp.freeserve.co.uk  Thu Jun 15 17:30:18 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Jun 2006 18:30:18 +0100
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>	<448FD25C.20101@maubp.freeserve.co.uk>	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>	<449085AD.7010801@maubp.freeserve.co.uk>
	<d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>
Message-ID: <4491992A.5040301@maubp.freeserve.co.uk>

Rohini Damle wrote:
> Hi,
> I am using BioPython 1.41 on windows I have also updated
> NcbIstandalone.pyfor the link u gave. here is my code.
> 
> from Bio.Blast import NCBIStandalone
> from Bio.Blast import NCBIXML
> blast_out = open("4proteinblast.xml","r")
> b_iterator = NCBIStandalone.Iterator(blast_out, NCBIXML.BlastParser())
> 
> for b_record in b_iterator :
>        query_name = b_record.query
>        print query_name
>       for alignment in b_record.alignments:
>               print '****Alignment****'
>               print 'sequence:', alignment.title
> 
> This code gives "sequences producing significant alignments for all the 4
> proteins but printing querry name as P1

This code does the same thing, but prints less on screen so its easier 
to read:

from Bio.Blast import NCBIStandalone
from Bio.Blast import NCBIXML
blast_out = open("4proteinblast.xml","r")
b_iterator = NCBIStandalone.Iterator(blast_out, NCBIXML.BlastParser())

for b_record in b_iterator :
     query_name = b_record.query
     print query_name
     for alignment in b_record.alignments:
         print query_name, alignment.title.split()[0]


 > I mean I am getting all the information I want but I have 4 protein
> querries and this code is giving only P1 as a query (not P2, P3, P4
 > but giving information about them) I ma attachin the xml file of
 > 4 protein blast results. thank you for your help.

Looking at the raw XML file by hand, I could only see references to P1, 
the first protein.

If the file had results for all four proteins I would expect to see:

<?xml version="1.0"?>
... results for P1 ...
<?xml version="1.0"?>
... results for P2 ...
<?xml version="1.0"?>
... results for P3 ...
<?xml version="1.0"?>
... results for P4 ...

Are you sure you gave Blast all four input sequences - and not just the 
first sequence?

Peter


From mdehoon at c2b2.columbia.edu  Thu Jun 15 17:43:51 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu, 15 Jun 2006 13:43:51 -0400
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <4491992A.5040301@maubp.freeserve.co.uk>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>	<448FD25C.20101@maubp.freeserve.co.uk>	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>	<449085AD.7010801@maubp.freeserve.co.uk>	<d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>
	<4491992A.5040301@maubp.freeserve.co.uk>
Message-ID: <44919C57.7030204@c2b2.columbia.edu>

Peter wrote:
> 
> Looking at the raw XML file by hand, I could only see references to P1, 
> the first protein.
> 
> If the file had results for all four proteins I would expect to see:
> 
> <?xml version="1.0"?>
> ... results for P1 ...
> <?xml version="1.0"?>
> ... results for P2 ...
> <?xml version="1.0"?>
> ... results for P3 ...
> <?xml version="1.0"?>
> ... results for P4 ...
> 
There are results for all four proteins in the XML file, but they look 
like this:

  <Iteration>
    <Iteration_iter-num>2</Iteration_iter-num>
    <Iteration_query-ID>2_20304</Iteration_query-ID>
    <Iteration_query-def>p2</Iteration_query-def>
    ...
  </Iteration>

and so on. Could you let us know how this XML file was generated?

--Michiel


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032


From biopython at maubp.freeserve.co.uk  Thu Jun 15 17:53:53 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Jun 2006 18:53:53 +0100
Subject: [BioPython] parsing the blastoutput and printing the alingment
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>
Message-ID: <44919EB1.1080805@maubp.freeserve.co.uk>

I know you haven't got the XML parsing working get - but I thought I 
should point something else out...

Muthuraman, Manickam wrote:
> from Bio import Fasta
> file_for_blast=open('/home/manickam/Documents/m_cold.fasta','r')
> f_iterator=Fasta.Iterator(file_for_blast)
> f_record=f_iterator.next()

f_record will contain a single fasta record (the first entry in the file 
m_cold.fasta only).

> from Bio.Blast import NCBIWWW
> result_handle=NCBIWWW.qblast('blastp','nr',f_record, format_type="XML")

This will only run blast on the one record (i.e. the first fasta entry 
in m_cold.fasta), so the resulting XML file will only have blast results 
for this protein.

I'm not sure if you can use the online NCBI blast (i.e. NCBIWWW.qblast) 
to submit multiple queries...

You might want to install stand alone blast on your own machine - as 
this will accept multiple inputs.  You just tell it to read m_cold.fasta 
as its input file, and the resulting XML file will contain the results 
for each sequence in the fasta file.

Note that if you know in advance that the XML blast output is from a 
single input query, you don't need the NCBI iterator.

Peter


From rohini.damle at gmail.com  Thu Jun 15 18:24:38 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Thu, 15 Jun 2006 11:24:38 -0700
Subject: [BioPython] (no subject)
Message-ID: <d9fd76050606151124m5d9cd72eme04fd3327074cd29@mail.gmail.com>

> I opened the 'search for short nearly exact match' blast tool then
> enterd these prtein sequences
>  >p1
> FILGIIITV
>  >p2
> GLFDFVNFV
>  >p3
> FLIVSLCPT
>  >p4
> RVYEALYYV
>
>
> Set parameters like evalue and organism and chose the putput format as XML
> The output does not contain references for all the 4 proteins inthe
> starting but in the <Iteration> block (one block for each protein)
> is there any other way to generate the XML formatted output?
> -Rohini.


From biopython at maubp.freeserve.co.uk  Thu Jun 15 18:38:54 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Jun 2006 19:38:54 +0100
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <44919C57.7030204@c2b2.columbia.edu>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>	<448FD25C.20101@maubp.freeserve.co.uk>	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>	<449085AD.7010801@maubp.freeserve.co.uk>	<d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>	<4491992A.5040301@maubp.freeserve.co.uk>
	<44919C57.7030204@c2b2.columbia.edu>
Message-ID: <4491A93E.2020306@maubp.freeserve.co.uk>

Michiel Jan Laurens de Hoon wrote:
> Peter wrote:
> 
>>Looking at the raw XML file by hand, I could only see references to P1, 
>>the first protein.
>>
>>If the file had results for all four proteins I would expect to see:
>>
>><?xml version="1.0"?>
>>... results for P1 ...
>><?xml version="1.0"?>
>>... results for P2 ...
>><?xml version="1.0"?>
>>... results for P3 ...
>><?xml version="1.0"?>
>>... results for P4 ...
>>
> 
> There are results for all four proteins in the XML file, but they look 
> like this:
> 
>   <Iteration>
>     <Iteration_iter-num>2</Iteration_iter-num>
>     <Iteration_query-ID>2_20304</Iteration_query-ID>
>     <Iteration_query-def>p2</Iteration_query-def>
>     ...
>   </Iteration>
> 
> and so on.

Oh yeah.  I should have seen that, sorry.

According to the XML file, it is from BLASTP 2.2.14 [May-07-2006], maybe 
they changed the XML format without telling anyone?

I couldn't see anything obvious on this page:

http://www.ncbi.nlm.nih.gov/blast/blast_whatsnew.shtml

This looks like the source code here:

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi.tar.gz

And you can view their CVS here:

http://www.ncbi.nlm.nih.gov/cvsweb/index.cgi/ncbi/algo/blast/

There is nothing in the check-in comments that leaps out at me regarding 
XML iterations...

 >
 > Could you let us know how this XML file was generated?
 >

e.g. Standalone or online?

Peter


From cariaso at yahoo.com  Thu Jun 15 18:39:21 2006
From: cariaso at yahoo.com (Mike Cariaso)
Date: Thu, 15 Jun 2006 11:39:21 -0700 (PDT)
Subject: [BioPython] Fwd:  Abuse of the new Wiki Homepage
In-Reply-To: <29F97001-146E-414A-8E5D-330AEDAB3392@duke.edu>
Message-ID: <20060615183921.27494.qmail@web52711.mail.yahoo.com>

> The biopython community will have to decide how it wants to handle  
> new accounts to the wiki site. Whether there is patrolling or if  
> you want to lock the site down.  I would encourage all legitimate  
> users to add something to their User page so that we can have an  
> easier time distinguishing random account creation from real people.

Consider this my vote against any sort of lock down against new users. It can be a real deterent to new contributors, and that is something we sorely need. I'd be more willing to roll back the useless spam, than to risk detering valuable new contributions.

Thank you to Maubp for already removing all of Ceas's garbage. 


_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From rohini.damle at gmail.com  Thu Jun 15 18:44:38 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Thu, 15 Jun 2006 11:44:38 -0700
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <4491A93E.2020306@maubp.freeserve.co.uk>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
	<448FD25C.20101@maubp.freeserve.co.uk>
	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>
	<449085AD.7010801@maubp.freeserve.co.uk>
	<d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>
	<4491992A.5040301@maubp.freeserve.co.uk>
	<44919C57.7030204@c2b2.columbia.edu>
	<4491A93E.2020306@maubp.freeserve.co.uk>
Message-ID: <d9fd76050606151144q44d935aai3b2bef9a6d71210d@mail.gmail.com>

I used online ncbi blast to generate the xml output
Rohini


On 6/15/06, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Michiel Jan Laurens de Hoon wrote:
> > Peter wrote:
> >
> >>Looking at the raw XML file by hand, I could only see references to P1,
> >>the first protein.
> >>
> >>If the file had results for all four proteins I would expect to see:
> >>
> >><?xml version="1.0"?>
> >>... results for P1 ...
> >><?xml version="1.0"?>
> >>... results for P2 ...
> >><?xml version="1.0"?>
> >>... results for P3 ...
> >><?xml version="1.0"?>
> >>... results for P4 ...
> >>
> >
> > There are results for all four proteins in the XML file, but they look
> > like this:
> >
> >   <Iteration>
> >     <Iteration_iter-num>2</Iteration_iter-num>
> >     <Iteration_query-ID>2_20304</Iteration_query-ID>
> >     <Iteration_query-def>p2</Iteration_query-def>
> >     ...
> >   </Iteration>
> >
> > and so on.
>
> Oh yeah.  I should have seen that, sorry.
>
> According to the XML file, it is from BLASTP 2.2.14 [May-07-2006], maybe
> they changed the XML format without telling anyone?
>
> I couldn't see anything obvious on this page:
>
> http://www.ncbi.nlm.nih.gov/blast/blast_whatsnew.shtml
>
> This looks like the source code here:
>
> ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi.tar.gz
>
> And you can view their CVS here:
>
> http://www.ncbi.nlm.nih.gov/cvsweb/index.cgi/ncbi/algo/blast/
>
> There is nothing in the check-in comments that leaps out at me regarding
> XML iterations...
>
> >
> > Could you let us know how this XML file was generated?
> >
>
> e.g. Standalone or online?
>
> Peter
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From cjfields at uiuc.edu  Thu Jun 15 16:55:40 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 15 Jun 2006 11:55:40 -0500
Subject: [BioPython] Abuse of the new Wiki Homepage
In-Reply-To: <FBF0A76F-92B2-4F8B-BA28-400B1E1A0C2E@duke.edu>
Message-ID: <000701c6909c$88bca480$15327e82@pyrimidine>

> I'm not convinced the blacklist is working - but we need to make sure
> it is enabled in the conf file on the server.  I've locked the
> blacklist page as well so that only sysops can edit it.  Iddo and
> Michiel are the main site admins right now, other people can be
> promoted by them or one of the main site admins if we know who you are.

Agreed.  I actually added the page as 'Help:BlackList' then redirected it to
'Help:Blacklist'; someone with admin privies can delete that redirect link
if they want.  My oops.  Like Jason says, probably doesn't make much of a
difference (the wiki version of the raindance, to ward off evil spammers).
 
> I've blocked the previous spammer's account.  You can easily revert
> changes by using the rollback button on the diff page.
> 
> The biopython community will have to decide how it wants to handle
> new accounts to the wiki site. Whether there is patrolling or if you
> want to lock the site down.  I would encourage all legitimate users
> to add something to their User page so that we can have an easier
> time distinguishing random account creation from real people.
> 
> -jason
> On Jun 15, 2006, at 12:13 PM, Mauricio Herrera Cuadra wrote:
> 
> > Hi Peter,
> >
> > We started to have the same problem in the BioPerl wiki some months
> > ago. The way we usually solve this is by blocking the user account
> > and rolling back to the previous version of the affected document.
> >
> > We have a list of wiki administrators who are constantly (and
> > independently) monitoring the recent changes in the site. This way
> > we can keep track of the changes and revert damages to the content:
> >
> > http://bioperl.org/wiki/BioPerl:Administrators
> > http://bioperl.org/wiki/Special:Recentchanges
> >
> > You can also keep track of the changes by using the RSS or Atom
> > feeds provided by the Recentchanges page:
> >
> > http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=rss
> > http://bioperl.org/w/index.php?title=Special:Recentchanges&feed=atom
> >
> > The wiki system has memory of the blocked users and IP's, you can
> > have a look here:
> >
> > http://bioperl.org/wiki/Special:Ipblocklist
> >
> > There also exists a Blacklist, which is a complement to the main
> > Wikimedia's one and helps detect spam content before it goes into a
> > document:
> >
> > http://bioperl.org/wiki/Help:Blacklist
> > http://meta.wikimedia.org/wiki/Spam_blacklist
> >
> > I don't know who's in charge of BioPython's wiki but I hope this
> > info can be helpful to you.
> >
> > Regards,
> > Mauricio.
> >
> > Peter wrote:
> >> I've noticed someone has created an account "Ceas" on the wiki and
> >> has been inserting junk/spam links.  For example, look at the
> >> history of the main page:
> >> http://biopython.org/wiki/Biopython
> >> Who is in charge of the Wiki?  Can we
> >> (a) block this account (short term action)
> >> (b) tighten up rules for creating new accounts?
> >> Peter
> >> _______________________________________________
> >> BioPython mailing list  -  BioPython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >
> > --
> > MAURICIO HERRERA CUADRA
> > arareko at campus.iztacala.unam.mx
> > Laboratorio de Gen?tica
> > Unidad de Morfofisiolog?a y Funci?n
> > Facultad de Estudios Superiores Iztacala, UNAM
> >
> 
> --
> Jason Stajich
> Duke University
> http://www.duke.edu/~jes12


From manickam.muthuraman at wur.nl  Thu Jun 15 21:29:51 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Thu, 15 Jun 2006 23:29:51 +0200
Subject: [BioPython] parsing the blastoutput and printing the alingment
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>
	<44917656.6090602@maubp.freeserve.co.uk>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>

Dear Peter

In this mail i am attaching three files :seq file,python script file and the blast output. I am using python Python 2.4.1 (#2, Aug 25 2005, 18:20:57)and biopython 1.40

i spent almost the whole evening to upgarde the python and biopython in mandriva linux but i failed. 

let me know is the version of python and biopython matter here

thanks for helping me out of this
from
manickam

-----Original Message-----
From:	Peter [mailto:biopython at maubp.freeserve.co.uk]
Sent:	Thu 6/15/2006 5:01 PM
To:	Muthuraman, Manickam
Cc:	biopython at lists.open-bio.org
Subject:	Re: [BioPython] parsing the blastoutput and printing the alingment

Muthuraman, Manickam wrote:
> Dear peter
> 
> here is the code my_blast.out and the error. My need is to get all the
 > blast hit sequences in fasta format. By parsing and i can extract
 > accession number from it.

I made an example fasta file containing just this one sequence twice:

 >example1
VPKIDVSPLFGDDQAAKMRVAQQIDAASRDTGFFYAVNHGINVQRLSQKTKEFHMSITP
EEKWDLAIRAYNKEHQDQVRAGYYLSIPGKKAVESFCYLNPNFTPDHPRIQAKTPTHEV
NVWPDETKHPGFQDFAEQYYWDVFGLSSALLKGYALALGKEENFFARHFKPDDTLASVV
LIRYPYLDPYPEAAIKTAADGTKLSFEWHEDVSLITVLYQSNVQNLQVETAAGYQDIEA
DDTGYLINCGSYMAHLTNNYYKAPIHRVKWVNAERQSLPFFVNLGYDSVI
 >example2
VPKIDVSPLFGDDQAAKMRVAQQIDAASRDTGFFYAVNHGINVQRLSQKTKEFHMSITP
EEKWDLAIRAYNKEHQDQVRAGYYLSIPGKKAVESFCYLNPNFTPDHPRIQAKTPTHEV
NVWPDETKHPGFQDFAEQYYWDVFGLSSALLKGYALALGKEENFFARHFKPDDTLASVV
LIRYPYLDPYPEAAIKTAADGTKLSFEWHEDVSLITVLYQSNVQNLQVETAAGYQDIEA
DDTGYLINCGSYMAHLTNNYYKAPIHRVKWVNAERQSLPFFVNLGYDSVI

I then edited the filenames in your example, and ran the code.  It 
worked for me using a fresh install of BioPython 1.41 on Linux with 
Python 2.4.2

So the good news is your code seems fine.

Maybe there is something "funny" with your fasta file?  Accented 
characters for example - which would then be in the output XML file?

Could you send me the fasta file and the XML file (in full, as 
attachments), off the mailing list to avoid clogging up everyone's inboxes.

Thanks

Peter


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 164714 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20060615/2f04b211/attachment-0002.bin>

From mdehoon at c2b2.columbia.edu  Thu Jun 15 22:37:18 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu, 15 Jun 2006 18:37:18 -0400
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <4491A93E.2020306@maubp.freeserve.co.uk>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>	<448FD25C.20101@maubp.freeserve.co.uk>	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>	<449085AD.7010801@maubp.freeserve.co.uk>	<d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>	<4491992A.5040301@maubp.freeserve.co.uk>
	<44919C57.7030204@c2b2.columbia.edu>
	<4491A93E.2020306@maubp.freeserve.co.uk>
Message-ID: <4491E11E.5020705@c2b2.columbia.edu>

Peter wrote:
> According to the XML file, it is from BLASTP 2.2.14 [May-07-2006], maybe 
> they changed the XML format without telling anyone?
> 
It appears that the XML format did change.
With Blastp 2.2.14, multiple searches generate multiple 
<Iteration>...</Iteration> blocks, one for each search.
With an older Blastp, multiple searches effectively generate multiple 
XML files (each with one <Iteration>...</Iteration> block). These files 
are then concatenated into one output file. Biopython then parses this 
file by looking for the beginning of each XML file in this output file.

The new output is in a sense better because the output file is a valid 
XML file. It may be that Biopython's XML parser ignores the <Iteration> 
tags, since in the old format there was only one <Iteration> block 
anyway, and therefore fails with the new format.

--Michiel.

-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032


From biopython at maubp.freeserve.co.uk  Thu Jun 15 22:31:59 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Jun 2006 23:31:59 +0100
Subject: [BioPython] parsing the blastoutput and printing the alingment
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>
Message-ID: <4491DFDF.9070506@maubp.freeserve.co.uk>

Muthuraman, Manickam wrote:
> Dear Peter
> 
> In this mail i am attaching three files :seq file,python script file
> and the blast output. I am using python Python 2.4.1 (#2, Aug 25
> 2005, 18:20:57)and biopython 1.40

Your attachment came as a weird winmail.dat file - something  Outlook 
and the Microsoft Exchange Client sometimes does.  There is a Linux tool 
to "unzip" the file called tnef, which I installed on Ubuntu with a 
simple "apt-get install tnef"

Anyway, the problem is simply that your XML file has this little HTTP 
header at the start:

HTTP/1.1 200 OK
Date: Thu, 15 Jun 2006 21:23:08 GMT
Server: Nde
Content-Type: application/xml
Connection: close

If you edit the file to remove this, the BioPython can read the file fine.

Looking over my old email, Michiel de Hoon checked in a fix from 
Alexander Morgan for this in March.  You need to update this file:

/usr/lib/python2.4/site-packages/Bio/Blast/NCBIWWW.py

Latest code is available here:

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIWWW.py?cvsroot=biopython

It also gets rid of this annoying message:

UserWarning: qblast works only with blastn and blastp for now.

Peter


From manickam.muthuraman at wur.nl  Fri Jun 16 14:27:00 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Fri, 16 Jun 2006 16:27:00 +0200
Subject: [BioPython] Running Blast locally
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>
	<4491DFDF.9070506@maubp.freeserve.co.uk>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA9@salte0008.wurnet.nl>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AFAB@salte0008.wurnet.nl>


Dear peter

In the last mail i said that b_record is none , so i tried to run the blastall in my local computer and it works right now.

here is the command :
./blastall -d db/swissprot -i /home/manickam/Documents/m_cold.fasta -p blastp 
 and i am getting the result. so let me know if i need to put this command in string and pass this string (example:my_blast_exe). Still i want to know how to pass the input file(my_blast_file).

i think i confuse myself
let me know your view for this
from
manickam


From winter at biotec.tu-dresden.de  Fri Jun 16 14:35:56 2006
From: winter at biotec.tu-dresden.de (Christof Winter)
Date: Fri, 16 Jun 2006 16:35:56 +0200
Subject: [BioPython] Running Blast locally
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AFAB@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>	<4491DFDF.9070506@maubp.freeserve.co.uk>	<4CDD243B32D07748944828EA7A29E4A3E2AFA9@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AFAB@salte0008.wurnet.nl>
Message-ID: <4492C1CC.4020607@biotec.tu-dresden.de>

Dear Manickam,

Can you try
blastall -V T -d db/swissprot -i /home/manickam/Documents/m_cold.fasta -p blastp

instead?

Christof


Muthuraman, Manickam wrote:
> Dear peter
> 
> In the last mail i said that b_record is none , so i tried to run the blastall in my local computer and it works right now.
> 
> here is the command :
> ./blastall -d db/swissprot -i /home/manickam/Documents/m_cold.fasta -p blastp 
>  and i am getting the result. so let me know if i need to put this command in string and pass this string (example:my_blast_exe). Still i want to know how to pass the input file(my_blast_file).
> 
> i think i confuse myself
> let me know your view for this
> from
> manickam
> 
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

-- 
Christof Winter
Bioinformatics Group
TU Dresden
Tatzberg 47-51
01307 Dresden, Germany


From manickam.muthuraman at wur.nl  Fri Jun 16 14:52:15 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Fri, 16 Jun 2006 16:52:15 +0200
Subject: [BioPython] Running Blast locally
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>	<4491DFDF.9070506@maubp.freeserve.co.uk>	<4CDD243B32D07748944828EA7A29E4A3E2AFA9@salte0008.wurnet.nl>
	<4CDD243B32D07748944828EA7A29E4A3E2AFAB@salte0008.wurnet.nl>
	<4492C1CC.4020607@biotec.tu-dresden.de>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AFAC@salte0008.wurnet.nl>

Dear Christof

Your command also works  separately but my question was how to intergrate blast in biopython script.

in biopython tutorial and cookbook they have the follwoing code where i need to provide the path to database ,file to blast and blast_exe.

I am not clear how to set the path for seq_file,db and exe.

import os
my_blast_db=os.path.join(os.getcwd(),'at-est','a-cds-10-7.fasta')
my_blast_file=os.path.join(os.getcwd(),'at-est','test_blast','sorghum_est-test.fasta')
my_blast_exe=os.path.join(os.getcwd(),'blast','/home/manickam/blast/blastall')


here is the whole script
import os
my_blast_db=os.path.join(os.getcwd(),'at-est','a-cds-10-7.fasta')
my_blast_file=os.path.join(os.getcwd(),'at-est','test_blast','sorghum_est-test.fasta')
my_blast_exe=os.path.join(os.getcwd(),'blast','/home/manickam/blast/blastall')
from Bio.Blast import NCBIStandalone
blast_out,error_info=NCBIStandalone.blastall(my_blast_exe,'blastp',my_blast_db,my_blast_file)
b_parser=NCBIStandalone.BlastParser()
b_iterator=NCBIStandalone.Iterator(blast_out,b_parser)
b_record=b_iterator.next()
while 1:
    b_record=b_iterator.next()
    if b_record is None:
        break
    for alignment in b_record.alignments:
        print "inside 2 loop"
        for hsp in alignment.hsps:
            print "inside 1 loop"
            print 'seq:',alignment.title

it runs but b_record is None so it comes out of the while loop at first time itself. so it mean i am not getting out put of the blast.

from
manickam


From manickam.muthuraman at wur.nl  Fri Jun 16 08:42:08 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Fri, 16 Jun 2006 10:42:08 +0200
Subject: [BioPython] parsing the blastoutput and printing the alingment
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>
	<4491DFDF.9070506@maubp.freeserve.co.uk>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AFA7@salte0008.wurnet.nl>


Thanks peter

After overwriting the NCBIWWW.py header file my script works. 
once again i would like to thank

from
manickam

-----Original Message-----
From:	Peter [mailto:biopython at maubp.freeserve.co.uk]
Sent:	Fri 6/16/2006 12:31 AM
To:	Muthuraman, Manickam
Cc:	biopython at lists.open-bio.org
Subject:	Re: [BioPython] parsing the blastoutput and printing the alingment
Muthuraman, Manickam wrote:
> Dear Peter
> 
> In this mail i am attaching three files :seq file,python script file
> and the blast output. I am using python Python 2.4.1 (#2, Aug 25
> 2005, 18:20:57)and biopython 1.40

Your attachment came as a weird winmail.dat file - something  Outlook 
and the Microsoft Exchange Client sometimes does.  There is a Linux tool 
to "unzip" the file called tnef, which I installed on Ubuntu with a 
simple "apt-get install tnef"

Anyway, the problem is simply that your XML file has this little HTTP 
header at the start:

HTTP/1.1 200 OK
Date: Thu, 15 Jun 2006 21:23:08 GMT
Server: Nde
Content-Type: application/xml
Connection: close

If you edit the file to remove this, the BioPython can read the file fine.

Looking over my old email, Michiel de Hoon checked in a fix from 
Alexander Morgan for this in March.  You need to update this file:

/usr/lib/python2.4/site-packages/Bio/Blast/NCBIWWW.py

Latest code is available here:

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIWWW.py?cvsroot=biopython

It also gets rid of this annoying message:

UserWarning: qblast works only with blastn and blastp for now.

Peter


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3991 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20060616/490186bf/attachment-0002.bin>

From manickam.muthuraman at wur.nl  Fri Jun 16 13:12:08 2006
From: manickam.muthuraman at wur.nl (Muthuraman, Manickam)
Date: Fri, 16 Jun 2006 15:12:08 +0200
Subject: [BioPython] Running Blast locally
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>
	<4491DFDF.9070506@maubp.freeserve.co.uk>
Message-ID: <4CDD243B32D07748944828EA7A29E4A3E2AFA9@salte0008.wurnet.nl>


Dear Peter

i am not clear about the subtopic running blast locally

let me explain in detail
i have blast executable files in my home directory i.e
/home/manickam/blast/blastall

i have my database files of nr,swissprot,pdb in /usr/junk/

the files which i can see under /usr/junk/        folder are 
nr.00.phr    
 nr.00.ppi  
nr.01.phr     
nr.01.ppi  nr.pal        
pdbaa.00.msk

lot in there and there extenstions are *.phr , ppi ,pal,msk,psq

i am not clear from the manual where do i need to provide the input sequences and how to i store the out put after running the local blast.

below is the following code which i tried and it works but b_record is none.

mport os
my_blast_db=os.path.join(os.getcwd(),'at-est','a-cds-10-7.fasta')
my_blast_file=os.path.join(os.getcwd(),'at-est','test_blast','sorghum_est-test.fasta')
my_blast_exe=os.path.join(os.getcwd(),'blast','/home/manickam/blast/blastall')
from Bio.Blast import NCBIStandalone
blast_out,error_info=NCBIStandalone.blastall(my_blast_exe,'blastp',my_blast_db,my_blast_file)
b_parser=NCBIStandalone.BlastParser()
b_iterator=NCBIStandalone.Iterator(blast_out,b_parser)
b_record=b_iterator.next()
while 1:
    b_record=b_iterator.next()
    if b_record is None:
        break
    for alignment in b_record.alignments:
        for hsp in alignment.hsps:
            print 'seq:',alignment.title

from
manickam

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3446 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20060616/2de84992/attachment-0002.bin>

From biopython at maubp.freeserve.co.uk  Fri Jun 16 15:53:31 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 16 Jun 2006 16:53:31 +0100
Subject: [BioPython] Running Blast locally
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AFAC@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>	<4491DFDF.9070506@maubp.freeserve.co.uk>	<4CDD243B32D07748944828EA7A29E4A3E2AFA9@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFAB@salte0008.wurnet.nl>	<4492C1CC.4020607@biotec.tu-dresden.de>
	<4CDD243B32D07748944828EA7A29E4A3E2AFAC@salte0008.wurnet.nl>
Message-ID: <4492D3FB.1040706@maubp.freeserve.co.uk>

Muthuraman, Manickam wrote:
> Dear Christof
> 
> Your command also works  separately but my question was how to intergrate blast in biopython script.
> 
> in biopython tutorial and cookbook they have the follwoing code where i need to provide the path to database ,file to blast and blast_exe.
> 
> I am not clear how to set the path for seq_file,db and exe.
> 
> import os
> my_blast_db=os.path.join(os.getcwd(),'at-est','a-cds-10-7.fasta')
> my_blast_file=os.path.join(os.getcwd(),'at-est','test_blast','sorghum_est-test.fasta')
> my_blast_exe=os.path.join(os.getcwd(),'blast','/home/manickam/blast/blastall')

Try typing this at the python prompt:

import os
help(os.path.join)

Are you familiar with relative paths etc?  You might find something like 
this easier to understand:

my_blast_db   = '/home/manickam/db/at-est/a-cds-10-7.fasta')
my_blast_file = '/home/manickam/sorghum_est-test.fasta')
my_blast_exe  = '/home/manickam/blast/blastall'

Or, based on you previous email you were using:

 > here is the command :
 > ./blastall -d db/swissprot -i /home/manickam/Documents/m_cold.fasta
 > -p blastp

Maybe something like this:

my_blast_db   = '/home/manickam/blast/db/swissprot')
my_blast_file = '/home/manickam/Documents/m_cold.fasta')
my_blast_exe  = '/home/manickam/blast/blastall'

It all depends on where you installed the blast program, where you put 
the blast databases, and where you are going to have your inputfile.

> here is the whole script
01> import os
02> my_blast_db=os.path.join(os.getcwd(),'at-est','a-cds-10-7.fasta')
03> 
my_blast_file=os.path.join(os.getcwd(),'at-est','test_blast','sorghum_est-test.fasta')
04> 
my_blast_exe=os.path.join(os.getcwd(),'blast','/home/manickam/blast/blastall')
05> from Bio.Blast import NCBIStandalone
06> 
blast_out,error_info=NCBIStandalone.blastall(my_blast_exe,'blastp',my_blast_db,my_blast_file)

At this point, some example scripts will save the output to a file, and 
then reload it and carry on.  This is very helpful if you have problems 
because you can open the file by hand and look at it.

07> b_parser=NCBIStandalone.BlastParser()
08> b_iterator=NCBIStandalone.Iterator(blast_out,b_parser)
09> b_record=b_iterator.next()
10> while 1:
11>     b_record=b_iterator.next()
12>     if b_record is None:
13>         break
14>     for alignment in b_record.alignments:
15>         print "inside 2 loop"
16>         for hsp in alignment.hsps:
17>             print "inside 1 loop"
18>             print 'seq:',alignment.title
> 
> it runs but b_record is None so it comes out of the while loop at first time itself. so it mean i am not getting out put of the blast.

Notice that at line 9, you set b_record to the first set of results 
(i.e. from the first sequence in your FASTA file).

Then, inside the look, at line 11 set b_record to the SECOND set of 
results and try and look at it.

I suggest you comment out line 9, and it should work better.

Finally, this code is using the "plain text" blast output, which can 
sometimes cause BioPython trouble.  I would recommend the XML parser but 
as you might know from the mailing list, it looks like they have changed 
the file format for multiple results in XML output...

Peter


From biopython at maubp.freeserve.co.uk  Fri Jun 16 16:06:14 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 16 Jun 2006 17:06:14 +0100
Subject: [BioPython] Running Blast locally
In-Reply-To: <4CDD243B32D07748944828EA7A29E4A3E2AFA9@salte0008.wurnet.nl>
References: <4CDD243B32D07748944828EA7A29E4A3E2AF9B@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9D@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AF9E@salte0008.wurnet.nl>	<4CDD243B32D07748944828EA7A29E4A3E2AFA1@salte0008.wurnet.nl>	<44917656.6090602@maubp.freeserve.co.uk>	<4CDD243B32D07748944828EA7A29E4A3E2AFA5@salte0008.wurnet.nl>	<4491DFDF.9070506@maubp.freeserve.co.uk>
	<4CDD243B32D07748944828EA7A29E4A3E2AFA9@salte0008.wurnet.nl>
Message-ID: <4492D6F6.5060100@maubp.freeserve.co.uk>

I didn't see this email - they arrived out of order at my computer. 
Please also read my longer reply...

Muthuraman, Manickam wrote:
> i have blast executable files in my home directory i.e
> /home/manickam/blast/blastall

Then use this:

my_blast_exe='/home/manickam/blast/blastall'

> i have my database files of nr,swissprot,pdb in /usr/junk/
> 
> the files which i can see under /usr/junk/        folder are 
> nr.00.phr    
>  nr.00.ppi  
> nr.01.phr     
> nr.01.ppi  nr.pal        
> pdbaa.00.msk
> 
> lot in there and there extenstions are *.phr , ppi ,pal,msk,psq

I think you should use one of these, but I haven't checked this:

my_blast_db='/usr/junk/nr'
my_blast_db='/usr/junk/swissprot'
my_blast_db='/usr/junk/pdb'

> i am not clear from the manual where do i need to provide the input sequences

The input fasta file can be anywhere - you just have to tell Blast where 
it is.  e.g.

my_blast_file='/home/manickam/Documents/m_cold.fasta')


 > and how to i store the out put after running the local blast.

If you run blast "by hand" at the command prompt, use the option -o 
outputfilename (that is a lower case letter o, not zero, not uppercase).

You can also using python to write the results to a file.

> below is the following code which i tried and it works but b_record is none.

See my other email

Peter


From gvwilson at cs.utoronto.ca  Sun Jun 18 18:15:04 2006
From: gvwilson at cs.utoronto.ca (Greg Wilson)
Date: Sun, 18 Jun 2006 14:15:04 -0400
Subject: [BioPython] ann: open source course on basic software development
	skills
Message-ID: <e7456q$pq2$11@sea.gmane.org>

http://www.third-bit.com/swc is an open source course on basic software
development skills, aimed primarily at people with backgrounds in
science, engineering, and medicine who have little formal training in
programming, but find themselves doing a lot of it.  The course was
developed in part through support from the Python Software Foundation;
all of the material can be used and modified free of charge (but with
attribution).  If you have questions, would like to contribute material,
or have a success story you'd like to share, please contact Greg Wilson
(gvwilson at cs.utoronto.ca).

Thanks,
Greg


From rohini.damle at gmail.com  Mon Jun 19 23:36:36 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Mon, 19 Jun 2006 16:36:36 -0700
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <4491E11E.5020705@c2b2.columbia.edu>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>
	<448FD25C.20101@maubp.freeserve.co.uk>
	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>
	<449085AD.7010801@maubp.freeserve.co.uk>
	<d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>
	<4491992A.5040301@maubp.freeserve.co.uk>
	<44919C57.7030204@c2b2.columbia.edu>
	<4491A93E.2020306@maubp.freeserve.co.uk>
	<4491E11E.5020705@c2b2.columbia.edu>
Message-ID: <d9fd76050606191636s7246b7e4va89754200ffd2eb1@mail.gmail.com>

So what do one need to do to make biopython working?  Make changes in the
XML  parser so that it will consider one iteration for one result out put?
-Rohini


On 6/15/06, Michiel Jan Laurens de Hoon <mdehoon at c2b2.columbia.edu> wrote:
>
> Peter wrote:
> > According to the XML file, it is from BLASTP 2.2.14 [May-07-2006], maybe
> > they changed the XML format without telling anyone?
> >
> It appears that the XML format did change.
> With Blastp 2.2.14, multiple searches generate multiple
> <Iteration>...</Iteration> blocks, one for each search.
> With an older Blastp, multiple searches effectively generate multiple
> XML files (each with one <Iteration>...</Iteration> block). These files
> are then concatenated into one output file. Biopython then parses this
> file by looking for the beginning of each XML file in this output file.
>
> The new output is in a sense better because the output file is a valid
> XML file. It may be that Biopython's XML parser ignores the <Iteration>
> tags, since in the old format there was only one <Iteration> block
> anyway, and therefore fails with the new format.
>
> --Michiel.
>
> --
> Michiel de Hoon
> Center for Computational Biology and Bioinformatics
> Columbia University
> 1130 St Nicholas Avenue
> New York, NY 10032
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From biopython at maubp.freeserve.co.uk  Tue Jun 20 13:52:48 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 20 Jun 2006 14:52:48 +0100
Subject: [BioPython] plain txt blast output - xml instead
In-Reply-To: <d9fd76050606191636s7246b7e4va89754200ffd2eb1@mail.gmail.com>
References: <d9fd76050606131209n104adb91mf1dc80090f9d209b@mail.gmail.com>	<448FD25C.20101@maubp.freeserve.co.uk>	<d9fd76050606141122m2d104ee6gb2c84473182e388d@mail.gmail.com>	<449085AD.7010801@maubp.freeserve.co.uk>	<d9fd76050606150936r5f0e1fe1o89fd12c16f4361d1@mail.gmail.com>	<4491992A.5040301@maubp.freeserve.co.uk>	<44919C57.7030204@c2b2.columbia.edu>	<4491A93E.2020306@maubp.freeserve.co.uk>	<4491E11E.5020705@c2b2.columbia.edu>
	<d9fd76050606191636s7246b7e4va89754200ffd2eb1@mail.gmail.com>
Message-ID: <4497FDB0.1000903@maubp.freeserve.co.uk>

Peter wrote:
>>> According to the XML file, it is from BLASTP 2.2.14 [May-07-2006], 
 >>> maybe they changed the XML format without telling anyone?

Michiel wrote:
>>It appears that the XML format did change.
>>With Blastp 2.2.14, multiple searches generate multiple
>><Iteration>...</Iteration> blocks, one for each search.
>>With an older Blastp, multiple searches effectively generate multiple
>>XML files (each with one <Iteration>...</Iteration> block). These files
>>are then concatenated into one output file. Biopython then parses this
>>file by looking for the beginning of each XML file in this output file.
>>
>>The new output is in a sense better because the output file is a valid
>>XML file. It may be that Biopython's XML parser ignores the <Iteration>
>>tags, since in the old format there was only one <Iteration> block
>>anyway, and therefore fails with the new format.

Rohini Damle wrote:
 > So what do one need to do to make biopython working?  Make changes in
 > the XML parser so that it will consider one iteration for one result
 > output?

Basically, yes, we need to change the BioPython NCBI Blast XML code 
somehow - this might be best moved to the development mailing list.

Some relevant but probably slightly out of data documentation:

ftp://ftp.ncbi.nlm.nih.gov/blast/documents/xml/README.blxml

Notice this appears to describe the <Iteration>...</Iteration> block as 
follows:

BlastOutput_iter-num: the psi-blast iteration number (optional)

So whatever we do, we should have a look at the psi-blast output as well...

One idea I was thinking about is to modify the existing Blast XML parser 
to specify WHICH iteratation number it should parse (ignoring the rest). 
  An invalid iteration number would throw a new exception error.

Then, a new Blast XML iterator would call the parser repeatedly 
incrementing the iteration number until the "invalid iteration number" 
error was raised, which would signal the end.

Note that with the "old style concatenated XML entries" we could parse 
each entry one by one, without having to load the entire XML file into 
memory at once.  I don't think that will be possible with the new style 
XML files.

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun 21 14:27:06 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 21 Jun 2006 15:27:06 +0100
Subject: [BioPython] docs have moved on the website
Message-ID: <4499573A.5060409@maubp.freeserve.co.uk>

I don't know if anyone has noticed this, but for example this:

http://www.biopython.org/docs/cookbook/genbank_to_fasta.html

Has moved to here:

http://www.biopython.org/DIST/docs/cookbook/genbank_to_fasta.html

Is it too late to revert to the old position?

If it is, to preserve any old links from external sites (and also to 
save google and other search engines having to update their indexes) 
maybe the website could automatically forward queries for:

http://www.biopython.org/docs/*

to:

http://www.biopython.org/DIST/docs/*

Good idea?  Bad idea?

Peter


From rohini.damle at gmail.com  Wed Jun 21 19:06:29 2006
From: rohini.damle at gmail.com (Rohini Damle)
Date: Wed, 21 Jun 2006 12:06:29 -0700
Subject: [BioPython] Biopython's XMl parser fails with NCBI blast changed
	XML output format
Message-ID: <d9fd76050606211206pa104f7dwdebfcb05dcab09d2@mail.gmail.com>

Hi,
I am trying to parse the blast output (XML formatted, using online NCBI's
blast) I got as a result for 'short nearly exact matches' for my 50-55 short
protein sequences.
It looks like the XML format has changed and biopython's XML parser fails to
parse the blast records.
can somebody show a way to fix this thing?
Thank you
Rohini Damle


From biopython at maubp.freeserve.co.uk  Sun Jun 25 21:37:53 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 25 Jun 2006 22:37:53 +0100
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
Message-ID: <449F0231.2050308@maubp.freeserve.co.uk>

[Off topic, but recently has anyone else get valid messages bounced due 
to a "suspicious header"?]

Hello List,

I recently wanted to load a "PHYLIP distance matrix file" created by
clustalw for my own research...

As discussed earlier, clustalw bends the official PHYLIP specification
by not truncating long names to 10 characters.  For my dataset I need
the long names to avoid ambiguity.

The attached code implements a fairly simple distance matrix class and
associated code to read (parse) and write PHYLIP style distance matrices.

There are options to control strict 10 character name truncation, and
the separator character(s) when writing files.

Internally, I store the distances as a list of lists (of different
lengths) to mimic a lower triangular matrix.

For example, this matrix:

[[0.0, 0.1, 0.2],
   [0.1, 0.0, 0.5],
   [0.2, 0.5, 0.0]]

Is stored as this:

[[], [0.1], [0.2, 0.5]]

This may not be the best way to do this in terms of speed and memory usage.

There are some simple test cases included, but I have pushed the code
very far and there may be problems.  Anyway - in case anyone is
interested either in the short term, or for ideas for how BioPython
could support these files - here it is.

I'm sure someone more familiar with arrays (Numeric and NumPy) would be
able to make the class act more like an array - but the basics are there.

As far as I could see, neither Numeric or NumPy have a specific
symmetric matrix / symmetric array class which would be ideal.

Members of the list are welcome to use the code, but please contact me
before re-distributing it to anyone else.

Peter

-------------- next part --------------
A non-text attachment was scrubbed...
Name: phylip_dst.py
Type: text/x-python
Size: 16528 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20060625/8d20b314/attachment-0002.py>

From chris.lasher at gmail.com  Tue Jun 27 21:34:37 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 27 Jun 2006 17:34:37 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <449F0231.2050308@maubp.freeserve.co.uk>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
	<449F0231.2050308@maubp.freeserve.co.uk>
Message-ID: <128a885f0606271434v4d5a40e9x1ceb0037d750f6a1@mail.gmail.com>

Hi Peter,

Would you be up for licensing your code under the BioPython license?
If not, I shouldn't  look at it, as I've started coding my own module
for the project. From your description, your module sounds very good.
=-)

Chris

On 6/25/06, Peter <biopython at maubp.freeserve.co.uk> wrote:
> [Off topic, but recently has anyone else get valid messages bounced due
> to a "suspicious header"?]
>
> Hello List,
>
> I recently wanted to load a "PHYLIP distance matrix file" created by
> clustalw for my own research...
>
> As discussed earlier, clustalw bends the official PHYLIP specification
> by not truncating long names to 10 characters.  For my dataset I need
> the long names to avoid ambiguity.
>
> The attached code implements a fairly simple distance matrix class and
> associated code to read (parse) and write PHYLIP style distance matrices.
>
> There are options to control strict 10 character name truncation, and
> the separator character(s) when writing files.
>
> Internally, I store the distances as a list of lists (of different
> lengths) to mimic a lower triangular matrix.
>
> For example, this matrix:
>
> [[0.0, 0.1, 0.2],
>    [0.1, 0.0, 0.5],
>    [0.2, 0.5, 0.0]]
>
> Is stored as this:
>
> [[], [0.1], [0.2, 0.5]]
>
> This may not be the best way to do this in terms of speed and memory usage.
>
> There are some simple test cases included, but I have pushed the code
> very far and there may be problems.  Anyway - in case anyone is
> interested either in the short term, or for ideas for how BioPython
> could support these files - here it is.
>
> I'm sure someone more familiar with arrays (Numeric and NumPy) would be
> able to make the class act more like an array - but the basics are there.
>
> As far as I could see, neither Numeric or NumPy have a specific
> symmetric matrix / symmetric array class which would be ideal.
>
> Members of the list are welcome to use the code, but please contact me
> before re-distributing it to anyone else.
>
> Peter
>
>
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
>


From biopython at maubp.freeserve.co.uk  Tue Jun 27 22:33:34 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 27 Jun 2006 23:33:34 +0100
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <128a885f0606271434v4d5a40e9x1ceb0037d750f6a1@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>	<448A9A7A.6050501@maubp.freeserve.co.uk>	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>	<449F0231.2050308@maubp.freeserve.co.uk>
	<128a885f0606271434v4d5a40e9x1ceb0037d750f6a1@mail.gmail.com>
Message-ID: <44A1B23E.5080007@maubp.freeserve.co.uk>

Chris Lasher wrote:
> Hi Peter,
> 
> Would you be up for licensing your code under the BioPython license?
> If not, I shouldn't  look at it, as I've started coding my own module
> for the project. From your description, your module sounds very good.
> =-)
> 
> Chris

I am quite happy to contribute the code to BioPython under the 
appropriate license, so please go ahead.

I've filled a bug on adding PHYLIP distance parsers to BioPython and 
attached a slightly revised version of the code (added "fuzzy" equality 
testing of matrices - mainly for testing):

http://bugzilla.open-bio.org/show_bug.cgi?id=2034

If anyone else really wants the code under some other license (GPL 
maybe) I could probably be persuaded.

Peter


From chris.lasher at gmail.com  Tue Jun 27 23:32:12 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 27 Jun 2006 19:32:12 -0400
Subject: [BioPython] Distance Matrix Parsers
In-Reply-To: <44A1B23E.5080007@maubp.freeserve.co.uk>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
	<449F0231.2050308@maubp.freeserve.co.uk>
	<128a885f0606271434v4d5a40e9x1ceb0037d750f6a1@mail.gmail.com>
	<44A1B23E.5080007@maubp.freeserve.co.uk>
Message-ID: <128a885f0606271632q2988f2d7y543dd441535f9808@mail.gmail.com>

[Oops! I didn't realize I was posting to the user list! Reverting it
back to BP-Dev]
This code looks very good, Peter!

As far as licensing, I'm new to the game, but my guess is the
BioPython license (http://www.biopython.org/DIST/LICENSE ) is highly
prefered for BioPython. You still retain copyright with the license,
but the code is more "free" than under any version of the GPL.

Chris

On 6/27/06, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Chris Lasher wrote:
> > Hi Peter,
> >
> > Would you be up for licensing your code under the BioPython license?
> > If not, I shouldn't  look at it, as I've started coding my own module
> > for the project. From your description, your module sounds very good.
> > =-)
> >
> > Chris
>
> I am quite happy to contribute the code to BioPython under the
> appropriate license, so please go ahead.
>
> I've filled a bug on adding PHYLIP distance parsers to BioPython and
> attached a slightly revised version of the code (added "fuzzy" equality
> testing of matrices - mainly for testing):
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2034
>
> If anyone else really wants the code under some other license (GPL
> maybe) I could probably be persuaded.
>
> Peter
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From cjfields at uiuc.edu  Wed Jun 28 18:30:44 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 28 Jun 2006 13:30:44 -0500
Subject: [BioPython] Wiki spammed
Message-ID: <005201c69ae0$f78c59c0$15327e82@pyrimidine>

Guys,

Just wanted to let whoever's in charge know that you need to roll back
changes to this page:

http://biopython.org/wiki/Biopython

The spammers have struck again!

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign