From biopython at maubp.freeserve.co.uk  Mon May  2 18:06:21 2005
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon May  2 17:52:30 2005
Subject: [BioPython] Big GenBank files
In-Reply-To: <005b01c54e6f$55a796b0$0b413851@YSENGARD>
References: <1114422149.426cbb8506bc7@imp4-q.free.fr>
	<426CD193.5030801@maubp.freeserve.co.uk>
	<005b01c54e6f$55a796b0$0b413851@YSENGARD>
Message-ID: <4276A45D.6060001@maubp.freeserve.co.uk>

I sent Aur?lie a Python file version my patch (from bug 1747) off 
the mailing list, and it looks like there is a problem using it with 
the GenBank.NCBIDictionary (see below) which I had never used.

http://bugzilla.open-bio.org/show_bug.cgi?id=1747

Thanks for letting me know Aur?lie!

I will try and look at this as time permits, but I will have to work 
out how the NCBIDictionary code works first... so if someone else 
wants to leap in, please do :)

Peter

-------- Original Message --------
Subject: Re: [BioPython] Big GenBank files
Date: Sun, 1 May 2005 19:00:53 +0200
From: Aur?lie Bornot <aurelie.bornot@free.fr>
To: Peter <biopython@maubp.freeserve.co.uk>

Hello Peter and everybody !

Sorry Peter : I take a lot of time to answer you about your patch
(GenBank.__init__.py)....

I  have tried it with this code (that works with the "old" 
__init__.py) :
fichier = open('AC008625.5.gb',"w")
record_parser = GenBank.FeatureParser()
ncbi_dict = GenBank.NCBIDictionary
('nucleotide','genbank',parser=record_parser)
gb_record = ncbi_dict['AC008625.5']
fichier.close()

And I got this error :
Traceback (most recent call last):
   File "essais.py", line 112, in ?
     gb_record = ncbi_dict['AC008625.5']
   File "C:\Python24\lib\site-packages\Bio\GenBank\__init__.py", 
line 1736,
in __getitem__
     return self.parser.parse(handle)
   File "C:\Python24\lib\site-packages\Bio\GenBank\__init__.py", 
line 219, in
parse    self._scanner.feed(handle, self._consumer)
   File "C:\Python24\lib\site-packages\Bio\GenBank\__init__.py", 
line 1261,
in feed    line = handle.readline()
AttributeError: ReseekFile instance has no attribute 'readline'

I don't know why very well...

BUT !!!!!   : )

like you said  : with  something like :
#connexion:
  fichierGB =
urllib2.urlopen("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id
="+ID+"&db="+database +"&retmod=text&rettype=genbank")
record_parser = GenBank.RecordParser()
gb_iterator = GenBank.Iterator(fichierGB, record_parser)
  cur_record = gb_iterator.next()
  fichierGB.close()

It works !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
The big file are parsed without any problem....  : )
So I simply modified my code like this....

To conclude :
Peter , You are my savior !!!!
THANK YOU VERY VERY MUCH !!!

Aurelie
From mdehoon at ims.u-tokyo.ac.jp  Tue May  3 02:45:00 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Tue May  3 02:32:55 2005
Subject: [BioPython] Rethinking Seq objects
Message-ID: <42771DEC.7090100@ims.u-tokyo.ac.jp>

Hi everybody,

Recently, there was a discussion on biopython-dev about changes to the Seq and 
MutableSeq classses. I'd like to ask you if any of the proposed changes would 
cause you any problems.

The current proposal is:

1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and the 
MutableSeq class basically describe the same thing, except that one is read-only 
and the other one is not. If desired, we can add a readonly flag to the class to 
describe if it is mutable or not. (Given that e.g. Numerical Python arrays don't 
have such a flag, my feeling is that it is not really needed for Seq objects 
either). For performance reasons, the new Seq class will be implemented in C.

2) By default, a Seq class doesn't assume a particular alphabet. Same as current 
behavior:
 >>>  from Bio.Seq import *
 >>>  Seq('ATCG')
Seq('ATCG', Alphabet())
However, if the user decides to specify the alphabet explicitly, input to the 
sequence will be checked for consistency with the alphabet. So
 >>>  from Bio.Seq import *
 >>>  from Bio.Alphabet import IUPAC
 >>>  my_alpha = IUPAC.unambiguous_dna
 >>>  s = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
 >>>  s[:3] = "XYZ"
will raise an error.

3) Make Seq objects understand circular genomes. Many bacterial genomes are 
circular. It would be nice if we could take the indices [-1000:1000] from a Seq 
object, if it is circular, or [3999000:40001000] if the sequence is circular 
with length 4000000.
Circular genomes will likely be implemented as an optional keyword (perhaps 
"topology") when creating the Seq object, with corresponding set_topology, 
get_topology methods.

4) Perhaps it would be a good idea to add transcribe and translate methods to 
the Seq class. Currently, to translate a DNA sequence, we have to do
 >>> from Bio.Seq import Seq
 >>> from Bio import Translate
 >>> from Bio.Alphabet import IUPAC
 >>> my_alpha = IUPAC.unambiguous_dna
 >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
 >>> standard_translator = Translate.unambiguous_dna_by_id[1]
 >>> standard_translator.translate(my_seq)
Seq('AIVMGR*KGAR', IUPACProtein())
which is too much typing for my taste.


Questions/comments/suggestions are welcome. None of this has actually been coded 
yet, so it's all still open to discussion.


--Michiel.


-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From bneron at pasteur.fr  Tue May  3 05:27:39 2005
From: bneron at pasteur.fr (bneron@pasteur.fr)
Date: Tue May  3 05:20:00 2005
Subject: [BioPython] Rethinking Seq objects
In-Reply-To: <42771DEC.7090100@ims.u-tokyo.ac.jp>
References: <42771DEC.7090100@ims.u-tokyo.ac.jp>
Message-ID: <20050503092739.GB10339@kerka-sis.pasteur.fr>

* Michiel Jan Laurens de Hoon <mdehoon@ims.u-tokyo.ac.jp> (20050503 15:45):
> Hi everybody,
> 
> Recently, there was a discussion on biopython-dev about changes to the Seq 
> and MutableSeq classses. I'd like to ask you if any of the proposed changes 
> would cause you any problems.
> 
> The current proposal is:
> 
> 1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and 
> the MutableSeq class basically describe the same thing, except that one is 
> read-only and the other one is not. If desired, we can add a readonly flag 
> to the class to describe if it is mutable or not. (Given that e.g. 
> Numerical Python arrays don't have such a flag, my feeling is that it is 
> not really needed for Seq objects either). For performance reasons, the new 
> Seq class will be implemented in C.
> 
> 2) By default, a Seq class doesn't assume a particular alphabet. Same as 
> current behavior:
> >>>  from Bio.Seq import *
> >>>  Seq('ATCG')
> Seq('ATCG', Alphabet())
> However, if the user decides to specify the alphabet explicitly, input to 
> the sequence will be checked for consistency with the alphabet. So
> >>>  from Bio.Seq import *
> >>>  from Bio.Alphabet import IUPAC
> >>>  my_alpha = IUPAC.unambiguous_dna
> >>>  s = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
> >>>  s[:3] = "XYZ"
> will raise an error.
> 
> 3) Make Seq objects understand circular genomes. Many bacterial genomes are 
> circular. It would be nice if we could take the indices [-1000:1000] from a 
> Seq object, if it is circular, or [3999000:40001000] if the sequence is 
> circular with length 4000000.
> Circular genomes will likely be implemented as an optional keyword (perhaps 
> "topology") when creating the Seq object, with corresponding set_topology, 
> get_topology methods.
> 
> 4) Perhaps it would be a good idea to add transcribe and translate methods 
> to the Seq class. Currently, to translate a DNA sequence, we have to do
> >>> from Bio.Seq import Seq
> >>> from Bio import Translate
> >>> from Bio.Alphabet import IUPAC
> >>> my_alpha = IUPAC.unambiguous_dna
> >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
> >>> standard_translator = Translate.unambiguous_dna_by_id[1]
> >>> standard_translator.translate(my_seq)
> Seq('AIVMGR*KGAR', IUPACProtein())
> which is too much typing for my taste.
> 
> 
> Questions/comments/suggestions are welcome. None of this has actually been 
> coded yet, so it's all still open to discussion.
> 
> 
> --Michiel.
> 


I agree with suggestions above , but I'd like to add a remark on the way 
in which the Seq object manage the alphabet used for the sequence more precisely
the case of the sequence.
just an exemple:

Python 2.3.4 (#1, Mar 11 2005, 17:34:27) 
[GCC 3.3.5  (Gentoo Linux 3.3.5-r1, ssp-3.3.2-3, pie-8.7.7.1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio import Translate
>>> from Bio.Alphabet import IUPAC
>>> my_alpha = IUPAC.unambiguous_dna
>>> my_seq_upper = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>>> my_seq_lower = Seq('gatcgatgggcctattaggatcgaaaatcgc', my_alpha)
>>> standard_translator = Translate.unambiguous_dna_by_id[1]
>>> standard_translator.translate(my_seq_upper)
Seq('DRWAY*DRKS', HasStopCodon(IUPACProtein(), '*'))
>>> standard_translator.translate(my_seq_lower)
Seq('**********', HasStopCodon(IUPACProtein(), '*'))
>>> 

obviously the lower case doesn't work in the Seq object.
But I haven't neither exceptions at the Seq init nor during the translation.
worst I have a return value after the translate method but it doesn't mean anything. 
(it work of the same manner for the traduction).

I think it could be a good thing to correct this behavior.


-- 
Bertrand Neron

Groupe Logiciels et Banques de Donnees 
Institut Pasteur

Tel: 01 45 68 86 78
Fax: 01 40 61 30 80
From letondal at pasteur.fr  Tue May  3 11:04:00 2005
From: letondal at pasteur.fr (Catherine Letondal)
Date: Tue May  3 10:52:30 2005
Subject: [BioPython] suggestions for Bio.PDB
In-Reply-To: <32950.83.92.3.59.1113745194.squirrel@www.binf.ku.dk>
References: <a5032f81273d6fe27019eea821b551ab@pasteur.fr>
	<32950.83.92.3.59.1113745194.squirrel@www.binf.ku.dk>
Message-ID: <7bb49749d8bfd65835e1e19fdb859a37@pasteur.fr>

Hi,

On Apr 17, 2005, at 3:39 PM, thamelry@binf.ku.dk wrote:

> Hi,
>
>> Would it be possible for the get_structure() method in PDBParser to
>> accept a filehandle
>
> You're not the first to suggest this - it's already in
> the CVS version, also for generating PDB output with PDBIO.

Ok, thanks.

>> Another suggestion: it could be useful to keep a record of the read
>> structure, just in case the user would like to benefit from biopython
>> PDB modules, but also do some custom analysis.
>
> I don't think this would be very useful, it's easy enough to just read 
> in
> the file separately.

Ok (the student who made the suggestion had a lot of files to read, 
that's why - but if you implement the filehandle parameter, it's fine)


> BTW, I'm soon going to start to implement a parser for the new
> PDB XML format. Any suggestions, comments, etc. regarding this
> are welcome.

A suggestion (not related to XML) from one of the teachers of our 
course: a method returning embedded elements at any level would be 
useful. For instance, a get_residues() method enabling to directly 
iterate on residues whatever the chain would be very convenient :

p = PDBParser()
s = p.get_structure('...')

for residue in s.get_residues():
	...

Similarly:
for atom in s.get_atoms():
	...

I'm aware it's easy to implement with a simple function - at the same 
time it might be useful enough to have it available directly in the 
Structure class?

Thanks in advance,

--
Catherine Letondal -- Institut Pasteur -- Informatics in Biology Course
www.pasteur.fr/formation/infobio/infobio-en.html

From GECrooks at lbl.gov  Wed May  4 12:37:18 2005
From: GECrooks at lbl.gov (Gavin Crooks)
Date: Wed May  4 12:30:15 2005
Subject: [BioPython] Rethinking Seq objects
In-Reply-To: <42771DEC.7090100@ims.u-tokyo.ac.jp>
References: <42771DEC.7090100@ims.u-tokyo.ac.jp>
Message-ID: <e5064a2f81499ed9966200ba7b79d635@lbl.gov>


On May 2, 2005, at 23:45, Michiel Jan Laurens de Hoon wrote:
> 1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class 
> and the MutableSeq class basically describe the same thing, except 
> that one is read-only and the other one is not. If desired, we can add 
> a readonly flag to the class to describe if it is mutable or not. 
> (Given that e.g. Numerical Python arrays don't have such a flag, my 
> feeling is that it is not really needed for Seq objects either). For 
> performance reasons, the new Seq class will be implemented in C.
>

Although I agree that we don't need a Seq and a MutableSeq class, I 
don't follow why we need a mutable sequence class at all. What's the 
use case?

If, in the alternative, Seq was a simple immutable object then it could 
be implemented as a light weight subclass of str, with an alphabet 
attribute that is also a subclass of str. You'd edit it like you would 
edit any string in python;  split it into a list, do whatever 
manipulations are necessary, and then join the list back together into 
a new Seq.


Gavin Crooks


--
Gavin E. Crooks
Divisional Fellow                    tel:  (510) 486-7721
Physical Biosciences                 aim:notastring
Lawrence Berkeley Natl. Lab          http://threeplusone.com/
Berkeley, CA 94720, USA              GECrooks@lbl.gov

From mdehoon at ims.u-tokyo.ac.jp  Thu May  5 03:30:50 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Thu May  5 03:18:34 2005
Subject: [BioPython] Rethinking Seq objects
In-Reply-To: <e5064a2f81499ed9966200ba7b79d635@lbl.gov>
References: <42771DEC.7090100@ims.u-tokyo.ac.jp>
	<e5064a2f81499ed9966200ba7b79d635@lbl.gov>
Message-ID: <4279CBAA.809@ims.u-tokyo.ac.jp>

Gavin Crooks wrote:
> On May 2, 2005, at 23:45, Michiel Jan Laurens de Hoon wrote:
>> 1) Make Seq objects mutable, and get rid of MutableSeq.
> 
> Although I agree that we don't need a Seq and a MutableSeq class, I 
> don't follow why we need a mutable sequence class at all. What's the use 
> case?

Biopython itself uses a MutableSeq in various places, so there does seem to be a 
need for a mutable sequence class. However, in some places a MutableSeq is used 
where a Seq would do. As far as I can tell, Bio.GA and Bio.NeuralNetwork 
actually use the MutableSeq class; in this case, a simple array might work also. 
So maybe there is not much use for a mutable Seq class.

I'm a bit hesitant though to simply throw out MutableSeq, so I'd like to ask our 
users:
Can you give an example where you can't use an (immutable) Seq, but have to use 
a MutableSeq?

> If, in the alternative, Seq was a simple immutable object then it could 
> be implemented as a light weight subclass of str, with an alphabet 
> attribute that is also a subclass of str. You'd edit it like you would 
> edit any string in python;  split it into a list, do whatever 
> manipulations are necessary, and then join the list back together into a 
> new Seq.

There may be performance issues with this approach, if a Seq object is mutated 
often. So let's wait and see if any of our users actually want to mutate a 
sequence object, and if so, if the performance is critical.

--Michiel.


-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From Frederic.Sohm at iaf.cnrs-gif.fr  Thu May  5 09:29:46 2005
From: Frederic.Sohm at iaf.cnrs-gif.fr (=?iso-8859-1?b?RnLpZOlyaWM=?= Sohm)
Date: Thu May  5 09:26:02 2005
Subject: [BioPython] (no subject)
Message-ID: <1115299786.427a1fcaacb25@mail.iaf.cnrs-gif.fr>

Hi Michiel and everyone,

Just a thought, don't flame me for that.
Since you will be making a new Seq object, will it be worth making it behave
more like a typical object :

But first a disclaimer, I realise the proposed change could mean breaking a
lot of code, so it might a very bad idea in the end.

When I did first used Biopython, I have been surprised by the behaviour of
 Seq object, in regards of the use of the built-in str() and repr() functions
 (I should have read the manual first, but hey...) :

Ok here is a the Seq behaviour :
>>> from Bio.Seq import Seq
>>> a = 'a'*80
>>> a

'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa'

>>> s = Seq(a)
>>> s

Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaa', Alphabet())

>>> str(s)

"Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaa ...', Alphabet())"

>>> repr(s)

"Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa', Alphabet())"

>>> s.tostring()

'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa'

Now here is  what I was expecting at the time following the respective
 meaning of str and repr

>>> a = 'a'*80
>>> a

'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaa'

>>> s = Seq(a)
>>> s

Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaa', Alphabet())

>>> str(s)

'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaa'

>>> repr(s)

"Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaa', Alphabet())"


So what I would propose is to :
 change str(seq) to return the actual sequence as do seq.tostring() right
 now. leave repr(seq) as it is,
make seq.tostring()  return str(seq) for backward compatibity. (Would be
eventually removed).
add a new function Seq.short() for example which would behave like the actual
str(Seq).

I don't have any idea how much code this would break. And the feasability of
it will as well depends on the way the new Seq will be release (I mean do you
plan to have the actual Seq and the new one co-existing for a while or to
directly replace the old Seq?).
If the later is the way we go this change is certainly not desirable,
otherwise it might be something to consider.

Personally I have mix filling about it, but I think it is worth discussing
 the matter now.

This change would make the Seq objects behave more like a Python programmer
would expect, on the other hand Biopython have been built on the current
model and this might be a bad idea to change after so much time.


Since the only real problem with this is the replacement of the str() method
all boiled down to how frequently people use the actual string method of Seq
in their code?
I do not have the impression it is very frequent but ...

What do you think ?

Fred

Le mardi 3 Mai 2005 08:45, Michiel Jan Laurens de Hoon a ?crit :
> Hi everybody,
>
> Recently, there was a discussion on biopython-dev about changes to the Seq
> and MutableSeq classses. I'd like to ask you if any of the proposed changes
> would cause you any problems.
>
> The current proposal is:
>
> 1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and
> the MutableSeq class basically describe the same thing, except that one is
> read-only and the other one is not. If desired, we can add a readonly flag
> to the class to describe if it is mutable or not. (Given that e.g.
> Numerical Python arrays don't have such a flag, my feeling is that it is
> not really needed for Seq objects either). For performance reasons, the new
> Seq class will be implemented in C.
>
> 2) By default, a Seq class doesn't assume a particular alphabet. Same as
> current
>
> behavior:
>  >>>  from Bio.Seq import *
>  >>>  Seq('ATCG')
>
> Seq('ATCG', Alphabet())
> However, if the user decides to specify the alphabet explicitly, input to
> the sequence will be checked for consistency with the alphabet. So
>
>  >>>  from Bio.Seq import *
>  >>>  from Bio.Alphabet import IUPAC
>  >>>  my_alpha = IUPAC.unambiguous_dna
>  >>>  s = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>  >>>  s[:3] = "XYZ"
>
> will raise an error.
>
> 3) Make Seq objects understand circular genomes. Many bacterial genomes are
> circular. It would be nice if we could take the indices [-1000:1000] from a
> Seq object, if it is circular, or [3999000:40001000] if the sequence is
> circular with length 4000000.
> Circular genomes will likely be implemented as an optional keyword (perhaps
> "topology") when creating the Seq object, with corresponding set_topology,
> get_topology methods.
>
> 4) Perhaps it would be a good idea to add transcribe and translate methods
> to the Seq class. Currently, to translate a DNA sequence, we have to do
>
>  >>> from Bio.Seq import Seq
>  >>> from Bio import Translate
>  >>> from Bio.Alphabet import IUPAC
>  >>> my_alpha = IUPAC.unambiguous_dna
>  >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>  >>> standard_translator = Translate.unambiguous_dna_by_id[1]
>  >>> standard_translator.translate(my_seq)
>
> Seq('AIVMGR*KGAR', IUPACProtein())
> which is too much typing for my taste.
>
>
> Questions/comments/suggestions are welcome. None of this has actually been
> coded yet, so it's all still open to discussion.
>
>
> --Michiel.

-- 
Fr?d?ric Sohm
Equipe INRA U1126 "Morphogen?se du syst?me nerveux des Chord?s"
UPR 2197 DEPSN, CNRS
Institut de Neurosciences A. Fessard
1 Avenue de la Terrasse
91 198 GIF-SUR-YVETTE
FRANCE
Phone: +33 (0) 1 69 82 34 12
Fax:+33 (0) 1 69 82 34 47

From GECrooks at lbl.gov  Thu May  5 14:35:19 2005
From: GECrooks at lbl.gov (Gavin Crooks)
Date: Thu May  5 14:28:28 2005
Subject: [BioPython] Rethinking Seq objects
In-Reply-To: <4279CBAA.809@ims.u-tokyo.ac.jp>
References: <42771DEC.7090100@ims.u-tokyo.ac.jp>
	<e5064a2f81499ed9966200ba7b79d635@lbl.gov>
	<4279CBAA.809@ims.u-tokyo.ac.jp>
Message-ID: <b5ffeeefee529b007424168fc5b255b0@lbl.gov>


On May 5, 2005, at 00:30, Michiel Jan Laurens de Hoon wrote:

>
>> If, in the alternative, Seq was a simple immutable object then it 
>> could be implemented as a light weight subclass of str, with an 
>> alphabet attribute that is also a subclass of str. You'd edit it like 
>> you would edit any string in python;  split it into a list, do 
>> whatever manipulations are necessary, and then join the list back 
>> together into a new Seq.
>
> There may be performance issues with this approach, if a Seq object is 
> mutated often. So let's wait and see if any of our users actually want 
> to mutate a sequence object, and if so, if the performance is 
> critical.

Performance would be no worse than for string manipulation in standard 
python. The Way of The Python is not to use MutableString's (Which are 
in the standard library, but not really canonical) but to split string 
into lists or arrays, do whatever manipulations are necessary and then 
join the string back together. Is there any reason why Seq's can't be 
mutated analogously?


Gavin Crooks

--
Gavin E. Crooks
Divisional Fellow                    tel:  (510) 486-7721
Physical Biosciences                 aim:notastring
Lawrence Berkeley Natl. Lab          http://threeplusone.com/
Berkeley, CA 94720, USA              GECrooks@lbl.gov

From john.corradi at bms.com  Fri May  6 11:51:48 2005
From: john.corradi at bms.com (John Corradi)
Date: Fri May  6 11:45:20 2005
Subject: [BioPython] handling sequence ambiguity in SeqUtils
Message-ID: <427B9294.6010605@bms.com>

Hi All,

I just noticed that the protein molecular weight caculation in 
ProtParam.py chokes on sequence amibiguities (i.e. X's).  It just throws 
a KeyError exception.  How about using either an average amino acid 
molecular weight or calculating a minimum and maximum for those cases?  
Thanks.

John

P.S.  I guess this is a consideration for the other utils as well.
From mdehoon at ims.u-tokyo.ac.jp  Sat May  7 01:25:28 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Sat May  7 01:13:35 2005
Subject: [BioPython] Rethinking Seq objects
In-Reply-To: <20050503092739.GB10339@kerka-sis.pasteur.fr>
References: <42771DEC.7090100@ims.u-tokyo.ac.jp>
	<20050503092739.GB10339@kerka-sis.pasteur.fr>
Message-ID: <427C5148.9080708@ims.u-tokyo.ac.jp>

bneron@pasteur.fr wrote:
> just an exemple:
> 
> Python 2.3.4 (#1, Mar 11 2005, 17:34:27) 
> [GCC 3.3.5  (Gentoo Linux 3.3.5-r1, ssp-3.3.2-3, pie-8.7.7.1)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> 
>>>>from Bio.Seq import Seq
>>>>from Bio import Translate
>>>>from Bio.Alphabet import IUPAC
>>>>my_alpha = IUPAC.unambiguous_dna
>>>>my_seq_upper = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>>>>my_seq_lower = Seq('gatcgatgggcctattaggatcgaaaatcgc', my_alpha)
>>>>standard_translator = Translate.unambiguous_dna_by_id[1]
>>>>standard_translator.translate(my_seq_upper)
> 
> Seq('DRWAY*DRKS', HasStopCodon(IUPACProtein(), '*'))
> 
>>>>standard_translator.translate(my_seq_lower)
> 
> Seq('**********', HasStopCodon(IUPACProtein(), '*'))
> 
> 
> obviously the lower case doesn't work in the Seq object.

I agree, this should be corrected. The translate and transcribe methods should 
work with both uppercase and lowercase.

--Michiel.

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From mdehoon at ims.u-tokyo.ac.jp  Sat May  7 01:36:01 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Sat May  7 01:23:49 2005
Subject: [BioPython] Rethinking Seq objects
In-Reply-To: <b5ffeeefee529b007424168fc5b255b0@lbl.gov>
References: <42771DEC.7090100@ims.u-tokyo.ac.jp>	<e5064a2f81499ed9966200ba7b79d635@lbl.gov>	<4279CBAA.809@ims.u-tokyo.ac.jp>
	<b5ffeeefee529b007424168fc5b255b0@lbl.gov>
Message-ID: <427C53C1.7040008@ims.u-tokyo.ac.jp>

Gavin Crooks wrote:
> On May 5, 2005, at 00:30, Michiel Jan Laurens de Hoon wrote:
>>> If, in the alternative, Seq was a simple immutable object then it 
>>> could be implemented as a light weight subclass of str, with an 
>>> alphabet attribute that is also a subclass of str. You'd edit it like 
>>> you would edit any string in python;  split it into a list, do 
>>> whatever manipulations are necessary, and then join the list back 
>>> together into a new Seq.
>>
>> There may be performance issues with this approach, if a Seq object is 
>> mutated often. So let's wait and see if any of our users actually want 
>> to mutate a sequence object, and if so, if the performance is critical.
> 
> Performance would be no worse than for string manipulation in standard 
> python. The Way of The Python is not to use MutableString's (Which are 
> in the standard library, but not really canonical) but to split string 
> into lists or arrays, do whatever manipulations are necessary and then 
> join the string back together. Is there any reason why Seq's can't be 
> mutated analogously?
> 
Well, I was gonna say that Seq objects can be very large, certainly much larger 
than common usage of strings in Python, and that this will be a performance 
issue. But when I tried to modify a long string by splitting and rejoining, it 
doesn't seem to be bad at all. So maybe this is the way to go.

--Michiel.


-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From wligtenberg at gmail.com  Sat May  7 10:28:47 2005
From: wligtenberg at gmail.com (Willem Ligtenberg)
Date: Sat May  7 10:22:03 2005
Subject: [BioPython] getting the positions as a string for comparison reason
Message-ID: <cbb068b305050707287b0833e4@mail.gmail.com>

Hello,
I am trying to parse a genbank file and want to compare the start (or
end) positions of a gene with start (or end) positions I already may
have parsed from Ensembl.
These positions have been stored as a string and the string is empty
"" if none is yet stored in the gene object.

[code snippet]
for gen in record.features:
    if gen.location != "":
        gene.addStartBP(gen.location.start)
        gene.addStartBP(gen.location.end)
[/code snippet]

[code from gene class]
def addStartBP(self, startBP):
		if self.startBP == "":
			self.startBP = startBP
[/code from gene class]

Gives (ofcourse) this error:
AssertionError: We can only do comparisons between Biopython Position objects.

But how can I throw this position object to a string?

Thanks in advance,

Willem Ligtenberg

From JBonis at imim.es  Mon May  9 09:34:17 2005
From: JBonis at imim.es (BONIS SANZ, JULIO)
Date: Mon May  9 09:28:54 2005
Subject: [BioPython] Problems parsing xml with sax
Message-ID: <66373AD054447F47851FCC5EB49B3611061251@basquet.imim.es>

Maybe it is not closely related with biopython, maybe it is... anyway:


I am using biopython GenBank.EUtils.ThinClient.ThinClient() to get some xml from NCBI.

After that I have build some xml parsers in sax to get information.

My problem is that sax does not recognize the format that NCBI uses in their DTD for SNP records.

I did:

snpdbi = GenBank.DBIds("snp",['6313'])
file = GenBank.EUtils.ThinClient.ThinClient.efetch_using_dbids(snpdbi,rettype = 'flt',retmode = 'xml')

the problem is that file starts with:

### <?xml version="1.0"?>
### <!DOCTYPE NSE-rs PUBLIC "-//NCBI//NSE/EN" "/entrez/query/DTD/NSE.dtd">

And sax returns this error: 

ValueError: unknown url type: /entrez/query/DTD/NSE.dtd


When retrieving from elink I have not that problem. For example:

>>> dbid = GenBank.DBIds("nucleotide",['55956922'])
>>> xmlFileWithSNPsStream = eutils.elink_using_dbids(dbid,db="snp")

And I can parse with sax, as the file starts with:

###  <?xml version="1.0"?>
###  <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd">

That is a well formed URL....


Any idea?

Regards, 

Julio Bonis Sanz MD
www.juliobonis.com


-----Mensaje original-----
De: biopython-bounces@portal.open-bio.org
[mailto:biopython-bounces@portal.open-bio.org]En nombre de Michiel Jan
Laurens de Hoon
Enviado el: s?bado, 07 de mayo de 2005 7:25
Para: bneron@pasteur.fr
CC: Biopython mailing list
Asunto: Re: [BioPython] Rethinking Seq objects


bneron@pasteur.fr wrote:
> just an exemple:
> 
> Python 2.3.4 (#1, Mar 11 2005, 17:34:27) 
> [GCC 3.3.5  (Gentoo Linux 3.3.5-r1, ssp-3.3.2-3, pie-8.7.7.1)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> 
>>>>from Bio.Seq import Seq
>>>>from Bio import Translate
>>>>from Bio.Alphabet import IUPAC
>>>>my_alpha = IUPAC.unambiguous_dna
>>>>my_seq_upper = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>>>>my_seq_lower = Seq('gatcgatgggcctattaggatcgaaaatcgc', my_alpha)
>>>>standard_translator = Translate.unambiguous_dna_by_id[1]
>>>>standard_translator.translate(my_seq_upper)
> 
> Seq('DRWAY*DRKS', HasStopCodon(IUPACProtein(), '*'))
> 
>>>>standard_translator.translate(my_seq_lower)
> 
> Seq('**********', HasStopCodon(IUPACProtein(), '*'))
> 
> 
> obviously the lower case doesn't work in the Seq object.

I agree, this should be corrected. The translate and transcribe methods should 
work with both uppercase and lowercase.

--Michiel.

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
_______________________________________________
BioPython mailing list  -  BioPython@biopython.org
http://biopython.org/mailman/listinfo/biopython

From JBonis at imim.es  Mon May  9 10:22:47 2005
From: JBonis at imim.es (BONIS SANZ, JULIO)
Date: Mon May  9 10:16:27 2005
Subject: [BioPython] Problems parsing xml with sax
Message-ID: <66373AD054447F47851FCC5EB49B3611061252@basquet.imim.es>

Solved ;)

Just after defining the parser 

>>> parser = from xml.sax.make_parser()

put:

self.__parser.setFeature('http://xml.org/sax/features/external-general-entities',False)

This disables the external entities solver, avoiding the problem of the DTD use.

Regards, 

Julio Bonis Sanz MD
www.juliobonis.com

-----Mensaje original-----
De: biopython-bounces@portal.open-bio.org
[mailto:biopython-bounces@portal.open-bio.org]En nombre de BONIS SANZ,
JULIO
Enviado el: lunes, 09 de mayo de 2005 15:34
Para: Biopython mailing list
Asunto: [BioPython] Problems parsing xml with sax


Maybe it is not closely related with biopython, maybe it is... anyway:


I am using biopython GenBank.EUtils.ThinClient.ThinClient() to get some xml from NCBI.

After that I have build some xml parsers in sax to get information.

My problem is that sax does not recognize the format that NCBI uses in their DTD for SNP records.

I did:

snpdbi = GenBank.DBIds("snp",['6313'])
file = GenBank.EUtils.ThinClient.ThinClient.efetch_using_dbids(snpdbi,rettype = 'flt',retmode = 'xml')

the problem is that file starts with:

### <?xml version="1.0"?>
### <!DOCTYPE NSE-rs PUBLIC "-//NCBI//NSE/EN" "/entrez/query/DTD/NSE.dtd">

And sax returns this error: 

ValueError: unknown url type: /entrez/query/DTD/NSE.dtd


When retrieving from elink I have not that problem. For example:

>>> dbid = GenBank.DBIds("nucleotide",['55956922'])
>>> xmlFileWithSNPsStream = eutils.elink_using_dbids(dbid,db="snp")

And I can parse with sax, as the file starts with:

###  <?xml version="1.0"?>
###  <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd">

That is a well formed URL....


Any idea?

Regards, 

Julio Bonis Sanz MD
www.juliobonis.com


-----Mensaje original-----
De: biopython-bounces@portal.open-bio.org
[mailto:biopython-bounces@portal.open-bio.org]En nombre de Michiel Jan
Laurens de Hoon
Enviado el: s?bado, 07 de mayo de 2005 7:25
Para: bneron@pasteur.fr
CC: Biopython mailing list
Asunto: Re: [BioPython] Rethinking Seq objects


bneron@pasteur.fr wrote:
> just an exemple:
> 
> Python 2.3.4 (#1, Mar 11 2005, 17:34:27) 
> [GCC 3.3.5  (Gentoo Linux 3.3.5-r1, ssp-3.3.2-3, pie-8.7.7.1)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> 
>>>>from Bio.Seq import Seq
>>>>from Bio import Translate
>>>>from Bio.Alphabet import IUPAC
>>>>my_alpha = IUPAC.unambiguous_dna
>>>>my_seq_upper = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>>>>my_seq_lower = Seq('gatcgatgggcctattaggatcgaaaatcgc', my_alpha)
>>>>standard_translator = Translate.unambiguous_dna_by_id[1]
>>>>standard_translator.translate(my_seq_upper)
> 
> Seq('DRWAY*DRKS', HasStopCodon(IUPACProtein(), '*'))
> 
>>>>standard_translator.translate(my_seq_lower)
> 
> Seq('**********', HasStopCodon(IUPACProtein(), '*'))
> 
> 
> obviously the lower case doesn't work in the Seq object.

I agree, this should be corrected. The translate and transcribe methods should 
work with both uppercase and lowercase.

--Michiel.

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
_______________________________________________
BioPython mailing list  -  BioPython@biopython.org
http://biopython.org/mailman/listinfo/biopython

_______________________________________________
BioPython mailing list  -  BioPython@biopython.org
http://biopython.org/mailman/listinfo/biopython

From eirik.sonneland at student.umb.no  Tue May 10 06:09:34 2005
From: eirik.sonneland at student.umb.no (=?ISO-8859-1?Q?Eirik_S=F8nneland?=)
Date: Tue May 10 10:29:48 2005
Subject: [BioPython] Megablast
Message-ID: <4280885E.7070201@student.umb.no>

Hi!

Is there someone who knows what the correct code is if I want to use 
megablast instead of blastn in:

b_results = NCBIWWW.qblast('blastn', 'bta_genome/all_contig', 
f_record).read()

Thank you!

Cheers,
Eirik
From eirik.sonneland at student.umb.no  Tue May 10 06:09:34 2005
From: eirik.sonneland at student.umb.no (=?ISO-8859-1?Q?Eirik_S=F8nneland?=)
Date: Tue May 10 10:29:54 2005
Subject: [BioPython] Megablast
Message-ID: <4280885E.7070201@student.umb.no>

Hi!

Is there someone who knows what the correct code is if I want to use 
megablast instead of blastn in:

b_results = NCBIWWW.qblast('blastn', 'bta_genome/all_contig', 
f_record).read()

Thank you!

Cheers,
Eirik
From aurelie.bornot at free.fr  Wed May 11 11:52:58 2005
From: aurelie.bornot at free.fr (aurelie.bornot@free.fr)
Date: Wed May 11 11:45:52 2005
Subject: [BioPython] pairwise alignment tools ???
Message-ID: <1115826778.42822a5a1161d@imp4-q.free.fr>


Hello everybody !

I would like to know if something exists in biopython to do optimal pairwise
alignments for DNA and proteins sequences ??
I need something that alllows to use BLOSUM62 and BLOSUM45 for proteins....

and if it doesn't exist : does anyone know where I can find binaries to do this
???  (Unfortunatly I must work on windowsXP... and I have trouble to find
something...)


Thanks a lot !!
Aurelie
From kris at Math.Princeton.EDU  Thu May 12 11:31:23 2005
From: kris at Math.Princeton.EDU (Kristina Rogale Plazonic)
Date: Thu May 12 11:24:53 2005
Subject: [BioPython] some files missing in sources of 1.40b and weird
	non-error
Message-ID: <Pine.LNX.4.62.0505111520380.15779@max.math.Princeton.EDU>


Hi!

Both the tar.gz and zip archive of 1.40b offered on the download page of 
the website seem to be incomplete - in particular these files are missing
KDTree/__init__.py, 
KDTree/KDTree.py
and maybe others. (source was downloaded yesterday.)

As a result PDB module's NeighborSearch.py doesn't work anymore with a 
strange error:

>>> ns = Bio.PDB.NeighborSearch(atlist) Traceback (most recent call last):
   File "<stdin>", line 1, in ?
TypeError: 'module' object is not callable

- as you see, python doesn't report missing KDTree module called from 
NeighborSearch at all!!! (Sample scripts do report missing KDTree module.) 
Indeed, it seems that NeighborSearch is then a module with no classes at 
all; i.e. all that is contained in NeighborSearch.py after the second line 
from Bio.KDTree import *
is ignored, with no error reported.

I'm utterly confused. Does this happen because of some setting in 
biopython?

Thanks,
Kristina
From kris at Math.Princeton.EDU  Thu May 12 11:56:12 2005
From: kris at Math.Princeton.EDU (Kristina Rogale Plazonic)
Date: Thu May 12 11:48:49 2005
Subject: [BioPython] some files missing in sources of 1.40b and weird
	non-error
In-Reply-To: <200505121732.47339.thamelry@binf.ku.dk>
References: <Pine.LNX.4.62.0505111520380.15779@max.math.Princeton.EDU>
	<200505121732.47339.thamelry@binf.ku.dk>
Message-ID: <Pine.LNX.4.62.0505121150030.26242@max.math.Princeton.EDU>


> KDTree is C++ code, which causes problems on some
> systems, and hence compilation is disabled by default. Un-commenting
> the KDTree lines in setup.py and re-installing will quite likely solve
> your problem.

Hi, this is the first thing I tried. Then I discovered that some of the 
KDTree files are MISSING from the source archives of 1.40b on the download 
page.  I had to fetch the current CVS version to get the complete source.

Kristina
From thamelry at binf.ku.dk  Thu May 12 11:32:47 2005
From: thamelry at binf.ku.dk (Thomas Hamelryck)
Date: Thu May 12 12:04:42 2005
Subject: [BioPython] some files missing in sources of 1.40b and weird
	non-error
In-Reply-To: <Pine.LNX.4.62.0505111520380.15779@max.math.Princeton.EDU>
References: <Pine.LNX.4.62.0505111520380.15779@max.math.Princeton.EDU>
Message-ID: <200505121732.47339.thamelry@binf.ku.dk>

Hi Kristina,

> I'm utterly confused. Does this happen because of some setting in
> biopython?

KDTree is C++ code, which causes problems on some 
systems, and hence compilation is disabled by default. Un-commenting
the KDTree lines in setup.py and re-installing will quite likely solve
your problem.

Best regards,

-Thomas

From mdehoon at ims.u-tokyo.ac.jp  Fri May 13 01:14:13 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Fri May 13 01:01:58 2005
Subject: [BioPython] some files missing in sources of 1.40b and weird
	non-error
In-Reply-To: <Pine.LNX.4.62.0505121150030.26242@max.math.Princeton.EDU>
References: <Pine.LNX.4.62.0505111520380.15779@max.math.Princeton.EDU>	<200505121732.47339.thamelry@binf.ku.dk>
	<Pine.LNX.4.62.0505121150030.26242@max.math.Princeton.EDU>
Message-ID: <428437A5.1030407@ims.u-tokyo.ac.jp>

KDTree's *.py were missing in MANIFEST.in, which caused them to be skipped when 
creating the source distribution. I've fixed MANIFEST.in in CVS, however the 
source distribution on www.biopython.org is still the old one.

Iddo, are you planning a 1.40 (final) release? Since the current release is 1.40 
beta.

--Michiel.

Kristina Rogale Plazonic wrote:

> 
>> KDTree is C++ code, which causes problems on some
>> systems, and hence compilation is disabled by default. Un-commenting
>> the KDTree lines in setup.py and re-installing will quite likely solve
>> your problem.
> 
> 
> Hi, this is the first thing I tried. Then I discovered that some of the 
> KDTree files are MISSING from the source archives of 1.40b on the 
> download page.  I had to fetch the current CVS version to get the 
> complete source.
> 
> Kristina
> _______________________________________________
> BioPython mailing list  -  BioPython@biopython.org
> http://biopython.org/mailman/listinfo/biopython
> 
> 

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From edeveaud at pasteur.fr  Tue May 17 05:57:28 2005
From: edeveaud at pasteur.fr (edeveaud@pasteur.fr)
Date: Tue May 17 05:51:07 2005
Subject: [BioPython] NCBIStandalone Parser problem
In-Reply-To: <426DFB17.5010409@ims.u-tokyo.ac.jp>
References: <20050421091119.GA12744@hebus.sis.pasteur.fr>
	<426DFB17.5010409@ims.u-tokyo.ac.jp>
Message-ID: <20050517095728.GA19359@hebus.sis.pasteur.fr>

On Tue, Apr 26, 2005 at 05:25:59PM +0900, Michiel Jan Laurens de Hoon wrote:
> Could you try this again with Biopython version 1.40b and see if the 
> problem still occurs there? If so, could you send me the query that you are 
> using so I can replicate this error?


sorry for the long delay, but know as course finished I will have more time
to dig into the problem.

anyway I reproduced the same error with Biopython version 1.40b

if you want to check with the datas I provided a tar.gz of all the necessary
files here
<URL:http://instcard1.free.fr/biopython_test.tgz>

the tarball contains :
*) a bank containing the following 2 sequences CV793585 and 
   CV793586 taken from the genbank nc0421.flat update 

*) the generated index for this one (formatdb -p F -i test_bank)
   NB this ones may need to be rebuild

*) 2 query files 
   query_ok: a query giving some hits
   query_crash: the query that does produce '***** No hits found ******'
   and leads the parser to crahs

*) the basic python script used  
   python ./crash.py query_ok bank/test_bank
   python ./crash.py query_crash bank/test_bank


	Eric
-- 
 E> desole mais je n est pas trop l habitude des groupes de discutions
 Le?on n? 1 : on r?pond en haut et on vire le message auquel on r?pond
 Cette suppression facilite grandement la lecture !!!
 -+- DrN in <http://neuneu.mine.nu> : Le Neuneu par l'exemple -+-
From jleigh at dal.ca  Tue May 17 13:13:49 2005
From: jleigh at dal.ca (Jessica Leigh)
Date: Tue May 17 13:06:31 2005
Subject: [BioPython] retrieve results with RID
Message-ID: <428A264D.9000607@dal.ca>

Hi,

I'm new to BioPython, and what I REALLY want to do is use blastp, 
restricting results to a particular entrez query.  The blast function 
allows me to do this, but not the qblast... this would be a useful 
addition, it's really easy to add.  My real problem, though, is this: 
when I use either blast or qblast, instead of getting a blast result 
page, I get the RID page (the one that says "This page will NOT be 
automatically updated.")  I know that this is just NCBI being a pain, 
but is there any function in BioPython that allows me to retrieve the 
results associated with an RID?

Thanks,
Jessica
From cgw501 at york.ac.uk  Tue May 17 15:07:03 2005
From: cgw501 at york.ac.uk (cgw501@york.ac.uk)
Date: Tue May 17 14:59:33 2005
Subject: [BioPython] alignment processing
Message-ID: <Prayer.1.0.10.0505172007030.19755@webmail0.york.ac.uk>

Hi,

I have a file processing task I'm trying to do with biopython. I have to 
take a bunch of clustal alignment files that cover one arm of a whole 
chromosome, strip off the lowercase letters at the end of each sequence, 
and produce a file containing all the stripped sequences together is fasta 
format. This is what I have so far:

import Bio.Clustalw
from Bio.Alphabet import IUPAC
import string
from Bio.Seq import Seq
from Bio.SeqIO import FASTA
from Bio.SeqRecord import SeqRecord
from sys import *
import sys

inputs = sys.argv[1:-2]
output = open(sys.argv[-1], 'w')


for f in inputs:
    align = Bio.Clustalw.parse_file(f, alphabet=IUPAC.ambiguous_dna)
    lines = align.get_all_seqs()

    strippedAlignRecord = []
    for line in lines:
        lineSeq = line.seq
        lineString = lineSeq.tostring()
        strippedSeq = lineString.rstrip('atcg-')
        strippedSeqObj = Seq(strippedSeq, IUPAC.ambiguous_dna)
        strippedRecObj = SeqRecord(strippedSeqObj, id = line.description)
        out = FASTA.FastaWriter(output)
        out.write(strippedRecObj)

When I run this from the command line I don't get any errors, but the 
outfile is not created. I'm a bit flummoxed. Any ideas?

Thanks,

Chris
From amairgen at gmail.com  Fri May 20 13:06:25 2005
From: amairgen at gmail.com (Mattias de Hollander)
Date: Fri May 20 12:58:57 2005
Subject: [BioPython] [ClustalW] alignment score
Message-ID: <6eeafc6305052010062f2e97be@mail.gmail.com>

Is it possible to get the 'alignment score' from a clustalw alignment, just 
like when you run ClustalW over the web?

-- 
Mattias

From fkauff at duke.edu  Fri May 20 13:20:04 2005
From: fkauff at duke.edu (Frank Kauff)
Date: Fri May 20 13:13:53 2005
Subject: [BioPython] [ClustalW] alignment score
In-Reply-To: <6eeafc6305052010062f2e97be@mail.gmail.com>
References: <6eeafc6305052010062f2e97be@mail.gmail.com>
Message-ID: <1116609605.4513.25.camel@osiris.biology.duke.edu>

Mattias,

On Fri, 2005-05-20 at 19:06 +0200, Mattias de Hollander wrote:
> Is it possible to get the 'alignment score' from a clustalw alignment, just 
> like when you run ClustalW over the web?
> 

Clustalw doesn't save the score together with the alignment, but only in
the log (or the on-screen output). I think the only way to get the score
might be to parse the log for the magic words 'Alignment score' and get
the value associated with them.

Frank

-- 
Frank Kauff
Dept. of Biology
Duke University
Box 90338
Durham, NC 27708
USA

Phone 919-660-7382
Fax 919-660-7293
Web http://www.lutzonilab.net/member/frankkauff.shtml


From julie.bernauer at ibbmc.u-psud.fr  Tue May 24 13:26:13 2005
From: julie.bernauer at ibbmc.u-psud.fr (Julie Bernauer)
Date: Tue May 24 13:25:15 2005
Subject: [BioPython] Comment/Suggestion about Bio.PDB.Polypeptide class. How
	to keep gaps information ?
Message-ID: <1116955573.3946.250.camel@fifi.ibbmc.u-psud.fr>

Hello

Let's imagine we want a fasta file or a seq object containing gaps
describing the amino acids that are present in a structure :

Ex : 1t6b chain X

Using this code :
 for pp in ppd.build_peptides(structure[0][X]):
                        print pp
We get :
	<Polypeptide start=16 end=158>
	<Polypeptide start=175 end=275>
	<Polypeptide start=288 end=303>
	<Polypeptide start=320 end=735>

If we want to bind those peptides together, let's try to define an empty
polypeptide :
	pp1=Polypeptide.Polypeptide([])
and extend it with the peptides we get :

	pp1=Polypeptide.Polypeptide([])
	for pp in ppd.build_peptides(structurecomplex[0][chaineR]):
                      pp1.extend(pp)
	print pp1
	seq=pp1.get_sequence()
	print seq.tostring()

We have :
	<Polypeptide start=16 end=735>
SQGLLGYYFSDLNFQAPMVVTSSTTGDLSIPSSELENIPSENQYFQSAIWSGFIKVKKSDEYTFATSADNHVTMWVDDQEVINKASNSNKIRLEKGRLYQIKIQYQRENPTEKGLDFKLYWTDSQNKKEVISSDNLQLPELKQVPDRDNDGIPDSLEVEGYTVDVKNKRTFLSPWISNIHEKKGLTKYKSSPEKWSTASDPYSDFEKVTGRIDKNVSPEARHPLVAAYPIVHVDMENIILSKNETISKNTSTSRTHTSEVVSAGFSNSNSSTVAIDHSLSLAGERTWAETMGLNTADTARLNANIRYVNTGTAPIYNVLPTTSLVLGKNQTLATIKAKENQLSQILAPNNYYPSKNLAPIALNAQDDFSSTPITMNYNQFLELEKTKQLRLDTDQVYGNIATYNFENGRVRVDTGSNWSEVLPQIQETTARIIFNGKDLNLVERRIAAVNPSDPLETTKPDMTLKEALKIAFGFNEPNGNLQYQGKDITEFDFNFDQQTSQNIKNQLAELNATNIYTVLDKIKLNAKMNILIRDKRFHYDRNNIAVGADESVVKEAHREVINSSTEGLLLNIDKDIRKILSGYIVEIEDTEGLKEVINDRYDMLNISSLRQDGKTFIDFKKYNDKLPLYISNPNYKVNVYAVTKENTIINPSENGDTSTNGIKKILIFSKKGYEIG

i.e.: We totally lose the information of gaps. "pp1" still contains this
information but cannot give it to "seq" even if using the gapped
alphabet.
I know it would be possible to get it from an iteration on residue from
the structure. However, I think it would be better to fill gap with an
'X' or a '-' while doing pp1.get_sequence(). I mean changing the method
get_sequence to handle this case.

Instead of :

	for res in self:
            resname=res.get_resname()
            if to_one_letter_code.has_key(resname):
                resname=to_one_letter_code[resname]
            else:
                resname='X'
            s=s+resname

I think would be nice to iterate over resseq

What do you think ? 


-- 
Julie BERNAUER
Equipe de G?nomique Structurale          http://www.genomics.eu.org
IBBMC - UMR 8619 - U.P.S. B?t.430          Tel. : +33 1 69 15 31 57
91405 Orsay - FRANCE                       Fax. : +33 1 69 85 37 15

From thamelry at binf.ku.dk  Tue May 24 14:26:32 2005
From: thamelry at binf.ku.dk (thamelry@binf.ku.dk)
Date: Tue May 24 14:25:44 2005
Subject: [BioPython] Comment/Suggestion about Bio.PDB.Polypeptide 
	class. How to keep gaps information ?
In-Reply-To: <1116955573.3946.250.camel@fifi.ibbmc.u-psud.fr>
References: <1116955573.3946.250.camel@fifi.ibbmc.u-psud.fr>
Message-ID: <35376.83.92.3.59.1116959192.squirrel@www.binf.ku.dk>


Hi Julie,

> i.e.: We totally lose the information of gaps. "pp1" still contains this
> information but cannot give it to "seq" even if using the gapped
> alphabet.
> I know it would be possible to get it from an iteration on residue from
> the structure. However, I think it would be better to fill gap with an
> 'X' or a '-' while doing pp1.get_sequence(). I mean changing the method
> get_sequence to handle this case.

I'll start with pointing out that you cannot rely on the fact that
the resseq numbering is meaningfull AT ALL. There are plenty of structures
in the PDB where residue X is firmly attached to residue X+Y (with Y>1)
and structures where X is not attached to X+1. That's the reason why
Bio.PDB uses a distance criterium to find polypeptides.

OTOH it would certainly be useful to have gap information, but I'd like to
put that in a seperate class, ie. BrokenPolypeptide. PolypeptideBuilder
could have a method build_broken_peptide that would return a
BrokenPolypeptide object. That class could have fancy methods to deal with
gaps and the sequences of the missing parts, for example.

I'll try to add this, but I'm busy at the moment (4 articles in the
pipeline), but you're welcome to give it a try and send me your code :-).

Best regards,

-Thomas


From julie.bernauer at ibbmc.u-psud.fr  Wed May 25 04:59:58 2005
From: julie.bernauer at ibbmc.u-psud.fr (Julie Bernauer)
Date: Wed May 25 04:52:59 2005
Subject: [BioPython] Comment/Suggestion about Bio.PDB.Polypeptide 
	class. How to keep gaps information ?
In-Reply-To: <35376.83.92.3.59.1116959192.squirrel@www.binf.ku.dk>
References: <1116955573.3946.250.camel@fifi.ibbmc.u-psud.fr>
	<35376.83.92.3.59.1116959192.squirrel@www.binf.ku.dk>
Message-ID: <1117011599.3946.257.camel@fifi.ibbmc.u-psud.fr>

On Tue, 2005-05-24 at 20:26 +0200, thamelry@binf.ku.dk wrote:
> Hi Julie,
> [...]
> I'll try to add this, but I'm busy at the moment (4 articles in the
> pipeline), but you're welcome to give it a try and send me your code :-).

Hi Thomas,

Thank you for your quick answer.

Here is a quick and dirty hack that works for me, just see whether you
may use it:

class BrokenPolypeptide(list):
    """
    A broken polypeptide is simply a list of polypeptide objects.
    """
    def get_sequence(self):
        """
        Return the AA sequence, filling gap or unknown residue with X.

        @return: polypeptide sequence
        @rtype: L{Seq}
        """
        s=""
        if self == []:
            end=0
        else :
            end=self[0][-1].get_id()[1]
        for peptide in self:
            start=peptide[0].get_id()[1]
            gaplength=start-end
            for indexgap in range(0, gaplength):
                s=s+'X'
            for res in peptide:
                resname=res.get_resname()
                if to_one_letter_code.has_key(resname):
                    resname=to_one_letter_code[resname]
                else:
                    resname='X'
                s=s+resname
            end=peptide[-1].get_id()[1]
        seq=Seq(s, ProteinAlphabet)
        return seq

HTH, regards,

J.

-- 
Julie BERNAUER
Equipe de G?nomique Structurale          http://www.genomics.eu.org
IBBMC - UMR 8619 - U.P.S. B?t.430          Tel. : +33 1 69 15 31 57
91405 Orsay - FRANCE                       Fax. : +33 1 69 85 37 15

From edeveaud at pasteur.fr  Thu May 26 09:31:53 2005
From: edeveaud at pasteur.fr (edeveaud@pasteur.fr)
Date: Thu May 26 09:26:43 2005
Subject: [BioPython] iterative ace parsing
Message-ID: <20050526133153.GA23295@hebus.sis.pasteur.fr>

	Hi,
	
after reading the doc for Bio.Sequencing.Ace

I would like to run some analysis on an assembly composed of 174 contigs 
based on approximatively 49000 reads.

the only problem is that parsing whole ace file at once needs 872M of memory.

my idea was to itereate over the contigs in order to decrease the memory needs,
but the doc claims 
 2) *** DEPRECATED: not entirely suitable for ACE files! 
             Or you can iterate over the contigs of an ace file one by one in
			 the ususal way:        
			  
could someone point me to some explanation about this warning ??

is the ace parser suitable for iterative tasks ??

	thank's

-- 
>       dvips -o $@ $<     
Faut faire gffe de pas te couper avec ton truc, t'as mis des ciseaux ($<)
partout :))
-+- Dom in Guide du linuxien pervers - "J'aime pas les Makefile !" -+-
From fkauff at duke.edu  Thu May 26 09:59:48 2005
From: fkauff at duke.edu (Frank Kauff)
Date: Thu May 26 09:52:00 2005
Subject: [BioPython] iterative ace parsing
In-Reply-To: <20050526133153.GA23295@hebus.sis.pasteur.fr>
References: <20050526133153.GA23295@hebus.sis.pasteur.fr>
Message-ID: <1117115988.4496.26.camel@osiris.biology.duke.edu>

Hi,

On Thu, 2005-05-26 at 15:31 +0200, edeveaud@pasteur.fr wrote:
> 	Hi,
> 	
> after reading the doc for Bio.Sequencing.Ace
> 
> I would like to run some analysis on an assembly composed of 174 contigs 
> based on approximatively 49000 reads.
> 
> the only problem is that parsing whole ace file at once needs 872M of memory.
> 
> my idea was to itereate over the contigs in order to decrease the memory needs,
> but the doc claims 
>  2) *** DEPRECATED: not entirely suitable for ACE files! 
>              Or you can iterate over the contigs of an ace file one by one in
> 			 the ususal way:        
> 			  
> could someone point me to some explanation about this warning ??
> 

It works fine, in theory. The problem with ace files is, that they are
not entirely suitable for contg-by-contig parsing, they can contain
contig-specific information at the very end of the file. So in your
case, after reading contig no. 174, there might be still some more info
left in the file about contigs no. 12, 132, and 160. Depending on what
kind of contigs you have, there might be no info at all or it's just
irrelevant for your analysis. The phrap manual (you're using phrap to
create the contigs?) lists the tags that can appear at the end of an ace
file, so you might want to have a look there and decide whether they are
important for you or not. If not, iterating voer contigs should just do
fine. 

Frank

> is the ace parser suitable for iterative tasks ??
> 


> 	thank's
> 
-- 
Frank Kauff
Dept. of Biology
Duke University
Box 90338
Durham, NC 27708
USA

Phone 919-660-7382
Fax 919-660-7293
Web http://www.lutzonilab.net/member/frankkauff.shtml


From edeveaud at pasteur.fr  Thu May 26 11:49:03 2005
From: edeveaud at pasteur.fr (edeveaud@pasteur.fr)
Date: Thu May 26 11:43:42 2005
Subject: [BioPython] iterative ace parsing
In-Reply-To: <1117115988.4496.26.camel@osiris.biology.duke.edu>
References: <20050526133153.GA23295@hebus.sis.pasteur.fr>
	<1117115988.4496.26.camel@osiris.biology.duke.edu>
Message-ID: <20050526154903.GA23750@hebus.sis.pasteur.fr>

On Thu, May 26, 2005 at 09:59:48AM -0400, Frank Kauff wrote:
> Hi,
> 
> On Thu, 2005-05-26 at 15:31 +0200, edeveaud@pasteur.fr wrote:
> > 	Hi,
> > 	
> > after reading the doc for Bio.Sequencing.Ace
> > 
> > my idea was to itereate over the contigs in order to decrease the memory
> > needs, but the doc claims 
> >  2) *** DEPRECATED: not entirely suitable for ACE files! 
> >              Or you can iterate over the contigs of an ace file one by one in
> > 			 the ususal way:        
> > 			  
> > could someone point me to some explanation about this warning ??
> > 
> 
> It works fine, in theory. The problem with ace files is, that they are
> not entirely suitable for contg-by-contig parsing, they can contain
> contig-specific information at the very end of the file. So in your
> case, after reading contig no. 174, there might be still some more info
> left in the file about contigs no. 12, 132, and 160. Depending on what
> kind of contigs you have, there might be no info at all or it's just
> irrelevant for your analysis. The phrap manual (you're using phrap to
> create the contigs?) lists the tags that can appear at the end of an ace
> file, so you might want to have a look there and decide whether they are
> important for you or not. If not, iterating voer contigs should just do
> fine. 

thank's for the clarification.

yes indede we use phred/phrarp in order to create the assembly.
and for the analysis we want to perform we don't care about the eventuals tag
set-up by phrap. we just need the contig coverage and the read starts. 

I'll check the iterative way. 

	thank's again

	Eric

-- 
  Ici, l'exemple est un peu capillotract?. 
  Si on choisissait plut?t un dilemme entre fr.comp.os.unix et
  fr.rec.arts.os.unix ? 
  -+- APM in: Guide du Cabaliste Usenet - La Cabale est-elle barbue ? -+-
From fredgca at hotmail.com  Sat May 28 11:22:59 2005
From: fredgca at hotmail.com (Frederico Arnoldi)
Date: Sat May 28 11:15:42 2005
Subject: [BioPython] Tool for Biomolecular data edition and analysis in
	Python
Message-ID: <BAY20-F3925198F6A696B4AD93EA8BF010@phx.gbl>

Hello,

   My name is Frederico G. Colombo Arnoldi, I am Phd student (subject:
Molecular Evolution) in Brazil.
  I have been developing a software for Biomolecular data edition and
analysis in python, mainly for gene analysis. It started as a little
tool to help me and became a more serious work. Actually this tool is
formed by a GTK interface that allows the user to align sequences with
Malign and Clustalw; color sequences previously aligned according
conservation and residues characteristics; create reverse,
reverse/complement and consensus sequences; search for conserved regions
and determined sequences inside a bigger (like restriction enzymes sites
and ORF's); generate alignments colored reports and others.
   The program and Its portal has been written in English, although, as
anyone can see in my bad english, I'm not a native english speaker.
    I would like to know about the possibility and the interest of
integrate it to biopython. The portal is : 
http://mpalign.incubadora.fapesp.br/portal

    Thanks a lot.
    Frederico

_________________________________________________________________
MSN Messenger: converse online com seus amigos .  
http://messenger.msn.com.br

From idoerg at burnham.org  Sun May 29 02:13:37 2005
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sun May 29 02:06:42 2005
Subject: [BioPython] Tool for Biomolecular data edition and analysis in
	Python
In-Reply-To: <BAY20-F3925198F6A696B4AD93EA8BF010@phx.gbl>
Message-ID: <Pine.SGI.4.10.10505282300380.3022179-100000@pines2.ljcrf.edu>

Frederico,

This looks like a really useful tool, thanks for sharing.

One way I can see this fitting into Biopython is as an addition to the
Align module. We'll have to think exactly how.

However, I am wondering if this really fits within biopython: you have
what seems to me as a standalone program. Biopython's goal is to provide
buiding blocks for such tools. So your code is the next step: you have
constructed a house (a very fine one, may I add), Biopython provides
lumber, bricks, etc. etc.

Like I said, I'll have to look at it. But with ISMB coming up, it will
probably not be for a month. If someone else has any thoughts on the
matter, please share.

Cheers, and thanks again,

Iddo


--
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037, USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 646 3171
http://ffas.ljcrf.edu/~iddo
-------------------------------------
Automated Protein Function Prediction Meeting, June 24, 2005
http://ffas.burnham.org/AFP

On Sat, 28 May 2005, Frederico Arnoldi wrote:


> Hello,
> 
>    My name is Frederico G. Colombo Arnoldi, I am Phd student (subject:
> Molecular Evolution) in Brazil.
>   I have been developing a software for Biomolecular data edition and
> analysis in python, mainly for gene analysis. It started as a little
> tool to help me and became a more serious work. Actually this tool is
> formed by a GTK interface that allows the user to align sequences with
> Malign and Clustalw; color sequences previously aligned according
> conservation and residues characteristics; create reverse,
> reverse/complement and consensus sequences; search for conserved
> regions and determined sequences inside a bigger (like restriction
> enzymes sites and ORF's); generate alignments colored reports and
> others.
>    The program and Its portal has been written in English, although,
> as anyone can see in my bad english, I'm not a native english speaker.
>     I would like to know about the possibility and the interest of
> integrate it to biopython. The portal is : 
> http://mpalign.incubadora.fapesp.br/portal
> 


>     Thanks a lot.
>     Frederico
> 
> _________________________________________________________________
> MSN Messenger: converse online com seus amigos .  
> http://messenger.msn.com.br
> 
> _______________________________________________
> BioPython mailing list  -  BioPython@biopython.org
> http://biopython.org/mailman/listinfo/biopython
> 

From janaspe at web.de  Mon May 30 10:10:41 2005
From: janaspe at web.de (Jana Sperschneider)
Date: Mon May 30 10:03:03 2005
Subject: [BioPython] mxtexttools link
Message-ID: <506450146@web.de>

Hi there,

I need to get the mxtexttools package for Windows, the link

http://www.egenix.com/files/python/eGenix-mx-Extensions.html

doesn't seem to work.. maybe the server is down.. can anyone help?

Cheers
Jana
__________________________________________________________
Mit WEB.DE FreePhone mit hoechster Qualitaet ab 0 Ct./Min.
weltweit telefonieren! http://freephone.web.de/?mc=021201

From biopython at maubp.freeserve.co.uk  Mon May 30 10:42:45 2005
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon May 30 10:27:42 2005
Subject: [BioPython] mxtexttools link
In-Reply-To: <506450146@web.de>
References: <506450146@web.de>
Message-ID: <429B2665.1020709@maubp.freeserve.co.uk>

Jana Sperschneider wrote:
> Hi there,
> 
> I need to get the mxtexttools package for Windows, the link
> 
> http://www.egenix.com/files/python/eGenix-mx-Extensions.html
> 
> doesn't seem to work.. maybe the server is down.. can anyone help?
> 
> Cheers
> Jana

Their website does seem to be down.  I do have 
egenix-mx-base-2.0.5.win32-py2.3.exe on my hard disk which I could 
email you if you like (off the list - its 574kb).

Note - you would have to be using Python 2.3 for this to work.

Peter
From janaspe at web.de  Mon May 30 12:26:51 2005
From: janaspe at web.de (Jana Sperschneider)
Date: Mon May 30 12:23:41 2005
Subject: [BioPython] mxtexttools link
Message-ID: <506597832@web.de>

Hi Peter,

thank you so much for your help, would be great if you could send the file to my email address! I have Python 2.4 on my computer, should work?

Cheers

Jana
______________________________________________________________
Verschicken Sie romantische, coole und witzige Bilder per SMS!
Jetzt bei WEB.DE FreeMail: http://f.web.de/?mc=021193

From amorgan at mitre.org  Tue May 31 17:43:46 2005
From: amorgan at mitre.org (Alexander A. Morgan)
Date: Tue May 31 17:36:39 2005
Subject: [BioPython] qblast through a proxy
Message-ID: <429CDA92.7080704@mitre.org>

Hello:
    Most of the parts of BioPython use urllib to connect to webservices 
which makes using a proxy (without a password at least) very 
straightforward.  However, Blast.NCBIWWW uses the socket library in 
'_send_to_qblast()'.  There doesn't seem to be an easy way to get 
through a proxy using the low level socket library.  Does anyone have a 
quick fix/workaround for this?

Thanks,

Alex