From bugzilla-daemon at portal.open-bio.org  Tue Nov  1 16:31:21 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Nov  1 16:57:29 2005
Subject: [Biopython-dev] [Bug 1885] KEGG Compound db format changes
Message-ID: <200511012131.jA1LVLwo011087@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1885


------- Comment #4 from edmonds@fas.harvard.edu  2005-11-01 16:31 -------
(In reply to comment #3)
> How did you download the new test cases for KEGG compound? Are the existing
> test cases in Tests/KEGG no longer valid? The submitted patch causes
> test_KEGG.py to fail, but I'm not sure if that is due to a bug in the patch or
> whether the existing test cases don't satisfy the current KEGG standard.
> 

The entire KEGG database can be downloaded at
http://www.genome.ad.jp/kegg/kegg5.html , so I took some test cases from there. 

There are two features of the existing test cases that do not resemble how the
entries are currently formatted:

In the past, the entry line used to have only the compound ID.  Now the group
the ligand belongs to is also named.  So on the right side of that line, it now
says "Compound" or "Drug" or "Glycan", ...  All the entries in the database
have that now, so I don't think it makes sense to make it optional just to
accommodate the old test cases.  

In the past, the formula line could come right after the name block or
somewhere at the end of the entry.  Now all formula lines come right after the
name block.  

Changing these two features of compound.sample and compound.irregular causes
test_KEGG.py not to fail.  

as an aside, neither what I submitted nor the original works for the glycan or
reaction parts of the ligand database, and I suspect that they also don't work
properly for the enzyme part of the database.  


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Tue Nov  1 21:26:58 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Nov  1 21:57:29 2005
Subject: [Biopython-dev] [Bug 1885] KEGG Compound db format changes
Message-ID: <200511020226.jA22QwUg017373@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1885


mdehoon@ims.u-tokyo.ac.jp changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #5 from mdehoon@ims.u-tokyo.ac.jp  2005-11-01 21:26 -------
Accepted; in CVS.
I agree though that the KEGG parsers are still not quite up to date.
Unfortunately I don't understand Martel well enough to do something about it
now.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From mcolosimo at mitre.org  Wed Nov  2 14:01:49 2005
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Wed Nov  2 14:42:47 2005
Subject: [Biopython-dev] SeqFeature's FeatureLocation for GenBank
Message-ID: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org>

I want to point out the very bizarre behavior of FeatureLocations when 
using GenBank.FeatureParser (well to me anyways).

  When I was testing out some code, I noticed that the start positions 
were 1 less that in the GenBank Record, but the end positions were 
correct. My first thought was that this must be a bug and such went 
looking for it. I soon gave up because I just don't have the time to 
understand all the code that is involved (I was going to file a bug 
report). So, I just added 1 to the start positions and went on to get 
the features from the DNA. Suddenly I now understand why the positions 
were like that: slicing!

Unless I missed something, I didn't see anything talking about this 
behavior. Is this consistent with other parsers? If so, I would suggest 
that this is included in the Cookbook and that the classes are modified 
so that when printed (__str__) reports 1 instead of 0 (basically +1). 
Also, it would be nice to be able to do things like location.start + 1 
instead of location.start.position + 1.

Marc

From biopython-dev at maubp.freeserve.co.uk  Thu Nov  3 05:38:38 2005
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Thu Nov  3 06:02:36 2005
Subject: [Biopython-dev] SeqFeature's FeatureLocation for GenBank
In-Reply-To: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org>
References: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org>
Message-ID: <4369E8AE.1080701@maubp.freeserve.co.uk>

Marc Colosimo wrote:
> I want to point out the very bizarre behavior of FeatureLocations when 
> using GenBank.FeatureParser (well to me anyways).

Its by design...

> When I was testing out some code, I noticed that the start positions 
> were 1 less that in the GenBank Record, but the end positions were 
> correct. My first thought was that this must be a bug and such went 
> looking for it. I soon gave up because I just don't have the time to 
> understand all the code that is involved (I was going to file a bug 
> report). So, I just added 1 to the start positions and went on to get 
> the features from the DNA. Suddenly I now understand why the positions 
> were like that: slicing!

Exactly, e.g. something like:

seq[feature.location.start.position:feature.location.end.position]

> Unless I missed something, I didn't see anything talking about this 
> behavior.

Python (like C) starts counting at zero, and this behaviour is 
deliberate to make handling of the BioPython sequence objects as easy as 
possible.  Why - because the biopython DNA/RNA/Proteins sequences are as 
much like Python strings as possible.

For example, to extract letters the 5 to 7 from "abcdefghijk" (using one 
based counting, i.e. "efg") in Python you say "abcdefghijk"[4:7]

Suppose your gene is bases 150..300 (using one based counting as in a 
GenBank file).

To extract this from the full DNA sequence, you would use something 
like: fullsequence[149:300]

I suppose the CookBook may have assumed people were familiar with Python 
strings already...

 > Is this consistent with other parsers? If so, I would suggest
> that this is included in the Cookbook ...

It should be consistent with other parsers.  Would you be able to 
suggest some rewording of the CookBook to clarify this?

(I'm sure I have seen a similar question on the mailing list in the 
past, so something could be improved)

> ... and that the classes are modified so that when printed (__str__)
 > reports 1 instead of 0 (basically +1).

That would be bad for people using the existing behaviour.

You'll get used to it (especially if you have to switch between zero 
based and one based languages).

Peter
From mcolosimo at mitre.org  Thu Nov  3 08:45:35 2005
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Thu Nov  3 08:44:17 2005
Subject: [Biopython-dev] SeqFeature's FeatureLocation for GenBank
In-Reply-To: <4369E8AE.1080701@maubp.freeserve.co.uk>
References: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org>
	<4369E8AE.1080701@maubp.freeserve.co.uk>
Message-ID: <436A147F.8050005@mitre.org>

Peter wrote:

> Marc Colosimo wrote:
>
>> I want to point out the very bizarre behavior of FeatureLocations 
>> when using GenBank.FeatureParser (well to me anyways).
>
>
> Its by design...
>
>> When I was testing out some code, I noticed that the start positions 
>> were 1 less that in the GenBank Record, but the end positions were 
>> correct. My first thought was that this must be a bug and such went 
>> looking for it. I soon gave up because I just don't have the time to 
>> understand all the code that is involved (I was going to file a bug 
>> report). So, I just added 1 to the start positions and went on to get 
>> the features from the DNA. Suddenly I now understand why the 
>> positions were like that: slicing!
>
>
> Exactly, e.g. something like:
>
> seq[feature.location.start.position:feature.location.end.position]
>
>> Unless I missed something, I didn't see anything talking about this 
>> behavior.
>
>
> Python (like C) starts counting at zero, and this behaviour is 
> deliberate to make handling of the BioPython sequence objects as easy 
> as possible.  Why - because the biopython DNA/RNA/Proteins sequences 
> are as much like Python strings as possible.
>
> For example, to extract letters the 5 to 7 from "abcdefghijk" (using 
> one based counting, i.e. "efg") in Python you say "abcdefghijk"[4:7]
>
> Suppose your gene is bases 150..300 (using one based counting as in a 
> GenBank file).
>
> To extract this from the full DNA sequence, you would use something 
> like: fullsequence[149:300]
>
> I suppose the CookBook may have assumed people were familiar with 
> Python strings already...
>
> > Is this consistent with other parsers? If so, I would suggest
>
>> that this is included in the Cookbook ...
>
>
Thank you for the response. However,  I know how lists work in Python 
(and C, and Java, etc...). That was not question. Here is some code to 
show you what I mean about the inconsistent behavior of Locations.

from Bio import GenBank
gi_list = GenBank.search_for("AB077698")
ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser = 
GenBank.FeatureParser() )
seq_rec = ncbi_dict[gi_list[0]]
print len(seq_rec.seq)               # returns 2701, which is correct
# now lets look at a feature location
source_feature = seq_rec.features[0]   
print source_feature.type           # should be 'source'
print source_feature.location      # (0..2701), in the gb record it was 
(1..2701). The start is correct, the end is NOT
# get a slice
seq_rec.seq[source_feature.location.start.position : 
source_feature.location.end.position]
# returns the correct thing
# now lets see what the first nt looks like
seq_rec.seq[source_feature.location.start.position]   # works fine
# now lets see what the last nt looks like
seq_rec.seq[source_feature.location.end.position]
IndexError: string index out of range
# The correct answer for is...
seq_rec.seq[source_feature.location.end.position - 1]   # now, this is 
different from how start position works!
# but wait there is more...
# What if I didn't know about the funny end position business and wrote 
this,
seq_rec.seq[source_feature.location.start.position: 
source_feature.location.end.position + 1]
# This works, but it is not correct because it has added a nt from the 
beginning to the end (slices are nice about that)
# If I were to use this on the other internal features I would get the 
wrong thing (by one nt)

So, either location End should be 2700,  Start should be 1, or state 
'explicitly' what Locations positions represent. But not 0..2701. 
Changing the end position probably would mess up lots of code. So that 
leaves documentation. You can add my code above to the cookbook 
<http://biopython.org/docs/tutorial/Tutorial004.html#toc16>.

Marc


From biopython-dev at maubp.freeserve.co.uk  Thu Nov  3 09:17:47 2005
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Thu Nov  3 09:33:13 2005
Subject: [Biopython-dev] SeqFeature's FeatureLocation for GenBank
In-Reply-To: <436A147F.8050005@mitre.org>
References: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org>
	<4369E8AE.1080701@maubp.freeserve.co.uk>
	<436A147F.8050005@mitre.org>
Message-ID: <436A1C0B.5080008@maubp.freeserve.co.uk>

Marc Colosimo wrote:
> Thank you for the response. However,  I know how lists work in Python 
> (and C, and Java, etc...). That was not question. Here is some code to 
> show you what I mean about the inconsistent behavior of Locations.
> 
> from Bio import GenBank
> gi_list = GenBank.search_for("AB077698")
> ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser = 
> GenBank.FeatureParser() )
> seq_rec = ncbi_dict[gi_list[0]]
> print len(seq_rec.seq)               # returns 2701, which is correct
> # now lets look at a feature location
> source_feature = seq_rec.features[0]   print 
> source_feature.type           # should be 'source'
> print source_feature.location      # (0..2701), in the gb record it was 
> (1..2701). The start is correct, the end is NOT

The start and end ARE correct in that seq_rec.seq[0:2701] will return 
all of the sequence.

The first nucleotide is seq_rec.seq[0]
The last nucleotide is seq_rec.seq[2700]
The length is 2701

It makes more sense in the case of (sub)features, rather than the source 
'feature' which is everything.

In the same way, a string of length 5, e.g. "abcde"
"abcde"[0] == "a"
"abcde"[4] == "e"
"abcde"[0:5] == "abcde"
"abcde"[5] is out of range.

 From memory, the location object is actually rather more complicated 
because it copes with nasty locations like 123..<150 plus joins etc, as 
well as the simple cases like 123..150

It took me a while to get my head round the location object too.

> # get a slice
> seq_rec.seq[source_feature.location.start.position : 
> source_feature.location.end.position]
> # returns the correct thing

> # now lets see what the first nt looks like
> seq_rec.seq[source_feature.location.start.position]   # works fine

> # now lets see what the last nt looks like
> seq_rec.seq[source_feature.location.end.position]
> IndexError: string index out of range

This is correct.  See my example with a string "abcde"[5]

> # The correct answer for is...
> seq_rec.seq[source_feature.location.end.position - 1]   # now, this is 
> different from how start position works!

It has to be different for the splicing trick.  Again, its the "fault" 
of trying to be the same as python strings.

> # but wait there is more...
> # What if I didn't know about the funny end position business and wrote 
> this,
> seq_rec.seq[source_feature.location.start.position: 
> source_feature.location.end.position + 1]
> # This works, but it is not correct because it has added a nt from the 
> beginning to the end (slices are nice about that)
> # If I were to use this on the other internal features I would get the 
> wrong thing (by one nt)
> 
> So, either location End should be 2700,  Start should be 1, or state 
> 'explicitly' what Locations positions represent. But not 0..2701. 

I personally am happy with having start 0, end 2701 for a genbank 
location of 1..2701 and this it is logical.

However, the documentation could be improved.

> Changing the end position probably would mess up lots of code. So that 
> leaves documentation. You can add my code above to the cookbook 
> <http://biopython.org/docs/tutorial/Tutorial004.html#toc16>.

Maybe an extra sub section, before the current "3.7.2.2  Locations" for 
the location simple case?  i.e. No joins, no fuzzy locations.  Just some 
very simple examples...

Peter
From mcolosimo at mitre.org  Thu Nov  3 13:35:34 2005
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Thu Nov  3 13:58:37 2005
Subject: [Biopython-dev] SeqFeature's FeatureLocation for GenBank
In-Reply-To: <436A1C0B.5080008@maubp.freeserve.co.uk>
References: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org>
	<4369E8AE.1080701@maubp.freeserve.co.uk>
	<436A147F.8050005@mitre.org>
	<436A1C0B.5080008@maubp.freeserve.co.uk>
Message-ID: <09567ab0c0445c1073172ab020441db9@mitre.org>

Okay, I think we are arguing in a circle here and this could go on for 
a long time.

Since we both have in the past had to get our heads around this 
behavior (which to me indicates that the behavior is not intuitive), I 
suggest that we up date the Cookbook. I think my code is a good example 
(I can even change it to include CDS instead of the source feature). 
Also, something like the following,

Under 3.7.2.1:

location
	- The location of the SeqFeature on the sequence that you are dealing 
with. The locations are designed to return the sequence when slicing. 
Thus, the end position is ONE more than the actual end position in the 
sequence.

Under  3.7.2.2:

<insert code snippet> for ExactPositions.

Marc

On Nov 3, 2005, at 9:17 AM, Peter wrote:

> Marc Colosimo wrote:
>> Thank you for the response. However,  I know how lists work in Python 
>> (and C, and Java, etc...). That was not question. Here is some code 
>> to show you what I mean about the inconsistent behavior of Locations.
>> from Bio import GenBank
>> gi_list = GenBank.search_for("AB077698")
>> ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser = 
>> GenBank.FeatureParser() )
>> seq_rec = ncbi_dict[gi_list[0]]
>> print len(seq_rec.seq)               # returns 2701, which is correct
>> # now lets look at a feature location
>> source_feature = seq_rec.features[0]   print source_feature.type      
>>      # should be 'source'
>> print source_feature.location      # (0..2701), in the gb record it 
>> was (1..2701). The start is correct, the end is NOT
>
> The start and end ARE correct in that seq_rec.seq[0:2701] will return 
> all of the sequence.
>
> The first nucleotide is seq_rec.seq[0]
> The last nucleotide is seq_rec.seq[2700]
> The length is 2701
>
> It makes more sense in the case of (sub)features, rather than the 
> source 'feature' which is everything.
>
> In the same way, a string of length 5, e.g. "abcde"
> "abcde"[0] == "a"
> "abcde"[4] == "e"
> "abcde"[0:5] == "abcde"
> "abcde"[5] is out of range.
>
> From memory, the location object is actually rather more complicated 
> because it copes with nasty locations like 123..<150 plus joins etc, 
> as well as the simple cases like 123..150
>
> It took me a while to get my head round the location object too.
>
>> # get a slice
>> seq_rec.seq[source_feature.location.start.position : 
>> source_feature.location.end.position]
>> # returns the correct thing
>
>> # now lets see what the first nt looks like
>> seq_rec.seq[source_feature.location.start.position]   # works fine
>
>> # now lets see what the last nt looks like
>> seq_rec.seq[source_feature.location.end.position]
>> IndexError: string index out of range
>
> This is correct.  See my example with a string "abcde"[5]
>
>> # The correct answer for is...
>> seq_rec.seq[source_feature.location.end.position - 1]   # now, this 
>> is different from how start position works!
>
> It has to be different for the splicing trick.  Again, its the "fault" 
> of trying to be the same as python strings.

  I don't see the connection between Location and Sequences behaving 
like python strings. The first part is the key to my confusion 
("splicing trick")

>
>> # but wait there is more...
>> # What if I didn't know about the funny end position business and 
>> wrote this,
>> seq_rec.seq[source_feature.location.start.position: 
>> source_feature.location.end.position + 1]
>> # This works, but it is not correct because it has added a nt from 
>> the beginning to the end (slices are nice about that)
>> # If I were to use this on the other internal features I would get 
>> the wrong thing (by one nt)
>> So, either location End should be 2700,  Start should be 1, or state 
>> 'explicitly' what Locations positions represent. But not 0..2701.
>
> I personally am happy with having start 0, end 2701 for a genbank 
> location of 1..2701 and this it is logical.
>
> However, the documentation could be improved.
>
>> Changing the end position probably would mess up lots of code. So 
>> that leaves documentation. You can add my code above to the cookbook 
>> <http://biopython.org/docs/tutorial/Tutorial004.html#toc16>.
>
> Maybe an extra sub section, before the current "3.7.2.2  Locations" 
> for the location simple case?  i.e. No joins, no fuzzy locations.  
> Just some very simple examples...
>
> Peter

From mdehoon at c2b2.columbia.edu  Thu Nov  3 15:05:20 2005
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu Nov  3 15:04:52 2005
Subject: [Biopython-dev] SeqFeature's FeatureLocation for GenBank
In-Reply-To: <436A1C0B.5080008@maubp.freeserve.co.uk>
References: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org>	<4369E8AE.1080701@maubp.freeserve.co.uk>	<436A147F.8050005@mitre.org>
	<436A1C0B.5080008@maubp.freeserve.co.uk>
Message-ID: <436A6D80.7050109@c2b2.columbia.edu>

I think the confusion is coming from the way FeatureLocation prints 
itself. (0..2701) looks too much like a GenBank-style location. If 
FeatureLocation were to print (0:2701) instead, it's pretty clear that 
this is a Python-style slice. One solution might be to let 
FeatureLocation inherit from list, and override as needed for abstract 
positions.

--Michiel.

Peter wrote:

> Marc Colosimo wrote:
>
>> Thank you for the response. However,  I know how lists work in Python 
>> (and C, and Java, etc...). That was not question. Here is some code 
>> to show you what I mean about the inconsistent behavior of Locations.
>>
>> from Bio import GenBank
>> gi_list = GenBank.search_for("AB077698")
>> ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser = 
>> GenBank.FeatureParser() )
>> seq_rec = ncbi_dict[gi_list[0]]
>> print len(seq_rec.seq)               # returns 2701, which is correct
>> # now lets look at a feature location
>> source_feature = seq_rec.features[0]   print 
>> source_feature.type           # should be 'source'
>> print source_feature.location      # (0..2701), in the gb record it 
>> was (1..2701). The start is correct, the end is NOT
>
>
> The start and end ARE correct in that seq_rec.seq[0:2701] will return 
> all of the sequence.
>
> The first nucleotide is seq_rec.seq[0]
> The last nucleotide is seq_rec.seq[2700]
> The length is 2701
>
> It makes more sense in the case of (sub)features, rather than the 
> source 'feature' which is everything.
>
> In the same way, a string of length 5, e.g. "abcde"
> "abcde"[0] == "a"
> "abcde"[4] == "e"
> "abcde"[0:5] == "abcde"
> "abcde"[5] is out of range.
>
> From memory, the location object is actually rather more complicated 
> because it copes with nasty locations like 123..<150 plus joins etc, 
> as well as the simple cases like 123..150
>
> It took me a while to get my head round the location object too.
>
>> # get a slice
>> seq_rec.seq[source_feature.location.start.position : 
>> source_feature.location.end.position]
>> # returns the correct thing
>
>
>> # now lets see what the first nt looks like
>> seq_rec.seq[source_feature.location.start.position]   # works fine
>
>
>> # now lets see what the last nt looks like
>> seq_rec.seq[source_feature.location.end.position]
>> IndexError: string index out of range
>
>
> This is correct.  See my example with a string "abcde"[5]
>
>> # The correct answer for is...
>> seq_rec.seq[source_feature.location.end.position - 1]   # now, this 
>> is different from how start position works!
>
>
> It has to be different for the splicing trick.  Again, its the "fault" 
> of trying to be the same as python strings.
>
>> # but wait there is more...
>> # What if I didn't know about the funny end position business and 
>> wrote this,
>> seq_rec.seq[source_feature.location.start.position: 
>> source_feature.location.end.position + 1]
>> # This works, but it is not correct because it has added a nt from 
>> the beginning to the end (slices are nice about that)
>> # If I were to use this on the other internal features I would get 
>> the wrong thing (by one nt)
>>
>> So, either location End should be 2700,  Start should be 1, or state 
>> 'explicitly' what Locations positions represent. But not 0..2701. 
>
>
> I personally am happy with having start 0, end 2701 for a genbank 
> location of 1..2701 and this it is logical.
>
> However, the documentation could be improved.
>
>> Changing the end position probably would mess up lots of code. So 
>> that leaves documentation. You can add my code above to the cookbook 
>> <http://biopython.org/docs/tutorial/Tutorial004.html#toc16>.
>
>
> Maybe an extra sub section, before the current "3.7.2.2  Locations" 
> for the location simple case?  i.e. No joins, no fuzzy locations.  
> Just some very simple examples...
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


From mcolosimo at mitre.org  Fri Nov  4 10:20:15 2005
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Fri Nov  4 10:21:47 2005
Subject: [Biopython-dev] SeqFeature's FeatureLocation for GenBank
In-Reply-To: <436A6D80.7050109@c2b2.columbia.edu>
References: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org>	<4369E8AE.1080701@maubp.freeserve.co.uk>	<436A147F.8050005@mitre.org>
	<436A1C0B.5080008@maubp.freeserve.co.uk>
	<436A6D80.7050109@c2b2.columbia.edu>
Message-ID: <436B7C2F.7050102@mitre.org>

Michiel,

That would probably help. Along with some additions to the cookbook.

Marc

Michiel Jan Laurens de Hoon wrote:

> I think the confusion is coming from the way FeatureLocation prints 
> itself. (0..2701) looks too much like a GenBank-style location. If 
> FeatureLocation were to print (0:2701) instead, it's pretty clear that 
> this is a Python-style slice. One solution might be to let 
> FeatureLocation inherit from list, and override as needed for abstract 
> positions.
>
> --Michiel.
>
> Peter wrote:
>
>> Marc Colosimo wrote:
>>
>>> Thank you for the response. However,  I know how lists work in 
>>> Python (and C, and Java, etc...). That was not question. Here is 
>>> some code to show you what I mean about the inconsistent behavior of 
>>> Locations.
>>>
>>> from Bio import GenBank
>>> gi_list = GenBank.search_for("AB077698")
>>> ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser 
>>> = GenBank.FeatureParser() )
>>> seq_rec = ncbi_dict[gi_list[0]]
>>> print len(seq_rec.seq)               # returns 2701, which is correct
>>> # now lets look at a feature location
>>> source_feature = seq_rec.features[0]   print 
>>> source_feature.type           # should be 'source'
>>> print source_feature.location      # (0..2701), in the gb record it 
>>> was (1..2701). The start is correct, the end is NOT
>>
>>
>>
>> The start and end ARE correct in that seq_rec.seq[0:2701] will return 
>> all of the sequence.
>>
>> The first nucleotide is seq_rec.seq[0]
>> The last nucleotide is seq_rec.seq[2700]
>> The length is 2701
>>
>> It makes more sense in the case of (sub)features, rather than the 
>> source 'feature' which is everything.
>>
>> In the same way, a string of length 5, e.g. "abcde"
>> "abcde"[0] == "a"
>> "abcde"[4] == "e"
>> "abcde"[0:5] == "abcde"
>> "abcde"[5] is out of range.
>>
>> From memory, the location object is actually rather more complicated 
>> because it copes with nasty locations like 123..<150 plus joins etc, 
>> as well as the simple cases like 123..150
>>
>> It took me a while to get my head round the location object too.
>>
>>> # get a slice
>>> seq_rec.seq[source_feature.location.start.position : 
>>> source_feature.location.end.position]
>>> # returns the correct thing
>>
>>
>>
>>> # now lets see what the first nt looks like
>>> seq_rec.seq[source_feature.location.start.position]   # works fine
>>
>>
>>
>>> # now lets see what the last nt looks like
>>> seq_rec.seq[source_feature.location.end.position]
>>> IndexError: string index out of range
>>
>>
>>
>> This is correct.  See my example with a string "abcde"[5]
>>
>>> # The correct answer for is...
>>> seq_rec.seq[source_feature.location.end.position - 1]   # now, this 
>>> is different from how start position works!
>>
>>
>>
>> It has to be different for the splicing trick.  Again, its the 
>> "fault" of trying to be the same as python strings.
>>
>>> # but wait there is more...
>>> # What if I didn't know about the funny end position business and 
>>> wrote this,
>>> seq_rec.seq[source_feature.location.start.position: 
>>> source_feature.location.end.position + 1]
>>> # This works, but it is not correct because it has added a nt from 
>>> the beginning to the end (slices are nice about that)
>>> # If I were to use this on the other internal features I would get 
>>> the wrong thing (by one nt)
>>>
>>> So, either location End should be 2700,  Start should be 1, or state 
>>> 'explicitly' what Locations positions represent. But not 0..2701. 
>>
>>
>>
>> I personally am happy with having start 0, end 2701 for a genbank 
>> location of 1..2701 and this it is logical.
>>
>> However, the documentation could be improved.
>>
>>> Changing the end position probably would mess up lots of code. So 
>>> that leaves documentation. You can add my code above to the cookbook 
>>> <http://biopython.org/docs/tutorial/Tutorial004.html#toc16>.
>>
>>
>>
>> Maybe an extra sub section, before the current "3.7.2.2  Locations" 
>> for the location simple case?  i.e. No joins, no fuzzy locations.  
>> Just some very simple examples...
>>
>> Peter
>

From bugzilla-daemon at portal.open-bio.org  Fri Nov  4 14:00:16 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Nov  4 14:57:31 2005
Subject: [Biopython-dev] [Bug 1747] GenBank parser is very slow and memory
	hungry for large input files
Message-ID: <200511041900.jA4J0FrS018788@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1747


------- Comment #7 from mdehoon@ims.u-tokyo.ac.jp  2005-11-04 14:00 -------
This patch causes an error when running the example in section 3.4.1 in the
tutorial/cookbook:

Python 2.4.1 (#1, Aug 25 2005, 12:45:44)
[GCC 3.4.4 (cygming special) (gdc 0.12, using dmd 0.125)] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import GenBank
>>>
>>> gi_list = GenBank.search_for("Opuntia AND rpl16")
>>> ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank')
>>> gb_record = ncbi_dict[gi_list[0]]
>>> record_parser = GenBank.FeatureParser()
>>> ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank',parser = record_
parser)
>>> gb_seqrecord = ncbi_dict[gi_list[0]]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/local/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line
17
36, in __getitem__
    return self.parser.parse(handle)
  File "/usr/local/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line
21
9, in parse
    self._scanner.feed(handle, self._consumer)
  File "/usr/local/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line
12
61, in feed
    line = handle.readline()
AttributeError: ReseekFile instance has no attribute 'readline'
>>>

Can this be fixed? I'm pretty much in favor of a hand-written parser instead of
Martel, because it's easier to understand and maintain (there are several other
GenBank bugs waiting).


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Mon Nov  7 04:30:26 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Mon Nov  7 04:57:39 2005
Subject: [Biopython-dev] [Bug 1897] New: In IsoelectricPoint.py infinite
	loop observed. 
Message-ID: <200511070930.jA79UQQA010394@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1897

           Summary: In IsoelectricPoint.py infinite loop observed.
           Product: Biopython
           Version: 1.24
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: clowney.lester@aist.go.jp


Using getIsoElectricPoint some proteins (just one for me)  can put you into an
infinite loop. There is a stopping condition that apparently is to coarse.
Using smaller pH increments increments allows the test to pass and not
oscillate.

This is the fixed fragment I used where delta pH was changed from 0.001to
0.0005.


Frag of IsoelectricPoint.py:

    while abs(Charge) > 0.01:
                if Charge > 0:
                    pH += 0.0005 # was 0.001
                else:
                    pH -= 0.0005  # 0.001
                Charge = self._chargeR(pH, Cterm, Nterm)
                print Charge
            print "returning pH", pH


Regards,
Les


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Mon Nov  7 08:15:02 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Mon Nov  7 08:57:35 2005
Subject: [Biopython-dev] [Bug 1747] GenBank parser is very slow and memory
	hungry for large input files
Message-ID: <200511071315.jA7DF2ac015757@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1747


------- Comment #8 from biopython-bugzilla@maubp.freeserve.co.uk  2005-11-07 08:15 -------
I was aware there was some problem with the NCBIDictionary support (which had
been noted on the mailing list).

The problem with shown by Michael's example (comment 7 on the bug report) is
due to ReseekFile.py only supporting the 'read' method, and not the 'readline'
method.  According to the comments in this file, this is all the Martel parsers
needed.

I tried adding the following to the class ReseekFile in ReseekFile.py, and this
seems to fix Michael's example.

     def _readline(self, size):
        """The readline support is just a quick guess..."""
        if size < 0:
            y = self.file.readline()
            z = self.buffer_file.readline() + y
            self.buffer_file.write(y)
            return z
        if size == 0:
            return ""
        x = self.buffer_file.readline(size)
        if len(x) < size:
            y = self.file.readline(size - len(x))
            self.buffer_file.write(y)
            return x + y
        return x

    def readline(self, size = -1):
        x = self._readline(size)
        if self.at_beginning and x:
            self.at_beginning = 0
        self._check_no_buffer()
        return x

I want to stress that I'm not entirely sure that my 'readline' code is valid,
it was just my best guess based on how the 'read' method was done.  And the
test functions at the end of the ReseekFile.py file could be extended...

It would help if I had used the NCBIDictionary before ;)

P.S. I have just tried this change and the GenBank/__init__.py patch against
BioPython 1.41 on Linux, and test_GenBank.py passed fine.

P.P.S. Would it be easy to add an offline version of Michael's example using
NCBIDictionary to the GenBank unit test?


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Mon Nov  7 15:18:55 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Mon Nov  7 15:57:36 2005
Subject: [Biopython-dev] [Bug 1747] GenBank parser is very slow and memory
	hungry for large input files
Message-ID: <200511072018.jA7KItD5022646@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1747


------- Comment #9 from mdehoon@ims.u-tokyo.ac.jp  2005-11-07 15:18 -------
Sorry, I'm not following. ReseekFile already has a readline method.
Also, do we need to use ReseekFile if we're not using Martel?


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Mon Nov  7 18:47:26 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Mon Nov  7 18:57:33 2005
Subject: [Biopython-dev] [Bug 1747] GenBank parser is very slow and memory
	hungry for large input files
Message-ID: <200511072347.jA7NlQrt000492@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1747


------- Comment #10 from biopython-bugzilla@maubp.freeserve.co.uk  2005-11-07 18:47 -------
Mystery solved: There are two different ReseekFile.py files in BioPython.  The
one I changed (based on tracing the exception thrown) lives in
Bio/ReseekFile.py

Revision: 1.3, Sun Mar 21 16:56:53 2004 UTC (19 months, 2 weeks ago) by
chapmanb

There is second version in Bio/EUtils/ReseekFile.py which appears to be more
advanced (more comments, supports readline, readlines, ...)

Revision: 1.1, Fri Jun 13 00:49:37 2003 UTC (2 years, 4 months ago) by dalke 

This version is however "older", but this is only due to a minor comment
related change by Brad Chapman on the first file.

It looks like Bio/ReseekFile.py should be removed, and Bio/EUtils/ReseekFile.py
used instead.  Andrew Dalke's implementation of readline is a safer bet than my
quick hack, plus he included some test cases for it as well. 

I haven't tried this yet as it late here, and I need to go to sleep now instead
;)


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Tue Nov  8 05:13:41 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Nov  8 05:57:33 2005
Subject: [Biopython-dev] [Bug 1747] GenBank parser is very slow and memory
	hungry for large input files
Message-ID: <200511081013.jA8ADfOc013443@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1747


------- Comment #11 from biopython-bugzilla@maubp.freeserve.co.uk  2005-11-08 05:13 -------
OK, revised instructions:-

(1) Apply my patch to Bio/GenBank/__init__.py

(2) Remove the old file Bio/ReseekFile.py in favour of just
Bio/EUtils/ReseekFile.py

(3) Update FormatIO.py line 4:

Change 'import ReseekFile'
To 'from Bio.EUtils import ReseekFile'

(4) Update Bio/config/DBRegistry.py line 229:

Change 'from Bio.ReseekFile import ReseekFile'
To 'from Bio.EUtils.ReseekFile import ReseekFile'

Then Michael's example in comment 7 using GenBank.NCBIDictionary works.

(My grep skills are a bit rusty, but I didn't notice any other uses of
ReseekFile)


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Tue Nov  8 11:58:22 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Nov  8 12:57:39 2005
Subject: [Biopython-dev] [Bug 1747] GenBank parser is very slow and memory
	hungry for large input files
Message-ID: <200511081658.jA8GwMYH026815@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1747


mdehoon@ims.u-tokyo.ac.jp changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #12 from mdehoon@ims.u-tokyo.ac.jp  2005-11-08 11:58 -------
Patch accepted in CVS, thanks.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Tue Nov  8 13:32:19 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Nov  8 13:57:34 2005
Subject: [Biopython-dev] [Bug 1747] GenBank parser is very slow and memory
	hungry for large input files
Message-ID: <200511081832.jA8IWJDh028552@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1747


biopython-bugzilla@maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
OtherBugsDependingO|                            |1899
              nThis|                            |


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Tue Nov  8 14:28:44 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Nov  8 14:57:34 2005
Subject: [Biopython-dev] [Bug 1680] Problems with the GenBank indexing
Message-ID: <200511081928.jA8JSi7a029940@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1680


mdehoon@ims.u-tokyo.ac.jp changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         OS/Version|Windows XP                  |All


------- Comment #2 from mdehoon@ims.u-tokyo.ac.jp  2005-11-08 14:28 -------
This file, with the spaces between the records, can be downloaded from Entrez
Nucleotide by selecting the three records, setting "Display" to "GenBank", and
"Send to" to "File". So I agree with Sameet that this is a bug in Bio.GenBank.
This bug still exists in Biopython 1.41.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Wed Nov  9 07:16:30 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Wed Nov  9 07:57:45 2005
Subject: [Biopython-dev] [Bug 1680] Problems with the GenBank indexing
Message-ID: <200511091216.jA9CGUYV020040@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1680


------- Comment #3 from biopython-bugzilla@maubp.freeserve.co.uk  2005-11-09 07:16 -------
(In reply to comment #1)
> This is because the Martel/Mindy iterator does not recognize spaces
> between the records in the file.  If you remove the spaces from the
> file, the code will index the file without complaint.
..
> One fix might be to modify the system to ignore whitespaces between
> records. But that may cause problems if the system were applied to a
> format where the number of blank lines were important.

Where is the "format" used by the Martel/Mindy iterator defined?

In the normal case, GenBank.index_file() simply calls
SimpleSeqRecord.create_flatdb() to do the work.

Does this use the same "format" for all types of file when asked to create a
flat database?


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Wed Nov  9 07:46:14 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Wed Nov  9 07:57:47 2005
Subject: [Biopython-dev] [Bug 1773] Martel.Parser.ParserPositionException
Message-ID: <200511091246.jA9CkE8W020644@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1773


------- Comment #1 from biopython-bugzilla@maubp.freeserve.co.uk  2005-11-09 07:46 -------
It looks to me like fixing bug 1680 may also fix this.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Wed Nov  9 07:09:11 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Wed Nov  9 07:57:48 2005
Subject: [Biopython-dev] [Bug 1762] Bio.GenBank.FeatureParser dislikes valid
	accessions and locus lines
Message-ID: <200511091209.jA9C9BTp019930@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1762


------- Comment #3 from biopython-bugzilla@maubp.freeserve.co.uk  2005-11-09 07:09 -------
The limitation with truncated LOCUS lines is still present with the switch to
my non-martel parser (see bug 1747), but this also means the original patch is
not applicable anymore.

The new parser seems to be OK with this style ACCESSION line:

ACCESSION   U00096 AE000111-AE000510

e.g. Using the RecordParser() it is accessable as cur_record.accession ==
['U00096', 'AE000111-AE000510']

Test script:
===========================================================
import time
from Bio import GenBank
#gb_file = "/tmp/U00096_full_locus.gbk"
gb_file = "/tmp/U00096_truncated_locus.gbk"

feature_parser = GenBank.FeatureParser()

gb_handle = open(gb_file, 'r')

start_time = time.time()

gb_iterator = GenBank.Iterator(gb_handle, feature_parser)

count = 0
while 1:
     print "Starting...",
     cur_record = gb_iterator.next()
     print "Done"

     if cur_record is None:
         break

     count = count + 1

     # now do something with the record
     print count, cur_record.name, len(cur_record.features),
len(cur_record.seq)
     if 'data_file_division' in cur_record.annotations :
         print cur_record.annotations['data_file_division']
     if 'date' in cur_record.annotations :
         print cur_record.annotations['date']

job_time = time.time() - start_time

print "Time elapsed %0.2f seconds for %s" % (job_time, gb_file)
==============================================================

Test script output for the undoctored GenBank file from the NCBI's website
(sent to file, I just searched for U00096 in google):

==============================================================
Starting... Done
1 U00096 8877 4639675
BCT
08-SEP-2005
Starting... Done
Time elapsed 79.05 seconds for /tmp/U00096_full_locus.gbk
==============================================================

Test script output for the truncated locus line version, where I edited the
first line by hand from:

LOCUS       U00096               4639675 bp    DNA     circular BCT

to:

LOCUS       U00096

==============================================================
Starting...

Traceback (most recent call last):
  File "/tmp/U00096_test.py", line 17, in -toplevel-
    cur_record = gb_iterator.next()
  File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 129, in
next
    return self._parser.parse(File.StringHandle(data))
  File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 219, in
parse
    self._scanner.feed(handle, self._consumer)
  File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 1382,
in feed
    assert False, \
AssertionError: Did not recognise the LOCUS line layout:
LOCUS       U00096
==============================================================

The patch should be straight forward (I'll try and do this this afternoon) but
note Jan T. Kim's warning:-

> I haven't checked whether missing division / length /
> DNA/RNA/protein / circular/linear information results in
> appropriate defaults in the objects created by parsing.
> As long as the corresponding members are not used, there
> should not be any problem.

Peter


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Wed Nov  9 10:49:25 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Wed Nov  9 10:57:59 2005
Subject: [Biopython-dev] [Bug 1762] Bio.GenBank.FeatureParser dislikes valid
	accessions and locus lines
Message-ID: <200511091549.jA9FnPZh026146@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1762


biopython-bugzilla@maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #202 is|0                           |1
           obsolete|                            |


------- Comment #4 from biopython-bugzilla@maubp.freeserve.co.uk  2005-11-09 10:49 -------
Created an attachment (id=247)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=247&action=view)
Patch to Bio/GenBank/__init__.py

Patch to my non-martel GenBank parser to:

(1) tackle the truncated LOCUS line problem (i.e. this bug)

Plus a few changes that should be on bug 1899 really:

(2) started to split the feed function into sub functions
(3) minor changes to comments to remove references to Martel
(4) removed the ErrorParser class used by the Martel parser

Patch created with:

diff cvs__init__.py __init__.py > patch.txt

test_GenBank.py still passes (only tested on Windows).

Sample output from the test script posted earlier, this time the truncated
LOCUS line is accepted:

>>> 
Starting... Done
1 U00096 8877 4639675
BCT
08-SEP-2005
Starting... Done
Time elapsed 48.34 seconds for c:\temp\U00096_full_locus.gbk
>>> 
Starting... Done
1 U00096 8877 4639675
Starting... Done
Time elapsed 49.01 seconds for c:\temp\U00096_truncated_locus.gbk


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Wed Nov  9 12:42:05 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Wed Nov  9 12:57:37 2005
Subject: [Biopython-dev] [Bug 1902] Change notation for FeatureLocation
	string representation
Message-ID: <200511091742.jA9Hg5YD029584@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1902


------- Comment #1 from biopython-bugzilla@maubp.freeserve.co.uk  2005-11-09 12:42 -------
Created an attachment (id=248)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=248&action=view)
Patch to Bio/SeqFeature.py

Patch as described in original bug report


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Wed Nov  9 12:41:04 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Wed Nov  9 12:57:40 2005
Subject: [Biopython-dev] [Bug 1902] New: Change notation for FeatureLocation
	string representation
Message-ID: <200511091741.jA9Hf4M1029526@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1902

           Summary: Change notation for FeatureLocation string
                    representation
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: biopython-bugzilla@maubp.freeserve.co.uk


This follows from a discussion on the development mailing list between Marc
Colosimo, Michiel de Hoon and myself on how GenBank locations are represented
in BioPython.

In particular Michiel suggested changing the representation to avoid looking
too like the GenBank syntax, and use something more like the Python splicing
snytax:-

http://www.biopython.org/pipermail/biopython-dev/2005-November/002176.html

Changes
-------
Change the __str__ function from:

"(%s..%s)" % (self._start, self._end)

To:

"[%s:%s]" % (self._start, self._end)

Add following to __doc_string for FeatureLocation.__str__

Returns a representation of the location.  For the simple case this
uses the python splicing syntax, [122:150] (zero based counting) which
GenBank would call 123..150 (one based counting).

Add following to __doc__ string for class FeatureLocation

Note that the start and end location numbering is designed with splicing
in mind, thus a GenBank entry of 123..150 (one based counting) becomes
a location of [122:150] (zero based counting).

(patch to follow...)

Note:
-----
This will require updating the expected output for the test_Genbank and
test_Location files in Bio/Tests/output/

Impact on existing code SHOULD be minimal, unless anyone is actively parsing
the string output of a location somewhere?

Documentation:
--------------
The tutorial does include some examples of output using the current syntax, and
thus would also need updating, see Section 3.7.2.2  Locations

As per the mailing list discussion, the whole topic of dealing with locations
could be expanded further...


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython-dev at maubp.freeserve.co.uk  Thu Nov 10 11:08:00 2005
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Thu Nov 10 12:17:38 2005
Subject: [Biopython-dev] Updates to the tutorial for parsing GenBank files
Message-ID: <43737060.4070006@maubp.freeserve.co.uk>

There should be a patch attached for Biopython Doc/Tutorial.tex which 
tries to clarify GenBank parsing.

Created on Windows using:-

diff cvs_Tutorial.tex new_Tutorial.tex -E -Naur > patch.txt

In particular, I have tried make it clear that GenBank.Iterator() and 
GenBank.index_file() are overkill/unnecessary when dealing with GenBank 
files which contain only single record (which is the typical case in my 
personal experience).

My changes add an introductory example: parsing a small bacterial genome 
(a single large GenBank record), before moving on to the 
GenBank.Iterator() and GenBank.index_file() examples.

I have also pointed out that the multi-record example GenBank file used 
in these examples (cor6_6.gb) is included in the downloadable BioPython 
source code.

Plus there is a minor correction to the GenBank.index_file example, 
len(gb_dict) gives 6, not 7.

Peter
-------------- next part --------------
--- cvs_Tutorial.tex	2005-11-10 12:49:45.685675200 +0000
+++ new_Tutorial.tex	2005-11-10 15:45:58.688907200 +0000
@@ -1455,7 +1455,6 @@
 
 One very nice feature of the GenBank libraries is the ability to automate retrieval of entries from GenBank. This is very convenient for creating scripts that automate a lot of your daily work. In this example we'll show how to query the NCBI databases, and to retrieve the records from the query.
 
-
 First, we want to make a query and find out the ids of the records to retrieve. Here we'll do a quick search for \emph{Opuntia}, my favorite organism (since I work on it). We can do quick search and get back the GIs (GenBank identifiers) for all of the corresponding records:
 
 \begin{verbatim}
@@ -1511,7 +1510,6 @@
 
 For more information of formats you can parse GenBank records into, please see section~\ref{sec:gb-parsing}.
 
-
 Using these automated query retrieval functionality is a big plus over doing things by hand. Additionally, the retrieval has nice built in features like a time-delay, which will prevent NCBI from getting mad at you and blocking your access.
 
 \subsection{Parsing GenBank records}
@@ -1525,39 +1523,69 @@
   \item FeatureParser -- This parses the raw record in a SeqRecord object with all of the feature table information represented in SeqFeatures (see section~\ref{sec:advanced-seq} for more info on these objects). This is best to use if you are interested in getting things in a more standard format.
 \end{enumerate}
 
-Either way you chose to go, the most common usage of these will be creating an iterator and parsing through a file on GenBank records. Doing this is very similar to how things are done in other formats, as the following code demonstrates:
+Depending on the type of GenBank files you are interested in, they will either contain a single record, or multiple records.  Each record will start with a {\tt LOCUS} line, various other header lines, a list of features, and finally the sequence data, ending with a {\tt //} line.
+
+Dealing with a GenBank file containing a single record is very easy.  For example, lets use a small bacterial genome, {\it Nanoarchaeum equitans Kin4-M} (RefSeq NC\_005213, GenBank AE017199) which can be downloaded from the NCBI here \ahrefurl{\url{ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Nanoarchaeum_equitans/AE017199.gbk}} (only 1.15 MB):
 
 \begin{verbatim}
 from Bio import GenBank
 
-gb_file = "my_file.gb"
+feature_parser = GenBank.FeatureParser()
+
+gb_file = "AE017199.gbk"
+
+gb_record = feature_parser.parse(open(gb_file,"r"))
+
+# now do something with the record
+print "Name %s, %i features" % (gb_record.name, len(gb_record.features))
+print gb_record.seq
+\end{verbatim}
+
+This gives the following output:
+
+\begin{verbatim}
+Name AE017199, 1107 features
+Seq('TCTCGCAGAGTTCTTTTTTGTATTAACAAACCCAAAACCCATAGAATTTAATGAACCCAA ...', IUPACAmbiguousDNA())
+\end{verbatim}
+
+\subsection{Iterating over GenBank records}
+\label{sec:gb-parsing-iterator}
+
+For multi-record GenBank files, the most common usage will be creating an iterator, and parsing through the file record by record. Doing this is very similar to how things are done in other formats, as the following code demonstrates, using an example file \verb|cor6_6.gb| included in the BioPython source code under the Tests/GenBank/ directory:
+
+\begin{verbatim}
+from Bio import GenBank
+
+gb_file = "cor6_6.gb"
 gb_handle = open(gb_file, 'r')
 
 feature_parser = GenBank.FeatureParser()
 
 gb_iterator = GenBank.Iterator(gb_handle, feature_parser)
 
-while 1:
+while True:
    cur_record = gb_iterator.next()
 
    if cur_record is None:
        break
 
    # now do something with the record
+   print "Name %s, %i features" % (cur_record.name, len(cur_record.features))
    print cur_record.seq
 \end{verbatim}
 
 This just iterates over a GenBank file, parsing it into SeqRecord and SeqFeature objects, and prints out the Seq objects representing the sequences in the record.
 
-
 As with other formats, you have lots of tools for dealing with GenBank records. This should make it possible to do whatever you need to with GenBank.
 
 \subsection{Making your very own GenBank database}
 
 One very cool thing that you can do is set up your own personal GenBank database and access it like a dictionary (this can be extra cool because you can also allow access to these local databases over a network using BioCorba -- see the BioCorba documentation for more information).
 
+Note - this is only worth doing {\it if} your GenBank file contains more than one record.
 
-Making a local database first involves creating an index file, which will allow quick access to any record in the file. To do this, we use the index file function:
+Making a local database first involves creating an index file, which will allow quick access to any record in the file. To do this, we use the index file function.
+Again, this example uses the file \verb|cor6_6.gb| which is included in the BioPython source code under the Tests/GenBank/ directory:
 
 \begin{verbatim}
 >>> from Bio import GenBank
@@ -1566,7 +1594,7 @@
 >>> GenBank.index_file(dict_file, index_file)
 \end{verbatim}
 
-This will create the file \verb|my_index_file.idx|. Now, we can use this index to create a dictionary object that allows individual access to every record. Like the Iterator and NCBIDictionary interfaces, we can either get back raw records, or we can pass the dictionary a parser that will parse the records before returning them. In this case, we pass a \verb|FeatureParser| so that when we get a record, then we retrieve a SeqRecord object. 
+This will create a directory called \verb|cor6_6.idx| containing the index files. Now, we can use this index to create a dictionary object that allows individual access to every record. Like the Iterator and NCBIDictionary interfaces, we can either get back raw records, or we can pass the dictionary a parser that will parse the records before returning them. In this case, we pass a \verb|FeatureParser| so that when we get a record, then we retrieve a SeqRecord object. 
 
 
 Setting up the dictionary is as easy as one line:
@@ -1579,7 +1607,7 @@
 
 \begin{verbatim}
 >>> len(gb_dict)
-7
+6
 >>> gb_dict.keys()
 ['L31939', 'AJ237582', 'X62281', 'AF297471', 'M81224', 'X55053']
 \end{verbatim}
@@ -1589,6 +1617,8 @@
 \begin{verbatim}
 >>> gb_dict['AJ237582']
 <Bio.SeqRecord.SeqRecord instance at 0x102fdd8c>
+>>> print len(gb_dict['X55053'].features)
+3
 \end{verbatim}
 
 \section{Dealing with alignments}
From bugzilla-daemon at portal.open-bio.org  Fri Nov 11 11:16:50 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Nov 11 11:57:46 2005
Subject: [Biopython-dev] [Bug 1903] New: GenBank parses fails with unusual
	quoting and line breaks
Message-ID: <200511111616.jABGGooT030390@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1903

           Summary: GenBank parses fails with unusual quoting and line
                    breaks
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: biopython-bugzilla@maubp.freeserve.co.uk


I've been testing my new parser (recently checked in) and have discovered an
oddity that it currently fails on, locus Bd2676 in this file:

LOCUS       NC_005363            3782950 bp    DNA     circular BCT 22-NOV-2004
DEFINITION  Bdellovibrio bacteriovorus HD100, complete genome.
ACCESSION   NC_005363
VERSION     NC_005363.1  GI:42521650
KEYWORDS    complete genome.
SOURCE      Bdellovibrio bacteriovorus HD100
  ORGANISM  Bdellovibrio bacteriovorus HD100

Look at this bit, /note="\n                     hypothetical protein"

Normally this would be written as /note="hypothetical protein"


     gene            2594436..2596394
                     /locus_tag="Bd2676"
                     /db_xref="GeneID:2736184"
     CDS             2594436..2596394
                     /locus_tag="Bd2676"
                     /note="
                     hypothetical protein"
                     /codon_start=1
                     /evidence=not_experimental
                     /transl_table=11
                     /product="hypothetical protein"
                     /protein_id="NP_969474.1"
                     /db_xref="GI:42524094"
                     /db_xref="GeneID:2736184"
                     /translation="MKRAYYSNDISRFLVDAPSSILGLLSKAHDFTLEEQQKNAWVKQ
                     IEILQTSLQGIPGHVYFEYSIPRVGKRVDLIVISGNALFSIEFKVGSSQFDSYAADQA
                     MDYALDLKNFHEGSHQIDIFPVLVATEATHTEALPSRFDDGVWSLTRTNSQNLSTHLQ
                     ALKTNAKGPEIDLLKWDASGYKPTPTIVEAAKALYSGHQVEEISRSDAGATNLSITSA
                     ALKKIIDESISQKKKTICLVTGVPGAGKTLVGLDLATSWNNPVANQHAVLLSGNGPLV
                     EILQEALAKDEANRSKASSPVKLSAARAKAKSFIQNIHHFRDEGLRTDAPPPEKVVIF
                     DEAQRAWNKTQTTKFMKTKKGVADFDHSEPEYLIKLMDRHADWAVIICLVGGGQEINT
                     GEAGISEWLDAIHNKFPHWQVCLPSTTSSADIPNIEKFVQAFSSRHHVDKNLHLTASV
                     RSFRSERVSDFMSALLDKDIDKAKALYSEIKEKYPIKLTRSLEEAKLWLKEKSRGNER
                     YGILASSGAGRLKAHGLDVKSRIEPVNWFLNDKKDVRSSFFMEDVATEFHVQGLELDW
                     TCVAWDIDFILSLKKETKFRSFAGTKWNNIKSSTDQSYLKNKYRVLLTRARQGLVLFV
                     PKGDPHDGTRPPGDYEELFSYLQYILND"

Patch to follow, once I work out what exactly my code is doing wrong.

Also, I have an existing patch pending for Bio/GenBank/__init__.py attached to
bug 1762.  Should any patch for this new bug be against the current CVS file,
or against the version after applying the bug 1762 patch?


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Thu Nov 17 13:27:53 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Thu Nov 17 13:57:51 2005
Subject: [Biopython-dev] [Bug 1762] Bio.GenBank.FeatureParser dislikes valid
	accessions and locus lines
Message-ID: <200511171827.jAHIRrwr015387@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1762


------- Comment #5 from mdehoon@ims.u-tokyo.ac.jp  2005-11-17 13:27 -------
I downloaded U00096 from Genbank, ran it through seqret, and tried to parse the
resulting file with the patched Genbank parser. Whereas the LOCUS line doesn't
cause a problem any more, there are other (seqres-specific?) lines that cause
the parsing to fail (starting with the "BASE COUNT" line).
Can this patch be fixed? It's better to start from the file created by seqres
to make sure all nonstandard lines can be handled.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Thu Nov 17 18:37:05 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Thu Nov 17 18:57:54 2005
Subject: [Biopython-dev] [Bug 1762] Bio.GenBank.FeatureParser dislikes valid
	accessions and locus lines
Message-ID: <200511172337.jAHNb50s022480@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1762


------- Comment #6 from biopython-bugzilla@maubp.freeserve.co.uk  2005-11-17 18:37 -------
I've never used seqret, so if you (Michiel) wouldn't mind emailing me this file
(U00096 GenBank after seqret has changed it) then I'll be happy to have a look
at the BASE COUNT problem.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Thu Nov 17 19:38:19 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Thu Nov 17 19:57:45 2005
Subject: [Biopython-dev] [Bug 1762] Bio.GenBank.FeatureParser dislikes valid
	accessions and locus lines
Message-ID: <200511180038.jAI0cJPE023052@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1762


------- Comment #7 from mdehoon@ims.u-tokyo.ac.jp  2005-11-17 19:38 -------
I put the file at http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/u00096.gb.gz.
(gzipped file).


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Mon Nov 21 10:17:36 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Mon Nov 21 10:57:50 2005
Subject: [Biopython-dev] [Bug 1903] GenBank parses fails with unusual
	quoting and line breaks
Message-ID: <200511211517.jALFHaFY021490@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1903


biopython-bugzilla@maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|biopython-dev@biopython.org |biopython-
                   |                            |bugzilla@maubp.freeserve.co.
                   |                            |uk


------- Comment #1 from biopython-bugzilla@maubp.freeserve.co.uk  2005-11-21 10:17 -------
> Also, I have an existing patch pending for Bio/GenBank/__init__.py
> attached to bug 1762.  Should any patch for this new bug be against
> the current CVS file, or against the version after applying the
> bug 1762 patch?

As the patch for bug 1762 needed further work, I included the one line change
needed for this problem (bug 1903) in that revised patch (attachment 252)


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Mon Nov 21 10:13:27 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Mon Nov 21 10:57:51 2005
Subject: [Biopython-dev] [Bug 1762] Bio.GenBank.FeatureParser dislikes valid
	accessions and locus lines
Message-ID: <200511211513.jALFDQr7021397@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1762


biopython-bugzilla@maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #247 is|0                           |1
           obsolete|                            |
         AssignedTo|biopython-dev@biopython.org |biopython-
                   |                            |bugzilla@maubp.freeserve.co.
                   |                            |uk
             Status|NEW                         |ASSIGNED


------- Comment #8 from biopython-bugzilla@maubp.freeserve.co.uk  2005-11-21 10:13 -------
Created an attachment (id=252)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=252&action=view)
Bio/GenBank/__init__.py patch

Patches to my non-martel GenBank parser to:

(1) tackle the truncated LOCUS line problem (i.e. this bug)
(2) tackle missing features as in seqret output (i.e. this bug)

Plus a few changes that should be on bug 1899 really:

(3) started to split the feed function into sub functions
(4) minor changes to comments to remove references to Martel
(5) removed the ErrorParser class used by the Martel parser

And fix for bug 1903 as well:

(6) GenBank parses fails with unusual quoting and line breaks

The genbank unit test still works.

Michiel - could you create another (smaller) seqret file to go in the GenBank
unit tests?


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
You are the assignee for the bug, or are watching the assignee.
From matteo.io at email.it  Mon Nov 21 15:40:43 2005
From: matteo.io at email.it (Matteo)
Date: Mon Nov 21 19:38:20 2005
Subject: [Biopython-dev] Blast Parser error
Message-ID: <438230CB.7060707@email.it>

Hi, I'm using the latest biopython (1.41) with python 2.4 on windows. I 
have a problem with a script that works fine on linux, I don't know what 
to do...
this is the simple code:

blast_db   = os.path.join(os.getcwd(), FASTA_FILE)
blast_file = os.path.join(os.getcwd(), QUERY_FILE)
blast_exe  = os.path.join(os.getcwd(), 'blastall.exe')
blast_out, error_info = NCBIStandalone.blastall(blast_exe, 'blastn', 
blast_db, blast_file)
save_file = open('my_blast.out', 'w')
blast_results = blast_out.read()
save_file.write(blast_results)
save_file.close()
blast_out = open('my_blast.out', 'r')
blastparser = NCBIStandalone.BlastParser()
alignrecord = blastparser.parse(blast_out)
[...]

And this is the error:

File "blastMaker.py", line 55, in ?
    alignrecord = blastparser.parse(blast_out)
rd.Blast
  File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 
623, in
 parse
    self._scanner.feed(handle, self._consumer)
  File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 
93, in
feed
    read_and_call_until(uhandle, consumer.noevent, contains='BLAST')
  File "C:\Python24\Lib\site-packages\Bio\ParserSupport.py", line 335, 
in read_a
nd_call_until
    line = safe_readline(uhandle)
  File "C:\Python24\Lib\site-packages\Bio\ParserSupport.py", line 411, 
in safe_r
eadline
    raise SyntaxError, "Unexpected end of stream."
SyntaxError: Unexpected end of stream.

Someone could help me?
Thanks in advance,

--
Matteo De Felice
University of Rome "Roma Tre"
 
 
 --
 Email.it, the professional e-mail, gratis per te: http://www.email.it/f
 
 Sponsor:
 Rc auto Zuritel. Scopri subito che risparmiare ? un gioco da ragazzi. Bastano 7 click per ottenere un preventivo personalizzato. Prova ora.
 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=4096&d=21-11
From matteo.io at email.it  Tue Nov 22 04:53:06 2005
From: matteo.io at email.it (Matteo)
Date: Tue Nov 22 04:50:17 2005
Subject: [Biopython-dev] Blast Parser error
In-Reply-To: <200511220859.58707.sohm@iaf.cnrs-gif.fr>
References: <438230CB.7060707@email.it>
	<200511220859.58707.sohm@iaf.cnrs-gif.fr>
Message-ID: <4382EA82.9070306@email.it>

Frederic Sohm ha scritto:

>Hi, 
>
>Just guessing.
>Did you check that blast_results is correct ?
>What did you get in error_info ?
>  
>
If I put a "print error_info" I get something like:
<open file 'C:\PATH\TO\PROGRAM\blastall.exe -p blastn -d 
C:\PATH\TO\PROGRAM\refsets.fasta -i C:\PATH\TO\PROGRAM\query.fasta', 
mode 'r' at 0x00AF6AD0>

>If everything is normal there, then the end of line is different in windows 
>('\r\n') and linux ('\n'), it might just be that.
>So try :
>
>...
>  
>
>>save_file.write(blast_results)
>>save_file.close()
>>#blast_out = open('my_blast.out', 'r')
>>    
>>
>blast_out = open('my_blast.out', 'rU')
>
>  
>
>>blastparser = NCBIStandalone.BlastParser()
>>alignrecord = blastparser.parse(blast_out)
>>[...]
>>    
>>
>
>  
>
At the end of the program the output "my_blast.out" is blank! I tried to 
launching blastall from the command line and THEN parsing the output 
with biopython, in this way it works! Maybe NCBIStandalone has a 
problem...I have downloaded the CVS versione but there is no change...
And remember that the SAME code works on linux...I also tried to 
download the versione that I have on my Ubuntu Linux (2.2.10) but there 
is no change...
Thank you for helping me,

Matteo De Felice
University of Rome "Roma Tre"

 
 --
 Email.it, the professional e-mail, gratis per te: http://www.email.it/f
 
 Sponsor:
 Hai dei virus sul tuo PC ma non sai come eliminarli? Allora impara subito come rimuovere ogni tipo di virus - clicca qui
 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=3211&d=22-11
From mdehoon at c2b2.columbia.edu  Wed Nov 23 13:10:20 2005
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Wed Nov 23 13:17:34 2005
Subject: [Biopython-dev] Blast Parser error
Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD9E@cgcmail.cgc.cpmc.columbia.edu>

Can you use the XML parser in NCBIXML instead?
See the updated tutorial on the biopython website on how to use it.

--Michiel.

Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


-----Original Message-----
From: biopython-dev-bounces@portal.open-bio.org on behalf of Matteo
Sent: Tue 11/22/2005 4:53 AM
To: Frederic Sohm
Cc: biopython-dev@biopython.org
Subject: Re: [Biopython-dev] Blast Parser error
 
Frederic Sohm ha scritto:

>Hi, 
>
>Just guessing.
>Did you check that blast_results is correct ?
>What did you get in error_info ?
>  
>
If I put a "print error_info" I get something like:
<open file 'C:\PATH\TO\PROGRAM\blastall.exe -p blastn -d 
C:\PATH\TO\PROGRAM\refsets.fasta -i C:\PATH\TO\PROGRAM\query.fasta', 
mode 'r' at 0x00AF6AD0>

>If everything is normal there, then the end of line is different in windows 
>('\r\n') and linux ('\n'), it might just be that.
>So try :
>
>...
>  
>
>>save_file.write(blast_results)
>>save_file.close()
>>#blast_out = open('my_blast.out', 'r')
>>    
>>
>blast_out = open('my_blast.out', 'rU')
>
>  
>
>>blastparser = NCBIStandalone.BlastParser()
>>alignrecord = blastparser.parse(blast_out)
>>[...]
>>    
>>
>
>  
>
At the end of the program the output "my_blast.out" is blank! I tried to 
launching blastall from the command line and THEN parsing the output 
with biopython, in this way it works! Maybe NCBIStandalone has a 
problem...I have downloaded the CVS versione but there is no change...
And remember that the SAME code works on linux...I also tried to 
download the versione that I have on my Ubuntu Linux (2.2.10) but there 
is no change...
Thank you for helping me,

Matteo De Felice
University of Rome "Roma Tre"

 
 --
 Email.it, the professional e-mail, gratis per te: http://www.email.it/f
 
 Sponsor:
 Hai dei virus sul tuo PC ma non sai come eliminarli? Allora impara subito
come rimuovere ogni tipo di virus - clicca qui
 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=3211&d=22-11
_______________________________________________
Biopython-dev mailing list
Biopython-dev@biopython.org
http://biopython.org/mailman/listinfo/biopython-dev


From biopython-dev at maubp.freeserve.co.uk  Wed Nov 23 17:11:36 2005
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Wed Nov 23 20:31:42 2005
Subject: [Biopython-dev] Blast Parser error
In-Reply-To: <4382EA82.9070306@email.it>
References: <438230CB.7060707@email.it>	<200511220859.58707.sohm@iaf.cnrs-gif.fr>
	<4382EA82.9070306@email.it>
Message-ID: <4384E918.7090807@maubp.freeserve.co.uk>

Matteo wrote:
> this is the simple code:
> 
> blast_db   = os.path.join(os.getcwd(), FASTA_FILE)
> blast_file = os.path.join(os.getcwd(), QUERY_FILE)
> blast_exe  = os.path.join(os.getcwd(), 'blastall.exe')
> blast_out, error_info = NCBIStandalone.blastall(blast_exe, 'blastn', blast_db, blast_file)
> save_file = open('my_blast.out', 'w')
> blast_results = blast_out.read()
> save_file.write(blast_results)
> save_file.close()
> blast_out = open('my_blast.out', 'r')
> blastparser = NCBIStandalone.BlastParser()
> alignrecord = blastparser.parse(blast_out)
> [...]
>
...
 > At the end of the program the output "my_blast.out" is blank!
 > I tried to launching blastall from the command line and THEN
 > parsing the output with biopython, in this way it works!

At Frederic Sohm's suggestion (off list?):

Matteo wrote:
> If I put a "print error_info" I get something like:
> <open file 'C:\PATH\TO\PROGRAM\blastall.exe -p blastn -d 
> C:\PATH\TO\PROGRAM\refsets.fasta -i C:\PATH\TO\PROGRAM\query.fasta', 
> mode 'r' at 0x00AF6AD0>

Could you post the actual message with the real "PATH TO PROGRAM" 
included?  I'm wondering if this is a problem with spaces in 
paths/filenames.

Also in your example script, you are explicitly creating a file and 
saving the blast output to it.  It is possible that you need to wait a 
second or two on windows for the file to be properly closed, before you 
can open it and parse it (just an guess!)

When I used NCBIStandalone.BlastParser (which worked for me on both 
Linux and Windows) I followed the cookbook approach, which doesn't 
create a temp file in this way.

What happens if you try this:

blast_db   = os.path.join(os.getcwd(), FASTA_FILE)
blast_file = os.path.join(os.getcwd(), QUERY_FILE)
blast_exe  = os.path.join(os.getcwd(), 'blastall.exe')
blast_out, error_info = NCBIStandalone.blastall(blast_exe, \
                              'blastn', blast_db, blast_file)
# save_file = open('my_blast.out', 'w')
# blast_results = blast_out.read()
# save_file.write(blast_results)
# save_file.close()
# blast_out = open('my_blast.out', 'r')
blastparser = NCBIStandalone.BlastParser()
alignrecord = blastparser.parse(blast_out)
[...]

You don't get to keep a copy of the raw blast output though.

I would try this for you now, but I'm on a Linux box at the moment.

Peter

From bugzilla-daemon at portal.open-bio.org  Tue Nov 29 16:17:47 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Nov 29 16:58:01 2005
Subject: [Biopython-dev] [Bug 1909] New: Format issue with GenBank with
	segmented BACs (eg GI:55276707)
Message-ID: <200511292117.jATLHl3G015910@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1909

           Summary: Format issue with GenBank with segmented BACs (eg
                    GI:55276707)
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: Windows
            Status: NEW
          Severity: blocker
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: moscou@iastate.edu


When using the FeatureParser, it will not be able to retrieve this function. 
The following is the error message from BioPython.  I am assuming this is
because this BAC was kept together because of repetitive regions that could not
be resolved, but still represent different sequences.  Instead I am just going
skip this record for now.


55276707
Traceback (most recent call last):
  File "search_ncbi_for_barley_bacs.py", line 43, in ?
    gb_seqrecord = ncbi_dict[gi]
  File "C:\Python24\Lib\site-packages\Bio\GenBank\__init__.py", line 1364, in
__
getitem__
    return self.parser.parse(handle)
  File "C:\Python24\Lib\site-packages\Bio\GenBank\__init__.py", line 219, in
par
se
    self._scanner.feed(handle, self._consumer)
  File "C:\Python24\Lib\site-packages\Bio\GenBank\__init__.py", line 1259, in
fe
ed
    self._parser.parseFile(handle)
  File "C:\Python24\Lib\site-packages\Martel\Parser.py", line 328, in parseFile
    self.parseString(fileobj.read())
  File "C:\Python24\Lib\site-packages\Martel\Parser.py", line 361, in
parseStrin
g
    self._err_handler.fatalError(ParserIncompleteException(pos))
  File "C:\Python24\lib\xml\sax\handler.py", line 38, in fatalError
    raise exception
Martel.Parser.ParserIncompleteException: error parsing at or beyond character
18
263 (unparsed text remains)


Thanks a lot!


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Tue Nov 29 18:19:42 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Nov 29 18:58:01 2005
Subject: [Biopython-dev] [Bug 1909] Format issue with GenBank with segmented
	BACs (eg GI:55276707)
Message-ID: <200511292319.jATNJghK017521@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1909


------- Comment #1 from biopython-bugzilla@maubp.freeserve.co.uk  2005-11-29 18:19 -------
You haven't said which version of BioPython you are using, I would guess
BioPython 1.40b with Python 2.4 on Windows XP.

Since the 1.40 release, the Genbank parser has been switched from using Martel
to a "simpler" python parser which should be easier to maintain.  Could you
download the latest Bio/GenBank/__init__.py file from CVS and repeat the test?

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/GenBank/__init__.py?cvsroot=biopython

You could rename the existing
C:\Python24\lib\site-packages\Bio\GenBank\__init__.py file so something else
(so you can undo the change) and save the latest version in its place.

It would also be useful to see the full test script... 

Thanks


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.