From bugzilla-daemon at portal.open-bio.org Tue Nov 1 16:31:21 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Tue Nov 1 16:57:29 2005 Subject: [Biopython-dev] [Bug 1885] KEGG Compound db format changes Message-ID: <200511012131.jA1LVLwo011087@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1885 ------- Comment #4 from edmonds@fas.harvard.edu 2005-11-01 16:31 ------- (In reply to comment #3) > How did you download the new test cases for KEGG compound? Are the existing > test cases in Tests/KEGG no longer valid? The submitted patch causes > test_KEGG.py to fail, but I'm not sure if that is due to a bug in the patch or > whether the existing test cases don't satisfy the current KEGG standard. > The entire KEGG database can be downloaded at http://www.genome.ad.jp/kegg/kegg5.html , so I took some test cases from there. There are two features of the existing test cases that do not resemble how the entries are currently formatted: In the past, the entry line used to have only the compound ID. Now the group the ligand belongs to is also named. So on the right side of that line, it now says "Compound" or "Drug" or "Glycan", ... All the entries in the database have that now, so I don't think it makes sense to make it optional just to accommodate the old test cases. In the past, the formula line could come right after the name block or somewhere at the end of the entry. Now all formula lines come right after the name block. Changing these two features of compound.sample and compound.irregular causes test_KEGG.py not to fail. as an aside, neither what I submitted nor the original works for the glycan or reaction parts of the ligand database, and I suspect that they also don't work properly for the enzyme part of the database. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 1 21:26:58 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Tue Nov 1 21:57:29 2005 Subject: [Biopython-dev] [Bug 1885] KEGG Compound db format changes Message-ID: <200511020226.jA22QwUg017373@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1885 mdehoon@ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from mdehoon@ims.u-tokyo.ac.jp 2005-11-01 21:26 ------- Accepted; in CVS. I agree though that the KEGG parsers are still not quite up to date. Unfortunately I don't understand Martel well enough to do something about it now. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mcolosimo at mitre.org Wed Nov 2 14:01:49 2005 From: mcolosimo at mitre.org (Marc Colosimo) Date: Wed Nov 2 14:42:47 2005 Subject: [Biopython-dev] SeqFeature's FeatureLocation for GenBank Message-ID: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org> I want to point out the very bizarre behavior of FeatureLocations when using GenBank.FeatureParser (well to me anyways). When I was testing out some code, I noticed that the start positions were 1 less that in the GenBank Record, but the end positions were correct. My first thought was that this must be a bug and such went looking for it. I soon gave up because I just don't have the time to understand all the code that is involved (I was going to file a bug report). So, I just added 1 to the start positions and went on to get the features from the DNA. Suddenly I now understand why the positions were like that: slicing! Unless I missed something, I didn't see anything talking about this behavior. Is this consistent with other parsers? If so, I would suggest that this is included in the Cookbook and that the classes are modified so that when printed (__str__) reports 1 instead of 0 (basically +1). Also, it would be nice to be able to do things like location.start + 1 instead of location.start.position + 1. Marc From biopython-dev at maubp.freeserve.co.uk Thu Nov 3 05:38:38 2005 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Thu Nov 3 06:02:36 2005 Subject: [Biopython-dev] SeqFeature's FeatureLocation for GenBank In-Reply-To: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org> References: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org> Message-ID: <4369E8AE.1080701@maubp.freeserve.co.uk> Marc Colosimo wrote: > I want to point out the very bizarre behavior of FeatureLocations when > using GenBank.FeatureParser (well to me anyways). Its by design... > When I was testing out some code, I noticed that the start positions > were 1 less that in the GenBank Record, but the end positions were > correct. My first thought was that this must be a bug and such went > looking for it. I soon gave up because I just don't have the time to > understand all the code that is involved (I was going to file a bug > report). So, I just added 1 to the start positions and went on to get > the features from the DNA. Suddenly I now understand why the positions > were like that: slicing! Exactly, e.g. something like: seq[feature.location.start.position:feature.location.end.position] > Unless I missed something, I didn't see anything talking about this > behavior. Python (like C) starts counting at zero, and this behaviour is deliberate to make handling of the BioPython sequence objects as easy as possible. Why - because the biopython DNA/RNA/Proteins sequences are as much like Python strings as possible. For example, to extract letters the 5 to 7 from "abcdefghijk" (using one based counting, i.e. "efg") in Python you say "abcdefghijk"[4:7] Suppose your gene is bases 150..300 (using one based counting as in a GenBank file). To extract this from the full DNA sequence, you would use something like: fullsequence[149:300] I suppose the CookBook may have assumed people were familiar with Python strings already... > Is this consistent with other parsers? If so, I would suggest > that this is included in the Cookbook ... It should be consistent with other parsers. Would you be able to suggest some rewording of the CookBook to clarify this? (I'm sure I have seen a similar question on the mailing list in the past, so something could be improved) > ... and that the classes are modified so that when printed (__str__) > reports 1 instead of 0 (basically +1). That would be bad for people using the existing behaviour. You'll get used to it (especially if you have to switch between zero based and one based languages). Peter From mcolosimo at mitre.org Thu Nov 3 08:45:35 2005 From: mcolosimo at mitre.org (Marc Colosimo) Date: Thu Nov 3 08:44:17 2005 Subject: [Biopython-dev] SeqFeature's FeatureLocation for GenBank In-Reply-To: <4369E8AE.1080701@maubp.freeserve.co.uk> References: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org> <4369E8AE.1080701@maubp.freeserve.co.uk> Message-ID: <436A147F.8050005@mitre.org> Peter wrote: > Marc Colosimo wrote: > >> I want to point out the very bizarre behavior of FeatureLocations >> when using GenBank.FeatureParser (well to me anyways). > > > Its by design... > >> When I was testing out some code, I noticed that the start positions >> were 1 less that in the GenBank Record, but the end positions were >> correct. My first thought was that this must be a bug and such went >> looking for it. I soon gave up because I just don't have the time to >> understand all the code that is involved (I was going to file a bug >> report). So, I just added 1 to the start positions and went on to get >> the features from the DNA. Suddenly I now understand why the >> positions were like that: slicing! > > > Exactly, e.g. something like: > > seq[feature.location.start.position:feature.location.end.position] > >> Unless I missed something, I didn't see anything talking about this >> behavior. > > > Python (like C) starts counting at zero, and this behaviour is > deliberate to make handling of the BioPython sequence objects as easy > as possible. Why - because the biopython DNA/RNA/Proteins sequences > are as much like Python strings as possible. > > For example, to extract letters the 5 to 7 from "abcdefghijk" (using > one based counting, i.e. "efg") in Python you say "abcdefghijk"[4:7] > > Suppose your gene is bases 150..300 (using one based counting as in a > GenBank file). > > To extract this from the full DNA sequence, you would use something > like: fullsequence[149:300] > > I suppose the CookBook may have assumed people were familiar with > Python strings already... > > > Is this consistent with other parsers? If so, I would suggest > >> that this is included in the Cookbook ... > > Thank you for the response. However, I know how lists work in Python (and C, and Java, etc...). That was not question. Here is some code to show you what I mean about the inconsistent behavior of Locations. from Bio import GenBank gi_list = GenBank.search_for("AB077698") ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser = GenBank.FeatureParser() ) seq_rec = ncbi_dict[gi_list[0]] print len(seq_rec.seq) # returns 2701, which is correct # now lets look at a feature location source_feature = seq_rec.features[0] print source_feature.type # should be 'source' print source_feature.location # (0..2701), in the gb record it was (1..2701). The start is correct, the end is NOT # get a slice seq_rec.seq[source_feature.location.start.position : source_feature.location.end.position] # returns the correct thing # now lets see what the first nt looks like seq_rec.seq[source_feature.location.start.position] # works fine # now lets see what the last nt looks like seq_rec.seq[source_feature.location.end.position] IndexError: string index out of range # The correct answer for is... seq_rec.seq[source_feature.location.end.position - 1] # now, this is different from how start position works! # but wait there is more... # What if I didn't know about the funny end position business and wrote this, seq_rec.seq[source_feature.location.start.position: source_feature.location.end.position + 1] # This works, but it is not correct because it has added a nt from the beginning to the end (slices are nice about that) # If I were to use this on the other internal features I would get the wrong thing (by one nt) So, either location End should be 2700, Start should be 1, or state 'explicitly' what Locations positions represent. But not 0..2701. Changing the end position probably would mess up lots of code. So that leaves documentation. You can add my code above to the cookbook . Marc From biopython-dev at maubp.freeserve.co.uk Thu Nov 3 09:17:47 2005 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Thu Nov 3 09:33:13 2005 Subject: [Biopython-dev] SeqFeature's FeatureLocation for GenBank In-Reply-To: <436A147F.8050005@mitre.org> References: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org> <4369E8AE.1080701@maubp.freeserve.co.uk> <436A147F.8050005@mitre.org> Message-ID: <436A1C0B.5080008@maubp.freeserve.co.uk> Marc Colosimo wrote: > Thank you for the response. However, I know how lists work in Python > (and C, and Java, etc...). That was not question. Here is some code to > show you what I mean about the inconsistent behavior of Locations. > > from Bio import GenBank > gi_list = GenBank.search_for("AB077698") > ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser = > GenBank.FeatureParser() ) > seq_rec = ncbi_dict[gi_list[0]] > print len(seq_rec.seq) # returns 2701, which is correct > # now lets look at a feature location > source_feature = seq_rec.features[0] print > source_feature.type # should be 'source' > print source_feature.location # (0..2701), in the gb record it was > (1..2701). The start is correct, the end is NOT The start and end ARE correct in that seq_rec.seq[0:2701] will return all of the sequence. The first nucleotide is seq_rec.seq[0] The last nucleotide is seq_rec.seq[2700] The length is 2701 It makes more sense in the case of (sub)features, rather than the source 'feature' which is everything. In the same way, a string of length 5, e.g. "abcde" "abcde"[0] == "a" "abcde"[4] == "e" "abcde"[0:5] == "abcde" "abcde"[5] is out of range. From memory, the location object is actually rather more complicated because it copes with nasty locations like 123..<150 plus joins etc, as well as the simple cases like 123..150 It took me a while to get my head round the location object too. > # get a slice > seq_rec.seq[source_feature.location.start.position : > source_feature.location.end.position] > # returns the correct thing > # now lets see what the first nt looks like > seq_rec.seq[source_feature.location.start.position] # works fine > # now lets see what the last nt looks like > seq_rec.seq[source_feature.location.end.position] > IndexError: string index out of range This is correct. See my example with a string "abcde"[5] > # The correct answer for is... > seq_rec.seq[source_feature.location.end.position - 1] # now, this is > different from how start position works! It has to be different for the splicing trick. Again, its the "fault" of trying to be the same as python strings. > # but wait there is more... > # What if I didn't know about the funny end position business and wrote > this, > seq_rec.seq[source_feature.location.start.position: > source_feature.location.end.position + 1] > # This works, but it is not correct because it has added a nt from the > beginning to the end (slices are nice about that) > # If I were to use this on the other internal features I would get the > wrong thing (by one nt) > > So, either location End should be 2700, Start should be 1, or state > 'explicitly' what Locations positions represent. But not 0..2701. I personally am happy with having start 0, end 2701 for a genbank location of 1..2701 and this it is logical. However, the documentation could be improved. > Changing the end position probably would mess up lots of code. So that > leaves documentation. You can add my code above to the cookbook > . Maybe an extra sub section, before the current "3.7.2.2 Locations" for the location simple case? i.e. No joins, no fuzzy locations. Just some very simple examples... Peter From mcolosimo at mitre.org Thu Nov 3 13:35:34 2005 From: mcolosimo at mitre.org (Marc Colosimo) Date: Thu Nov 3 13:58:37 2005 Subject: [Biopython-dev] SeqFeature's FeatureLocation for GenBank In-Reply-To: <436A1C0B.5080008@maubp.freeserve.co.uk> References: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org> <4369E8AE.1080701@maubp.freeserve.co.uk> <436A147F.8050005@mitre.org> <436A1C0B.5080008@maubp.freeserve.co.uk> Message-ID: <09567ab0c0445c1073172ab020441db9@mitre.org> Okay, I think we are arguing in a circle here and this could go on for a long time. Since we both have in the past had to get our heads around this behavior (which to me indicates that the behavior is not intuitive), I suggest that we up date the Cookbook. I think my code is a good example (I can even change it to include CDS instead of the source feature). Also, something like the following, Under 3.7.2.1: location - The location of the SeqFeature on the sequence that you are dealing with. The locations are designed to return the sequence when slicing. Thus, the end position is ONE more than the actual end position in the sequence. Under 3.7.2.2: for ExactPositions. Marc On Nov 3, 2005, at 9:17 AM, Peter wrote: > Marc Colosimo wrote: >> Thank you for the response. However, I know how lists work in Python >> (and C, and Java, etc...). That was not question. Here is some code >> to show you what I mean about the inconsistent behavior of Locations. >> from Bio import GenBank >> gi_list = GenBank.search_for("AB077698") >> ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser = >> GenBank.FeatureParser() ) >> seq_rec = ncbi_dict[gi_list[0]] >> print len(seq_rec.seq) # returns 2701, which is correct >> # now lets look at a feature location >> source_feature = seq_rec.features[0] print source_feature.type >> # should be 'source' >> print source_feature.location # (0..2701), in the gb record it >> was (1..2701). The start is correct, the end is NOT > > The start and end ARE correct in that seq_rec.seq[0:2701] will return > all of the sequence. > > The first nucleotide is seq_rec.seq[0] > The last nucleotide is seq_rec.seq[2700] > The length is 2701 > > It makes more sense in the case of (sub)features, rather than the > source 'feature' which is everything. > > In the same way, a string of length 5, e.g. "abcde" > "abcde"[0] == "a" > "abcde"[4] == "e" > "abcde"[0:5] == "abcde" > "abcde"[5] is out of range. > > From memory, the location object is actually rather more complicated > because it copes with nasty locations like 123..<150 plus joins etc, > as well as the simple cases like 123..150 > > It took me a while to get my head round the location object too. > >> # get a slice >> seq_rec.seq[source_feature.location.start.position : >> source_feature.location.end.position] >> # returns the correct thing > >> # now lets see what the first nt looks like >> seq_rec.seq[source_feature.location.start.position] # works fine > >> # now lets see what the last nt looks like >> seq_rec.seq[source_feature.location.end.position] >> IndexError: string index out of range > > This is correct. See my example with a string "abcde"[5] > >> # The correct answer for is... >> seq_rec.seq[source_feature.location.end.position - 1] # now, this >> is different from how start position works! > > It has to be different for the splicing trick. Again, its the "fault" > of trying to be the same as python strings. I don't see the connection between Location and Sequences behaving like python strings. The first part is the key to my confusion ("splicing trick") > >> # but wait there is more... >> # What if I didn't know about the funny end position business and >> wrote this, >> seq_rec.seq[source_feature.location.start.position: >> source_feature.location.end.position + 1] >> # This works, but it is not correct because it has added a nt from >> the beginning to the end (slices are nice about that) >> # If I were to use this on the other internal features I would get >> the wrong thing (by one nt) >> So, either location End should be 2700, Start should be 1, or state >> 'explicitly' what Locations positions represent. But not 0..2701. > > I personally am happy with having start 0, end 2701 for a genbank > location of 1..2701 and this it is logical. > > However, the documentation could be improved. > >> Changing the end position probably would mess up lots of code. So >> that leaves documentation. You can add my code above to the cookbook >> . > > Maybe an extra sub section, before the current "3.7.2.2 Locations" > for the location simple case? i.e. No joins, no fuzzy locations. > Just some very simple examples... > > Peter From mdehoon at c2b2.columbia.edu Thu Nov 3 15:05:20 2005 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Thu Nov 3 15:04:52 2005 Subject: [Biopython-dev] SeqFeature's FeatureLocation for GenBank In-Reply-To: <436A1C0B.5080008@maubp.freeserve.co.uk> References: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org> <4369E8AE.1080701@maubp.freeserve.co.uk> <436A147F.8050005@mitre.org> <436A1C0B.5080008@maubp.freeserve.co.uk> Message-ID: <436A6D80.7050109@c2b2.columbia.edu> I think the confusion is coming from the way FeatureLocation prints itself. (0..2701) looks too much like a GenBank-style location. If FeatureLocation were to print (0:2701) instead, it's pretty clear that this is a Python-style slice. One solution might be to let FeatureLocation inherit from list, and override as needed for abstract positions. --Michiel. Peter wrote: > Marc Colosimo wrote: > >> Thank you for the response. However, I know how lists work in Python >> (and C, and Java, etc...). That was not question. Here is some code >> to show you what I mean about the inconsistent behavior of Locations. >> >> from Bio import GenBank >> gi_list = GenBank.search_for("AB077698") >> ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser = >> GenBank.FeatureParser() ) >> seq_rec = ncbi_dict[gi_list[0]] >> print len(seq_rec.seq) # returns 2701, which is correct >> # now lets look at a feature location >> source_feature = seq_rec.features[0] print >> source_feature.type # should be 'source' >> print source_feature.location # (0..2701), in the gb record it >> was (1..2701). The start is correct, the end is NOT > > > The start and end ARE correct in that seq_rec.seq[0:2701] will return > all of the sequence. > > The first nucleotide is seq_rec.seq[0] > The last nucleotide is seq_rec.seq[2700] > The length is 2701 > > It makes more sense in the case of (sub)features, rather than the > source 'feature' which is everything. > > In the same way, a string of length 5, e.g. "abcde" > "abcde"[0] == "a" > "abcde"[4] == "e" > "abcde"[0:5] == "abcde" > "abcde"[5] is out of range. > > From memory, the location object is actually rather more complicated > because it copes with nasty locations like 123..<150 plus joins etc, > as well as the simple cases like 123..150 > > It took me a while to get my head round the location object too. > >> # get a slice >> seq_rec.seq[source_feature.location.start.position : >> source_feature.location.end.position] >> # returns the correct thing > > >> # now lets see what the first nt looks like >> seq_rec.seq[source_feature.location.start.position] # works fine > > >> # now lets see what the last nt looks like >> seq_rec.seq[source_feature.location.end.position] >> IndexError: string index out of range > > > This is correct. See my example with a string "abcde"[5] > >> # The correct answer for is... >> seq_rec.seq[source_feature.location.end.position - 1] # now, this >> is different from how start position works! > > > It has to be different for the splicing trick. Again, its the "fault" > of trying to be the same as python strings. > >> # but wait there is more... >> # What if I didn't know about the funny end position business and >> wrote this, >> seq_rec.seq[source_feature.location.start.position: >> source_feature.location.end.position + 1] >> # This works, but it is not correct because it has added a nt from >> the beginning to the end (slices are nice about that) >> # If I were to use this on the other internal features I would get >> the wrong thing (by one nt) >> >> So, either location End should be 2700, Start should be 1, or state >> 'explicitly' what Locations positions represent. But not 0..2701. > > > I personally am happy with having start 0, end 2701 for a genbank > location of 1..2701 and this it is logical. > > However, the documentation could be improved. > >> Changing the end position probably would mess up lots of code. So >> that leaves documentation. You can add my code above to the cookbook >> . > > > Maybe an extra sub section, before the current "3.7.2.2 Locations" > for the location simple case? i.e. No joins, no fuzzy locations. > Just some very simple examples... > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 From mcolosimo at mitre.org Fri Nov 4 10:20:15 2005 From: mcolosimo at mitre.org (Marc Colosimo) Date: Fri Nov 4 10:21:47 2005 Subject: [Biopython-dev] SeqFeature's FeatureLocation for GenBank In-Reply-To: <436A6D80.7050109@c2b2.columbia.edu> References: <5bc01dafdc6f69c88c127c0895efb7bf@mitre.org> <4369E8AE.1080701@maubp.freeserve.co.uk> <436A147F.8050005@mitre.org> <436A1C0B.5080008@maubp.freeserve.co.uk> <436A6D80.7050109@c2b2.columbia.edu> Message-ID: <436B7C2F.7050102@mitre.org> Michiel, That would probably help. Along with some additions to the cookbook. Marc Michiel Jan Laurens de Hoon wrote: > I think the confusion is coming from the way FeatureLocation prints > itself. (0..2701) looks too much like a GenBank-style location. If > FeatureLocation were to print (0:2701) instead, it's pretty clear that > this is a Python-style slice. One solution might be to let > FeatureLocation inherit from list, and override as needed for abstract > positions. > > --Michiel. > > Peter wrote: > >> Marc Colosimo wrote: >> >>> Thank you for the response. However, I know how lists work in >>> Python (and C, and Java, etc...). That was not question. Here is >>> some code to show you what I mean about the inconsistent behavior of >>> Locations. >>> >>> from Bio import GenBank >>> gi_list = GenBank.search_for("AB077698") >>> ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser >>> = GenBank.FeatureParser() ) >>> seq_rec = ncbi_dict[gi_list[0]] >>> print len(seq_rec.seq) # returns 2701, which is correct >>> # now lets look at a feature location >>> source_feature = seq_rec.features[0] print >>> source_feature.type # should be 'source' >>> print source_feature.location # (0..2701), in the gb record it >>> was (1..2701). The start is correct, the end is NOT >> >> >> >> The start and end ARE correct in that seq_rec.seq[0:2701] will return >> all of the sequence. >> >> The first nucleotide is seq_rec.seq[0] >> The last nucleotide is seq_rec.seq[2700] >> The length is 2701 >> >> It makes more sense in the case of (sub)features, rather than the >> source 'feature' which is everything. >> >> In the same way, a string of length 5, e.g. "abcde" >> "abcde"[0] == "a" >> "abcde"[4] == "e" >> "abcde"[0:5] == "abcde" >> "abcde"[5] is out of range. >> >> From memory, the location object is actually rather more complicated >> because it copes with nasty locations like 123..<150 plus joins etc, >> as well as the simple cases like 123..150 >> >> It took me a while to get my head round the location object too. >> >>> # get a slice >>> seq_rec.seq[source_feature.location.start.position : >>> source_feature.location.end.position] >>> # returns the correct thing >> >> >> >>> # now lets see what the first nt looks like >>> seq_rec.seq[source_feature.location.start.position] # works fine >> >> >> >>> # now lets see what the last nt looks like >>> seq_rec.seq[source_feature.location.end.position] >>> IndexError: string index out of range >> >> >> >> This is correct. See my example with a string "abcde"[5] >> >>> # The correct answer for is... >>> seq_rec.seq[source_feature.location.end.position - 1] # now, this >>> is different from how start position works! >> >> >> >> It has to be different for the splicing trick. Again, its the >> "fault" of trying to be the same as python strings. >> >>> # but wait there is more... >>> # What if I didn't know about the funny end position business and >>> wrote this, >>> seq_rec.seq[source_feature.location.start.position: >>> source_feature.location.end.position + 1] >>> # This works, but it is not correct because it has added a nt from >>> the beginning to the end (slices are nice about that) >>> # If I were to use this on the other internal features I would get >>> the wrong thing (by one nt) >>> >>> So, either location End should be 2700, Start should be 1, or state >>> 'explicitly' what Locations positions represent. But not 0..2701. >> >> >> >> I personally am happy with having start 0, end 2701 for a genbank >> location of 1..2701 and this it is logical. >> >> However, the documentation could be improved. >> >>> Changing the end position probably would mess up lots of code. So >>> that leaves documentation. You can add my code above to the cookbook >>> . >> >> >> >> Maybe an extra sub section, before the current "3.7.2.2 Locations" >> for the location simple case? i.e. No joins, no fuzzy locations. >> Just some very simple examples... >> >> Peter > From bugzilla-daemon at portal.open-bio.org Fri Nov 4 14:00:16 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Fri Nov 4 14:57:31 2005 Subject: [Biopython-dev] [Bug 1747] GenBank parser is very slow and memory hungry for large input files Message-ID: <200511041900.jA4J0FrS018788@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1747 ------- Comment #7 from mdehoon@ims.u-tokyo.ac.jp 2005-11-04 14:00 ------- This patch causes an error when running the example in section 3.4.1 in the tutorial/cookbook: Python 2.4.1 (#1, Aug 25 2005, 12:45:44) [GCC 3.4.4 (cygming special) (gdc 0.12, using dmd 0.125)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import GenBank >>> >>> gi_list = GenBank.search_for("Opuntia AND rpl16") >>> ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank') >>> gb_record = ncbi_dict[gi_list[0]] >>> record_parser = GenBank.FeatureParser() >>> ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank',parser = record_ parser) >>> gb_seqrecord = ncbi_dict[gi_list[0]] Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 17 36, in __getitem__ return self.parser.parse(handle) File "/usr/local/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 21 9, in parse self._scanner.feed(handle, self._consumer) File "/usr/local/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 12 61, in feed line = handle.readline() AttributeError: ReseekFile instance has no attribute 'readline' >>> Can this be fixed? I'm pretty much in favor of a hand-written parser instead of Martel, because it's easier to understand and maintain (there are several other GenBank bugs waiting). ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Nov 7 04:30:26 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Mon Nov 7 04:57:39 2005 Subject: [Biopython-dev] [Bug 1897] New: In IsoelectricPoint.py infinite loop observed. Message-ID: <200511070930.jA79UQQA010394@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1897 Summary: In IsoelectricPoint.py infinite loop observed. Product: Biopython Version: 1.24 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev@biopython.org ReportedBy: clowney.lester@aist.go.jp Using getIsoElectricPoint some proteins (just one for me) can put you into an infinite loop. There is a stopping condition that apparently is to coarse. Using smaller pH increments increments allows the test to pass and not oscillate. This is the fixed fragment I used where delta pH was changed from 0.001to 0.0005. Frag of IsoelectricPoint.py: while abs(Charge) > 0.01: if Charge > 0: pH += 0.0005 # was 0.001 else: pH -= 0.0005 # 0.001 Charge = self._chargeR(pH, Cterm, Nterm) print Charge print "returning pH", pH Regards, Les ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Nov 7 08:15:02 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Mon Nov 7 08:57:35 2005 Subject: [Biopython-dev] [Bug 1747] GenBank parser is very slow and memory hungry for large input files Message-ID: <200511071315.jA7DF2ac015757@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1747 ------- Comment #8 from biopython-bugzilla@maubp.freeserve.co.uk 2005-11-07 08:15 ------- I was aware there was some problem with the NCBIDictionary support (which had been noted on the mailing list). The problem with shown by Michael's example (comment 7 on the bug report) is due to ReseekFile.py only supporting the 'read' method, and not the 'readline' method. According to the comments in this file, this is all the Martel parsers needed. I tried adding the following to the class ReseekFile in ReseekFile.py, and this seems to fix Michael's example. def _readline(self, size): """The readline support is just a quick guess...""" if size < 0: y = self.file.readline() z = self.buffer_file.readline() + y self.buffer_file.write(y) return z if size == 0: return "" x = self.buffer_file.readline(size) if len(x) < size: y = self.file.readline(size - len(x)) self.buffer_file.write(y) return x + y return x def readline(self, size = -1): x = self._readline(size) if self.at_beginning and x: self.at_beginning = 0 self._check_no_buffer() return x I want to stress that I'm not entirely sure that my 'readline' code is valid, it was just my best guess based on how the 'read' method was done. And the test functions at the end of the ReseekFile.py file could be extended... It would help if I had used the NCBIDictionary before ;) P.S. I have just tried this change and the GenBank/__init__.py patch against BioPython 1.41 on Linux, and test_GenBank.py passed fine. P.P.S. Would it be easy to add an offline version of Michael's example using NCBIDictionary to the GenBank unit test? ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Nov 7 15:18:55 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Mon Nov 7 15:57:36 2005 Subject: [Biopython-dev] [Bug 1747] GenBank parser is very slow and memory hungry for large input files Message-ID: <200511072018.jA7KItD5022646@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1747 ------- Comment #9 from mdehoon@ims.u-tokyo.ac.jp 2005-11-07 15:18 ------- Sorry, I'm not following. ReseekFile already has a readline method. Also, do we need to use ReseekFile if we're not using Martel? ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Nov 7 18:47:26 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Mon Nov 7 18:57:33 2005 Subject: [Biopython-dev] [Bug 1747] GenBank parser is very slow and memory hungry for large input files Message-ID: <200511072347.jA7NlQrt000492@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1747 ------- Comment #10 from biopython-bugzilla@maubp.freeserve.co.uk 2005-11-07 18:47 ------- Mystery solved: There are two different ReseekFile.py files in BioPython. The one I changed (based on tracing the exception thrown) lives in Bio/ReseekFile.py Revision: 1.3, Sun Mar 21 16:56:53 2004 UTC (19 months, 2 weeks ago) by chapmanb There is second version in Bio/EUtils/ReseekFile.py which appears to be more advanced (more comments, supports readline, readlines, ...) Revision: 1.1, Fri Jun 13 00:49:37 2003 UTC (2 years, 4 months ago) by dalke This version is however "older", but this is only due to a minor comment related change by Brad Chapman on the first file. It looks like Bio/ReseekFile.py should be removed, and Bio/EUtils/ReseekFile.py used instead. Andrew Dalke's implementation of readline is a safer bet than my quick hack, plus he included some test cases for it as well. I haven't tried this yet as it late here, and I need to go to sleep now instead ;) ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 8 05:13:41 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Tue Nov 8 05:57:33 2005 Subject: [Biopython-dev] [Bug 1747] GenBank parser is very slow and memory hungry for large input files Message-ID: <200511081013.jA8ADfOc013443@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1747 ------- Comment #11 from biopython-bugzilla@maubp.freeserve.co.uk 2005-11-08 05:13 ------- OK, revised instructions:- (1) Apply my patch to Bio/GenBank/__init__.py (2) Remove the old file Bio/ReseekFile.py in favour of just Bio/EUtils/ReseekFile.py (3) Update FormatIO.py line 4: Change 'import ReseekFile' To 'from Bio.EUtils import ReseekFile' (4) Update Bio/config/DBRegistry.py line 229: Change 'from Bio.ReseekFile import ReseekFile' To 'from Bio.EUtils.ReseekFile import ReseekFile' Then Michael's example in comment 7 using GenBank.NCBIDictionary works. (My grep skills are a bit rusty, but I didn't notice any other uses of ReseekFile) ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 8 11:58:22 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Tue Nov 8 12:57:39 2005 Subject: [Biopython-dev] [Bug 1747] GenBank parser is very slow and memory hungry for large input files Message-ID: <200511081658.jA8GwMYH026815@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1747 mdehoon@ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #12 from mdehoon@ims.u-tokyo.ac.jp 2005-11-08 11:58 ------- Patch accepted in CVS, thanks. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 8 13:32:19 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Tue Nov 8 13:57:34 2005 Subject: [Biopython-dev] [Bug 1747] GenBank parser is very slow and memory hungry for large input files Message-ID: <200511081832.jA8IWJDh028552@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1747 biopython-bugzilla@maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |1899 nThis| | ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 8 14:28:44 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Tue Nov 8 14:57:34 2005 Subject: [Biopython-dev] [Bug 1680] Problems with the GenBank indexing Message-ID: <200511081928.jA8JSi7a029940@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1680 mdehoon@ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- OS/Version|Windows XP |All ------- Comment #2 from mdehoon@ims.u-tokyo.ac.jp 2005-11-08 14:28 ------- This file, with the spaces between the records, can be downloaded from Entrez Nucleotide by selecting the three records, setting "Display" to "GenBank", and "Send to" to "File". So I agree with Sameet that this is a bug in Bio.GenBank. This bug still exists in Biopython 1.41. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 9 07:16:30 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Wed Nov 9 07:57:45 2005 Subject: [Biopython-dev] [Bug 1680] Problems with the GenBank indexing Message-ID: <200511091216.jA9CGUYV020040@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1680 ------- Comment #3 from biopython-bugzilla@maubp.freeserve.co.uk 2005-11-09 07:16 ------- (In reply to comment #1) > This is because the Martel/Mindy iterator does not recognize spaces > between the records in the file. If you remove the spaces from the > file, the code will index the file without complaint. .. > One fix might be to modify the system to ignore whitespaces between > records. But that may cause problems if the system were applied to a > format where the number of blank lines were important. Where is the "format" used by the Martel/Mindy iterator defined? In the normal case, GenBank.index_file() simply calls SimpleSeqRecord.create_flatdb() to do the work. Does this use the same "format" for all types of file when asked to create a flat database? ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 9 07:46:14 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Wed Nov 9 07:57:47 2005 Subject: [Biopython-dev] [Bug 1773] Martel.Parser.ParserPositionException Message-ID: <200511091246.jA9CkE8W020644@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1773 ------- Comment #1 from biopython-bugzilla@maubp.freeserve.co.uk 2005-11-09 07:46 ------- It looks to me like fixing bug 1680 may also fix this. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 9 07:09:11 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Wed Nov 9 07:57:48 2005 Subject: [Biopython-dev] [Bug 1762] Bio.GenBank.FeatureParser dislikes valid accessions and locus lines Message-ID: <200511091209.jA9C9BTp019930@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1762 ------- Comment #3 from biopython-bugzilla@maubp.freeserve.co.uk 2005-11-09 07:09 ------- The limitation with truncated LOCUS lines is still present with the switch to my non-martel parser (see bug 1747), but this also means the original patch is not applicable anymore. The new parser seems to be OK with this style ACCESSION line: ACCESSION U00096 AE000111-AE000510 e.g. Using the RecordParser() it is accessable as cur_record.accession == ['U00096', 'AE000111-AE000510'] Test script: =========================================================== import time from Bio import GenBank #gb_file = "/tmp/U00096_full_locus.gbk" gb_file = "/tmp/U00096_truncated_locus.gbk" feature_parser = GenBank.FeatureParser() gb_handle = open(gb_file, 'r') start_time = time.time() gb_iterator = GenBank.Iterator(gb_handle, feature_parser) count = 0 while 1: print "Starting...", cur_record = gb_iterator.next() print "Done" if cur_record is None: break count = count + 1 # now do something with the record print count, cur_record.name, len(cur_record.features), len(cur_record.seq) if 'data_file_division' in cur_record.annotations : print cur_record.annotations['data_file_division'] if 'date' in cur_record.annotations : print cur_record.annotations['date'] job_time = time.time() - start_time print "Time elapsed %0.2f seconds for %s" % (job_time, gb_file) ============================================================== Test script output for the undoctored GenBank file from the NCBI's website (sent to file, I just searched for U00096 in google): ============================================================== Starting... Done 1 U00096 8877 4639675 BCT 08-SEP-2005 Starting... Done Time elapsed 79.05 seconds for /tmp/U00096_full_locus.gbk ============================================================== Test script output for the truncated locus line version, where I edited the first line by hand from: LOCUS U00096 4639675 bp DNA circular BCT to: LOCUS U00096 ============================================================== Starting... Traceback (most recent call last): File "/tmp/U00096_test.py", line 17, in -toplevel- cur_record = gb_iterator.next() File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 129, in next return self._parser.parse(File.StringHandle(data)) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 219, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 1382, in feed assert False, \ AssertionError: Did not recognise the LOCUS line layout: LOCUS U00096 ============================================================== The patch should be straight forward (I'll try and do this this afternoon) but note Jan T. Kim's warning:- > I haven't checked whether missing division / length / > DNA/RNA/protein / circular/linear information results in > appropriate defaults in the objects created by parsing. > As long as the corresponding members are not used, there > should not be any problem. Peter ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 9 10:49:25 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Wed Nov 9 10:57:59 2005 Subject: [Biopython-dev] [Bug 1762] Bio.GenBank.FeatureParser dislikes valid accessions and locus lines Message-ID: <200511091549.jA9FnPZh026146@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1762 biopython-bugzilla@maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #202 is|0 |1 obsolete| | ------- Comment #4 from biopython-bugzilla@maubp.freeserve.co.uk 2005-11-09 10:49 ------- Created an attachment (id=247) --> (http://bugzilla.open-bio.org/attachment.cgi?id=247&action=view) Patch to Bio/GenBank/__init__.py Patch to my non-martel GenBank parser to: (1) tackle the truncated LOCUS line problem (i.e. this bug) Plus a few changes that should be on bug 1899 really: (2) started to split the feed function into sub functions (3) minor changes to comments to remove references to Martel (4) removed the ErrorParser class used by the Martel parser Patch created with: diff cvs__init__.py __init__.py > patch.txt test_GenBank.py still passes (only tested on Windows). Sample output from the test script posted earlier, this time the truncated LOCUS line is accepted: >>> Starting... Done 1 U00096 8877 4639675 BCT 08-SEP-2005 Starting... Done Time elapsed 48.34 seconds for c:\temp\U00096_full_locus.gbk >>> Starting... Done 1 U00096 8877 4639675 Starting... Done Time elapsed 49.01 seconds for c:\temp\U00096_truncated_locus.gbk ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 9 12:42:05 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Wed Nov 9 12:57:37 2005 Subject: [Biopython-dev] [Bug 1902] Change notation for FeatureLocation string representation Message-ID: <200511091742.jA9Hg5YD029584@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1902 ------- Comment #1 from biopython-bugzilla@maubp.freeserve.co.uk 2005-11-09 12:42 ------- Created an attachment (id=248) --> (http://bugzilla.open-bio.org/attachment.cgi?id=248&action=view) Patch to Bio/SeqFeature.py Patch as described in original bug report ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 9 12:41:04 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Wed Nov 9 12:57:40 2005 Subject: [Biopython-dev] [Bug 1902] New: Change notation for FeatureLocation string representation Message-ID: <200511091741.jA9Hf4M1029526@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1902 Summary: Change notation for FeatureLocation string representation Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev@biopython.org ReportedBy: biopython-bugzilla@maubp.freeserve.co.uk This follows from a discussion on the development mailing list between Marc Colosimo, Michiel de Hoon and myself on how GenBank locations are represented in BioPython. In particular Michiel suggested changing the representation to avoid looking too like the GenBank syntax, and use something more like the Python splicing snytax:- http://www.biopython.org/pipermail/biopython-dev/2005-November/002176.html Changes ------- Change the __str__ function from: "(%s..%s)" % (self._start, self._end) To: "[%s:%s]" % (self._start, self._end) Add following to __doc_string for FeatureLocation.__str__ Returns a representation of the location. For the simple case this uses the python splicing syntax, [122:150] (zero based counting) which GenBank would call 123..150 (one based counting). Add following to __doc__ string for class FeatureLocation Note that the start and end location numbering is designed with splicing in mind, thus a GenBank entry of 123..150 (one based counting) becomes a location of [122:150] (zero based counting). (patch to follow...) Note: ----- This will require updating the expected output for the test_Genbank and test_Location files in Bio/Tests/output/ Impact on existing code SHOULD be minimal, unless anyone is actively parsing the string output of a location somewhere? Documentation: -------------- The tutorial does include some examples of output using the current syntax, and thus would also need updating, see Section 3.7.2.2 Locations As per the mailing list discussion, the whole topic of dealing with locations could be expanded further... ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Thu Nov 10 11:08:00 2005 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Thu Nov 10 12:17:38 2005 Subject: [Biopython-dev] Updates to the tutorial for parsing GenBank files Message-ID: <43737060.4070006@maubp.freeserve.co.uk> There should be a patch attached for Biopython Doc/Tutorial.tex which tries to clarify GenBank parsing. Created on Windows using:- diff cvs_Tutorial.tex new_Tutorial.tex -E -Naur > patch.txt In particular, I have tried make it clear that GenBank.Iterator() and GenBank.index_file() are overkill/unnecessary when dealing with GenBank files which contain only single record (which is the typical case in my personal experience). My changes add an introductory example: parsing a small bacterial genome (a single large GenBank record), before moving on to the GenBank.Iterator() and GenBank.index_file() examples. I have also pointed out that the multi-record example GenBank file used in these examples (cor6_6.gb) is included in the downloadable BioPython source code. Plus there is a minor correction to the GenBank.index_file example, len(gb_dict) gives 6, not 7. Peter -------------- next part -------------- --- cvs_Tutorial.tex 2005-11-10 12:49:45.685675200 +0000 +++ new_Tutorial.tex 2005-11-10 15:45:58.688907200 +0000 @@ -1455,7 +1455,6 @@ One very nice feature of the GenBank libraries is the ability to automate retrieval of entries from GenBank. This is very convenient for creating scripts that automate a lot of your daily work. In this example we'll show how to query the NCBI databases, and to retrieve the records from the query. - First, we want to make a query and find out the ids of the records to retrieve. Here we'll do a quick search for \emph{Opuntia}, my favorite organism (since I work on it). We can do quick search and get back the GIs (GenBank identifiers) for all of the corresponding records: \begin{verbatim} @@ -1511,7 +1510,6 @@ For more information of formats you can parse GenBank records into, please see section~\ref{sec:gb-parsing}. - Using these automated query retrieval functionality is a big plus over doing things by hand. Additionally, the retrieval has nice built in features like a time-delay, which will prevent NCBI from getting mad at you and blocking your access. \subsection{Parsing GenBank records} @@ -1525,39 +1523,69 @@ \item FeatureParser -- This parses the raw record in a SeqRecord object with all of the feature table information represented in SeqFeatures (see section~\ref{sec:advanced-seq} for more info on these objects). This is best to use if you are interested in getting things in a more standard format. \end{enumerate} -Either way you chose to go, the most common usage of these will be creating an iterator and parsing through a file on GenBank records. Doing this is very similar to how things are done in other formats, as the following code demonstrates: +Depending on the type of GenBank files you are interested in, they will either contain a single record, or multiple records. Each record will start with a {\tt LOCUS} line, various other header lines, a list of features, and finally the sequence data, ending with a {\tt //} line. + +Dealing with a GenBank file containing a single record is very easy. For example, lets use a small bacterial genome, {\it Nanoarchaeum equitans Kin4-M} (RefSeq NC\_005213, GenBank AE017199) which can be downloaded from the NCBI here \ahrefurl{\url{ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Nanoarchaeum_equitans/AE017199.gbk}} (only 1.15 MB): \begin{verbatim} from Bio import GenBank -gb_file = "my_file.gb" +feature_parser = GenBank.FeatureParser() + +gb_file = "AE017199.gbk" + +gb_record = feature_parser.parse(open(gb_file,"r")) + +# now do something with the record +print "Name %s, %i features" % (gb_record.name, len(gb_record.features)) +print gb_record.seq +\end{verbatim} + +This gives the following output: + +\begin{verbatim} +Name AE017199, 1107 features +Seq('TCTCGCAGAGTTCTTTTTTGTATTAACAAACCCAAAACCCATAGAATTTAATGAACCCAA ...', IUPACAmbiguousDNA()) +\end{verbatim} + +\subsection{Iterating over GenBank records} +\label{sec:gb-parsing-iterator} + +For multi-record GenBank files, the most common usage will be creating an iterator, and parsing through the file record by record. Doing this is very similar to how things are done in other formats, as the following code demonstrates, using an example file \verb|cor6_6.gb| included in the BioPython source code under the Tests/GenBank/ directory: + +\begin{verbatim} +from Bio import GenBank + +gb_file = "cor6_6.gb" gb_handle = open(gb_file, 'r') feature_parser = GenBank.FeatureParser() gb_iterator = GenBank.Iterator(gb_handle, feature_parser) -while 1: +while True: cur_record = gb_iterator.next() if cur_record is None: break # now do something with the record + print "Name %s, %i features" % (cur_record.name, len(cur_record.features)) print cur_record.seq \end{verbatim} This just iterates over a GenBank file, parsing it into SeqRecord and SeqFeature objects, and prints out the Seq objects representing the sequences in the record. - As with other formats, you have lots of tools for dealing with GenBank records. This should make it possible to do whatever you need to with GenBank. \subsection{Making your very own GenBank database} One very cool thing that you can do is set up your own personal GenBank database and access it like a dictionary (this can be extra cool because you can also allow access to these local databases over a network using BioCorba -- see the BioCorba documentation for more information). +Note - this is only worth doing {\it if} your GenBank file contains more than one record. -Making a local database first involves creating an index file, which will allow quick access to any record in the file. To do this, we use the index file function: +Making a local database first involves creating an index file, which will allow quick access to any record in the file. To do this, we use the index file function. +Again, this example uses the file \verb|cor6_6.gb| which is included in the BioPython source code under the Tests/GenBank/ directory: \begin{verbatim} >>> from Bio import GenBank @@ -1566,7 +1594,7 @@ >>> GenBank.index_file(dict_file, index_file) \end{verbatim} -This will create the file \verb|my_index_file.idx|. Now, we can use this index to create a dictionary object that allows individual access to every record. Like the Iterator and NCBIDictionary interfaces, we can either get back raw records, or we can pass the dictionary a parser that will parse the records before returning them. In this case, we pass a \verb|FeatureParser| so that when we get a record, then we retrieve a SeqRecord object. +This will create a directory called \verb|cor6_6.idx| containing the index files. Now, we can use this index to create a dictionary object that allows individual access to every record. Like the Iterator and NCBIDictionary interfaces, we can either get back raw records, or we can pass the dictionary a parser that will parse the records before returning them. In this case, we pass a \verb|FeatureParser| so that when we get a record, then we retrieve a SeqRecord object. Setting up the dictionary is as easy as one line: @@ -1579,7 +1607,7 @@ \begin{verbatim} >>> len(gb_dict) -7 +6 >>> gb_dict.keys() ['L31939', 'AJ237582', 'X62281', 'AF297471', 'M81224', 'X55053'] \end{verbatim} @@ -1589,6 +1617,8 @@ \begin{verbatim} >>> gb_dict['AJ237582'] +>>> print len(gb_dict['X55053'].features) +3 \end{verbatim} \section{Dealing with alignments} From bugzilla-daemon at portal.open-bio.org Fri Nov 11 11:16:50 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Fri Nov 11 11:57:46 2005 Subject: [Biopython-dev] [Bug 1903] New: GenBank parses fails with unusual quoting and line breaks Message-ID: <200511111616.jABGGooT030390@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1903 Summary: GenBank parses fails with unusual quoting and line breaks Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: minor Priority: P2 Component: Main Distribution AssignedTo: biopython-dev@biopython.org ReportedBy: biopython-bugzilla@maubp.freeserve.co.uk I've been testing my new parser (recently checked in) and have discovered an oddity that it currently fails on, locus Bd2676 in this file: LOCUS NC_005363 3782950 bp DNA circular BCT 22-NOV-2004 DEFINITION Bdellovibrio bacteriovorus HD100, complete genome. ACCESSION NC_005363 VERSION NC_005363.1 GI:42521650 KEYWORDS complete genome. SOURCE Bdellovibrio bacteriovorus HD100 ORGANISM Bdellovibrio bacteriovorus HD100 Look at this bit, /note="\n hypothetical protein" Normally this would be written as /note="hypothetical protein" gene 2594436..2596394 /locus_tag="Bd2676" /db_xref="GeneID:2736184" CDS 2594436..2596394 /locus_tag="Bd2676" /note=" hypothetical protein" /codon_start=1 /evidence=not_experimental /transl_table=11 /product="hypothetical protein" /protein_id="NP_969474.1" /db_xref="GI:42524094" /db_xref="GeneID:2736184" /translation="MKRAYYSNDISRFLVDAPSSILGLLSKAHDFTLEEQQKNAWVKQ IEILQTSLQGIPGHVYFEYSIPRVGKRVDLIVISGNALFSIEFKVGSSQFDSYAADQA MDYALDLKNFHEGSHQIDIFPVLVATEATHTEALPSRFDDGVWSLTRTNSQNLSTHLQ ALKTNAKGPEIDLLKWDASGYKPTPTIVEAAKALYSGHQVEEISRSDAGATNLSITSA ALKKIIDESISQKKKTICLVTGVPGAGKTLVGLDLATSWNNPVANQHAVLLSGNGPLV EILQEALAKDEANRSKASSPVKLSAARAKAKSFIQNIHHFRDEGLRTDAPPPEKVVIF DEAQRAWNKTQTTKFMKTKKGVADFDHSEPEYLIKLMDRHADWAVIICLVGGGQEINT GEAGISEWLDAIHNKFPHWQVCLPSTTSSADIPNIEKFVQAFSSRHHVDKNLHLTASV RSFRSERVSDFMSALLDKDIDKAKALYSEIKEKYPIKLTRSLEEAKLWLKEKSRGNER YGILASSGAGRLKAHGLDVKSRIEPVNWFLNDKKDVRSSFFMEDVATEFHVQGLELDW TCVAWDIDFILSLKKETKFRSFAGTKWNNIKSSTDQSYLKNKYRVLLTRARQGLVLFV PKGDPHDGTRPPGDYEELFSYLQYILND" Patch to follow, once I work out what exactly my code is doing wrong. Also, I have an existing patch pending for Bio/GenBank/__init__.py attached to bug 1762. Should any patch for this new bug be against the current CVS file, or against the version after applying the bug 1762 patch? ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 17 13:27:53 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Thu Nov 17 13:57:51 2005 Subject: [Biopython-dev] [Bug 1762] Bio.GenBank.FeatureParser dislikes valid accessions and locus lines Message-ID: <200511171827.jAHIRrwr015387@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1762 ------- Comment #5 from mdehoon@ims.u-tokyo.ac.jp 2005-11-17 13:27 ------- I downloaded U00096 from Genbank, ran it through seqret, and tried to parse the resulting file with the patched Genbank parser. Whereas the LOCUS line doesn't cause a problem any more, there are other (seqres-specific?) lines that cause the parsing to fail (starting with the "BASE COUNT" line). Can this patch be fixed? It's better to start from the file created by seqres to make sure all nonstandard lines can be handled. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 17 18:37:05 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Thu Nov 17 18:57:54 2005 Subject: [Biopython-dev] [Bug 1762] Bio.GenBank.FeatureParser dislikes valid accessions and locus lines Message-ID: <200511172337.jAHNb50s022480@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1762 ------- Comment #6 from biopython-bugzilla@maubp.freeserve.co.uk 2005-11-17 18:37 ------- I've never used seqret, so if you (Michiel) wouldn't mind emailing me this file (U00096 GenBank after seqret has changed it) then I'll be happy to have a look at the BASE COUNT problem. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 17 19:38:19 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Thu Nov 17 19:57:45 2005 Subject: [Biopython-dev] [Bug 1762] Bio.GenBank.FeatureParser dislikes valid accessions and locus lines Message-ID: <200511180038.jAI0cJPE023052@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1762 ------- Comment #7 from mdehoon@ims.u-tokyo.ac.jp 2005-11-17 19:38 ------- I put the file at http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/u00096.gb.gz. (gzipped file). ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Nov 21 10:17:36 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Mon Nov 21 10:57:50 2005 Subject: [Biopython-dev] [Bug 1903] GenBank parses fails with unusual quoting and line breaks Message-ID: <200511211517.jALFHaFY021490@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1903 biopython-bugzilla@maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|biopython-dev@biopython.org |biopython- | |bugzilla@maubp.freeserve.co. | |uk ------- Comment #1 from biopython-bugzilla@maubp.freeserve.co.uk 2005-11-21 10:17 ------- > Also, I have an existing patch pending for Bio/GenBank/__init__.py > attached to bug 1762. Should any patch for this new bug be against > the current CVS file, or against the version after applying the > bug 1762 patch? As the patch for bug 1762 needed further work, I included the one line change needed for this problem (bug 1903) in that revised patch (attachment 252) ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Nov 21 10:13:27 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Mon Nov 21 10:57:51 2005 Subject: [Biopython-dev] [Bug 1762] Bio.GenBank.FeatureParser dislikes valid accessions and locus lines Message-ID: <200511211513.jALFDQr7021397@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1762 biopython-bugzilla@maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #247 is|0 |1 obsolete| | AssignedTo|biopython-dev@biopython.org |biopython- | |bugzilla@maubp.freeserve.co. | |uk Status|NEW |ASSIGNED ------- Comment #8 from biopython-bugzilla@maubp.freeserve.co.uk 2005-11-21 10:13 ------- Created an attachment (id=252) --> (http://bugzilla.open-bio.org/attachment.cgi?id=252&action=view) Bio/GenBank/__init__.py patch Patches to my non-martel GenBank parser to: (1) tackle the truncated LOCUS line problem (i.e. this bug) (2) tackle missing features as in seqret output (i.e. this bug) Plus a few changes that should be on bug 1899 really: (3) started to split the feed function into sub functions (4) minor changes to comments to remove references to Martel (5) removed the ErrorParser class used by the Martel parser And fix for bug 1903 as well: (6) GenBank parses fails with unusual quoting and line breaks The genbank unit test still works. Michiel - could you create another (smaller) seqret file to go in the GenBank unit tests? ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. You are the assignee for the bug, or are watching the assignee. From matteo.io at email.it Mon Nov 21 15:40:43 2005 From: matteo.io at email.it (Matteo) Date: Mon Nov 21 19:38:20 2005 Subject: [Biopython-dev] Blast Parser error Message-ID: <438230CB.7060707@email.it> Hi, I'm using the latest biopython (1.41) with python 2.4 on windows. I have a problem with a script that works fine on linux, I don't know what to do... this is the simple code: blast_db = os.path.join(os.getcwd(), FASTA_FILE) blast_file = os.path.join(os.getcwd(), QUERY_FILE) blast_exe = os.path.join(os.getcwd(), 'blastall.exe') blast_out, error_info = NCBIStandalone.blastall(blast_exe, 'blastn', blast_db, blast_file) save_file = open('my_blast.out', 'w') blast_results = blast_out.read() save_file.write(blast_results) save_file.close() blast_out = open('my_blast.out', 'r') blastparser = NCBIStandalone.BlastParser() alignrecord = blastparser.parse(blast_out) [...] And this is the error: File "blastMaker.py", line 55, in ? alignrecord = blastparser.parse(blast_out) rd.Blast File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 623, in parse self._scanner.feed(handle, self._consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 93, in feed read_and_call_until(uhandle, consumer.noevent, contains='BLAST') File "C:\Python24\Lib\site-packages\Bio\ParserSupport.py", line 335, in read_a nd_call_until line = safe_readline(uhandle) File "C:\Python24\Lib\site-packages\Bio\ParserSupport.py", line 411, in safe_r eadline raise SyntaxError, "Unexpected end of stream." SyntaxError: Unexpected end of stream. Someone could help me? Thanks in advance, -- Matteo De Felice University of Rome "Roma Tre" -- Email.it, the professional e-mail, gratis per te: http://www.email.it/f Sponsor: Rc auto Zuritel. Scopri subito che risparmiare ? un gioco da ragazzi. Bastano 7 click per ottenere un preventivo personalizzato. Prova ora. Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=4096&d=21-11 From matteo.io at email.it Tue Nov 22 04:53:06 2005 From: matteo.io at email.it (Matteo) Date: Tue Nov 22 04:50:17 2005 Subject: [Biopython-dev] Blast Parser error In-Reply-To: <200511220859.58707.sohm@iaf.cnrs-gif.fr> References: <438230CB.7060707@email.it> <200511220859.58707.sohm@iaf.cnrs-gif.fr> Message-ID: <4382EA82.9070306@email.it> Frederic Sohm ha scritto: >Hi, > >Just guessing. >Did you check that blast_results is correct ? >What did you get in error_info ? > > If I put a "print error_info" I get something like: >If everything is normal there, then the end of line is different in windows >('\r\n') and linux ('\n'), it might just be that. >So try : > >... > > >>save_file.write(blast_results) >>save_file.close() >>#blast_out = open('my_blast.out', 'r') >> >> >blast_out = open('my_blast.out', 'rU') > > > >>blastparser = NCBIStandalone.BlastParser() >>alignrecord = blastparser.parse(blast_out) >>[...] >> >> > > > At the end of the program the output "my_blast.out" is blank! I tried to launching blastall from the command line and THEN parsing the output with biopython, in this way it works! Maybe NCBIStandalone has a problem...I have downloaded the CVS versione but there is no change... And remember that the SAME code works on linux...I also tried to download the versione that I have on my Ubuntu Linux (2.2.10) but there is no change... Thank you for helping me, Matteo De Felice University of Rome "Roma Tre" -- Email.it, the professional e-mail, gratis per te: http://www.email.it/f Sponsor: Hai dei virus sul tuo PC ma non sai come eliminarli? Allora impara subito come rimuovere ogni tipo di virus - clicca qui Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=3211&d=22-11 From mdehoon at c2b2.columbia.edu Wed Nov 23 13:10:20 2005 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Wed Nov 23 13:17:34 2005 Subject: [Biopython-dev] Blast Parser error Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECD9E@cgcmail.cgc.cpmc.columbia.edu> Can you use the XML parser in NCBIXML instead? See the updated tutorial on the biopython website on how to use it. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-dev-bounces@portal.open-bio.org on behalf of Matteo Sent: Tue 11/22/2005 4:53 AM To: Frederic Sohm Cc: biopython-dev@biopython.org Subject: Re: [Biopython-dev] Blast Parser error Frederic Sohm ha scritto: >Hi, > >Just guessing. >Did you check that blast_results is correct ? >What did you get in error_info ? > > If I put a "print error_info" I get something like: >If everything is normal there, then the end of line is different in windows >('\r\n') and linux ('\n'), it might just be that. >So try : > >... > > >>save_file.write(blast_results) >>save_file.close() >>#blast_out = open('my_blast.out', 'r') >> >> >blast_out = open('my_blast.out', 'rU') > > > >>blastparser = NCBIStandalone.BlastParser() >>alignrecord = blastparser.parse(blast_out) >>[...] >> >> > > > At the end of the program the output "my_blast.out" is blank! I tried to launching blastall from the command line and THEN parsing the output with biopython, in this way it works! Maybe NCBIStandalone has a problem...I have downloaded the CVS versione but there is no change... And remember that the SAME code works on linux...I also tried to download the versione that I have on my Ubuntu Linux (2.2.10) but there is no change... Thank you for helping me, Matteo De Felice University of Rome "Roma Tre" -- Email.it, the professional e-mail, gratis per te: http://www.email.it/f Sponsor: Hai dei virus sul tuo PC ma non sai come eliminarli? Allora impara subito come rimuovere ogni tipo di virus - clicca qui Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=3211&d=22-11 _______________________________________________ Biopython-dev mailing list Biopython-dev@biopython.org http://biopython.org/mailman/listinfo/biopython-dev From biopython-dev at maubp.freeserve.co.uk Wed Nov 23 17:11:36 2005 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed Nov 23 20:31:42 2005 Subject: [Biopython-dev] Blast Parser error In-Reply-To: <4382EA82.9070306@email.it> References: <438230CB.7060707@email.it> <200511220859.58707.sohm@iaf.cnrs-gif.fr> <4382EA82.9070306@email.it> Message-ID: <4384E918.7090807@maubp.freeserve.co.uk> Matteo wrote: > this is the simple code: > > blast_db = os.path.join(os.getcwd(), FASTA_FILE) > blast_file = os.path.join(os.getcwd(), QUERY_FILE) > blast_exe = os.path.join(os.getcwd(), 'blastall.exe') > blast_out, error_info = NCBIStandalone.blastall(blast_exe, 'blastn', blast_db, blast_file) > save_file = open('my_blast.out', 'w') > blast_results = blast_out.read() > save_file.write(blast_results) > save_file.close() > blast_out = open('my_blast.out', 'r') > blastparser = NCBIStandalone.BlastParser() > alignrecord = blastparser.parse(blast_out) > [...] > ... > At the end of the program the output "my_blast.out" is blank! > I tried to launching blastall from the command line and THEN > parsing the output with biopython, in this way it works! At Frederic Sohm's suggestion (off list?): Matteo wrote: > If I put a "print error_info" I get something like: > C:\PATH\TO\PROGRAM\refsets.fasta -i C:\PATH\TO\PROGRAM\query.fasta', > mode 'r' at 0x00AF6AD0> Could you post the actual message with the real "PATH TO PROGRAM" included? I'm wondering if this is a problem with spaces in paths/filenames. Also in your example script, you are explicitly creating a file and saving the blast output to it. It is possible that you need to wait a second or two on windows for the file to be properly closed, before you can open it and parse it (just an guess!) When I used NCBIStandalone.BlastParser (which worked for me on both Linux and Windows) I followed the cookbook approach, which doesn't create a temp file in this way. What happens if you try this: blast_db = os.path.join(os.getcwd(), FASTA_FILE) blast_file = os.path.join(os.getcwd(), QUERY_FILE) blast_exe = os.path.join(os.getcwd(), 'blastall.exe') blast_out, error_info = NCBIStandalone.blastall(blast_exe, \ 'blastn', blast_db, blast_file) # save_file = open('my_blast.out', 'w') # blast_results = blast_out.read() # save_file.write(blast_results) # save_file.close() # blast_out = open('my_blast.out', 'r') blastparser = NCBIStandalone.BlastParser() alignrecord = blastparser.parse(blast_out) [...] You don't get to keep a copy of the raw blast output though. I would try this for you now, but I'm on a Linux box at the moment. Peter From bugzilla-daemon at portal.open-bio.org Tue Nov 29 16:17:47 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Tue Nov 29 16:58:01 2005 Subject: [Biopython-dev] [Bug 1909] New: Format issue with GenBank with segmented BACs (eg GI:55276707) Message-ID: <200511292117.jATLHl3G015910@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1909 Summary: Format issue with GenBank with segmented BACs (eg GI:55276707) Product: Biopython Version: Not Applicable Platform: PC OS/Version: Windows Status: NEW Severity: blocker Priority: P2 Component: Main Distribution AssignedTo: biopython-dev@biopython.org ReportedBy: moscou@iastate.edu When using the FeatureParser, it will not be able to retrieve this function. The following is the error message from BioPython. I am assuming this is because this BAC was kept together because of repetitive regions that could not be resolved, but still represent different sequences. Instead I am just going skip this record for now. 55276707 Traceback (most recent call last): File "search_ncbi_for_barley_bacs.py", line 43, in ? gb_seqrecord = ncbi_dict[gi] File "C:\Python24\Lib\site-packages\Bio\GenBank\__init__.py", line 1364, in __ getitem__ return self.parser.parse(handle) File "C:\Python24\Lib\site-packages\Bio\GenBank\__init__.py", line 219, in par se self._scanner.feed(handle, self._consumer) File "C:\Python24\Lib\site-packages\Bio\GenBank\__init__.py", line 1259, in fe ed self._parser.parseFile(handle) File "C:\Python24\Lib\site-packages\Martel\Parser.py", line 328, in parseFile self.parseString(fileobj.read()) File "C:\Python24\Lib\site-packages\Martel\Parser.py", line 361, in parseStrin g self._err_handler.fatalError(ParserIncompleteException(pos)) File "C:\Python24\lib\xml\sax\handler.py", line 38, in fatalError raise exception Martel.Parser.ParserIncompleteException: error parsing at or beyond character 18 263 (unparsed text remains) Thanks a lot! ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 29 18:19:42 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Tue Nov 29 18:58:01 2005 Subject: [Biopython-dev] [Bug 1909] Format issue with GenBank with segmented BACs (eg GI:55276707) Message-ID: <200511292319.jATNJghK017521@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1909 ------- Comment #1 from biopython-bugzilla@maubp.freeserve.co.uk 2005-11-29 18:19 ------- You haven't said which version of BioPython you are using, I would guess BioPython 1.40b with Python 2.4 on Windows XP. Since the 1.40 release, the Genbank parser has been switched from using Martel to a "simpler" python parser which should be easier to maintain. Could you download the latest Bio/GenBank/__init__.py file from CVS and repeat the test? http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/GenBank/__init__.py?cvsroot=biopython You could rename the existing C:\Python24\lib\site-packages\Bio\GenBank\__init__.py file so something else (so you can undo the change) and save the latest version in its place. It would also be useful to see the full test script... Thanks ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.