[Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole, unparsed multiline entry?
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Fri Mar 12 16:57:53 EST 2010
http://bugzilla.open-bio.org/show_bug.cgi?id=3000
------- Comment #4 from mmokrejs at ribosome.natur.cuni.cz 2010-03-12 16:57 EST -------
Hi Peter,
I finally got back to this. Thank your for all your work. I would be glad if
one could use the accession without the trailing ".1", etc for get_raw() and
get(). I think just any version of the record should be returned, and maybe a
list if there were multiple versions of the same.
>>> print data.get_raw("BC035166")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "Bio/SeqIO/_index.py", line 280, in get_raw
handle.seek(dict.__getitem__(self, key))
KeyError: 'BC035166'
>>>
Similarly, if I loop over the entries I have to do:
>>> mylist = ['ACC1', 'ACC2', 'ACC3']
>>> sequences = []
>>> for acc in data.keys():
... if data.get(acc).id.split('.')[0] in mylist:
... sequences.append(data.get(acc))
Oh no, this is not what I wanted, in full:
from Bio import SeqIO
data = SeqIO.index("full.gb", "gb")
mylist = ['AC11111.1', 'AC2222.2', 'AC3333.3']
sequences = []
for acc in mylist:
if acc in map(lambda x: x.split('.')[0], data.keys()):
print "Found %s" % acc
if data.get(acc + '.1'):
sequences.append(data.get(acc + '.1'))
else:
if data.get(acc + '.2'):
sequences.append(data.get(acc + '.2'))
else:
sequences.append(data.get(acc + '.3'))
else:
print "Missing %s" % acc
output_handle = open("filtered.gb", "w")
SeqIO.write(sequences, output_handle, "genbank")
There was already a discussing on the user mailing list, I do not think forcing
uppercase letters for genbank files is a good idea. Just stick with what was
supplied. Myself, I use mixed typically to emphasize, ORFs, but sometimes in
lower-case low-quality regions. Anyway, I provided original NCBI-web GenBank
file of an EST and the DNA sequence was in lowercase, biopython returned
uppercase. In turn, diff(1) command returns too many changed lines,
unnecessarily. I suggest giving use an opportunity to specify on input parsing
whether to keep mixed-case/lower-case or force uppercase. Also, protein
sequences I have often seen in lower-case, which is ugly to my eyes, btw.
Finally, the remaining differences are here (probably the first is in bug
#2578):
--- /tmp/orig.gb 2010-03-12 21:09:24.000000000 +0100
+++ /tmp/new.gb 2010-03-12 21:09:38.000000000 +0100
@@ -1,4 +1,4 @@
-LOCUS CR603932 1625 bp mRNA linear HTC
16-OCT-2008
+LOCUS CR603932 1625 bp DNA HTC
16-OCT-2008
DEFINITION full-length cDNA clone CS0DK007YH24 of HeLa cells Cot
25-normalized
of Homo sapiens (human).
ACCESSION CR603932
@@ -29,39 +29,39 @@
division of Invitrogen.
FEATURES Location/Qualifiers
source 1..1625
- /organism="Homo sapiens"
/mol_type="mRNA"
- /db_xref="taxon:9606"
/clone="CS0DK007YH24"
+ /db_xref="taxon:9606"
/tissue_type="HeLa cells Cot 25-normalized"
/plasmid="pCMVSPORT_6"
+ /organism="Homo sapiens"
ORIGIN
Thanks for all you work on this, it is a great service. ;-) Next, I will try to
filter by .features['tissue_type'] but sadly will have to search for the very
same string through COMMENT string as well.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list