[Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole, unparsed multiline entry?

Fri Mar 12 16:57:53 EST 2010

http://bugzilla.open-bio.org/show_bug.cgi?id=3000





------- Comment #4 from mmokrejs at ribosome.natur.cuni.cz  2010-03-12 16:57 EST -------
Hi Peter,
  I finally got back to this. Thank your for all your work. I would be glad if
one could use the accession without the trailing ".1", etc for get_raw() and
get(). I think just any version of the record should be returned, and maybe a
list if there were multiple versions of the same.

>>> print data.get_raw("BC035166")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Bio/SeqIO/_index.py", line 280, in get_raw
    handle.seek(dict.__getitem__(self, key))
KeyError: 'BC035166'
>>>

Similarly, if I loop over the entries I have to do:
>>> mylist = ['ACC1', 'ACC2', 'ACC3']
>>> sequences = []
>>> for acc in data.keys():
...     if data.get(acc).id.split('.')[0] in mylist:
...         sequences.append(data.get(acc))



Oh no, this is not what I wanted, in full:

from Bio import SeqIO
data = SeqIO.index("full.gb", "gb")

mylist = ['AC11111.1', 'AC2222.2', 'AC3333.3']
sequences = []
for acc in mylist:
    if acc in map(lambda x: x.split('.')[0], data.keys()):
        print "Found %s" % acc
        if data.get(acc + '.1'):
            sequences.append(data.get(acc + '.1'))
        else:
            if data.get(acc + '.2'):
                sequences.append(data.get(acc + '.2'))
            else:
                sequences.append(data.get(acc + '.3'))
    else:
        print "Missing %s" % acc

output_handle = open("filtered.gb", "w")
SeqIO.write(sequences, output_handle, "genbank")



There was already a discussing on the user mailing list, I do not think forcing
uppercase letters for genbank files is a good idea. Just stick with what was
supplied. Myself, I use mixed typically to emphasize, ORFs, but sometimes in
lower-case low-quality regions. Anyway, I provided original NCBI-web GenBank
file of an EST and the DNA sequence was in lowercase, biopython returned
uppercase. In turn, diff(1) command returns too many changed lines,
unnecessarily. I suggest giving use an opportunity to specify on input parsing
whether to keep mixed-case/lower-case or force uppercase. Also, protein
sequences I have often seen in lower-case, which is ugly to my eyes, btw.

Finally, the remaining differences are here (probably the first is in bug
#2578):

--- /tmp/orig.gb        2010-03-12 21:09:24.000000000 +0100
+++ /tmp/new.gb 2010-03-12 21:09:38.000000000 +0100
@@ -1,4 +1,4 @@
-LOCUS       CR603932                1625 bp    mRNA    linear   HTC
16-OCT-2008
+LOCUS       CR603932                1625 bp    DNA              HTC
16-OCT-2008
 DEFINITION  full-length cDNA clone CS0DK007YH24 of HeLa cells Cot
25-normalized
             of Homo sapiens (human).
 ACCESSION   CR603932
@@ -29,39 +29,39 @@
             division of Invitrogen.
 FEATURES             Location/Qualifiers
      source          1..1625
-                     /organism="Homo sapiens"
                      /mol_type="mRNA"
-                     /db_xref="taxon:9606"
                      /clone="CS0DK007YH24"
+                     /db_xref="taxon:9606"
                      /tissue_type="HeLa cells Cot 25-normalized"
                      /plasmid="pCMVSPORT_6"
+                     /organism="Homo sapiens"
 ORIGIN

Thanks for all you work on this, it is a great service. ;-) Next, I will try to
filter by .features['tissue_type'] but sadly will have to search for the very
same string through COMMENT string as well.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.