[Biopython] losing information

Peter biopython at maubp.freeserve.co.uk
Thu Oct 29 14:04:20 UTC 2009


On Thu, Oct 29, 2009 at 10:13 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Oct 29, 2009 at 4:53 AM, Liam Thompson <dejmail at gmail.com> wrote:
>> hi everyone
>>
>> I'm running a simple script to remove genbank records from
>> a GB file that I have indentified as undesirable. ...
>
> ...
>
> That said, I would approach your problem in a very different
> way. I would NOT parse the file into objects at all - I would
> just loop over the lines, toggling between desired or not,
> and outputting the lines for desired records as is. This
> assumes your criteria for "desired" is simple to define,
> e.g. a list of LOCUS identifiers.

If you can just look at the LOCUS line, this is very easy in
Python (you don't need Biopython at all). It will also be very
fast as there is no complicated parsing and object creation.
e.g.

wanted = set(["AB493847", "AB493848"])
inp_handle = open("original.txt")
out_handle = open("new.txt", "w")
save = False
for line in inp_handle :
    if line.startswith("LOCUS") : #start of record
        save = line.split()[1] in wanted
    if save :
        out_handle.write(line)
    if line.strip() == "//" : #end of record
        save = False
inp_handle.close()
out_handle.close()

I've written this using a set of good record identifiers. If you have a
list of bad records, just switch round the "in" check.

If you need to access something like the annotation, or the sequence,
then it does make sense to parse the records - but keep a copy of
the raw GenBank record as a string to use for output. One way to
do this is to use StringIO.

Peter



More information about the Biopython mailing list