[Biopython] instances

Mon Jun 29 14:06:04 UTC 2009

Hi everyone

Ok, so I managed to write a parser for Genbank files ( I will post to
script central once completed, it works well with single genes from
genomic sequences) which can search for a gene from a genomic sequence
and copy it out as a FASTA. The problem of course is that these
entries are often incorrectly annotated, more often than I can
manually correct. For instance, in HBV sequences that I am using you
get "precore" and "core" which are pretty much the same sequence, but
sometimes they're annotated separately, and sometimes not, which is
what I am trying to control for in my little parser.

So I thought, I can copy out the start position from precore, and then
the end position from core (the one ends immediately where the other
begins), and I have the whole sequence, irrespective of the
annotation. I am just having a little trouble getting it to work.

I had to refactor my code to take this into account, so I have some functions

def findgene(gene_list, curentry)

gene_list = a dictionary of genes are potentially annotated as under
the /gene or /product part of genbank features (there is also not
always /gene and /product, sometimes one or the other)
curentry = the current genbank record being processed & comes from
iterator.next() which is defined as iterator =
GenBank.Iterator(gb_handle, feature_parser)

at the end, it returns, if the gene is found, the gene.location and
gene.sequence and is a tuple.

I then attempt to print the sequence at the given coordinates

 if corecur_seq > 0:
            print "core sequence only \n"
            corestart = corecur_seq[0]._start
            coreend = corecur_seq[0]._end
            coreseq = corecur_seq[1]
            print coreseq[corestart:coreend]

getting the following error message

Traceback (most recent call last):
  File "/media/RESCUE/HBx_Bioinformatics/reannotate.py", line 171, in <module>
    print coreseq[corestart:coreend]
  File "/var/lib/python-support/python2.6/Bio/Seq.py", line 132, in __getitem__
    return Seq(self._data[index], self.alphabet)
TypeError: object cannot be interpreted as an index

So I guess I've changed the type of the variable in the definition

I then changed it to

    if precorecur_seq == None:
        corecur_seq =  findgene(core_list, current_entry)
        if corecur_seq > 0:
            print "core sequence only \n"
            corestart = corecur_seq[0]._start
            coreend = corecur_seq[0]._end
            print current_entry.seq[corestart:coreend]

giving the same error

I think the error is (although I don't know, I am pretty new to python
and programming in biopython) with the variable type of
corestart and coreend, both defined as <type 'instance'> and when I
print them on the shell I get

Bio.SeqFeature.ExactPosition(1900)

Bio.SeqFeature.ExactPosition(2452)

as an example, do I need to convert these to integers ? I have tried,
but I think I would need to replace or copy out the number
into a different variable ?

Specific thanks to Peter, Andrew Dalke and Brad who posted numerous
examples on their pages and on the mailing lists which
have helped me tremendously.

I would appreciate any comments.

Kind regards
Liam

-- 
-----------------------------------------------------------
Antiviral Gene Therapy Research Unit
University of the Witwatersrand