[Biopython] instances
Liam Thompson
dejmail at gmail.com
Mon Jun 29 14:06:04 UTC 2009
Hi everyone
Ok, so I managed to write a parser for Genbank files ( I will post to
script central once completed, it works well with single genes from
genomic sequences) which can search for a gene from a genomic sequence
and copy it out as a FASTA. The problem of course is that these
entries are often incorrectly annotated, more often than I can
manually correct. For instance, in HBV sequences that I am using you
get "precore" and "core" which are pretty much the same sequence, but
sometimes they're annotated separately, and sometimes not, which is
what I am trying to control for in my little parser.
So I thought, I can copy out the start position from precore, and then
the end position from core (the one ends immediately where the other
begins), and I have the whole sequence, irrespective of the
annotation. I am just having a little trouble getting it to work.
I had to refactor my code to take this into account, so I have some functions
def findgene(gene_list, curentry)
gene_list = a dictionary of genes are potentially annotated as under
the /gene or /product part of genbank features (there is also not
always /gene and /product, sometimes one or the other)
curentry = the current genbank record being processed & comes from
iterator.next() which is defined as iterator =
GenBank.Iterator(gb_handle, feature_parser)
at the end, it returns, if the gene is found, the gene.location and
gene.sequence and is a tuple.
I then attempt to print the sequence at the given coordinates
if corecur_seq > 0:
print "core sequence only \n"
corestart = corecur_seq[0]._start
coreend = corecur_seq[0]._end
coreseq = corecur_seq[1]
print coreseq[corestart:coreend]
getting the following error message
Traceback (most recent call last):
File "/media/RESCUE/HBx_Bioinformatics/reannotate.py", line 171, in <module>
print coreseq[corestart:coreend]
File "/var/lib/python-support/python2.6/Bio/Seq.py", line 132, in __getitem__
return Seq(self._data[index], self.alphabet)
TypeError: object cannot be interpreted as an index
So I guess I've changed the type of the variable in the definition
I then changed it to
if precorecur_seq == None:
corecur_seq = findgene(core_list, current_entry)
if corecur_seq > 0:
print "core sequence only \n"
corestart = corecur_seq[0]._start
coreend = corecur_seq[0]._end
print current_entry.seq[corestart:coreend]
giving the same error
I think the error is (although I don't know, I am pretty new to python
and programming in biopython) with the variable type of
corestart and coreend, both defined as <type 'instance'> and when I
print them on the shell I get
Bio.SeqFeature.ExactPosition(1900)
Bio.SeqFeature.ExactPosition(2452)
as an example, do I need to convert these to integers ? I have tried,
but I think I would need to replace or copy out the number
into a different variable ?
Specific thanks to Peter, Andrew Dalke and Brad who posted numerous
examples on their pages and on the mailing lists which
have helped me tremendously.
I would appreciate any comments.
Kind regards
Liam
--
-----------------------------------------------------------
Antiviral Gene Therapy Research Unit
University of the Witwatersrand
More information about the Biopython
mailing list