[BioPython] Parsing Genbank file examples

Brad Chapman chapmanb@arches.uga.edu
Sat, 28 Jul 2001 14:50:39 -0400


Hi Scott;
Apologies about the delay in getting back with you; as you probably
heard people mention, many of us have been partying, I mean working 
hard, at ISMB in Denmark. 

> Well I managed to get Biopython working on my computer

Great! If you are running on Windows regularly and have any problems,
please report 'em -- most of us are unix-geek-types and I really only
venture unto Windows to make the installers.

> The documentation
> is excellent BTW. The reason I say "potential" is that I haven't quite
> gotten it to work completely. (I think mostly it is my lack of comprehension
> about how the parsers really work.)

Thanks for the kind words about the docs. We spent a bit of time
talking about the type of work they need while we were in Denmark, and
I'm trying to start getting on top of that. If you have any specific
feedback, it would be much appreciated.
 
> Thus, I'd really like some pointers on parsing Genbank files and extracting
> data from these files. What I am really interested in at the moment is
> accessing multiple mRNA features from a Genbank record. 

Sure! One of the things we talked about was trying to put together a
"Cookbook" similiar to the python cookbook that ActiveState has:

http://aspn.activestate.com/ASPN/Python/Cookbook/

(or the Perl Cookbook from O'Reilly). I'm not yet sure about the best
way to set it up, but right now I'll just start writing examples by
hand. Your question seemed like a good starting entry for the
cookbook, so I got something together and attached it below. Let me
know if this answers your questions, or if it needs
clarification/help, etc.

> P.S. Is this list the right place to point out bugs and errors?

Definately. If you are sending big patches or want to have lengthy 
discussions about things, the development list 
(biopython-dev@biopython.org) is also a good place to go.

Thanks for the feedback. Hope the example helps!

Brad


 Parsing Features from a GenBank file
=====================================
  
 Author
-------
    Brad Chapman (chapmanb@arches.uga.edu)

 Summary
--------
   Describes the basic idea for getting features out of a GenBank feature
table. Explains both how GenBank parsing works, and how to find your way
around the SeqFeatures produced by GenBank. This specific example gets
information about mRNA features in an entry.

 Source
-------
   
<<
     1 #!/usr/bin/env python
     2 """Cookbook example showing how to parse features from a GenBank file.
     3 
     4 This script prints out the location of mRNAs as annotated in a
     5 GenBank file. It was tested and run with accession number Y17722
     6 but should probably work with any GenBank file.
     7 """
     8 from Bio import GenBank
     9 from Bio.Seq import MutableSeq
    10 from Bio.Alphabet import IUPAC
    11 from Bio import utils
    12 
    13 # --- load a parser and iterator for our GenBank file
    14 gb_handle = open("mRNA_entry.gb", "r")
    15 # -- a parser that will give you back SeqFeature objects
    16 feature_parser = GenBank.FeatureParser()
    17 iterator = GenBank.Iterator(gb_handle, feature_parser)
    18 
    19 # begin iterating through the file and getting GenBank records
    20 while 1:
    21     # get a SeqFeature object for the next GenBank record. When we run
    22     # out of records in the file, cur_entry will be None
    23     cur_entry = iterator.next()
    24 
    25     if cur_entry is None:
    26         break
    27 
    28     print "Printing mRNA entries for %s" % cur_entry.id
    29 
    30     # loop through all of the features for the entry
    31     for feature in cur_entry.features:
    32         # when we've got mRNA features, parse the info out of them
    33         if feature.type == "mRNA":
    34             # loop through all of the subfeatures and get the
    35             # individual exons
    36             all_exons = []
    37             for sub_feature in feature.sub_features:
    38                 # add the info about the exon, ignoring fuzzy locations
    39                 all_exons.append((sub_feature.location.nofuzzy_start,
    40                                   sub_feature.location.nofuzzy_end))
    41 
    42             # now print out the exon information
    43             print "   Information for %s" % feature.qualifiers["gene"]
    44             print "\tExons"
    45             for exon in all_exons:
    46                 print "\t  Start: %s, End: %s" % (exon[0], exon[1])
    47 
    48             # get the mRNA sequence for the exon
    49             # first add up all of the DNA info from the GenBank file
    50             exon_seq = MutableSeq("", IUPAC.ambiguous_dna)
    51             for exon in all_exons:
    52                 start_pos = exon[0] - 1
    53                 end_pos = exon[1] - 1
    54                 exon_seq.extend(cur_entry.seq[start_pos:end_pos])
    55             # now transcribe the sequence
    56             mrna_seq = utils.transcribe(exon_seq.toseq())
    57             print "\tSequence"
    58             print "\t  %s" % mrna_seq
>>
  

 Discussion
-----------
  This example demonstrates just about the full potential of the Biopython
GenBank parser, so it might be a bit in-depth. Apologies, but I also wanted to
show off the things that it can do. Let's just look through the script a
little at a time and touch on the things it's trying to demonstrate.After
importing the necessary modules, lines 13 to 17 show the standard process for
setting up an iterator to parse a file of GenBank records. In this case, we
use the GenBank FeatureParser, which parses the records from the file into
generic SeqRecord objects, with features. It would also be possible to use the
non-generic RecordParser, which parses the records into a GenBank-specific
format. We pass our `feature_parser' and file handle to an iterator, which
will allow us to go through an entire file full of GenBank entries. Now that
we've got the iterator set up, we are ready to deal with all of the records in
the file. Lines 20 to 26 are the standard python way to loop through all
records with an iterator. We start an infinite loop with `while 1:' and on
each re-iteration of the loop get a new record object. When we run out of
record objects the iterator will return None. We check for this, and break out
of the looping when this occurs. We've now set up a loop to get all of the
GenBank records parsed in our file; now we just need to do something with
them.In this case we're interested in getting out information about annotated
mRNAs in the GenBank file. This type of information is related to an actual
biological sequence as features on the sequence. Since we are using a
FeatureParser, we get back SeqRecord objects to deal with, which contain the
features of a sequence in a attribute appropriately called `features'.Line 31
lets us loop through all of the features in our record. Each feature is a
SeqFeature object, and to find only mRNA features we look for the objects
which have 'mRNA' set as the type attribute of the feature. This is done in
line 33.Now that we've isolated our features of interest, we need to extract
information from them. The feature will look something like:
<<
       mRNA           join(143..367,938..1127,1222..1837,1929..2057,2151..2198,
                      2284..2363,2436..2626,2714..3034,3127..3190,3283..3809)
                      /gene="trp1"
>>
  What we want to do is find the locations of all exons of this mRNA (ie. 143
to 367, 938 to 1127 ...). All of these individual locations will be held as
`sub_features' of the top level mRNA feature. These `sub_features' are also
SeqFeature objects, but each `sub_feature' will be an individual exon in the
mRNA.To begin going through the `sub_features' and collecting the exon
locations, line 37 sets up a loop. With each exon in the loop, we look at the
location attribute of the feature. This location is also a class, which has
start and end attributes. There is only one additional level of complication:
the start and end are not just simple numbers. In order to deal with dreaded
``fuzzy'' locations (ie. '<143' can be a valid start item that will indicate
that a location is somewhere less then 143), we need to make each position a
class which can potentially give information about fuzziness. In this script,
we'll assume we don't really care about this level of complication, and just
want a number. All of the locations have attributes called `nofuzzy_start' and
`nofuzzy_end' which will just return reasonable numbers specifying the start
and end of the location. We use this information to collect up the exon
positions in lines 39 and 40 by adding them to a python list that will look
like:
<<
  [(143, 367), (938, 1127), (1222, 1837), (1929, 2057), (2151, 2198),
   (2284, 2363), (2436, 2626), (2714, 3034), (3127, 3190), (3283, 3809)]
>>
  Lines 42 to 46 simply print out this information collected about the exons.
You could do whatever kind of analysis you wanted to do with it here.To
demonstrate something you might want to do with the exons, lines 48 to 58 go a
little further and show how we could use this information to get a Seq object
with the actual mRNA sequence. We start with a MutableSeq object in line 50,
and build up this sequence with the appropriate DNA sequences from the
sequence. For each exon, we collect the sequence from the start to the end of
the exon. Notice that since the annotations in the file are in ``biological''
coordinates (starting at 1 for the first base) and things in python and
Biopython work in ``computer'' coordinates (starting at 0 for the first base),
we subtract 1 from the annotated start and end positions so we get the exact
sequences we want.Once we've build up the sequence, line 56 uses the
transcribe functionality of `Bio.utils' to convert our DNA into the
appropriate mRNA. Finally, we'll end up with a predicted mRNA in a Seq object
that looks like:
<<
  Seq('GAGAGAAAGAAAUAGGAGAUUGUUUUUGCUUGUGGUGAGAGAAGAUCGCAUCGUGUGGGU...', 
  IUPACAmbiguousRNA())
>>
  Whew, we just got some serious information out of that GenBank file; that
sure was a lot of fun. Hopefully this example helps demonstrate how to
effectively use the GenBank parser to extract information about a sequence.