[BioPython] writing clustalw alignment to file

Brad Chapman chapmanb@arches.uga.edu
Thu, 6 Dec 2001 08:31:36 -0500


Hi Scott;
Gack! Discovered this while I was cleaning out my immense mail backlog.
Sorry for the very late reply. Hope this still helps.

> (1) I am trying to convert a clustalw alignment to a fasta alignment (pg.
> 38-39 of Cookbook). I can do this just like in the book, but then I want to
> write the fasta alignment to a file and I don't know how to do this.

You should just be able to do:

handle = open("yourfile.fasta", "w")
handle.write(str(fasta_align))
handle.close()

It looks like I wrote a custom __str__ representation, so this should
write a sensible FASTA format to the file.

> Also, I
> would really like to search through the fasta or clustalw file for
> particular sequence names (titles) so that I can retrieve the particular
> sequence but I do not know how this is done.
> 
> For instance, I would like to find the following sequence from a file:
> 
> >APE2174
> ------------------------------------------------------------
> --------------------------MVKGSQVKPSTTELLLKAVSAKAPSGD-------
> ---PIHQK---IGDLPFEKIVEIAIEKKPD--LLAKTLKAAVKTILGSARSIGVTVDGKD
> PKEVTRQVDEGVYDAVLAKYEEKWEEAEG

For this, I would just load the file in with a fasta parser, and then
search for the title you are looking for. Something like:

handle = open("my_file", "r")
parser = Fasta.RecordParser()
iterator = Fasta.Iterator(handle, parser)

while 1:
    cur_record = iterator.next()
    if cur_record is None:
        break
    if cur_record.title.find("your_title_of_interest") >= 0:
        print "Here's the record:\n %s" % cur_record

Does this do what you need?

> (2) The other question has to do with parsing Genbank files. I can use the
> Genbank parser to retrieve exon and intron sequences just fine and I give a
> (partial) example of a function I wrote to do this below from the "mRNA"
> feature (very similar to the example in the cookbook). However, when I try
> to get the positions from the "gene" feature I run into problems. For
> example:
> 
>      gene            2..1021
>                      /gene="HI0001"
> 
> All I want is the positions of the gene and the gene name but I keep running
> into problems. 

Okay dokee. It looks like you are almost there. I think the following
would work:

def get_introns(genbank_file_name):
    gb_handle = open(genbank_file_name, "r")
    parser = GenBank.FeatureParser()
    iterator = GenBank.Iterator(gb_handle, parser)
    
    while 1:
        cur_record = iterator.next()
        if cur_record is None:
            break
        gene_info = find_genes(record)
        # do whatever you want to do with the gene information

def find_genes(gb_record):
    """Find features reporting gene information in a GenBank Record.

    Returns a list of three items, where each item has:
        (gene_start, gene_end, gene_name)
    """
    mrna_info = []
    for feature in gb_record.features:
        if feature.type == "mRNA":
            mrna_start = feature.location.nofuzzy_start
            mrna_end = feature.location.nofuzzy_end
            if feature.qualifiers.has_key("gene"):
                mrna_name = feature.qualifiers["gene"]
            else:
                mrna_name = "not given"
            mrna_info.append((mrna_start, mrna_end, mrna_name))


    return mrna_info

Does this do the right thing? Or at least given you a hint?

Hope this helps. Sorry again for the long delay in responding.

Careful-with-the-untested-code-I-wrote-in-my-mailer-ly yr's,
Brad