[Biopython-dev] 7/5 biopython Questions - BioStar

Tue Jul 5 10:56:46 UTC 2011

// GenBank to Fasta failing with CONTIG fields
// July 5, 2011 at 6:31 AM

http://biostar.stackexchange.com/questions/9892/genbank-to-fasta-failing-with-contig-fields
I used to generate FASTA out of my GenBank source files using a simple conversion script:

#!/usr/bin/env python
import sys, signal
from Bio import SeqIO

def wrap( text, width=80 ):
        for i in xrange( 0, len( text ), width ):
                yield text[i:i+width]

if name == "main":
        status = progress()
        for record in SeqIO.parse( sys.stdin, "genbank"):
                try:
                        gi = record.annotations["gi"]
                except KeyError:
                        gi = None

            accession = record.id
            desc = record.description
            seq = record.seq
            locus = record.name
            print ">gi|%s|emb|%s|%s| %s" % (gi, accession, locus, desc)
            for block in wrap( seq ):
                    print block

When I changed the sequence files to newer versions some of the resulting FASTA file sequences were just filled with Ns. After closer inspection of the GenBank source files, it turns out that they have replaced the ORIGIN block

ORIGIN
  sequence...

with a CONTIG block, something like

CONTIG      join(BX640437.1:1..347356,BX640438.1:51..347786,...)

Is there a way to resolve this using BioPython?
I was working with BioPython 1.52 and 1.57 (latest).

Thanks for your suggestions.

// Parsing BLAST output BioPython Error
// July 5, 2011 at 2:25 AM

http://biostar.stackexchange.com/questions/9882/parsing-blast-output-biopython-error
Hi, 
I have the following code

 def runBLAST(self):
        print "Running BLAST .........."
        cmd=subprocess.Popen("blastp -db nr -query repeat.txt -out out.faa -evalue 0.001 -gapopen 11 -gapextend 1 -matrix BLOSUM62 -remote -outfmt 5",shell=True)
        cmd.communicate()[0]
        f1=open("out.faa")
        blast_records = NCBIXML.parse(f1)
        save_file = open("my_fasta_seq.fasta", 'w')
        for blast_record in blast_records[:10]:
            for alignment in blast_record.alignments:
                for hsp in alignment.hsps:
                    save_file.write('>%s\n' % (alignment.hseq,))
        save_file.close()
        f1.close()
        f2=open("my_fasta_seq.fasta")
        for record in SeqIO.parse(f2,"fasta"):
            f=open("tempBLAST1.txt","w")
            f.write(">"+"\n"+str(record.name)+"\n"+str(record.seq)+"\n")
            f.close()

I get the error on TypeError:  for blast_record in blast_records[:10]: saying 'generator' object is not subscriptable.
I am looking to get top 10 blast hits (sequences)

// Getting top 10 sequences of BLAST results Bio Python
// July 5, 2011 at 12:29 AM

http://biostar.stackexchange.com/questions/9880/getting-top-10-sequences-of-blast-results-bio-python
Hi, 

I want to get top 10 sequences of BLAST results (just the sequences, no alignment or score or e-value etc). I am inputting a text file containing 5 fasta file. So my output should be top 10 blast hits of each fasta file.. therefore my output file will have 50 sequences. 

I am reading each of my input fasta file through Bio.SeqIO, writing it as temp.faa and then passing it to command line BLAST through subprocess as 

blastp -db nr -query temp.faa -out out.faa -evalue 0.001 -gapopen 11 -gapextend 1 -matrix BLOSUM62  -remote -outfmt 2

the output has lots of other information. Should I parse this output now or there's a better way.

Thanks

P.S XML might be the way, but I didn't find a relavant NCBIXML parser syntax

--
Website: http://biostar.stackexchange.com/questions/tagged/biopython

Account Login: 
https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email

Unsubscribe here: 
http://www.feedmyinbox.com/feeds/unsubscribe/782463/cfe3e2c307e215f87d612a439b646b9c22290b84/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email

--
This email was carefully delivered by FeedMyInbox.com. 
PO Box 682532 Franklin, TN 37068