[Biopython] Biopython local blastn query

Ivan Gregoretti ivangreg at gmail.com
Tue Jul 30 15:14:06 UTC 2013


Hi Ara,

If you are interested only in the most obvious matches, and I think you
are, pass the following parameter values to blastn

-max_hsps_per_subject 1 -num_alignments 1

>From the blastn documentation:

 -max_hsps_per_subject <Integer, >=0>
   Override maximum number of HSPs per subject to save for ungapped searches
   (0 means do not override)
   Default = `0'

 -max_target_seqs <Integer, >=1>
   Maximum number of aligned sequences to keep
   Not applicable for outfmt <= 4
   Default = `500'


I hope this helps with your thesis.

Ivan





Ivan Gregoretti, PhD
Bioinformatics



On Tue, Jul 30, 2013 at 10:14 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:

> Peter,
>
>   Thank you for your quick response! I added in the -perc_identity and
> closed the file. I end up with the same results. I do get the full
> sequences but also a bunch of partials.
>
> ara
>
>
> On Tue, Jul 30, 2013 at 2:12 AM, Peter Cock <p.j.a.cock at googlemail.com
> >wrote:
>
> > On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser <ghashsnaga at gmail.com>
> wrote:
> > > Hello all,
> > >
> > >    I goofed up on curating accession numbers for part of my PhD
> project.
> > > But I have the sequences in a big fasta file! I wrote a quick script
> that
> > > read in one sequence at a time from the file, blasted it and then
> > filtered
> > > it based on 0 gaps and 100% id match. I did this for just the first 6
> > > sequences as to not anger the NCBI. This worked great! But it's slow
> > > (really slow) and I can't submit the whole file.
> > >
> > >  I installed a local blast db and wrote this script.(attached as
> > > meta_data_local.py and the query file, clear_genus_level.fasta ):
> > >
> > >
> >
> ########################################################################################
> > > #I want to read in one sequence at a time from a fasta file and blast
> it
> > > against a local
> > > #blast db.
> > >
> > > from Bio.Blast.Applications import NcbiblastnCommandline
> > > from Bio.Blast import NCBIXML
> > > from Bio import SeqIO
> > > from Bio import Seq
> > > from Bio.SeqRecord import SeqRecord
> > >
> > > nt = "/Users/arakooser/blast/db/nt.00"
> > > #Where the database is located at
> > > file_out = open("metadata_genus.level.csv","w+")
> > >
> > > #Contains all the data my boss wants on the sequences
> > > file_in = open("clear_genus_level.fasta")
> > >
> > > #The main fasta file that needs to be blasted
> > >
> > > fas_rec = SeqIO.parse(file_in,"fasta")
> > > #Parses the main fasta file
> > >
> > > for first_seq in fas_rec:
> > > #Hopefully grabs the first sequence
> > > #Takes that sequence from standard in and sumbits it to the blast
> > > commandline and spits
> > > #out an xml
> > >     result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001,
> > outfmt=5,
> > >                                    out="temp.xml")
> >
> > You could ask BLAST itself to apply the percentage
> > identity threshold, blastn has a -perc_identity option.
> >
> > >     stdout, stderr = result(stdin=first_seq.format("fasta"))
> > >
> > > #Reading in the xml file.
> > > #
> > >
> > >     record = open("temp.xml")
> > >     ...
> >
> > You never close this file handle, perhaps that is
> > causing problems reusing the filename?
> >
> > It might be safer to use a different temporary
> > file each time (there are standard functions to
> > generate these names in Python)?
> >
> > Peter
> >
>
>
>
> --
> Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
> sub cardine glacialis ursae.
>
> Geoscience website: http://www.tattooedscience.org/
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>



More information about the Biopython mailing list