[Bioperl-l] primer candidates validation by comparing the wgs blast results between fwd and rev.

Thu Feb 18 06:55:43 UTC 2010

Think you are best of by starting with:
http://www.bioperl.org/wiki/Module:Bio::Tools::Run::RemoteBlast
and
http://www.bioperl.org/wiki/FAQ#I_want_to_parse_BLAST.2C_how_do_I_do_this.3F

Cheers,
Jelle

2010/2/18 teetee <sclantw at hotmail.com>

>
> I am totally new to bioperl.
>
> I would like to see if anyone could give me a hint or clue for tackling
> this
> problem I am trying to solve:
> Use bioperl/perl script and CGI to create a primer quality control web
> interface
>
> the steps I would like to be automated:
> I design many primer pairs (~500+) flanking intron regions of silkworm wgs
> sequences close to cDNA/mRNA/EST/molecular anchor loci selected. After the
> primers are generated (I wish this step could be automated but it really
> can't), I have to blast each and every of them against the wgs database of
> the same organism to make sure there is no common hits in terms of the same
> contig number result between the forward and the reverse primers blastn
> hits
> to avoid the non-target amplification except for the target intron region.
> The steps I take to validate the primers are as follows:
> 1. At NCBI blastn webpage, put in the forward primer sequence in the search
> field, label it ("job title"), choose the wgs database and the organism,
> and
> click "submit" to start search.
> 2. Open another browser tab and go to NCBI blastn webpage, put in the
> reverse primer sequence in the search field, label it ("job title"), choose
> the wgs database and the organism, and click "submit" to start search.
> 3. On the forward primer blastn result page, write down the top 20 wgs
> sequence that was from build 2 genomic sequencing project (the title of
> each
> hit has a text string with certain format like
> "Bm_scaf<number>_contig<number>").
> 4. On the reverse primer blastn result page, write down the top 20 wgs
> sequence that was from build 2 genomic sequencing project (the title of
> each
> hit has a text string with certain format like
> "Bm_scaf<number>_contig<number>").
> 5. compare the recorded blast hits from step 3 and step 4 and list the
> common hit(s) between the two primer sequences (with the same scaffold and
> contig number)
> 6. show a warning if there is more than one common hit since there should
> be
> only one target hit.
>
> Example:
> I have these two primer sequences:
> GCATCGGTGAACGAGCTA
> CGCCTGCAAACGAGAATA
>
> First I blast each of the above primer sequences against wgs database
> bombyx
> mori (organismid:7091) on blast website
>
> http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&BLAST_PROGRAMS=megaBlast&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on&LINK_LOC=blasthome
> After I get the results in two different browser tabs, I record down the
> results. For example, from the forward primer result page(only the first
> several hits are listed):
> =================== first few hits from forward primer blast result
> ==================
> BABH01015134.1
> Bombyx mori DNA, contig: Bm_scaf21_contig15134,
> strain: p50T/Dazao, build 2, whole genome shotgun
> sequence
> 34.2 34.2 94% 0.12 100%
>
> BABH01038273.1
> Bombyx mori DNA, contig: Bm_scaf121_contig38273,
> strain: p50T/Dazao, build 2, whole genome shotgun
> sequence
> 34.2 34.2 94% 0.12 100%
>
> AADK01021213.1
> Bombyx mori strain Dazao Ctg021213, whole genome
> shotgun sequence
> 34.2 34.2 94% 0.12 100%
>
> BAAB01106839.1
> Bombyx mori DNA, contig477862, whole genome shotgun
> sequence
> 34.2 34.2 94% 0.12 100%
>
> BAAB01154920.1
> Bombyx mori DNA, contig585939, whole genome shotgun
> sequence
> 34.2 34.2 94% 0.12 100%
>
> BABH01007204.1
> Bombyx mori DNA, contig: Bm_scaf8_contig7204,
> strain: p50T/Dazao, build 2, whole genome shotgun
> sequence
> 32.2 32.2 88% 0.48 100%
>
> BABH01020379.1
> Bombyx mori DNA, contig: Bm_scaf33_contig20379,
> strain: p50T/Dazao, build 2, whole genome shotgun
> sequence
> 32.2 32.2 88% 0.48 100%
> ================================ end of the forward blast result
> ================================ first few hits from reverse primer blast
> result ==================
> BABH01015134.1
> Bombyx mori DNA, contig:
> Bm_scaf21_contig15134, strain: p50T/Dazao,
> build 2, whole genome shotgun sequence
> 36.2 36.2 100% 0.031 100%
>
> AADK01021213.1
> Bombyx mori strain Dazao Ctg021213, whole
> genome shotgun sequence
> 36.2 36.2 100% 0.031 100%
>
> AADK01032592.1
> Bombyx mori strain Dazao Ctg032592, whole
> genome shotgun sequence
> 36.2 36.2 100% 0.031 100%
>
> BAAB01106839.1
> Bombyx mori DNA, contig477862, whole genome
> shotgun sequence
> 36.2 36.2 100% 0.031 100%
>
> BABH01028024.1
> Bombyx mori DNA, contig:
> Bm_scaf56_contig28024, strain: p50T/Dazao,
> build 2, whole genome shotgun sequence
> 30.2 30.2 83% 1.9 100%
>
> AADK01039561.1
> Bombyx mori strain Dazao Ctg039561, whole
> genome shotgun sequence
> 30.2 30.2 83% 1.9 100%
>
> AADK01056852.1
> Bombyx mori strain Dazao Ctg063892, whole
> genome shotgun sequence
> 30.2 30.2 83% 1.9 100%
>
> BABH01001710.1
> Bombyx mori DNA, contig: Bm_scaf2_contig1710,
> strain: p50T/Dazao, build 2, whole genome
> shotgun sequence
> ===============================end of the reverse result
>
> >From the list, I would record down the ones with "Bm_scaf#_contig#"(ex.
> Bm_scaf21_contig15134) since that's the string pattern I would like to
> compare with the hits from reverse primer blast results.
> After I record down the first 20 qualified blast hits(hopefully with blast
> paser program I can use more than 20), I compare them with the ones from
> the
> reverse primer search result and see if there is any common result other
> than the target.
>
> I am OK to go through this validation process manually if there are only
> tens of primers I have to design. However with 500 and maybe more primers
> to
> come I believe there is an easier way.
>
>
> I imagine the code will have the following functions:
> input: user's primer pairs, multiple entry capatability
> output: compare the blastn results between both fwd and reverse primer and
> generate a list of common blastn hits (w/ same scaf# and contig# from build
> 2 wgs sequences - naming convention: Bm_scaf#_contig#) on the wgs (accept
> the blast search parameters through the web and pass to the blast command)
> background record-keeping mechanism: create records of the blast report for
> each primer vs wgs blastn results and properly name the files.
>
> I guess my question is:
> What would be the most stright-forward approach? (I know you probably think
> I already know the method since I post the question here, but more
> suggestions the better) and where should I start?
>
> My background:
> 1. I've written codes for retrieving PDB file upon user's PDB 4-letter
> protein ID entry and atom-atom distance measurement with the same
> setup(perl
> script+web CGI interface)
> 2. I've used command-line megablast to batch blast multiple sequences
> however my impression is that it's not intend to do short sequence
> blast(primers are usually around 20-24bp long).
> ref.:
>
> http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
> 3. I can modify simple perl codes and do some text string menupilation in
> perl
> 4. I have my own linux box and have apache/perl/bioperl/cgi ready.
>
> --
> View this message in context:
> http://old.nabble.com/primer-candidates-validation-by-comparing-the-wgs-blast-results-between-fwd-and-rev.-tp27633496p27633496.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>