[Bioperl-l] primer candidates validation by comparing the wgs blast results between fwd and rev.

Wed Feb 17 21:10:32 EST 2010

I am totally new to bioperl.

I would like to see if anyone could give me a hint or clue for tackling this
problem I am trying to solve:
Use bioperl/perl script and CGI to create a primer quality control web
interface

the steps I would like to be automated:
I design many primer pairs (~500+) flanking intron regions of silkworm wgs
sequences close to cDNA/mRNA/EST/molecular anchor loci selected. After the
primers are generated (I wish this step could be automated but it really
can't), I have to blast each and every of them against the wgs database of
the same organism to make sure there is no common hits in terms of the same
contig number result between the forward and the reverse primers blastn hits
to avoid the non-target amplification except for the target intron region.
The steps I take to validate the primers are as follows:
1. At NCBI blastn webpage, put in the forward primer sequence in the search
field, label it ("job title"), choose the wgs database and the organism, and
click "submit" to start search.
2. Open another browser tab and go to NCBI blastn webpage, put in the
reverse primer sequence in the search field, label it ("job title"), choose
the wgs database and the organism, and click "submit" to start search.
3. On the forward primer blastn result page, write down the top 20 wgs
sequence that was from build 2 genomic sequencing project (the title of each
hit has a text string with certain format like
"Bm_scaf<number>_contig<number>"). 
4. On the reverse primer blastn result page, write down the top 20 wgs
sequence that was from build 2 genomic sequencing project (the title of each
hit has a text string with certain format like
"Bm_scaf<number>_contig<number>"). 
5. compare the recorded blast hits from step 3 and step 4 and list the
common hit(s) between the two primer sequences (with the same scaffold and
contig number)
6. show a warning if there is more than one common hit since there should be
only one target hit.

Example:
I have these two primer sequences:
GCATCGGTGAACGAGCTA
CGCCTGCAAACGAGAATA

First I blast each of the above primer sequences against wgs database bombyx
mori (organismid:7091) on blast website
http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&BLAST_PROGRAMS=megaBlast&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on&LINK_LOC=blasthome
After I get the results in two different browser tabs, I record down the
results. For example, from the forward primer result page(only the first
several hits are listed):
=================== first few hits from forward primer blast result
==================
BABH01015134.1
Bombyx mori DNA, contig: Bm_scaf21_contig15134,
strain: p50T/Dazao, build 2, whole genome shotgun
sequence
34.2 34.2 94% 0.12 100%

BABH01038273.1
Bombyx mori DNA, contig: Bm_scaf121_contig38273,
strain: p50T/Dazao, build 2, whole genome shotgun
sequence
34.2 34.2 94% 0.12 100%

AADK01021213.1
Bombyx mori strain Dazao Ctg021213, whole genome
shotgun sequence
34.2 34.2 94% 0.12 100%

BAAB01106839.1
Bombyx mori DNA, contig477862, whole genome shotgun
sequence
34.2 34.2 94% 0.12 100%

BAAB01154920.1
Bombyx mori DNA, contig585939, whole genome shotgun
sequence
34.2 34.2 94% 0.12 100%

BABH01007204.1
Bombyx mori DNA, contig: Bm_scaf8_contig7204,
strain: p50T/Dazao, build 2, whole genome shotgun
sequence
32.2 32.2 88% 0.48 100%

BABH01020379.1
Bombyx mori DNA, contig: Bm_scaf33_contig20379,
strain: p50T/Dazao, build 2, whole genome shotgun
sequence
32.2 32.2 88% 0.48 100%
================================ end of the forward blast result
================================ first few hits from reverse primer blast
result ==================
BABH01015134.1
Bombyx mori DNA, contig:
Bm_scaf21_contig15134, strain: p50T/Dazao,
build 2, whole genome shotgun sequence
36.2 36.2 100% 0.031 100%

AADK01021213.1
Bombyx mori strain Dazao Ctg021213, whole
genome shotgun sequence
36.2 36.2 100% 0.031 100%

AADK01032592.1
Bombyx mori strain Dazao Ctg032592, whole
genome shotgun sequence
36.2 36.2 100% 0.031 100%

BAAB01106839.1
Bombyx mori DNA, contig477862, whole genome
shotgun sequence
36.2 36.2 100% 0.031 100%

BABH01028024.1
Bombyx mori DNA, contig:
Bm_scaf56_contig28024, strain: p50T/Dazao,
build 2, whole genome shotgun sequence
30.2 30.2 83% 1.9 100%

AADK01039561.1
Bombyx mori strain Dazao Ctg039561, whole
genome shotgun sequence
30.2 30.2 83% 1.9 100%

AADK01056852.1
Bombyx mori strain Dazao Ctg063892, whole
genome shotgun sequence
30.2 30.2 83% 1.9 100%

BABH01001710.1
Bombyx mori DNA, contig: Bm_scaf2_contig1710,
strain: p50T/Dazao, build 2, whole genome
shotgun sequence
===============================end of the reverse result

>From the list, I would record down the ones with "Bm_scaf#_contig#"(ex.
Bm_scaf21_contig15134) since that's the string pattern I would like to
compare with the hits from reverse primer blast results.
After I record down the first 20 qualified blast hits(hopefully with blast
paser program I can use more than 20), I compare them with the ones from the
reverse primer search result and see if there is any common result other
than the target.

I am OK to go through this validation process manually if there are only
tens of primers I have to design. However with 500 and maybe more primers to
come I believe there is an easier way.

I imagine the code will have the following functions:
input: user's primer pairs, multiple entry capatability
output: compare the blastn results between both fwd and reverse primer and
generate a list of common blastn hits (w/ same scaf# and contig# from build
2 wgs sequences - naming convention: Bm_scaf#_contig#) on the wgs (accept
the blast search parameters through the web and pass to the blast command)
background record-keeping mechanism: create records of the blast report for
each primer vs wgs blastn results and properly name the files.

I guess my question is:
What would be the most stright-forward approach? (I know you probably think
I already know the method since I post the question here, but more
suggestions the better) and where should I start?

My background: 
1. I've written codes for retrieving PDB file upon user's PDB 4-letter
protein ID entry and atom-atom distance measurement with the same setup(perl
script+web CGI interface)
2. I've used command-line megablast to batch blast multiple sequences
however my impression is that it's not intend to do short sequence
blast(primers are usually around 20-24bp long).
ref.:
http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
3. I can modify simple perl codes and do some text string menupilation in
perl
4. I have my own linux box and have apache/perl/bioperl/cgi ready.

-- 
View this message in context: http://old.nabble.com/primer-candidates-validation-by-comparing-the-wgs-blast-results-between-fwd-and-rev.-tp27633496p27633496.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.