[Bioperl-l] automation of translation based on alignment

Mon Mar 22 16:51:08 EDT 2010

Ross, Chris F,

I'd like to just comment on this since we are working in parallel on a  
similar problem. See also the prior thread in archives for Peters work  
in BioPython that I instigated: "Polyproteins, robo slippage, viral  
mat_peptides"

This dialog below is just to clarify the science that will guide the  
pseudocode and logic flow would be needed to be built out into a  
BioPerl module. There are plenty of comments on the string mashing  
required, and its a harrowing morass, but heres some other thoughts.  
Three line item comments first, and then some open general ideas for  
moving this block of concepts forward:

1.
>> Ross Said:

>> I am working on virus sequences and one of the Genbank file is here:
>>
>> http://www.ncbi.nlm.nih.gov/nuccore/DQ089804.1?ordinalpos=1
>> <http://www.ncbi.nlm.nih.gov/nuccore/DQ089804.1?ordinalpos=1&itool=EntrezSys
>> tem2.PEntrez.Sequence.Sequence_ResultsPanel.Sequence_RVDocSum>

If you are transferring protein annotation, why not use the RefSeq one  
instead of a GenBank one? In our experience at Virusbrc.org we find  
that protein annotation transfer is only a valid idea if you have  
reference sequences for each serotype, or your annotations will have  
propagation errors from the reference. They just dont align more than  
80% of the time for instance in Dengue, and I assume you want better  
then that? Yes this HepB is a decent sequence, but the problem is that  
HepB has four main serotypes, and yet there is only one RefSeq:  
NC_003977. My guess is that you will have to define reference peptide  
seqs for all four serotypes first, and then grab the Taxon_ID from the  
input unknown file so you align right i.e. you need to do virus  
annotation below the species level or it isnt accurate. The number of  
reference sequences that you use is related to the conservation of  
your virus family. The script needs to know which one to align to, so  
we have pulled that from the taxon_ID field of the *.gbk file. You  
could also use blast and pull the high scorer. Your choice.

>> Ross said:
>>
>> Thanks for your response. While the one with Genbank file can be  
>> extracted,
>> those without have to rely on alignment. Scripts certainly can be  
>> written to
>> move forward and backward on the multiple alignment but it is an  
>> error-prone

We find also that viruses dont have the proteins annotated most of the  
time. It's just genome file. Part of the problem is that /host/  
proteases sometimes cleave the /viral/ polyproteins, in a species- 
specific way, and since there is only one database entry, but many  
hosts, you can /only/ give the genome code and still be right for  
everything it /might/ infect. You cant define the peptides in the  
file, because they might be different, depending on the host. Sick,  
isnt it? The proteins produced in different animals based on their  
proteases cleavage specificity help determine whether the virus  
effects that animal or not. This is my hunch based on experience, no,  
I cannot give an example.

3.
Chris F said:

> To preface this, any reason you're not translating the alignment  
> sequences using the above sequence's features as a reference?

A logical place to start. But-they are usually not given. In addition  
to the above reason, the amount of data for viral sequences is rarer  
since fewer grad students want to sequence things that mame you or  
make you hurl, if you screw up on the nucleic acid extraction. Also,  
the locations for protein processing sites can be variable, like > or  
< instead of a real location in the string. So, the GenBank file isnt  
really very good as a reference, 5% of the time. Last, if there are  
three child proteins from a CDS, and one is made by a host protease,  
one by a viral protease, and one by a start codon, what do you say is  
'mature'? What should be in the 'feature' field? Its not standardized  
right now. Nobody has this nailed at NCBI or UniProt.

Still, like Chris says, a script that asks first for the coordinates,  
and takes that as the first go round, is best. The GenBank coords when  
provided, are accurate most of the time. AFter that, you end up  
comparing everything and making your choice.

4.
Last thoughts:

* We tried BL2Seq to align query to target one at a time, with good  
reference sequences. It works, for exactly what you ask for. But! Only  
in a few virus families. And, its 1200 lines long, doing error  
checking; as you say its just not easy. Pulling an HSP from a blast  
report leaves one with with a lot of end trimming and comparing to do,  
since the HSP ends in an identity, and well, sometimes viruses vary at  
the point of cleavage of proteins. Good luck with that task, it gave  
us fits. Its not really appropriate to look at the ends of the hsp and  
say they are right. It requires that extra code. Still, we may open  
that code to the public after April database release. It only works  
for well conserved viruses. (I know... Jumbo Shrimp).

* I know of no BioPerl module that can parse an MSA and take out the  
relevant alignments, so you dont have to assign a reference sequence  
from scratch, every time you do this. Is there one?

*Sometimes the features on viruses are named differently: / 
mat_peptide, /sig_peptide; sometimes they are named different in /note  
or /product. There is no standard for much of this. It needs to be  
proposed. Maybe we can do that together.

* If you want to use a synoptic MSA for all Hepatitis B viruses, and  
then pull the alignments out of that, I'd love to talk to you. The  
VBRC used precomputed MSAs for all their virus families and got  
forward a little bit. We are looking into that code.

All ideas. Nothing set in stone. Dialog welcome.

Good luck all.

Chris

-- 

Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133

clarsen at vecna.com