[Bioperl-l] automation of translation based on alignment
Chris Larsen
clarsen at vecna.com
Mon Mar 22 16:51:08 EDT 2010
Ross, Chris F,
I'd like to just comment on this since we are working in parallel on a
similar problem. See also the prior thread in archives for Peters work
in BioPython that I instigated: "Polyproteins, robo slippage, viral
mat_peptides"
This dialog below is just to clarify the science that will guide the
pseudocode and logic flow would be needed to be built out into a
BioPerl module. There are plenty of comments on the string mashing
required, and its a harrowing morass, but heres some other thoughts.
Three line item comments first, and then some open general ideas for
moving this block of concepts forward:
1.
>> Ross Said:
>> I am working on virus sequences and one of the Genbank file is here:
>>
>> http://www.ncbi.nlm.nih.gov/nuccore/DQ089804.1?ordinalpos=1
>> <http://www.ncbi.nlm.nih.gov/nuccore/DQ089804.1?ordinalpos=1&itool=EntrezSys
>> tem2.PEntrez.Sequence.Sequence_ResultsPanel.Sequence_RVDocSum>
If you are transferring protein annotation, why not use the RefSeq one
instead of a GenBank one? In our experience at Virusbrc.org we find
that protein annotation transfer is only a valid idea if you have
reference sequences for each serotype, or your annotations will have
propagation errors from the reference. They just dont align more than
80% of the time for instance in Dengue, and I assume you want better
then that? Yes this HepB is a decent sequence, but the problem is that
HepB has four main serotypes, and yet there is only one RefSeq:
NC_003977. My guess is that you will have to define reference peptide
seqs for all four serotypes first, and then grab the Taxon_ID from the
input unknown file so you align right i.e. you need to do virus
annotation below the species level or it isnt accurate. The number of
reference sequences that you use is related to the conservation of
your virus family. The script needs to know which one to align to, so
we have pulled that from the taxon_ID field of the *.gbk file. You
could also use blast and pull the high scorer. Your choice.
>> Ross said:
>>
>> Thanks for your response. While the one with Genbank file can be
>> extracted,
>> those without have to rely on alignment. Scripts certainly can be
>> written to
>> move forward and backward on the multiple alignment but it is an
>> error-prone
We find also that viruses dont have the proteins annotated most of the
time. It's just genome file. Part of the problem is that /host/
proteases sometimes cleave the /viral/ polyproteins, in a species-
specific way, and since there is only one database entry, but many
hosts, you can /only/ give the genome code and still be right for
everything it /might/ infect. You cant define the peptides in the
file, because they might be different, depending on the host. Sick,
isnt it? The proteins produced in different animals based on their
proteases cleavage specificity help determine whether the virus
effects that animal or not. This is my hunch based on experience, no,
I cannot give an example.
3.
Chris F said:
> To preface this, any reason you're not translating the alignment
> sequences using the above sequence's features as a reference?
A logical place to start. But-they are usually not given. In addition
to the above reason, the amount of data for viral sequences is rarer
since fewer grad students want to sequence things that mame you or
make you hurl, if you screw up on the nucleic acid extraction. Also,
the locations for protein processing sites can be variable, like > or
< instead of a real location in the string. So, the GenBank file isnt
really very good as a reference, 5% of the time. Last, if there are
three child proteins from a CDS, and one is made by a host protease,
one by a viral protease, and one by a start codon, what do you say is
'mature'? What should be in the 'feature' field? Its not standardized
right now. Nobody has this nailed at NCBI or UniProt.
Still, like Chris says, a script that asks first for the coordinates,
and takes that as the first go round, is best. The GenBank coords when
provided, are accurate most of the time. AFter that, you end up
comparing everything and making your choice.
4.
Last thoughts:
* We tried BL2Seq to align query to target one at a time, with good
reference sequences. It works, for exactly what you ask for. But! Only
in a few virus families. And, its 1200 lines long, doing error
checking; as you say its just not easy. Pulling an HSP from a blast
report leaves one with with a lot of end trimming and comparing to do,
since the HSP ends in an identity, and well, sometimes viruses vary at
the point of cleavage of proteins. Good luck with that task, it gave
us fits. Its not really appropriate to look at the ends of the hsp and
say they are right. It requires that extra code. Still, we may open
that code to the public after April database release. It only works
for well conserved viruses. (I know... Jumbo Shrimp).
* I know of no BioPerl module that can parse an MSA and take out the
relevant alignments, so you dont have to assign a reference sequence
from scratch, every time you do this. Is there one?
*Sometimes the features on viruses are named differently: /
mat_peptide, /sig_peptide; sometimes they are named different in /note
or /product. There is no standard for much of this. It needs to be
proposed. Maybe we can do that together.
* If you want to use a synoptic MSA for all Hepatitis B viruses, and
then pull the alignments out of that, I'd love to talk to you. The
VBRC used precomputed MSAs for all their virus families and got
forward a little bit. We are looking into that code.
All ideas. Nothing set in stone. Dialog welcome.
Good luck all.
Chris
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
clarsen at vecna.com
More information about the Bioperl-l
mailing list