[Bioperl-l] locate introns in a protein sequence
Jason Stajich
jason at bioperl.org
Tue Mar 8 07:42:54 UTC 2011
I think there are several ways to do this, the easiest is to walk
through each exon, and count up the codons, and print out the
translation of the codons so far when you get to the end of an exon,
keeping the overhang for the next exon.
That's basically what I do here in terms of counting, though I'm mapping
introns into an alignment not a single peptide (though undocumented so
it is a bit impenetrable)
https://github.com/hyphaltip/thesis/blob/master/src/intron/map_introns_aln.pl
Use Bio::DB::SeqFeature::Store will make this much easier. You will
want to convert this Broad GTF to GFF3 with this script
https://github.com/hyphaltip/genome-scripts/blob/master/data_format/gtf2gff3_3level.pl
Then load the annotation along with a scaffold GFF3 (like one from this
script run on your FASTA genome file
https://github.com/hyphaltip/genome-scripts/blob/master/gbrowse_tools/fasta_to_gbrowse_scaffold.pl
- just watch out -- the Broad is capitalizing the contig name
(Supercontig_3.1) while the FASTA file will have it uncapped and
truncated name (supercont_3.1) so you need to fix that in either of the
files so it is consistent - a perl one liner will work here;
perl -i -p -e 's/^>supercont3\./>Supercontig_3./' GENOME.fasta
and load this in and you'll then be able to use the Bio::DB::SeqFeature
to get all genes, walk through them a CDS for each gene, one a time, and
then just keep count of the frame you are in and the exon length and
you'll know where the introns go in the protein.
for my $gene ( $dbh->features(-type =>
"gene:S_cryophilus_OY26_V3_CALLGENES_FINAL_2
") ) {
for my $mRNA ( $gene->get_SeqFeatures('mRNA') ) {
for my $CDS ( $mRNA->get_SeqFeatures('CDS') ) {
my $length = $CDS->length;
# do some stuff to keep count or whatever you need to figure out
where you are to put the intron
}
}
}
you can also see this code which converts folders of GFF3 + genomes into
coding seqs, translated peps, etc.
https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/gff3_to_cdspep.pl
It all depends on what you want to report from this type of query.
Tao Zhu wrote:
> Hello, everyone. For example, I have a GTF file annotating like this,
>
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 start_codon 25009
> 25011 . + 0 gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 stop_codon 26003
> 26005 . + 0 gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 exon 24828
> 25172 . + . gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 CDS 25009 25172 .
> + 0 gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 exon 25245
> 25364 . + . gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 CDS 25245 25364 .
> + 1 gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 exon 25414
> 26178 . + . gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 CDS 25414 26002 .
> + 1 gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
>
> Obviously this transcript "SPOG_00008T0" has two introns.
>
> Also I have a corresponding protein sequence file like this( in fasta
> format),
>
>> SPOG_00008T0 | SPOG_00008 | Schizosaccharomyces cryophilus OY26 exosome
> subunit Rrp45 (292 aa)
> MSKSLEPSANNKGFIVNALKKELRLDGRSLTDFRDLKIEFGEDYGQVDISLGSTRVMARI
> SAEITKPYSDRPFDGIFAITTELTPLASPAFETGRVSEQEVIISRLIEQAIRRSNALDTE
> SLCIISGQKCWSVRASVHFINHDGNLVDAACIAVITGLCHFRRPEITVLGDEVTVHSIEE
> RVPVPLSVLHTPICVTFSFFEDGSLSAIDASLEEEELRTGSMTVTLNKNREICQIFKAGG
> VTIEASSVVACAHTAFQKTTSIISEIQRALDEDLSKKETQFFGGSAENQRS*
>
> I hope to precisely locate these two introns into the protein
> sequence(find their location among the amino acids). Please recommend a
> relatively convenient method. Thank you!
>
>
>
--
Jason Stajich
jason at bioperl.org
http://bioperl.org/wiki
More information about the Bioperl-l
mailing list