[Biojava-dev] biojava3 sequence tools

Scott Markel SMarkel at accelrys.com
Wed Aug 11 16:51:09 UTC 2010


Andreas,

You might want to look at the _guess_alphabet subroutine in BioPerl's
Bio::PrimarySeq module.

Here's the core logic.

   my $u = ($str =~ tr/Uu//);
	# The assumption here is that most of sequences comprised of mainly
   # ATGC, with some N, will be 'dna' despite the fact that N could
	# also be Asparagine
   my $atgc = ($str =~ tr/ATGCNatgcn//);

   if( ($atgc / $total) > 0.85 ) {
       $type = 'dna';
   } elsif( (($atgc + $u) / $total) > 0.85 ) {
       $type = 'rna';
   } else {
       $type = 'protein';
   }

Scott

Scott Markel, Ph.D.
Principal Bioinformatics Architect  email:  smarkel at accelrys.com
Accelrys (Pipeline Pilot R&D)       mobile: +1 858 205 3653
10188 Telesis Court, Suite 100      voice:  +1 858 799 5603
San Diego, CA 92121                 fax:    +1 858 799 5222
USA                                 web:    http://www.accelrys.com

http://www.linkedin.com/in/smarkel
Vice President, Board of Directors:
    International Society for Computational Biology
Chair: ISCB Publications Committee
Associate Editor: PLoS Computational Biology
Editorial Board: Briefings in Bioinformatics


-----Original Message-----
From: biojava-dev-bounces at lists.open-bio.org [mailto:biojava-dev-bounces at lists.open-bio.org] On Behalf Of Andreas Prlic
Sent: Wednesday, 11 August 2010 8:58 AM
To: Andy Yates
Cc: biojava-dev
Subject: Re: [Biojava-dev] biojava3 sequence tools

thanks for the replies. I was trying to see how to improve a web-form
into which the user can paste in any type of sequence and the server
selects the correct version of blast to run...  I will probably use a
check how many % of the sequence are looking like they are
nucleotides. Unlikely to find a longer protein sequence that just
consist of ATCGs ...

Andreas


On Wed, Aug 11, 2010 at 1:26 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> Building a Sequence object which can contain AminoAcidCompound or NucleotideCompound is easy; the return types makes this incredibly hard since we'd have to return Sequence<Compound> which forces the user to start casting to a more useful type. Every auto detector I've known gets it wrong since they all apply arbitrary thresholds to decide the switch.
>
> However if the need is there (which I'm sure for writing some interfaces there are) something can be knocked up quickly I think.
>
> On 11 Aug 2010, at 05:46, Mark Schreiber wrote:
>
>> I think SeqIOTools had a method for this, possible also available in
>> RichSequence.IOTools.
>>
>> As Richard says, not guaranteed to work in all cases.
>>
>>
>>
>>
>> On Wed, Aug 11, 2010 at 12:05 PM, Richard Holland <holland at eaglegenomics.com
>>> wrote:
>>
>>> You mean an auto-detector that takes a String input, guesses based on
>>> content what it is, and returns a Sequence object of the appropriate type,
>>> being Protein or DNA etc.? Not that I know of. A bit hard too - if all the
>>> letters in the String are a valid subset from two or more alphabets (e.g.
>>> ATCG are all in the Protein alphabet as well as being DNA), how do we know
>>> which one it is?
>>>
>>> On 11 Aug 2010, at 03:24, Andreas Prlic wrote:
>>>
>>>> Hi,
>>>>
>>>> just wondering if we have already a class that can accept any protein
>>>> or DNA sequence as input and can return a Sequence object of the
>>>> correct type ?
>>>>
>>>> Andreas
>>>> _______________________________________________
>>>> biojava-dev mailing list
>>>> biojava-dev at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>
>>> --
>>> Richard Holland, BSc MBCS
>>> Operations and Delivery Director, Eagle Genomics Ltd
>>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>>> http://www.eaglegenomics.com/
>>>
>>>
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>



-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------

_______________________________________________
biojava-dev mailing list
biojava-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-dev






More information about the biojava-dev mailing list