[Biojava-dev] biojava3 sequence tools
Andreas Prlic
andreas at sdsc.edu
Wed Aug 11 17:58:14 UTC 2010
thanks, Scott,
here similar utility methods in Java ...
Andreas
protected static final String NUCLEOTIDE_LETTERS = "GCTAUX";
public static int percentNucleotideSequence(String sequence)
{
if (sequence == null || sequence.length() == 0) return 0;
int l = sequence.length();
int n =0;
for (int i = 0; i < l; i++)
{
if (NUCLEOTIDE_LETTERS.indexOf(sequence.charAt(i)) < 0)
{
continue;
}
n++;
}
return (100 * n) / l;
}
public static boolean isNucleotideSequence(String sequence)
{
if (sequence == null || sequence.length() == 0) return false;
int l = sequence.length();
for (int i = 0; i < l; i++)
{
if (NUCLEOTIDE_LETTERS.indexOf(sequence.charAt(i)) < 0)
{
return false;
}
}
return true;
}
On Wed, Aug 11, 2010 at 9:51 AM, Scott Markel <SMarkel at accelrys.com> wrote:
> Andreas,
>
> You might want to look at the _guess_alphabet subroutine in BioPerl's
> Bio::PrimarySeq module.
>
> Here's the core logic.
>
> my $u = ($str =~ tr/Uu//);
> # The assumption here is that most of sequences comprised of mainly
> # ATGC, with some N, will be 'dna' despite the fact that N could
> # also be Asparagine
> my $atgc = ($str =~ tr/ATGCNatgcn//);
>
> if( ($atgc / $total) > 0.85 ) {
> $type = 'dna';
> } elsif( (($atgc + $u) / $total) > 0.85 ) {
> $type = 'rna';
> } else {
> $type = 'protein';
> }
>
> Scott
>
> Scott Markel, Ph.D.
> Principal Bioinformatics Architect email: smarkel at accelrys.com
> Accelrys (Pipeline Pilot R&D) mobile: +1 858 205 3653
> 10188 Telesis Court, Suite 100 voice: +1 858 799 5603
> San Diego, CA 92121 fax: +1 858 799 5222
> USA web: http://www.accelrys.com
>
> http://www.linkedin.com/in/smarkel
> Vice President, Board of Directors:
> International Society for Computational Biology
> Chair: ISCB Publications Committee
> Associate Editor: PLoS Computational Biology
> Editorial Board: Briefings in Bioinformatics
>
>
> -----Original Message-----
> From: biojava-dev-bounces at lists.open-bio.org [mailto:biojava-dev-bounces at lists.open-bio.org] On Behalf Of Andreas Prlic
> Sent: Wednesday, 11 August 2010 8:58 AM
> To: Andy Yates
> Cc: biojava-dev
> Subject: Re: [Biojava-dev] biojava3 sequence tools
>
> thanks for the replies. I was trying to see how to improve a web-form
> into which the user can paste in any type of sequence and the server
> selects the correct version of blast to run... I will probably use a
> check how many % of the sequence are looking like they are
> nucleotides. Unlikely to find a longer protein sequence that just
> consist of ATCGs ...
>
> Andreas
>
>
> On Wed, Aug 11, 2010 at 1:26 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Building a Sequence object which can contain AminoAcidCompound or NucleotideCompound is easy; the return types makes this incredibly hard since we'd have to return Sequence<Compound> which forces the user to start casting to a more useful type. Every auto detector I've known gets it wrong since they all apply arbitrary thresholds to decide the switch.
>>
>> However if the need is there (which I'm sure for writing some interfaces there are) something can be knocked up quickly I think.
>>
>> On 11 Aug 2010, at 05:46, Mark Schreiber wrote:
>>
>>> I think SeqIOTools had a method for this, possible also available in
>>> RichSequence.IOTools.
>>>
>>> As Richard says, not guaranteed to work in all cases.
>>>
>>>
>>>
>>>
>>> On Wed, Aug 11, 2010 at 12:05 PM, Richard Holland <holland at eaglegenomics.com
>>>> wrote:
>>>
>>>> You mean an auto-detector that takes a String input, guesses based on
>>>> content what it is, and returns a Sequence object of the appropriate type,
>>>> being Protein or DNA etc.? Not that I know of. A bit hard too - if all the
>>>> letters in the String are a valid subset from two or more alphabets (e.g.
>>>> ATCG are all in the Protein alphabet as well as being DNA), how do we know
>>>> which one it is?
>>>>
>>>> On 11 Aug 2010, at 03:24, Andreas Prlic wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> just wondering if we have already a class that can accept any protein
>>>>> or DNA sequence as input and can return a Sequence object of the
>>>>> correct type ?
>>>>>
>>>>> Andreas
>>>>> _______________________________________________
>>>>> biojava-dev mailing list
>>>>> biojava-dev at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>
>>>> --
>>>> Richard Holland, BSc MBCS
>>>> Operations and Delivery Director, Eagle Genomics Ltd
>>>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>>>> http://www.eaglegenomics.com/
>>>>
>>>>
>>>> _______________________________________________
>>>> biojava-dev mailing list
>>>> biojava-dev at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>> --
>> Andrew Yates Ensembl Genomes Engineer
>> EMBL-EBI Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
>>
>>
>>
>>
>>
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
>
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
More information about the biojava-dev
mailing list