[Biojava-l] Extract non-gene regions
Florian Schatz
mail at florianschatz.de
Thu Apr 24 12:09:24 UTC 2008
Hello,
I tried that, but is as slow as a version operating on Strings..
however, I created a Cookbook entry:
http://biojava.org/wiki/BioJava:Cookbook:Sequence:ExtractGeneRegions
Is there a better way to get a Sequence from a SybolList than:
Sequence newsequence = DNATools.createDNASequence(symbolL.seqString
(), "New Sequence");
Best,
Florian
Am 24.04.2008 um 04:29 schrieb Mark Schreiber:
> Hi Florian -
>
> There are at least two approaches. You are on the right track with
> making a union of all gene locations. The compound location that
> results from the Union will contain all the nucleotides that are
> coding. You can then iterate through each nucleotide in the genome and
> find out if the union contains the nucleotide. If it doesn't then it
> is non coding. This is surprisingly rapid as the comparisons are
> simple. The pseudo code would be something like...
>
> RichLocation coding; //initialize this by making a union of all
> locations of CDS or Gene Features.
>
> RichSequence genome; // read from file or database
>
> for(int i = 1; i <= genome.lenght(); i++){ //you might need to be a
> bit more sophisticated for a circular genome
> if( ! genome.contains(i){
> //you have a non-coding nucleotide.
> }
> }
>
> The other approach is to use the blockIterator() method of the
> compound location that results from the union of coding sequences.
> This will output each contiguous chunk of coding sequence. If you know
> the length of the sequence then you can rapidly figure out the
> intervening pieces.
>
> For example, if the block iterator tells you that [10..50], [90..100],
> [350..380] are coding and you know the genome is of length 400 then
> you can quickly derive [1..9], [51..89], [101..349] and [381..400] are
> non-coding. Again it is more complicated for circular sequences and
> more complex if you consider the opposite strand of a gene (the gene
> shadow) to be non-coding. Unfortunately there is no convenience method
> to do this but if you code something up it would be great to put it in
> the cookbook so others can re-use it.
>
> - Mark
>
> You could actually make point locations of all the non-coding
> nucleotides and then merge the whole lot at the end into a compound
> location of non-coding
>
> On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz
> <mail at florianschatz.de> wrote:
>> Hello,
>>
>> I am new to biojava and worked a lot with in the last few weeks.
>> I hope
>> this is the right place for questions, if not please tell me.
>>
>> I want to get the nucleotid sequence outside the genes of a
>> genebank file.
>> So everything that is not marked by a 'gene' feature.
>> Unfortunately, there
>> is no sustract or exclude function for the Location class. Any hints?
>>
>> Btw: union() of location worked fine for extracting nucleotids of
>> the genes
>> only.
>>
>> Best,
>> Florian
>> _______________________________________________
>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
More information about the Biojava-l
mailing list