[Bioperl-l] problem to fit genomic coordinates

Thu Mar 26 08:31:40 UTC 2009

yes but this is a school problem that my teacher ask us to resolve 
without using Bioperl modules !

i have written a piece of code in awk but it takes too much times to 
perform the task:

#!/usr/bin/awk -f
#usage: myprog.awk file1.txt
#         file2.txt                       file1.txt
#   CDS 3760 3913 + AT1G01010       acc_1762592 24 89 112 -
#   exon 3631 3913 + AT1G01010      acc_2739797 24 304 327 -
#   CDS 3996 4276 + AT1G01010       acc_1955650 18 308 325 -
BEGIN {
        while((getline < "file2.txt") > 0){
                                           cpt++
                                           descr[cpt]=$1
                                           start[cpt]=$2
                                           end[cpt]=$3
                                           strand[cpt]=$4
                                           tair[cpt]=$5
                                            }
         close("file2.txt")
       }
{
  j=1
  while(start[j]<=$3 && j<=cpt){
                  if(end[j]>=$4){print 
"from="$3,"to="$4,"start="start[j],"end="end[j],"j="j;j++}
                   else{j++}
                                            }
}

Chris Fields a écrit :
> Laurent,
>
> All BioPerl modules, including Bio::SimpleAlign, have documentation 
> via 'perldoc', you should have a look at that for specific examples.  
> Myself, I recommend using Bio::DB::SeqFeature::Store (or another 
> Bio::SeqFeature::CollectionI) for this.
>
> chris
>
> On Mar 25, 2009, at 3:09 PM, Laurent Manchon wrote:
>
>> -- yes perhaps,
>> but i don't know how to use Bio::SimpleAlign object to resolve my 
>> problem, what a pity for me,
>> so i'm going on to search using in another way procedural programmation.
>>
>> thank you --
>>
>> Kevin Brown a écrit :
>>> Please keep all replies on list.
>>> Doing it with the SimpleAlign gets rid of the problem of 
>>> incrementing and reduces the complexity of the number of loop 
>>> iterations you'll have to do.  Based on your sample data you have a 
>>> lot of IDs that actually have the same location information that 
>>> they are needing, you also have overlapping information from the 
>>> first file. So you'll still need to make decisions as to which item 
>>> is what you really want (e.g. CDS vs Exon).
>>>
>>>
>>> ________________________________
>>>
>>>     From: Laurent MANCHON [mailto:lmanchon at univ-montp2.fr]     Sent: 
>>> Wednesday, March 25, 2009 9:44 AM
>>>     To: Kevin Brown
>>>     Subject: Re: [Bioperl-l] problem to fit genomic coordinates
>>>     
>>>     
>>>     Okay but i think it's not an easy way with this method,
>>>     the files are already sorted on colum numbers, so maybe another 
>>> logical method
>>>     without using Bioperl libraries exist, for example using a while 
>>> loop,
>>>     
>>>     something like:
>>>     
>>>     $i = $j = 1;
>>>     $idx = number of lines in file1
>>>     $cpt = number of lines in file2
>>>     while ($i <= $idx && $j <= $cpt) {
>>>      #compare current elements
>>>      #increment either $i or $j depending which segment comes before 
>>> the other
>>>     }
>>>     the difficulty is when to decide to incremente $i or $j inside 
>>> the loop
>>>     
>>>     Laurent --
>>>     
>>>     Kevin Brown a écrit :
>>>         Read in first file and create a Bio::SimpleAlign object
>>>        
>>>         Then use the slice method to find the features that are 
>>> between the
>>>         start/end values of your second file
>>>        
>>>         =head2 slice
>>>        
>>>          Title     : slice
>>>          Usage     : $aln2 = $aln->slice(20,30)
>>>          Function  : Creates a slice from the alignment inclusive of 
>>> start and
>>>                      end columns, and the first column in the 
>>> alignment is
>>>         denoted 1.
>>>                      Sequences with no residues in the slice are 
>>> excluded from
>>>         the
>>>                      new alignment and a warning is printed. Slice 
>>> beyond the
>>>         length of
>>>                      the sequence does not do padding.
>>>          Returns   : A Bio::SimpleAlign object
>>>          Args      : Positive integer for start column, positive 
>>> integer for end
>>>         column,
>>>                      optional boolean which if true will keep 
>>> gap-only columns
>>>         in the newly
>>>                      created slice. Example:
>>>        
>>>                      $aln2 = $aln->slice(20,30,1)
>>>        
>>>         =cut        
>>>        
>>>             -----Original Message-----
>>>             From: bioperl-l-bounces at lists.open-bio.org             
>>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf 
>>> Of             Laurent MANCHON
>>>             Sent: Wednesday, March 25, 2009 7:57 AM
>>>             To: bioperl-l at lists.open-bio.org
>>>             Subject: [Bioperl-l] problem to fit genomic coordinates
>>>            
>>>             this is my problem:
>>>             how is it possible to fit range of genomic coordinates 
>>> stored in two             distinct files ?
>>>            
>>>             first file (file1.txt) is my annotation file with format 
>>> as:
>>>            
>>>             regulatory_region 3455 3463
>>>             regulatory_region 3535 3544
>>>             regulatory_region 3601 3608
>>>             transcriptional_cis_regulatory_region 3622 3630
>>>             five_prime_UTR 3631 3759
>>>             CDS 3760 3913
>>>             exon 3631 3913
>>>             CDS 3996 4276
>>>             exon 3996 4276
>>>             CDS 4486 4605
>>>             exon 4486 4605
>>>             CDS 4706 5095
>>>             exon 4706 5095
>>>             CDS 5174 5326
>>>             exon 5174 5326
>>>             ....
>>>             ....
>>>            
>>>             second file (file2.txt) is my experimental file with 
>>> format as:
>>>            
>>>             acc_2765773 3222 3239 -
>>>             acc_2842543 3222 3239 -
>>>             acc_2842544 3222 3239 -
>>>             acc_442945 3222 3239 -
>>>             acc_442946 3222 3239 -
>>>             acc_4873 3222 3239 -
>>>             acc_53956 3222 3239 -
>>>             acc_562588 3222 3239 -
>>>             acc_807114 3222 3239 -
>>>             acc_84146 3222 3239 -
>>>             acc_2419732 3268 3285 +
>>>             acc_3041065 3565 3583 +
>>>             acc_362358 3640 3656 -
>>>             acc_3279485 3793 3813 +
>>>             acc_3091017 3794 3811 -
>>>             acc_2807380 3832 3848 +
>>>             acc_3105138 3832 3848 +
>>>             acc_3105139 3832 3848 +
>>>             acc_3105140 3832 3848 +
>>>             acc_3116450 3832 3848 +
>>>             acc_86708 3832 3848 +
>>>             acc_1987802 3922 3938 -
>>>             acc_1679660 4113 4129 +
>>>             acc_891489 4113 4129 +
>>>             acc_2829973 4299 4318 +
>>>             ....
>>>             ....
>>>            
>>>            
>>>             number of lines in file1.txt ~ 150000
>>>             number of lines in file2.txt ~ 800000
>>>            
>>>             so, how to annotate my file2 using the genomic 
>>> coordinates stored in             file1. I need to compare each 
>>> couple of range of my file2 with each             couple of range of 
>>> my file1: 800000x150000 combinaisons (quadratic             analysis) ?
>>>             i'm looking for a fast method to do that, something like 
>>> linear             progression in the analysis
>>>            
>>>             thank you so much if you have ideas for help me.
>>>            
>>>             Laurent --
>>>             _______________________________________________
>>>             Bioperl-l mailing list
>>>             Bioperl-l at lists.open-bio.org
>>>             http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>            
>>>            
>>>        
>>>         _______________________________________________
>>>         Bioperl-l mailing list
>>>         Bioperl-l at lists.open-bio.org
>>>         http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>        
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>