[Bioperl-l] problem to fit genomic coordinates

Wed Mar 25 20:09:21 UTC 2009

-- yes perhaps,
but i don't know how to use Bio::SimpleAlign object to resolve my 
problem, what a pity for me,
so i'm going on to search using in another way procedural programmation.

thank you --

Kevin Brown a écrit :
> Please keep all replies on list.
>  
> Doing it with the SimpleAlign gets rid of the problem of incrementing and reduces the complexity of the number of loop iterations you'll have to do.  Based on your sample data you have a lot of IDs that actually have the same location information that they are needing, you also have overlapping information from the first file. So you'll still need to make decisions as to which item is what you really want (e.g. CDS vs Exon).
>
>
> ________________________________
>
> 	From: Laurent MANCHON [mailto:lmanchon at univ-montp2.fr] 
> 	Sent: Wednesday, March 25, 2009 9:44 AM
> 	To: Kevin Brown
> 	Subject: Re: [Bioperl-l] problem to fit genomic coordinates
> 	
> 	
> 	Okay but i think it's not an easy way with this method,
> 	the files are already sorted on colum numbers, so maybe another logical method
> 	without using Bioperl libraries exist, for example using a while loop,
> 	
> 	something like:
> 	
> 	$i = $j = 1;
> 	$idx = number of lines in file1
> 	$cpt = number of lines in file2
> 	while ($i <= $idx && $j <= $cpt) {
> 	 #compare current elements
> 	 #increment either $i or $j depending which segment comes before the other
> 	}
> 	the difficulty is when to decide to incremente $i or $j inside the loop
> 	
> 	Laurent --
> 	
> 	Kevin Brown a écrit : 
>
> 		Read in first file and create a Bio::SimpleAlign object
> 		
> 		Then use the slice method to find the features that are between the
> 		start/end values of your second file
> 		
> 		=head2 slice
> 		
> 		 Title     : slice
> 		 Usage     : $aln2 = $aln->slice(20,30)
> 		 Function  : Creates a slice from the alignment inclusive of start and
> 		             end columns, and the first column in the alignment is
> 		denoted 1.
> 		             Sequences with no residues in the slice are excluded from
> 		the
> 		             new alignment and a warning is printed. Slice beyond the
> 		length of
> 		             the sequence does not do padding.
> 		 Returns   : A Bio::SimpleAlign object
> 		 Args      : Positive integer for start column, positive integer for end
> 		column,
> 		             optional boolean which if true will keep gap-only columns
> 		in the newly
> 		             created slice. Example:
> 		
> 		             $aln2 = $aln->slice(20,30,1)
> 		
> 		=cut 
> 		
> 		  
>
> 			-----Original Message-----
> 			From: bioperl-l-bounces at lists.open-bio.org 
> 			[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of 
> 			Laurent MANCHON
> 			Sent: Wednesday, March 25, 2009 7:57 AM
> 			To: bioperl-l at lists.open-bio.org
> 			Subject: [Bioperl-l] problem to fit genomic coordinates
> 			
> 			this is my problem:
> 			how is it possible to fit range of genomic coordinates stored in two 
> 			distinct files ?
> 			
> 			first file (file1.txt) is my annotation file with format as:
> 			
> 			regulatory_region 3455 3463
> 			regulatory_region 3535 3544
> 			regulatory_region 3601 3608
> 			transcriptional_cis_regulatory_region 3622 3630
> 			five_prime_UTR 3631 3759
> 			CDS 3760 3913
> 			exon 3631 3913
> 			CDS 3996 4276
> 			exon 3996 4276
> 			CDS 4486 4605
> 			exon 4486 4605
> 			CDS 4706 5095
> 			exon 4706 5095
> 			CDS 5174 5326
> 			exon 5174 5326
> 			....
> 			....
> 			
> 			second file (file2.txt) is my experimental file with format as:
> 			
> 			acc_2765773 3222 3239 -
> 			acc_2842543 3222 3239 -
> 			acc_2842544 3222 3239 -
> 			acc_442945 3222 3239 -
> 			acc_442946 3222 3239 -
> 			acc_4873 3222 3239 -
> 			acc_53956 3222 3239 -
> 			acc_562588 3222 3239 -
> 			acc_807114 3222 3239 -
> 			acc_84146 3222 3239 -
> 			acc_2419732 3268 3285 +
> 			acc_3041065 3565 3583 +
> 			acc_362358 3640 3656 -
> 			acc_3279485 3793 3813 +
> 			acc_3091017 3794 3811 -
> 			acc_2807380 3832 3848 +
> 			acc_3105138 3832 3848 +
> 			acc_3105139 3832 3848 +
> 			acc_3105140 3832 3848 +
> 			acc_3116450 3832 3848 +
> 			acc_86708 3832 3848 +
> 			acc_1987802 3922 3938 -
> 			acc_1679660 4113 4129 +
> 			acc_891489 4113 4129 +
> 			acc_2829973 4299 4318 +
> 			....
> 			....
> 			
> 			
> 			number of lines in file1.txt ~ 150000
> 			number of lines in file2.txt ~ 800000
> 			
> 			so, how to annotate my file2 using the genomic coordinates stored in 
> 			file1. I need to compare each couple of range of my file2 with each 
> 			couple of range of my file1: 800000x150000 combinaisons (quadratic 
> 			analysis) ?
> 			i'm looking for a fast method to do that, something like linear 
> 			progression in the analysis
> 			
> 			thank you so much if you have ideas for help me.
> 			
> 			Laurent --
> 			_______________________________________________
> 			Bioperl-l mailing list
> 			Bioperl-l at lists.open-bio.org
> 			http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 			
> 			    
>
> 		
> 		_______________________________________________
> 		Bioperl-l mailing list
> 		Bioperl-l at lists.open-bio.org
> 		http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 		  
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>