[Bioperl-l] problem to fit genomic coordinates

Wed Mar 25 18:06:41 EDT 2009

Laurent,

All BioPerl modules, including Bio::SimpleAlign, have documentation  
via 'perldoc', you should have a look at that for specific examples.   
Myself, I recommend using Bio::DB::SeqFeature::Store (or another  
Bio::SeqFeature::CollectionI) for this.

chris

On Mar 25, 2009, at 3:09 PM, Laurent Manchon wrote:

> -- yes perhaps,
> but i don't know how to use Bio::SimpleAlign object to resolve my  
> problem, what a pity for me,
> so i'm going on to search using in another way procedural  
> programmation.
>
> thank you --
>
> Kevin Brown a écrit :
>> Please keep all replies on list.
>> Doing it with the SimpleAlign gets rid of the problem of  
>> incrementing and reduces the complexity of the number of loop  
>> iterations you'll have to do.  Based on your sample data you have a  
>> lot of IDs that actually have the same location information that  
>> they are needing, you also have overlapping information from the  
>> first file. So you'll still need to make decisions as to which item  
>> is what you really want (e.g. CDS vs Exon).
>>
>>
>> ________________________________
>>
>> 	From: Laurent MANCHON [mailto:lmanchon at univ-montp2.fr] 	Sent:  
>> Wednesday, March 25, 2009 9:44 AM
>> 	To: Kevin Brown
>> 	Subject: Re: [Bioperl-l] problem to fit genomic coordinates
>> 	
>> 	
>> 	Okay but i think it's not an easy way with this method,
>> 	the files are already sorted on colum numbers, so maybe another  
>> logical method
>> 	without using Bioperl libraries exist, for example using a while  
>> loop,
>> 	
>> 	something like:
>> 	
>> 	$i = $j = 1;
>> 	$idx = number of lines in file1
>> 	$cpt = number of lines in file2
>> 	while ($i <= $idx && $j <= $cpt) {
>> 	 #compare current elements
>> 	 #increment either $i or $j depending which segment comes before  
>> the other
>> 	}
>> 	the difficulty is when to decide to incremente $i or $j inside the  
>> loop
>> 	
>> 	Laurent --
>> 	
>> 	Kevin Brown a écrit :
>> 		Read in first file and create a Bio::SimpleAlign object
>> 		
>> 		Then use the slice method to find the features that are between the
>> 		start/end values of your second file
>> 		
>> 		=head2 slice
>> 		
>> 		 Title     : slice
>> 		 Usage     : $aln2 = $aln->slice(20,30)
>> 		 Function  : Creates a slice from the alignment inclusive of  
>> start and
>> 		             end columns, and the first column in the alignment is
>> 		denoted 1.
>> 		             Sequences with no residues in the slice are excluded  
>> from
>> 		the
>> 		             new alignment and a warning is printed. Slice beyond  
>> the
>> 		length of
>> 		             the sequence does not do padding.
>> 		 Returns   : A Bio::SimpleAlign object
>> 		 Args      : Positive integer for start column, positive integer  
>> for end
>> 		column,
>> 		             optional boolean which if true will keep gap-only  
>> columns
>> 		in the newly
>> 		             created slice. Example:
>> 		
>> 		             $aln2 = $aln->slice(20,30,1)
>> 		
>> 		=cut 		
>> 		
>> 			-----Original Message-----
>> 			From: bioperl-l-bounces at lists.open-bio.org 			[mailto:bioperl-l-bounces at lists.open-bio.org 
>> ] On Behalf Of 			Laurent MANCHON
>> 			Sent: Wednesday, March 25, 2009 7:57 AM
>> 			To: bioperl-l at lists.open-bio.org
>> 			Subject: [Bioperl-l] problem to fit genomic coordinates
>> 			
>> 			this is my problem:
>> 			how is it possible to fit range of genomic coordinates stored in  
>> two 			distinct files ?
>> 			
>> 			first file (file1.txt) is my annotation file with format as:
>> 			
>> 			regulatory_region 3455 3463
>> 			regulatory_region 3535 3544
>> 			regulatory_region 3601 3608
>> 			transcriptional_cis_regulatory_region 3622 3630
>> 			five_prime_UTR 3631 3759
>> 			CDS 3760 3913
>> 			exon 3631 3913
>> 			CDS 3996 4276
>> 			exon 3996 4276
>> 			CDS 4486 4605
>> 			exon 4486 4605
>> 			CDS 4706 5095
>> 			exon 4706 5095
>> 			CDS 5174 5326
>> 			exon 5174 5326
>> 			....
>> 			....
>> 			
>> 			second file (file2.txt) is my experimental file with format as:
>> 			
>> 			acc_2765773 3222 3239 -
>> 			acc_2842543 3222 3239 -
>> 			acc_2842544 3222 3239 -
>> 			acc_442945 3222 3239 -
>> 			acc_442946 3222 3239 -
>> 			acc_4873 3222 3239 -
>> 			acc_53956 3222 3239 -
>> 			acc_562588 3222 3239 -
>> 			acc_807114 3222 3239 -
>> 			acc_84146 3222 3239 -
>> 			acc_2419732 3268 3285 +
>> 			acc_3041065 3565 3583 +
>> 			acc_362358 3640 3656 -
>> 			acc_3279485 3793 3813 +
>> 			acc_3091017 3794 3811 -
>> 			acc_2807380 3832 3848 +
>> 			acc_3105138 3832 3848 +
>> 			acc_3105139 3832 3848 +
>> 			acc_3105140 3832 3848 +
>> 			acc_3116450 3832 3848 +
>> 			acc_86708 3832 3848 +
>> 			acc_1987802 3922 3938 -
>> 			acc_1679660 4113 4129 +
>> 			acc_891489 4113 4129 +
>> 			acc_2829973 4299 4318 +
>> 			....
>> 			....
>> 			
>> 			
>> 			number of lines in file1.txt ~ 150000
>> 			number of lines in file2.txt ~ 800000
>> 			
>> 			so, how to annotate my file2 using the genomic coordinates  
>> stored in 			file1. I need to compare each couple of range of my  
>> file2 with each 			couple of range of my file1: 800000x150000  
>> combinaisons (quadratic 			analysis) ?
>> 			i'm looking for a fast method to do that, something like linear  
>> 			progression in the analysis
>> 			
>> 			thank you so much if you have ideas for help me.
>> 			
>> 			Laurent --
>> 			_______________________________________________
>> 			Bioperl-l mailing list
>> 			Bioperl-l at lists.open-bio.org
>> 			http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 			
>> 			
>> 		
>> 		_______________________________________________
>> 		Bioperl-l mailing list
>> 		Bioperl-l at lists.open-bio.org
>> 		http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 		
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l