[Bioperl-l] problem to fit genomic coordinates
Laurent MANCHON
lmanchon at univ-montp2.fr
Thu Mar 26 04:31:40 EDT 2009
yes but this is a school problem that my teacher ask us to resolve
without using Bioperl modules !
i have written a piece of code in awk but it takes too much times to
perform the task:
#!/usr/bin/awk -f
#usage: myprog.awk file1.txt
# file2.txt file1.txt
# CDS 3760 3913 + AT1G01010 acc_1762592 24 89 112 -
# exon 3631 3913 + AT1G01010 acc_2739797 24 304 327 -
# CDS 3996 4276 + AT1G01010 acc_1955650 18 308 325 -
BEGIN {
while((getline < "file2.txt") > 0){
cpt++
descr[cpt]=$1
start[cpt]=$2
end[cpt]=$3
strand[cpt]=$4
tair[cpt]=$5
}
close("file2.txt")
}
{
j=1
while(start[j]<=$3 && j<=cpt){
if(end[j]>=$4){print
"from="$3,"to="$4,"start="start[j],"end="end[j],"j="j;j++}
else{j++}
}
}
Chris Fields a écrit :
> Laurent,
>
> All BioPerl modules, including Bio::SimpleAlign, have documentation
> via 'perldoc', you should have a look at that for specific examples.
> Myself, I recommend using Bio::DB::SeqFeature::Store (or another
> Bio::SeqFeature::CollectionI) for this.
>
> chris
>
> On Mar 25, 2009, at 3:09 PM, Laurent Manchon wrote:
>
>> -- yes perhaps,
>> but i don't know how to use Bio::SimpleAlign object to resolve my
>> problem, what a pity for me,
>> so i'm going on to search using in another way procedural programmation.
>>
>> thank you --
>>
>> Kevin Brown a écrit :
>>> Please keep all replies on list.
>>> Doing it with the SimpleAlign gets rid of the problem of
>>> incrementing and reduces the complexity of the number of loop
>>> iterations you'll have to do. Based on your sample data you have a
>>> lot of IDs that actually have the same location information that
>>> they are needing, you also have overlapping information from the
>>> first file. So you'll still need to make decisions as to which item
>>> is what you really want (e.g. CDS vs Exon).
>>>
>>>
>>> ________________________________
>>>
>>> From: Laurent MANCHON [mailto:lmanchon at univ-montp2.fr] Sent:
>>> Wednesday, March 25, 2009 9:44 AM
>>> To: Kevin Brown
>>> Subject: Re: [Bioperl-l] problem to fit genomic coordinates
>>>
>>>
>>> Okay but i think it's not an easy way with this method,
>>> the files are already sorted on colum numbers, so maybe another
>>> logical method
>>> without using Bioperl libraries exist, for example using a while
>>> loop,
>>>
>>> something like:
>>>
>>> $i = $j = 1;
>>> $idx = number of lines in file1
>>> $cpt = number of lines in file2
>>> while ($i <= $idx && $j <= $cpt) {
>>> #compare current elements
>>> #increment either $i or $j depending which segment comes before
>>> the other
>>> }
>>> the difficulty is when to decide to incremente $i or $j inside
>>> the loop
>>>
>>> Laurent --
>>>
>>> Kevin Brown a écrit :
>>> Read in first file and create a Bio::SimpleAlign object
>>>
>>> Then use the slice method to find the features that are
>>> between the
>>> start/end values of your second file
>>>
>>> =head2 slice
>>>
>>> Title : slice
>>> Usage : $aln2 = $aln->slice(20,30)
>>> Function : Creates a slice from the alignment inclusive of
>>> start and
>>> end columns, and the first column in the
>>> alignment is
>>> denoted 1.
>>> Sequences with no residues in the slice are
>>> excluded from
>>> the
>>> new alignment and a warning is printed. Slice
>>> beyond the
>>> length of
>>> the sequence does not do padding.
>>> Returns : A Bio::SimpleAlign object
>>> Args : Positive integer for start column, positive
>>> integer for end
>>> column,
>>> optional boolean which if true will keep
>>> gap-only columns
>>> in the newly
>>> created slice. Example:
>>>
>>> $aln2 = $aln->slice(20,30,1)
>>>
>>> =cut
>>>
>>> -----Original Message-----
>>> From: bioperl-l-bounces at lists.open-bio.org
>>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf
>>> Of Laurent MANCHON
>>> Sent: Wednesday, March 25, 2009 7:57 AM
>>> To: bioperl-l at lists.open-bio.org
>>> Subject: [Bioperl-l] problem to fit genomic coordinates
>>>
>>> this is my problem:
>>> how is it possible to fit range of genomic coordinates
>>> stored in two distinct files ?
>>>
>>> first file (file1.txt) is my annotation file with format
>>> as:
>>>
>>> regulatory_region 3455 3463
>>> regulatory_region 3535 3544
>>> regulatory_region 3601 3608
>>> transcriptional_cis_regulatory_region 3622 3630
>>> five_prime_UTR 3631 3759
>>> CDS 3760 3913
>>> exon 3631 3913
>>> CDS 3996 4276
>>> exon 3996 4276
>>> CDS 4486 4605
>>> exon 4486 4605
>>> CDS 4706 5095
>>> exon 4706 5095
>>> CDS 5174 5326
>>> exon 5174 5326
>>> ....
>>> ....
>>>
>>> second file (file2.txt) is my experimental file with
>>> format as:
>>>
>>> acc_2765773 3222 3239 -
>>> acc_2842543 3222 3239 -
>>> acc_2842544 3222 3239 -
>>> acc_442945 3222 3239 -
>>> acc_442946 3222 3239 -
>>> acc_4873 3222 3239 -
>>> acc_53956 3222 3239 -
>>> acc_562588 3222 3239 -
>>> acc_807114 3222 3239 -
>>> acc_84146 3222 3239 -
>>> acc_2419732 3268 3285 +
>>> acc_3041065 3565 3583 +
>>> acc_362358 3640 3656 -
>>> acc_3279485 3793 3813 +
>>> acc_3091017 3794 3811 -
>>> acc_2807380 3832 3848 +
>>> acc_3105138 3832 3848 +
>>> acc_3105139 3832 3848 +
>>> acc_3105140 3832 3848 +
>>> acc_3116450 3832 3848 +
>>> acc_86708 3832 3848 +
>>> acc_1987802 3922 3938 -
>>> acc_1679660 4113 4129 +
>>> acc_891489 4113 4129 +
>>> acc_2829973 4299 4318 +
>>> ....
>>> ....
>>>
>>>
>>> number of lines in file1.txt ~ 150000
>>> number of lines in file2.txt ~ 800000
>>>
>>> so, how to annotate my file2 using the genomic
>>> coordinates stored in file1. I need to compare each
>>> couple of range of my file2 with each couple of range of
>>> my file1: 800000x150000 combinaisons (quadratic analysis) ?
>>> i'm looking for a fast method to do that, something like
>>> linear progression in the analysis
>>>
>>> thank you so much if you have ideas for help me.
>>>
>>> Laurent --
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
More information about the Bioperl-l
mailing list