[Bioperl-l] problem to fit genomic coordinates
Chris Fields
cjfields at illinois.edu
Wed Mar 25 22:06:41 UTC 2009
Laurent,
All BioPerl modules, including Bio::SimpleAlign, have documentation
via 'perldoc', you should have a look at that for specific examples.
Myself, I recommend using Bio::DB::SeqFeature::Store (or another
Bio::SeqFeature::CollectionI) for this.
chris
On Mar 25, 2009, at 3:09 PM, Laurent Manchon wrote:
> -- yes perhaps,
> but i don't know how to use Bio::SimpleAlign object to resolve my
> problem, what a pity for me,
> so i'm going on to search using in another way procedural
> programmation.
>
> thank you --
>
> Kevin Brown a écrit :
>> Please keep all replies on list.
>> Doing it with the SimpleAlign gets rid of the problem of
>> incrementing and reduces the complexity of the number of loop
>> iterations you'll have to do. Based on your sample data you have a
>> lot of IDs that actually have the same location information that
>> they are needing, you also have overlapping information from the
>> first file. So you'll still need to make decisions as to which item
>> is what you really want (e.g. CDS vs Exon).
>>
>>
>> ________________________________
>>
>> From: Laurent MANCHON [mailto:lmanchon at univ-montp2.fr] Sent:
>> Wednesday, March 25, 2009 9:44 AM
>> To: Kevin Brown
>> Subject: Re: [Bioperl-l] problem to fit genomic coordinates
>>
>>
>> Okay but i think it's not an easy way with this method,
>> the files are already sorted on colum numbers, so maybe another
>> logical method
>> without using Bioperl libraries exist, for example using a while
>> loop,
>>
>> something like:
>>
>> $i = $j = 1;
>> $idx = number of lines in file1
>> $cpt = number of lines in file2
>> while ($i <= $idx && $j <= $cpt) {
>> #compare current elements
>> #increment either $i or $j depending which segment comes before
>> the other
>> }
>> the difficulty is when to decide to incremente $i or $j inside the
>> loop
>>
>> Laurent --
>>
>> Kevin Brown a écrit :
>> Read in first file and create a Bio::SimpleAlign object
>>
>> Then use the slice method to find the features that are between the
>> start/end values of your second file
>>
>> =head2 slice
>>
>> Title : slice
>> Usage : $aln2 = $aln->slice(20,30)
>> Function : Creates a slice from the alignment inclusive of
>> start and
>> end columns, and the first column in the alignment is
>> denoted 1.
>> Sequences with no residues in the slice are excluded
>> from
>> the
>> new alignment and a warning is printed. Slice beyond
>> the
>> length of
>> the sequence does not do padding.
>> Returns : A Bio::SimpleAlign object
>> Args : Positive integer for start column, positive integer
>> for end
>> column,
>> optional boolean which if true will keep gap-only
>> columns
>> in the newly
>> created slice. Example:
>>
>> $aln2 = $aln->slice(20,30,1)
>>
>> =cut
>>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org
>> ] On Behalf Of Laurent MANCHON
>> Sent: Wednesday, March 25, 2009 7:57 AM
>> To: bioperl-l at lists.open-bio.org
>> Subject: [Bioperl-l] problem to fit genomic coordinates
>>
>> this is my problem:
>> how is it possible to fit range of genomic coordinates stored in
>> two distinct files ?
>>
>> first file (file1.txt) is my annotation file with format as:
>>
>> regulatory_region 3455 3463
>> regulatory_region 3535 3544
>> regulatory_region 3601 3608
>> transcriptional_cis_regulatory_region 3622 3630
>> five_prime_UTR 3631 3759
>> CDS 3760 3913
>> exon 3631 3913
>> CDS 3996 4276
>> exon 3996 4276
>> CDS 4486 4605
>> exon 4486 4605
>> CDS 4706 5095
>> exon 4706 5095
>> CDS 5174 5326
>> exon 5174 5326
>> ....
>> ....
>>
>> second file (file2.txt) is my experimental file with format as:
>>
>> acc_2765773 3222 3239 -
>> acc_2842543 3222 3239 -
>> acc_2842544 3222 3239 -
>> acc_442945 3222 3239 -
>> acc_442946 3222 3239 -
>> acc_4873 3222 3239 -
>> acc_53956 3222 3239 -
>> acc_562588 3222 3239 -
>> acc_807114 3222 3239 -
>> acc_84146 3222 3239 -
>> acc_2419732 3268 3285 +
>> acc_3041065 3565 3583 +
>> acc_362358 3640 3656 -
>> acc_3279485 3793 3813 +
>> acc_3091017 3794 3811 -
>> acc_2807380 3832 3848 +
>> acc_3105138 3832 3848 +
>> acc_3105139 3832 3848 +
>> acc_3105140 3832 3848 +
>> acc_3116450 3832 3848 +
>> acc_86708 3832 3848 +
>> acc_1987802 3922 3938 -
>> acc_1679660 4113 4129 +
>> acc_891489 4113 4129 +
>> acc_2829973 4299 4318 +
>> ....
>> ....
>>
>>
>> number of lines in file1.txt ~ 150000
>> number of lines in file2.txt ~ 800000
>>
>> so, how to annotate my file2 using the genomic coordinates
>> stored in file1. I need to compare each couple of range of my
>> file2 with each couple of range of my file1: 800000x150000
>> combinaisons (quadratic analysis) ?
>> i'm looking for a fast method to do that, something like linear
>> progression in the analysis
>>
>> thank you so much if you have ideas for help me.
>>
>> Laurent --
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list