[Bioperl-l] Chromosome coordinates

Thu Dec 1 17:31:29 UTC 2011

You might try using BEDtools, intersectBED which can do a lot of what you are doing in simple command line program.

Jason
On Dec 1, 2011, at 8:59 AM, Scott Cain wrote:

> Hi Jay,
> 
> Since the maize GFF file is likely to be fairly large, I would consider
> putting it in a database, using either Bio::DB::GFF if it is GFF2 or
> Bio::DB::SeqFeature::Store if it is gff3.  Then you can use the methods
> that come along with either of those modules to search regions for for
> genes.  They both support a get_features_by_location method, so you could
> get the range for each of the regions you want to look at, and check the
> database with that method to see if anything is there.
> 
> Scott
> 
> 
> On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand <jboddu at illinois.edu>wrote:
> 
>> Hello
>> I am newbie to Perl scripts.
>> I have a file with short reads mapped to the MAIZE genome
>> The format is a simple BLASTN output.
>> READ_ID
>> 
>> Chr
>> 
>> % Similarity
>> 
>> Alignment
>> 
>> Mismatches
>> 
>> Gaps
>> 
>> READ Start
>> 
>> READ End
>> 
>> Chr Start
>> 
>> Chr End
>> 
>> E Value
>> 
>> Score
>> 
>> READ1
>> 
>> chrPt
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 35021
>> 
>> 35037
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr10
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 128587356
>> 
>> 128587372
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr6
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 160769803
>> 
>> 160769787
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr5
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 172103083
>> 
>> 172103067
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr4
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 213173683
>> 
>> 213173699
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr3
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 23689132
>> 
>> 23689116
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr8
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 161048603
>> 
>> 161048587
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr6
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 155768884
>> 
>> 155768868
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr5
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 32958812
>> 
>> 32958828
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr3
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 212451090
>> 
>> 212451074
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr2
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 2046449
>> 
>> 2046465
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr1
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 223233801
>> 
>> 223233785
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr1
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 277573037
>> 
>> 277573021
>> 
>> 0.21
>> 
>> 34.2
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> As expected the same read maps to multiple places on the same/different
>> chromosome.
>> I have a GFF file with annotated coordinates.
>> I would like to run a PERL script to find out READS that are within the
>> GENES in the GFF file and that are not.
>> The anticipated script should;
>> 
>> 1.       Take the READ coordinates on the genome (by chromosome);
>> 
>> 2.       Go the GFF file;
>> 
>> 3.       Find the Chromosome;
>> 
>> 4.       Find the GENE (by coordinates);
>> 
>> 5.       and report READ-its coordinates-Chromosome-GENE-and its
>> coordinates.
>> 
>> It doesn't need to be in the same order.
>> After this, I guess I could use simple Microsoft ACCESS query to pull out
>> READS that are not mapped to the GENEs.
>> I would greatly appreciate if anyone can has a script that more or less
>> similar job.
>> 
>> Thanks
>> Jay
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
> 
> 
> 
> -- 
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at scottcain dot
> net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l