[Bioperl-l] assembling chromosomes from contigs and .agp file
Smithies, Russell
Russell.Smithies at agresearch.co.nz
Thu Jan 8 03:59:08 UTC 2009
Was easier than I thought although I couldn't work out a way to "build" a Bio::Seq directly from bits.
Here's how I did it:
-------------------------------
use Bio::DB::Fasta;
use Bio::Seq;
use Bio::SeqIO;
open(AGP,"Mt2.0_pgp.agp") or die $!;
my @chr = ();
my $db = Bio::DB::Fasta->new("contigs.fa");
while(<AGP>){
chomp;
split /\s/;
# extend temp string if it's too short
do{$chr[$_[0]] .= ' ' x 1_000_000;}while length $chr[$_[0]] < $_[2] ;
if($_[4] !~ m/N/){
($start,$stop) = $_[8] eq '+'?($_[6], $_[7]):($_[7], $_[6]);
$s = substr $chr[$_[0]], $_[1], $_[9], $db->seq($_[5],$start,$stop);
}else{
$s = substr $chr[$_[0]], $_[1], $_[5], "N" x $_[5] ;
}
}
#remove any trailing whitespace
@chr = map{s/\s+//g;$_}@chr;
#print the sequence. chromosomes are chr0 -> chr8
foreach(0..$#chr){
my $seqobj = Bio::Seq->new( -display_id => "chr$_", -seq => $chr[$_]);
my $seq_out = Bio::SeqIO->new('-file' => ">chr$_.fa",'-format' => 'fasta');
$seq_out->write_seq($seqobj);
}
-------------------------------
Please excuse my hacky use of substrings but this .agp file had overlapping runs of 'N' and this was the easiest way to deal with it
e.g.
0 1 50000 1 N 50000 clone yes
0 50001 167645 2 F AC144644.3 1 117645 + 117645
0 167646 217645 3 N 50000 clone yes
0 217646 317645 4 N 100000 contig no
0 317646 367645 5 N 50000 clone yes
0 367646 411754 6 F AC146805.17 1 44109 + 44109
--Russell
> -----Original Message-----
> From: Smithies, Russell
> Sent: Thursday, 8 January 2009 1:33 p.m.
> To: 'bioperl-l at lists.open-bio.org'
> Subject: assembling chromosomes from contigs and .agp file
>
> Does anyone have a script for building chromosomes from an .agp file
> and a directory full of contigs?
> If not, I'll write something but I didn't want to re-invent the wheel
> if there's something "in the wild".
>
> Would something like a Bio::Assembly::IO::agp.pm be a good idea? Could
> an .agp file be regarded as a Bio::Assembly?
>
> --Russell
>
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================
More information about the Bioperl-l
mailing list