[Bioperl-l] Strange behaviour in the write_seq function for large fasta

Tue Dec 27 15:03:28 UTC 2011

This is a strange one.  Personally I haven't seen this behavior, but that maybe it's OS-dependent?

We'll need more information, particularly what version of BioPerl you are using, the OS, version of perl, etc.  Also, in general to make sure we don't lose track of this issue it is best to submit a bug report:

https://redmine.open-bio.org/projects/bioperl

I'm planning on triaging bugs next week, I could take a look then.

chris
________________________________________
From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of David Gacquer [dgacquer at ulb.ac.be]
Sent: Wednesday, December 21, 2011 7:26 AM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Strange behaviour in the write_seq function for large      fasta

Dear BioPerl users/developers,

I am facing a strange issue with the $seq_out->write_seq function when
using large fasta files

I have downloaded the hg19 chromosome 1, and applied the following code
(basically I wanted to mask some regions in it but the problem also
appears when copying the sequence without modifications):

sub main{
     my $seq_in  = Bio::SeqIO->new( -format => 'largefasta', -file =>
$ARGV[0]);
     my $seq_out  = Bio::SeqIO->new( -format => 'largefasta', -file =>
'>'.$ARGV[1]);
     my $seq_obj_in = $seq_in->next_seq();
     my $modified_seq = $seq_obj_in->seq();
     my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq =>
$modified_seq, -id  => $seq_obj_in->id, -desc => $seq_obj_in->desc);
     $seq_out->write_seq($seq_obj_out);
}

when checking the output fasta file, the sequence of chr1 is 1-bp shorter.

I have noticed that in the original fasta file, each line contains
exactly 50 nucleotides, while the output of the $seq_out->write_seq
function contains always 60 characters per line.
chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that
the very last base was missing, I created the following fasta files:

chr121.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAG

chr122.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAG

They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last
character being a G. When running the above code:

chr121.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

chr122.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AG

The output for the 122 bp chromosome is correct (2 lines of 60 bp and
the last line with 2 bp, AG) but for the 121 bp chromosome, the last
character is missing (2 lines of 60 bp only, last G is missing).

When replacing -format => 'largefasta' by -format => 'fasta' or writing
the output without the write_seq function however, the problem is solved.

Am I missing something or is there a problem with the write_seq function
used with large fasta files? (I am running BioPerl on a Mac under OS X
Snow Leopard)

Best regards

David

--
David Gacquer, Ph. D.

IRIBHM - Universite Libre de Bruxelles
Bldg C, room C.4.117
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium

Phone: +32-2-555 4187
Fax: +32-2-555 4655
E-mail: dgacquer at ulb.ac.be

_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l