[Bioperl-l] Problems with Bio::DB::Fasta

Florent Angly florent.angly at gmail.com
Mon May 30 21:53:04 UTC 2011


Hi Justin,

Please "reply all" so that our emails stay on the BioPerl mailing list.

Weirdness regarding new lines if often indicative of a file that has 
traveled between different operating systems (which have a different way 
of representing new lines). You may try to follow these instructions if 
that's the case:
http://www.cyberciti.biz/faq/howto-unix-linux-convert-dos-newlines-cr-lf-unix-text-format/

Florent




On 31/05/11 04:28, Justin Chu wrote:
> Hi Florent:
>
> It seems that I does not detect the spaces in my files at times for 
> some reason and will proceed to run the script with no problem. 
> Strangely empty lines I insert myself seem to be detected in 
> Test1.Fasta, but not in Test2.Fasta.
>
> Justin
>
> On Fri, May 27, 2011 at 5:33 PM, Florent Angly 
> <florent.angly at gmail.com <mailto:florent.angly at gmail.com>> wrote:
>
>
>
>     On 28/05/11 05:07, Justin Chu wrote:
>>     Thanks for your reply, I think something is wrong with my
>>     installation because I keep getting an error when running your
>>     script. I have had already tried reinstalling with a version on
>>     cpan to make sure my problem is not due to missing dependencies
>>     but I still get the following error:
>>
>>     Can't locate Test/Exception.pm in @INC (@INC contains: t/lib
>>     /home/justin/workspace/.metadata/.plugins/org.epic.debug
>>     /home/justin/workspace/LocalTools/Testing /etc/perl
>>     /usr/local/lib/perl/5.10.1 /usr/local/share/perl/5.10.1
>>     /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10
>>     /usr/share/perl/5.10 /usr/local/lib/site_perl .) at (eval 46) line 2.
>>     BEGIN failed--compilation aborted at (eval 46) line 2.
>>
>>     BEGIN failed--compilation aborted at
>>     /usr/local/share/perl/5.10.1/Bio/Root/Test.pm line 152.
>>     Compilation failed in require at
>>     /home/justin/workspace/LocalTools/Testing/test.pl
>>     <http://test.pl> line 6.
>>     BEGIN failed--compilation aborted at
>>     /home/justin/workspace/LocalTools/Testing/test.pl
>>     <http://test.pl> line 6.
>
>     Hi Justin,
>     Install the Test::Exception module this way (for Debian-like
>     systems): sudo apt-get install libtest-exception- perl
>     Once it is installed, you should get the error messages on the
>     white lines of your FASTA file when running the script. If you
>     don't get errors on the white lines, and the script continues
>     happily, then that's very likely the reason why you get the wrong
>     subsequences.
>     Florent
>
>
>
>
>>
>>     However I did post my problem somewhere else and I did find other
>>     people did get errors when trying to make a index with my files.
>>     The weird thing is that I could make index files but lines with
>>     out sequence would cause my sequence retrieval to be offset one
>>     sequence position by each empty line. I found that removing all
>>     the spaces fixed the retrieval but this still does not explain
>>     the lack or error messages.
>>
>>     Thanks for your help,
>>
>>     Justin
>>
>>     On Thu, May 26, 2011 at 8:55 PM, Florent Angly
>>     <florent.angly at gmail.com <mailto:florent.angly at gmail.com>> wrote:
>>
>>         Hi Justin,
>>
>>         I been trying to reproduce your issue. A problem I ran into
>>         was that there were some extra empty lines in your FASTA
>>         files. Then I made a test script that gets the subsequences
>>         you mentioned using three different methods:
>>         Bio::SeqIO+Bio::Seq, Bio::DB::Fasta, and your
>>         InMemoryFastaAccess. These three methods return the same
>>         answer, so, I see no problem there.
>>
>>         My system is pretty similar to yours:
>>         Bioperl-live from the BioPerl GitHub master branch from 27/5/11
>>         Perl 5.12.3
>>         Linux 2.6.38-2-amd64 (Linux Mint Debian Edition)
>>
>>         Can you run the attached script on the attached FASTA files
>>         and see if all tests pass?
>>
>>         Thanks,
>>
>>         Florent
>>
>>
>>
>>
>>         On 21/05/11 05:51, Justin Chu wrote:
>>>         Hello:
>>>
>>>         I'm having trouble with Bio::DB::Fasta. It sometimes occurs when I use large
>>>         fasta files and retrieve sequence from a bit past the start of the file. I
>>>         think some characters are being ignored or a rounding error is occurring or
>>>         something  when using the offset to retrieve entries from the index file. I
>>>         have attached the Fasta files I have been using, just incase my problem is
>>>         due to improper formatting of my files.
>>>
>>>         For example:
>>>
>>>         my $refDB   = Bio::DB::Fasta->new('Test2.Fasta');
>>>         my $queryDB = Bio::DB::Fasta->new('Test1.Fasta');
>>>
>>>         print $refDB->subseq( "gi|294675557|ref|NC_014034.1|", 161067, 161788
>>>         )."\n";
>>>         print $queryDB->subseq( "gi|169245903|gb|EU376363.1|", 1, 722 )."\n";
>>>
>>>         output:
>>>         GGTAGTCCACGCCGTAAACGATGAATGCCAGTCGTCGGCAG...
>>>         GTAGTCCCGGCCGTAAACGATGGATGCTAGCCGTCGGATAG...
>>>
>>>         my $refDB2  = InMemoryFastaAccess->new('Test2.Fasta');
>>>         my $queryDB2 = InMemoryFastaAccess->new('Test1.Fasta');
>>>
>>>         print $refDB2->subseq( "gi|294675557|ref|NC_014034.1|", 161067, 161788
>>>         )."\n";
>>>         print $queryDB2->subseq( "gi|169245903|gb|EU376363.1|", 1, 722 )."\n";
>>>
>>>         I get:
>>>
>>>         output:
>>>         GTAGTCCACGCCGTAAACGATGAATGCCAGTCGTCGGCA...
>>>         GTAGTCCCGGCCGTAAACGATGGATGCTAGCCGTCGGAT...
>>>
>>>         Basically, sometimes the sequences retrieved are correct but other times it
>>>         is offset slightly by a few base pairs. Interestingly it seems that the
>>>         offset problem gets worse as you retrieve sequence chunks further and
>>>         further down the sequence.
>>>
>>>         print $refDB->subseq( "gi|294675557|ref|NC_014034.1|", 1514858,
>>>         1515579)."\n";
>>>
>>>         output:
>>>         CCCTGGTAGTCCACGCCGTAAACGATGAATGCCAGTCGT...
>>>
>>>         when it should be:
>>>
>>>         print $refDB2->subseq( "gi|294675557|ref|NC_014034.1|", 1514858,
>>>         1515579)."\n";
>>>
>>>         output:
>>>         GTAGTCCACGCCGTAAACGATGAATGCCAGTCGTCGGCA...
>>>
>>>         This module is still way faster than what I have, so I want to keep using
>>>         it. Do you think there something I'm overlooking that could be the problem
>>>         or do you see a way to fix this?
>>>
>>>         I am currently running:
>>>         Bioperl-live from the BioPerl GitHub master branch from 19/5/11
>>>         Perl 5.10.1
>>>         Debian 6.0.1
>>>
>>>         If you need any other information please let me know.
>>>
>>>         Thanks,
>>>
>>>         Justin Chu
>>>
>>>
>>>
>>>         _______________________________________________
>>>         Bioperl-l mailing list
>>>         Bioperl-l at lists.open-bio.org  <mailto:Bioperl-l at lists.open-bio.org>
>>>         http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
>




More information about the Bioperl-l mailing list