[Bioperl-l] Bio::AlignIO ignores questionmarks?

Chris Fields cjfields at uiuc.edu
Fri Apr 14 11:41:09 EDT 2006


> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of David Messina
> Sent: Friday, April 14, 2006 12:14 AM
> To: Kai Müller
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::AlignIO ignores questionmarks?
> 
> Hi Kai,
> 
> I'm by no means an expert with this module, but I'll take a shot.
> 
> Running your code through a debugger, I'm seeing that
> Bio::AlignIO::fasta is gobbling the question marks:
> 
> line 66: $MATCHPATTERN = '^A-Za-z\.\-';
> 
> and then where $entry contains a line of sequence from the input file
> 
> line 118: $entry =~ s/[$MATCHPATTERN]//g;
> 
> As far as I can tell, a question mark is not a valid character for
> the FASTA format (see http://en.wikipedia.org/wiki/FASTA_format) --
> perhaps that's the reason Bio::AlignIO::fasta doesn't permit them?
 
I wouldn't trust wikipedia with that one.  Check out the bioperl page:

http://www.bioperl.org/wiki/FASTA_sequence_format

The problem is, there is no really well-established universal rule for FASTA
format.  These are three valid FASTA input sequences for some programs:

>xyz
X

>
-

>


It's all dependent on how a program/web interface imports the sequence.  You
don't need a description line, just '>' will do.  Some don't even reuire a
sequence, though most filters will warn you.  Even the rules for wrapping
the sequences on multiple line are different (is it 60, 80, 100, or none?).
I know, when I first started (early '90's), a quick and easy way to get
sequences ready for BLAST searches which required FASTA was copy-paste and
add '>' and CR in a line above, with no additional line breaks in the
sequence (all on one line).  Still works AFAIK...

> And then by the time missing_char() is applied, the question marks
> are already gone.
> 
> What happens if you read in your sequence with question marks in a
> format that explicitly permits question marks?
> 
> Dave
> 
> 
> On Apr 13, 2006, at 7:38 PM, Kai Müller wrote:
> 
> > hi,
> >
> > I'm very new to BioPerl and have a maybe silly question.
> > when using Bio::AlignIO to load a set of sequences, the
> > questionmarks are
> > simply lost (they refer to missing characters as opposed to gap
> > characters
> > [-] or ambiguity [N]). I thought that 'missing_char()' might help,
> > but it
> > didn't (I probably used it the wrong way).
> >
> > when $filename contains sequences with ????, the following snippet
> > would
> > produce an alignment with ???? lost and downstream nucleotide just
> > shifted
> > and the resulting length differnces filled by '---' @ 3' end:
> >
> >
> > my $aln_in = Bio::AlignIO->new(-file => "$filename", '-format' =>
> > 'fasta');
> > 	my $aln = $aln_in->next_aln();
> > 	$aln->gap_char('-');
> > 	$aln->missing_char('?');
> >
> > 	my $testout = Bio::AlignIO->new(-fh => \*STDOUT , '-format' =>
> > 'clustalw');
> > 	$testout->write_aln($aln);
> >
> >
> >
> > Can somebody give me a hint here?
> >
> > thanks and all the best,
> >
> > Kai Müller
> >
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list