[Biopython-dev] Biopython status
Jared Flatow
jflatow at northwestern.edu
Tue Oct 16 16:02:19 UTC 2007
Please forgive me for ever doubting your health, it seems the group
is very much alive!
On Oct 16, 2007, at 3:16 AM, Peter wrote:
> Jared Flatow wrote:
>> I have also needed to create a modified FASTA parser so that I can
>> read things like quality score files.
>
> Could you be a little more specific - what exactly do you mean by a
> quality score files (links and/or examples). It may be that this
> warrants setting up a new file format in Bio.SeqIO
That is what I did. The quality score files I meant are simply FASTA-
like records that indicate the quality of each base pair read from a
sequencing machine, on a scale of something like 1 to 64. The values
are tab separated and correspond to 'reads' in another FASTA file
that contain the actual sequences read. This is the way the 454
GSFlex machines output their sequencing reads, so for every set of
reads there will be a pair of 454Reads.fna, 454Reads.qual files. The
only difference between a parser that processes these qual files and
one that processes the sequence files is that it shouldn't get rid of
spaces, and the newlines should not to be stripped but converted into
spaces (when 454 writes a newline of scores they omit the space).
Essentially I have made a duplicate of FastaIOs iterator, named it
something else, made these two small changes and put an entry for it
in the SeqIO file.
16,17c16,17
< def GSQualIterator(handle, alphabet = single_letter_alphabet,
title2ids = None) :
< """Generator function to iterate over GSFlex quality records
(as SeqRecord objects).
---
> def FastaIterator(handle, alphabet = single_letter_alphabet,
title2ids = None) :
> """Generator function to iterate over Fasta records (as
SeqRecord objects).
54c54
< lines.append(line.rstrip()) # .replace(" ","")) leave
off the replacing internal spaces so we can process qscore files (jf)
---
> lines.append(line.rstrip().replace(" ",""))
58c58
< yield SeqRecord(Seq(" ".join(lines), alphabet),
---
> yield SeqRecord(Seq("".join(lines), alphabet),
63a64,199
As you can see a parser like this might be useful for other FASTA-
like formats as well and is in no way specific to the GS quality
files (its just a space preserving parser). If it were to be
implemented in Biopython you might call it something else.
>
>> I would be happy to submit the changes to the group or an individual
>> for inspection, but I would like to avoid having to maintain my own
>> separate version of Biopython if possible.
>
> As has already been said - please file some (enhancement) bugs and
> attach your patches, or raise specific issues for discussion on this
> mailing list.
>
> Depending on the nature of your changes, you might be able to achieve
> some of them by subclassing Biopython's objects - rather than
> literally
> maintaining your own branch of the project.
>
>> I am also wondering how it would be received if I did something like
>> add a to_fasta method to SeqRecord instead of having to go
>> through writing it to a file using a SeqIO when all I want is the
>> string.
>
> Out of interest, why do you want to create a FASTA record as a string?
I am serving the fasta from a database of sequences dynamically via a
web server.
>
> Did you know you can write to a string using any Bio.SeqIO supported
> file format using StringIO? Perhaps we should spell this out more
> explicitly in the documentation, but a motivating example would help.
This is what I do now, but it seems like a hack to me to go this
route. To always have to write to a file feels strange, but I see
that it would be messy to go OO since there are so many formats.
However, giving preference to fasta over other formats by making it
innate doesn't seem like such a terrible idea. I do have mixed
feelings about 'bloating' the code which is why I asked, and you have
convinced me that this is not quite appropriate given existing
convention. However the idea would be to put the to_fasta or
to_format method inside the SeqRecord, then to call it from the IO
when needed to actually write to a file, but call it directly when
all that is wanted is a string...
>
> I would suggest rather than adding a to_fasta method to the
> SeqRecord, simply write your own "seqrecord_to_string" function (or
> create a subclass of SeqRecord with this method).
>
I'll leave it alone for now until I can come up with a real proposal =)
>> Finally, are there plans to move to a subversion repository at any
>> point?
>
> It was raised a while ago, and our cunning plan was to let BioPerl try
> the move first. Once that has been proven, it should be fairly
> easy for
> the OBF guys to also move us over. I should email them to see how
> things stand...
BioPerl seems to be the guinea pigs for everything. Leading the way
on this might put a stop to those nasty rumors about Biopython.
Best Regards,
Jared
More information about the Biopython-dev
mailing list