[Biopython-dev] Cleaning up Bio.SeqUtils
Thomas Sicheritz-Ponten
thomas at cbs.dtu.dk
Fri Sep 26 05:54:12 EDT 2008
Ok, fair enough :-)
Please remove also the OBSOLETE tag - as Bio.SeqIO.parse is not really a
substitution for quick_FASTA_reader
cheers
-thomas
Peter wrote:
> On Thu, Sep 25, 2008 at 11:47 PM, Thomas Sicheritz-Ponten
> <thomas at cbs.dtu.dk> wrote:
>> Peter, can you check in the corrected version of quick_FASTA_reader for me?
>> I added the changes which were suggested in earlier posts (changes not
>> affecting speed and simplicity)
>>
>> def quick_FASTA_reader(file):
>> "simple and quick FASTA reader to be used on large FASTA files"
>> from os import linesep
>> txt = open(file).read()
>> entries = []
>> splitter = "%s>" % linesep
>> for entry in txt.split(splitter):
>> name,seq= entry.split(linesep,1)
>> if name[0]=='>': name = name[1:]
>> seq = seq.replace('\n','').replace(' ','').upper()
>> entries.append((name, seq))
>> return entries
>
> I'm pretty sure we shouldn't be using os.linesep in this way. I'd
> have to double check on a Windows box to confirm this, but I believe
> from memory that any CRLF in the file becomes just a \n in python.
>
> The basic idea is we want to split on "\n>" so that any additional ">"
> inside a name are ignored. This than means the first record in the
> file is a special case. You've also added an extra if statement in
> the loop - I assume to cope with the fact that using a split on "\n>"
> would leave a leading ">" on the first record's name -- but this would
> go wrong if the name itself started with a ">" too (i.e. a line
> starting with ">>..." which would be unusual).
>
> Perhaps instead, as a typical FASTA file starts immediately with ">"
> we can just do the split on "\n"+contents of file. I've updated CVS
> based on this, and added a minimal test for quick_FASTA_reader (and
> GC) to test_SeqUtils.py as well.
>
> Checking in Bio/SeqUtils/__init__.py;
> /home/repository/biopython/biopython/Bio/SeqUtils/__init__.py,v <--
> __init__.py
> new revision: 1.17; previous revision: 1.16
> done
> Checking in Tests/test_SeqUtils.py;
> /home/repository/biopython/biopython/Tests/test_SeqUtils.py,v <--
> test_SeqUtils.py
> new revision: 1.2; previous revision: 1.1
> done
> Checking in Tests/output/test_SeqUtils;
> /home/repository/biopython/biopython/Tests/output/test_SeqUtils,v <--
> test_SeqUtils
> new revision: 1.2; previous revision: 1.1
> done
>
> Could you have a look at Bio/SeqUtils/__init__.py revision 1.17 for
> review? It will be up on ViewCVS shortly...
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqUtils/__init__.py?cvsroot=biopython
>
> Do you think I should remove the "OBSOLETE" tag in the docstring for
> the quick_FASTA_reader function?
>
>> Concerning the seq3 function, I am not sure where it came from, I don't
>> think I have added it.
>
> OK, thanks.
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
Sicheritz-Ponten Thomas, Associate Professor, Ph.D (
Head of Metagenomics, Technical University of Denmark \
Center for Biological Sequence Analysis, BioCentrum )
CBS: +45 45 252422 Building 208, DK-2800 Lyngby ##----->
Fax: +45 45 931585 http://www.cbs.dtu.dk/~thomas )
/
... damn arrow eating trees ... (
More information about the Biopython-dev
mailing list