[Biopython-dev] Cleaning up Bio.SeqUtils
Peter
biopython at maubp.freeserve.co.uk
Fri Sep 26 09:38:57 UTC 2008
On Thu, Sep 25, 2008 at 11:47 PM, Thomas Sicheritz-Ponten
<thomas at cbs.dtu.dk> wrote:
> Peter, can you check in the corrected version of quick_FASTA_reader for me?
> I added the changes which were suggested in earlier posts (changes not
> affecting speed and simplicity)
>
> def quick_FASTA_reader(file):
> "simple and quick FASTA reader to be used on large FASTA files"
> from os import linesep
> txt = open(file).read()
> entries = []
> splitter = "%s>" % linesep
> for entry in txt.split(splitter):
> name,seq= entry.split(linesep,1)
> if name[0]=='>': name = name[1:]
> seq = seq.replace('\n','').replace(' ','').upper()
> entries.append((name, seq))
> return entries
I'm pretty sure we shouldn't be using os.linesep in this way. I'd
have to double check on a Windows box to confirm this, but I believe
from memory that any CRLF in the file becomes just a \n in python.
The basic idea is we want to split on "\n>" so that any additional ">"
inside a name are ignored. This than means the first record in the
file is a special case. You've also added an extra if statement in
the loop - I assume to cope with the fact that using a split on "\n>"
would leave a leading ">" on the first record's name -- but this would
go wrong if the name itself started with a ">" too (i.e. a line
starting with ">>..." which would be unusual).
Perhaps instead, as a typical FASTA file starts immediately with ">"
we can just do the split on "\n"+contents of file. I've updated CVS
based on this, and added a minimal test for quick_FASTA_reader (and
GC) to test_SeqUtils.py as well.
Checking in Bio/SeqUtils/__init__.py;
/home/repository/biopython/biopython/Bio/SeqUtils/__init__.py,v <--
__init__.py
new revision: 1.17; previous revision: 1.16
done
Checking in Tests/test_SeqUtils.py;
/home/repository/biopython/biopython/Tests/test_SeqUtils.py,v <--
test_SeqUtils.py
new revision: 1.2; previous revision: 1.1
done
Checking in Tests/output/test_SeqUtils;
/home/repository/biopython/biopython/Tests/output/test_SeqUtils,v <--
test_SeqUtils
new revision: 1.2; previous revision: 1.1
done
Could you have a look at Bio/SeqUtils/__init__.py revision 1.17 for
review? It will be up on ViewCVS shortly...
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqUtils/__init__.py?cvsroot=biopython
Do you think I should remove the "OBSOLETE" tag in the docstring for
the quick_FASTA_reader function?
> Concerning the seq3 function, I am not sure where it came from, I don't
> think I have added it.
OK, thanks.
Peter
More information about the Biopython-dev
mailing list