[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Tue Aug 1 10:42:37 UTC 2006

On Mon, 2006-07-31 at 12:08 -0400, Marc Colosimo wrote: 
> On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote:
> >>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the
> >>> entire file into memory in one go, and then parses it.  On the other
> >>> hand its not perfect: I would use "\n>" as the split marker  
> >>> rather than
> >>> ">" which could appear in the description of a sequence.
> >>
> >> I agree (not that it's bitten me, yet), but I'd be inclined to go  
> >> with
> >> "%s>" % os.linesep as the split marker, just in case.
> >
> > Good point.  I wonder how many people even know this function exists?
> >
> 
> The only problem with this is that if someone sends you a file not  
> created on your system. [...]  
> This has mostly simplied down to two - Unix and Windows - unless the  
> person uses a Mac GUI app some of which use \r (CR) instead of \n  
> (LF) where Windows uses \r\n (CRLF). I think the standard python  
> disto comes with crlf.py and lfcr.py that can convert the line endings.

Also a good point.  I had a play about with regular expression
splitting/substitution and the SeqUtils.quick_FASTA_reader method to see
if I could capture this variability in line-endings:

def method_quick_FASTA_reader3(filename):
    txt = file(filename).read()
    entries = []
    split_marker = re.compile('^>', re.M)
    for entry in re.split(split_marker, txt)[1:]:
        name,seq= re.split('[\r\n]', entry, 1)
        seq = re.sub('\s', '', seq).upper()
        entries.append((name, seq))
    return "SeqUtils/quick_FASTA_reader (import re)", len(entries)

Using regular expressions in this way seems to slow things down to about
the same speed as the SeqIO parser, with the disadvantage of still
having to process the entries into SeqRecord objects (if that's what you
want to do with them).  quick_FASTA_reader is a bit of a misnomer in
this case, I guess ;)

4.15s SeqIO.FASTA.FastaReader (for record in interator)
3.95s SeqIO.FASTA.FastaReader (iterator.next)
4.13s SeqIO.FASTA.FastaReader (iterator[i])
1.89s SeqUtils/quick_FASTA_reader
1.03s pyfastaseqlexer/next_record
0.52s pyfastaseqlexer/quick_FASTA_reader
4.44s SeqUtils/quick_FASTA_reader (import re)

Results are typical for the 72000 record set, and this doesn't look to
be a promising route.

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).