[Biopython-dev] [Bug 2382] New: Generic FASTA parser

Tue Oct 16 15:57:35 EDT 2007

http://bugzilla.open-bio.org/show_bug.cgi?id=2382

           Summary: Generic FASTA parser
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: jflatow at northwestern.edu

I would like to be able read in and iterate over records in generic fasta files
of the format:
>header
data
>header
data
...

This iterator should return Bio.Fasta.Record objects with the corresponding
header and data fields.

I suggest putting this inside the existing Bio.Fasta module and updating
Bio.SeqIO.Fasta to use this iterator and transform the records returned into
Bio.SeqRecord objects.

This should make it easier to add metadata to SeqRecord objects parsed in from
FASTA. Consider the following example for illustration. I have data from a
genome sequencing machine that outputs pairs of files. One contains the
sequence reads which look like this, the other contains estimates of the
quality of each base call in the sequence.

The sequence file might look something like this (only with hundreds of
thousands more entries):

>ERSGEES02IKV6B length=97 xy=3401_1361 region=2 run=R_runname
CAATATAATTTCTCTTAAAATTATTCCCATGGCCAGGTGTGGTGGCTCACACCTGTAGTC
CCGGCACTTTGGGAGGCCAAGGCACACAGGGGATAGG
>ERSGEES02GGZDB length=142 xy=2536_2685 region=2 run= R_runname
GGTCTCCAGTGCCCTGTCTCCCCATATTTCTGACACACCTTCTCACAGCCTGGCCCATCT
TGCTGGGTCCCTCTTCTCCTCCCTTCCTGCTCCATTTGTCAACACTGCTGGGACATTAGA
ATTCAGATCTCCCGGGTCACCG
>ERSGEES02JQUCP length=113 xy=3879_0663 region=2 run= R_runname
AAAGTGACTAAAGAATCAATTTACATTAATATTCTATGTGAACAGGCAAAATACTTACAA
AGAAGTAGAGAAAATATGAATTCAGTACAGAATTCAGATCTCCCGGGTCACCG

The corresponding quality score file might look something like this:

>ERSGEES02IKV6B length=97 xy=3401_1361 region=2 run= R_runname
27 28 21 27 27 27 28 22 28 25 3 27 27 27 28 21 33 31 20 6 28 21 26 26 18 28 25
2 26 25 29 23 31 24 27 29 22 27 27 27 29 23 27 31 25 27 27 27 27 27 27 32 26 27
27 27 27 26 27 33
30 12 32 26 27 27 27 33 30 12 33 30 12 26 31 25 33 27 32 28 33 28 27 27 27 27
27 26 33 32 20 7 27 27 27 32 26
>ERSGEES02GGZDB length=142 xy=2536_2685 region=2 run= R_runname
28 9 26 24 27 27 20 26 18 25 27 32 29 10 26 26 27 18 25 32 30 17 1 25 27 22 32
30 12 27 27 22 26 25 27 23 25 28 21 32 27 27 27 25 26 27 26 25 27 20 26 26 19
28 25 3 25 27 22 27
19 24 24 24 32 29 11 24 34 31 17 23 23 30 23 27 25 30 23 27 33 31 17 27 20 28
21 27 25 26 26 30 24 27 33 31 13 26 27 27 31 25 27 25 23 26 16 26 27 30 27 7 27
27 27 32 27 26 26 32
27 30 26 27 27 27 27 27 27 27 30 27 6 34 31 17 27 21 27 32 28 18
>ERSGEES02JQUCP length=113 xy=3879_0663 region=2 run= R_runname
29 26 5 25 27 24 27 27 27 30 27 7 26 27 19 25 26 31 26 34 32 16 20 27 26 32 27
32 28 27 25 26 18 27 25 27 26 26 24 27 31 25 27 27 31 26 26 34 32 23 11 26 22
27 32 26 27 26 32 30
11 26 31 24 27 27 25 23 27 27 33 30 19 4 17 26 25 26 31 27 30 26 27 26 22 26 18
24 27 26 32 26 32 28 27 27 25 27 25 24 25 31 28 10 34 31 15 27 21 27 28 21 27

I would like to be able to do the following:

# create a function to parse the header line and return a dictionary
def parse_gsflex_header(gs_header):
        parts = gs_record.description.split(' ')
        assert len(parts) == 5
        xy = parts[2].split('=')[1].split('_')
        return {'letters': gs_record.seq.tostring(),
                'name': parts[0],
                'length': parts[1].split('=')[1],
                'xpos': xy[0],
                'ypos': xy[1],
                'region': parts[3].split('=')[1],
                'run': parts[4].split('=')[1]}

# Bio.SeqIO.FastaIO wraps the Bio.Fasta parser, might look something like this
class Fasta(): # or however its organized
   def data_toseq(data):
      # do some parsing of the data
      return Seq(...)

   def parse(file, header_todict):
     return (SeqRecord(seq=data_toseq(record.data),
**header_todict(record.header)) for record in Bio.Fasta.parse(file))

# I create an initial SeqRecord dictionary using the Bio.SeqIO.FastaIO parser
seq_dict = SeqIO.to_dict(SeqIO.FastaIO.parse(seq_file,  parse_gsflex_header))

# Then I iterate over the sequences in the qual file and look them up in the
seq_dict
# setting the quality scores as I find them them
for record in Bio.Fasta.parse(qual_file):
        seq_dict[my_header_todict(record.header)['id']].quality = 
my_qualitycleanup(record.data)

This would work well for parsing all kinds of FASTA-like files and provides a
simple mechanism for dealing with them record by record.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.