[Biopython] SwissProt parser: get entire entry as string?

Chevreux, Bastien bastien.chevreux at dsm.com
Fri Aug 19 21:19:11 UTC 2016


From: Peter Cock [mailto:p.j.a.cock at googlemail.com]
> If you or anyone else reading is using Bio.SeqIO, this is just a wrapper round the (older and more file format specific/faithful) Bio.SwissProt parser to reformat the data into our more file format neutral SeqRecord:
>
> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SwissIO.py#L67
>
> The underlying Bio.SwissProt parser is here:
>
> https://github.com/biopython/biopython/blob/master/Bio/SwissProt/__init__.py
>
> Here OC lines map to the record.organism_classification list, DR maps to the record.cross_references list, and CC maps to record.comments - so you may find checking the (repr of) those three attributes enough? Or does this restructure the data just enough to hinder your search?
>
> If you want to experiment, you can probably just copy the Bio/SwissProt/__init__.py to example.py in your working directory and replace the final section with your test:
>
>if __name__ == "__main__":
>    print("Bastien's search test")
>    # ....
> I hope that helps - but if you can give a more specific example that would be useful (e.g. data you are looking for and a specific accession to look in).


Here's the current easiest use case: create reduced SwissProt/TrEMBL files which contain subsets only from "Bacteria; Firmicutes" and "Archaea":

How this tiny script would ideally look:

---
import sys
from Bio import SwissProt
fh=open(sys.stdin.fileno())
for rec in SwissProt.parse(fh):
    if len(rec.organism_classification):
        if (rec.organism_classification[0]== "Bacteria" and rec.organism_classification[1]== "Firmicutes") or rec.organism_classification[0]== "Archea":
            print(rec.string_of_full_entry)
---

After that, a whole bunch of other - already existing scripts and programs - take over to extract FASTA from these reduced .dat files (continuing into BLAST DBs etc.) and also build mapping tables for pathways, GOs, Pfam domains etc.pp

Bastien

PS: One can imagine (and I would implement) other, arbitrary more complicated filters, like, e.g., scanning the CC comments for specific "-!- SUBCELLULAR LOCATION:" entries etc.pp to create other subsets of SwissProt/TrEMBL .dat files.

--
DSM Nutritional Products Microbia Inc | Bioinformatics
60 Westview Street | Lexington, MA 02421 | United States
Phone +1 781 259 7613 | Fax +1 781 259 0615


________________________________

DISCLAIMER:
This e-mail is for the intended recipient only.
If you have received it by mistake please let us know by reply and then delete it from your system; access, disclosure, copying, distribution or reliance on any of it by anyone else is prohibited.
If you as intended recipient have received this e-mail incorrectly, please notify the sender (via e-mail) immediately.



More information about the Biopython mailing list