[Biopython] parsing fasta based on header

Matthew MacManes macmanes at gmail.com
Tue Nov 1 19:19:24 UTC 2011


Hi All,

I have a large fasta file that I am trying to sort into multiple smaller
files based on their ID's. The File starts like this:

>1MUSgi|116063569|ref|NM_010065.2|
AGGGG-TGGTTGACCATCAACAACATCGGCATCATG-AAGGGAGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG-
>2MUSgi|118130562|ref|NM_019880.3|
CGGCCCGCGGCTCAGCCGTCGGCGCGCAGGATGGACGGCG-A
>2MUSgi|118130562|ref|NM_019880.3|
AGTTTAGCCAGGCCCTGGCCATCCGGAGCTACACCAAGTTTGTGATGGGGATTGCAGTGAGCATGCTGACCTACCCCTTCCTGCTCGTTGGAGATCTCATGGCAGTGAACAACCCTGGAGTAACCT
>1HOMOgi|59853098|ref|NM_004408.2|
GCATCCGCAAGGGCTGGCTGACTATCAATAATATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG-
>1
GGTGATCCGCAGGGGCTGGCTGACCATCAACAACATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTCGTGCTCACTGCCGAGTCACTGTCCTGGTACAAGGACGAAGAGGAGAAAGAGAG
>2
CGCGCCAGCACCGGCCCGCGGCGCAGCCCTCGGCCCGCAGGATGGACGGCGCGTCCGGGGGCCTGGGCTCTGGGGATAGTGCC

I want all of the ID's beginning with 1's to go on one file, ID's starting
with 2's in another.

I have been trying to use SeqIO

for record in SeqIO.parse(open("QHM-clean.fasta", "rU"), "fasta") :
for i in range(1,3):
 if record.id %i:  #this needs to be changed "if record.id *STARTS WITH* %i"
 print record.id
output_handle = open("%i.fasta", "w") #naming in this manner does not seem
to be allowed
 SeqIO.write(output_handle, "fasta")
output_handle.close()

But this seems to be not working in many obvious ways... Can anybody help
me out with some advice on how to proceed?

Thanks a lot, Matt



More information about the Biopython mailing list