[Biopython] parsing fasta based on header
Matthew MacManes
macmanes at gmail.com
Tue Nov 1 19:19:24 UTC 2011
Hi All,
I have a large fasta file that I am trying to sort into multiple smaller
files based on their ID's. The File starts like this:
>1MUSgi|116063569|ref|NM_010065.2|
AGGGG-TGGTTGACCATCAACAACATCGGCATCATG-AAGGGAGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG-
>2MUSgi|118130562|ref|NM_019880.3|
CGGCCCGCGGCTCAGCCGTCGGCGCGCAGGATGGACGGCG-A
>2MUSgi|118130562|ref|NM_019880.3|
AGTTTAGCCAGGCCCTGGCCATCCGGAGCTACACCAAGTTTGTGATGGGGATTGCAGTGAGCATGCTGACCTACCCCTTCCTGCTCGTTGGAGATCTCATGGCAGTGAACAACCCTGGAGTAACCT
>1HOMOgi|59853098|ref|NM_004408.2|
GCATCCGCAAGGGCTGGCTGACTATCAATAATATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG-
>1
GGTGATCCGCAGGGGCTGGCTGACCATCAACAACATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTCGTGCTCACTGCCGAGTCACTGTCCTGGTACAAGGACGAAGAGGAGAAAGAGAG
>2
CGCGCCAGCACCGGCCCGCGGCGCAGCCCTCGGCCCGCAGGATGGACGGCGCGTCCGGGGGCCTGGGCTCTGGGGATAGTGCC
I want all of the ID's beginning with 1's to go on one file, ID's starting
with 2's in another.
I have been trying to use SeqIO
for record in SeqIO.parse(open("QHM-clean.fasta", "rU"), "fasta") :
for i in range(1,3):
if record.id %i: #this needs to be changed "if record.id *STARTS WITH* %i"
print record.id
output_handle = open("%i.fasta", "w") #naming in this manner does not seem
to be allowed
SeqIO.write(output_handle, "fasta")
output_handle.close()
But this seems to be not working in many obvious ways... Can anybody help
me out with some advice on how to proceed?
Thanks a lot, Matt
More information about the Biopython
mailing list