[Biopython] Local alignment using a single fasta file with multiple paired end reads
Damian Menning
dmenning at mail.usf.edu
Wed Sep 16 21:25:22 UTC 2015
Hello All,
I have a fasta dataset in a single file with multiple paired end reads in
paired sets of forward and reverse sequences (the reverse sequence is in
the correct orientation). I am pretty sure this is the real world example
requested in 6.1.3 of the Biopython Cookbook J. Within this dataset all of
the information is the same i.e. ID:, Name:, Number of features:. The only
exceptions are the descriptions and sequences. Ex.
>UAR Kaktovik 11-004 F L15774b(M13F)
GTAGTATAGCAATTACCTTGGTCTTGTAAGCCAAAAACGGAGAATACCTACTCTCCCTAA
GACTCAAGGAAGAAGCAACAGCTCCACTACCAGCACCCAAAGCTAATGTTCTATTTAAAC
TATTCCCTGGTACATACTACTATTTTACCCCATGTCCTATTCATTTCATATATACCATCT
TATGTGCTGTGCCATCGCAGTATGTCCTCGAATACCTTTCCCCCCCTATGTATATCGTGC
ATTAATGGTGTGCCCCATGCATATAAGCATGTACATATTACGCTTGGTCTTACATAAGGA
CTTACGTTCCGAAAGCTTATTTCAGGTGTATGGTCTGTGAGCATGTATTTCACTTAGTCC
GAGAGCTTAATCACCGGGCCTCGAGAAACCAGCAACCCTTGCGAGTACGTGTACCTCTTC
TCGCTCCGGGCCCATGGGGTGTGGGGGTTTCTATGTTGAAACTATACCTGGCATCTG
>UAR Kaktovik 11-004 R CSBCH(M13R)
TCCCTTCATTATTATCGGACAACTAGCCTCCATTCTCTACTTTACAATCCTCCTAGTACT
TATACCTATCGCTGGAATTATTGAAAACAGCCTCTTAAAGTGGAGAGTCTTTGTAGTATA
GCAATTACCTTGGTCTTGTAAGCCAAAAACGGAGAATACCTACTCTCCCTAAGACTCAAG
GAAGAAGCAACAGCTCCACTACCAGCACCCAAAGCTAATGTTCTATTTAAACTATTCCCT
GGTACATACTACTATTTTACCCCATGTCCTATTCATTTCATATATACCATCTTATGTGCT
GTGCCATCGCAGTATGTCCTCGAATACCTTTCCCCCCCTATGTATATCGTGCATTAATGG
TGTGCCCCATGCATATAAGCATGTACATATTACGCTTGGTCTTACATAAGGACTTACGTT
CCGAAAGCTTATTTCAGGTGTATGGTCTGTGAGCATGTATTTCACTTAGTCCGAGAGCTT
AATCACCGGGCCTCGAGAAACCAGCAACCCTTGCGAGTACGTGTACCTCTTCTCGCTCCG
GGCCCATGGGGTGTGGGGGTTTCTATGTTGAAACTATACCTG
My end goal is to align the paired ends of the sequences that have the same
description and save the aligned sequence to another file for further
analyses. I have a few problems:
1) The descriptions of each sequence are not identical so I need to delete
all but the first three parts and include the associated sequence. I.e.
remove F L15774b(M13F) and R CSBCH(M13R) above. The script below is what I
have to make a new dictionary in this format. Is this the best way to
proceed in order to align the sequences in the next step?
handle = open("pairedend2.txt", 'r')
output_handle = open("AlignDict.txt", "a")
desc2=dict()
from Bio import SeqIO
for seq_record in SeqIO.parse(handle, "fasta"):
parts = seq_record.description.split(" ")
des = [str(parts[0] + ' ' + parts[1] + ' ' + parts[2] + ':' +
seq_record.seq)]
desc2=(dict(v.split(':') for v in des))
print ('\n' + str(desc2))
output_handle.write(str(desc2) + '\n')
output_handle.close()
2) My second issue is figuring out how to do the alignment. I thought I
would do a local alignment using something like needle (or is there a
better way?) but the script examples I have seen so far use two files with
a single sequence in each and I have one file with multiple sequences.
There is no easy way to separate these out into individual sequences into
different files as the data sets are quite large.
Any help/ideas would be greatly appreciated.
Thank you
Damian
--
Damian Menning, Ph.D.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20150916/493fa86e/attachment.html>
More information about the Biopython
mailing list