[Biopython] Local alignment using a single fasta file with multiple paired end reads

Thu Sep 17 11:08:18 UTC 2015

Hi Damian,

This sounds very like read merging down with paired end Illumina FASTQ
files, although here you are presumably using "Sanger" capillary
sequencing? If so the ABI files can be turned into FASTQ files with
quality scores rather than just FASTA files (e.g. with Biopython's
SeqIO). You would probably have to rename your reads, e.g.
"identifier/1 (space) optional text" and "identifier/2 (space)
optional text" but I'm not sure how well pair-merging tools would cope
with these longer reads.

Peter

Peter

On Wed, Sep 16, 2015 at 10:25 PM, Damian Menning <dmenning at mail.usf.edu> wrote:
> Hello All,
>
>
>   I have a fasta dataset in a single file with multiple paired end reads in
> paired sets of forward and reverse sequences (the reverse sequence is in the
> correct orientation).  I am pretty sure this is the real world example
> requested in 6.1.3 of the Biopython Cookbook J.  Within this dataset all of
> the information is the same i.e. ID:, Name:, Number of features:. The only
> exceptions are the descriptions and sequences.  Ex.
>
>
>>UAR Kaktovik 11-004 F L15774b(M13F)
>
> GTAGTATAGCAATTACCTTGGTCTTGTAAGCCAAAAACGGAGAATACCTACTCTCCCTAA
>
> GACTCAAGGAAGAAGCAACAGCTCCACTACCAGCACCCAAAGCTAATGTTCTATTTAAAC
>
> TATTCCCTGGTACATACTACTATTTTACCCCATGTCCTATTCATTTCATATATACCATCT
>
> TATGTGCTGTGCCATCGCAGTATGTCCTCGAATACCTTTCCCCCCCTATGTATATCGTGC
>
> ATTAATGGTGTGCCCCATGCATATAAGCATGTACATATTACGCTTGGTCTTACATAAGGA
>
> CTTACGTTCCGAAAGCTTATTTCAGGTGTATGGTCTGTGAGCATGTATTTCACTTAGTCC
>
> GAGAGCTTAATCACCGGGCCTCGAGAAACCAGCAACCCTTGCGAGTACGTGTACCTCTTC
>
> TCGCTCCGGGCCCATGGGGTGTGGGGGTTTCTATGTTGAAACTATACCTGGCATCTG
>
>
>
>>UAR Kaktovik 11-004 R CSBCH(M13R)
>
> TCCCTTCATTATTATCGGACAACTAGCCTCCATTCTCTACTTTACAATCCTCCTAGTACT
>
> TATACCTATCGCTGGAATTATTGAAAACAGCCTCTTAAAGTGGAGAGTCTTTGTAGTATA
>
> GCAATTACCTTGGTCTTGTAAGCCAAAAACGGAGAATACCTACTCTCCCTAAGACTCAAG
>
> GAAGAAGCAACAGCTCCACTACCAGCACCCAAAGCTAATGTTCTATTTAAACTATTCCCT
>
> GGTACATACTACTATTTTACCCCATGTCCTATTCATTTCATATATACCATCTTATGTGCT
>
> GTGCCATCGCAGTATGTCCTCGAATACCTTTCCCCCCCTATGTATATCGTGCATTAATGG
>
> TGTGCCCCATGCATATAAGCATGTACATATTACGCTTGGTCTTACATAAGGACTTACGTT
>
> CCGAAAGCTTATTTCAGGTGTATGGTCTGTGAGCATGTATTTCACTTAGTCCGAGAGCTT
>
> AATCACCGGGCCTCGAGAAACCAGCAACCCTTGCGAGTACGTGTACCTCTTCTCGCTCCG
>
> GGCCCATGGGGTGTGGGGGTTTCTATGTTGAAACTATACCTG
>
>
>
> My end goal is to align the paired ends of the sequences that have the same
> description and save the aligned sequence to another file for further
> analyses.  I have a few problems:
>
>
>
> 1) The descriptions of each sequence are not identical so I need to delete
> all but the first three parts and include the associated sequence. I.e.
> remove F L15774b(M13F) and  R CSBCH(M13R) above. The script below is what I
> have to make a new dictionary in this format.  Is this the best way to
> proceed in order to align the sequences in the next step?
>
>
>
> handle = open("pairedend2.txt", 'r')
>
>
> output_handle = open("AlignDict.txt", "a")
>
>
> desc2=dict()
>
> from Bio import SeqIO
>
> for seq_record in SeqIO.parse(handle, "fasta"):
>
>     parts = seq_record.description.split(" ")
>
>     des = [str(parts[0] + ' ' + parts[1] + ' ' + parts[2] + ':' +
> seq_record.seq)]
>
>     desc2=(dict(v.split(':') for v in des))
>
>     print ('\n' + str(desc2))
>
>     output_handle.write(str(desc2) + '\n')
>
>
>
> output_handle.close()
>
>
>
> 2) My second issue is figuring out how to do the alignment.  I thought I
> would do a local alignment using something like needle (or is there a better
> way?) but the script examples I have seen so far use two files with a single
> sequence in each and I have one file with multiple sequences.  There is no
> easy way to separate these out into individual sequences into different
> files as the data sets are quite large.
>
>
>
> Any help/ideas would be greatly appreciated.
>
>
>
> Thank you
>
>
>   Damian
>
>
> --
> Damian Menning, Ph.D.
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython