[Biopython] Local alignment using a single fasta file with multiple paired end reads

Thu Sep 17 12:01:08 UTC 2015

In case it is needed, merging paired reads in FASTQ format can be done
with a tool called FLASH, "Fast Length Adjustment of SHort reads".

I use it routinely for merging pairs of 2x300 bp from Illumina's technology.

I hope this helps.

Ivan

Ivan Gregoretti, PhD
Bioinformatics

On Thu, Sep 17, 2015 at 7:08 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi Damian,
>
> This sounds very like read merging down with paired end Illumina FASTQ
> files, although here you are presumably using "Sanger" capillary
> sequencing? If so the ABI files can be turned into FASTQ files with
> quality scores rather than just FASTA files (e.g. with Biopython's
> SeqIO). You would probably have to rename your reads, e.g.
> "identifier/1 (space) optional text" and "identifier/2 (space)
> optional text" but I'm not sure how well pair-merging tools would cope
> with these longer reads.
>
> Peter
>
>
>
> Peter
>
>
> On Wed, Sep 16, 2015 at 10:25 PM, Damian Menning <dmenning at mail.usf.edu> wrote:
>> Hello All,
>>
>>
>>   I have a fasta dataset in a single file with multiple paired end reads in
>> paired sets of forward and reverse sequences (the reverse sequence is in the
>> correct orientation).  I am pretty sure this is the real world example
>> requested in 6.1.3 of the Biopython Cookbook J.  Within this dataset all of
>> the information is the same i.e. ID:, Name:, Number of features:. The only
>> exceptions are the descriptions and sequences.  Ex.
>>
>>
>>>UAR Kaktovik 11-004 F L15774b(M13F)
>>
>> GTAGTATAGCAATTACCTTGGTCTTGTAAGCCAAAAACGGAGAATACCTACTCTCCCTAA
>>
>> GACTCAAGGAAGAAGCAACAGCTCCACTACCAGCACCCAAAGCTAATGTTCTATTTAAAC
>>
>> TATTCCCTGGTACATACTACTATTTTACCCCATGTCCTATTCATTTCATATATACCATCT
>>
>> TATGTGCTGTGCCATCGCAGTATGTCCTCGAATACCTTTCCCCCCCTATGTATATCGTGC
>>
>> ATTAATGGTGTGCCCCATGCATATAAGCATGTACATATTACGCTTGGTCTTACATAAGGA
>>
>> CTTACGTTCCGAAAGCTTATTTCAGGTGTATGGTCTGTGAGCATGTATTTCACTTAGTCC
>>
>> GAGAGCTTAATCACCGGGCCTCGAGAAACCAGCAACCCTTGCGAGTACGTGTACCTCTTC
>>
>> TCGCTCCGGGCCCATGGGGTGTGGGGGTTTCTATGTTGAAACTATACCTGGCATCTG
>>
>>
>>
>>>UAR Kaktovik 11-004 R CSBCH(M13R)
>>
>> TCCCTTCATTATTATCGGACAACTAGCCTCCATTCTCTACTTTACAATCCTCCTAGTACT
>>
>> TATACCTATCGCTGGAATTATTGAAAACAGCCTCTTAAAGTGGAGAGTCTTTGTAGTATA
>>
>> GCAATTACCTTGGTCTTGTAAGCCAAAAACGGAGAATACCTACTCTCCCTAAGACTCAAG
>>
>> GAAGAAGCAACAGCTCCACTACCAGCACCCAAAGCTAATGTTCTATTTAAACTATTCCCT
>>
>> GGTACATACTACTATTTTACCCCATGTCCTATTCATTTCATATATACCATCTTATGTGCT
>>
>> GTGCCATCGCAGTATGTCCTCGAATACCTTTCCCCCCCTATGTATATCGTGCATTAATGG
>>
>> TGTGCCCCATGCATATAAGCATGTACATATTACGCTTGGTCTTACATAAGGACTTACGTT
>>
>> CCGAAAGCTTATTTCAGGTGTATGGTCTGTGAGCATGTATTTCACTTAGTCCGAGAGCTT
>>
>> AATCACCGGGCCTCGAGAAACCAGCAACCCTTGCGAGTACGTGTACCTCTTCTCGCTCCG
>>
>> GGCCCATGGGGTGTGGGGGTTTCTATGTTGAAACTATACCTG
>>
>>
>>
>> My end goal is to align the paired ends of the sequences that have the same
>> description and save the aligned sequence to another file for further
>> analyses.  I have a few problems:
>>
>>
>>
>> 1) The descriptions of each sequence are not identical so I need to delete
>> all but the first three parts and include the associated sequence. I.e.
>> remove F L15774b(M13F) and  R CSBCH(M13R) above. The script below is what I
>> have to make a new dictionary in this format.  Is this the best way to
>> proceed in order to align the sequences in the next step?
>>
>>
>>
>> handle = open("pairedend2.txt", 'r')
>>
>>
>> output_handle = open("AlignDict.txt", "a")
>>
>>
>> desc2=dict()
>>
>> from Bio import SeqIO
>>
>> for seq_record in SeqIO.parse(handle, "fasta"):
>>
>>     parts = seq_record.description.split(" ")
>>
>>     des = [str(parts[0] + ' ' + parts[1] + ' ' + parts[2] + ':' +
>> seq_record.seq)]
>>
>>     desc2=(dict(v.split(':') for v in des))
>>
>>     print ('\n' + str(desc2))
>>
>>     output_handle.write(str(desc2) + '\n')
>>
>>
>>
>> output_handle.close()
>>
>>
>>
>> 2) My second issue is figuring out how to do the alignment.  I thought I
>> would do a local alignment using something like needle (or is there a better
>> way?) but the script examples I have seen so far use two files with a single
>> sequence in each and I have one file with multiple sequences.  There is no
>> easy way to separate these out into individual sequences into different
>> files as the data sets are quite large.
>>
>>
>>
>> Any help/ideas would be greatly appreciated.
>>
>>
>>
>> Thank you
>>
>>
>>   Damian
>>
>>
>> --
>> Damian Menning, Ph.D.
>>
>> _______________________________________________
>> Biopython mailing list  -  Biopython at mailman.open-bio.org
>> http://mailman.open-bio.org/mailman/listinfo/biopython
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython