[Biopython] Local alignment using a single fasta file with multiple paired end reads

Thu Sep 17 15:52:07 UTC 2015

Thank you both.  I'll get to work on both of those suggestions and let you
know what I figure out.

  Damian

On Thu, Sep 17, 2015 at 4:01 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:

> In case it is needed, merging paired reads in FASTQ format can be done
> with a tool called FLASH, "Fast Length Adjustment of SHort reads".
>
> I use it routinely for merging pairs of 2x300 bp from Illumina's
> technology.
>
> I hope this helps.
>
> Ivan
>
>
>
> Ivan Gregoretti, PhD
> Bioinformatics
>
>
>
> On Thu, Sep 17, 2015 at 7:08 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> > Hi Damian,
> >
> > This sounds very like read merging down with paired end Illumina FASTQ
> > files, although here you are presumably using "Sanger" capillary
> > sequencing? If so the ABI files can be turned into FASTQ files with
> > quality scores rather than just FASTA files (e.g. with Biopython's
> > SeqIO). You would probably have to rename your reads, e.g.
> > "identifier/1 (space) optional text" and "identifier/2 (space)
> > optional text" but I'm not sure how well pair-merging tools would cope
> > with these longer reads.
> >
> > Peter
> >
> >
> >
> > Peter
> >
> >
> > On Wed, Sep 16, 2015 at 10:25 PM, Damian Menning <dmenning at mail.usf.edu>
> wrote:
> >> Hello All,
> >>
> >>
> >>   I have a fasta dataset in a single file with multiple paired end
> reads in
> >> paired sets of forward and reverse sequences (the reverse sequence is
> in the
> >> correct orientation).  I am pretty sure this is the real world example
> >> requested in 6.1.3 of the Biopython Cookbook J.  Within this dataset
> all of
> >> the information is the same i.e. ID:, Name:, Number of features:. The
> only
> >> exceptions are the descriptions and sequences.  Ex.
> >>
> >>
> >>>UAR Kaktovik 11-004 F L15774b(M13F)
> >>
> >> GTAGTATAGCAATTACCTTGGTCTTGTAAGCCAAAAACGGAGAATACCTACTCTCCCTAA
> >>
> >> GACTCAAGGAAGAAGCAACAGCTCCACTACCAGCACCCAAAGCTAATGTTCTATTTAAAC
> >>
> >> TATTCCCTGGTACATACTACTATTTTACCCCATGTCCTATTCATTTCATATATACCATCT
> >>
> >> TATGTGCTGTGCCATCGCAGTATGTCCTCGAATACCTTTCCCCCCCTATGTATATCGTGC
> >>
> >> ATTAATGGTGTGCCCCATGCATATAAGCATGTACATATTACGCTTGGTCTTACATAAGGA
> >>
> >> CTTACGTTCCGAAAGCTTATTTCAGGTGTATGGTCTGTGAGCATGTATTTCACTTAGTCC
> >>
> >> GAGAGCTTAATCACCGGGCCTCGAGAAACCAGCAACCCTTGCGAGTACGTGTACCTCTTC
> >>
> >> TCGCTCCGGGCCCATGGGGTGTGGGGGTTTCTATGTTGAAACTATACCTGGCATCTG
> >>
> >>
> >>
> >>>UAR Kaktovik 11-004 R CSBCH(M13R)
> >>
> >> TCCCTTCATTATTATCGGACAACTAGCCTCCATTCTCTACTTTACAATCCTCCTAGTACT
> >>
> >> TATACCTATCGCTGGAATTATTGAAAACAGCCTCTTAAAGTGGAGAGTCTTTGTAGTATA
> >>
> >> GCAATTACCTTGGTCTTGTAAGCCAAAAACGGAGAATACCTACTCTCCCTAAGACTCAAG
> >>
> >> GAAGAAGCAACAGCTCCACTACCAGCACCCAAAGCTAATGTTCTATTTAAACTATTCCCT
> >>
> >> GGTACATACTACTATTTTACCCCATGTCCTATTCATTTCATATATACCATCTTATGTGCT
> >>
> >> GTGCCATCGCAGTATGTCCTCGAATACCTTTCCCCCCCTATGTATATCGTGCATTAATGG
> >>
> >> TGTGCCCCATGCATATAAGCATGTACATATTACGCTTGGTCTTACATAAGGACTTACGTT
> >>
> >> CCGAAAGCTTATTTCAGGTGTATGGTCTGTGAGCATGTATTTCACTTAGTCCGAGAGCTT
> >>
> >> AATCACCGGGCCTCGAGAAACCAGCAACCCTTGCGAGTACGTGTACCTCTTCTCGCTCCG
> >>
> >> GGCCCATGGGGTGTGGGGGTTTCTATGTTGAAACTATACCTG
> >>
> >>
> >>
> >> My end goal is to align the paired ends of the sequences that have the
> same
> >> description and save the aligned sequence to another file for further
> >> analyses.  I have a few problems:
> >>
> >>
> >>
> >> 1) The descriptions of each sequence are not identical so I need to
> delete
> >> all but the first three parts and include the associated sequence. I.e.
> >> remove F L15774b(M13F) and  R CSBCH(M13R) above. The script below is
> what I
> >> have to make a new dictionary in this format.  Is this the best way to
> >> proceed in order to align the sequences in the next step?
> >>
> >>
> >>
> >> handle = open("pairedend2.txt", 'r')
> >>
> >>
> >> output_handle = open("AlignDict.txt", "a")
> >>
> >>
> >> desc2=dict()
> >>
> >> from Bio import SeqIO
> >>
> >> for seq_record in SeqIO.parse(handle, "fasta"):
> >>
> >>     parts = seq_record.description.split(" ")
> >>
> >>     des = [str(parts[0] + ' ' + parts[1] + ' ' + parts[2] + ':' +
> >> seq_record.seq)]
> >>
> >>     desc2=(dict(v.split(':') for v in des))
> >>
> >>     print ('\n' + str(desc2))
> >>
> >>     output_handle.write(str(desc2) + '\n')
> >>
> >>
> >>
> >> output_handle.close()
> >>
> >>
> >>
> >> 2) My second issue is figuring out how to do the alignment.  I thought I
> >> would do a local alignment using something like needle (or is there a
> better
> >> way?) but the script examples I have seen so far use two files with a
> single
> >> sequence in each and I have one file with multiple sequences.  There is
> no
> >> easy way to separate these out into individual sequences into different
> >> files as the data sets are quite large.
> >>
> >>
> >>
> >> Any help/ideas would be greatly appreciated.
> >>
> >>
> >>
> >> Thank you
> >>
> >>
> >>   Damian
> >>
> >>
> >> --
> >> Damian Menning, Ph.D.
> >>
> >> _______________________________________________
> >> Biopython mailing list  -  Biopython at mailman.open-bio.org
> >> http://mailman.open-bio.org/mailman/listinfo/biopython
> > _______________________________________________
> > Biopython mailing list  -  Biopython at mailman.open-bio.org
> > http://mailman.open-bio.org/mailman/listinfo/biopython
>

-- 
Damian Menning, Ph.D.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20150917/20727828/attachment.html>