[Biopython] PDB Tidy

Thu Mar 18 17:11:56 UTC 2010

Hi Carlos and BioPythoneers

Has anyone come across PDB-Tools :  http://code.google.com/p/pdb-tools/
It's a python implementation to clean up pdbs and some other stuff.

Might be useful for someone interested in the PDB-Tidy project. :) :)

Thanks
Subho

> Message: 1
> Date: Wed, 17 Mar 2010 14:08:21 -0300
> From: Carlos R?os "V." <crosvera at gmail.com>
> Subject: Re: [Biopython] BioPython GSOC 2010
> To: biopython at lists.open-bio.org
> Message-ID: <1268845701.2161.10.camel at cabernet>
> Content-Type: text/plain; charset="UTF-8"
>
> Hello people,
>
> I'm very interesting in this idea:
>
> http://biopython.org/wiki/Google_Summer_of_Code#PDB-Tidy:_command-line_tools_for_manipulating_PDB_files
>
> I have some experience with the Bio.PDB Module, and I think that would
> be a very useful tool for labs.
>
> Brad Chapman wrote an e-mail that said that we have to demonstrate our
> knowledge of the project and open source coding capabilities, where I
> have to show you that?
>
> Regards.
>
> --
> http://crosvera.blogspot.com
>
> Carlos R?os V.
> Estudiante de Ing. (E) en Computaci?n e Inform?tica.
> Universidad del B?o-B?o
> VIII Regi?n, Chile
>
> Linux user number 425502
>
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Wed, 17 Mar 2010 14:32:44 -0400
> From: Eric Talevich <eric.talevich at gmail.com>
> Subject: Re: [Biopython] sort fasta file
> To: xyz <mitlox at op.pl>, biopython at lists.open-bio.org
> Message-ID:
>        <3f6baf361003171132s4ec12e4bw12d80e2a5edf6977 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> xyz <mitlox at op.pl> wrote:
>
> >
> > Hello,
> > I would like sort multiple fasta file depends on the sequence length,
> >  ie. from the read with longest sequence to the read with the shortest
> > sequence.
> >
> > I have tried to do it but I do not how to sort the records depends on
> > the sequence length.
> >
> > [...]
> >
> > If I could not hold all the records in memory at once what could I do?
> >
>
> There's also a program called uclust which can sort reads by sequence
> length
> very quickly:
> http://www.drive5.com/uclust/
>
> It's designed for clustering short reads, but it includes a feature to sort
> sequences by decreasing length. I think it can handle files larger than
> available RAM, too, though I haven't tested that.
>
> -Eric
>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 18 Mar 2010 10:44:09 +0000
> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython] sort fasta file
> To: xyz <mitlox at op.pl>
> Cc: biopython at lists.open-bio.org
> Message-ID:
>        <320fb6e01003180344n47fc9ba3y54c7284fc6747e25 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Wed, Mar 17, 2010 at 12:01 PM, xyz <mitlox at op.pl> wrote:
> > On Wed, 17 Mar 2010 10:22:48 +0000
> > Peter <biopython at maubp.freeserve.co.uk> wrote:
> >> For example,
> >>
> >> handle = open("example.fasta", "rU")
> >> records = list(SeqIO.parse(handle, "fasta"))
> >> handle.close()
> >> records.sort(cmp=lambda x,y: cmp(len(x), len(y)))
> >> #records.sort(cmp=reverse=True)
> >> out_handle = open("sorted.fasta", "w")
> >> SeqIO.write(records, out_handle, "fasta")
> >> out_handle.close()
> >>
> >> Peter
> >
> > Thank you for the code. I only changed this and it works.
> >
> > records.sort(cmp=lambda x,y: len(y.seq) - len(x.seq))
> >
> > If I could not hold all the records in memory at once what could I do?
>
> I would use Bio.SeqIO.index() to give random access to the
> records. You would also need to load and sort the record
> identifiers and the lengths. Something like this:
>
> from Bio import SeqIO
> #Get the lengths and ids, and sort on length
> len_and_ids = sorted((len(rec), rec.id) for rec in \
>            SeqIO.parse(open("ls_orchid.fasta"),"fasta"))
> #Once sorted only need the ids, so can free some memory
> ids = [id for (length, id) in len_and_ids]
> del len_and_ids
> #Now prepare the index
> record_index = SeqIO.index("ls_orchid.fasta", "fasta")
> #Now prepare a generator expression to give the
> #records one-by-one for output
> records = (record_index[id] for id in ids)
> #Finally write these to a file
> handle = open("sorted.fasta", "w")
> count = SeqIO.write(records, handle, "fasta")
> handle.close()
> print "Sorted %i records" % count
>
> That code should work for any file format support by
> the Bio.SeqIO parse, index and write functions (e.g.
> GenBank files, FASTQ, etc).
>
> Notice that it actually reads though the input file twice,
> once to get the ids and lengths, and once to build the
> index (getting the ids and file offsets). If you wanted to
> get a bit more low level you could do this in a single
> pass - but it would be more effort than using the SeqIO
> functions.
>
> I wonder if this example is useful enough to go in the
> tutorial? What do you think?
>
> Peter
>
>
> ------------------------------
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
> End of Biopython Digest, Vol 87, Issue 19
> *****************************************
>

-- 
Subhodeep Moitra
First Year, Masters Student
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA , USA