From jdeuts01 at students.poly.edu  Thu Dec  1 09:09:19 2011
From: jdeuts01 at students.poly.edu (jdeuts01 at students.poly.edu)
Date: Thu, 1 Dec 2011 14:09:19 +0000
Subject: [Bioperl-l] question
Message-ID: <SNT134-W43F83A46574EEDD841600186B10@phx.gbl>


Dear Bioperl,
       This is my first experience with bioperl and I need help please.
1. The version of bioperl is 1.6.1 and have also installed Bundle-BioPerl 2.1.8 and mGen 1.03.    I was unable to install Bribes and trouchelle DB.     Will this prevent the BioPerl package from functioning correctly?
2. The operating platform is windows 7 - 64 bit using ActiveState - Perl v5.12.2
3. The script is as follows:
#!/usr/bin/perl
# Write a script using OOP to write protein sequences to the file fasta.txt.use strict;use warnings;use Bio::SeqIO::fasta;
# Declare and initialize input and output files.my $protein_fasta = "protein.fa";my $protein_out = ">fasta.txt";
# Setup objects for input and output.my $seq_in = Bio::SeqIO->new(-file => "$protein_fasta", -format => 'Fasta');my $seq_out = Bio::SeqIO->new(-file => "$protein_out", -format => 'fasta');
# Establish while loop using "next_seq" method # to read in multiple sequences from "protein.fa" file# one by one until none were remaining.while(my $seq = $seq_in -> next_seq){		$seq_out->write_seq($seq);}
The information is successfully written to the file: fasta.txt. 
4. Receiving the following error messages: 
Replacement list is longer than search list at C:/Perl64/site/lib/Bio/Range.pm line 251.Subroutine _initialize redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 92.Subroutine next_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 112.Subroutine write_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 193.Subroutine width redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 272.Subroutine preferred_id_type redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 295.
Thanks in advance for your help.John Deutsch
 		 	   		  

From jboddu at illinois.edu  Thu Dec  1 11:38:00 2011
From: jboddu at illinois.edu (Boddu, Jayanand)
Date: Thu, 1 Dec 2011 16:38:00 +0000
Subject: [Bioperl-l] Chromosome coordinates
Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>

Hello
I am newbie to Perl scripts.
I have a file with short reads mapped to the MAIZE genome
The format is a simple BLASTN output.
READ_ID

Chr

% Similarity

Alignment

Mismatches

Gaps

READ Start

READ End

Chr Start

Chr End

E Value

Score

READ1

chrPt

100

17

0

0

1

17

35021

35037

0.21

34.2

READ1

chr10

100

17

0

0

1

17

128587356

128587372

0.21

34.2

READ1

chr6

100

17

0

0

1

17

160769803

160769787

0.21

34.2

READ1

chr5

100

17

0

0

1

17

172103083

172103067

0.21

34.2

READ1

chr4

100

17

0

0

1

17

213173683

213173699

0.21

34.2

READ1

chr3

100

17

0

0

1

17

23689132

23689116

0.21

34.2

READ2

chr8

100

17

0

0

1

17

161048603

161048587

0.21

34.2

READ2

chr6

100

17

0

0

1

17

155768884

155768868

0.21

34.2

READ2

chr5

100

17

0

0

1

17

32958812

32958828

0.21

34.2

READ2

chr3

100

17

0

0

1

17

212451090

212451074

0.21

34.2

READ2

chr2

100

17

0

0

1

17

2046449

2046465

0.21

34.2

READ2

chr1

100

17

0

0

1

17

223233801

223233785

0.21

34.2

READ2

chr1

100

17

0

0

1

17

277573037

277573021

0.21

34.2


As expected the same read maps to multiple places on the same/different chromosome.
I have a GFF file with annotated coordinates.
I would like to run a PERL script to find out READS that are within the GENES in the GFF file and that are not.
The anticipated script should;

1.       Take the READ coordinates on the genome (by chromosome);

2.       Go the GFF file;

3.       Find the Chromosome;

4.       Find the GENE (by coordinates);

5.       and report READ-its coordinates-Chromosome-GENE-and its coordinates.

It doesn't need to be in the same order.
After this, I guess I could use simple Microsoft ACCESS query to pull out READS that are not mapped to the GENEs.
I would greatly appreciate if anyone can has a script that more or less similar job.

Thanks
Jay


From scott at scottcain.net  Thu Dec  1 11:59:56 2011
From: scott at scottcain.net (Scott Cain)
Date: Thu, 1 Dec 2011 11:59:56 -0500
Subject: [Bioperl-l] Chromosome coordinates
In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>
Message-ID: <CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>

Hi Jay,

Since the maize GFF file is likely to be fairly large, I would consider
putting it in a database, using either Bio::DB::GFF if it is GFF2 or
Bio::DB::SeqFeature::Store if it is gff3.  Then you can use the methods
that come along with either of those modules to search regions for for
genes.  They both support a get_features_by_location method, so you could
get the range for each of the regions you want to look at, and check the
database with that method to see if anything is there.

Scott


On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand <jboddu at illinois.edu>wrote:

> Hello
> I am newbie to Perl scripts.
> I have a file with short reads mapped to the MAIZE genome
> The format is a simple BLASTN output.
> READ_ID
>
> Chr
>
> % Similarity
>
> Alignment
>
> Mismatches
>
> Gaps
>
> READ Start
>
> READ End
>
> Chr Start
>
> Chr End
>
> E Value
>
> Score
>
> READ1
>
> chrPt
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 35021
>
> 35037
>
> 0.21
>
> 34.2
>
> READ1
>
> chr10
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 128587356
>
> 128587372
>
> 0.21
>
> 34.2
>
> READ1
>
> chr6
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 160769803
>
> 160769787
>
> 0.21
>
> 34.2
>
> READ1
>
> chr5
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 172103083
>
> 172103067
>
> 0.21
>
> 34.2
>
> READ1
>
> chr4
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 213173683
>
> 213173699
>
> 0.21
>
> 34.2
>
> READ1
>
> chr3
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 23689132
>
> 23689116
>
> 0.21
>
> 34.2
>
> READ2
>
> chr8
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 161048603
>
> 161048587
>
> 0.21
>
> 34.2
>
> READ2
>
> chr6
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 155768884
>
> 155768868
>
> 0.21
>
> 34.2
>
> READ2
>
> chr5
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 32958812
>
> 32958828
>
> 0.21
>
> 34.2
>
> READ2
>
> chr3
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 212451090
>
> 212451074
>
> 0.21
>
> 34.2
>
> READ2
>
> chr2
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 2046449
>
> 2046465
>
> 0.21
>
> 34.2
>
> READ2
>
> chr1
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 223233801
>
> 223233785
>
> 0.21
>
> 34.2
>
> READ2
>
> chr1
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 277573037
>
> 277573021
>
> 0.21
>
> 34.2
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> As expected the same read maps to multiple places on the same/different
> chromosome.
> I have a GFF file with annotated coordinates.
> I would like to run a PERL script to find out READS that are within the
> GENES in the GFF file and that are not.
> The anticipated script should;
>
> 1.       Take the READ coordinates on the genome (by chromosome);
>
> 2.       Go the GFF file;
>
> 3.       Find the Chromosome;
>
> 4.       Find the GENE (by coordinates);
>
> 5.       and report READ-its coordinates-Chromosome-GENE-and its
> coordinates.
>
> It doesn't need to be in the same order.
> After this, I guess I could use simple Microsoft ACCESS query to pull out
> READS that are not mapped to the GENEs.
> I would greatly appreciate if anyone can has a script that more or less
> similar job.
>
> Thanks
> Jay
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot
net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

From jason.stajich at gmail.com  Thu Dec  1 12:31:29 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Thu, 1 Dec 2011 09:31:29 -0800
Subject: [Bioperl-l] Chromosome coordinates
In-Reply-To: <CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>
	<CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
Message-ID: <470A08F7-E238-44E1-B0D1-69FDBC0BA2A3@gmail.com>

You might try using BEDtools, intersectBED which can do a lot of what you are doing in simple command line program.

Jason
On Dec 1, 2011, at 8:59 AM, Scott Cain wrote:

> Hi Jay,
> 
> Since the maize GFF file is likely to be fairly large, I would consider
> putting it in a database, using either Bio::DB::GFF if it is GFF2 or
> Bio::DB::SeqFeature::Store if it is gff3.  Then you can use the methods
> that come along with either of those modules to search regions for for
> genes.  They both support a get_features_by_location method, so you could
> get the range for each of the regions you want to look at, and check the
> database with that method to see if anything is there.
> 
> Scott
> 
> 
> On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand <jboddu at illinois.edu>wrote:
> 
>> Hello
>> I am newbie to Perl scripts.
>> I have a file with short reads mapped to the MAIZE genome
>> The format is a simple BLASTN output.
>> READ_ID
>> 
>> Chr
>> 
>> % Similarity
>> 
>> Alignment
>> 
>> Mismatches
>> 
>> Gaps
>> 
>> READ Start
>> 
>> READ End
>> 
>> Chr Start
>> 
>> Chr End
>> 
>> E Value
>> 
>> Score
>> 
>> READ1
>> 
>> chrPt
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 35021
>> 
>> 35037
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr10
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 128587356
>> 
>> 128587372
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr6
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 160769803
>> 
>> 160769787
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr5
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 172103083
>> 
>> 172103067
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr4
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 213173683
>> 
>> 213173699
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr3
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 23689132
>> 
>> 23689116
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr8
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 161048603
>> 
>> 161048587
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr6
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 155768884
>> 
>> 155768868
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr5
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 32958812
>> 
>> 32958828
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr3
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 212451090
>> 
>> 212451074
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr2
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 2046449
>> 
>> 2046465
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr1
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 223233801
>> 
>> 223233785
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr1
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 277573037
>> 
>> 277573021
>> 
>> 0.21
>> 
>> 34.2
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> As expected the same read maps to multiple places on the same/different
>> chromosome.
>> I have a GFF file with annotated coordinates.
>> I would like to run a PERL script to find out READS that are within the
>> GENES in the GFF file and that are not.
>> The anticipated script should;
>> 
>> 1.       Take the READ coordinates on the genome (by chromosome);
>> 
>> 2.       Go the GFF file;
>> 
>> 3.       Find the Chromosome;
>> 
>> 4.       Find the GENE (by coordinates);
>> 
>> 5.       and report READ-its coordinates-Chromosome-GENE-and its
>> coordinates.
>> 
>> It doesn't need to be in the same order.
>> After this, I guess I could use simple Microsoft ACCESS query to pull out
>> READS that are not mapped to the GENEs.
>> I would greatly appreciate if anyone can has a script that more or less
>> similar job.
>> 
>> Thanks
>> Jay
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
> 
> 
> 
> -- 
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at scottcain dot
> net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From jovel_juan at hotmail.com  Thu Dec  1 12:36:32 2011
From: jovel_juan at hotmail.com (Juan Jovel)
Date: Thu, 1 Dec 2011 17:36:32 +0000
Subject: [Bioperl-l] Error when using SearchIO
In-Reply-To: <CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>,
	<CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
Message-ID: <COL102-W29B31BD46108B9F1D9EF46FAB10@phx.gbl>


Hello Everybody!
I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message:
"Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251"
What it does mean? Would it have any effect on my parsing results?
Thanks, 
JUAN 		 	   		  

From cjfields at illinois.edu  Thu Dec  1 14:03:45 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 1 Dec 2011 19:03:45 +0000
Subject: [Bioperl-l] Error when using SearchIO
In-Reply-To: <COL102-W29B31BD46108B9F1D9EF46FAB10@phx.gbl>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>,
	<CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
	<COL102-W29B31BD46108B9F1D9EF46FAB10@phx.gbl>
Message-ID: <7233AC75-E03B-401A-8D0A-260ED76956A4@illinois.edu>

On Dec 1, 2011, at 11:36 AM, Juan Jovel wrote:

> Hello Everybody!
> I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message:
> "Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251"
> What it does mean? Would it have any effect on my parsing results?
> Thanks, 
> JUAN

This is a bug that was fixed, I think this was in the latest BioPerl release (1.6.901).  There was a transliteration error that produced this warning, but otherwise it's harmless. The warning is perl version dependent, I think it pops up in perl 5.12 and up.  This user post (about halfway down) runs into the same issue: http://www.sysarchitects.com/bioperl.

chris

From David.Messina at sbc.su.se  Thu Dec  1 17:02:20 2011
From: David.Messina at sbc.su.se (Dave Messina)
Date: Thu, 1 Dec 2011 23:02:20 +0100
Subject: [Bioperl-l] re trieving blast multiple alignment in fasta form
In-Reply-To: <32886592.post@talk.nabble.com>
References: <32886592.post@talk.nabble.com>
Message-ID: <CAM3TQQURpsF8+Tq2AQ6yAzmDVuDnip-6mFAfGgUdR5ScTus4YA@mail.gmail.com>

Hi Eric,

Wait, do you want multiple pairwise alignments in your output FASTA file,
or a single multiple alignment of your query and all the hits?

If the former, get_aln() will give you one pairwise alignment per hsp, but
you'll need to move the output file creation statement (my $alnIO = ...)
before the loops so it gets created only once. Then, when you do the write
statement ($alnIO->write_aln($aln);), all of the alignments will go to the
same file.

If on the other hand you'd like to have a multiple alignment between a
query and all of its hits, you'll have to take the IDs of the hits, pull
the corresponding sequences out of the database, and then run a multiple
alignment algorithm on them.


Dave

From scuoppo at gmail.com  Fri Dec  2 17:50:28 2011
From: scuoppo at gmail.com (Claudio Scuoppo)
Date: Fri, 2 Dec 2011 17:50:28 -0500
Subject: [Bioperl-l] List of genes from genomic intervals
Message-ID: <CAEz0Wfv_yj6g5rZEbmj+UhOkJX8ryzObEO7N6rqvLRMV6Yd_Pg@mail.gmail.com>

Hi,

I am new to BioPerl. I was wondering what`s the best strategy to get
the genes contained in a a series of human genomic interval.
Basically, I have a table with:

Chromosome Start End

Which module should I be looking at?
Thanks,
Claudio

From awitney at sgul.ac.uk  Mon Dec  5 06:09:39 2011
From: awitney at sgul.ac.uk (Adam Witney)
Date: Mon, 5 Dec 2011 11:09:39 +0000
Subject: [Bioperl-l] Bio::Graphics imagemap and padding
Message-ID: <44A27378-CC4B-4396-817E-AA31004847C7@sgul.ac.uk>

Hi,

Image maps seem to be out of position if you use padding in the Panel, like this:

my $panel = Bio::Graphics::Panel->new( ?.. -pad_left  => 20, -pad_right => 20 ?? );

Without these options, the image map is fine. Is this a known issue?

Also on a side note, I noticed that when using Bio::Graphics with Dancer, some of the CGI code was blocking somewhere (I found a reference to a similar problem with CGI and Catalyst), swapping CGI with HTML::Entities fixes it:

sub create_web_map {
?.
	eval "require HTML::Entities" unless HTML::Entities->can('encode_entities');
?.
	my $title  = HTML::Entities::encode_entities($self->make_link($tr,$feature,1));
 	my $target = HTML::Entities::encode_entities($self->make_link($tgr,$feature,1));
?..
}

Thanks

Adam

From momin.amin at gmail.com  Mon Dec  5 18:00:23 2011
From: momin.amin at gmail.com (Amin Momin)
Date: Mon, 5 Dec 2011 15:00:23 -0800 (PST)
Subject: [Bioperl-l] SimpleAlign and consensus_string
Message-ID: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>

Hi ,

I am generating a consensus sequence by aligning two protein homologs
using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to
understand the criteria consensus_string() method of simpleAlign uses
to determine the consensus at position with dissimilar aminoacids/
nucleotide. Also how would the % cutoffs provided to
consensus_string() affect the outcome.


Thanks,
Amin

From jason.stajich at gmail.com  Mon Dec  5 18:58:59 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Mon, 5 Dec 2011 15:58:59 -0800
Subject: [Bioperl-l] SimpleAlign and consensus_string
In-Reply-To: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>
References: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>
Message-ID: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com>

There are several methods that do related things. 

Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns. 

If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column.

=head2 consensus_string

 Title     : consensus_string
 Usage     : $str = $ali->consensus_string($threshold_percent)
 Function  : Makes a strict consensus
 Returns   : Consensus string
 Argument  : Optional treshold ranging from 0 to 100.
             The consensus residue has to appear at least threshold %
             of the sequences at a given location, otherwise a '?'
             character will be placed at that location.
             (Default value = 0%)

=cut

On Dec 5, 2011, at 3:00 PM, Amin Momin wrote:

> Hi ,
> 
> I am generating a consensus sequence by aligning two protein homologs
> using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to
> understand the criteria consensus_string() method of simpleAlign uses
> to determine the consensus at position with dissimilar aminoacids/
> nucleotide. Also how would the % cutoffs provided to
> consensus_string() affect the outcome.
> 
> 
> Thanks,
> Amin
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From wenbinmei at gmail.com  Tue Dec  6 11:09:35 2011
From: wenbinmei at gmail.com (wenbin mei)
Date: Tue, 6 Dec 2011 11:09:35 -0500
Subject: [Bioperl-l] revcom the multiple sequence alignment
Message-ID: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>

Hi,

I have a question about revcom the multiple sequence alignment. One way I
can do convert the format into fasta and revcom individual sequences. I
wonder is there a easy way to convert the multiple sequence alignment as a
whole.  Thank you for help.

-best,
wenbin

From jason.stajich at gmail.com  Tue Dec  6 12:40:37 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Tue, 6 Dec 2011 09:40:37 -0800
Subject: [Bioperl-l] revcom the multiple sequence alignment
In-Reply-To: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
References: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
Message-ID: <CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>

I think this would work to update it in place though I haven't tried it myself

for my $seq ( $aln->each_seq ) {
 $seq->seq( $seq->revcom->seq );
}
$out->write_aln($aln);

This may also work - not entirely sure if there is any extra work done on the meta data (start/end) of the Seq object when this is done.  You may want to flip start/end for the sequences (the seqs are Bio::LocatableSeq objects) explicitly if not. Or you may not care about those data and can ignore.

   $seq = $seq->revcom

Jason
On Dec 6, 2011, at 8:09 AM, wenbin mei wrote:

> Hi,
> 
> I have a question about revcom the multiple sequence alignment. One way I
> can do convert the format into fasta and revcom individual sequences. I
> wonder is there a easy way to convert the multiple sequence alignment as a
> whole.  Thank you for help.
> 
> -best,
> wenbin
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From wenbinmei at gmail.com  Tue Dec  6 12:51:18 2011
From: wenbinmei at gmail.com (wenbin mei)
Date: Tue, 6 Dec 2011 12:51:18 -0500
Subject: [Bioperl-l] revcom the multiple sequence alignment
In-Reply-To: <CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
References: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
	<CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
Message-ID: <CAHdrE2TqBxYN+wPC=LH433GwMt3sp5oas_OQhwWY9DbfbjdCpg@mail.gmail.com>

I think I might not explain clearly my questions. I extract the individual
gene alignment from the whole genome alignment. Since some gene are on the
reverse strand, I want to revcom the gene alignment. There is part of my
scripts. I can read the strand information from another file.

my $newstart = $refseq->column_from_residue_number($start);
my $newend = $refseq->column_from_residue_number($end);
$seq{$genename} = $aln->slice($newstart, $newend);


Any suggestion to help me revcom some gene alignment on the minus strand is
helpful. Thank you.

-best,
wenbin


On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich <jason.stajich at gmail.com>wrote:

> I think this would work to update it in place though I haven't tried it
> myself
>
> for my $seq ( $aln->each_seq ) {
>  $seq->seq( $seq->revcom->seq );
> }
> $out->write_aln($aln);
>
> This may also work - not entirely sure if there is any extra work done on
> the meta data (start/end) of the Seq object when this is done.  You may
> want to flip start/end for the sequences (the seqs are Bio::LocatableSeq
> objects) explicitly if not. Or you may not care about those data and can
> ignore.
>
>   $seq = $seq->revcom
>
> Jason
> On Dec 6, 2011, at 8:09 AM, wenbin mei wrote:
>
> > Hi,
> >
> > I have a question about revcom the multiple sequence alignment. One way I
> > can do convert the format into fasta and revcom individual sequences. I
> > wonder is there a easy way to convert the multiple sequence alignment as
> a
> > whole.  Thank you for help.
> >
> > -best,
> > wenbin
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Wenbin Mei Ph.D. Student
Dr. Brad Barbazuk's Lab
Department of Biology
University of Florida
509-899-3067
wmei at ufl.edu <wmei at ufl.edu>

From kellert at ohsu.edu  Tue Dec  6 13:21:39 2011
From: kellert at ohsu.edu (Tom Keller)
Date: Tue, 6 Dec 2011 10:21:39 -0800
Subject: [Bioperl-l] Bioperl-l Digest, Vol 104, Issue 3
In-Reply-To: <mailman.3.1322931604.28955.bioperl-l@lists.open-bio.org>
References: <mailman.3.1322931604.28955.bioperl-l@lists.open-bio.org>
Message-ID: <B68BC6F2-8C57-4749-902D-3232B0DA6113@ohsu.edu>

I'd start by looking for the section "Searching for genes in genomic DNA" in the HOWTO:Beginners - BioPerl website.

Thomas (Tom) Keller, PhD
kellert at ohsu.edu
503.494.2442
6588 R Jones Hall (BSc/CROET)
MMI DNA Services
Member of OHSU Shared Resources

On Dec 3, 2011, at 9:00 AM, <bioperl-l-request at lists.open-bio.org> <bioperl-l-request at lists.open-bio.org> wrote:

> Send Bioperl-l mailing list submissions to
> 	bioperl-l at lists.open-bio.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.open-bio.org/mailman/listinfo/bioperl-l
> or, via email, send a message with subject or body 'help' to
> 	bioperl-l-request at lists.open-bio.org
> 
> You can reach the person managing the list at
> 	bioperl-l-owner at lists.open-bio.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Bioperl-l digest..."
> 
> 
> Today's Topics:
> 
>   1.  List of genes from genomic intervals (Claudio Scuoppo)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Fri, 2 Dec 2011 17:50:28 -0500
> From: Claudio Scuoppo <scuoppo at gmail.com>
> Subject: [Bioperl-l] List of genes from genomic intervals
> To: bioperl-l at lists.open-bio.org
> Message-ID:
> 	<CAEz0Wfv_yj6g5rZEbmj+UhOkJX8ryzObEO7N6rqvLRMV6Yd_Pg at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Hi,
> 
> I am new to BioPerl. I was wondering what`s the best strategy to get
> the genes contained in a a series of human genomic interval.
> Basically, I have a table with:
> 
> Chromosome Start End
> 
> Which module should I be looking at?
> Thanks,
> Claudio
> 
> 
> ------------------------------
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> End of Bioperl-l Digest, Vol 104, Issue 3
> *****************************************


From wenbinmei at gmail.com  Tue Dec  6 17:54:51 2011
From: wenbinmei at gmail.com (wenbin mei)
Date: Tue, 6 Dec 2011 17:54:51 -0500
Subject: [Bioperl-l] revcom the multiple sequence alignment
In-Reply-To: <CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
References: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
	<CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
Message-ID: <CAHdrE2S33zcuchbSXuH2NwM5gM-=BnxVx9xA13ye18gPi2Mtcg@mail.gmail.com>

Figured out! Thanks for help.

-best,
wenbin


On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich <jason.stajich at gmail.com>wrote:

> I think this would work to update it in place though I haven't tried it
> myself
>
> for my $seq ( $aln->each_seq ) {
>  $seq->seq( $seq->revcom->seq );
> }
> $out->write_aln($aln);
>
> This may also work - not entirely sure if there is any extra work done on
> the meta data (start/end) of the Seq object when this is done.  You may
> want to flip start/end for the sequences (the seqs are Bio::LocatableSeq
> objects) explicitly if not. Or you may not care about those data and can
> ignore.
>
>   $seq = $seq->revcom
>
> Jason
> On Dec 6, 2011, at 8:09 AM, wenbin mei wrote:
>
> > Hi,
> >
> > I have a question about revcom the multiple sequence alignment. One way I
> > can do convert the format into fasta and revcom individual sequences. I
> > wonder is there a easy way to convert the multiple sequence alignment as
> a
> > whole.  Thank you for help.
> >
> > -best,
> > wenbin
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Wenbin Mei Ph.D. Student
Dr. Brad Barbazuk's Lab
Department of Biology
University of Florida
509-899-3067
wmei at ufl.edu <wmei at ufl.edu>

From momin.amin at gmail.com  Tue Dec  6 12:37:16 2011
From: momin.amin at gmail.com (Amin Momin)
Date: Tue, 6 Dec 2011 11:37:16 -0600
Subject: [Bioperl-l] SimpleAlign and consensus_string
In-Reply-To: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com>
References: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>
	<4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com>
Message-ID: <CAA0DaRhm+jsPpFFYR6q2xj0YOkYy3Enh8rrRD-YQJ26z_U+Fkw@mail.gmail.com>

Thanks Jason


On Mon, Dec 5, 2011 at 5:58 PM, Jason Stajich <jason.stajich at gmail.com> wrote:
> There are several methods that do related things.
>
> Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns.
>
> If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column.
>
> =head2 consensus_string
>
> ?Title ? ? : consensus_string
> ?Usage ? ? : $str = $ali->consensus_string($threshold_percent)
> ?Function ?: Makes a strict consensus
> ?Returns ? : Consensus string
> ?Argument ?: Optional treshold ranging from 0 to 100.
> ? ? ? ? ? ? The consensus residue has to appear at least threshold %
> ? ? ? ? ? ? of the sequences at a given location, otherwise a '?'
> ? ? ? ? ? ? character will be placed at that location.
> ? ? ? ? ? ? (Default value = 0%)
>
> =cut
>
> On Dec 5, 2011, at 3:00 PM, Amin Momin wrote:
>
>> Hi ,
>>
>> I am generating a consensus sequence by aligning two protein homologs
>> using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to
>> understand the criteria consensus_string() method of simpleAlign uses
>> to determine the consensus at position with dissimilar aminoacids/
>> nucleotide. Also how would the % cutoffs provided to
>> consensus_string() affect the outcome.
>>
>>
>> Thanks,
>> Amin
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From sunwukong at potc.net  Wed Dec  7 14:05:20 2011
From: sunwukong at potc.net (sunwukong)
Date: Wed, 07 Dec 2011 11:05:20 -0800
Subject: [Bioperl-l] DNA Sequencing two questions
Message-ID: <4EDFB8F0.8080001@potc.net>

I am not a medical professional but I have two DNA related questions.

A year or so ago I realized that if the standard building blocks of life 
were the amino acids GATC then they could be represented as a base 4 
number system (e.g., 0,1,2 and 3).  Then any life form could be 
represented by a number (it would be very long).  So I set out on a 
quest to do this with a small life form.  For fun I chose the Spanish 
Flu which I believe I found on an NIH site.  Then I set out and realized 
that there was no standard.  And I did not know if the number would be 
built with the most significant digit on the left or right.

1.  Is there a standard method for representing the ATCD molecules as 
numbers
g = 0
a = 1
t  = 2
c = 3

2. is the sequence read left to right or right to left?

note:  It may be biologically significant if the right values are 
assigned to the letters GATC, there could be a pattern somewhere that 
holds significant information.  One idea might be to look at DNA 
sequences in bases other than 4 to see if something jumps out.

http://www.insectscience.org/2.10/ref/fig5a.gif

VR
Pat Kirol
509 442-2214

From Russell.Smithies at agresearch.co.nz  Wed Dec  7 16:59:18 2011
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Thu, 8 Dec 2011 10:59:18 +1300
Subject: [Bioperl-l] DNA Sequencing two questions
In-Reply-To: <4EDFB8F0.8080001@potc.net>
References: <4EDFB8F0.8080001@potc.net>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>

I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve.
I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions?  Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png
Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes?

But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery.

But don't let this stop you uncovering the great secret hidden in our genes :-)

On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html

--Russell

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of sunwukong
> Sent: Thursday, 8 December 2011 8:05 a.m.
> To: bioperl-l at bioperl.org
> Subject: [Bioperl-l] DNA Sequencing two questions
> 
> I am not a medical professional but I have two DNA related questions.
> 
> A year or so ago I realized that if the standard building blocks of life were the
> amino acids GATC then they could be represented as a base 4 number
> system (e.g., 0,1,2 and 3).  Then any life form could be represented by a
> number (it would be very long).  So I set out on a quest to do this with a small
> life form.  For fun I chose the Spanish Flu which I believe I found on an NIH
> site.  Then I set out and realized that there was no standard.  And I did not
> know if the number would be built with the most significant digit on the left
> or right.
> 
> 1.  Is there a standard method for representing the ATCD molecules as
> numbers g = 0 a = 1 t  = 2 c = 3
> 
> 2. is the sequence read left to right or right to left?
> 
> note:  It may be biologically significant if the right values are assigned to the
> letters GATC, there could be a pattern somewhere that holds significant
> information.  One idea might be to look at DNA sequences in bases other
> than 4 to see if something jumps out.
> 
> http://www.insectscience.org/2.10/ref/fig5a.gif
> 
> VR
> Pat Kirol
> 509 442-2214
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From jason.stajich at gmail.com  Wed Dec  7 17:53:10 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Wed, 7 Dec 2011 14:53:10 -0800
Subject: [Bioperl-l] DNA Sequencing two questions
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
References: <4EDFB8F0.8080001@potc.net>
	<18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
Message-ID: <9BDA2BFA-264F-45B2-9234-A6BC9402FBA8@gmail.com>

For other fun picture games -- 

You can look at patterns of motifs/words in a chaos game representation of genomes.
http://mbe.oxfordjournals.org/content/16/10/1391.long
http://mbe.oxfordjournals.org/content/20/6/901.long


On Dec 7, 2011, at 1:59 PM, Smithies, Russell wrote:

> I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve.
> I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions?  Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png
> Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes?
> 
> But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery.
> 
> But don't let this stop you uncovering the great secret hidden in our genes :-)
> 
> On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html
> 
> --Russell
> 
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of sunwukong
>> Sent: Thursday, 8 December 2011 8:05 a.m.
>> To: bioperl-l at bioperl.org
>> Subject: [Bioperl-l] DNA Sequencing two questions
>> 
>> I am not a medical professional but I have two DNA related questions.
>> 
>> A year or so ago I realized that if the standard building blocks of life were the
>> amino acids GATC then they could be represented as a base 4 number
>> system (e.g., 0,1,2 and 3).  Then any life form could be represented by a
>> number (it would be very long).  So I set out on a quest to do this with a small
>> life form.  For fun I chose the Spanish Flu which I believe I found on an NIH
>> site.  Then I set out and realized that there was no standard.  And I did not
>> know if the number would be built with the most significant digit on the left
>> or right.
>> 
>> 1.  Is there a standard method for representing the ATCD molecules as
>> numbers g = 0 a = 1 t  = 2 c = 3
>> 
>> 2. is the sequence read left to right or right to left?
>> 
>> note:  It may be biologically significant if the right values are assigned to the
>> letters GATC, there could be a pattern somewhere that holds significant
>> information.  One idea might be to look at DNA sequences in bases other
>> than 4 to see if something jumps out.
>> 
>> http://www.insectscience.org/2.10/ref/fig5a.gif
>> 
>> VR
>> Pat Kirol
>> 509 442-2214
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From Russell.Smithies at agresearch.co.nz  Wed Dec  7 19:29:47 2011
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Thu, 8 Dec 2011 13:29:47 +1300
Subject: [Bioperl-l] DNA Sequencing two questions
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
References: <4EDFB8F0.8080001@potc.net>
	<18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF245@exchsth.agresearch.co.nz>

I tried again and came up with this:
http://www.bioperl.org/w/images/7/7a/Autostereogram.png
If you look carefully, you can see the answer to life, the universe, and everything!!

--Russell

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Smithies, Russell
> Sent: Thursday, 8 December 2011 10:59 a.m.
> To: 'sunwukong'; bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] DNA Sequencing two questions
> 
> I did something similar a few years ago (after watching the movie "Contact" I
> think) and encoded codons as RGB values and drew an image of a genome.
> Looked much like random noise but I might try it again and draw as a space
> filling curve.
> I guess if you're looking for "hidden messages", why restrict yourself to 2
> dimensions?  Perhaps something pops out as a single-image stereogram eg.
> http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Ra
> ndom_Dot_Shark.png
> Perhaps it's a 3D "object" represented by slices drawn in a series of 2D
> planes?
> 
> But you need a bit of biological background as there will be patterns simply
> because of the way genes "work" and are laid out in chromosomes. You
> need to remember that DNA is effectively a 2D representation of a 3D
> protein structure and there is already much hidden information we know we
> don't understand - a "simple" task like how proteins fold is barely understood
> and why some become prions is still a mystery.
> 
> But don't let this stop you uncovering the great secret hidden in our genes :-)
> 
> On a similar note, have a look at http://medgadget.com/2011/10/send-your-
> secret-message-hidden-in-bacteria.html
> 
> --Russell
> 
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > bounces at lists.open-bio.org] On Behalf Of sunwukong
> > Sent: Thursday, 8 December 2011 8:05 a.m.
> > To: bioperl-l at bioperl.org
> > Subject: [Bioperl-l] DNA Sequencing two questions
> >
> > I am not a medical professional but I have two DNA related questions.
> >
> > A year or so ago I realized that if the standard building blocks of
> > life were the amino acids GATC then they could be represented as a
> > base 4 number system (e.g., 0,1,2 and 3).  Then any life form could be
> > represented by a number (it would be very long).  So I set out on a
> > quest to do this with a small life form.  For fun I chose the Spanish
> > Flu which I believe I found on an NIH site.  Then I set out and
> > realized that there was no standard.  And I did not know if the number
> > would be built with the most significant digit on the left or right.
> >
> > 1.  Is there a standard method for representing the ATCD molecules as
> > numbers g = 0 a = 1 t  = 2 c = 3
> >
> > 2. is the sequence read left to right or right to left?
> >
> > note:  It may be biologically significant if the right values are
> > assigned to the letters GATC, there could be a pattern somewhere that
> > holds significant information.  One idea might be to look at DNA
> > sequences in bases other than 4 to see if something jumps out.
> >
> > http://www.insectscience.org/2.10/ref/fig5a.gif
> >
> > VR
> > Pat Kirol
> > 509 442-2214
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> ==========================================================
> =============
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities to which
> it is addressed and may contain confidential and/or privileged material. Any
> review, retransmission, dissemination or other use of, or taking of any action
> in reliance upon, this information by persons or entities other than the
> intended recipients is prohibited by AgResearch Limited. If you have received
> this message in error, please notify the sender immediately.
> ==========================================================
> =============
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From lumos.lumos.lumos at gmail.com  Fri Dec  9 11:47:36 2011
From: lumos.lumos.lumos at gmail.com (lumos lumos)
Date: Fri, 9 Dec 2011 08:47:36 -0800
Subject: [Bioperl-l] Mouse->Human homologues ?
Message-ID: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>

Hello,

Is there a way to get human homologues for a mouse gene list where I get
all human genes(symbols) as text output ?

Thank you
LM

From cjfields at illinois.edu  Fri Dec  9 12:17:20 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Fri, 9 Dec 2011 17:17:20 +0000
Subject: [Bioperl-l] Mouse->Human homologues ?
In-Reply-To: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>
References: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>
Message-ID: <C19C1032-5FE7-400A-AEB5-152EEFC4C435@illinois.edu>

There are lots of databases that have this capability (ensembl, orthodb, homologene, oma, to name only a few).  Have you tried a simple search for this, or did you want expert opinion on the matter?  

chris

PS - Just to note, there is a lot of controversy swirling about re: the ortholog conjecture and some recently published papers calling it into question using human-mouse data, worth a look if you're trotting this path to know the current situation.  If you have access to F1000, see the following (paper itself is open :)

Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al. Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi: 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011. F1000.com/12462957

On Dec 9, 2011, at 10:47 AM, lumos lumos wrote:

> Hello,
> 
> Is there a way to get human homologues for a mouse gene list where I get
> all human genes(symbols) as text output ?
> 
> Thank you
> LM
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From lumos.lumos.lumos at gmail.com  Fri Dec  9 12:29:24 2011
From: lumos.lumos.lumos at gmail.com (lumos lumos)
Date: Fri, 9 Dec 2011 09:29:24 -0800
Subject: [Bioperl-l] Mouse->Human homologues ?
In-Reply-To: <C19C1032-5FE7-400A-AEB5-152EEFC4C435@illinois.edu>
References: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>
	<C19C1032-5FE7-400A-AEB5-152EEFC4C435@illinois.edu>
Message-ID: <CAJbewukt_xCCpQaWTsvqi2z1NkbsTZRG6xXJUcZhcK5jdAZhWQ@mail.gmail.com>

Hi Chris,

Thanks for your reply. I wanted to know if there is anyway you can do it
via script/automatically in perl for a list of mouse genes whose human
homologues I require.

LM

On Fri, Dec 9, 2011 at 9:17 AM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> There are lots of databases that have this capability (ensembl, orthodb,
> homologene, oma, to name only a few).  Have you tried a simple search for
> this, or did you want expert opinion on the matter?
>
> chris
>
> PS - Just to note, there is a lot of controversy swirling about re: the
> ortholog conjecture and some recently published papers calling it into
> question using human-mouse data, worth a look if you're trotting this path
> to know the current situation.  If you have access to F1000, see the
> following (paper itself is open :)
>
> Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al.
> Testing the ortholog conjecture with comparative functional genomic data
> from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi:
> 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011.
> F1000.com/12462957
>
> On Dec 9, 2011, at 10:47 AM, lumos lumos wrote:
>
> > Hello,
> >
> > Is there a way to get human homologues for a mouse gene list where I get
> > all human genes(symbols) as text output ?
> >
> > Thank you
> > LM
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>

From lumos.lumos.lumos at gmail.com  Wed Dec  7 23:47:19 2011
From: lumos.lumos.lumos at gmail.com (lumos lumos)
Date: Wed, 7 Dec 2011 20:47:19 -0800
Subject: [Bioperl-l] Perl parsing
Message-ID: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>

Hello,

I have a text file(tab-delim) with some gene names as shown below.

*BRCA1: breast cancer 1, early onset

TNF: tumor necrosis factor

OMG: oligodendrocyte myelin glycoprotein*

I would like to get the list of gene name BRCA1,TNF,OMG that is before the
colon(:) .
How do I parse in perl this text file with this list of genes?

Thanks in advance.
LM

From b.m.forde at umail.ucc.ie  Fri Dec  9 11:52:56 2011
From: b.m.forde at umail.ucc.ie (BForde)
Date: Fri, 9 Dec 2011 08:52:56 -0800 (PST)
Subject: [Bioperl-l]  Genbank files
Message-ID: <32941955.post@talk.nabble.com>


Hello all,

I am new to Bioperl so I apologise if this is stupid question. 

For CDS features I which to add additional qualifiers e.g. /colour and /note
qualifiers. I have looked at the BioPerl wiki but am still unsure as how to
do this?

regards

Brian
-- 
View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From jboddu at illinois.edu  Fri Dec  9 14:59:39 2011
From: jboddu at illinois.edu (Boddu, Jayanand)
Date: Fri, 9 Dec 2011 19:59:39 +0000
Subject: [Bioperl-l] Batch processing of Data
Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>

Hi Anyone:
Please let me know if the following is practical with PERL.
My data output can be described as following.

1.       Hundreds of samples are run.

2.       A batch output sends data from each sample to its own "folder". Output is in the form of few text files, spreadsheets and PDF files.

3.       One of the spreadsheet has the data of most interest.

4.       This means I end up having hundreds of folders.

5.       The spreadsheet with the data has multiple worksheets out of which a couple have the interesting data to be processed (Please find attached a spreadsheet output in which the data is organized and the worksheets of my interest are named as "Compound" and "Peak". Yellow high-lighted columns in each worksheet has the data to be processed).
OK. That's long description.
NOW. Is it practical to write a PERL/or any script to;

1.       Enter each folder.

2.       Look for the spreadsheet of interest.

3.       Look for worksheets named "Compound" and "Peak".

4.       Look for the specific columns of interest.

5.       Copy paste the columns of interest into a new spreadsheet/text file with data from each folder next to each other.

This final spreadsheet will pass through a bunch of other calculations.

I apologize for this long and painful description.
However, it would be great if this can be done.
Thanks
Jay
-------------- next part --------------
A non-text attachment was scrubbed...
Name: REPORT01.xls
Type: application/vnd.ms-excel
Size: 93696 bytes
Desc: REPORT01.xls
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20111209/0528887b/attachment-0001.xls>

From cjfields at illinois.edu  Fri Dec  9 15:37:48 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Fri, 9 Dec 2011 20:37:48 +0000
Subject: [Bioperl-l] Perl parsing
In-Reply-To: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>
References: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>
Message-ID: <E70D9F7E-7419-47AE-A0F9-3FB2B7B25907@illinois.edu>

On Dec 7, 2011, at 10:47 PM, lumos lumos wrote:

> Hello,
> 
> I have a text file(tab-delim) with some gene names as shown below.
> 
> *BRCA1: breast cancer 1, early onset
> 
> TNF: tumor necrosis factor
> 
> OMG: oligodendrocyte myelin glycoprotein*
> 
> I would like to get the list of gene name BRCA1,TNF,OMG that is before the
> colon(:) .
> How do I parse in perl this text file with this list of genes?

'Very carefully?'

Okay, I'll try to refrain from further sarcasm, but I'm confused, what does this have to do with BioPerl (*the toolkit*) specifically?  That is what this mailing list is for.  

Just to note, this is a very common perl task. The answer is attainable by searching for it (not to mention taking the time to learn basic perl).  For instance:

   http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings

One of the many links found by simply using Google:

   http://lmgtfy.com/?q=perl+parse+tab+file

I'll leave the regex munging to you.  

(okay, I failed at refraining from sarcasm, ah well it's friday).

chris


> Thanks in advance.
> LM


From jason.stajich at gmail.com  Fri Dec  9 16:18:38 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Fri, 9 Dec 2011 13:18:38 -0800
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32941955.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
Message-ID: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>

$feature->add_tag_value('color','blue');

On Dec 9, 2011, at 8:52 AM, BForde wrote:

> 
> Hello all,
> 
> I am new to Bioperl so I apologise if this is stupid question. 
> 
> For CDS features I which to add additional qualifiers e.g. /colour and /note
> qualifiers. I have looked at the BioPerl wiki but am still unsure as how to
> do this?
> 
> regards
> 
> Brian
> -- 
> View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org


From bosborne11 at verizon.net  Fri Dec  9 15:31:15 2011
From: bosborne11 at verizon.net (Brian Osborne)
Date: Fri, 09 Dec 2011 15:31:15 -0500
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32941955.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
Message-ID: <3AB0716B-EC07-4BD6-9FC8-0C47A29FC0BA@verizon.net>

Brian,

Reasonable question. Start here:

http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

If you've never used Bioperl then:

http://www.bioperl.org/wiki/HOWTO:Beginners

Brian


On Dec 9, 2011, at 11:52 AM, BForde wrote:

> 
> Hello all,
> 
> I am new to Bioperl so I apologise if this is stupid question. 
> 
> For CDS features I which to add additional qualifiers e.g. /colour and /note
> qualifiers. I have looked at the BioPerl wiki but am still unsure as how to
> do this?
> 
> regards
> 
> Brian
> -- 
> View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From asjo at koldfront.dk  Fri Dec  9 17:25:00 2011
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Fri, 09 Dec 2011 23:25:00 +0100
Subject: [Bioperl-l] Batch processing of Data
References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>
Message-ID: <871usdpemb.fsf@topper.koldfront.dk>

On Fri, 9 Dec 2011 19:59:39 +0000, Boddu, wrote:

> Please let me know if the following is practical with PERL.

It might very well be, yes.

Modules you might be interested in include Spreadsheet::ParseExcel,
Spreadsheet::XLSX, Spreadsheet::WriteExcel and Excel::Writer::XLSX?.

A big help in finding interesting CPAN modules is the search engine on
https://metacpan.org/

Depending on your platform and preference using find(1) might also be
helpful to traverse the folders, rather than doing so in Perl.

Note that none of this has anything to do with BioPerl as such, though,
and you'll need to do some actual programming to get the job done.


  Best regards,

    Adam


? http://blogs.perl.org/users/john_mcnamara/2011/10/spreadsheetwriteexcel-is-dead-long-live-excelwriterxlsx.html

-- 
 "Angels can fly because they take themselves lightly."       Adam Sj?gren
                                                         asjo at koldfront.dk


From David.Messina at sbc.su.se  Fri Dec  9 17:30:23 2011
From: David.Messina at sbc.su.se (Dave Messina)
Date: Fri, 9 Dec 2011 23:30:23 +0100
Subject: [Bioperl-l] Batch processing of Data
In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>
References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>
Message-ID: <CAM3TQQWMnqwShBhYQWH9iZqtDphVFYLYtuVWEMxxfVY1OqSbhg@mail.gmail.com>

Yes, it can be done. However, it has nothing to do with this mailing list.

Steps 1 and 2 are basic Perl.
For steps 3 through 5, try googling "perl parse excel".


Dave


On Fri, Dec 9, 2011 at 20:59, Boddu, Jayanand <jboddu at illinois.edu> wrote:

> Hi Anyone:
> Please let me know if the following is practical with PERL.
> My data output can be described as following.
>
> 1.       Hundreds of samples are run.
>
> 2.       A batch output sends data from each sample to its own "folder".
> Output is in the form of few text files, spreadsheets and PDF files.
>
> 3.       One of the spreadsheet has the data of most interest.
>
> 4.       This means I end up having hundreds of folders.
>
> 5.       The spreadsheet with the data has multiple worksheets out of
> which a couple have the interesting data to be processed (Please find
> attached a spreadsheet output in which the data is organized and the
> worksheets of my interest are named as "Compound" and "Peak". Yellow
> high-lighted columns in each worksheet has the data to be processed).
> OK. That's long description.
> NOW. Is it practical to write a PERL/or any script to;
>
> 1.       Enter each folder.
>
> 2.       Look for the spreadsheet of interest.
>
> 3.       Look for worksheets named "Compound" and "Peak".
>
> 4.       Look for the specific columns of interest.
>
> 5.       Copy paste the columns of interest into a new spreadsheet/text
> file with data from each folder next to each other.
>
> This final spreadsheet will pass through a bunch of other calculations.
>
> I apologize for this long and painful description.
> However, it would be great if this can be done.
> Thanks
> Jay
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

From lsbrath at gmail.com  Sat Dec 10 16:39:44 2011
From: lsbrath at gmail.com (Mgavi Brathwaite)
Date: Sat, 10 Dec 2011 16:39:44 -0500
Subject: [Bioperl-l] Perl parsing
In-Reply-To: <E70D9F7E-7419-47AE-A0F9-3FB2B7B25907@illinois.edu>
References: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>
	<E70D9F7E-7419-47AE-A0F9-3FB2B7B25907@illinois.edu>
Message-ID: <CAJm=ba98HUgAB1kUG29_KA+ZvNWP_AsHoJQNPQ-_Fe=Pa7b74Q@mail.gmail.com>

Yes grasshopper you have to suffer a little bit. Learn Perl first, then
step up to BioPerl. Chris I feel you concerning the power of Regex, and the
sarcasm.

Lom

On Fri, Dec 9, 2011 at 3:37 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> On Dec 7, 2011, at 10:47 PM, lumos lumos wrote:
>
> > Hello,
> >
> > I have a text file(tab-delim) with some gene names as shown below.
> >
> > *BRCA1: breast cancer 1, early onset
> >
> > TNF: tumor necrosis factor
> >
> > OMG: oligodendrocyte myelin glycoprotein*
> >
> > I would like to get the list of gene name BRCA1,TNF,OMG that is before
> the
> > colon(:) .
> > How do I parse in perl this text file with this list of genes?
>
> 'Very carefully?'
>
> Okay, I'll try to refrain from further sarcasm, but I'm confused, what
> does this have to do with BioPerl (*the toolkit*) specifically?  That is
> what this mailing list is for.
>
> Just to note, this is a very common perl task. The answer is attainable by
> searching for it (not to mention taking the time to learn basic perl).  For
> instance:
>
>
> http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings
>
> One of the many links found by simply using Google:
>
>   http://lmgtfy.com/?q=perl+parse+tab+file
>
> I'll leave the regex munging to you.
>
> (okay, I failed at refraining from sarcasm, ah well it's friday).
>
> chris
>
>
> > Thanks in advance.
> > LM
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

From pawan.mani2 at gmail.com  Mon Dec  5 17:00:09 2011
From: pawan.mani2 at gmail.com (pawan.mani2 at gmail.com)
Date: Tue, 6 Dec 2011 03:30:09 +0530
Subject: [Bioperl-l] bioperl in cygwin
Message-ID: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC>

Hi
     I would like to after the givibg following commands in cgwin terminal:


 perl -MCPAN -e shell

then I type

    o conf prerequisites_policy follow
    o conf commit
    install Bundle::CPAN 
install Module::Build 
d /bioperl/ 
then we  you get a list of different versions. 
I selected CJFIELDS/BioPerl-1.6.1.96
install CJFIELDS/BioPerl-1.6.1.96.tar.gz 


but build.install was not ok.

Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7.

thanks in advanced.

with best regards,
Pawan

From cjfields at illinois.edu  Sun Dec 11 13:22:01 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sun, 11 Dec 2011 18:22:01 +0000
Subject: [Bioperl-l] bioperl in cygwin
In-Reply-To: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC>
References: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC>
Message-ID: <B674A464-E650-4CBF-B2CE-2100AB0B29B9@illinois.edu>

Pawan,

Hard to say what the problem is w/o supplying warnings/errors.  Prior to doing this, you should try installing BioPerl-1.6.901 (the latest CPAN release).  You can try a direct installation of the distribution, but the easiest way to get the latest version is to try installing Bio::Perl.

(I'm not sure what BioPerl-1.6.1.96 is, but this seems wrong)

chris

On Dec 5, 2011, at 4:00 PM, <pawan.mani2 at gmail.com>
 <pawan.mani2 at gmail.com> wrote:

> Hi
>     I would like to after the givibg following commands in cgwin terminal:
> 
> 
> perl -MCPAN -e shell
> 
> then I type
> 
>    o conf prerequisites_policy follow
>    o conf commit
>    install Bundle::CPAN 
> install Module::Build 
> d /bioperl/ 
> then we  you get a list of different versions. 
> I selected CJFIELDS/BioPerl-1.6.1.96
> install CJFIELDS/BioPerl-1.6.1.96.tar.gz 
> 
> 
> but build.install was not ok.
> 
> Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7.
> 
> thanks in advanced.
> 
> with best regards,
> Pawan
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From b.m.forde at umail.ucc.ie  Tue Dec 13 06:03:50 2011
From: b.m.forde at umail.ucc.ie (BForde)
Date: Tue, 13 Dec 2011 03:03:50 -0800 (PST)
Subject: [Bioperl-l] Genbank files
In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
Message-ID: <32965574.post@talk.nabble.com>


Than you for the replies. 

My script (below) reads in a list of locus_tags from a tab delimited text
file. Compares these locus_tags to the locus_tags in  a genbank file and
where they are equal adds new features.
the line
$feat->add_tag_value()
needs to be defined. In the bioperl wiki this variable appears to be defined
by giving it coordinates etc (creating a new feature). I wish to add
features to CDS key when the locus_tags are identical. Is this possible?

use strict; 
use Bio::SeqIO; 

my @V; 
open (LIST1, 'list') ||die; 
while (<LIST1>){ 
    push @V, (split(/\t/, $_))[0]; 
} 
close(LIST1); 

my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); 
my $seq_object = $seqio_object->next_seq; 

for my $feat_object ($seq_object->get_SeqFeatures){ 
    if ($feat_object->primary_tag eq "CDS"){ 
        if ($feat_object->has_tag('locus_tag')){ 
            for my $V3 ($feat_object->get_tag_values('locus_tag')){ 
                for my $V1 (@V) { 
                    if ($V1 eq $V3){ 
                        ADD NEW FEATURES 
                        
                    }     
                } 
            } 
        } 
    } 
} 
  
The script works down as far as the comparison point where locus_tags in the
genbankfile "Contig100.gb" are compared against a list of locus_tags from a
delimited txt file. 


regards 

Brian 

Jason Stajich-5 wrote:
> 
> $feature->add_tag_value('color','blue');
> 
> On Dec 9, 2011, at 8:52 AM, BForde wrote:
> 
>> 
>> Hello all,
>> 
>> I am new to Bioperl so I apologise if this is stupid question. 
>> 
>> For CDS features I which to add additional qualifiers e.g. /colour and
>> /note
>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>> to
>> do this?
>> 
>> regards
>> 
>> Brian
>> -- 
>> View this message in context:
>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 

-- 
View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32965574.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From roy.chaudhuri at gmail.com  Tue Dec 13 06:52:05 2011
From: roy.chaudhuri at gmail.com (Roy Chaudhuri)
Date: Tue, 13 Dec 2011 11:52:05 +0000
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32965574.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
	<32965574.post@talk.nabble.com>
Message-ID: <4EE73C65.1080101@gmail.com>

Hi Brian,

Just to check I have understood you, you want to read through a genbank 
file and add additional tags to features which are listed in a 
tab-delimited file of locus tags?

Your code is on the right lines, but it would be much more efficient to 
read your tab-delimited locus_tags into a hash, and check using exists, 
rather than ploughing through the (potentially very long) list of locus 
tags every time. Also, be careful with new lines in your tab file (you 
can safely get rid of them using "chomp"). You can miss out the 
"has_tag" check by using "get_tagset_values" instead of 
"get_tag_values", since the former does not complain if the tag is not 
present. Once you have modified your sequence object, you need to write 
it out to a new file (or STDOUT) using Bio::SeqIO.

Also, just a couple of general points, you should always "use warnings" 
(or even better "use warnings FATAL=>qw(all)") since that can help solve 
many problems, and your code may be easier to read if you don't include 
the word "object" in all your variable names (after all you wouldn't say 
you write on a paper object using a pen object).

use strict;
use warnings FATAL=>qw(all);
use Bio::SeqIO;
open (my $list, 'list') or die $!;
my %V;
while (<$list>){
     chomp;
     $V{(split(/\t/, $_))[0]}=1;
}
my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
my $seq_object = $seqio_object->next_seq;
for my $feat_object ($seq_object->remove_SeqFeatures){
     if ($feat_object->primary_tag eq "CDS"){
	for my $V3 ($feat_object->get_tagset_values('locus_tag')){
             if (exists $V{$V3}){
		$feat_object->add_tag_value(listed_in_tab_file=>'yes');
		next;
             }
         }
     }
     $seq_object->add_SeqFeature($feat_object);
}
Bio::SeqIO->new(-format=>'genbank')->write_seq($seq_object);

Hope this helps.
Cheers,
Roy.

On 13/12/2011 11:03, BForde wrote:
>
> Than you for the replies.
>
> My script (below) reads in a list of locus_tags from a tab delimited text
> file. Compares these locus_tags to the locus_tags in  a genbank file and
> where they are equal adds new features.
> the line
> $feat->add_tag_value()
> needs to be defined. In the bioperl wiki this variable appears to be defined
> by giving it coordinates etc (creating a new feature). I wish to add
> features to CDS key when the locus_tags are identical. Is this possible?
>
> use strict;
> use Bio::SeqIO;
>
> my @V;
> open (LIST1, 'list') ||die;
> while (<LIST1>){
>      push @V, (split(/\t/, $_))[0];
> }
> close(LIST1);
>
> my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
> my $seq_object = $seqio_object->next_seq;
>
> for my $feat_object ($seq_object->get_SeqFeatures){
>      if ($feat_object->primary_tag eq "CDS"){
>          if ($feat_object->has_tag('locus_tag')){
>              for my $V3 ($feat_object->get_tag_values('locus_tag')){
>                  for my $V1 (@V) {
>                      if ($V1 eq $V3){
>                          ADD NEW FEATURES
>
>                      }
>                  }
>              }
>          }
>      }
> }
>
> The script works down as far as the comparison point where locus_tags in the
> genbankfile "Contig100.gb" are compared against a list of locus_tags from a
> delimited txt file.
>
>
> regards
>
> Brian
>
> Jason Stajich-5 wrote:
>>
>> $feature->add_tag_value('color','blue');
>>
>> On Dec 9, 2011, at 8:52 AM, BForde wrote:
>>
>>>
>>> Hello all,
>>>
>>> I am new to Bioperl so I apologise if this is stupid question.
>>>
>>> For CDS features I which to add additional qualifiers e.g. /colour and
>>> /note
>>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>>> to
>>> do this?
>>>
>>> regards
>>>
>>> Brian
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Jason Stajich
>> jason.stajich at gmail.com
>> jason at bioperl.org
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>


From b.m.forde at umail.ucc.ie  Tue Dec 13 09:22:01 2011
From: b.m.forde at umail.ucc.ie (Brian Forde)
Date: Tue, 13 Dec 2011 14:22:01 +0000
Subject: [Bioperl-l] Genbank files
In-Reply-To: <4EE73C65.1080101@gmail.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
	<32965574.post@talk.nabble.com> <4EE73C65.1080101@gmail.com>
Message-ID: <CAJLmuD+0Ts_5hPLL6T2vToY8+oW+PxXHaBiGGKoLXZZoGiBptg@mail.gmail.com>

Hi Roy,

Thank you. That works perfectly. I have to confess that someone else told
me to use hashes but I could  not get them to work.. Thanks again

regards

Brian

On Tue, Dec 13, 2011 at 11:52 AM, Roy Chaudhuri <roy.chaudhuri at gmail.com>wrote:

> Hi Brian,
>
> Just to check I have understood you, you want to read through a genbank
> file and add additional tags to features which are listed in a
> tab-delimited file of locus tags?
>
> Your code is on the right lines, but it would be much more efficient to
> read your tab-delimited locus_tags into a hash, and check using exists,
> rather than ploughing through the (potentially very long) list of locus
> tags every time. Also, be careful with new lines in your tab file (you can
> safely get rid of them using "chomp"). You can miss out the "has_tag" check
> by using "get_tagset_values" instead of "get_tag_values", since the former
> does not complain if the tag is not present. Once you have modified your
> sequence object, you need to write it out to a new file (or STDOUT) using
> Bio::SeqIO.
>
> Also, just a couple of general points, you should always "use warnings"
> (or even better "use warnings FATAL=>qw(all)") since that can help solve
> many problems, and your code may be easier to read if you don't include the
> word "object" in all your variable names (after all you wouldn't say you
> write on a paper object using a pen object).
>
> use strict;
> use warnings FATAL=>qw(all);
> use Bio::SeqIO;
> open (my $list, 'list') or die $!;
> my %V;
> while (<$list>){
>    chomp;
>    $V{(split(/\t/, $_))[0]}=1;
>
> }
> my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb");
> my $seq_object = $seqio_object->next_seq;
> for my $feat_object ($seq_object->remove_**SeqFeatures){
>
>    if ($feat_object->primary_tag eq "CDS"){
>        for my $V3 ($feat_object->get_tagset_**values('locus_tag')){
>            if (exists $V{$V3}){
>                $feat_object->add_tag_value(**listed_in_tab_file=>'yes');
>                next;
>            }
>        }
>    }
>    $seq_object->add_SeqFeature($**feat_object);
> }
> Bio::SeqIO->new(-format=>'**genbank')->write_seq($seq_**object);
>
> Hope this helps.
> Cheers,
> Roy.
>
>
> On 13/12/2011 11:03, BForde wrote:
>
>>
>> Than you for the replies.
>>
>> My script (below) reads in a list of locus_tags from a tab delimited text
>> file. Compares these locus_tags to the locus_tags in  a genbank file and
>> where they are equal adds new features.
>> the line
>> $feat->add_tag_value()
>> needs to be defined. In the bioperl wiki this variable appears to be
>> defined
>> by giving it coordinates etc (creating a new feature). I wish to add
>> features to CDS key when the locus_tags are identical. Is this possible?
>>
>> use strict;
>> use Bio::SeqIO;
>>
>> my @V;
>> open (LIST1, 'list') ||die;
>> while (<LIST1>){
>>     push @V, (split(/\t/, $_))[0];
>> }
>> close(LIST1);
>>
>> my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb");
>> my $seq_object = $seqio_object->next_seq;
>>
>> for my $feat_object ($seq_object->get_SeqFeatures)**{
>>     if ($feat_object->primary_tag eq "CDS"){
>>         if ($feat_object->has_tag('locus_**tag')){
>>             for my $V3 ($feat_object->get_tag_values(**'locus_tag')){
>>                 for my $V1 (@V) {
>>                     if ($V1 eq $V3){
>>                         ADD NEW FEATURES
>>
>>                     }
>>                 }
>>             }
>>         }
>>     }
>> }
>>
>> The script works down as far as the comparison point where locus_tags in
>> the
>> genbankfile "Contig100.gb" are compared against a list of locus_tags from
>> a
>> delimited txt file.
>>
>>
>> regards
>>
>> Brian
>>
>> Jason Stajich-5 wrote:
>>
>>>
>>> $feature->add_tag_value('**color','blue');
>>>
>>> On Dec 9, 2011, at 8:52 AM, BForde wrote:
>>>
>>>
>>>> Hello all,
>>>>
>>>> I am new to Bioperl so I apologise if this is stupid question.
>>>>
>>>> For CDS features I which to add additional qualifiers e.g. /colour and
>>>> /note
>>>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>>>> to
>>>> do this?
>>>>
>>>> regards
>>>>
>>>> Brian
>>>> --
>>>> View this message in context:
>>>> http://old.nabble.com/Genbank-**files-tp32941955p32941955.html<http://old.nabble.com/Genbank-files-tp32941955p32941955.html>
>>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>>>
>>>> ______________________________**_________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l<http://lists.open-bio.org/mailman/listinfo/bioperl-l>
>>>>
>>>
>>> Jason Stajich
>>> jason.stajich at gmail.com
>>> jason at bioperl.org
>>>
>>>
>>> ______________________________**_________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l<http://lists.open-bio.org/mailman/listinfo/bioperl-l>
>>>
>>>
>>>
>>
>


-- 
Brian Forde
Microbiology Dept.
Bioscience Institute. Room 4.11
University College Cork
Cork
Ireland
tel:+353 21 4901306
email: b.m.forde at umail.ucc.ie

From b.m.forde at umail.ucc.ie  Mon Dec 12 12:20:53 2011
From: b.m.forde at umail.ucc.ie (BForde)
Date: Mon, 12 Dec 2011 09:20:53 -0800 (PST)
Subject: [Bioperl-l] Genbank files
In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
Message-ID: <32959999.post@talk.nabble.com>


Than you for the replies.

I am unsure as to how to use the line below with my script. My script so far
reads

use strict;
use Bio::SeqIO;

my @V;
open (LIST1, 'list') ||die;
while (<LIST1>){
    push @V, (split(/\t/, $_))[0];
}
close(LIST1);

my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
my $seq_object = $seqio_object->next_seq;

for my $feat_object ($seq_object->get_SeqFeatures){
    if ($feat_object->primary_tag eq "CDS"){
        if ($feat_object->has_tag('locus_tag')){
            for my $V3 ($feat_object->get_tag_values('locus_tag')){
                for my $V1 (@V) {
                    if ($V1 eq $V3){
                        ADD NEW FEATURES
                        
                    }    
                }
            }
        }
    }
}
 
The script works down as far as the comparison point where locus_tags in the
genbankfile "Contig100.gb" are compared against a list of locus_tags from a
delimited txt file.
I possbile could you show me how to amend my script so I can add new
features

regards

Brian

Jason Stajich-5 wrote:
> 
> $feature->add_tag_value('color','blue');
> 
> On Dec 9, 2011, at 8:52 AM, BForde wrote:
> 
>> 
>> Hello all,
>> 
>> I am new to Bioperl so I apologise if this is stupid question. 
>> 
>> For CDS features I which to add additional qualifiers e.g. /colour and
>> /note
>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>> to
>> do this?
>> 
>> regards
>> 
>> Brian
>> -- 
>> View this message in context:
>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 

-- 
View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32959999.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From Russell.Smithies at agresearch.co.nz  Tue Dec 13 22:17:02 2011
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Wed, 14 Dec 2011 16:17:02 +1300
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32959999.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
	<32959999.post@talk.nabble.com>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF27A@exchsth.agresearch.co.nz>

Something like this:

use strict;
use Bio::SeqIO;

my @V;
open (LIST1, 'list') ||die;
while (<LIST1>){
    push @V, (split(/\t/, $_))[0];
}
close(LIST1);

my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
my $seq_object = $seqio_object->next_seq;

for my $feat_object ($seq_object->get_SeqFeatures){
    if ($feat_object->primary_tag eq "CDS"){
        if ($feat_object->has_tag('locus_tag')){
            for my $V3 ($feat_object->get_tag_values('locus_tag')){
                for my $V1 (@V) {
                    if ($V1 eq $V3){
                        #ADD NEW FEATURES
                        $feat_object->add_tag_value('color','blue');
                    }
                }
            }
        }
    }
}
#write the new annotations
my $io = Bio::SeqIO->new(-format => "genbank", -file => ">new.gb" );
$io->write_seq($seq_object);

Take another look at http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Building_Your_Own_Sequences

--Russell


> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of BForde
> Sent: Tuesday, 13 December 2011 6:21 a.m.
> To: Bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Genbank files
> 
> 
> Than you for the replies.
> 
> I am unsure as to how to use the line below with my script. My script so far
> reads
> 
> use strict;
> use Bio::SeqIO;
> 
> my @V;
> open (LIST1, 'list') ||die;
> while (<LIST1>){
>     push @V, (split(/\t/, $_))[0];
> }
> close(LIST1);
> 
> my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
> my $seq_object = $seqio_object->next_seq;
> 
> for my $feat_object ($seq_object->get_SeqFeatures){
>     if ($feat_object->primary_tag eq "CDS"){
>         if ($feat_object->has_tag('locus_tag')){
>             for my $V3 ($feat_object->get_tag_values('locus_tag')){
>                 for my $V1 (@V) {
>                     if ($V1 eq $V3){
>                         ADD NEW FEATURES
> 
>                     }
>                 }
>             }
>         }
>     }
> }
> 
> The script works down as far as the comparison point where locus_tags in the
> genbankfile "Contig100.gb" are compared against a list of locus_tags from a
> delimited txt file.
> I possbile could you show me how to amend my script so I can add new
> features
> 
> regards
> 
> Brian
> 
> Jason Stajich-5 wrote:
> >
> > $feature->add_tag_value('color','blue');
> >
> > On Dec 9, 2011, at 8:52 AM, BForde wrote:
> >
> >>
> >> Hello all,
> >>
> >> I am new to Bioperl so I apologise if this is stupid question.
> >>
> >> For CDS features I which to add additional qualifiers e.g. /colour
> >> and /note qualifiers. I have looked at the BioPerl wiki but am still
> >> unsure as how to do this?
> >>
> >> regards
> >>
> >> Brian
> >> --
> >> View this message in context:
> >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
> >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > Jason Stajich
> > jason.stajich at gmail.com
> > jason at bioperl.org
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> 
> --
> View this message in context: http://old.nabble.com/Genbank-files-
> tp32941955p32959999.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From l.m.timmermans at students.uu.nl  Wed Dec 14 10:43:24 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Wed, 14 Dec 2011 16:43:24 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
Message-ID: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>

Hi all,

As already mentioned on IRC, I recently wrote a SFF parser and uploaded it
to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF
entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time
to write one I'd be most grateful.

Leon

From p.j.a.cock at googlemail.com  Wed Dec 14 11:03:05 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 14 Dec 2011 16:03:05 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
Message-ID: <CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>

On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> Hi all,
>
> As already mentioned on IRC, I recently wrote a SFF parser and uploaded it
> to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF
> entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time
> to write one I'd be most grateful.
>
> Leon

Hi Leon,

Have you looked at the index block at all, in order to offer random
access by read ID, or to access the Roche XML manifest? Please
ask if you need more information about this - or if you can read Python:
https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py

Is this building on Miguel Pignatelli's work? I don't recall seeing
any follow up posts from him after this one:
http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html

Peter

From cjfields at illinois.edu  Wed Dec 14 11:12:58 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Wed, 14 Dec 2011 16:12:58 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>,
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
Message-ID: <3BBC1132-E768-45D9-A107-ACD51791722D@illinois.edu>

Leon, 

Nice!  Definitely a good idea to have the lower-level parser and the BioPerl-bridging code separate, one of my concerns with the various parsers we have right now which hard-wire BioPerl classes in with the parser (makes it hard for optimization).

Chris

PS- Peter, I don't think the two projects are related, but I suppose Leon is the best to answer that.

Sent from my stupid iPad, now my laptop's on the fritz

On Dec 14, 2011, at 10:04 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

> On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans
> <l.m.timmermans at students.uu.nl> wrote:
>> Hi all,
>> 
>> As already mentioned on IRC, I recently wrote a SFF parser and uploaded it
>> to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF
>> entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time
>> to write one I'd be most grateful.
>> 
>> Leon
> 
> Hi Leon,
> 
> Have you looked at the index block at all, in order to offer random
> access by read ID, or to access the Roche XML manifest? Please
> ask if you need more information about this - or if you can read Python:
> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py
> 
> Is this building on Miguel Pignatelli's work? I don't recall seeing
> any follow up posts from him after this one:
> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html
> 
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From l.m.timmermans at students.uu.nl  Wed Dec 14 11:27:58 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Wed, 14 Dec 2011 17:27:58 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
Message-ID: <CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>

On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Hi Leon,
>
> Have you looked at the index block at all, in order to offer random
> access by read ID, or to access the Roche XML manifest? Please
> ask if you need more information about this - or if you can read Python:
> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py
>

I have looked at it, but not implemented it yet. There is no standardized
index, and the ones that are in common use either seem stupid (the Roche
index, which is essentially just a weirdly formatted sequential list,
though that should still be faster than a table scan) or undocumented (hash
based index).

 Is this building on Miguel Pignatelli's work? I don't recall seeing
> any follow up posts from him after this one:
> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html
>

It isn't. I like his idea for reusing BioPython's test files though.

Leon

From p.j.a.cock at googlemail.com  Wed Dec 14 11:44:28 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 14 Dec 2011 16:44:28 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
Message-ID: <CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>

On Wed, Dec 14, 2011 at 4:27 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> Hi Leon,
>>
>> Have you looked at the index block at all, in order to offer random
>> access by read ID, or to access the Roche XML manifest? Please
>> ask if you need more information about this - or if you can read Python:
>> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py
>
> I have looked at it, but not implemented it yet. There is no standardized
> index, and the ones that are in common use either seem stupid (the Roche
> index, which is essentially just a weirdly formatted sequential list, though
> that should still be faster than a table scan) or undocumented (hash based
> index).

There are two widely used indexes, both from Roche (one with and
one without an XML manifest, magic bytes .mft and .srt). They are
both just a simple table of the reads names and offsets, sorted
alphabetically. This works pretty well for rapid lookup for SFF files
(because the read count is not so high), and is pretty easy.

I don't think anyone used the hash table style indexes (.hsh), which
I assume was a proof of principle or trial in the early days of SFF.

One thing to check is what Ion Torrent's SFF files use. I would
guess they've followed Roche, but I don't know. After all, the
index structure is not defined in the SFF specification - it was
left extensible on purpose.

>> Is this building on Miguel Pignatelli's work? I don't recall seeing
>> any follow up posts from him after this one:
>> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html
>
> It isn't. I like his idea for reusing BioPython's test files though.

Yes, please do.

Peter

From gingerplum at gmail.com  Wed Dec 14 00:18:55 2011
From: gingerplum at gmail.com (plum ginger)
Date: Tue, 13 Dec 2011 21:18:55 -0800 (PST)
Subject: [Bioperl-l] a problem about BLAST
Message-ID: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com>

Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I
need run BLAST on more than one sequences. However the blast outfile
only store the result of last sequence. How to make the outfile store
all results?

Wish your help. Thanks very much!


Best regards

From jason.stajich at gmail.com  Thu Dec 15 12:02:47 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Thu, 15 Dec 2011 11:02:47 -0600
Subject: [Bioperl-l] a problem about BLAST
In-Reply-To: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com>
References: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com>
Message-ID: <58E5B487-7FF0-4018-B109-D6595DC2E493@gmail.com>

you are probably setting the outfile in each parsing iteration -- you need to show your code if you want someone to help you debug the problem.

On Dec 13, 2011, at 11:18 PM, plum ginger wrote:

> Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I
> need run BLAST on more than one sequences. However the blast outfile
> only store the result of last sequence. How to make the outfile store
> all results?
> 
> Wish your help. Thanks very much!
> 
> 
> Best regards
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org


From pengyu.ut at gmail.com  Fri Dec 16 17:10:27 2011
From: pengyu.ut at gmail.com (Peng Yu)
Date: Fri, 16 Dec 2011 16:10:27 -0600
Subject: [Bioperl-l] How to stop rather than emit warnings with
	Bio::Das::segment?
Message-ID: <CABrM6w=tiMNRJjZtvK-8Ot-TE3WuZgvtod6q-3g=UAWk8K_WRg@mail.gmail.com>

Hi,

Bio::Das::segment can give me the following warnings without stopping
the whole program when the position for the query doesn't exist. I
could test the return result and quit when it is []. But this would
cause my program have an test whenever I call segment. I'm wondering
if there is an automatic way to let Bio::Das::segment stop in such
cases.

--------------------- WARNING ---------------------
MSG: Sequence is not dna or rna, but []. Attempting to revcom, but
unsure if this is right
---------------------------------------------------


-- 
Regards,
Peng

From cjfields at illinois.edu  Fri Dec 16 21:48:07 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sat, 17 Dec 2011 02:48:07 +0000
Subject: [Bioperl-l] How to stop rather than emit warnings
	with	Bio::Das::segment?
In-Reply-To: <CABrM6w=tiMNRJjZtvK-8Ot-TE3WuZgvtod6q-3g=UAWk8K_WRg@mail.gmail.com>
References: <CABrM6w=tiMNRJjZtvK-8Ot-TE3WuZgvtod6q-3g=UAWk8K_WRg@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0881CF8E@CITESMBX4.ad.uillinois.edu>

Setting verbosity to 2 should convert warnings to exceptions.   

IIRC, set '-verbose => 2' in the Bio::Das constructor, set '$das->verbose(2)' explicitly, or set the env variable BIOPERLDEBUG=2.  

chris

________________________________________
From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of Peng Yu [pengyu.ut at gmail.com]
Sent: Friday, December 16, 2011 4:10 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment?

Hi,

Bio::Das::segment can give me the following warnings without stopping
the whole program when the position for the query doesn't exist. I
could test the return result and quit when it is []. But this would
cause my program have an test whenever I call segment. I'm wondering
if there is an automatic way to let Bio::Das::segment stop in such
cases.

--------------------- WARNING ---------------------
MSG: Sequence is not dna or rna, but []. Attempting to revcom, but
unsure if this is right
---------------------------------------------------


--
Regards,
Peng
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l


From anna.fr at gmail.com  Mon Dec 19 02:09:15 2011
From: anna.fr at gmail.com (Anna Friedlander)
Date: Mon, 19 Dec 2011 20:09:15 +1300
Subject: [Bioperl-l] StandAloneBlastPlus blastdbcmd question
Message-ID: <CALv2E+1Yvt1OhcTE_YXqho+zYZhPjihhCFupybArxMjLfD1S_g@mail.gmail.com>

Hi all

I have a question about using blastdbcmd via
Bio::Tools::Run::StandAloneBlastPlus

I have some Blast+ search results that I am manipulating in a perl
programme, and I would like to retrieve some sequence information for
some results using subject sequence IDs, and associated subject start
and end indices. If I was using blastdbcmd directly, I would do so
using the -entry and -range options.

My question is, can I use all the blastdbcmd options (or more
specifically, just the -entry and -range options) from within the
StandAloneBlastPlus module?

My apologies if I don't properly understand how this "wrapper" works!

Thanks in advance for your help
Anna Friedlander

From l.m.timmermans at students.uu.nl  Mon Dec 19 09:19:14 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Mon, 19 Dec 2011 15:19:14 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
Message-ID: <CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>

On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> There are two widely used indexes, both from Roche (one with and
> one without an XML manifest, magic bytes .mft and .srt). They are
> both just a simple table of the reads names and offsets, sorted
> alphabetically.


Yeah, that's what I got from the BioPython code. I didn't know it was
sorted though (it doesn't make much sense either, unless they wanted to do
a binary search or something).

This works pretty well for rapid lookup for SFF files
> (because the read count is not so high), and is pretty easy.
>

It's implemented in Bio::SFF 0.003. I did restructure my code into two
readers though, since doing sequential and random-access in the class
didn't make much sense code-wise.

I don't think anyone used the hash table style indexes (.hsh), which
> I assume was a proof of principle or trial in the early days of SFF.
>

I see, too bad.


> One thing to check is what Ion Torrent's SFF files use. I would
> guess they've followed Roche, but I don't know. After all, the
> index structure is not defined in the SFF specification - it was
> left extensible on purpose.
>

Yeah, we should check that too.

Yes, please do.
>

It's added to 0.003. The lack of tests was bothering me, but the SFFs I had
at hand were not suitable.

Leon

From p.j.a.cock at googlemail.com  Mon Dec 19 09:31:18 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 14:31:18 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
Message-ID: <CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>

On Mon, Dec 19, 2011 at 2:19 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>
>> There are two widely used indexes, both from Roche (one with and
>> one without an XML manifest, magic bytes .mft and .srt). They are
>> both just a simple table of the reads names and offsets, sorted
>> alphabetically.
>
> Yeah, that's what I got from the BioPython code. I didn't know it
> was sorted though (it doesn't make much sense either, unless they
> wanted to do a binary search or something).

I presume that's what Roche uses if they keep the index on disk.

The alternative is to load the index into RAM, which is really fast.
You just open the SFF, read the header, seek to the index, load
the index. Without the index, you have to scan the entire SFF file
to find each record and its offset - which is much slower.

>> This works pretty well for rapid lookup for SFF files
>> (because the read count is not so high), and is pretty easy.
>
> It's implemented in Bio::SFF 0.003. I did restructure my code into two
> readers though, since doing sequential and random-access in the class
> didn't make much sense code-wise.
>
>> I don't think anyone used the hash table style indexes (.hsh), which
>> I assume was a proof of principle or trial in the early days of SFF.
>
> I see, too bad.
>
>> One thing to check is what Ion Torrent's SFF files use. I would
>> guess they've followed Roche, but I don't know. After all, the
>> index structure is not defined in the SFF specification - it was
>> left extensible on purpose.
>
> Yeah, we should check that too.

I don't have any Ion Torrent data first hand, and the public
samples I've seen were FASTQ not SFF. But I know a few
people with Ion Torrent machines that might be able to help...

> It's added to 0.003. The lack of tests was bothering me, but the
> SFFs I had at hand were not suitable.

Have you looked at the sample SFF data in Biopython? Please
use them for the BioPerl unit tests (we're been talking about a
cross project collection of test data files like this), the README
file should be self-explanatory:
https://github.com/biopython/biopython/tree/master/Tests/Roche

Peter

From p.j.a.cock at googlemail.com  Mon Dec 19 10:13:53 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 15:13:53 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk>
Message-ID: <CAKVJ-_4U_Yt5A8f4QLxb-SzT8Y7n-2kRvGH=g9n+NfqAFegxgA@mail.gmail.com>

On Mon, Dec 19, 2011 at 3:03 PM, Adam Witney <awitney at sgul.ac.uk> wrote:
>> I don't have any Ion Torrent data first hand, and the public
>> samples I've seen were FASTQ not SFF. But I know a few
>> people with Ion Torrent machines that might be able to help?
>
> I can you let you have some Ion Torrent SFF files if it helps
>
> adam

Hi Adam,

I've just had a quick look at a file from an IonTorrent 314 chip
that a colleague kindly sent me, and that SFF file had no index
(but only 50k reads so this isn't so important).

If you can send me (and Leon?) one of two original SFF files that
would be useful, even if just to confirm that Ion Torrent's SFF files
do indeed typically lack an index. If that is the case, I may need to
remove the warning message Biopython currently prints when
indexing these files: No SFF index, doing it the slow way

Off list is fine if you'd like to keep the data private, use dropbox
or something if you don't have an FTP server.

Thanks,

Peter


From awitney at sgul.ac.uk  Mon Dec 19 10:03:16 2011
From: awitney at sgul.ac.uk (Adam Witney)
Date: Mon, 19 Dec 2011 15:03:16 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
Message-ID: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk>

>>> One thing to check is what Ion Torrent's SFF files use. I would
>>> guess they've followed Roche, but I don't know. After all, the
>>> index structure is not defined in the SFF specification - it was
>>> left extensible on purpose.
>> 
>> Yeah, we should check that too.
> 
> I don't have any Ion Torrent data first hand, and the public
> samples I've seen were FASTQ not SFF. But I know a few
> people with Ion Torrent machines that might be able to help?

I can you let you have some Ion Torrent SFF files if it helps

adam


From l.m.timmermans at students.uu.nl  Mon Dec 19 10:48:34 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Mon, 19 Dec 2011 16:48:34 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
Message-ID: <CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>

On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> I presume that's what Roche uses if they keep the index on disk.
>
> The alternative is to load the index into RAM, which is really fast.
> You just open the SFF, read the header, seek to the index, load
> the index. Without the index, you have to scan the entire SFF file
> to find each record and its offset - which is much slower.
>

That's what I'm doing now. It's much faster, but it still takes a
noticeable amount of time on large files.

Have you looked at the sample SFF data in Biopython? Please
> use them for the BioPerl unit tests (we're been talking about a
> cross project collection of test data files like this), the README
> file should be self-explanatory:
> https://github.com/biopython/biopython/tree/master/Tests/Roche
>

Yeah, I'm using those now (
https://github.com/Leont/bio-sff/blob/master/t/reader.t). I must say there
were some interesting corner cases in it.

Leon

From p.j.a.cock at googlemail.com  Mon Dec 19 11:15:15 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 16:15:15 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
Message-ID: <CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>

On Mon, Dec 19, 2011 at 3:48 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock wrote:
>
>> Have you looked at the sample SFF data in Biopython? Please
>> use them for the BioPerl unit tests (we're been talking about a
>> cross project collection of test data files like this), the README
>> file should be self-explanatory:
>> https://github.com/biopython/biopython/tree/master/Tests/Roche
>
> Yeah, I'm using those now
> (https://github.com/Leont/bio-sff/blob/master/t/reader.t).

Could you a link to your /corpus/README.txt file pointing
back to the Biopython original for acknowledgement and
future reference?

>
> I must say there were some interesting corner cases in it.
>

I'm glad you agree - and if you can think of any more special
cases to verify that would be great.

Are you doing just SFF parsing for now? Not writing?

Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
format name "sff" to mean the full read sequence (with mixed
case, upper case for the good sequence, lower cases for any
left/right clipping - as in the Roche tools), and "sff-trim" to mean
the trimmed sequences. I would encourage you to do the
same, as part of the general aim of having consistent
sequence format names between BioPerl, Biopython, and
EMBOSS, where possible.

Peter

From l.m.timmermans at students.uu.nl  Mon Dec 19 11:47:41 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Mon, 19 Dec 2011 17:47:41 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
Message-ID: <CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>

On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Could you a link to your /corpus/README.txt file pointing
> back to the Biopython original for acknowledgement and
> future reference?
>

I forgot about that, I will add it to the next release.

Are you doing just SFF parsing for now? Not writing?
>

I haven't written the writer yet (haven't needed it so far). I'd rather
release working code early instead of waiting until everything is complete.

Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
> format name "sff" to mean the full read sequence (with mixed
> case, upper case for the good sequence, lower cases for any
> left/right clipping - as in the Roche tools), and "sff-trim" to mean
> the trimmed sequences. I would encourage you to do the
> same, as part of the general aim of having consistent
> sequence format names between BioPerl, Biopython, and
> EMBOSS, where possible.
>

I agree, consistency is good.

Leon

From p.j.a.cock at googlemail.com  Mon Dec 19 12:00:03 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 17:00:03 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
Message-ID: <CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>

On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> Could you a link to your /corpus/README.txt file pointing
>> back to the Biopython original for acknowledgement and
>> future reference?
>
> I forgot about that, I will add it to the next release.

Thanks.

>> Are you doing just SFF parsing for now? Not writing?
>
>
> I haven't written the writer yet (haven't needed it so far). I'd rather
> release working code early instead of waiting until everything is complete.

I understand - but make sure you've designed the data structures
in the parser so as to allow the original record to be re-built as SFF.

>> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
>> format name "sff" to mean the full read sequence (with mixed
>> case, upper case for the good sequence, lower cases for any
>> left/right clipping - as in the Roche tools), and "sff-trim" to mean
>> the trimmed sequences. I would encourage you to do the
>> same, as part of the general aim of having consistent
>> sequence format names between BioPerl, Biopython, and
>> EMBOSS, where possible.
>
> I agree, consistency is good.

Great. I'd guess Bio::SeqIO integration would be more important
that SFF output initially.

Peter

From cjfields at illinois.edu  Mon Dec 19 14:44:22 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 19 Dec 2011 19:44:22 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>,
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
Message-ID: <D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>

Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best.  Barring that, a very simple class for storing data.  We've found BioPerl objects/classes pretty heavy.

(for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing).

Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used.  

For instance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists.

Chris

Sent from my iPad

On Dec 19, 2011, at 11:05 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

> On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans
> <l.m.timmermans at students.uu.nl> wrote:
>> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock <p.j.a.cock at googlemail.com>
>> wrote:
>>> 
>>> Could you a link to your /corpus/README.txt file pointing
>>> back to the Biopython original for acknowledgement and
>>> future reference?
>> 
>> I forgot about that, I will add it to the next release.
> 
> Thanks.
> 
>>> Are you doing just SFF parsing for now? Not writing?
>> 
>> 
>> I haven't written the writer yet (haven't needed it so far). I'd rather
>> release working code early instead of waiting until everything is complete.
> 
> I understand - but make sure you've designed the data structures
> in the parser so as to allow the original record to be re-built as SFF.
> 
>>> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
>>> format name "sff" to mean the full read sequence (with mixed
>>> case, upper case for the good sequence, lower cases for any
>>> left/right clipping - as in the Roche tools), and "sff-trim" to mean
>>> the trimmed sequences. I would encourage you to do the
>>> same, as part of the general aim of having consistent
>>> sequence format names between BioPerl, Biopython, and
>>> EMBOSS, where possible.
>> 
>> I agree, consistency is good.
> 
> Great. I'd guess Bio::SeqIO integration would be more important
> that SFF output initially.
> 
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From cjfields at illinois.edu  Mon Dec 19 19:28:25 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Mon, 19 Dec 2011 18:28:25 -0600
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
Message-ID: <4EEFD6A9.3010303@illinois.edu>

On 12/19/2011 10:47 AM, Leon Timmermans wrote:
> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock<p.j.a.cock at googlemail.com>wrote:
>
>> Could you a link to your /corpus/README.txt file pointing
>> back to the Biopython original for acknowledgement and
>> future reference?
>>
> I forgot about that, I will add it to the next release.
>
> Are you doing just SFF parsing for now? Not writing?
> I haven't written the writer yet (haven't needed it so far). I'd rather
> release working code early instead of waiting until everything is complete.
>
> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
>> format name "sff" to mean the full read sequence (with mixed
>> case, upper case for the good sequence, lower cases for any
>> left/right clipping - as in the Roche tools), and "sff-trim" to mean
>> the trimmed sequences. I would encourage you to do the
>> same, as part of the general aim of having consistent
>> sequence format names between BioPerl, Biopython, and
>> EMBOSS, where possible.
>>
> I agree, consistency is good.
>
> Leon
This is already implemented in Bio::SeqIO I believe.  This is the same 
line of thinking with the FASTQ format, that one can have a 
'format-variant' combination that (as one might guess) indicates to the 
parser any variation of the parser so logic within the parser can deal 
with it.  You can also pass the '-variant => "foo"' parameter as well 
IIRC.  You would just check the variant with the variant() method.

chris

From l.m.timmermans at students.uu.nl  Tue Dec 20 10:25:13 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 20 Dec 2011 16:25:13 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
Message-ID: <CAC1jpXDmU3jF-3G9dnB=-0fCkydqWd18J0eo2nKkSLvKTbOxjQ@mail.gmail.com>

On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> I understand - but make sure you've designed the data structures
> in the parser so as to allow the original record to be re-built as SFF.
>

 I did, though currently it's rather hard to make new entries from scratch.
That said, I can hardly imagine anyone wanting to do this.

Great. I'd guess Bio::SeqIO integration would be more important
> that SFF output initially.
>

Probably. It looks like it's quite easy, it's just rather underdocumented.

Leon

From l.m.timmermans at students.uu.nl  Tue Dec 20 10:26:11 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 20 Dec 2011 16:26:11 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
	<D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>
Message-ID: <CAC1jpXAq3cN0GriHndyQnbpn3v26M1o927i8z_a=x5bQU-iPQg@mail.gmail.com>

On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J <
cjfields at illinois.edu> wrote:

> Kinda joining this a little late, but I think if there is a way to have a
> low-level parser/writer that generically parses the data into simple
> (possibly hash-tagged) data structures, that would be best.  Barring that,
> a very simple class for storing data.  We've found BioPerl objects/classes
> pretty heavy.
>
> (for an example of this, see Heng Li's readfq parser on github, which has
> some stats for Fastq/fasta parsing).
>
> Any way we can separate the parser from object instantiation would enable
> us to optimize the object/class layer and parser/writer layers separately,
> with the possible nice side effect of making the parser more broadly used.
>
> For insn Sance, if someone wanted a faster parser, use the low level,
> otherwise use the higher level (possibly BioPerl-specific) API. Lincoln
> does this do a certain degree with Bio-samtools; I would go further and
> make the bp- and non-bp code in separate dists.
>

A good OO system can actually help make things faster. For example, I'm
unpacking the flowspace and quality data lazily, which made scanning
through an SFF file 2.5-3 times as fast while having marginal extra costs
when you do need them.

Leon

From l.m.timmermans at students.uu.nl  Tue Dec 20 10:30:54 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 20 Dec 2011 16:30:54 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <4EEFD6A9.3010303@illinois.edu>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<4EEFD6A9.3010303@illinois.edu>
Message-ID: <CAC1jpXD_sSYoU2DS33Yn99c0WToyXvTY2aJdcS6w-yZ0xfCFMg@mail.gmail.com>

On Tue, Dec 20, 2011 at 1:28 AM, Chris Fields <cjfields at illinois.edu> wrote:

> This is already implemented in Bio::SeqIO I believe.  This is the same
> line of thinking with the FASTQ format, that one can have a
> 'format-variant' combination that (as one might guess) indicates to the
> parser any variation of the parser so logic within the parser can deal with
> it.  You can also pass the '-variant => "foo"' parameter as well IIRC.  You
> would just check the variant with the variant() method.
>

Great. That makes life much easier :-)

Leon

From p.j.a.cock at googlemail.com  Tue Dec 20 10:31:59 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 20 Dec 2011 15:31:59 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXDmU3jF-3G9dnB=-0fCkydqWd18J0eo2nKkSLvKTbOxjQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
	<CAC1jpXDmU3jF-3G9dnB=-0fCkydqWd18J0eo2nKkSLvKTbOxjQ@mail.gmail.com>
Message-ID: <CAKVJ-_7v+wKQVXkLz_CMJXviYApyirjG9CA89mti5a3N40V8iA@mail.gmail.com>

On Tue, Dec 20, 2011 at 3:25 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> I understand - but make sure you've designed the data structures
>> in the parser so as to allow the original record to be re-built as SFF.
>
> ?I did, though currently it's rather hard to make new entries from scratch.
> That said, I can hardly imagine anyone wanting to do this.

Typical use cases I've found in using the Biopython SFF code are
filtering an SFF file (taking some records only), and modifying the
clipping values. In both cases, the user isn't creating the SFF
records from scratch.

Peter


From cjfields at illinois.edu  Tue Dec 20 17:40:31 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 20 Dec 2011 22:40:31 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXAq3cN0GriHndyQnbpn3v26M1o927i8z_a=x5bQU-iPQg@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
	<D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>,
	<CAC1jpXAq3cN0GriHndyQnbpn3v26M1o927i8z_a=x5bQU-iPQg@mail.gmail.com>
Message-ID: <CE1C3005-EA13-4C4E-A4B5-7F387D0E8E0B@illinois.edu>


On Dec 20, 2011, at 9:26 AM, "Leon Timmermans" <l.m.timmermans at students.uu.nl<mailto:l.m.timmermans at students.uu.nl>> wrote:

On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J <cjfields at illinois.edu<mailto:cjfields at illinois.edu>> wrote:
Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best.  Barring that, a very simple class for storing data.  We've found BioPerl objects/classes pretty heavy.

(for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing).

Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used.

For insn Sance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists.

A good OO system can actually help make things faster. For example, I'm unpacking the flowspace and quality data lazily, which made scanning through an SFF file 2.5-3 times as fast while having marginal extra costs when you do need them.

Leon

Yep, thinking about using the same approach for the Fastq variants.

Chris

Sent from my ancient iPad b/c my laptop's borked


From dgacquer at ulb.ac.be  Wed Dec 21 08:26:07 2011
From: dgacquer at ulb.ac.be (David Gacquer)
Date: Wed, 21 Dec 2011 14:26:07 +0100
Subject: [Bioperl-l] Strange behaviour in the write_seq function for large
	fasta
Message-ID: <4EF1DE6F.4070508@ulb.ac.be>

Dear BioPerl users/developers,

I am facing a strange issue with the $seq_out->write_seq function when 
using large fasta files

I have downloaded the hg19 chromosome 1, and applied the following code 
(basically I wanted to mask some regions in it but the problem also 
appears when copying the sequence without modifications):

sub main{
     my $seq_in  = Bio::SeqIO->new( -format => 'largefasta', -file => 
$ARGV[0]);
     my $seq_out  = Bio::SeqIO->new( -format => 'largefasta', -file => 
'>'.$ARGV[1]);
     my $seq_obj_in = $seq_in->next_seq();
     my $modified_seq = $seq_obj_in->seq();
     my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq => 
$modified_seq, -id  => $seq_obj_in->id, -desc => $seq_obj_in->desc);
     $seq_out->write_seq($seq_obj_out);
}

when checking the output fasta file, the sequence of chr1 is 1-bp shorter.

I have noticed that in the original fasta file, each line contains 
exactly 50 nucleotides, while the output of the $seq_out->write_seq 
function contains always 60 characters per line.
chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that 
the very last base was missing, I created the following fasta files:

chr121.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAG

chr122.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAG

They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last 
character being a G. When running the above code:

chr121.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

chr122.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AG

The output for the 122 bp chromosome is correct (2 lines of 60 bp and 
the last line with 2 bp, AG) but for the 121 bp chromosome, the last 
character is missing (2 lines of 60 bp only, last G is missing).

When replacing -format => 'largefasta' by -format => 'fasta' or writing 
the output without the write_seq function however, the problem is solved.

Am I missing something or is there a problem with the write_seq function 
used with large fasta files? (I am running BioPerl on a Mac under OS X 
Snow Leopard)

Best regards

David

-- 
David Gacquer, Ph. D.

IRIBHM - Universite Libre de Bruxelles
Bldg C, room C.4.117
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium

Phone: +32-2-555 4187
Fax: +32-2-555 4655
E-mail: dgacquer at ulb.ac.be


From koraydogankaya at gmail.com  Sat Dec 24 03:44:43 2011
From: koraydogankaya at gmail.com (Koray)
Date: Sat, 24 Dec 2011 00:44:43 -0800 (PST)
Subject: [Bioperl-l] exons
Message-ID: <8454dd72-4fed-41c5-977e-83d9300cd68b@z25g2000vbs.googlegroups.com>

I need an explicit code for getting exon sequences of an mrna or gene
fetched by get_Seq_by_acc or id.

in ensembl it is easy but here it is not easy many ios exists.

for example:

here how can i get such a $gene object from DBs (GeneBank or
EntrezGene) by acc numberor ids?


exons	code	prev	next	Top
 Title   : exons()
 Usage   : @exons = $gene->exons();
           @inital_exons = $gene->exons('Initial');
 Function: Get all exon features or all exons of a specified type of
this gene
           structure.

           Exon type is treated as a case-insensitive regular
expression and
           optional. For consistency, use only the following types:
           initial, internal, terminal, utr, utr5prime, and
utr3prime.
           A special and virtual type is 'coding', which refers to all
types
           except utr.

           This method basically merges the exons returned by
transcripts.

 Returns : An array of Bio::SeqFeature::Gene::ExonI implementing
objects.
 Args    : An optional string specifying the type of exon.

From challa_ghanashyam at yahoo.com  Sat Dec 24 15:09:09 2011
From: challa_ghanashyam at yahoo.com (GSC)
Date: Sat, 24 Dec 2011 12:09:09 -0800 (PST)
Subject: [Bioperl-l] re trieve description for a list of gi ids..
Message-ID: <33034438.post@talk.nabble.com>


Hi all:
I am new to perl. I am working on a script to retrieve the record
description (name given for a sequence record in genbank) for a list of gi
ids. the script works fine for 1000 ids but my list is about 250,000 ids
long and it is not working for me. Any suggestions on this.

GS
-- 
View this message in context: http://old.nabble.com/retrieve-description-for-a-list-of-gi-ids..-tp33034438p33034438.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From cjfields at illinois.edu  Tue Dec 27 10:03:28 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 27 Dec 2011 15:03:28 +0000
Subject: [Bioperl-l] Strange behaviour in the write_seq function for
 large	fasta
In-Reply-To: <4EF1DE6F.4070508@ulb.ac.be>
References: <4EF1DE6F.4070508@ulb.ac.be>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0BB29547@CITESMBX4.ad.uillinois.edu>

This is a strange one.  Personally I haven't seen this behavior, but that maybe it's OS-dependent?

We'll need more information, particularly what version of BioPerl you are using, the OS, version of perl, etc.  Also, in general to make sure we don't lose track of this issue it is best to submit a bug report:

https://redmine.open-bio.org/projects/bioperl

I'm planning on triaging bugs next week, I could take a look then.

chris
________________________________________
From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of David Gacquer [dgacquer at ulb.ac.be]
Sent: Wednesday, December 21, 2011 7:26 AM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Strange behaviour in the write_seq function for large      fasta

Dear BioPerl users/developers,

I am facing a strange issue with the $seq_out->write_seq function when
using large fasta files

I have downloaded the hg19 chromosome 1, and applied the following code
(basically I wanted to mask some regions in it but the problem also
appears when copying the sequence without modifications):

sub main{
     my $seq_in  = Bio::SeqIO->new( -format => 'largefasta', -file =>
$ARGV[0]);
     my $seq_out  = Bio::SeqIO->new( -format => 'largefasta', -file =>
'>'.$ARGV[1]);
     my $seq_obj_in = $seq_in->next_seq();
     my $modified_seq = $seq_obj_in->seq();
     my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq =>
$modified_seq, -id  => $seq_obj_in->id, -desc => $seq_obj_in->desc);
     $seq_out->write_seq($seq_obj_out);
}

when checking the output fasta file, the sequence of chr1 is 1-bp shorter.

I have noticed that in the original fasta file, each line contains
exactly 50 nucleotides, while the output of the $seq_out->write_seq
function contains always 60 characters per line.
chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that
the very last base was missing, I created the following fasta files:

chr121.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAG

chr122.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAG

They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last
character being a G. When running the above code:

chr121.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

chr122.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AG

The output for the 122 bp chromosome is correct (2 lines of 60 bp and
the last line with 2 bp, AG) but for the 121 bp chromosome, the last
character is missing (2 lines of 60 bp only, last G is missing).

When replacing -format => 'largefasta' by -format => 'fasta' or writing
the output without the write_seq function however, the problem is solved.

Am I missing something or is there a problem with the write_seq function
used with large fasta files? (I am running BioPerl on a Mac under OS X
Snow Leopard)

Best regards

David

--
David Gacquer, Ph. D.

IRIBHM - Universite Libre de Bruxelles
Bldg C, room C.4.117
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium

Phone: +32-2-555 4187
Fax: +32-2-555 4655
E-mail: dgacquer at ulb.ac.be

_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l


From jdeuts01 at students.poly.edu  Thu Dec  1 09:09:19 2011
From: jdeuts01 at students.poly.edu (jdeuts01 at students.poly.edu)
Date: Thu, 1 Dec 2011 14:09:19 +0000
Subject: [Bioperl-l] question
Message-ID: <SNT134-W43F83A46574EEDD841600186B10@phx.gbl>


Dear Bioperl,
       This is my first experience with bioperl and I need help please.
1. The version of bioperl is 1.6.1 and have also installed Bundle-BioPerl 2.1.8 and mGen 1.03.    I was unable to install Bribes and trouchelle DB.     Will this prevent the BioPerl package from functioning correctly?
2. The operating platform is windows 7 - 64 bit using ActiveState - Perl v5.12.2
3. The script is as follows:
#!/usr/bin/perl
# Write a script using OOP to write protein sequences to the file fasta.txt.use strict;use warnings;use Bio::SeqIO::fasta;
# Declare and initialize input and output files.my $protein_fasta = "protein.fa";my $protein_out = ">fasta.txt";
# Setup objects for input and output.my $seq_in = Bio::SeqIO->new(-file => "$protein_fasta", -format => 'Fasta');my $seq_out = Bio::SeqIO->new(-file => "$protein_out", -format => 'fasta');
# Establish while loop using "next_seq" method # to read in multiple sequences from "protein.fa" file# one by one until none were remaining.while(my $seq = $seq_in -> next_seq){		$seq_out->write_seq($seq);}
The information is successfully written to the file: fasta.txt. 
4. Receiving the following error messages: 
Replacement list is longer than search list at C:/Perl64/site/lib/Bio/Range.pm line 251.Subroutine _initialize redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 92.Subroutine next_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 112.Subroutine write_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 193.Subroutine width redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 272.Subroutine preferred_id_type redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 295.
Thanks in advance for your help.John Deutsch
 		 	   		  

From jboddu at illinois.edu  Thu Dec  1 11:38:00 2011
From: jboddu at illinois.edu (Boddu, Jayanand)
Date: Thu, 1 Dec 2011 16:38:00 +0000
Subject: [Bioperl-l] Chromosome coordinates
Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>

Hello
I am newbie to Perl scripts.
I have a file with short reads mapped to the MAIZE genome
The format is a simple BLASTN output.
READ_ID

Chr

% Similarity

Alignment

Mismatches

Gaps

READ Start

READ End

Chr Start

Chr End

E Value

Score

READ1

chrPt

100

17

0

0

1

17

35021

35037

0.21

34.2

READ1

chr10

100

17

0

0

1

17

128587356

128587372

0.21

34.2

READ1

chr6

100

17

0

0

1

17

160769803

160769787

0.21

34.2

READ1

chr5

100

17

0

0

1

17

172103083

172103067

0.21

34.2

READ1

chr4

100

17

0

0

1

17

213173683

213173699

0.21

34.2

READ1

chr3

100

17

0

0

1

17

23689132

23689116

0.21

34.2

READ2

chr8

100

17

0

0

1

17

161048603

161048587

0.21

34.2

READ2

chr6

100

17

0

0

1

17

155768884

155768868

0.21

34.2

READ2

chr5

100

17

0

0

1

17

32958812

32958828

0.21

34.2

READ2

chr3

100

17

0

0

1

17

212451090

212451074

0.21

34.2

READ2

chr2

100

17

0

0

1

17

2046449

2046465

0.21

34.2

READ2

chr1

100

17

0

0

1

17

223233801

223233785

0.21

34.2

READ2

chr1

100

17

0

0

1

17

277573037

277573021

0.21

34.2


As expected the same read maps to multiple places on the same/different chromosome.
I have a GFF file with annotated coordinates.
I would like to run a PERL script to find out READS that are within the GENES in the GFF file and that are not.
The anticipated script should;

1.       Take the READ coordinates on the genome (by chromosome);

2.       Go the GFF file;

3.       Find the Chromosome;

4.       Find the GENE (by coordinates);

5.       and report READ-its coordinates-Chromosome-GENE-and its coordinates.

It doesn't need to be in the same order.
After this, I guess I could use simple Microsoft ACCESS query to pull out READS that are not mapped to the GENEs.
I would greatly appreciate if anyone can has a script that more or less similar job.

Thanks
Jay


From scott at scottcain.net  Thu Dec  1 11:59:56 2011
From: scott at scottcain.net (Scott Cain)
Date: Thu, 1 Dec 2011 11:59:56 -0500
Subject: [Bioperl-l] Chromosome coordinates
In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>
Message-ID: <CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>

Hi Jay,

Since the maize GFF file is likely to be fairly large, I would consider
putting it in a database, using either Bio::DB::GFF if it is GFF2 or
Bio::DB::SeqFeature::Store if it is gff3.  Then you can use the methods
that come along with either of those modules to search regions for for
genes.  They both support a get_features_by_location method, so you could
get the range for each of the regions you want to look at, and check the
database with that method to see if anything is there.

Scott


On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand <jboddu at illinois.edu>wrote:

> Hello
> I am newbie to Perl scripts.
> I have a file with short reads mapped to the MAIZE genome
> The format is a simple BLASTN output.
> READ_ID
>
> Chr
>
> % Similarity
>
> Alignment
>
> Mismatches
>
> Gaps
>
> READ Start
>
> READ End
>
> Chr Start
>
> Chr End
>
> E Value
>
> Score
>
> READ1
>
> chrPt
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 35021
>
> 35037
>
> 0.21
>
> 34.2
>
> READ1
>
> chr10
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 128587356
>
> 128587372
>
> 0.21
>
> 34.2
>
> READ1
>
> chr6
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 160769803
>
> 160769787
>
> 0.21
>
> 34.2
>
> READ1
>
> chr5
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 172103083
>
> 172103067
>
> 0.21
>
> 34.2
>
> READ1
>
> chr4
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 213173683
>
> 213173699
>
> 0.21
>
> 34.2
>
> READ1
>
> chr3
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 23689132
>
> 23689116
>
> 0.21
>
> 34.2
>
> READ2
>
> chr8
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 161048603
>
> 161048587
>
> 0.21
>
> 34.2
>
> READ2
>
> chr6
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 155768884
>
> 155768868
>
> 0.21
>
> 34.2
>
> READ2
>
> chr5
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 32958812
>
> 32958828
>
> 0.21
>
> 34.2
>
> READ2
>
> chr3
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 212451090
>
> 212451074
>
> 0.21
>
> 34.2
>
> READ2
>
> chr2
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 2046449
>
> 2046465
>
> 0.21
>
> 34.2
>
> READ2
>
> chr1
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 223233801
>
> 223233785
>
> 0.21
>
> 34.2
>
> READ2
>
> chr1
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 277573037
>
> 277573021
>
> 0.21
>
> 34.2
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> As expected the same read maps to multiple places on the same/different
> chromosome.
> I have a GFF file with annotated coordinates.
> I would like to run a PERL script to find out READS that are within the
> GENES in the GFF file and that are not.
> The anticipated script should;
>
> 1.       Take the READ coordinates on the genome (by chromosome);
>
> 2.       Go the GFF file;
>
> 3.       Find the Chromosome;
>
> 4.       Find the GENE (by coordinates);
>
> 5.       and report READ-its coordinates-Chromosome-GENE-and its
> coordinates.
>
> It doesn't need to be in the same order.
> After this, I guess I could use simple Microsoft ACCESS query to pull out
> READS that are not mapped to the GENEs.
> I would greatly appreciate if anyone can has a script that more or less
> similar job.
>
> Thanks
> Jay
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot
net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research


From jason.stajich at gmail.com  Thu Dec  1 12:31:29 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Thu, 1 Dec 2011 09:31:29 -0800
Subject: [Bioperl-l] Chromosome coordinates
In-Reply-To: <CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>
	<CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
Message-ID: <470A08F7-E238-44E1-B0D1-69FDBC0BA2A3@gmail.com>

You might try using BEDtools, intersectBED which can do a lot of what you are doing in simple command line program.

Jason
On Dec 1, 2011, at 8:59 AM, Scott Cain wrote:

> Hi Jay,
> 
> Since the maize GFF file is likely to be fairly large, I would consider
> putting it in a database, using either Bio::DB::GFF if it is GFF2 or
> Bio::DB::SeqFeature::Store if it is gff3.  Then you can use the methods
> that come along with either of those modules to search regions for for
> genes.  They both support a get_features_by_location method, so you could
> get the range for each of the regions you want to look at, and check the
> database with that method to see if anything is there.
> 
> Scott
> 
> 
> On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand <jboddu at illinois.edu>wrote:
> 
>> Hello
>> I am newbie to Perl scripts.
>> I have a file with short reads mapped to the MAIZE genome
>> The format is a simple BLASTN output.
>> READ_ID
>> 
>> Chr
>> 
>> % Similarity
>> 
>> Alignment
>> 
>> Mismatches
>> 
>> Gaps
>> 
>> READ Start
>> 
>> READ End
>> 
>> Chr Start
>> 
>> Chr End
>> 
>> E Value
>> 
>> Score
>> 
>> READ1
>> 
>> chrPt
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 35021
>> 
>> 35037
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr10
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 128587356
>> 
>> 128587372
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr6
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 160769803
>> 
>> 160769787
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr5
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 172103083
>> 
>> 172103067
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr4
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 213173683
>> 
>> 213173699
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr3
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 23689132
>> 
>> 23689116
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr8
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 161048603
>> 
>> 161048587
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr6
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 155768884
>> 
>> 155768868
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr5
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 32958812
>> 
>> 32958828
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr3
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 212451090
>> 
>> 212451074
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr2
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 2046449
>> 
>> 2046465
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr1
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 223233801
>> 
>> 223233785
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr1
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 277573037
>> 
>> 277573021
>> 
>> 0.21
>> 
>> 34.2
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> As expected the same read maps to multiple places on the same/different
>> chromosome.
>> I have a GFF file with annotated coordinates.
>> I would like to run a PERL script to find out READS that are within the
>> GENES in the GFF file and that are not.
>> The anticipated script should;
>> 
>> 1.       Take the READ coordinates on the genome (by chromosome);
>> 
>> 2.       Go the GFF file;
>> 
>> 3.       Find the Chromosome;
>> 
>> 4.       Find the GENE (by coordinates);
>> 
>> 5.       and report READ-its coordinates-Chromosome-GENE-and its
>> coordinates.
>> 
>> It doesn't need to be in the same order.
>> After this, I guess I could use simple Microsoft ACCESS query to pull out
>> READS that are not mapped to the GENEs.
>> I would greatly appreciate if anyone can has a script that more or less
>> similar job.
>> 
>> Thanks
>> Jay
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
> 
> 
> 
> -- 
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at scottcain dot
> net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From jovel_juan at hotmail.com  Thu Dec  1 12:36:32 2011
From: jovel_juan at hotmail.com (Juan Jovel)
Date: Thu, 1 Dec 2011 17:36:32 +0000
Subject: [Bioperl-l] Error when using SearchIO
In-Reply-To: <CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>,
	<CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
Message-ID: <COL102-W29B31BD46108B9F1D9EF46FAB10@phx.gbl>


Hello Everybody!
I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message:
"Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251"
What it does mean? Would it have any effect on my parsing results?
Thanks, 
JUAN 		 	   		  


From cjfields at illinois.edu  Thu Dec  1 14:03:45 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 1 Dec 2011 19:03:45 +0000
Subject: [Bioperl-l] Error when using SearchIO
In-Reply-To: <COL102-W29B31BD46108B9F1D9EF46FAB10@phx.gbl>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>,
	<CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
	<COL102-W29B31BD46108B9F1D9EF46FAB10@phx.gbl>
Message-ID: <7233AC75-E03B-401A-8D0A-260ED76956A4@illinois.edu>

On Dec 1, 2011, at 11:36 AM, Juan Jovel wrote:

> Hello Everybody!
> I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message:
> "Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251"
> What it does mean? Would it have any effect on my parsing results?
> Thanks, 
> JUAN

This is a bug that was fixed, I think this was in the latest BioPerl release (1.6.901).  There was a transliteration error that produced this warning, but otherwise it's harmless. The warning is perl version dependent, I think it pops up in perl 5.12 and up.  This user post (about halfway down) runs into the same issue: http://www.sysarchitects.com/bioperl.

chris


From David.Messina at sbc.su.se  Thu Dec  1 17:02:20 2011
From: David.Messina at sbc.su.se (Dave Messina)
Date: Thu, 1 Dec 2011 23:02:20 +0100
Subject: [Bioperl-l] re trieving blast multiple alignment in fasta form
In-Reply-To: <32886592.post@talk.nabble.com>
References: <32886592.post@talk.nabble.com>
Message-ID: <CAM3TQQURpsF8+Tq2AQ6yAzmDVuDnip-6mFAfGgUdR5ScTus4YA@mail.gmail.com>

Hi Eric,

Wait, do you want multiple pairwise alignments in your output FASTA file,
or a single multiple alignment of your query and all the hits?

If the former, get_aln() will give you one pairwise alignment per hsp, but
you'll need to move the output file creation statement (my $alnIO = ...)
before the loops so it gets created only once. Then, when you do the write
statement ($alnIO->write_aln($aln);), all of the alignments will go to the
same file.

If on the other hand you'd like to have a multiple alignment between a
query and all of its hits, you'll have to take the IDs of the hits, pull
the corresponding sequences out of the database, and then run a multiple
alignment algorithm on them.


Dave


From scuoppo at gmail.com  Fri Dec  2 17:50:28 2011
From: scuoppo at gmail.com (Claudio Scuoppo)
Date: Fri, 2 Dec 2011 17:50:28 -0500
Subject: [Bioperl-l] List of genes from genomic intervals
Message-ID: <CAEz0Wfv_yj6g5rZEbmj+UhOkJX8ryzObEO7N6rqvLRMV6Yd_Pg@mail.gmail.com>

Hi,

I am new to BioPerl. I was wondering what`s the best strategy to get
the genes contained in a a series of human genomic interval.
Basically, I have a table with:

Chromosome Start End

Which module should I be looking at?
Thanks,
Claudio


From awitney at sgul.ac.uk  Mon Dec  5 06:09:39 2011
From: awitney at sgul.ac.uk (Adam Witney)
Date: Mon, 5 Dec 2011 11:09:39 +0000
Subject: [Bioperl-l] Bio::Graphics imagemap and padding
Message-ID: <44A27378-CC4B-4396-817E-AA31004847C7@sgul.ac.uk>

Hi,

Image maps seem to be out of position if you use padding in the Panel, like this:

my $panel = Bio::Graphics::Panel->new( ?.. -pad_left  => 20, -pad_right => 20 ?? );

Without these options, the image map is fine. Is this a known issue?

Also on a side note, I noticed that when using Bio::Graphics with Dancer, some of the CGI code was blocking somewhere (I found a reference to a similar problem with CGI and Catalyst), swapping CGI with HTML::Entities fixes it:

sub create_web_map {
?.
	eval "require HTML::Entities" unless HTML::Entities->can('encode_entities');
?.
	my $title  = HTML::Entities::encode_entities($self->make_link($tr,$feature,1));
 	my $target = HTML::Entities::encode_entities($self->make_link($tgr,$feature,1));
?..
}

Thanks

Adam


From momin.amin at gmail.com  Mon Dec  5 18:00:23 2011
From: momin.amin at gmail.com (Amin Momin)
Date: Mon, 5 Dec 2011 15:00:23 -0800 (PST)
Subject: [Bioperl-l] SimpleAlign and consensus_string
Message-ID: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>

Hi ,

I am generating a consensus sequence by aligning two protein homologs
using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to
understand the criteria consensus_string() method of simpleAlign uses
to determine the consensus at position with dissimilar aminoacids/
nucleotide. Also how would the % cutoffs provided to
consensus_string() affect the outcome.


Thanks,
Amin


From jason.stajich at gmail.com  Mon Dec  5 18:58:59 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Mon, 5 Dec 2011 15:58:59 -0800
Subject: [Bioperl-l] SimpleAlign and consensus_string
In-Reply-To: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>
References: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>
Message-ID: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com>

There are several methods that do related things. 

Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns. 

If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column.

=head2 consensus_string

 Title     : consensus_string
 Usage     : $str = $ali->consensus_string($threshold_percent)
 Function  : Makes a strict consensus
 Returns   : Consensus string
 Argument  : Optional treshold ranging from 0 to 100.
             The consensus residue has to appear at least threshold %
             of the sequences at a given location, otherwise a '?'
             character will be placed at that location.
             (Default value = 0%)

=cut

On Dec 5, 2011, at 3:00 PM, Amin Momin wrote:

> Hi ,
> 
> I am generating a consensus sequence by aligning two protein homologs
> using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to
> understand the criteria consensus_string() method of simpleAlign uses
> to determine the consensus at position with dissimilar aminoacids/
> nucleotide. Also how would the % cutoffs provided to
> consensus_string() affect the outcome.
> 
> 
> Thanks,
> Amin
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From wenbinmei at gmail.com  Tue Dec  6 11:09:35 2011
From: wenbinmei at gmail.com (wenbin mei)
Date: Tue, 6 Dec 2011 11:09:35 -0500
Subject: [Bioperl-l] revcom the multiple sequence alignment
Message-ID: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>

Hi,

I have a question about revcom the multiple sequence alignment. One way I
can do convert the format into fasta and revcom individual sequences. I
wonder is there a easy way to convert the multiple sequence alignment as a
whole.  Thank you for help.

-best,
wenbin


From jason.stajich at gmail.com  Tue Dec  6 12:40:37 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Tue, 6 Dec 2011 09:40:37 -0800
Subject: [Bioperl-l] revcom the multiple sequence alignment
In-Reply-To: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
References: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
Message-ID: <CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>

I think this would work to update it in place though I haven't tried it myself

for my $seq ( $aln->each_seq ) {
 $seq->seq( $seq->revcom->seq );
}
$out->write_aln($aln);

This may also work - not entirely sure if there is any extra work done on the meta data (start/end) of the Seq object when this is done.  You may want to flip start/end for the sequences (the seqs are Bio::LocatableSeq objects) explicitly if not. Or you may not care about those data and can ignore.

   $seq = $seq->revcom

Jason
On Dec 6, 2011, at 8:09 AM, wenbin mei wrote:

> Hi,
> 
> I have a question about revcom the multiple sequence alignment. One way I
> can do convert the format into fasta and revcom individual sequences. I
> wonder is there a easy way to convert the multiple sequence alignment as a
> whole.  Thank you for help.
> 
> -best,
> wenbin
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From wenbinmei at gmail.com  Tue Dec  6 12:51:18 2011
From: wenbinmei at gmail.com (wenbin mei)
Date: Tue, 6 Dec 2011 12:51:18 -0500
Subject: [Bioperl-l] revcom the multiple sequence alignment
In-Reply-To: <CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
References: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
	<CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
Message-ID: <CAHdrE2TqBxYN+wPC=LH433GwMt3sp5oas_OQhwWY9DbfbjdCpg@mail.gmail.com>

I think I might not explain clearly my questions. I extract the individual
gene alignment from the whole genome alignment. Since some gene are on the
reverse strand, I want to revcom the gene alignment. There is part of my
scripts. I can read the strand information from another file.

my $newstart = $refseq->column_from_residue_number($start);
my $newend = $refseq->column_from_residue_number($end);
$seq{$genename} = $aln->slice($newstart, $newend);


Any suggestion to help me revcom some gene alignment on the minus strand is
helpful. Thank you.

-best,
wenbin


On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich <jason.stajich at gmail.com>wrote:

> I think this would work to update it in place though I haven't tried it
> myself
>
> for my $seq ( $aln->each_seq ) {
>  $seq->seq( $seq->revcom->seq );
> }
> $out->write_aln($aln);
>
> This may also work - not entirely sure if there is any extra work done on
> the meta data (start/end) of the Seq object when this is done.  You may
> want to flip start/end for the sequences (the seqs are Bio::LocatableSeq
> objects) explicitly if not. Or you may not care about those data and can
> ignore.
>
>   $seq = $seq->revcom
>
> Jason
> On Dec 6, 2011, at 8:09 AM, wenbin mei wrote:
>
> > Hi,
> >
> > I have a question about revcom the multiple sequence alignment. One way I
> > can do convert the format into fasta and revcom individual sequences. I
> > wonder is there a easy way to convert the multiple sequence alignment as
> a
> > whole.  Thank you for help.
> >
> > -best,
> > wenbin
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Wenbin Mei Ph.D. Student
Dr. Brad Barbazuk's Lab
Department of Biology
University of Florida
509-899-3067
wmei at ufl.edu <wmei at ufl.edu>


From kellert at ohsu.edu  Tue Dec  6 13:21:39 2011
From: kellert at ohsu.edu (Tom Keller)
Date: Tue, 6 Dec 2011 10:21:39 -0800
Subject: [Bioperl-l] Bioperl-l Digest, Vol 104, Issue 3
In-Reply-To: <mailman.3.1322931604.28955.bioperl-l@lists.open-bio.org>
References: <mailman.3.1322931604.28955.bioperl-l@lists.open-bio.org>
Message-ID: <B68BC6F2-8C57-4749-902D-3232B0DA6113@ohsu.edu>

I'd start by looking for the section "Searching for genes in genomic DNA" in the HOWTO:Beginners - BioPerl website.

Thomas (Tom) Keller, PhD
kellert at ohsu.edu
503.494.2442
6588 R Jones Hall (BSc/CROET)
MMI DNA Services
Member of OHSU Shared Resources

On Dec 3, 2011, at 9:00 AM, <bioperl-l-request at lists.open-bio.org> <bioperl-l-request at lists.open-bio.org> wrote:

> Send Bioperl-l mailing list submissions to
> 	bioperl-l at lists.open-bio.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.open-bio.org/mailman/listinfo/bioperl-l
> or, via email, send a message with subject or body 'help' to
> 	bioperl-l-request at lists.open-bio.org
> 
> You can reach the person managing the list at
> 	bioperl-l-owner at lists.open-bio.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Bioperl-l digest..."
> 
> 
> Today's Topics:
> 
>   1.  List of genes from genomic intervals (Claudio Scuoppo)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Fri, 2 Dec 2011 17:50:28 -0500
> From: Claudio Scuoppo <scuoppo at gmail.com>
> Subject: [Bioperl-l] List of genes from genomic intervals
> To: bioperl-l at lists.open-bio.org
> Message-ID:
> 	<CAEz0Wfv_yj6g5rZEbmj+UhOkJX8ryzObEO7N6rqvLRMV6Yd_Pg at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Hi,
> 
> I am new to BioPerl. I was wondering what`s the best strategy to get
> the genes contained in a a series of human genomic interval.
> Basically, I have a table with:
> 
> Chromosome Start End
> 
> Which module should I be looking at?
> Thanks,
> Claudio
> 
> 
> ------------------------------
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> End of Bioperl-l Digest, Vol 104, Issue 3
> *****************************************


From wenbinmei at gmail.com  Tue Dec  6 17:54:51 2011
From: wenbinmei at gmail.com (wenbin mei)
Date: Tue, 6 Dec 2011 17:54:51 -0500
Subject: [Bioperl-l] revcom the multiple sequence alignment
In-Reply-To: <CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
References: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
	<CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
Message-ID: <CAHdrE2S33zcuchbSXuH2NwM5gM-=BnxVx9xA13ye18gPi2Mtcg@mail.gmail.com>

Figured out! Thanks for help.

-best,
wenbin


On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich <jason.stajich at gmail.com>wrote:

> I think this would work to update it in place though I haven't tried it
> myself
>
> for my $seq ( $aln->each_seq ) {
>  $seq->seq( $seq->revcom->seq );
> }
> $out->write_aln($aln);
>
> This may also work - not entirely sure if there is any extra work done on
> the meta data (start/end) of the Seq object when this is done.  You may
> want to flip start/end for the sequences (the seqs are Bio::LocatableSeq
> objects) explicitly if not. Or you may not care about those data and can
> ignore.
>
>   $seq = $seq->revcom
>
> Jason
> On Dec 6, 2011, at 8:09 AM, wenbin mei wrote:
>
> > Hi,
> >
> > I have a question about revcom the multiple sequence alignment. One way I
> > can do convert the format into fasta and revcom individual sequences. I
> > wonder is there a easy way to convert the multiple sequence alignment as
> a
> > whole.  Thank you for help.
> >
> > -best,
> > wenbin
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Wenbin Mei Ph.D. Student
Dr. Brad Barbazuk's Lab
Department of Biology
University of Florida
509-899-3067
wmei at ufl.edu <wmei at ufl.edu>


From momin.amin at gmail.com  Tue Dec  6 12:37:16 2011
From: momin.amin at gmail.com (Amin Momin)
Date: Tue, 6 Dec 2011 11:37:16 -0600
Subject: [Bioperl-l] SimpleAlign and consensus_string
In-Reply-To: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com>
References: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>
	<4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com>
Message-ID: <CAA0DaRhm+jsPpFFYR6q2xj0YOkYy3Enh8rrRD-YQJ26z_U+Fkw@mail.gmail.com>

Thanks Jason


On Mon, Dec 5, 2011 at 5:58 PM, Jason Stajich <jason.stajich at gmail.com> wrote:
> There are several methods that do related things.
>
> Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns.
>
> If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column.
>
> =head2 consensus_string
>
> ?Title ? ? : consensus_string
> ?Usage ? ? : $str = $ali->consensus_string($threshold_percent)
> ?Function ?: Makes a strict consensus
> ?Returns ? : Consensus string
> ?Argument ?: Optional treshold ranging from 0 to 100.
> ? ? ? ? ? ? The consensus residue has to appear at least threshold %
> ? ? ? ? ? ? of the sequences at a given location, otherwise a '?'
> ? ? ? ? ? ? character will be placed at that location.
> ? ? ? ? ? ? (Default value = 0%)
>
> =cut
>
> On Dec 5, 2011, at 3:00 PM, Amin Momin wrote:
>
>> Hi ,
>>
>> I am generating a consensus sequence by aligning two protein homologs
>> using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to
>> understand the criteria consensus_string() method of simpleAlign uses
>> to determine the consensus at position with dissimilar aminoacids/
>> nucleotide. Also how would the % cutoffs provided to
>> consensus_string() affect the outcome.
>>
>>
>> Thanks,
>> Amin
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From sunwukong at potc.net  Wed Dec  7 14:05:20 2011
From: sunwukong at potc.net (sunwukong)
Date: Wed, 07 Dec 2011 11:05:20 -0800
Subject: [Bioperl-l] DNA Sequencing two questions
Message-ID: <4EDFB8F0.8080001@potc.net>

I am not a medical professional but I have two DNA related questions.

A year or so ago I realized that if the standard building blocks of life 
were the amino acids GATC then they could be represented as a base 4 
number system (e.g., 0,1,2 and 3).  Then any life form could be 
represented by a number (it would be very long).  So I set out on a 
quest to do this with a small life form.  For fun I chose the Spanish 
Flu which I believe I found on an NIH site.  Then I set out and realized 
that there was no standard.  And I did not know if the number would be 
built with the most significant digit on the left or right.

1.  Is there a standard method for representing the ATCD molecules as 
numbers
g = 0
a = 1
t  = 2
c = 3

2. is the sequence read left to right or right to left?

note:  It may be biologically significant if the right values are 
assigned to the letters GATC, there could be a pattern somewhere that 
holds significant information.  One idea might be to look at DNA 
sequences in bases other than 4 to see if something jumps out.

http://www.insectscience.org/2.10/ref/fig5a.gif

VR
Pat Kirol
509 442-2214


From Russell.Smithies at agresearch.co.nz  Wed Dec  7 16:59:18 2011
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Thu, 8 Dec 2011 10:59:18 +1300
Subject: [Bioperl-l] DNA Sequencing two questions
In-Reply-To: <4EDFB8F0.8080001@potc.net>
References: <4EDFB8F0.8080001@potc.net>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>

I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve.
I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions?  Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png
Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes?

But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery.

But don't let this stop you uncovering the great secret hidden in our genes :-)

On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html

--Russell

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of sunwukong
> Sent: Thursday, 8 December 2011 8:05 a.m.
> To: bioperl-l at bioperl.org
> Subject: [Bioperl-l] DNA Sequencing two questions
> 
> I am not a medical professional but I have two DNA related questions.
> 
> A year or so ago I realized that if the standard building blocks of life were the
> amino acids GATC then they could be represented as a base 4 number
> system (e.g., 0,1,2 and 3).  Then any life form could be represented by a
> number (it would be very long).  So I set out on a quest to do this with a small
> life form.  For fun I chose the Spanish Flu which I believe I found on an NIH
> site.  Then I set out and realized that there was no standard.  And I did not
> know if the number would be built with the most significant digit on the left
> or right.
> 
> 1.  Is there a standard method for representing the ATCD molecules as
> numbers g = 0 a = 1 t  = 2 c = 3
> 
> 2. is the sequence read left to right or right to left?
> 
> note:  It may be biologically significant if the right values are assigned to the
> letters GATC, there could be a pattern somewhere that holds significant
> information.  One idea might be to look at DNA sequences in bases other
> than 4 to see if something jumps out.
> 
> http://www.insectscience.org/2.10/ref/fig5a.gif
> 
> VR
> Pat Kirol
> 509 442-2214
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From jason.stajich at gmail.com  Wed Dec  7 17:53:10 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Wed, 7 Dec 2011 14:53:10 -0800
Subject: [Bioperl-l] DNA Sequencing two questions
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
References: <4EDFB8F0.8080001@potc.net>
	<18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
Message-ID: <9BDA2BFA-264F-45B2-9234-A6BC9402FBA8@gmail.com>

For other fun picture games -- 

You can look at patterns of motifs/words in a chaos game representation of genomes.
http://mbe.oxfordjournals.org/content/16/10/1391.long
http://mbe.oxfordjournals.org/content/20/6/901.long


On Dec 7, 2011, at 1:59 PM, Smithies, Russell wrote:

> I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve.
> I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions?  Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png
> Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes?
> 
> But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery.
> 
> But don't let this stop you uncovering the great secret hidden in our genes :-)
> 
> On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html
> 
> --Russell
> 
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of sunwukong
>> Sent: Thursday, 8 December 2011 8:05 a.m.
>> To: bioperl-l at bioperl.org
>> Subject: [Bioperl-l] DNA Sequencing two questions
>> 
>> I am not a medical professional but I have two DNA related questions.
>> 
>> A year or so ago I realized that if the standard building blocks of life were the
>> amino acids GATC then they could be represented as a base 4 number
>> system (e.g., 0,1,2 and 3).  Then any life form could be represented by a
>> number (it would be very long).  So I set out on a quest to do this with a small
>> life form.  For fun I chose the Spanish Flu which I believe I found on an NIH
>> site.  Then I set out and realized that there was no standard.  And I did not
>> know if the number would be built with the most significant digit on the left
>> or right.
>> 
>> 1.  Is there a standard method for representing the ATCD molecules as
>> numbers g = 0 a = 1 t  = 2 c = 3
>> 
>> 2. is the sequence read left to right or right to left?
>> 
>> note:  It may be biologically significant if the right values are assigned to the
>> letters GATC, there could be a pattern somewhere that holds significant
>> information.  One idea might be to look at DNA sequences in bases other
>> than 4 to see if something jumps out.
>> 
>> http://www.insectscience.org/2.10/ref/fig5a.gif
>> 
>> VR
>> Pat Kirol
>> 509 442-2214
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From Russell.Smithies at agresearch.co.nz  Wed Dec  7 19:29:47 2011
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Thu, 8 Dec 2011 13:29:47 +1300
Subject: [Bioperl-l] DNA Sequencing two questions
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
References: <4EDFB8F0.8080001@potc.net>
	<18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF245@exchsth.agresearch.co.nz>

I tried again and came up with this:
http://www.bioperl.org/w/images/7/7a/Autostereogram.png
If you look carefully, you can see the answer to life, the universe, and everything!!

--Russell

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Smithies, Russell
> Sent: Thursday, 8 December 2011 10:59 a.m.
> To: 'sunwukong'; bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] DNA Sequencing two questions
> 
> I did something similar a few years ago (after watching the movie "Contact" I
> think) and encoded codons as RGB values and drew an image of a genome.
> Looked much like random noise but I might try it again and draw as a space
> filling curve.
> I guess if you're looking for "hidden messages", why restrict yourself to 2
> dimensions?  Perhaps something pops out as a single-image stereogram eg.
> http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Ra
> ndom_Dot_Shark.png
> Perhaps it's a 3D "object" represented by slices drawn in a series of 2D
> planes?
> 
> But you need a bit of biological background as there will be patterns simply
> because of the way genes "work" and are laid out in chromosomes. You
> need to remember that DNA is effectively a 2D representation of a 3D
> protein structure and there is already much hidden information we know we
> don't understand - a "simple" task like how proteins fold is barely understood
> and why some become prions is still a mystery.
> 
> But don't let this stop you uncovering the great secret hidden in our genes :-)
> 
> On a similar note, have a look at http://medgadget.com/2011/10/send-your-
> secret-message-hidden-in-bacteria.html
> 
> --Russell
> 
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > bounces at lists.open-bio.org] On Behalf Of sunwukong
> > Sent: Thursday, 8 December 2011 8:05 a.m.
> > To: bioperl-l at bioperl.org
> > Subject: [Bioperl-l] DNA Sequencing two questions
> >
> > I am not a medical professional but I have two DNA related questions.
> >
> > A year or so ago I realized that if the standard building blocks of
> > life were the amino acids GATC then they could be represented as a
> > base 4 number system (e.g., 0,1,2 and 3).  Then any life form could be
> > represented by a number (it would be very long).  So I set out on a
> > quest to do this with a small life form.  For fun I chose the Spanish
> > Flu which I believe I found on an NIH site.  Then I set out and
> > realized that there was no standard.  And I did not know if the number
> > would be built with the most significant digit on the left or right.
> >
> > 1.  Is there a standard method for representing the ATCD molecules as
> > numbers g = 0 a = 1 t  = 2 c = 3
> >
> > 2. is the sequence read left to right or right to left?
> >
> > note:  It may be biologically significant if the right values are
> > assigned to the letters GATC, there could be a pattern somewhere that
> > holds significant information.  One idea might be to look at DNA
> > sequences in bases other than 4 to see if something jumps out.
> >
> > http://www.insectscience.org/2.10/ref/fig5a.gif
> >
> > VR
> > Pat Kirol
> > 509 442-2214
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> ==========================================================
> =============
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities to which
> it is addressed and may contain confidential and/or privileged material. Any
> review, retransmission, dissemination or other use of, or taking of any action
> in reliance upon, this information by persons or entities other than the
> intended recipients is prohibited by AgResearch Limited. If you have received
> this message in error, please notify the sender immediately.
> ==========================================================
> =============
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From lumos.lumos.lumos at gmail.com  Fri Dec  9 11:47:36 2011
From: lumos.lumos.lumos at gmail.com (lumos lumos)
Date: Fri, 9 Dec 2011 08:47:36 -0800
Subject: [Bioperl-l] Mouse->Human homologues ?
Message-ID: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>

Hello,

Is there a way to get human homologues for a mouse gene list where I get
all human genes(symbols) as text output ?

Thank you
LM


From cjfields at illinois.edu  Fri Dec  9 12:17:20 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Fri, 9 Dec 2011 17:17:20 +0000
Subject: [Bioperl-l] Mouse->Human homologues ?
In-Reply-To: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>
References: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>
Message-ID: <C19C1032-5FE7-400A-AEB5-152EEFC4C435@illinois.edu>

There are lots of databases that have this capability (ensembl, orthodb, homologene, oma, to name only a few).  Have you tried a simple search for this, or did you want expert opinion on the matter?  

chris

PS - Just to note, there is a lot of controversy swirling about re: the ortholog conjecture and some recently published papers calling it into question using human-mouse data, worth a look if you're trotting this path to know the current situation.  If you have access to F1000, see the following (paper itself is open :)

Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al. Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi: 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011. F1000.com/12462957

On Dec 9, 2011, at 10:47 AM, lumos lumos wrote:

> Hello,
> 
> Is there a way to get human homologues for a mouse gene list where I get
> all human genes(symbols) as text output ?
> 
> Thank you
> LM
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From lumos.lumos.lumos at gmail.com  Fri Dec  9 12:29:24 2011
From: lumos.lumos.lumos at gmail.com (lumos lumos)
Date: Fri, 9 Dec 2011 09:29:24 -0800
Subject: [Bioperl-l] Mouse->Human homologues ?
In-Reply-To: <C19C1032-5FE7-400A-AEB5-152EEFC4C435@illinois.edu>
References: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>
	<C19C1032-5FE7-400A-AEB5-152EEFC4C435@illinois.edu>
Message-ID: <CAJbewukt_xCCpQaWTsvqi2z1NkbsTZRG6xXJUcZhcK5jdAZhWQ@mail.gmail.com>

Hi Chris,

Thanks for your reply. I wanted to know if there is anyway you can do it
via script/automatically in perl for a list of mouse genes whose human
homologues I require.

LM

On Fri, Dec 9, 2011 at 9:17 AM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> There are lots of databases that have this capability (ensembl, orthodb,
> homologene, oma, to name only a few).  Have you tried a simple search for
> this, or did you want expert opinion on the matter?
>
> chris
>
> PS - Just to note, there is a lot of controversy swirling about re: the
> ortholog conjecture and some recently published papers calling it into
> question using human-mouse data, worth a look if you're trotting this path
> to know the current situation.  If you have access to F1000, see the
> following (paper itself is open :)
>
> Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al.
> Testing the ortholog conjecture with comparative functional genomic data
> from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi:
> 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011.
> F1000.com/12462957
>
> On Dec 9, 2011, at 10:47 AM, lumos lumos wrote:
>
> > Hello,
> >
> > Is there a way to get human homologues for a mouse gene list where I get
> > all human genes(symbols) as text output ?
> >
> > Thank you
> > LM
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


From lumos.lumos.lumos at gmail.com  Wed Dec  7 23:47:19 2011
From: lumos.lumos.lumos at gmail.com (lumos lumos)
Date: Wed, 7 Dec 2011 20:47:19 -0800
Subject: [Bioperl-l] Perl parsing
Message-ID: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>

Hello,

I have a text file(tab-delim) with some gene names as shown below.

*BRCA1: breast cancer 1, early onset

TNF: tumor necrosis factor

OMG: oligodendrocyte myelin glycoprotein*

I would like to get the list of gene name BRCA1,TNF,OMG that is before the
colon(:) .
How do I parse in perl this text file with this list of genes?

Thanks in advance.
LM


From b.m.forde at umail.ucc.ie  Fri Dec  9 11:52:56 2011
From: b.m.forde at umail.ucc.ie (BForde)
Date: Fri, 9 Dec 2011 08:52:56 -0800 (PST)
Subject: [Bioperl-l]  Genbank files
Message-ID: <32941955.post@talk.nabble.com>


Hello all,

I am new to Bioperl so I apologise if this is stupid question. 

For CDS features I which to add additional qualifiers e.g. /colour and /note
qualifiers. I have looked at the BioPerl wiki but am still unsure as how to
do this?

regards

Brian
-- 
View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From jboddu at illinois.edu  Fri Dec  9 14:59:39 2011
From: jboddu at illinois.edu (Boddu, Jayanand)
Date: Fri, 9 Dec 2011 19:59:39 +0000
Subject: [Bioperl-l] Batch processing of Data
Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>

Hi Anyone:
Please let me know if the following is practical with PERL.
My data output can be described as following.

1.       Hundreds of samples are run.

2.       A batch output sends data from each sample to its own "folder". Output is in the form of few text files, spreadsheets and PDF files.

3.       One of the spreadsheet has the data of most interest.

4.       This means I end up having hundreds of folders.

5.       The spreadsheet with the data has multiple worksheets out of which a couple have the interesting data to be processed (Please find attached a spreadsheet output in which the data is organized and the worksheets of my interest are named as "Compound" and "Peak". Yellow high-lighted columns in each worksheet has the data to be processed).
OK. That's long description.
NOW. Is it practical to write a PERL/or any script to;

1.       Enter each folder.

2.       Look for the spreadsheet of interest.

3.       Look for worksheets named "Compound" and "Peak".

4.       Look for the specific columns of interest.

5.       Copy paste the columns of interest into a new spreadsheet/text file with data from each folder next to each other.

This final spreadsheet will pass through a bunch of other calculations.

I apologize for this long and painful description.
However, it would be great if this can be done.
Thanks
Jay
-------------- next part --------------
A non-text attachment was scrubbed...
Name: REPORT01.xls
Type: application/vnd.ms-excel
Size: 93696 bytes
Desc: REPORT01.xls
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20111209/0528887b/attachment-0002.xls>

From cjfields at illinois.edu  Fri Dec  9 15:37:48 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Fri, 9 Dec 2011 20:37:48 +0000
Subject: [Bioperl-l] Perl parsing
In-Reply-To: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>
References: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>
Message-ID: <E70D9F7E-7419-47AE-A0F9-3FB2B7B25907@illinois.edu>

On Dec 7, 2011, at 10:47 PM, lumos lumos wrote:

> Hello,
> 
> I have a text file(tab-delim) with some gene names as shown below.
> 
> *BRCA1: breast cancer 1, early onset
> 
> TNF: tumor necrosis factor
> 
> OMG: oligodendrocyte myelin glycoprotein*
> 
> I would like to get the list of gene name BRCA1,TNF,OMG that is before the
> colon(:) .
> How do I parse in perl this text file with this list of genes?

'Very carefully?'

Okay, I'll try to refrain from further sarcasm, but I'm confused, what does this have to do with BioPerl (*the toolkit*) specifically?  That is what this mailing list is for.  

Just to note, this is a very common perl task. The answer is attainable by searching for it (not to mention taking the time to learn basic perl).  For instance:

   http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings

One of the many links found by simply using Google:

   http://lmgtfy.com/?q=perl+parse+tab+file

I'll leave the regex munging to you.  

(okay, I failed at refraining from sarcasm, ah well it's friday).

chris


> Thanks in advance.
> LM


From jason.stajich at gmail.com  Fri Dec  9 16:18:38 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Fri, 9 Dec 2011 13:18:38 -0800
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32941955.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
Message-ID: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>

$feature->add_tag_value('color','blue');

On Dec 9, 2011, at 8:52 AM, BForde wrote:

> 
> Hello all,
> 
> I am new to Bioperl so I apologise if this is stupid question. 
> 
> For CDS features I which to add additional qualifiers e.g. /colour and /note
> qualifiers. I have looked at the BioPerl wiki but am still unsure as how to
> do this?
> 
> regards
> 
> Brian
> -- 
> View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org


From bosborne11 at verizon.net  Fri Dec  9 15:31:15 2011
From: bosborne11 at verizon.net (Brian Osborne)
Date: Fri, 09 Dec 2011 15:31:15 -0500
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32941955.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
Message-ID: <3AB0716B-EC07-4BD6-9FC8-0C47A29FC0BA@verizon.net>

Brian,

Reasonable question. Start here:

http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

If you've never used Bioperl then:

http://www.bioperl.org/wiki/HOWTO:Beginners

Brian


On Dec 9, 2011, at 11:52 AM, BForde wrote:

> 
> Hello all,
> 
> I am new to Bioperl so I apologise if this is stupid question. 
> 
> For CDS features I which to add additional qualifiers e.g. /colour and /note
> qualifiers. I have looked at the BioPerl wiki but am still unsure as how to
> do this?
> 
> regards
> 
> Brian
> -- 
> View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From asjo at koldfront.dk  Fri Dec  9 17:25:00 2011
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Fri, 09 Dec 2011 23:25:00 +0100
Subject: [Bioperl-l] Batch processing of Data
References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>
Message-ID: <871usdpemb.fsf@topper.koldfront.dk>

On Fri, 9 Dec 2011 19:59:39 +0000, Boddu, wrote:

> Please let me know if the following is practical with PERL.

It might very well be, yes.

Modules you might be interested in include Spreadsheet::ParseExcel,
Spreadsheet::XLSX, Spreadsheet::WriteExcel and Excel::Writer::XLSX?.

A big help in finding interesting CPAN modules is the search engine on
https://metacpan.org/

Depending on your platform and preference using find(1) might also be
helpful to traverse the folders, rather than doing so in Perl.

Note that none of this has anything to do with BioPerl as such, though,
and you'll need to do some actual programming to get the job done.


  Best regards,

    Adam


? http://blogs.perl.org/users/john_mcnamara/2011/10/spreadsheetwriteexcel-is-dead-long-live-excelwriterxlsx.html

-- 
 "Angels can fly because they take themselves lightly."       Adam Sj?gren
                                                         asjo at koldfront.dk


From David.Messina at sbc.su.se  Fri Dec  9 17:30:23 2011
From: David.Messina at sbc.su.se (Dave Messina)
Date: Fri, 9 Dec 2011 23:30:23 +0100
Subject: [Bioperl-l] Batch processing of Data
In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>
References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>
Message-ID: <CAM3TQQWMnqwShBhYQWH9iZqtDphVFYLYtuVWEMxxfVY1OqSbhg@mail.gmail.com>

Yes, it can be done. However, it has nothing to do with this mailing list.

Steps 1 and 2 are basic Perl.
For steps 3 through 5, try googling "perl parse excel".


Dave


On Fri, Dec 9, 2011 at 20:59, Boddu, Jayanand <jboddu at illinois.edu> wrote:

> Hi Anyone:
> Please let me know if the following is practical with PERL.
> My data output can be described as following.
>
> 1.       Hundreds of samples are run.
>
> 2.       A batch output sends data from each sample to its own "folder".
> Output is in the form of few text files, spreadsheets and PDF files.
>
> 3.       One of the spreadsheet has the data of most interest.
>
> 4.       This means I end up having hundreds of folders.
>
> 5.       The spreadsheet with the data has multiple worksheets out of
> which a couple have the interesting data to be processed (Please find
> attached a spreadsheet output in which the data is organized and the
> worksheets of my interest are named as "Compound" and "Peak". Yellow
> high-lighted columns in each worksheet has the data to be processed).
> OK. That's long description.
> NOW. Is it practical to write a PERL/or any script to;
>
> 1.       Enter each folder.
>
> 2.       Look for the spreadsheet of interest.
>
> 3.       Look for worksheets named "Compound" and "Peak".
>
> 4.       Look for the specific columns of interest.
>
> 5.       Copy paste the columns of interest into a new spreadsheet/text
> file with data from each folder next to each other.
>
> This final spreadsheet will pass through a bunch of other calculations.
>
> I apologize for this long and painful description.
> However, it would be great if this can be done.
> Thanks
> Jay
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From lsbrath at gmail.com  Sat Dec 10 16:39:44 2011
From: lsbrath at gmail.com (Mgavi Brathwaite)
Date: Sat, 10 Dec 2011 16:39:44 -0500
Subject: [Bioperl-l] Perl parsing
In-Reply-To: <E70D9F7E-7419-47AE-A0F9-3FB2B7B25907@illinois.edu>
References: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>
	<E70D9F7E-7419-47AE-A0F9-3FB2B7B25907@illinois.edu>
Message-ID: <CAJm=ba98HUgAB1kUG29_KA+ZvNWP_AsHoJQNPQ-_Fe=Pa7b74Q@mail.gmail.com>

Yes grasshopper you have to suffer a little bit. Learn Perl first, then
step up to BioPerl. Chris I feel you concerning the power of Regex, and the
sarcasm.

Lom

On Fri, Dec 9, 2011 at 3:37 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> On Dec 7, 2011, at 10:47 PM, lumos lumos wrote:
>
> > Hello,
> >
> > I have a text file(tab-delim) with some gene names as shown below.
> >
> > *BRCA1: breast cancer 1, early onset
> >
> > TNF: tumor necrosis factor
> >
> > OMG: oligodendrocyte myelin glycoprotein*
> >
> > I would like to get the list of gene name BRCA1,TNF,OMG that is before
> the
> > colon(:) .
> > How do I parse in perl this text file with this list of genes?
>
> 'Very carefully?'
>
> Okay, I'll try to refrain from further sarcasm, but I'm confused, what
> does this have to do with BioPerl (*the toolkit*) specifically?  That is
> what this mailing list is for.
>
> Just to note, this is a very common perl task. The answer is attainable by
> searching for it (not to mention taking the time to learn basic perl).  For
> instance:
>
>
> http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings
>
> One of the many links found by simply using Google:
>
>   http://lmgtfy.com/?q=perl+parse+tab+file
>
> I'll leave the regex munging to you.
>
> (okay, I failed at refraining from sarcasm, ah well it's friday).
>
> chris
>
>
> > Thanks in advance.
> > LM
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From pawan.mani2 at gmail.com  Mon Dec  5 17:00:09 2011
From: pawan.mani2 at gmail.com (pawan.mani2 at gmail.com)
Date: Tue, 6 Dec 2011 03:30:09 +0530
Subject: [Bioperl-l] bioperl in cygwin
Message-ID: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC>

Hi
     I would like to after the givibg following commands in cgwin terminal:


 perl -MCPAN -e shell

then I type

    o conf prerequisites_policy follow
    o conf commit
    install Bundle::CPAN 
install Module::Build 
d /bioperl/ 
then we  you get a list of different versions. 
I selected CJFIELDS/BioPerl-1.6.1.96
install CJFIELDS/BioPerl-1.6.1.96.tar.gz 


but build.install was not ok.

Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7.

thanks in advanced.

with best regards,
Pawan


From cjfields at illinois.edu  Sun Dec 11 13:22:01 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sun, 11 Dec 2011 18:22:01 +0000
Subject: [Bioperl-l] bioperl in cygwin
In-Reply-To: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC>
References: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC>
Message-ID: <B674A464-E650-4CBF-B2CE-2100AB0B29B9@illinois.edu>

Pawan,

Hard to say what the problem is w/o supplying warnings/errors.  Prior to doing this, you should try installing BioPerl-1.6.901 (the latest CPAN release).  You can try a direct installation of the distribution, but the easiest way to get the latest version is to try installing Bio::Perl.

(I'm not sure what BioPerl-1.6.1.96 is, but this seems wrong)

chris

On Dec 5, 2011, at 4:00 PM, <pawan.mani2 at gmail.com>
 <pawan.mani2 at gmail.com> wrote:

> Hi
>     I would like to after the givibg following commands in cgwin terminal:
> 
> 
> perl -MCPAN -e shell
> 
> then I type
> 
>    o conf prerequisites_policy follow
>    o conf commit
>    install Bundle::CPAN 
> install Module::Build 
> d /bioperl/ 
> then we  you get a list of different versions. 
> I selected CJFIELDS/BioPerl-1.6.1.96
> install CJFIELDS/BioPerl-1.6.1.96.tar.gz 
> 
> 
> but build.install was not ok.
> 
> Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7.
> 
> thanks in advanced.
> 
> with best regards,
> Pawan
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From b.m.forde at umail.ucc.ie  Tue Dec 13 06:03:50 2011
From: b.m.forde at umail.ucc.ie (BForde)
Date: Tue, 13 Dec 2011 03:03:50 -0800 (PST)
Subject: [Bioperl-l] Genbank files
In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
Message-ID: <32965574.post@talk.nabble.com>


Than you for the replies. 

My script (below) reads in a list of locus_tags from a tab delimited text
file. Compares these locus_tags to the locus_tags in  a genbank file and
where they are equal adds new features.
the line
$feat->add_tag_value()
needs to be defined. In the bioperl wiki this variable appears to be defined
by giving it coordinates etc (creating a new feature). I wish to add
features to CDS key when the locus_tags are identical. Is this possible?

use strict; 
use Bio::SeqIO; 

my @V; 
open (LIST1, 'list') ||die; 
while (<LIST1>){ 
    push @V, (split(/\t/, $_))[0]; 
} 
close(LIST1); 

my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); 
my $seq_object = $seqio_object->next_seq; 

for my $feat_object ($seq_object->get_SeqFeatures){ 
    if ($feat_object->primary_tag eq "CDS"){ 
        if ($feat_object->has_tag('locus_tag')){ 
            for my $V3 ($feat_object->get_tag_values('locus_tag')){ 
                for my $V1 (@V) { 
                    if ($V1 eq $V3){ 
                        ADD NEW FEATURES 
                        
                    }     
                } 
            } 
        } 
    } 
} 
  
The script works down as far as the comparison point where locus_tags in the
genbankfile "Contig100.gb" are compared against a list of locus_tags from a
delimited txt file. 


regards 

Brian 

Jason Stajich-5 wrote:
> 
> $feature->add_tag_value('color','blue');
> 
> On Dec 9, 2011, at 8:52 AM, BForde wrote:
> 
>> 
>> Hello all,
>> 
>> I am new to Bioperl so I apologise if this is stupid question. 
>> 
>> For CDS features I which to add additional qualifiers e.g. /colour and
>> /note
>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>> to
>> do this?
>> 
>> regards
>> 
>> Brian
>> -- 
>> View this message in context:
>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 

-- 
View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32965574.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From roy.chaudhuri at gmail.com  Tue Dec 13 06:52:05 2011
From: roy.chaudhuri at gmail.com (Roy Chaudhuri)
Date: Tue, 13 Dec 2011 11:52:05 +0000
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32965574.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
	<32965574.post@talk.nabble.com>
Message-ID: <4EE73C65.1080101@gmail.com>

Hi Brian,

Just to check I have understood you, you want to read through a genbank 
file and add additional tags to features which are listed in a 
tab-delimited file of locus tags?

Your code is on the right lines, but it would be much more efficient to 
read your tab-delimited locus_tags into a hash, and check using exists, 
rather than ploughing through the (potentially very long) list of locus 
tags every time. Also, be careful with new lines in your tab file (you 
can safely get rid of them using "chomp"). You can miss out the 
"has_tag" check by using "get_tagset_values" instead of 
"get_tag_values", since the former does not complain if the tag is not 
present. Once you have modified your sequence object, you need to write 
it out to a new file (or STDOUT) using Bio::SeqIO.

Also, just a couple of general points, you should always "use warnings" 
(or even better "use warnings FATAL=>qw(all)") since that can help solve 
many problems, and your code may be easier to read if you don't include 
the word "object" in all your variable names (after all you wouldn't say 
you write on a paper object using a pen object).

use strict;
use warnings FATAL=>qw(all);
use Bio::SeqIO;
open (my $list, 'list') or die $!;
my %V;
while (<$list>){
     chomp;
     $V{(split(/\t/, $_))[0]}=1;
}
my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
my $seq_object = $seqio_object->next_seq;
for my $feat_object ($seq_object->remove_SeqFeatures){
     if ($feat_object->primary_tag eq "CDS"){
	for my $V3 ($feat_object->get_tagset_values('locus_tag')){
             if (exists $V{$V3}){
		$feat_object->add_tag_value(listed_in_tab_file=>'yes');
		next;
             }
         }
     }
     $seq_object->add_SeqFeature($feat_object);
}
Bio::SeqIO->new(-format=>'genbank')->write_seq($seq_object);

Hope this helps.
Cheers,
Roy.

On 13/12/2011 11:03, BForde wrote:
>
> Than you for the replies.
>
> My script (below) reads in a list of locus_tags from a tab delimited text
> file. Compares these locus_tags to the locus_tags in  a genbank file and
> where they are equal adds new features.
> the line
> $feat->add_tag_value()
> needs to be defined. In the bioperl wiki this variable appears to be defined
> by giving it coordinates etc (creating a new feature). I wish to add
> features to CDS key when the locus_tags are identical. Is this possible?
>
> use strict;
> use Bio::SeqIO;
>
> my @V;
> open (LIST1, 'list') ||die;
> while (<LIST1>){
>      push @V, (split(/\t/, $_))[0];
> }
> close(LIST1);
>
> my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
> my $seq_object = $seqio_object->next_seq;
>
> for my $feat_object ($seq_object->get_SeqFeatures){
>      if ($feat_object->primary_tag eq "CDS"){
>          if ($feat_object->has_tag('locus_tag')){
>              for my $V3 ($feat_object->get_tag_values('locus_tag')){
>                  for my $V1 (@V) {
>                      if ($V1 eq $V3){
>                          ADD NEW FEATURES
>
>                      }
>                  }
>              }
>          }
>      }
> }
>
> The script works down as far as the comparison point where locus_tags in the
> genbankfile "Contig100.gb" are compared against a list of locus_tags from a
> delimited txt file.
>
>
> regards
>
> Brian
>
> Jason Stajich-5 wrote:
>>
>> $feature->add_tag_value('color','blue');
>>
>> On Dec 9, 2011, at 8:52 AM, BForde wrote:
>>
>>>
>>> Hello all,
>>>
>>> I am new to Bioperl so I apologise if this is stupid question.
>>>
>>> For CDS features I which to add additional qualifiers e.g. /colour and
>>> /note
>>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>>> to
>>> do this?
>>>
>>> regards
>>>
>>> Brian
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Jason Stajich
>> jason.stajich at gmail.com
>> jason at bioperl.org
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>


From b.m.forde at umail.ucc.ie  Tue Dec 13 09:22:01 2011
From: b.m.forde at umail.ucc.ie (Brian Forde)
Date: Tue, 13 Dec 2011 14:22:01 +0000
Subject: [Bioperl-l] Genbank files
In-Reply-To: <4EE73C65.1080101@gmail.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
	<32965574.post@talk.nabble.com> <4EE73C65.1080101@gmail.com>
Message-ID: <CAJLmuD+0Ts_5hPLL6T2vToY8+oW+PxXHaBiGGKoLXZZoGiBptg@mail.gmail.com>

Hi Roy,

Thank you. That works perfectly. I have to confess that someone else told
me to use hashes but I could  not get them to work.. Thanks again

regards

Brian

On Tue, Dec 13, 2011 at 11:52 AM, Roy Chaudhuri <roy.chaudhuri at gmail.com>wrote:

> Hi Brian,
>
> Just to check I have understood you, you want to read through a genbank
> file and add additional tags to features which are listed in a
> tab-delimited file of locus tags?
>
> Your code is on the right lines, but it would be much more efficient to
> read your tab-delimited locus_tags into a hash, and check using exists,
> rather than ploughing through the (potentially very long) list of locus
> tags every time. Also, be careful with new lines in your tab file (you can
> safely get rid of them using "chomp"). You can miss out the "has_tag" check
> by using "get_tagset_values" instead of "get_tag_values", since the former
> does not complain if the tag is not present. Once you have modified your
> sequence object, you need to write it out to a new file (or STDOUT) using
> Bio::SeqIO.
>
> Also, just a couple of general points, you should always "use warnings"
> (or even better "use warnings FATAL=>qw(all)") since that can help solve
> many problems, and your code may be easier to read if you don't include the
> word "object" in all your variable names (after all you wouldn't say you
> write on a paper object using a pen object).
>
> use strict;
> use warnings FATAL=>qw(all);
> use Bio::SeqIO;
> open (my $list, 'list') or die $!;
> my %V;
> while (<$list>){
>    chomp;
>    $V{(split(/\t/, $_))[0]}=1;
>
> }
> my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb");
> my $seq_object = $seqio_object->next_seq;
> for my $feat_object ($seq_object->remove_**SeqFeatures){
>
>    if ($feat_object->primary_tag eq "CDS"){
>        for my $V3 ($feat_object->get_tagset_**values('locus_tag')){
>            if (exists $V{$V3}){
>                $feat_object->add_tag_value(**listed_in_tab_file=>'yes');
>                next;
>            }
>        }
>    }
>    $seq_object->add_SeqFeature($**feat_object);
> }
> Bio::SeqIO->new(-format=>'**genbank')->write_seq($seq_**object);
>
> Hope this helps.
> Cheers,
> Roy.
>
>
> On 13/12/2011 11:03, BForde wrote:
>
>>
>> Than you for the replies.
>>
>> My script (below) reads in a list of locus_tags from a tab delimited text
>> file. Compares these locus_tags to the locus_tags in  a genbank file and
>> where they are equal adds new features.
>> the line
>> $feat->add_tag_value()
>> needs to be defined. In the bioperl wiki this variable appears to be
>> defined
>> by giving it coordinates etc (creating a new feature). I wish to add
>> features to CDS key when the locus_tags are identical. Is this possible?
>>
>> use strict;
>> use Bio::SeqIO;
>>
>> my @V;
>> open (LIST1, 'list') ||die;
>> while (<LIST1>){
>>     push @V, (split(/\t/, $_))[0];
>> }
>> close(LIST1);
>>
>> my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb");
>> my $seq_object = $seqio_object->next_seq;
>>
>> for my $feat_object ($seq_object->get_SeqFeatures)**{
>>     if ($feat_object->primary_tag eq "CDS"){
>>         if ($feat_object->has_tag('locus_**tag')){
>>             for my $V3 ($feat_object->get_tag_values(**'locus_tag')){
>>                 for my $V1 (@V) {
>>                     if ($V1 eq $V3){
>>                         ADD NEW FEATURES
>>
>>                     }
>>                 }
>>             }
>>         }
>>     }
>> }
>>
>> The script works down as far as the comparison point where locus_tags in
>> the
>> genbankfile "Contig100.gb" are compared against a list of locus_tags from
>> a
>> delimited txt file.
>>
>>
>> regards
>>
>> Brian
>>
>> Jason Stajich-5 wrote:
>>
>>>
>>> $feature->add_tag_value('**color','blue');
>>>
>>> On Dec 9, 2011, at 8:52 AM, BForde wrote:
>>>
>>>
>>>> Hello all,
>>>>
>>>> I am new to Bioperl so I apologise if this is stupid question.
>>>>
>>>> For CDS features I which to add additional qualifiers e.g. /colour and
>>>> /note
>>>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>>>> to
>>>> do this?
>>>>
>>>> regards
>>>>
>>>> Brian
>>>> --
>>>> View this message in context:
>>>> http://old.nabble.com/Genbank-**files-tp32941955p32941955.html<http://old.nabble.com/Genbank-files-tp32941955p32941955.html>
>>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>>>
>>>> ______________________________**_________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l<http://lists.open-bio.org/mailman/listinfo/bioperl-l>
>>>>
>>>
>>> Jason Stajich
>>> jason.stajich at gmail.com
>>> jason at bioperl.org
>>>
>>>
>>> ______________________________**_________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l<http://lists.open-bio.org/mailman/listinfo/bioperl-l>
>>>
>>>
>>>
>>
>


-- 
Brian Forde
Microbiology Dept.
Bioscience Institute. Room 4.11
University College Cork
Cork
Ireland
tel:+353 21 4901306
email: b.m.forde at umail.ucc.ie


From b.m.forde at umail.ucc.ie  Mon Dec 12 12:20:53 2011
From: b.m.forde at umail.ucc.ie (BForde)
Date: Mon, 12 Dec 2011 09:20:53 -0800 (PST)
Subject: [Bioperl-l] Genbank files
In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
Message-ID: <32959999.post@talk.nabble.com>


Than you for the replies.

I am unsure as to how to use the line below with my script. My script so far
reads

use strict;
use Bio::SeqIO;

my @V;
open (LIST1, 'list') ||die;
while (<LIST1>){
    push @V, (split(/\t/, $_))[0];
}
close(LIST1);

my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
my $seq_object = $seqio_object->next_seq;

for my $feat_object ($seq_object->get_SeqFeatures){
    if ($feat_object->primary_tag eq "CDS"){
        if ($feat_object->has_tag('locus_tag')){
            for my $V3 ($feat_object->get_tag_values('locus_tag')){
                for my $V1 (@V) {
                    if ($V1 eq $V3){
                        ADD NEW FEATURES
                        
                    }    
                }
            }
        }
    }
}
 
The script works down as far as the comparison point where locus_tags in the
genbankfile "Contig100.gb" are compared against a list of locus_tags from a
delimited txt file.
I possbile could you show me how to amend my script so I can add new
features

regards

Brian

Jason Stajich-5 wrote:
> 
> $feature->add_tag_value('color','blue');
> 
> On Dec 9, 2011, at 8:52 AM, BForde wrote:
> 
>> 
>> Hello all,
>> 
>> I am new to Bioperl so I apologise if this is stupid question. 
>> 
>> For CDS features I which to add additional qualifiers e.g. /colour and
>> /note
>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>> to
>> do this?
>> 
>> regards
>> 
>> Brian
>> -- 
>> View this message in context:
>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 

-- 
View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32959999.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From Russell.Smithies at agresearch.co.nz  Tue Dec 13 22:17:02 2011
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Wed, 14 Dec 2011 16:17:02 +1300
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32959999.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
	<32959999.post@talk.nabble.com>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF27A@exchsth.agresearch.co.nz>

Something like this:

use strict;
use Bio::SeqIO;

my @V;
open (LIST1, 'list') ||die;
while (<LIST1>){
    push @V, (split(/\t/, $_))[0];
}
close(LIST1);

my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
my $seq_object = $seqio_object->next_seq;

for my $feat_object ($seq_object->get_SeqFeatures){
    if ($feat_object->primary_tag eq "CDS"){
        if ($feat_object->has_tag('locus_tag')){
            for my $V3 ($feat_object->get_tag_values('locus_tag')){
                for my $V1 (@V) {
                    if ($V1 eq $V3){
                        #ADD NEW FEATURES
                        $feat_object->add_tag_value('color','blue');
                    }
                }
            }
        }
    }
}
#write the new annotations
my $io = Bio::SeqIO->new(-format => "genbank", -file => ">new.gb" );
$io->write_seq($seq_object);

Take another look at http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Building_Your_Own_Sequences

--Russell


> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of BForde
> Sent: Tuesday, 13 December 2011 6:21 a.m.
> To: Bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Genbank files
> 
> 
> Than you for the replies.
> 
> I am unsure as to how to use the line below with my script. My script so far
> reads
> 
> use strict;
> use Bio::SeqIO;
> 
> my @V;
> open (LIST1, 'list') ||die;
> while (<LIST1>){
>     push @V, (split(/\t/, $_))[0];
> }
> close(LIST1);
> 
> my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
> my $seq_object = $seqio_object->next_seq;
> 
> for my $feat_object ($seq_object->get_SeqFeatures){
>     if ($feat_object->primary_tag eq "CDS"){
>         if ($feat_object->has_tag('locus_tag')){
>             for my $V3 ($feat_object->get_tag_values('locus_tag')){
>                 for my $V1 (@V) {
>                     if ($V1 eq $V3){
>                         ADD NEW FEATURES
> 
>                     }
>                 }
>             }
>         }
>     }
> }
> 
> The script works down as far as the comparison point where locus_tags in the
> genbankfile "Contig100.gb" are compared against a list of locus_tags from a
> delimited txt file.
> I possbile could you show me how to amend my script so I can add new
> features
> 
> regards
> 
> Brian
> 
> Jason Stajich-5 wrote:
> >
> > $feature->add_tag_value('color','blue');
> >
> > On Dec 9, 2011, at 8:52 AM, BForde wrote:
> >
> >>
> >> Hello all,
> >>
> >> I am new to Bioperl so I apologise if this is stupid question.
> >>
> >> For CDS features I which to add additional qualifiers e.g. /colour
> >> and /note qualifiers. I have looked at the BioPerl wiki but am still
> >> unsure as how to do this?
> >>
> >> regards
> >>
> >> Brian
> >> --
> >> View this message in context:
> >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
> >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > Jason Stajich
> > jason.stajich at gmail.com
> > jason at bioperl.org
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> 
> --
> View this message in context: http://old.nabble.com/Genbank-files-
> tp32941955p32959999.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From l.m.timmermans at students.uu.nl  Wed Dec 14 10:43:24 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Wed, 14 Dec 2011 16:43:24 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
Message-ID: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>

Hi all,

As already mentioned on IRC, I recently wrote a SFF parser and uploaded it
to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF
entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time
to write one I'd be most grateful.

Leon


From p.j.a.cock at googlemail.com  Wed Dec 14 11:03:05 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 14 Dec 2011 16:03:05 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
Message-ID: <CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>

On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> Hi all,
>
> As already mentioned on IRC, I recently wrote a SFF parser and uploaded it
> to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF
> entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time
> to write one I'd be most grateful.
>
> Leon

Hi Leon,

Have you looked at the index block at all, in order to offer random
access by read ID, or to access the Roche XML manifest? Please
ask if you need more information about this - or if you can read Python:
https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py

Is this building on Miguel Pignatelli's work? I don't recall seeing
any follow up posts from him after this one:
http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html

Peter


From cjfields at illinois.edu  Wed Dec 14 11:12:58 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Wed, 14 Dec 2011 16:12:58 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>,
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
Message-ID: <3BBC1132-E768-45D9-A107-ACD51791722D@illinois.edu>

Leon, 

Nice!  Definitely a good idea to have the lower-level parser and the BioPerl-bridging code separate, one of my concerns with the various parsers we have right now which hard-wire BioPerl classes in with the parser (makes it hard for optimization).

Chris

PS- Peter, I don't think the two projects are related, but I suppose Leon is the best to answer that.

Sent from my stupid iPad, now my laptop's on the fritz

On Dec 14, 2011, at 10:04 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

> On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans
> <l.m.timmermans at students.uu.nl> wrote:
>> Hi all,
>> 
>> As already mentioned on IRC, I recently wrote a SFF parser and uploaded it
>> to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF
>> entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time
>> to write one I'd be most grateful.
>> 
>> Leon
> 
> Hi Leon,
> 
> Have you looked at the index block at all, in order to offer random
> access by read ID, or to access the Roche XML manifest? Please
> ask if you need more information about this - or if you can read Python:
> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py
> 
> Is this building on Miguel Pignatelli's work? I don't recall seeing
> any follow up posts from him after this one:
> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html
> 
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From l.m.timmermans at students.uu.nl  Wed Dec 14 11:27:58 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Wed, 14 Dec 2011 17:27:58 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
Message-ID: <CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>

On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Hi Leon,
>
> Have you looked at the index block at all, in order to offer random
> access by read ID, or to access the Roche XML manifest? Please
> ask if you need more information about this - or if you can read Python:
> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py
>

I have looked at it, but not implemented it yet. There is no standardized
index, and the ones that are in common use either seem stupid (the Roche
index, which is essentially just a weirdly formatted sequential list,
though that should still be faster than a table scan) or undocumented (hash
based index).

 Is this building on Miguel Pignatelli's work? I don't recall seeing
> any follow up posts from him after this one:
> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html
>

It isn't. I like his idea for reusing BioPython's test files though.

Leon


From p.j.a.cock at googlemail.com  Wed Dec 14 11:44:28 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 14 Dec 2011 16:44:28 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
Message-ID: <CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>

On Wed, Dec 14, 2011 at 4:27 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> Hi Leon,
>>
>> Have you looked at the index block at all, in order to offer random
>> access by read ID, or to access the Roche XML manifest? Please
>> ask if you need more information about this - or if you can read Python:
>> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py
>
> I have looked at it, but not implemented it yet. There is no standardized
> index, and the ones that are in common use either seem stupid (the Roche
> index, which is essentially just a weirdly formatted sequential list, though
> that should still be faster than a table scan) or undocumented (hash based
> index).

There are two widely used indexes, both from Roche (one with and
one without an XML manifest, magic bytes .mft and .srt). They are
both just a simple table of the reads names and offsets, sorted
alphabetically. This works pretty well for rapid lookup for SFF files
(because the read count is not so high), and is pretty easy.

I don't think anyone used the hash table style indexes (.hsh), which
I assume was a proof of principle or trial in the early days of SFF.

One thing to check is what Ion Torrent's SFF files use. I would
guess they've followed Roche, but I don't know. After all, the
index structure is not defined in the SFF specification - it was
left extensible on purpose.

>> Is this building on Miguel Pignatelli's work? I don't recall seeing
>> any follow up posts from him after this one:
>> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html
>
> It isn't. I like his idea for reusing BioPython's test files though.

Yes, please do.

Peter


From gingerplum at gmail.com  Wed Dec 14 00:18:55 2011
From: gingerplum at gmail.com (plum ginger)
Date: Tue, 13 Dec 2011 21:18:55 -0800 (PST)
Subject: [Bioperl-l] a problem about BLAST
Message-ID: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com>

Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I
need run BLAST on more than one sequences. However the blast outfile
only store the result of last sequence. How to make the outfile store
all results?

Wish your help. Thanks very much!


Best regards


From jason.stajich at gmail.com  Thu Dec 15 12:02:47 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Thu, 15 Dec 2011 11:02:47 -0600
Subject: [Bioperl-l] a problem about BLAST
In-Reply-To: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com>
References: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com>
Message-ID: <58E5B487-7FF0-4018-B109-D6595DC2E493@gmail.com>

you are probably setting the outfile in each parsing iteration -- you need to show your code if you want someone to help you debug the problem.

On Dec 13, 2011, at 11:18 PM, plum ginger wrote:

> Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I
> need run BLAST on more than one sequences. However the blast outfile
> only store the result of last sequence. How to make the outfile store
> all results?
> 
> Wish your help. Thanks very much!
> 
> 
> Best regards
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org


From pengyu.ut at gmail.com  Fri Dec 16 17:10:27 2011
From: pengyu.ut at gmail.com (Peng Yu)
Date: Fri, 16 Dec 2011 16:10:27 -0600
Subject: [Bioperl-l] How to stop rather than emit warnings with
	Bio::Das::segment?
Message-ID: <CABrM6w=tiMNRJjZtvK-8Ot-TE3WuZgvtod6q-3g=UAWk8K_WRg@mail.gmail.com>

Hi,

Bio::Das::segment can give me the following warnings without stopping
the whole program when the position for the query doesn't exist. I
could test the return result and quit when it is []. But this would
cause my program have an test whenever I call segment. I'm wondering
if there is an automatic way to let Bio::Das::segment stop in such
cases.

--------------------- WARNING ---------------------
MSG: Sequence is not dna or rna, but []. Attempting to revcom, but
unsure if this is right
---------------------------------------------------


-- 
Regards,
Peng


From cjfields at illinois.edu  Fri Dec 16 21:48:07 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sat, 17 Dec 2011 02:48:07 +0000
Subject: [Bioperl-l] How to stop rather than emit warnings
	with	Bio::Das::segment?
In-Reply-To: <CABrM6w=tiMNRJjZtvK-8Ot-TE3WuZgvtod6q-3g=UAWk8K_WRg@mail.gmail.com>
References: <CABrM6w=tiMNRJjZtvK-8Ot-TE3WuZgvtod6q-3g=UAWk8K_WRg@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0881CF8E@CITESMBX4.ad.uillinois.edu>

Setting verbosity to 2 should convert warnings to exceptions.   

IIRC, set '-verbose => 2' in the Bio::Das constructor, set '$das->verbose(2)' explicitly, or set the env variable BIOPERLDEBUG=2.  

chris

________________________________________
From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of Peng Yu [pengyu.ut at gmail.com]
Sent: Friday, December 16, 2011 4:10 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment?

Hi,

Bio::Das::segment can give me the following warnings without stopping
the whole program when the position for the query doesn't exist. I
could test the return result and quit when it is []. But this would
cause my program have an test whenever I call segment. I'm wondering
if there is an automatic way to let Bio::Das::segment stop in such
cases.

--------------------- WARNING ---------------------
MSG: Sequence is not dna or rna, but []. Attempting to revcom, but
unsure if this is right
---------------------------------------------------


--
Regards,
Peng
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l


From anna.fr at gmail.com  Mon Dec 19 02:09:15 2011
From: anna.fr at gmail.com (Anna Friedlander)
Date: Mon, 19 Dec 2011 20:09:15 +1300
Subject: [Bioperl-l] StandAloneBlastPlus blastdbcmd question
Message-ID: <CALv2E+1Yvt1OhcTE_YXqho+zYZhPjihhCFupybArxMjLfD1S_g@mail.gmail.com>

Hi all

I have a question about using blastdbcmd via
Bio::Tools::Run::StandAloneBlastPlus

I have some Blast+ search results that I am manipulating in a perl
programme, and I would like to retrieve some sequence information for
some results using subject sequence IDs, and associated subject start
and end indices. If I was using blastdbcmd directly, I would do so
using the -entry and -range options.

My question is, can I use all the blastdbcmd options (or more
specifically, just the -entry and -range options) from within the
StandAloneBlastPlus module?

My apologies if I don't properly understand how this "wrapper" works!

Thanks in advance for your help
Anna Friedlander


From l.m.timmermans at students.uu.nl  Mon Dec 19 09:19:14 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Mon, 19 Dec 2011 15:19:14 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
Message-ID: <CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>

On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> There are two widely used indexes, both from Roche (one with and
> one without an XML manifest, magic bytes .mft and .srt). They are
> both just a simple table of the reads names and offsets, sorted
> alphabetically.


Yeah, that's what I got from the BioPython code. I didn't know it was
sorted though (it doesn't make much sense either, unless they wanted to do
a binary search or something).

This works pretty well for rapid lookup for SFF files
> (because the read count is not so high), and is pretty easy.
>

It's implemented in Bio::SFF 0.003. I did restructure my code into two
readers though, since doing sequential and random-access in the class
didn't make much sense code-wise.

I don't think anyone used the hash table style indexes (.hsh), which
> I assume was a proof of principle or trial in the early days of SFF.
>

I see, too bad.


> One thing to check is what Ion Torrent's SFF files use. I would
> guess they've followed Roche, but I don't know. After all, the
> index structure is not defined in the SFF specification - it was
> left extensible on purpose.
>

Yeah, we should check that too.

Yes, please do.
>

It's added to 0.003. The lack of tests was bothering me, but the SFFs I had
at hand were not suitable.

Leon


From p.j.a.cock at googlemail.com  Mon Dec 19 09:31:18 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 14:31:18 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
Message-ID: <CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>

On Mon, Dec 19, 2011 at 2:19 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>
>> There are two widely used indexes, both from Roche (one with and
>> one without an XML manifest, magic bytes .mft and .srt). They are
>> both just a simple table of the reads names and offsets, sorted
>> alphabetically.
>
> Yeah, that's what I got from the BioPython code. I didn't know it
> was sorted though (it doesn't make much sense either, unless they
> wanted to do a binary search or something).

I presume that's what Roche uses if they keep the index on disk.

The alternative is to load the index into RAM, which is really fast.
You just open the SFF, read the header, seek to the index, load
the index. Without the index, you have to scan the entire SFF file
to find each record and its offset - which is much slower.

>> This works pretty well for rapid lookup for SFF files
>> (because the read count is not so high), and is pretty easy.
>
> It's implemented in Bio::SFF 0.003. I did restructure my code into two
> readers though, since doing sequential and random-access in the class
> didn't make much sense code-wise.
>
>> I don't think anyone used the hash table style indexes (.hsh), which
>> I assume was a proof of principle or trial in the early days of SFF.
>
> I see, too bad.
>
>> One thing to check is what Ion Torrent's SFF files use. I would
>> guess they've followed Roche, but I don't know. After all, the
>> index structure is not defined in the SFF specification - it was
>> left extensible on purpose.
>
> Yeah, we should check that too.

I don't have any Ion Torrent data first hand, and the public
samples I've seen were FASTQ not SFF. But I know a few
people with Ion Torrent machines that might be able to help...

> It's added to 0.003. The lack of tests was bothering me, but the
> SFFs I had at hand were not suitable.

Have you looked at the sample SFF data in Biopython? Please
use them for the BioPerl unit tests (we're been talking about a
cross project collection of test data files like this), the README
file should be self-explanatory:
https://github.com/biopython/biopython/tree/master/Tests/Roche

Peter


From p.j.a.cock at googlemail.com  Mon Dec 19 10:13:53 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 15:13:53 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk>
Message-ID: <CAKVJ-_4U_Yt5A8f4QLxb-SzT8Y7n-2kRvGH=g9n+NfqAFegxgA@mail.gmail.com>

On Mon, Dec 19, 2011 at 3:03 PM, Adam Witney <awitney at sgul.ac.uk> wrote:
>> I don't have any Ion Torrent data first hand, and the public
>> samples I've seen were FASTQ not SFF. But I know a few
>> people with Ion Torrent machines that might be able to help?
>
> I can you let you have some Ion Torrent SFF files if it helps
>
> adam

Hi Adam,

I've just had a quick look at a file from an IonTorrent 314 chip
that a colleague kindly sent me, and that SFF file had no index
(but only 50k reads so this isn't so important).

If you can send me (and Leon?) one of two original SFF files that
would be useful, even if just to confirm that Ion Torrent's SFF files
do indeed typically lack an index. If that is the case, I may need to
remove the warning message Biopython currently prints when
indexing these files: No SFF index, doing it the slow way

Off list is fine if you'd like to keep the data private, use dropbox
or something if you don't have an FTP server.

Thanks,

Peter


From awitney at sgul.ac.uk  Mon Dec 19 10:03:16 2011
From: awitney at sgul.ac.uk (Adam Witney)
Date: Mon, 19 Dec 2011 15:03:16 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
Message-ID: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk>

>>> One thing to check is what Ion Torrent's SFF files use. I would
>>> guess they've followed Roche, but I don't know. After all, the
>>> index structure is not defined in the SFF specification - it was
>>> left extensible on purpose.
>> 
>> Yeah, we should check that too.
> 
> I don't have any Ion Torrent data first hand, and the public
> samples I've seen were FASTQ not SFF. But I know a few
> people with Ion Torrent machines that might be able to help?

I can you let you have some Ion Torrent SFF files if it helps

adam


From l.m.timmermans at students.uu.nl  Mon Dec 19 10:48:34 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Mon, 19 Dec 2011 16:48:34 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
Message-ID: <CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>

On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> I presume that's what Roche uses if they keep the index on disk.
>
> The alternative is to load the index into RAM, which is really fast.
> You just open the SFF, read the header, seek to the index, load
> the index. Without the index, you have to scan the entire SFF file
> to find each record and its offset - which is much slower.
>

That's what I'm doing now. It's much faster, but it still takes a
noticeable amount of time on large files.

Have you looked at the sample SFF data in Biopython? Please
> use them for the BioPerl unit tests (we're been talking about a
> cross project collection of test data files like this), the README
> file should be self-explanatory:
> https://github.com/biopython/biopython/tree/master/Tests/Roche
>

Yeah, I'm using those now (
https://github.com/Leont/bio-sff/blob/master/t/reader.t). I must say there
were some interesting corner cases in it.

Leon


From p.j.a.cock at googlemail.com  Mon Dec 19 11:15:15 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 16:15:15 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
Message-ID: <CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>

On Mon, Dec 19, 2011 at 3:48 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock wrote:
>
>> Have you looked at the sample SFF data in Biopython? Please
>> use them for the BioPerl unit tests (we're been talking about a
>> cross project collection of test data files like this), the README
>> file should be self-explanatory:
>> https://github.com/biopython/biopython/tree/master/Tests/Roche
>
> Yeah, I'm using those now
> (https://github.com/Leont/bio-sff/blob/master/t/reader.t).

Could you a link to your /corpus/README.txt file pointing
back to the Biopython original for acknowledgement and
future reference?

>
> I must say there were some interesting corner cases in it.
>

I'm glad you agree - and if you can think of any more special
cases to verify that would be great.

Are you doing just SFF parsing for now? Not writing?

Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
format name "sff" to mean the full read sequence (with mixed
case, upper case for the good sequence, lower cases for any
left/right clipping - as in the Roche tools), and "sff-trim" to mean
the trimmed sequences. I would encourage you to do the
same, as part of the general aim of having consistent
sequence format names between BioPerl, Biopython, and
EMBOSS, where possible.

Peter


From l.m.timmermans at students.uu.nl  Mon Dec 19 11:47:41 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Mon, 19 Dec 2011 17:47:41 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
Message-ID: <CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>

On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Could you a link to your /corpus/README.txt file pointing
> back to the Biopython original for acknowledgement and
> future reference?
>

I forgot about that, I will add it to the next release.

Are you doing just SFF parsing for now? Not writing?
>

I haven't written the writer yet (haven't needed it so far). I'd rather
release working code early instead of waiting until everything is complete.

Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
> format name "sff" to mean the full read sequence (with mixed
> case, upper case for the good sequence, lower cases for any
> left/right clipping - as in the Roche tools), and "sff-trim" to mean
> the trimmed sequences. I would encourage you to do the
> same, as part of the general aim of having consistent
> sequence format names between BioPerl, Biopython, and
> EMBOSS, where possible.
>

I agree, consistency is good.

Leon


From p.j.a.cock at googlemail.com  Mon Dec 19 12:00:03 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 17:00:03 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
Message-ID: <CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>

On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> Could you a link to your /corpus/README.txt file pointing
>> back to the Biopython original for acknowledgement and
>> future reference?
>
> I forgot about that, I will add it to the next release.

Thanks.

>> Are you doing just SFF parsing for now? Not writing?
>
>
> I haven't written the writer yet (haven't needed it so far). I'd rather
> release working code early instead of waiting until everything is complete.

I understand - but make sure you've designed the data structures
in the parser so as to allow the original record to be re-built as SFF.

>> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
>> format name "sff" to mean the full read sequence (with mixed
>> case, upper case for the good sequence, lower cases for any
>> left/right clipping - as in the Roche tools), and "sff-trim" to mean
>> the trimmed sequences. I would encourage you to do the
>> same, as part of the general aim of having consistent
>> sequence format names between BioPerl, Biopython, and
>> EMBOSS, where possible.
>
> I agree, consistency is good.

Great. I'd guess Bio::SeqIO integration would be more important
that SFF output initially.

Peter


From cjfields at illinois.edu  Mon Dec 19 14:44:22 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 19 Dec 2011 19:44:22 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>,
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
Message-ID: <D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>

Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best.  Barring that, a very simple class for storing data.  We've found BioPerl objects/classes pretty heavy.

(for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing).

Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used.  

For instance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists.

Chris

Sent from my iPad

On Dec 19, 2011, at 11:05 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

> On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans
> <l.m.timmermans at students.uu.nl> wrote:
>> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock <p.j.a.cock at googlemail.com>
>> wrote:
>>> 
>>> Could you a link to your /corpus/README.txt file pointing
>>> back to the Biopython original for acknowledgement and
>>> future reference?
>> 
>> I forgot about that, I will add it to the next release.
> 
> Thanks.
> 
>>> Are you doing just SFF parsing for now? Not writing?
>> 
>> 
>> I haven't written the writer yet (haven't needed it so far). I'd rather
>> release working code early instead of waiting until everything is complete.
> 
> I understand - but make sure you've designed the data structures
> in the parser so as to allow the original record to be re-built as SFF.
> 
>>> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
>>> format name "sff" to mean the full read sequence (with mixed
>>> case, upper case for the good sequence, lower cases for any
>>> left/right clipping - as in the Roche tools), and "sff-trim" to mean
>>> the trimmed sequences. I would encourage you to do the
>>> same, as part of the general aim of having consistent
>>> sequence format names between BioPerl, Biopython, and
>>> EMBOSS, where possible.
>> 
>> I agree, consistency is good.
> 
> Great. I'd guess Bio::SeqIO integration would be more important
> that SFF output initially.
> 
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From cjfields at illinois.edu  Mon Dec 19 19:28:25 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Mon, 19 Dec 2011 18:28:25 -0600
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
Message-ID: <4EEFD6A9.3010303@illinois.edu>

On 12/19/2011 10:47 AM, Leon Timmermans wrote:
> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock<p.j.a.cock at googlemail.com>wrote:
>
>> Could you a link to your /corpus/README.txt file pointing
>> back to the Biopython original for acknowledgement and
>> future reference?
>>
> I forgot about that, I will add it to the next release.
>
> Are you doing just SFF parsing for now? Not writing?
> I haven't written the writer yet (haven't needed it so far). I'd rather
> release working code early instead of waiting until everything is complete.
>
> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
>> format name "sff" to mean the full read sequence (with mixed
>> case, upper case for the good sequence, lower cases for any
>> left/right clipping - as in the Roche tools), and "sff-trim" to mean
>> the trimmed sequences. I would encourage you to do the
>> same, as part of the general aim of having consistent
>> sequence format names between BioPerl, Biopython, and
>> EMBOSS, where possible.
>>
> I agree, consistency is good.
>
> Leon
This is already implemented in Bio::SeqIO I believe.  This is the same 
line of thinking with the FASTQ format, that one can have a 
'format-variant' combination that (as one might guess) indicates to the 
parser any variation of the parser so logic within the parser can deal 
with it.  You can also pass the '-variant => "foo"' parameter as well 
IIRC.  You would just check the variant with the variant() method.

chris


From l.m.timmermans at students.uu.nl  Tue Dec 20 10:25:13 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 20 Dec 2011 16:25:13 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
Message-ID: <CAC1jpXDmU3jF-3G9dnB=-0fCkydqWd18J0eo2nKkSLvKTbOxjQ@mail.gmail.com>

On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> I understand - but make sure you've designed the data structures
> in the parser so as to allow the original record to be re-built as SFF.
>

 I did, though currently it's rather hard to make new entries from scratch.
That said, I can hardly imagine anyone wanting to do this.

Great. I'd guess Bio::SeqIO integration would be more important
> that SFF output initially.
>

Probably. It looks like it's quite easy, it's just rather underdocumented.

Leon


From l.m.timmermans at students.uu.nl  Tue Dec 20 10:26:11 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 20 Dec 2011 16:26:11 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
	<D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>
Message-ID: <CAC1jpXAq3cN0GriHndyQnbpn3v26M1o927i8z_a=x5bQU-iPQg@mail.gmail.com>

On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J <
cjfields at illinois.edu> wrote:

> Kinda joining this a little late, but I think if there is a way to have a
> low-level parser/writer that generically parses the data into simple
> (possibly hash-tagged) data structures, that would be best.  Barring that,
> a very simple class for storing data.  We've found BioPerl objects/classes
> pretty heavy.
>
> (for an example of this, see Heng Li's readfq parser on github, which has
> some stats for Fastq/fasta parsing).
>
> Any way we can separate the parser from object instantiation would enable
> us to optimize the object/class layer and parser/writer layers separately,
> with the possible nice side effect of making the parser more broadly used.
>
> For insn Sance, if someone wanted a faster parser, use the low level,
> otherwise use the higher level (possibly BioPerl-specific) API. Lincoln
> does this do a certain degree with Bio-samtools; I would go further and
> make the bp- and non-bp code in separate dists.
>

A good OO system can actually help make things faster. For example, I'm
unpacking the flowspace and quality data lazily, which made scanning
through an SFF file 2.5-3 times as fast while having marginal extra costs
when you do need them.

Leon


From l.m.timmermans at students.uu.nl  Tue Dec 20 10:30:54 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 20 Dec 2011 16:30:54 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <4EEFD6A9.3010303@illinois.edu>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<4EEFD6A9.3010303@illinois.edu>
Message-ID: <CAC1jpXD_sSYoU2DS33Yn99c0WToyXvTY2aJdcS6w-yZ0xfCFMg@mail.gmail.com>

On Tue, Dec 20, 2011 at 1:28 AM, Chris Fields <cjfields at illinois.edu> wrote:

> This is already implemented in Bio::SeqIO I believe.  This is the same
> line of thinking with the FASTQ format, that one can have a
> 'format-variant' combination that (as one might guess) indicates to the
> parser any variation of the parser so logic within the parser can deal with
> it.  You can also pass the '-variant => "foo"' parameter as well IIRC.  You
> would just check the variant with the variant() method.
>

Great. That makes life much easier :-)

Leon


From p.j.a.cock at googlemail.com  Tue Dec 20 10:31:59 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 20 Dec 2011 15:31:59 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXDmU3jF-3G9dnB=-0fCkydqWd18J0eo2nKkSLvKTbOxjQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
	<CAC1jpXDmU3jF-3G9dnB=-0fCkydqWd18J0eo2nKkSLvKTbOxjQ@mail.gmail.com>
Message-ID: <CAKVJ-_7v+wKQVXkLz_CMJXviYApyirjG9CA89mti5a3N40V8iA@mail.gmail.com>

On Tue, Dec 20, 2011 at 3:25 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> I understand - but make sure you've designed the data structures
>> in the parser so as to allow the original record to be re-built as SFF.
>
> ?I did, though currently it's rather hard to make new entries from scratch.
> That said, I can hardly imagine anyone wanting to do this.

Typical use cases I've found in using the Biopython SFF code are
filtering an SFF file (taking some records only), and modifying the
clipping values. In both cases, the user isn't creating the SFF
records from scratch.

Peter


From cjfields at illinois.edu  Tue Dec 20 17:40:31 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 20 Dec 2011 22:40:31 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXAq3cN0GriHndyQnbpn3v26M1o927i8z_a=x5bQU-iPQg@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
	<D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>,
	<CAC1jpXAq3cN0GriHndyQnbpn3v26M1o927i8z_a=x5bQU-iPQg@mail.gmail.com>
Message-ID: <CE1C3005-EA13-4C4E-A4B5-7F387D0E8E0B@illinois.edu>


On Dec 20, 2011, at 9:26 AM, "Leon Timmermans" <l.m.timmermans at students.uu.nl<mailto:l.m.timmermans at students.uu.nl>> wrote:

On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J <cjfields at illinois.edu<mailto:cjfields at illinois.edu>> wrote:
Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best.  Barring that, a very simple class for storing data.  We've found BioPerl objects/classes pretty heavy.

(for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing).

Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used.

For insn Sance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists.

A good OO system can actually help make things faster. For example, I'm unpacking the flowspace and quality data lazily, which made scanning through an SFF file 2.5-3 times as fast while having marginal extra costs when you do need them.

Leon

Yep, thinking about using the same approach for the Fastq variants.

Chris

Sent from my ancient iPad b/c my laptop's borked


From dgacquer at ulb.ac.be  Wed Dec 21 08:26:07 2011
From: dgacquer at ulb.ac.be (David Gacquer)
Date: Wed, 21 Dec 2011 14:26:07 +0100
Subject: [Bioperl-l] Strange behaviour in the write_seq function for large
	fasta
Message-ID: <4EF1DE6F.4070508@ulb.ac.be>

Dear BioPerl users/developers,

I am facing a strange issue with the $seq_out->write_seq function when 
using large fasta files

I have downloaded the hg19 chromosome 1, and applied the following code 
(basically I wanted to mask some regions in it but the problem also 
appears when copying the sequence without modifications):

sub main{
     my $seq_in  = Bio::SeqIO->new( -format => 'largefasta', -file => 
$ARGV[0]);
     my $seq_out  = Bio::SeqIO->new( -format => 'largefasta', -file => 
'>'.$ARGV[1]);
     my $seq_obj_in = $seq_in->next_seq();
     my $modified_seq = $seq_obj_in->seq();
     my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq => 
$modified_seq, -id  => $seq_obj_in->id, -desc => $seq_obj_in->desc);
     $seq_out->write_seq($seq_obj_out);
}

when checking the output fasta file, the sequence of chr1 is 1-bp shorter.

I have noticed that in the original fasta file, each line contains 
exactly 50 nucleotides, while the output of the $seq_out->write_seq 
function contains always 60 characters per line.
chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that 
the very last base was missing, I created the following fasta files:

chr121.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAG

chr122.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAG

They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last 
character being a G. When running the above code:

chr121.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

chr122.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AG

The output for the 122 bp chromosome is correct (2 lines of 60 bp and 
the last line with 2 bp, AG) but for the 121 bp chromosome, the last 
character is missing (2 lines of 60 bp only, last G is missing).

When replacing -format => 'largefasta' by -format => 'fasta' or writing 
the output without the write_seq function however, the problem is solved.

Am I missing something or is there a problem with the write_seq function 
used with large fasta files? (I am running BioPerl on a Mac under OS X 
Snow Leopard)

Best regards

David

-- 
David Gacquer, Ph. D.

IRIBHM - Universite Libre de Bruxelles
Bldg C, room C.4.117
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium

Phone: +32-2-555 4187
Fax: +32-2-555 4655
E-mail: dgacquer at ulb.ac.be


From koraydogankaya at gmail.com  Sat Dec 24 03:44:43 2011
From: koraydogankaya at gmail.com (Koray)
Date: Sat, 24 Dec 2011 00:44:43 -0800 (PST)
Subject: [Bioperl-l] exons
Message-ID: <8454dd72-4fed-41c5-977e-83d9300cd68b@z25g2000vbs.googlegroups.com>

I need an explicit code for getting exon sequences of an mrna or gene
fetched by get_Seq_by_acc or id.

in ensembl it is easy but here it is not easy many ios exists.

for example:

here how can i get such a $gene object from DBs (GeneBank or
EntrezGene) by acc numberor ids?


exons	code	prev	next	Top
 Title   : exons()
 Usage   : @exons = $gene->exons();
           @inital_exons = $gene->exons('Initial');
 Function: Get all exon features or all exons of a specified type of
this gene
           structure.

           Exon type is treated as a case-insensitive regular
expression and
           optional. For consistency, use only the following types:
           initial, internal, terminal, utr, utr5prime, and
utr3prime.
           A special and virtual type is 'coding', which refers to all
types
           except utr.

           This method basically merges the exons returned by
transcripts.

 Returns : An array of Bio::SeqFeature::Gene::ExonI implementing
objects.
 Args    : An optional string specifying the type of exon.


From challa_ghanashyam at yahoo.com  Sat Dec 24 15:09:09 2011
From: challa_ghanashyam at yahoo.com (GSC)
Date: Sat, 24 Dec 2011 12:09:09 -0800 (PST)
Subject: [Bioperl-l] re trieve description for a list of gi ids..
Message-ID: <33034438.post@talk.nabble.com>


Hi all:
I am new to perl. I am working on a script to retrieve the record
description (name given for a sequence record in genbank) for a list of gi
ids. the script works fine for 1000 ids but my list is about 250,000 ids
long and it is not working for me. Any suggestions on this.

GS
-- 
View this message in context: http://old.nabble.com/retrieve-description-for-a-list-of-gi-ids..-tp33034438p33034438.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From cjfields at illinois.edu  Tue Dec 27 10:03:28 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 27 Dec 2011 15:03:28 +0000
Subject: [Bioperl-l] Strange behaviour in the write_seq function for
 large	fasta
In-Reply-To: <4EF1DE6F.4070508@ulb.ac.be>
References: <4EF1DE6F.4070508@ulb.ac.be>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0BB29547@CITESMBX4.ad.uillinois.edu>

This is a strange one.  Personally I haven't seen this behavior, but that maybe it's OS-dependent?

We'll need more information, particularly what version of BioPerl you are using, the OS, version of perl, etc.  Also, in general to make sure we don't lose track of this issue it is best to submit a bug report:

https://redmine.open-bio.org/projects/bioperl

I'm planning on triaging bugs next week, I could take a look then.

chris
________________________________________
From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of David Gacquer [dgacquer at ulb.ac.be]
Sent: Wednesday, December 21, 2011 7:26 AM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Strange behaviour in the write_seq function for large      fasta

Dear BioPerl users/developers,

I am facing a strange issue with the $seq_out->write_seq function when
using large fasta files

I have downloaded the hg19 chromosome 1, and applied the following code
(basically I wanted to mask some regions in it but the problem also
appears when copying the sequence without modifications):

sub main{
     my $seq_in  = Bio::SeqIO->new( -format => 'largefasta', -file =>
$ARGV[0]);
     my $seq_out  = Bio::SeqIO->new( -format => 'largefasta', -file =>
'>'.$ARGV[1]);
     my $seq_obj_in = $seq_in->next_seq();
     my $modified_seq = $seq_obj_in->seq();
     my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq =>
$modified_seq, -id  => $seq_obj_in->id, -desc => $seq_obj_in->desc);
     $seq_out->write_seq($seq_obj_out);
}

when checking the output fasta file, the sequence of chr1 is 1-bp shorter.

I have noticed that in the original fasta file, each line contains
exactly 50 nucleotides, while the output of the $seq_out->write_seq
function contains always 60 characters per line.
chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that
the very last base was missing, I created the following fasta files:

chr121.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAG

chr122.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAG

They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last
character being a G. When running the above code:

chr121.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

chr122.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AG

The output for the 122 bp chromosome is correct (2 lines of 60 bp and
the last line with 2 bp, AG) but for the 121 bp chromosome, the last
character is missing (2 lines of 60 bp only, last G is missing).

When replacing -format => 'largefasta' by -format => 'fasta' or writing
the output without the write_seq function however, the problem is solved.

Am I missing something or is there a problem with the write_seq function
used with large fasta files? (I am running BioPerl on a Mac under OS X
Snow Leopard)

Best regards

David

--
David Gacquer, Ph. D.

IRIBHM - Universite Libre de Bruxelles
Bldg C, room C.4.117
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium

Phone: +32-2-555 4187
Fax: +32-2-555 4655
E-mail: dgacquer at ulb.ac.be

_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l


From jdeuts01 at students.poly.edu  Thu Dec  1 09:09:19 2011
From: jdeuts01 at students.poly.edu (jdeuts01 at students.poly.edu)
Date: Thu, 1 Dec 2011 14:09:19 +0000
Subject: [Bioperl-l] question
Message-ID: <SNT134-W43F83A46574EEDD841600186B10@phx.gbl>


Dear Bioperl,
       This is my first experience with bioperl and I need help please.
1. The version of bioperl is 1.6.1 and have also installed Bundle-BioPerl 2.1.8 and mGen 1.03.    I was unable to install Bribes and trouchelle DB.     Will this prevent the BioPerl package from functioning correctly?
2. The operating platform is windows 7 - 64 bit using ActiveState - Perl v5.12.2
3. The script is as follows:
#!/usr/bin/perl
# Write a script using OOP to write protein sequences to the file fasta.txt.use strict;use warnings;use Bio::SeqIO::fasta;
# Declare and initialize input and output files.my $protein_fasta = "protein.fa";my $protein_out = ">fasta.txt";
# Setup objects for input and output.my $seq_in = Bio::SeqIO->new(-file => "$protein_fasta", -format => 'Fasta');my $seq_out = Bio::SeqIO->new(-file => "$protein_out", -format => 'fasta');
# Establish while loop using "next_seq" method # to read in multiple sequences from "protein.fa" file# one by one until none were remaining.while(my $seq = $seq_in -> next_seq){		$seq_out->write_seq($seq);}
The information is successfully written to the file: fasta.txt. 
4. Receiving the following error messages: 
Replacement list is longer than search list at C:/Perl64/site/lib/Bio/Range.pm line 251.Subroutine _initialize redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 92.Subroutine next_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 112.Subroutine write_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 193.Subroutine width redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 272.Subroutine preferred_id_type redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 295.
Thanks in advance for your help.John Deutsch
 		 	   		  

From jboddu at illinois.edu  Thu Dec  1 11:38:00 2011
From: jboddu at illinois.edu (Boddu, Jayanand)
Date: Thu, 1 Dec 2011 16:38:00 +0000
Subject: [Bioperl-l] Chromosome coordinates
Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>

Hello
I am newbie to Perl scripts.
I have a file with short reads mapped to the MAIZE genome
The format is a simple BLASTN output.
READ_ID

Chr

% Similarity

Alignment

Mismatches

Gaps

READ Start

READ End

Chr Start

Chr End

E Value

Score

READ1

chrPt

100

17

0

0

1

17

35021

35037

0.21

34.2

READ1

chr10

100

17

0

0

1

17

128587356

128587372

0.21

34.2

READ1

chr6

100

17

0

0

1

17

160769803

160769787

0.21

34.2

READ1

chr5

100

17

0

0

1

17

172103083

172103067

0.21

34.2

READ1

chr4

100

17

0

0

1

17

213173683

213173699

0.21

34.2

READ1

chr3

100

17

0

0

1

17

23689132

23689116

0.21

34.2

READ2

chr8

100

17

0

0

1

17

161048603

161048587

0.21

34.2

READ2

chr6

100

17

0

0

1

17

155768884

155768868

0.21

34.2

READ2

chr5

100

17

0

0

1

17

32958812

32958828

0.21

34.2

READ2

chr3

100

17

0

0

1

17

212451090

212451074

0.21

34.2

READ2

chr2

100

17

0

0

1

17

2046449

2046465

0.21

34.2

READ2

chr1

100

17

0

0

1

17

223233801

223233785

0.21

34.2

READ2

chr1

100

17

0

0

1

17

277573037

277573021

0.21

34.2


As expected the same read maps to multiple places on the same/different chromosome.
I have a GFF file with annotated coordinates.
I would like to run a PERL script to find out READS that are within the GENES in the GFF file and that are not.
The anticipated script should;

1.       Take the READ coordinates on the genome (by chromosome);

2.       Go the GFF file;

3.       Find the Chromosome;

4.       Find the GENE (by coordinates);

5.       and report READ-its coordinates-Chromosome-GENE-and its coordinates.

It doesn't need to be in the same order.
After this, I guess I could use simple Microsoft ACCESS query to pull out READS that are not mapped to the GENEs.
I would greatly appreciate if anyone can has a script that more or less similar job.

Thanks
Jay


From scott at scottcain.net  Thu Dec  1 11:59:56 2011
From: scott at scottcain.net (Scott Cain)
Date: Thu, 1 Dec 2011 11:59:56 -0500
Subject: [Bioperl-l] Chromosome coordinates
In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>
Message-ID: <CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>

Hi Jay,

Since the maize GFF file is likely to be fairly large, I would consider
putting it in a database, using either Bio::DB::GFF if it is GFF2 or
Bio::DB::SeqFeature::Store if it is gff3.  Then you can use the methods
that come along with either of those modules to search regions for for
genes.  They both support a get_features_by_location method, so you could
get the range for each of the regions you want to look at, and check the
database with that method to see if anything is there.

Scott


On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand <jboddu at illinois.edu>wrote:

> Hello
> I am newbie to Perl scripts.
> I have a file with short reads mapped to the MAIZE genome
> The format is a simple BLASTN output.
> READ_ID
>
> Chr
>
> % Similarity
>
> Alignment
>
> Mismatches
>
> Gaps
>
> READ Start
>
> READ End
>
> Chr Start
>
> Chr End
>
> E Value
>
> Score
>
> READ1
>
> chrPt
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 35021
>
> 35037
>
> 0.21
>
> 34.2
>
> READ1
>
> chr10
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 128587356
>
> 128587372
>
> 0.21
>
> 34.2
>
> READ1
>
> chr6
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 160769803
>
> 160769787
>
> 0.21
>
> 34.2
>
> READ1
>
> chr5
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 172103083
>
> 172103067
>
> 0.21
>
> 34.2
>
> READ1
>
> chr4
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 213173683
>
> 213173699
>
> 0.21
>
> 34.2
>
> READ1
>
> chr3
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 23689132
>
> 23689116
>
> 0.21
>
> 34.2
>
> READ2
>
> chr8
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 161048603
>
> 161048587
>
> 0.21
>
> 34.2
>
> READ2
>
> chr6
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 155768884
>
> 155768868
>
> 0.21
>
> 34.2
>
> READ2
>
> chr5
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 32958812
>
> 32958828
>
> 0.21
>
> 34.2
>
> READ2
>
> chr3
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 212451090
>
> 212451074
>
> 0.21
>
> 34.2
>
> READ2
>
> chr2
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 2046449
>
> 2046465
>
> 0.21
>
> 34.2
>
> READ2
>
> chr1
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 223233801
>
> 223233785
>
> 0.21
>
> 34.2
>
> READ2
>
> chr1
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 277573037
>
> 277573021
>
> 0.21
>
> 34.2
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> As expected the same read maps to multiple places on the same/different
> chromosome.
> I have a GFF file with annotated coordinates.
> I would like to run a PERL script to find out READS that are within the
> GENES in the GFF file and that are not.
> The anticipated script should;
>
> 1.       Take the READ coordinates on the genome (by chromosome);
>
> 2.       Go the GFF file;
>
> 3.       Find the Chromosome;
>
> 4.       Find the GENE (by coordinates);
>
> 5.       and report READ-its coordinates-Chromosome-GENE-and its
> coordinates.
>
> It doesn't need to be in the same order.
> After this, I guess I could use simple Microsoft ACCESS query to pull out
> READS that are not mapped to the GENEs.
> I would greatly appreciate if anyone can has a script that more or less
> similar job.
>
> Thanks
> Jay
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot
net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research


From jason.stajich at gmail.com  Thu Dec  1 12:31:29 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Thu, 1 Dec 2011 09:31:29 -0800
Subject: [Bioperl-l] Chromosome coordinates
In-Reply-To: <CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>
	<CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
Message-ID: <470A08F7-E238-44E1-B0D1-69FDBC0BA2A3@gmail.com>

You might try using BEDtools, intersectBED which can do a lot of what you are doing in simple command line program.

Jason
On Dec 1, 2011, at 8:59 AM, Scott Cain wrote:

> Hi Jay,
> 
> Since the maize GFF file is likely to be fairly large, I would consider
> putting it in a database, using either Bio::DB::GFF if it is GFF2 or
> Bio::DB::SeqFeature::Store if it is gff3.  Then you can use the methods
> that come along with either of those modules to search regions for for
> genes.  They both support a get_features_by_location method, so you could
> get the range for each of the regions you want to look at, and check the
> database with that method to see if anything is there.
> 
> Scott
> 
> 
> On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand <jboddu at illinois.edu>wrote:
> 
>> Hello
>> I am newbie to Perl scripts.
>> I have a file with short reads mapped to the MAIZE genome
>> The format is a simple BLASTN output.
>> READ_ID
>> 
>> Chr
>> 
>> % Similarity
>> 
>> Alignment
>> 
>> Mismatches
>> 
>> Gaps
>> 
>> READ Start
>> 
>> READ End
>> 
>> Chr Start
>> 
>> Chr End
>> 
>> E Value
>> 
>> Score
>> 
>> READ1
>> 
>> chrPt
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 35021
>> 
>> 35037
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr10
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 128587356
>> 
>> 128587372
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr6
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 160769803
>> 
>> 160769787
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr5
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 172103083
>> 
>> 172103067
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr4
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 213173683
>> 
>> 213173699
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr3
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 23689132
>> 
>> 23689116
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr8
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 161048603
>> 
>> 161048587
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr6
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 155768884
>> 
>> 155768868
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr5
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 32958812
>> 
>> 32958828
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr3
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 212451090
>> 
>> 212451074
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr2
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 2046449
>> 
>> 2046465
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr1
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 223233801
>> 
>> 223233785
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr1
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 277573037
>> 
>> 277573021
>> 
>> 0.21
>> 
>> 34.2
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> As expected the same read maps to multiple places on the same/different
>> chromosome.
>> I have a GFF file with annotated coordinates.
>> I would like to run a PERL script to find out READS that are within the
>> GENES in the GFF file and that are not.
>> The anticipated script should;
>> 
>> 1.       Take the READ coordinates on the genome (by chromosome);
>> 
>> 2.       Go the GFF file;
>> 
>> 3.       Find the Chromosome;
>> 
>> 4.       Find the GENE (by coordinates);
>> 
>> 5.       and report READ-its coordinates-Chromosome-GENE-and its
>> coordinates.
>> 
>> It doesn't need to be in the same order.
>> After this, I guess I could use simple Microsoft ACCESS query to pull out
>> READS that are not mapped to the GENEs.
>> I would greatly appreciate if anyone can has a script that more or less
>> similar job.
>> 
>> Thanks
>> Jay
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
> 
> 
> 
> -- 
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at scottcain dot
> net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From jovel_juan at hotmail.com  Thu Dec  1 12:36:32 2011
From: jovel_juan at hotmail.com (Juan Jovel)
Date: Thu, 1 Dec 2011 17:36:32 +0000
Subject: [Bioperl-l] Error when using SearchIO
In-Reply-To: <CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>,
	<CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
Message-ID: <COL102-W29B31BD46108B9F1D9EF46FAB10@phx.gbl>


Hello Everybody!
I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message:
"Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251"
What it does mean? Would it have any effect on my parsing results?
Thanks, 
JUAN 		 	   		  


From cjfields at illinois.edu  Thu Dec  1 14:03:45 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 1 Dec 2011 19:03:45 +0000
Subject: [Bioperl-l] Error when using SearchIO
In-Reply-To: <COL102-W29B31BD46108B9F1D9EF46FAB10@phx.gbl>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>,
	<CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
	<COL102-W29B31BD46108B9F1D9EF46FAB10@phx.gbl>
Message-ID: <7233AC75-E03B-401A-8D0A-260ED76956A4@illinois.edu>

On Dec 1, 2011, at 11:36 AM, Juan Jovel wrote:

> Hello Everybody!
> I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message:
> "Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251"
> What it does mean? Would it have any effect on my parsing results?
> Thanks, 
> JUAN

This is a bug that was fixed, I think this was in the latest BioPerl release (1.6.901).  There was a transliteration error that produced this warning, but otherwise it's harmless. The warning is perl version dependent, I think it pops up in perl 5.12 and up.  This user post (about halfway down) runs into the same issue: http://www.sysarchitects.com/bioperl.

chris


From David.Messina at sbc.su.se  Thu Dec  1 17:02:20 2011
From: David.Messina at sbc.su.se (Dave Messina)
Date: Thu, 1 Dec 2011 23:02:20 +0100
Subject: [Bioperl-l] re trieving blast multiple alignment in fasta form
In-Reply-To: <32886592.post@talk.nabble.com>
References: <32886592.post@talk.nabble.com>
Message-ID: <CAM3TQQURpsF8+Tq2AQ6yAzmDVuDnip-6mFAfGgUdR5ScTus4YA@mail.gmail.com>

Hi Eric,

Wait, do you want multiple pairwise alignments in your output FASTA file,
or a single multiple alignment of your query and all the hits?

If the former, get_aln() will give you one pairwise alignment per hsp, but
you'll need to move the output file creation statement (my $alnIO = ...)
before the loops so it gets created only once. Then, when you do the write
statement ($alnIO->write_aln($aln);), all of the alignments will go to the
same file.

If on the other hand you'd like to have a multiple alignment between a
query and all of its hits, you'll have to take the IDs of the hits, pull
the corresponding sequences out of the database, and then run a multiple
alignment algorithm on them.


Dave


From scuoppo at gmail.com  Fri Dec  2 17:50:28 2011
From: scuoppo at gmail.com (Claudio Scuoppo)
Date: Fri, 2 Dec 2011 17:50:28 -0500
Subject: [Bioperl-l] List of genes from genomic intervals
Message-ID: <CAEz0Wfv_yj6g5rZEbmj+UhOkJX8ryzObEO7N6rqvLRMV6Yd_Pg@mail.gmail.com>

Hi,

I am new to BioPerl. I was wondering what`s the best strategy to get
the genes contained in a a series of human genomic interval.
Basically, I have a table with:

Chromosome Start End

Which module should I be looking at?
Thanks,
Claudio


From awitney at sgul.ac.uk  Mon Dec  5 06:09:39 2011
From: awitney at sgul.ac.uk (Adam Witney)
Date: Mon, 5 Dec 2011 11:09:39 +0000
Subject: [Bioperl-l] Bio::Graphics imagemap and padding
Message-ID: <44A27378-CC4B-4396-817E-AA31004847C7@sgul.ac.uk>

Hi,

Image maps seem to be out of position if you use padding in the Panel, like this:

my $panel = Bio::Graphics::Panel->new( ?.. -pad_left  => 20, -pad_right => 20 ?? );

Without these options, the image map is fine. Is this a known issue?

Also on a side note, I noticed that when using Bio::Graphics with Dancer, some of the CGI code was blocking somewhere (I found a reference to a similar problem with CGI and Catalyst), swapping CGI with HTML::Entities fixes it:

sub create_web_map {
?.
	eval "require HTML::Entities" unless HTML::Entities->can('encode_entities');
?.
	my $title  = HTML::Entities::encode_entities($self->make_link($tr,$feature,1));
 	my $target = HTML::Entities::encode_entities($self->make_link($tgr,$feature,1));
?..
}

Thanks

Adam


From momin.amin at gmail.com  Mon Dec  5 18:00:23 2011
From: momin.amin at gmail.com (Amin Momin)
Date: Mon, 5 Dec 2011 15:00:23 -0800 (PST)
Subject: [Bioperl-l] SimpleAlign and consensus_string
Message-ID: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>

Hi ,

I am generating a consensus sequence by aligning two protein homologs
using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to
understand the criteria consensus_string() method of simpleAlign uses
to determine the consensus at position with dissimilar aminoacids/
nucleotide. Also how would the % cutoffs provided to
consensus_string() affect the outcome.


Thanks,
Amin


From jason.stajich at gmail.com  Mon Dec  5 18:58:59 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Mon, 5 Dec 2011 15:58:59 -0800
Subject: [Bioperl-l] SimpleAlign and consensus_string
In-Reply-To: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>
References: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>
Message-ID: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com>

There are several methods that do related things. 

Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns. 

If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column.

=head2 consensus_string

 Title     : consensus_string
 Usage     : $str = $ali->consensus_string($threshold_percent)
 Function  : Makes a strict consensus
 Returns   : Consensus string
 Argument  : Optional treshold ranging from 0 to 100.
             The consensus residue has to appear at least threshold %
             of the sequences at a given location, otherwise a '?'
             character will be placed at that location.
             (Default value = 0%)

=cut

On Dec 5, 2011, at 3:00 PM, Amin Momin wrote:

> Hi ,
> 
> I am generating a consensus sequence by aligning two protein homologs
> using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to
> understand the criteria consensus_string() method of simpleAlign uses
> to determine the consensus at position with dissimilar aminoacids/
> nucleotide. Also how would the % cutoffs provided to
> consensus_string() affect the outcome.
> 
> 
> Thanks,
> Amin
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From wenbinmei at gmail.com  Tue Dec  6 11:09:35 2011
From: wenbinmei at gmail.com (wenbin mei)
Date: Tue, 6 Dec 2011 11:09:35 -0500
Subject: [Bioperl-l] revcom the multiple sequence alignment
Message-ID: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>

Hi,

I have a question about revcom the multiple sequence alignment. One way I
can do convert the format into fasta and revcom individual sequences. I
wonder is there a easy way to convert the multiple sequence alignment as a
whole.  Thank you for help.

-best,
wenbin


From jason.stajich at gmail.com  Tue Dec  6 12:40:37 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Tue, 6 Dec 2011 09:40:37 -0800
Subject: [Bioperl-l] revcom the multiple sequence alignment
In-Reply-To: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
References: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
Message-ID: <CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>

I think this would work to update it in place though I haven't tried it myself

for my $seq ( $aln->each_seq ) {
 $seq->seq( $seq->revcom->seq );
}
$out->write_aln($aln);

This may also work - not entirely sure if there is any extra work done on the meta data (start/end) of the Seq object when this is done.  You may want to flip start/end for the sequences (the seqs are Bio::LocatableSeq objects) explicitly if not. Or you may not care about those data and can ignore.

   $seq = $seq->revcom

Jason
On Dec 6, 2011, at 8:09 AM, wenbin mei wrote:

> Hi,
> 
> I have a question about revcom the multiple sequence alignment. One way I
> can do convert the format into fasta and revcom individual sequences. I
> wonder is there a easy way to convert the multiple sequence alignment as a
> whole.  Thank you for help.
> 
> -best,
> wenbin
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From wenbinmei at gmail.com  Tue Dec  6 12:51:18 2011
From: wenbinmei at gmail.com (wenbin mei)
Date: Tue, 6 Dec 2011 12:51:18 -0500
Subject: [Bioperl-l] revcom the multiple sequence alignment
In-Reply-To: <CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
References: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
	<CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
Message-ID: <CAHdrE2TqBxYN+wPC=LH433GwMt3sp5oas_OQhwWY9DbfbjdCpg@mail.gmail.com>

I think I might not explain clearly my questions. I extract the individual
gene alignment from the whole genome alignment. Since some gene are on the
reverse strand, I want to revcom the gene alignment. There is part of my
scripts. I can read the strand information from another file.

my $newstart = $refseq->column_from_residue_number($start);
my $newend = $refseq->column_from_residue_number($end);
$seq{$genename} = $aln->slice($newstart, $newend);


Any suggestion to help me revcom some gene alignment on the minus strand is
helpful. Thank you.

-best,
wenbin


On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich <jason.stajich at gmail.com>wrote:

> I think this would work to update it in place though I haven't tried it
> myself
>
> for my $seq ( $aln->each_seq ) {
>  $seq->seq( $seq->revcom->seq );
> }
> $out->write_aln($aln);
>
> This may also work - not entirely sure if there is any extra work done on
> the meta data (start/end) of the Seq object when this is done.  You may
> want to flip start/end for the sequences (the seqs are Bio::LocatableSeq
> objects) explicitly if not. Or you may not care about those data and can
> ignore.
>
>   $seq = $seq->revcom
>
> Jason
> On Dec 6, 2011, at 8:09 AM, wenbin mei wrote:
>
> > Hi,
> >
> > I have a question about revcom the multiple sequence alignment. One way I
> > can do convert the format into fasta and revcom individual sequences. I
> > wonder is there a easy way to convert the multiple sequence alignment as
> a
> > whole.  Thank you for help.
> >
> > -best,
> > wenbin
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Wenbin Mei Ph.D. Student
Dr. Brad Barbazuk's Lab
Department of Biology
University of Florida
509-899-3067
wmei at ufl.edu <wmei at ufl.edu>


From kellert at ohsu.edu  Tue Dec  6 13:21:39 2011
From: kellert at ohsu.edu (Tom Keller)
Date: Tue, 6 Dec 2011 10:21:39 -0800
Subject: [Bioperl-l] Bioperl-l Digest, Vol 104, Issue 3
In-Reply-To: <mailman.3.1322931604.28955.bioperl-l@lists.open-bio.org>
References: <mailman.3.1322931604.28955.bioperl-l@lists.open-bio.org>
Message-ID: <B68BC6F2-8C57-4749-902D-3232B0DA6113@ohsu.edu>

I'd start by looking for the section "Searching for genes in genomic DNA" in the HOWTO:Beginners - BioPerl website.

Thomas (Tom) Keller, PhD
kellert at ohsu.edu
503.494.2442
6588 R Jones Hall (BSc/CROET)
MMI DNA Services
Member of OHSU Shared Resources

On Dec 3, 2011, at 9:00 AM, <bioperl-l-request at lists.open-bio.org> <bioperl-l-request at lists.open-bio.org> wrote:

> Send Bioperl-l mailing list submissions to
> 	bioperl-l at lists.open-bio.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.open-bio.org/mailman/listinfo/bioperl-l
> or, via email, send a message with subject or body 'help' to
> 	bioperl-l-request at lists.open-bio.org
> 
> You can reach the person managing the list at
> 	bioperl-l-owner at lists.open-bio.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Bioperl-l digest..."
> 
> 
> Today's Topics:
> 
>   1.  List of genes from genomic intervals (Claudio Scuoppo)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Fri, 2 Dec 2011 17:50:28 -0500
> From: Claudio Scuoppo <scuoppo at gmail.com>
> Subject: [Bioperl-l] List of genes from genomic intervals
> To: bioperl-l at lists.open-bio.org
> Message-ID:
> 	<CAEz0Wfv_yj6g5rZEbmj+UhOkJX8ryzObEO7N6rqvLRMV6Yd_Pg at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Hi,
> 
> I am new to BioPerl. I was wondering what`s the best strategy to get
> the genes contained in a a series of human genomic interval.
> Basically, I have a table with:
> 
> Chromosome Start End
> 
> Which module should I be looking at?
> Thanks,
> Claudio
> 
> 
> ------------------------------
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> End of Bioperl-l Digest, Vol 104, Issue 3
> *****************************************


From wenbinmei at gmail.com  Tue Dec  6 17:54:51 2011
From: wenbinmei at gmail.com (wenbin mei)
Date: Tue, 6 Dec 2011 17:54:51 -0500
Subject: [Bioperl-l] revcom the multiple sequence alignment
In-Reply-To: <CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
References: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
	<CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
Message-ID: <CAHdrE2S33zcuchbSXuH2NwM5gM-=BnxVx9xA13ye18gPi2Mtcg@mail.gmail.com>

Figured out! Thanks for help.

-best,
wenbin


On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich <jason.stajich at gmail.com>wrote:

> I think this would work to update it in place though I haven't tried it
> myself
>
> for my $seq ( $aln->each_seq ) {
>  $seq->seq( $seq->revcom->seq );
> }
> $out->write_aln($aln);
>
> This may also work - not entirely sure if there is any extra work done on
> the meta data (start/end) of the Seq object when this is done.  You may
> want to flip start/end for the sequences (the seqs are Bio::LocatableSeq
> objects) explicitly if not. Or you may not care about those data and can
> ignore.
>
>   $seq = $seq->revcom
>
> Jason
> On Dec 6, 2011, at 8:09 AM, wenbin mei wrote:
>
> > Hi,
> >
> > I have a question about revcom the multiple sequence alignment. One way I
> > can do convert the format into fasta and revcom individual sequences. I
> > wonder is there a easy way to convert the multiple sequence alignment as
> a
> > whole.  Thank you for help.
> >
> > -best,
> > wenbin
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Wenbin Mei Ph.D. Student
Dr. Brad Barbazuk's Lab
Department of Biology
University of Florida
509-899-3067
wmei at ufl.edu <wmei at ufl.edu>


From momin.amin at gmail.com  Tue Dec  6 12:37:16 2011
From: momin.amin at gmail.com (Amin Momin)
Date: Tue, 6 Dec 2011 11:37:16 -0600
Subject: [Bioperl-l] SimpleAlign and consensus_string
In-Reply-To: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com>
References: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>
	<4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com>
Message-ID: <CAA0DaRhm+jsPpFFYR6q2xj0YOkYy3Enh8rrRD-YQJ26z_U+Fkw@mail.gmail.com>

Thanks Jason


On Mon, Dec 5, 2011 at 5:58 PM, Jason Stajich <jason.stajich at gmail.com> wrote:
> There are several methods that do related things.
>
> Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns.
>
> If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column.
>
> =head2 consensus_string
>
> ?Title ? ? : consensus_string
> ?Usage ? ? : $str = $ali->consensus_string($threshold_percent)
> ?Function ?: Makes a strict consensus
> ?Returns ? : Consensus string
> ?Argument ?: Optional treshold ranging from 0 to 100.
> ? ? ? ? ? ? The consensus residue has to appear at least threshold %
> ? ? ? ? ? ? of the sequences at a given location, otherwise a '?'
> ? ? ? ? ? ? character will be placed at that location.
> ? ? ? ? ? ? (Default value = 0%)
>
> =cut
>
> On Dec 5, 2011, at 3:00 PM, Amin Momin wrote:
>
>> Hi ,
>>
>> I am generating a consensus sequence by aligning two protein homologs
>> using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to
>> understand the criteria consensus_string() method of simpleAlign uses
>> to determine the consensus at position with dissimilar aminoacids/
>> nucleotide. Also how would the % cutoffs provided to
>> consensus_string() affect the outcome.
>>
>>
>> Thanks,
>> Amin
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From sunwukong at potc.net  Wed Dec  7 14:05:20 2011
From: sunwukong at potc.net (sunwukong)
Date: Wed, 07 Dec 2011 11:05:20 -0800
Subject: [Bioperl-l] DNA Sequencing two questions
Message-ID: <4EDFB8F0.8080001@potc.net>

I am not a medical professional but I have two DNA related questions.

A year or so ago I realized that if the standard building blocks of life 
were the amino acids GATC then they could be represented as a base 4 
number system (e.g., 0,1,2 and 3).  Then any life form could be 
represented by a number (it would be very long).  So I set out on a 
quest to do this with a small life form.  For fun I chose the Spanish 
Flu which I believe I found on an NIH site.  Then I set out and realized 
that there was no standard.  And I did not know if the number would be 
built with the most significant digit on the left or right.

1.  Is there a standard method for representing the ATCD molecules as 
numbers
g = 0
a = 1
t  = 2
c = 3

2. is the sequence read left to right or right to left?

note:  It may be biologically significant if the right values are 
assigned to the letters GATC, there could be a pattern somewhere that 
holds significant information.  One idea might be to look at DNA 
sequences in bases other than 4 to see if something jumps out.

http://www.insectscience.org/2.10/ref/fig5a.gif

VR
Pat Kirol
509 442-2214


From Russell.Smithies at agresearch.co.nz  Wed Dec  7 16:59:18 2011
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Thu, 8 Dec 2011 10:59:18 +1300
Subject: [Bioperl-l] DNA Sequencing two questions
In-Reply-To: <4EDFB8F0.8080001@potc.net>
References: <4EDFB8F0.8080001@potc.net>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>

I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve.
I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions?  Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png
Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes?

But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery.

But don't let this stop you uncovering the great secret hidden in our genes :-)

On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html

--Russell

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of sunwukong
> Sent: Thursday, 8 December 2011 8:05 a.m.
> To: bioperl-l at bioperl.org
> Subject: [Bioperl-l] DNA Sequencing two questions
> 
> I am not a medical professional but I have two DNA related questions.
> 
> A year or so ago I realized that if the standard building blocks of life were the
> amino acids GATC then they could be represented as a base 4 number
> system (e.g., 0,1,2 and 3).  Then any life form could be represented by a
> number (it would be very long).  So I set out on a quest to do this with a small
> life form.  For fun I chose the Spanish Flu which I believe I found on an NIH
> site.  Then I set out and realized that there was no standard.  And I did not
> know if the number would be built with the most significant digit on the left
> or right.
> 
> 1.  Is there a standard method for representing the ATCD molecules as
> numbers g = 0 a = 1 t  = 2 c = 3
> 
> 2. is the sequence read left to right or right to left?
> 
> note:  It may be biologically significant if the right values are assigned to the
> letters GATC, there could be a pattern somewhere that holds significant
> information.  One idea might be to look at DNA sequences in bases other
> than 4 to see if something jumps out.
> 
> http://www.insectscience.org/2.10/ref/fig5a.gif
> 
> VR
> Pat Kirol
> 509 442-2214
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From jason.stajich at gmail.com  Wed Dec  7 17:53:10 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Wed, 7 Dec 2011 14:53:10 -0800
Subject: [Bioperl-l] DNA Sequencing two questions
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
References: <4EDFB8F0.8080001@potc.net>
	<18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
Message-ID: <9BDA2BFA-264F-45B2-9234-A6BC9402FBA8@gmail.com>

For other fun picture games -- 

You can look at patterns of motifs/words in a chaos game representation of genomes.
http://mbe.oxfordjournals.org/content/16/10/1391.long
http://mbe.oxfordjournals.org/content/20/6/901.long


On Dec 7, 2011, at 1:59 PM, Smithies, Russell wrote:

> I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve.
> I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions?  Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png
> Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes?
> 
> But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery.
> 
> But don't let this stop you uncovering the great secret hidden in our genes :-)
> 
> On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html
> 
> --Russell
> 
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of sunwukong
>> Sent: Thursday, 8 December 2011 8:05 a.m.
>> To: bioperl-l at bioperl.org
>> Subject: [Bioperl-l] DNA Sequencing two questions
>> 
>> I am not a medical professional but I have two DNA related questions.
>> 
>> A year or so ago I realized that if the standard building blocks of life were the
>> amino acids GATC then they could be represented as a base 4 number
>> system (e.g., 0,1,2 and 3).  Then any life form could be represented by a
>> number (it would be very long).  So I set out on a quest to do this with a small
>> life form.  For fun I chose the Spanish Flu which I believe I found on an NIH
>> site.  Then I set out and realized that there was no standard.  And I did not
>> know if the number would be built with the most significant digit on the left
>> or right.
>> 
>> 1.  Is there a standard method for representing the ATCD molecules as
>> numbers g = 0 a = 1 t  = 2 c = 3
>> 
>> 2. is the sequence read left to right or right to left?
>> 
>> note:  It may be biologically significant if the right values are assigned to the
>> letters GATC, there could be a pattern somewhere that holds significant
>> information.  One idea might be to look at DNA sequences in bases other
>> than 4 to see if something jumps out.
>> 
>> http://www.insectscience.org/2.10/ref/fig5a.gif
>> 
>> VR
>> Pat Kirol
>> 509 442-2214
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From Russell.Smithies at agresearch.co.nz  Wed Dec  7 19:29:47 2011
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Thu, 8 Dec 2011 13:29:47 +1300
Subject: [Bioperl-l] DNA Sequencing two questions
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
References: <4EDFB8F0.8080001@potc.net>
	<18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF245@exchsth.agresearch.co.nz>

I tried again and came up with this:
http://www.bioperl.org/w/images/7/7a/Autostereogram.png
If you look carefully, you can see the answer to life, the universe, and everything!!

--Russell

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Smithies, Russell
> Sent: Thursday, 8 December 2011 10:59 a.m.
> To: 'sunwukong'; bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] DNA Sequencing two questions
> 
> I did something similar a few years ago (after watching the movie "Contact" I
> think) and encoded codons as RGB values and drew an image of a genome.
> Looked much like random noise but I might try it again and draw as a space
> filling curve.
> I guess if you're looking for "hidden messages", why restrict yourself to 2
> dimensions?  Perhaps something pops out as a single-image stereogram eg.
> http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Ra
> ndom_Dot_Shark.png
> Perhaps it's a 3D "object" represented by slices drawn in a series of 2D
> planes?
> 
> But you need a bit of biological background as there will be patterns simply
> because of the way genes "work" and are laid out in chromosomes. You
> need to remember that DNA is effectively a 2D representation of a 3D
> protein structure and there is already much hidden information we know we
> don't understand - a "simple" task like how proteins fold is barely understood
> and why some become prions is still a mystery.
> 
> But don't let this stop you uncovering the great secret hidden in our genes :-)
> 
> On a similar note, have a look at http://medgadget.com/2011/10/send-your-
> secret-message-hidden-in-bacteria.html
> 
> --Russell
> 
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > bounces at lists.open-bio.org] On Behalf Of sunwukong
> > Sent: Thursday, 8 December 2011 8:05 a.m.
> > To: bioperl-l at bioperl.org
> > Subject: [Bioperl-l] DNA Sequencing two questions
> >
> > I am not a medical professional but I have two DNA related questions.
> >
> > A year or so ago I realized that if the standard building blocks of
> > life were the amino acids GATC then they could be represented as a
> > base 4 number system (e.g., 0,1,2 and 3).  Then any life form could be
> > represented by a number (it would be very long).  So I set out on a
> > quest to do this with a small life form.  For fun I chose the Spanish
> > Flu which I believe I found on an NIH site.  Then I set out and
> > realized that there was no standard.  And I did not know if the number
> > would be built with the most significant digit on the left or right.
> >
> > 1.  Is there a standard method for representing the ATCD molecules as
> > numbers g = 0 a = 1 t  = 2 c = 3
> >
> > 2. is the sequence read left to right or right to left?
> >
> > note:  It may be biologically significant if the right values are
> > assigned to the letters GATC, there could be a pattern somewhere that
> > holds significant information.  One idea might be to look at DNA
> > sequences in bases other than 4 to see if something jumps out.
> >
> > http://www.insectscience.org/2.10/ref/fig5a.gif
> >
> > VR
> > Pat Kirol
> > 509 442-2214
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> ==========================================================
> =============
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities to which
> it is addressed and may contain confidential and/or privileged material. Any
> review, retransmission, dissemination or other use of, or taking of any action
> in reliance upon, this information by persons or entities other than the
> intended recipients is prohibited by AgResearch Limited. If you have received
> this message in error, please notify the sender immediately.
> ==========================================================
> =============
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From lumos.lumos.lumos at gmail.com  Fri Dec  9 11:47:36 2011
From: lumos.lumos.lumos at gmail.com (lumos lumos)
Date: Fri, 9 Dec 2011 08:47:36 -0800
Subject: [Bioperl-l] Mouse->Human homologues ?
Message-ID: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>

Hello,

Is there a way to get human homologues for a mouse gene list where I get
all human genes(symbols) as text output ?

Thank you
LM


From cjfields at illinois.edu  Fri Dec  9 12:17:20 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Fri, 9 Dec 2011 17:17:20 +0000
Subject: [Bioperl-l] Mouse->Human homologues ?
In-Reply-To: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>
References: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>
Message-ID: <C19C1032-5FE7-400A-AEB5-152EEFC4C435@illinois.edu>

There are lots of databases that have this capability (ensembl, orthodb, homologene, oma, to name only a few).  Have you tried a simple search for this, or did you want expert opinion on the matter?  

chris

PS - Just to note, there is a lot of controversy swirling about re: the ortholog conjecture and some recently published papers calling it into question using human-mouse data, worth a look if you're trotting this path to know the current situation.  If you have access to F1000, see the following (paper itself is open :)

Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al. Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi: 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011. F1000.com/12462957

On Dec 9, 2011, at 10:47 AM, lumos lumos wrote:

> Hello,
> 
> Is there a way to get human homologues for a mouse gene list where I get
> all human genes(symbols) as text output ?
> 
> Thank you
> LM
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From lumos.lumos.lumos at gmail.com  Fri Dec  9 12:29:24 2011
From: lumos.lumos.lumos at gmail.com (lumos lumos)
Date: Fri, 9 Dec 2011 09:29:24 -0800
Subject: [Bioperl-l] Mouse->Human homologues ?
In-Reply-To: <C19C1032-5FE7-400A-AEB5-152EEFC4C435@illinois.edu>
References: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>
	<C19C1032-5FE7-400A-AEB5-152EEFC4C435@illinois.edu>
Message-ID: <CAJbewukt_xCCpQaWTsvqi2z1NkbsTZRG6xXJUcZhcK5jdAZhWQ@mail.gmail.com>

Hi Chris,

Thanks for your reply. I wanted to know if there is anyway you can do it
via script/automatically in perl for a list of mouse genes whose human
homologues I require.

LM

On Fri, Dec 9, 2011 at 9:17 AM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> There are lots of databases that have this capability (ensembl, orthodb,
> homologene, oma, to name only a few).  Have you tried a simple search for
> this, or did you want expert opinion on the matter?
>
> chris
>
> PS - Just to note, there is a lot of controversy swirling about re: the
> ortholog conjecture and some recently published papers calling it into
> question using human-mouse data, worth a look if you're trotting this path
> to know the current situation.  If you have access to F1000, see the
> following (paper itself is open :)
>
> Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al.
> Testing the ortholog conjecture with comparative functional genomic data
> from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi:
> 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011.
> F1000.com/12462957
>
> On Dec 9, 2011, at 10:47 AM, lumos lumos wrote:
>
> > Hello,
> >
> > Is there a way to get human homologues for a mouse gene list where I get
> > all human genes(symbols) as text output ?
> >
> > Thank you
> > LM
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


From lumos.lumos.lumos at gmail.com  Wed Dec  7 23:47:19 2011
From: lumos.lumos.lumos at gmail.com (lumos lumos)
Date: Wed, 7 Dec 2011 20:47:19 -0800
Subject: [Bioperl-l] Perl parsing
Message-ID: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>

Hello,

I have a text file(tab-delim) with some gene names as shown below.

*BRCA1: breast cancer 1, early onset

TNF: tumor necrosis factor

OMG: oligodendrocyte myelin glycoprotein*

I would like to get the list of gene name BRCA1,TNF,OMG that is before the
colon(:) .
How do I parse in perl this text file with this list of genes?

Thanks in advance.
LM


From b.m.forde at umail.ucc.ie  Fri Dec  9 11:52:56 2011
From: b.m.forde at umail.ucc.ie (BForde)
Date: Fri, 9 Dec 2011 08:52:56 -0800 (PST)
Subject: [Bioperl-l]  Genbank files
Message-ID: <32941955.post@talk.nabble.com>


Hello all,

I am new to Bioperl so I apologise if this is stupid question. 

For CDS features I which to add additional qualifiers e.g. /colour and /note
qualifiers. I have looked at the BioPerl wiki but am still unsure as how to
do this?

regards

Brian
-- 
View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From jboddu at illinois.edu  Fri Dec  9 14:59:39 2011
From: jboddu at illinois.edu (Boddu, Jayanand)
Date: Fri, 9 Dec 2011 19:59:39 +0000
Subject: [Bioperl-l] Batch processing of Data
Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>

Hi Anyone:
Please let me know if the following is practical with PERL.
My data output can be described as following.

1.       Hundreds of samples are run.

2.       A batch output sends data from each sample to its own "folder". Output is in the form of few text files, spreadsheets and PDF files.

3.       One of the spreadsheet has the data of most interest.

4.       This means I end up having hundreds of folders.

5.       The spreadsheet with the data has multiple worksheets out of which a couple have the interesting data to be processed (Please find attached a spreadsheet output in which the data is organized and the worksheets of my interest are named as "Compound" and "Peak". Yellow high-lighted columns in each worksheet has the data to be processed).
OK. That's long description.
NOW. Is it practical to write a PERL/or any script to;

1.       Enter each folder.

2.       Look for the spreadsheet of interest.

3.       Look for worksheets named "Compound" and "Peak".

4.       Look for the specific columns of interest.

5.       Copy paste the columns of interest into a new spreadsheet/text file with data from each folder next to each other.

This final spreadsheet will pass through a bunch of other calculations.

I apologize for this long and painful description.
However, it would be great if this can be done.
Thanks
Jay
-------------- next part --------------
A non-text attachment was scrubbed...
Name: REPORT01.xls
Type: application/vnd.ms-excel
Size: 93696 bytes
Desc: REPORT01.xls
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20111209/0528887b/attachment-0003.xls>

From cjfields at illinois.edu  Fri Dec  9 15:37:48 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Fri, 9 Dec 2011 20:37:48 +0000
Subject: [Bioperl-l] Perl parsing
In-Reply-To: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>
References: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>
Message-ID: <E70D9F7E-7419-47AE-A0F9-3FB2B7B25907@illinois.edu>

On Dec 7, 2011, at 10:47 PM, lumos lumos wrote:

> Hello,
> 
> I have a text file(tab-delim) with some gene names as shown below.
> 
> *BRCA1: breast cancer 1, early onset
> 
> TNF: tumor necrosis factor
> 
> OMG: oligodendrocyte myelin glycoprotein*
> 
> I would like to get the list of gene name BRCA1,TNF,OMG that is before the
> colon(:) .
> How do I parse in perl this text file with this list of genes?

'Very carefully?'

Okay, I'll try to refrain from further sarcasm, but I'm confused, what does this have to do with BioPerl (*the toolkit*) specifically?  That is what this mailing list is for.  

Just to note, this is a very common perl task. The answer is attainable by searching for it (not to mention taking the time to learn basic perl).  For instance:

   http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings

One of the many links found by simply using Google:

   http://lmgtfy.com/?q=perl+parse+tab+file

I'll leave the regex munging to you.  

(okay, I failed at refraining from sarcasm, ah well it's friday).

chris


> Thanks in advance.
> LM


From jason.stajich at gmail.com  Fri Dec  9 16:18:38 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Fri, 9 Dec 2011 13:18:38 -0800
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32941955.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
Message-ID: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>

$feature->add_tag_value('color','blue');

On Dec 9, 2011, at 8:52 AM, BForde wrote:

> 
> Hello all,
> 
> I am new to Bioperl so I apologise if this is stupid question. 
> 
> For CDS features I which to add additional qualifiers e.g. /colour and /note
> qualifiers. I have looked at the BioPerl wiki but am still unsure as how to
> do this?
> 
> regards
> 
> Brian
> -- 
> View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org


From bosborne11 at verizon.net  Fri Dec  9 15:31:15 2011
From: bosborne11 at verizon.net (Brian Osborne)
Date: Fri, 09 Dec 2011 15:31:15 -0500
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32941955.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
Message-ID: <3AB0716B-EC07-4BD6-9FC8-0C47A29FC0BA@verizon.net>

Brian,

Reasonable question. Start here:

http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

If you've never used Bioperl then:

http://www.bioperl.org/wiki/HOWTO:Beginners

Brian


On Dec 9, 2011, at 11:52 AM, BForde wrote:

> 
> Hello all,
> 
> I am new to Bioperl so I apologise if this is stupid question. 
> 
> For CDS features I which to add additional qualifiers e.g. /colour and /note
> qualifiers. I have looked at the BioPerl wiki but am still unsure as how to
> do this?
> 
> regards
> 
> Brian
> -- 
> View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From asjo at koldfront.dk  Fri Dec  9 17:25:00 2011
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Fri, 09 Dec 2011 23:25:00 +0100
Subject: [Bioperl-l] Batch processing of Data
References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>
Message-ID: <871usdpemb.fsf@topper.koldfront.dk>

On Fri, 9 Dec 2011 19:59:39 +0000, Boddu, wrote:

> Please let me know if the following is practical with PERL.

It might very well be, yes.

Modules you might be interested in include Spreadsheet::ParseExcel,
Spreadsheet::XLSX, Spreadsheet::WriteExcel and Excel::Writer::XLSX?.

A big help in finding interesting CPAN modules is the search engine on
https://metacpan.org/

Depending on your platform and preference using find(1) might also be
helpful to traverse the folders, rather than doing so in Perl.

Note that none of this has anything to do with BioPerl as such, though,
and you'll need to do some actual programming to get the job done.


  Best regards,

    Adam


? http://blogs.perl.org/users/john_mcnamara/2011/10/spreadsheetwriteexcel-is-dead-long-live-excelwriterxlsx.html

-- 
 "Angels can fly because they take themselves lightly."       Adam Sj?gren
                                                         asjo at koldfront.dk


From David.Messina at sbc.su.se  Fri Dec  9 17:30:23 2011
From: David.Messina at sbc.su.se (Dave Messina)
Date: Fri, 9 Dec 2011 23:30:23 +0100
Subject: [Bioperl-l] Batch processing of Data
In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>
References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>
Message-ID: <CAM3TQQWMnqwShBhYQWH9iZqtDphVFYLYtuVWEMxxfVY1OqSbhg@mail.gmail.com>

Yes, it can be done. However, it has nothing to do with this mailing list.

Steps 1 and 2 are basic Perl.
For steps 3 through 5, try googling "perl parse excel".


Dave


On Fri, Dec 9, 2011 at 20:59, Boddu, Jayanand <jboddu at illinois.edu> wrote:

> Hi Anyone:
> Please let me know if the following is practical with PERL.
> My data output can be described as following.
>
> 1.       Hundreds of samples are run.
>
> 2.       A batch output sends data from each sample to its own "folder".
> Output is in the form of few text files, spreadsheets and PDF files.
>
> 3.       One of the spreadsheet has the data of most interest.
>
> 4.       This means I end up having hundreds of folders.
>
> 5.       The spreadsheet with the data has multiple worksheets out of
> which a couple have the interesting data to be processed (Please find
> attached a spreadsheet output in which the data is organized and the
> worksheets of my interest are named as "Compound" and "Peak". Yellow
> high-lighted columns in each worksheet has the data to be processed).
> OK. That's long description.
> NOW. Is it practical to write a PERL/or any script to;
>
> 1.       Enter each folder.
>
> 2.       Look for the spreadsheet of interest.
>
> 3.       Look for worksheets named "Compound" and "Peak".
>
> 4.       Look for the specific columns of interest.
>
> 5.       Copy paste the columns of interest into a new spreadsheet/text
> file with data from each folder next to each other.
>
> This final spreadsheet will pass through a bunch of other calculations.
>
> I apologize for this long and painful description.
> However, it would be great if this can be done.
> Thanks
> Jay
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From lsbrath at gmail.com  Sat Dec 10 16:39:44 2011
From: lsbrath at gmail.com (Mgavi Brathwaite)
Date: Sat, 10 Dec 2011 16:39:44 -0500
Subject: [Bioperl-l] Perl parsing
In-Reply-To: <E70D9F7E-7419-47AE-A0F9-3FB2B7B25907@illinois.edu>
References: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>
	<E70D9F7E-7419-47AE-A0F9-3FB2B7B25907@illinois.edu>
Message-ID: <CAJm=ba98HUgAB1kUG29_KA+ZvNWP_AsHoJQNPQ-_Fe=Pa7b74Q@mail.gmail.com>

Yes grasshopper you have to suffer a little bit. Learn Perl first, then
step up to BioPerl. Chris I feel you concerning the power of Regex, and the
sarcasm.

Lom

On Fri, Dec 9, 2011 at 3:37 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> On Dec 7, 2011, at 10:47 PM, lumos lumos wrote:
>
> > Hello,
> >
> > I have a text file(tab-delim) with some gene names as shown below.
> >
> > *BRCA1: breast cancer 1, early onset
> >
> > TNF: tumor necrosis factor
> >
> > OMG: oligodendrocyte myelin glycoprotein*
> >
> > I would like to get the list of gene name BRCA1,TNF,OMG that is before
> the
> > colon(:) .
> > How do I parse in perl this text file with this list of genes?
>
> 'Very carefully?'
>
> Okay, I'll try to refrain from further sarcasm, but I'm confused, what
> does this have to do with BioPerl (*the toolkit*) specifically?  That is
> what this mailing list is for.
>
> Just to note, this is a very common perl task. The answer is attainable by
> searching for it (not to mention taking the time to learn basic perl).  For
> instance:
>
>
> http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings
>
> One of the many links found by simply using Google:
>
>   http://lmgtfy.com/?q=perl+parse+tab+file
>
> I'll leave the regex munging to you.
>
> (okay, I failed at refraining from sarcasm, ah well it's friday).
>
> chris
>
>
> > Thanks in advance.
> > LM
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From pawan.mani2 at gmail.com  Mon Dec  5 17:00:09 2011
From: pawan.mani2 at gmail.com (pawan.mani2 at gmail.com)
Date: Tue, 6 Dec 2011 03:30:09 +0530
Subject: [Bioperl-l] bioperl in cygwin
Message-ID: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC>

Hi
     I would like to after the givibg following commands in cgwin terminal:


 perl -MCPAN -e shell

then I type

    o conf prerequisites_policy follow
    o conf commit
    install Bundle::CPAN 
install Module::Build 
d /bioperl/ 
then we  you get a list of different versions. 
I selected CJFIELDS/BioPerl-1.6.1.96
install CJFIELDS/BioPerl-1.6.1.96.tar.gz 


but build.install was not ok.

Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7.

thanks in advanced.

with best regards,
Pawan


From cjfields at illinois.edu  Sun Dec 11 13:22:01 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sun, 11 Dec 2011 18:22:01 +0000
Subject: [Bioperl-l] bioperl in cygwin
In-Reply-To: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC>
References: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC>
Message-ID: <B674A464-E650-4CBF-B2CE-2100AB0B29B9@illinois.edu>

Pawan,

Hard to say what the problem is w/o supplying warnings/errors.  Prior to doing this, you should try installing BioPerl-1.6.901 (the latest CPAN release).  You can try a direct installation of the distribution, but the easiest way to get the latest version is to try installing Bio::Perl.

(I'm not sure what BioPerl-1.6.1.96 is, but this seems wrong)

chris

On Dec 5, 2011, at 4:00 PM, <pawan.mani2 at gmail.com>
 <pawan.mani2 at gmail.com> wrote:

> Hi
>     I would like to after the givibg following commands in cgwin terminal:
> 
> 
> perl -MCPAN -e shell
> 
> then I type
> 
>    o conf prerequisites_policy follow
>    o conf commit
>    install Bundle::CPAN 
> install Module::Build 
> d /bioperl/ 
> then we  you get a list of different versions. 
> I selected CJFIELDS/BioPerl-1.6.1.96
> install CJFIELDS/BioPerl-1.6.1.96.tar.gz 
> 
> 
> but build.install was not ok.
> 
> Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7.
> 
> thanks in advanced.
> 
> with best regards,
> Pawan
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From b.m.forde at umail.ucc.ie  Tue Dec 13 06:03:50 2011
From: b.m.forde at umail.ucc.ie (BForde)
Date: Tue, 13 Dec 2011 03:03:50 -0800 (PST)
Subject: [Bioperl-l] Genbank files
In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
Message-ID: <32965574.post@talk.nabble.com>


Than you for the replies. 

My script (below) reads in a list of locus_tags from a tab delimited text
file. Compares these locus_tags to the locus_tags in  a genbank file and
where they are equal adds new features.
the line
$feat->add_tag_value()
needs to be defined. In the bioperl wiki this variable appears to be defined
by giving it coordinates etc (creating a new feature). I wish to add
features to CDS key when the locus_tags are identical. Is this possible?

use strict; 
use Bio::SeqIO; 

my @V; 
open (LIST1, 'list') ||die; 
while (<LIST1>){ 
    push @V, (split(/\t/, $_))[0]; 
} 
close(LIST1); 

my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); 
my $seq_object = $seqio_object->next_seq; 

for my $feat_object ($seq_object->get_SeqFeatures){ 
    if ($feat_object->primary_tag eq "CDS"){ 
        if ($feat_object->has_tag('locus_tag')){ 
            for my $V3 ($feat_object->get_tag_values('locus_tag')){ 
                for my $V1 (@V) { 
                    if ($V1 eq $V3){ 
                        ADD NEW FEATURES 
                        
                    }     
                } 
            } 
        } 
    } 
} 
  
The script works down as far as the comparison point where locus_tags in the
genbankfile "Contig100.gb" are compared against a list of locus_tags from a
delimited txt file. 


regards 

Brian 

Jason Stajich-5 wrote:
> 
> $feature->add_tag_value('color','blue');
> 
> On Dec 9, 2011, at 8:52 AM, BForde wrote:
> 
>> 
>> Hello all,
>> 
>> I am new to Bioperl so I apologise if this is stupid question. 
>> 
>> For CDS features I which to add additional qualifiers e.g. /colour and
>> /note
>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>> to
>> do this?
>> 
>> regards
>> 
>> Brian
>> -- 
>> View this message in context:
>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 

-- 
View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32965574.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From roy.chaudhuri at gmail.com  Tue Dec 13 06:52:05 2011
From: roy.chaudhuri at gmail.com (Roy Chaudhuri)
Date: Tue, 13 Dec 2011 11:52:05 +0000
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32965574.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
	<32965574.post@talk.nabble.com>
Message-ID: <4EE73C65.1080101@gmail.com>

Hi Brian,

Just to check I have understood you, you want to read through a genbank 
file and add additional tags to features which are listed in a 
tab-delimited file of locus tags?

Your code is on the right lines, but it would be much more efficient to 
read your tab-delimited locus_tags into a hash, and check using exists, 
rather than ploughing through the (potentially very long) list of locus 
tags every time. Also, be careful with new lines in your tab file (you 
can safely get rid of them using "chomp"). You can miss out the 
"has_tag" check by using "get_tagset_values" instead of 
"get_tag_values", since the former does not complain if the tag is not 
present. Once you have modified your sequence object, you need to write 
it out to a new file (or STDOUT) using Bio::SeqIO.

Also, just a couple of general points, you should always "use warnings" 
(or even better "use warnings FATAL=>qw(all)") since that can help solve 
many problems, and your code may be easier to read if you don't include 
the word "object" in all your variable names (after all you wouldn't say 
you write on a paper object using a pen object).

use strict;
use warnings FATAL=>qw(all);
use Bio::SeqIO;
open (my $list, 'list') or die $!;
my %V;
while (<$list>){
     chomp;
     $V{(split(/\t/, $_))[0]}=1;
}
my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
my $seq_object = $seqio_object->next_seq;
for my $feat_object ($seq_object->remove_SeqFeatures){
     if ($feat_object->primary_tag eq "CDS"){
	for my $V3 ($feat_object->get_tagset_values('locus_tag')){
             if (exists $V{$V3}){
		$feat_object->add_tag_value(listed_in_tab_file=>'yes');
		next;
             }
         }
     }
     $seq_object->add_SeqFeature($feat_object);
}
Bio::SeqIO->new(-format=>'genbank')->write_seq($seq_object);

Hope this helps.
Cheers,
Roy.

On 13/12/2011 11:03, BForde wrote:
>
> Than you for the replies.
>
> My script (below) reads in a list of locus_tags from a tab delimited text
> file. Compares these locus_tags to the locus_tags in  a genbank file and
> where they are equal adds new features.
> the line
> $feat->add_tag_value()
> needs to be defined. In the bioperl wiki this variable appears to be defined
> by giving it coordinates etc (creating a new feature). I wish to add
> features to CDS key when the locus_tags are identical. Is this possible?
>
> use strict;
> use Bio::SeqIO;
>
> my @V;
> open (LIST1, 'list') ||die;
> while (<LIST1>){
>      push @V, (split(/\t/, $_))[0];
> }
> close(LIST1);
>
> my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
> my $seq_object = $seqio_object->next_seq;
>
> for my $feat_object ($seq_object->get_SeqFeatures){
>      if ($feat_object->primary_tag eq "CDS"){
>          if ($feat_object->has_tag('locus_tag')){
>              for my $V3 ($feat_object->get_tag_values('locus_tag')){
>                  for my $V1 (@V) {
>                      if ($V1 eq $V3){
>                          ADD NEW FEATURES
>
>                      }
>                  }
>              }
>          }
>      }
> }
>
> The script works down as far as the comparison point where locus_tags in the
> genbankfile "Contig100.gb" are compared against a list of locus_tags from a
> delimited txt file.
>
>
> regards
>
> Brian
>
> Jason Stajich-5 wrote:
>>
>> $feature->add_tag_value('color','blue');
>>
>> On Dec 9, 2011, at 8:52 AM, BForde wrote:
>>
>>>
>>> Hello all,
>>>
>>> I am new to Bioperl so I apologise if this is stupid question.
>>>
>>> For CDS features I which to add additional qualifiers e.g. /colour and
>>> /note
>>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>>> to
>>> do this?
>>>
>>> regards
>>>
>>> Brian
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Jason Stajich
>> jason.stajich at gmail.com
>> jason at bioperl.org
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>


From b.m.forde at umail.ucc.ie  Tue Dec 13 09:22:01 2011
From: b.m.forde at umail.ucc.ie (Brian Forde)
Date: Tue, 13 Dec 2011 14:22:01 +0000
Subject: [Bioperl-l] Genbank files
In-Reply-To: <4EE73C65.1080101@gmail.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
	<32965574.post@talk.nabble.com> <4EE73C65.1080101@gmail.com>
Message-ID: <CAJLmuD+0Ts_5hPLL6T2vToY8+oW+PxXHaBiGGKoLXZZoGiBptg@mail.gmail.com>

Hi Roy,

Thank you. That works perfectly. I have to confess that someone else told
me to use hashes but I could  not get them to work.. Thanks again

regards

Brian

On Tue, Dec 13, 2011 at 11:52 AM, Roy Chaudhuri <roy.chaudhuri at gmail.com>wrote:

> Hi Brian,
>
> Just to check I have understood you, you want to read through a genbank
> file and add additional tags to features which are listed in a
> tab-delimited file of locus tags?
>
> Your code is on the right lines, but it would be much more efficient to
> read your tab-delimited locus_tags into a hash, and check using exists,
> rather than ploughing through the (potentially very long) list of locus
> tags every time. Also, be careful with new lines in your tab file (you can
> safely get rid of them using "chomp"). You can miss out the "has_tag" check
> by using "get_tagset_values" instead of "get_tag_values", since the former
> does not complain if the tag is not present. Once you have modified your
> sequence object, you need to write it out to a new file (or STDOUT) using
> Bio::SeqIO.
>
> Also, just a couple of general points, you should always "use warnings"
> (or even better "use warnings FATAL=>qw(all)") since that can help solve
> many problems, and your code may be easier to read if you don't include the
> word "object" in all your variable names (after all you wouldn't say you
> write on a paper object using a pen object).
>
> use strict;
> use warnings FATAL=>qw(all);
> use Bio::SeqIO;
> open (my $list, 'list') or die $!;
> my %V;
> while (<$list>){
>    chomp;
>    $V{(split(/\t/, $_))[0]}=1;
>
> }
> my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb");
> my $seq_object = $seqio_object->next_seq;
> for my $feat_object ($seq_object->remove_**SeqFeatures){
>
>    if ($feat_object->primary_tag eq "CDS"){
>        for my $V3 ($feat_object->get_tagset_**values('locus_tag')){
>            if (exists $V{$V3}){
>                $feat_object->add_tag_value(**listed_in_tab_file=>'yes');
>                next;
>            }
>        }
>    }
>    $seq_object->add_SeqFeature($**feat_object);
> }
> Bio::SeqIO->new(-format=>'**genbank')->write_seq($seq_**object);
>
> Hope this helps.
> Cheers,
> Roy.
>
>
> On 13/12/2011 11:03, BForde wrote:
>
>>
>> Than you for the replies.
>>
>> My script (below) reads in a list of locus_tags from a tab delimited text
>> file. Compares these locus_tags to the locus_tags in  a genbank file and
>> where they are equal adds new features.
>> the line
>> $feat->add_tag_value()
>> needs to be defined. In the bioperl wiki this variable appears to be
>> defined
>> by giving it coordinates etc (creating a new feature). I wish to add
>> features to CDS key when the locus_tags are identical. Is this possible?
>>
>> use strict;
>> use Bio::SeqIO;
>>
>> my @V;
>> open (LIST1, 'list') ||die;
>> while (<LIST1>){
>>     push @V, (split(/\t/, $_))[0];
>> }
>> close(LIST1);
>>
>> my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb");
>> my $seq_object = $seqio_object->next_seq;
>>
>> for my $feat_object ($seq_object->get_SeqFeatures)**{
>>     if ($feat_object->primary_tag eq "CDS"){
>>         if ($feat_object->has_tag('locus_**tag')){
>>             for my $V3 ($feat_object->get_tag_values(**'locus_tag')){
>>                 for my $V1 (@V) {
>>                     if ($V1 eq $V3){
>>                         ADD NEW FEATURES
>>
>>                     }
>>                 }
>>             }
>>         }
>>     }
>> }
>>
>> The script works down as far as the comparison point where locus_tags in
>> the
>> genbankfile "Contig100.gb" are compared against a list of locus_tags from
>> a
>> delimited txt file.
>>
>>
>> regards
>>
>> Brian
>>
>> Jason Stajich-5 wrote:
>>
>>>
>>> $feature->add_tag_value('**color','blue');
>>>
>>> On Dec 9, 2011, at 8:52 AM, BForde wrote:
>>>
>>>
>>>> Hello all,
>>>>
>>>> I am new to Bioperl so I apologise if this is stupid question.
>>>>
>>>> For CDS features I which to add additional qualifiers e.g. /colour and
>>>> /note
>>>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>>>> to
>>>> do this?
>>>>
>>>> regards
>>>>
>>>> Brian
>>>> --
>>>> View this message in context:
>>>> http://old.nabble.com/Genbank-**files-tp32941955p32941955.html<http://old.nabble.com/Genbank-files-tp32941955p32941955.html>
>>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>>>
>>>> ______________________________**_________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l<http://lists.open-bio.org/mailman/listinfo/bioperl-l>
>>>>
>>>
>>> Jason Stajich
>>> jason.stajich at gmail.com
>>> jason at bioperl.org
>>>
>>>
>>> ______________________________**_________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l<http://lists.open-bio.org/mailman/listinfo/bioperl-l>
>>>
>>>
>>>
>>
>


-- 
Brian Forde
Microbiology Dept.
Bioscience Institute. Room 4.11
University College Cork
Cork
Ireland
tel:+353 21 4901306
email: b.m.forde at umail.ucc.ie


From b.m.forde at umail.ucc.ie  Mon Dec 12 12:20:53 2011
From: b.m.forde at umail.ucc.ie (BForde)
Date: Mon, 12 Dec 2011 09:20:53 -0800 (PST)
Subject: [Bioperl-l] Genbank files
In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
Message-ID: <32959999.post@talk.nabble.com>


Than you for the replies.

I am unsure as to how to use the line below with my script. My script so far
reads

use strict;
use Bio::SeqIO;

my @V;
open (LIST1, 'list') ||die;
while (<LIST1>){
    push @V, (split(/\t/, $_))[0];
}
close(LIST1);

my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
my $seq_object = $seqio_object->next_seq;

for my $feat_object ($seq_object->get_SeqFeatures){
    if ($feat_object->primary_tag eq "CDS"){
        if ($feat_object->has_tag('locus_tag')){
            for my $V3 ($feat_object->get_tag_values('locus_tag')){
                for my $V1 (@V) {
                    if ($V1 eq $V3){
                        ADD NEW FEATURES
                        
                    }    
                }
            }
        }
    }
}
 
The script works down as far as the comparison point where locus_tags in the
genbankfile "Contig100.gb" are compared against a list of locus_tags from a
delimited txt file.
I possbile could you show me how to amend my script so I can add new
features

regards

Brian

Jason Stajich-5 wrote:
> 
> $feature->add_tag_value('color','blue');
> 
> On Dec 9, 2011, at 8:52 AM, BForde wrote:
> 
>> 
>> Hello all,
>> 
>> I am new to Bioperl so I apologise if this is stupid question. 
>> 
>> For CDS features I which to add additional qualifiers e.g. /colour and
>> /note
>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>> to
>> do this?
>> 
>> regards
>> 
>> Brian
>> -- 
>> View this message in context:
>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 

-- 
View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32959999.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From Russell.Smithies at agresearch.co.nz  Tue Dec 13 22:17:02 2011
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Wed, 14 Dec 2011 16:17:02 +1300
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32959999.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
	<32959999.post@talk.nabble.com>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF27A@exchsth.agresearch.co.nz>

Something like this:

use strict;
use Bio::SeqIO;

my @V;
open (LIST1, 'list') ||die;
while (<LIST1>){
    push @V, (split(/\t/, $_))[0];
}
close(LIST1);

my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
my $seq_object = $seqio_object->next_seq;

for my $feat_object ($seq_object->get_SeqFeatures){
    if ($feat_object->primary_tag eq "CDS"){
        if ($feat_object->has_tag('locus_tag')){
            for my $V3 ($feat_object->get_tag_values('locus_tag')){
                for my $V1 (@V) {
                    if ($V1 eq $V3){
                        #ADD NEW FEATURES
                        $feat_object->add_tag_value('color','blue');
                    }
                }
            }
        }
    }
}
#write the new annotations
my $io = Bio::SeqIO->new(-format => "genbank", -file => ">new.gb" );
$io->write_seq($seq_object);

Take another look at http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Building_Your_Own_Sequences

--Russell


> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of BForde
> Sent: Tuesday, 13 December 2011 6:21 a.m.
> To: Bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Genbank files
> 
> 
> Than you for the replies.
> 
> I am unsure as to how to use the line below with my script. My script so far
> reads
> 
> use strict;
> use Bio::SeqIO;
> 
> my @V;
> open (LIST1, 'list') ||die;
> while (<LIST1>){
>     push @V, (split(/\t/, $_))[0];
> }
> close(LIST1);
> 
> my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
> my $seq_object = $seqio_object->next_seq;
> 
> for my $feat_object ($seq_object->get_SeqFeatures){
>     if ($feat_object->primary_tag eq "CDS"){
>         if ($feat_object->has_tag('locus_tag')){
>             for my $V3 ($feat_object->get_tag_values('locus_tag')){
>                 for my $V1 (@V) {
>                     if ($V1 eq $V3){
>                         ADD NEW FEATURES
> 
>                     }
>                 }
>             }
>         }
>     }
> }
> 
> The script works down as far as the comparison point where locus_tags in the
> genbankfile "Contig100.gb" are compared against a list of locus_tags from a
> delimited txt file.
> I possbile could you show me how to amend my script so I can add new
> features
> 
> regards
> 
> Brian
> 
> Jason Stajich-5 wrote:
> >
> > $feature->add_tag_value('color','blue');
> >
> > On Dec 9, 2011, at 8:52 AM, BForde wrote:
> >
> >>
> >> Hello all,
> >>
> >> I am new to Bioperl so I apologise if this is stupid question.
> >>
> >> For CDS features I which to add additional qualifiers e.g. /colour
> >> and /note qualifiers. I have looked at the BioPerl wiki but am still
> >> unsure as how to do this?
> >>
> >> regards
> >>
> >> Brian
> >> --
> >> View this message in context:
> >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
> >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > Jason Stajich
> > jason.stajich at gmail.com
> > jason at bioperl.org
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> 
> --
> View this message in context: http://old.nabble.com/Genbank-files-
> tp32941955p32959999.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From l.m.timmermans at students.uu.nl  Wed Dec 14 10:43:24 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Wed, 14 Dec 2011 16:43:24 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
Message-ID: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>

Hi all,

As already mentioned on IRC, I recently wrote a SFF parser and uploaded it
to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF
entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time
to write one I'd be most grateful.

Leon


From p.j.a.cock at googlemail.com  Wed Dec 14 11:03:05 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 14 Dec 2011 16:03:05 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
Message-ID: <CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>

On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> Hi all,
>
> As already mentioned on IRC, I recently wrote a SFF parser and uploaded it
> to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF
> entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time
> to write one I'd be most grateful.
>
> Leon

Hi Leon,

Have you looked at the index block at all, in order to offer random
access by read ID, or to access the Roche XML manifest? Please
ask if you need more information about this - or if you can read Python:
https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py

Is this building on Miguel Pignatelli's work? I don't recall seeing
any follow up posts from him after this one:
http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html

Peter


From cjfields at illinois.edu  Wed Dec 14 11:12:58 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Wed, 14 Dec 2011 16:12:58 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>,
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
Message-ID: <3BBC1132-E768-45D9-A107-ACD51791722D@illinois.edu>

Leon, 

Nice!  Definitely a good idea to have the lower-level parser and the BioPerl-bridging code separate, one of my concerns with the various parsers we have right now which hard-wire BioPerl classes in with the parser (makes it hard for optimization).

Chris

PS- Peter, I don't think the two projects are related, but I suppose Leon is the best to answer that.

Sent from my stupid iPad, now my laptop's on the fritz

On Dec 14, 2011, at 10:04 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

> On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans
> <l.m.timmermans at students.uu.nl> wrote:
>> Hi all,
>> 
>> As already mentioned on IRC, I recently wrote a SFF parser and uploaded it
>> to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF
>> entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time
>> to write one I'd be most grateful.
>> 
>> Leon
> 
> Hi Leon,
> 
> Have you looked at the index block at all, in order to offer random
> access by read ID, or to access the Roche XML manifest? Please
> ask if you need more information about this - or if you can read Python:
> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py
> 
> Is this building on Miguel Pignatelli's work? I don't recall seeing
> any follow up posts from him after this one:
> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html
> 
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From l.m.timmermans at students.uu.nl  Wed Dec 14 11:27:58 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Wed, 14 Dec 2011 17:27:58 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
Message-ID: <CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>

On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Hi Leon,
>
> Have you looked at the index block at all, in order to offer random
> access by read ID, or to access the Roche XML manifest? Please
> ask if you need more information about this - or if you can read Python:
> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py
>

I have looked at it, but not implemented it yet. There is no standardized
index, and the ones that are in common use either seem stupid (the Roche
index, which is essentially just a weirdly formatted sequential list,
though that should still be faster than a table scan) or undocumented (hash
based index).

 Is this building on Miguel Pignatelli's work? I don't recall seeing
> any follow up posts from him after this one:
> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html
>

It isn't. I like his idea for reusing BioPython's test files though.

Leon


From p.j.a.cock at googlemail.com  Wed Dec 14 11:44:28 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 14 Dec 2011 16:44:28 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
Message-ID: <CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>

On Wed, Dec 14, 2011 at 4:27 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> Hi Leon,
>>
>> Have you looked at the index block at all, in order to offer random
>> access by read ID, or to access the Roche XML manifest? Please
>> ask if you need more information about this - or if you can read Python:
>> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py
>
> I have looked at it, but not implemented it yet. There is no standardized
> index, and the ones that are in common use either seem stupid (the Roche
> index, which is essentially just a weirdly formatted sequential list, though
> that should still be faster than a table scan) or undocumented (hash based
> index).

There are two widely used indexes, both from Roche (one with and
one without an XML manifest, magic bytes .mft and .srt). They are
both just a simple table of the reads names and offsets, sorted
alphabetically. This works pretty well for rapid lookup for SFF files
(because the read count is not so high), and is pretty easy.

I don't think anyone used the hash table style indexes (.hsh), which
I assume was a proof of principle or trial in the early days of SFF.

One thing to check is what Ion Torrent's SFF files use. I would
guess they've followed Roche, but I don't know. After all, the
index structure is not defined in the SFF specification - it was
left extensible on purpose.

>> Is this building on Miguel Pignatelli's work? I don't recall seeing
>> any follow up posts from him after this one:
>> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html
>
> It isn't. I like his idea for reusing BioPython's test files though.

Yes, please do.

Peter


From gingerplum at gmail.com  Wed Dec 14 00:18:55 2011
From: gingerplum at gmail.com (plum ginger)
Date: Tue, 13 Dec 2011 21:18:55 -0800 (PST)
Subject: [Bioperl-l] a problem about BLAST
Message-ID: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com>

Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I
need run BLAST on more than one sequences. However the blast outfile
only store the result of last sequence. How to make the outfile store
all results?

Wish your help. Thanks very much!


Best regards


From jason.stajich at gmail.com  Thu Dec 15 12:02:47 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Thu, 15 Dec 2011 11:02:47 -0600
Subject: [Bioperl-l] a problem about BLAST
In-Reply-To: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com>
References: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com>
Message-ID: <58E5B487-7FF0-4018-B109-D6595DC2E493@gmail.com>

you are probably setting the outfile in each parsing iteration -- you need to show your code if you want someone to help you debug the problem.

On Dec 13, 2011, at 11:18 PM, plum ginger wrote:

> Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I
> need run BLAST on more than one sequences. However the blast outfile
> only store the result of last sequence. How to make the outfile store
> all results?
> 
> Wish your help. Thanks very much!
> 
> 
> Best regards
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org


From pengyu.ut at gmail.com  Fri Dec 16 17:10:27 2011
From: pengyu.ut at gmail.com (Peng Yu)
Date: Fri, 16 Dec 2011 16:10:27 -0600
Subject: [Bioperl-l] How to stop rather than emit warnings with
	Bio::Das::segment?
Message-ID: <CABrM6w=tiMNRJjZtvK-8Ot-TE3WuZgvtod6q-3g=UAWk8K_WRg@mail.gmail.com>

Hi,

Bio::Das::segment can give me the following warnings without stopping
the whole program when the position for the query doesn't exist. I
could test the return result and quit when it is []. But this would
cause my program have an test whenever I call segment. I'm wondering
if there is an automatic way to let Bio::Das::segment stop in such
cases.

--------------------- WARNING ---------------------
MSG: Sequence is not dna or rna, but []. Attempting to revcom, but
unsure if this is right
---------------------------------------------------


-- 
Regards,
Peng


From cjfields at illinois.edu  Fri Dec 16 21:48:07 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sat, 17 Dec 2011 02:48:07 +0000
Subject: [Bioperl-l] How to stop rather than emit warnings
	with	Bio::Das::segment?
In-Reply-To: <CABrM6w=tiMNRJjZtvK-8Ot-TE3WuZgvtod6q-3g=UAWk8K_WRg@mail.gmail.com>
References: <CABrM6w=tiMNRJjZtvK-8Ot-TE3WuZgvtod6q-3g=UAWk8K_WRg@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0881CF8E@CITESMBX4.ad.uillinois.edu>

Setting verbosity to 2 should convert warnings to exceptions.   

IIRC, set '-verbose => 2' in the Bio::Das constructor, set '$das->verbose(2)' explicitly, or set the env variable BIOPERLDEBUG=2.  

chris

________________________________________
From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of Peng Yu [pengyu.ut at gmail.com]
Sent: Friday, December 16, 2011 4:10 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment?

Hi,

Bio::Das::segment can give me the following warnings without stopping
the whole program when the position for the query doesn't exist. I
could test the return result and quit when it is []. But this would
cause my program have an test whenever I call segment. I'm wondering
if there is an automatic way to let Bio::Das::segment stop in such
cases.

--------------------- WARNING ---------------------
MSG: Sequence is not dna or rna, but []. Attempting to revcom, but
unsure if this is right
---------------------------------------------------


--
Regards,
Peng
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l


From anna.fr at gmail.com  Mon Dec 19 02:09:15 2011
From: anna.fr at gmail.com (Anna Friedlander)
Date: Mon, 19 Dec 2011 20:09:15 +1300
Subject: [Bioperl-l] StandAloneBlastPlus blastdbcmd question
Message-ID: <CALv2E+1Yvt1OhcTE_YXqho+zYZhPjihhCFupybArxMjLfD1S_g@mail.gmail.com>

Hi all

I have a question about using blastdbcmd via
Bio::Tools::Run::StandAloneBlastPlus

I have some Blast+ search results that I am manipulating in a perl
programme, and I would like to retrieve some sequence information for
some results using subject sequence IDs, and associated subject start
and end indices. If I was using blastdbcmd directly, I would do so
using the -entry and -range options.

My question is, can I use all the blastdbcmd options (or more
specifically, just the -entry and -range options) from within the
StandAloneBlastPlus module?

My apologies if I don't properly understand how this "wrapper" works!

Thanks in advance for your help
Anna Friedlander


From l.m.timmermans at students.uu.nl  Mon Dec 19 09:19:14 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Mon, 19 Dec 2011 15:19:14 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
Message-ID: <CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>

On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> There are two widely used indexes, both from Roche (one with and
> one without an XML manifest, magic bytes .mft and .srt). They are
> both just a simple table of the reads names and offsets, sorted
> alphabetically.


Yeah, that's what I got from the BioPython code. I didn't know it was
sorted though (it doesn't make much sense either, unless they wanted to do
a binary search or something).

This works pretty well for rapid lookup for SFF files
> (because the read count is not so high), and is pretty easy.
>

It's implemented in Bio::SFF 0.003. I did restructure my code into two
readers though, since doing sequential and random-access in the class
didn't make much sense code-wise.

I don't think anyone used the hash table style indexes (.hsh), which
> I assume was a proof of principle or trial in the early days of SFF.
>

I see, too bad.


> One thing to check is what Ion Torrent's SFF files use. I would
> guess they've followed Roche, but I don't know. After all, the
> index structure is not defined in the SFF specification - it was
> left extensible on purpose.
>

Yeah, we should check that too.

Yes, please do.
>

It's added to 0.003. The lack of tests was bothering me, but the SFFs I had
at hand were not suitable.

Leon


From p.j.a.cock at googlemail.com  Mon Dec 19 09:31:18 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 14:31:18 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
Message-ID: <CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>

On Mon, Dec 19, 2011 at 2:19 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>
>> There are two widely used indexes, both from Roche (one with and
>> one without an XML manifest, magic bytes .mft and .srt). They are
>> both just a simple table of the reads names and offsets, sorted
>> alphabetically.
>
> Yeah, that's what I got from the BioPython code. I didn't know it
> was sorted though (it doesn't make much sense either, unless they
> wanted to do a binary search or something).

I presume that's what Roche uses if they keep the index on disk.

The alternative is to load the index into RAM, which is really fast.
You just open the SFF, read the header, seek to the index, load
the index. Without the index, you have to scan the entire SFF file
to find each record and its offset - which is much slower.

>> This works pretty well for rapid lookup for SFF files
>> (because the read count is not so high), and is pretty easy.
>
> It's implemented in Bio::SFF 0.003. I did restructure my code into two
> readers though, since doing sequential and random-access in the class
> didn't make much sense code-wise.
>
>> I don't think anyone used the hash table style indexes (.hsh), which
>> I assume was a proof of principle or trial in the early days of SFF.
>
> I see, too bad.
>
>> One thing to check is what Ion Torrent's SFF files use. I would
>> guess they've followed Roche, but I don't know. After all, the
>> index structure is not defined in the SFF specification - it was
>> left extensible on purpose.
>
> Yeah, we should check that too.

I don't have any Ion Torrent data first hand, and the public
samples I've seen were FASTQ not SFF. But I know a few
people with Ion Torrent machines that might be able to help...

> It's added to 0.003. The lack of tests was bothering me, but the
> SFFs I had at hand were not suitable.

Have you looked at the sample SFF data in Biopython? Please
use them for the BioPerl unit tests (we're been talking about a
cross project collection of test data files like this), the README
file should be self-explanatory:
https://github.com/biopython/biopython/tree/master/Tests/Roche

Peter


From p.j.a.cock at googlemail.com  Mon Dec 19 10:13:53 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 15:13:53 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk>
Message-ID: <CAKVJ-_4U_Yt5A8f4QLxb-SzT8Y7n-2kRvGH=g9n+NfqAFegxgA@mail.gmail.com>

On Mon, Dec 19, 2011 at 3:03 PM, Adam Witney <awitney at sgul.ac.uk> wrote:
>> I don't have any Ion Torrent data first hand, and the public
>> samples I've seen were FASTQ not SFF. But I know a few
>> people with Ion Torrent machines that might be able to help?
>
> I can you let you have some Ion Torrent SFF files if it helps
>
> adam

Hi Adam,

I've just had a quick look at a file from an IonTorrent 314 chip
that a colleague kindly sent me, and that SFF file had no index
(but only 50k reads so this isn't so important).

If you can send me (and Leon?) one of two original SFF files that
would be useful, even if just to confirm that Ion Torrent's SFF files
do indeed typically lack an index. If that is the case, I may need to
remove the warning message Biopython currently prints when
indexing these files: No SFF index, doing it the slow way

Off list is fine if you'd like to keep the data private, use dropbox
or something if you don't have an FTP server.

Thanks,

Peter


From awitney at sgul.ac.uk  Mon Dec 19 10:03:16 2011
From: awitney at sgul.ac.uk (Adam Witney)
Date: Mon, 19 Dec 2011 15:03:16 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
Message-ID: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk>

>>> One thing to check is what Ion Torrent's SFF files use. I would
>>> guess they've followed Roche, but I don't know. After all, the
>>> index structure is not defined in the SFF specification - it was
>>> left extensible on purpose.
>> 
>> Yeah, we should check that too.
> 
> I don't have any Ion Torrent data first hand, and the public
> samples I've seen were FASTQ not SFF. But I know a few
> people with Ion Torrent machines that might be able to help?

I can you let you have some Ion Torrent SFF files if it helps

adam


From l.m.timmermans at students.uu.nl  Mon Dec 19 10:48:34 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Mon, 19 Dec 2011 16:48:34 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
Message-ID: <CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>

On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> I presume that's what Roche uses if they keep the index on disk.
>
> The alternative is to load the index into RAM, which is really fast.
> You just open the SFF, read the header, seek to the index, load
> the index. Without the index, you have to scan the entire SFF file
> to find each record and its offset - which is much slower.
>

That's what I'm doing now. It's much faster, but it still takes a
noticeable amount of time on large files.

Have you looked at the sample SFF data in Biopython? Please
> use them for the BioPerl unit tests (we're been talking about a
> cross project collection of test data files like this), the README
> file should be self-explanatory:
> https://github.com/biopython/biopython/tree/master/Tests/Roche
>

Yeah, I'm using those now (
https://github.com/Leont/bio-sff/blob/master/t/reader.t). I must say there
were some interesting corner cases in it.

Leon


From p.j.a.cock at googlemail.com  Mon Dec 19 11:15:15 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 16:15:15 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
Message-ID: <CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>

On Mon, Dec 19, 2011 at 3:48 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock wrote:
>
>> Have you looked at the sample SFF data in Biopython? Please
>> use them for the BioPerl unit tests (we're been talking about a
>> cross project collection of test data files like this), the README
>> file should be self-explanatory:
>> https://github.com/biopython/biopython/tree/master/Tests/Roche
>
> Yeah, I'm using those now
> (https://github.com/Leont/bio-sff/blob/master/t/reader.t).

Could you a link to your /corpus/README.txt file pointing
back to the Biopython original for acknowledgement and
future reference?

>
> I must say there were some interesting corner cases in it.
>

I'm glad you agree - and if you can think of any more special
cases to verify that would be great.

Are you doing just SFF parsing for now? Not writing?

Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
format name "sff" to mean the full read sequence (with mixed
case, upper case for the good sequence, lower cases for any
left/right clipping - as in the Roche tools), and "sff-trim" to mean
the trimmed sequences. I would encourage you to do the
same, as part of the general aim of having consistent
sequence format names between BioPerl, Biopython, and
EMBOSS, where possible.

Peter


From l.m.timmermans at students.uu.nl  Mon Dec 19 11:47:41 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Mon, 19 Dec 2011 17:47:41 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
Message-ID: <CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>

On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Could you a link to your /corpus/README.txt file pointing
> back to the Biopython original for acknowledgement and
> future reference?
>

I forgot about that, I will add it to the next release.

Are you doing just SFF parsing for now? Not writing?
>

I haven't written the writer yet (haven't needed it so far). I'd rather
release working code early instead of waiting until everything is complete.

Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
> format name "sff" to mean the full read sequence (with mixed
> case, upper case for the good sequence, lower cases for any
> left/right clipping - as in the Roche tools), and "sff-trim" to mean
> the trimmed sequences. I would encourage you to do the
> same, as part of the general aim of having consistent
> sequence format names between BioPerl, Biopython, and
> EMBOSS, where possible.
>

I agree, consistency is good.

Leon


From p.j.a.cock at googlemail.com  Mon Dec 19 12:00:03 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 17:00:03 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
Message-ID: <CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>

On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> Could you a link to your /corpus/README.txt file pointing
>> back to the Biopython original for acknowledgement and
>> future reference?
>
> I forgot about that, I will add it to the next release.

Thanks.

>> Are you doing just SFF parsing for now? Not writing?
>
>
> I haven't written the writer yet (haven't needed it so far). I'd rather
> release working code early instead of waiting until everything is complete.

I understand - but make sure you've designed the data structures
in the parser so as to allow the original record to be re-built as SFF.

>> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
>> format name "sff" to mean the full read sequence (with mixed
>> case, upper case for the good sequence, lower cases for any
>> left/right clipping - as in the Roche tools), and "sff-trim" to mean
>> the trimmed sequences. I would encourage you to do the
>> same, as part of the general aim of having consistent
>> sequence format names between BioPerl, Biopython, and
>> EMBOSS, where possible.
>
> I agree, consistency is good.

Great. I'd guess Bio::SeqIO integration would be more important
that SFF output initially.

Peter


From cjfields at illinois.edu  Mon Dec 19 14:44:22 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 19 Dec 2011 19:44:22 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>,
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
Message-ID: <D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>

Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best.  Barring that, a very simple class for storing data.  We've found BioPerl objects/classes pretty heavy.

(for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing).

Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used.  

For instance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists.

Chris

Sent from my iPad

On Dec 19, 2011, at 11:05 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

> On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans
> <l.m.timmermans at students.uu.nl> wrote:
>> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock <p.j.a.cock at googlemail.com>
>> wrote:
>>> 
>>> Could you a link to your /corpus/README.txt file pointing
>>> back to the Biopython original for acknowledgement and
>>> future reference?
>> 
>> I forgot about that, I will add it to the next release.
> 
> Thanks.
> 
>>> Are you doing just SFF parsing for now? Not writing?
>> 
>> 
>> I haven't written the writer yet (haven't needed it so far). I'd rather
>> release working code early instead of waiting until everything is complete.
> 
> I understand - but make sure you've designed the data structures
> in the parser so as to allow the original record to be re-built as SFF.
> 
>>> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
>>> format name "sff" to mean the full read sequence (with mixed
>>> case, upper case for the good sequence, lower cases for any
>>> left/right clipping - as in the Roche tools), and "sff-trim" to mean
>>> the trimmed sequences. I would encourage you to do the
>>> same, as part of the general aim of having consistent
>>> sequence format names between BioPerl, Biopython, and
>>> EMBOSS, where possible.
>> 
>> I agree, consistency is good.
> 
> Great. I'd guess Bio::SeqIO integration would be more important
> that SFF output initially.
> 
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From cjfields at illinois.edu  Mon Dec 19 19:28:25 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Mon, 19 Dec 2011 18:28:25 -0600
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
Message-ID: <4EEFD6A9.3010303@illinois.edu>

On 12/19/2011 10:47 AM, Leon Timmermans wrote:
> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock<p.j.a.cock at googlemail.com>wrote:
>
>> Could you a link to your /corpus/README.txt file pointing
>> back to the Biopython original for acknowledgement and
>> future reference?
>>
> I forgot about that, I will add it to the next release.
>
> Are you doing just SFF parsing for now? Not writing?
> I haven't written the writer yet (haven't needed it so far). I'd rather
> release working code early instead of waiting until everything is complete.
>
> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
>> format name "sff" to mean the full read sequence (with mixed
>> case, upper case for the good sequence, lower cases for any
>> left/right clipping - as in the Roche tools), and "sff-trim" to mean
>> the trimmed sequences. I would encourage you to do the
>> same, as part of the general aim of having consistent
>> sequence format names between BioPerl, Biopython, and
>> EMBOSS, where possible.
>>
> I agree, consistency is good.
>
> Leon
This is already implemented in Bio::SeqIO I believe.  This is the same 
line of thinking with the FASTQ format, that one can have a 
'format-variant' combination that (as one might guess) indicates to the 
parser any variation of the parser so logic within the parser can deal 
with it.  You can also pass the '-variant => "foo"' parameter as well 
IIRC.  You would just check the variant with the variant() method.

chris


From l.m.timmermans at students.uu.nl  Tue Dec 20 10:25:13 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 20 Dec 2011 16:25:13 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
Message-ID: <CAC1jpXDmU3jF-3G9dnB=-0fCkydqWd18J0eo2nKkSLvKTbOxjQ@mail.gmail.com>

On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> I understand - but make sure you've designed the data structures
> in the parser so as to allow the original record to be re-built as SFF.
>

 I did, though currently it's rather hard to make new entries from scratch.
That said, I can hardly imagine anyone wanting to do this.

Great. I'd guess Bio::SeqIO integration would be more important
> that SFF output initially.
>

Probably. It looks like it's quite easy, it's just rather underdocumented.

Leon


From l.m.timmermans at students.uu.nl  Tue Dec 20 10:26:11 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 20 Dec 2011 16:26:11 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
	<D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>
Message-ID: <CAC1jpXAq3cN0GriHndyQnbpn3v26M1o927i8z_a=x5bQU-iPQg@mail.gmail.com>

On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J <
cjfields at illinois.edu> wrote:

> Kinda joining this a little late, but I think if there is a way to have a
> low-level parser/writer that generically parses the data into simple
> (possibly hash-tagged) data structures, that would be best.  Barring that,
> a very simple class for storing data.  We've found BioPerl objects/classes
> pretty heavy.
>
> (for an example of this, see Heng Li's readfq parser on github, which has
> some stats for Fastq/fasta parsing).
>
> Any way we can separate the parser from object instantiation would enable
> us to optimize the object/class layer and parser/writer layers separately,
> with the possible nice side effect of making the parser more broadly used.
>
> For insn Sance, if someone wanted a faster parser, use the low level,
> otherwise use the higher level (possibly BioPerl-specific) API. Lincoln
> does this do a certain degree with Bio-samtools; I would go further and
> make the bp- and non-bp code in separate dists.
>

A good OO system can actually help make things faster. For example, I'm
unpacking the flowspace and quality data lazily, which made scanning
through an SFF file 2.5-3 times as fast while having marginal extra costs
when you do need them.

Leon


From l.m.timmermans at students.uu.nl  Tue Dec 20 10:30:54 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 20 Dec 2011 16:30:54 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <4EEFD6A9.3010303@illinois.edu>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<4EEFD6A9.3010303@illinois.edu>
Message-ID: <CAC1jpXD_sSYoU2DS33Yn99c0WToyXvTY2aJdcS6w-yZ0xfCFMg@mail.gmail.com>

On Tue, Dec 20, 2011 at 1:28 AM, Chris Fields <cjfields at illinois.edu> wrote:

> This is already implemented in Bio::SeqIO I believe.  This is the same
> line of thinking with the FASTQ format, that one can have a
> 'format-variant' combination that (as one might guess) indicates to the
> parser any variation of the parser so logic within the parser can deal with
> it.  You can also pass the '-variant => "foo"' parameter as well IIRC.  You
> would just check the variant with the variant() method.
>

Great. That makes life much easier :-)

Leon


From p.j.a.cock at googlemail.com  Tue Dec 20 10:31:59 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 20 Dec 2011 15:31:59 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXDmU3jF-3G9dnB=-0fCkydqWd18J0eo2nKkSLvKTbOxjQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
	<CAC1jpXDmU3jF-3G9dnB=-0fCkydqWd18J0eo2nKkSLvKTbOxjQ@mail.gmail.com>
Message-ID: <CAKVJ-_7v+wKQVXkLz_CMJXviYApyirjG9CA89mti5a3N40V8iA@mail.gmail.com>

On Tue, Dec 20, 2011 at 3:25 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> I understand - but make sure you've designed the data structures
>> in the parser so as to allow the original record to be re-built as SFF.
>
> ?I did, though currently it's rather hard to make new entries from scratch.
> That said, I can hardly imagine anyone wanting to do this.

Typical use cases I've found in using the Biopython SFF code are
filtering an SFF file (taking some records only), and modifying the
clipping values. In both cases, the user isn't creating the SFF
records from scratch.

Peter


From cjfields at illinois.edu  Tue Dec 20 17:40:31 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 20 Dec 2011 22:40:31 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXAq3cN0GriHndyQnbpn3v26M1o927i8z_a=x5bQU-iPQg@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
	<D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>,
	<CAC1jpXAq3cN0GriHndyQnbpn3v26M1o927i8z_a=x5bQU-iPQg@mail.gmail.com>
Message-ID: <CE1C3005-EA13-4C4E-A4B5-7F387D0E8E0B@illinois.edu>


On Dec 20, 2011, at 9:26 AM, "Leon Timmermans" <l.m.timmermans at students.uu.nl<mailto:l.m.timmermans at students.uu.nl>> wrote:

On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J <cjfields at illinois.edu<mailto:cjfields at illinois.edu>> wrote:
Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best.  Barring that, a very simple class for storing data.  We've found BioPerl objects/classes pretty heavy.

(for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing).

Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used.

For insn Sance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists.

A good OO system can actually help make things faster. For example, I'm unpacking the flowspace and quality data lazily, which made scanning through an SFF file 2.5-3 times as fast while having marginal extra costs when you do need them.

Leon

Yep, thinking about using the same approach for the Fastq variants.

Chris

Sent from my ancient iPad b/c my laptop's borked


From dgacquer at ulb.ac.be  Wed Dec 21 08:26:07 2011
From: dgacquer at ulb.ac.be (David Gacquer)
Date: Wed, 21 Dec 2011 14:26:07 +0100
Subject: [Bioperl-l] Strange behaviour in the write_seq function for large
	fasta
Message-ID: <4EF1DE6F.4070508@ulb.ac.be>

Dear BioPerl users/developers,

I am facing a strange issue with the $seq_out->write_seq function when 
using large fasta files

I have downloaded the hg19 chromosome 1, and applied the following code 
(basically I wanted to mask some regions in it but the problem also 
appears when copying the sequence without modifications):

sub main{
     my $seq_in  = Bio::SeqIO->new( -format => 'largefasta', -file => 
$ARGV[0]);
     my $seq_out  = Bio::SeqIO->new( -format => 'largefasta', -file => 
'>'.$ARGV[1]);
     my $seq_obj_in = $seq_in->next_seq();
     my $modified_seq = $seq_obj_in->seq();
     my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq => 
$modified_seq, -id  => $seq_obj_in->id, -desc => $seq_obj_in->desc);
     $seq_out->write_seq($seq_obj_out);
}

when checking the output fasta file, the sequence of chr1 is 1-bp shorter.

I have noticed that in the original fasta file, each line contains 
exactly 50 nucleotides, while the output of the $seq_out->write_seq 
function contains always 60 characters per line.
chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that 
the very last base was missing, I created the following fasta files:

chr121.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAG

chr122.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAG

They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last 
character being a G. When running the above code:

chr121.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

chr122.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AG

The output for the 122 bp chromosome is correct (2 lines of 60 bp and 
the last line with 2 bp, AG) but for the 121 bp chromosome, the last 
character is missing (2 lines of 60 bp only, last G is missing).

When replacing -format => 'largefasta' by -format => 'fasta' or writing 
the output without the write_seq function however, the problem is solved.

Am I missing something or is there a problem with the write_seq function 
used with large fasta files? (I am running BioPerl on a Mac under OS X 
Snow Leopard)

Best regards

David

-- 
David Gacquer, Ph. D.

IRIBHM - Universite Libre de Bruxelles
Bldg C, room C.4.117
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium

Phone: +32-2-555 4187
Fax: +32-2-555 4655
E-mail: dgacquer at ulb.ac.be


From koraydogankaya at gmail.com  Sat Dec 24 03:44:43 2011
From: koraydogankaya at gmail.com (Koray)
Date: Sat, 24 Dec 2011 00:44:43 -0800 (PST)
Subject: [Bioperl-l] exons
Message-ID: <8454dd72-4fed-41c5-977e-83d9300cd68b@z25g2000vbs.googlegroups.com>

I need an explicit code for getting exon sequences of an mrna or gene
fetched by get_Seq_by_acc or id.

in ensembl it is easy but here it is not easy many ios exists.

for example:

here how can i get such a $gene object from DBs (GeneBank or
EntrezGene) by acc numberor ids?


exons	code	prev	next	Top
 Title   : exons()
 Usage   : @exons = $gene->exons();
           @inital_exons = $gene->exons('Initial');
 Function: Get all exon features or all exons of a specified type of
this gene
           structure.

           Exon type is treated as a case-insensitive regular
expression and
           optional. For consistency, use only the following types:
           initial, internal, terminal, utr, utr5prime, and
utr3prime.
           A special and virtual type is 'coding', which refers to all
types
           except utr.

           This method basically merges the exons returned by
transcripts.

 Returns : An array of Bio::SeqFeature::Gene::ExonI implementing
objects.
 Args    : An optional string specifying the type of exon.


From challa_ghanashyam at yahoo.com  Sat Dec 24 15:09:09 2011
From: challa_ghanashyam at yahoo.com (GSC)
Date: Sat, 24 Dec 2011 12:09:09 -0800 (PST)
Subject: [Bioperl-l] re trieve description for a list of gi ids..
Message-ID: <33034438.post@talk.nabble.com>


Hi all:
I am new to perl. I am working on a script to retrieve the record
description (name given for a sequence record in genbank) for a list of gi
ids. the script works fine for 1000 ids but my list is about 250,000 ids
long and it is not working for me. Any suggestions on this.

GS
-- 
View this message in context: http://old.nabble.com/retrieve-description-for-a-list-of-gi-ids..-tp33034438p33034438.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From cjfields at illinois.edu  Tue Dec 27 10:03:28 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 27 Dec 2011 15:03:28 +0000
Subject: [Bioperl-l] Strange behaviour in the write_seq function for
 large	fasta
In-Reply-To: <4EF1DE6F.4070508@ulb.ac.be>
References: <4EF1DE6F.4070508@ulb.ac.be>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0BB29547@CITESMBX4.ad.uillinois.edu>

This is a strange one.  Personally I haven't seen this behavior, but that maybe it's OS-dependent?

We'll need more information, particularly what version of BioPerl you are using, the OS, version of perl, etc.  Also, in general to make sure we don't lose track of this issue it is best to submit a bug report:

https://redmine.open-bio.org/projects/bioperl

I'm planning on triaging bugs next week, I could take a look then.

chris
________________________________________
From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of David Gacquer [dgacquer at ulb.ac.be]
Sent: Wednesday, December 21, 2011 7:26 AM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Strange behaviour in the write_seq function for large      fasta

Dear BioPerl users/developers,

I am facing a strange issue with the $seq_out->write_seq function when
using large fasta files

I have downloaded the hg19 chromosome 1, and applied the following code
(basically I wanted to mask some regions in it but the problem also
appears when copying the sequence without modifications):

sub main{
     my $seq_in  = Bio::SeqIO->new( -format => 'largefasta', -file =>
$ARGV[0]);
     my $seq_out  = Bio::SeqIO->new( -format => 'largefasta', -file =>
'>'.$ARGV[1]);
     my $seq_obj_in = $seq_in->next_seq();
     my $modified_seq = $seq_obj_in->seq();
     my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq =>
$modified_seq, -id  => $seq_obj_in->id, -desc => $seq_obj_in->desc);
     $seq_out->write_seq($seq_obj_out);
}

when checking the output fasta file, the sequence of chr1 is 1-bp shorter.

I have noticed that in the original fasta file, each line contains
exactly 50 nucleotides, while the output of the $seq_out->write_seq
function contains always 60 characters per line.
chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that
the very last base was missing, I created the following fasta files:

chr121.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAG

chr122.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAG

They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last
character being a G. When running the above code:

chr121.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

chr122.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AG

The output for the 122 bp chromosome is correct (2 lines of 60 bp and
the last line with 2 bp, AG) but for the 121 bp chromosome, the last
character is missing (2 lines of 60 bp only, last G is missing).

When replacing -format => 'largefasta' by -format => 'fasta' or writing
the output without the write_seq function however, the problem is solved.

Am I missing something or is there a problem with the write_seq function
used with large fasta files? (I am running BioPerl on a Mac under OS X
Snow Leopard)

Best regards

David

--
David Gacquer, Ph. D.

IRIBHM - Universite Libre de Bruxelles
Bldg C, room C.4.117
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium

Phone: +32-2-555 4187
Fax: +32-2-555 4655
E-mail: dgacquer at ulb.ac.be

_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l


From jdeuts01 at students.poly.edu  Thu Dec  1 14:09:19 2011
From: jdeuts01 at students.poly.edu (jdeuts01 at students.poly.edu)
Date: Thu, 1 Dec 2011 14:09:19 +0000
Subject: [Bioperl-l] question
Message-ID: <SNT134-W43F83A46574EEDD841600186B10@phx.gbl>


Dear Bioperl,
       This is my first experience with bioperl and I need help please.
1. The version of bioperl is 1.6.1 and have also installed Bundle-BioPerl 2.1.8 and mGen 1.03.    I was unable to install Bribes and trouchelle DB.     Will this prevent the BioPerl package from functioning correctly?
2. The operating platform is windows 7 - 64 bit using ActiveState - Perl v5.12.2
3. The script is as follows:
#!/usr/bin/perl
# Write a script using OOP to write protein sequences to the file fasta.txt.use strict;use warnings;use Bio::SeqIO::fasta;
# Declare and initialize input and output files.my $protein_fasta = "protein.fa";my $protein_out = ">fasta.txt";
# Setup objects for input and output.my $seq_in = Bio::SeqIO->new(-file => "$protein_fasta", -format => 'Fasta');my $seq_out = Bio::SeqIO->new(-file => "$protein_out", -format => 'fasta');
# Establish while loop using "next_seq" method # to read in multiple sequences from "protein.fa" file# one by one until none were remaining.while(my $seq = $seq_in -> next_seq){		$seq_out->write_seq($seq);}
The information is successfully written to the file: fasta.txt. 
4. Receiving the following error messages: 
Replacement list is longer than search list at C:/Perl64/site/lib/Bio/Range.pm line 251.Subroutine _initialize redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 92.Subroutine next_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 112.Subroutine write_seq redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 193.Subroutine width redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 272.Subroutine preferred_id_type redefined at C:/Perl64/site/lib/Bio\SeqIO\fasta.pm line 295.
Thanks in advance for your help.John Deutsch
 		 	   		  

From jboddu at illinois.edu  Thu Dec  1 16:38:00 2011
From: jboddu at illinois.edu (Boddu, Jayanand)
Date: Thu, 1 Dec 2011 16:38:00 +0000
Subject: [Bioperl-l] Chromosome coordinates
Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>

Hello
I am newbie to Perl scripts.
I have a file with short reads mapped to the MAIZE genome
The format is a simple BLASTN output.
READ_ID

Chr

% Similarity

Alignment

Mismatches

Gaps

READ Start

READ End

Chr Start

Chr End

E Value

Score

READ1

chrPt

100

17

0

0

1

17

35021

35037

0.21

34.2

READ1

chr10

100

17

0

0

1

17

128587356

128587372

0.21

34.2

READ1

chr6

100

17

0

0

1

17

160769803

160769787

0.21

34.2

READ1

chr5

100

17

0

0

1

17

172103083

172103067

0.21

34.2

READ1

chr4

100

17

0

0

1

17

213173683

213173699

0.21

34.2

READ1

chr3

100

17

0

0

1

17

23689132

23689116

0.21

34.2

READ2

chr8

100

17

0

0

1

17

161048603

161048587

0.21

34.2

READ2

chr6

100

17

0

0

1

17

155768884

155768868

0.21

34.2

READ2

chr5

100

17

0

0

1

17

32958812

32958828

0.21

34.2

READ2

chr3

100

17

0

0

1

17

212451090

212451074

0.21

34.2

READ2

chr2

100

17

0

0

1

17

2046449

2046465

0.21

34.2

READ2

chr1

100

17

0

0

1

17

223233801

223233785

0.21

34.2

READ2

chr1

100

17

0

0

1

17

277573037

277573021

0.21

34.2


As expected the same read maps to multiple places on the same/different chromosome.
I have a GFF file with annotated coordinates.
I would like to run a PERL script to find out READS that are within the GENES in the GFF file and that are not.
The anticipated script should;

1.       Take the READ coordinates on the genome (by chromosome);

2.       Go the GFF file;

3.       Find the Chromosome;

4.       Find the GENE (by coordinates);

5.       and report READ-its coordinates-Chromosome-GENE-and its coordinates.

It doesn't need to be in the same order.
After this, I guess I could use simple Microsoft ACCESS query to pull out READS that are not mapped to the GENEs.
I would greatly appreciate if anyone can has a script that more or less similar job.

Thanks
Jay


From scott at scottcain.net  Thu Dec  1 16:59:56 2011
From: scott at scottcain.net (Scott Cain)
Date: Thu, 1 Dec 2011 11:59:56 -0500
Subject: [Bioperl-l] Chromosome coordinates
In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>
Message-ID: <CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>

Hi Jay,

Since the maize GFF file is likely to be fairly large, I would consider
putting it in a database, using either Bio::DB::GFF if it is GFF2 or
Bio::DB::SeqFeature::Store if it is gff3.  Then you can use the methods
that come along with either of those modules to search regions for for
genes.  They both support a get_features_by_location method, so you could
get the range for each of the regions you want to look at, and check the
database with that method to see if anything is there.

Scott


On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand <jboddu at illinois.edu>wrote:

> Hello
> I am newbie to Perl scripts.
> I have a file with short reads mapped to the MAIZE genome
> The format is a simple BLASTN output.
> READ_ID
>
> Chr
>
> % Similarity
>
> Alignment
>
> Mismatches
>
> Gaps
>
> READ Start
>
> READ End
>
> Chr Start
>
> Chr End
>
> E Value
>
> Score
>
> READ1
>
> chrPt
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 35021
>
> 35037
>
> 0.21
>
> 34.2
>
> READ1
>
> chr10
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 128587356
>
> 128587372
>
> 0.21
>
> 34.2
>
> READ1
>
> chr6
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 160769803
>
> 160769787
>
> 0.21
>
> 34.2
>
> READ1
>
> chr5
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 172103083
>
> 172103067
>
> 0.21
>
> 34.2
>
> READ1
>
> chr4
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 213173683
>
> 213173699
>
> 0.21
>
> 34.2
>
> READ1
>
> chr3
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 23689132
>
> 23689116
>
> 0.21
>
> 34.2
>
> READ2
>
> chr8
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 161048603
>
> 161048587
>
> 0.21
>
> 34.2
>
> READ2
>
> chr6
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 155768884
>
> 155768868
>
> 0.21
>
> 34.2
>
> READ2
>
> chr5
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 32958812
>
> 32958828
>
> 0.21
>
> 34.2
>
> READ2
>
> chr3
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 212451090
>
> 212451074
>
> 0.21
>
> 34.2
>
> READ2
>
> chr2
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 2046449
>
> 2046465
>
> 0.21
>
> 34.2
>
> READ2
>
> chr1
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 223233801
>
> 223233785
>
> 0.21
>
> 34.2
>
> READ2
>
> chr1
>
> 100
>
> 17
>
> 0
>
> 0
>
> 1
>
> 17
>
> 277573037
>
> 277573021
>
> 0.21
>
> 34.2
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> As expected the same read maps to multiple places on the same/different
> chromosome.
> I have a GFF file with annotated coordinates.
> I would like to run a PERL script to find out READS that are within the
> GENES in the GFF file and that are not.
> The anticipated script should;
>
> 1.       Take the READ coordinates on the genome (by chromosome);
>
> 2.       Go the GFF file;
>
> 3.       Find the Chromosome;
>
> 4.       Find the GENE (by coordinates);
>
> 5.       and report READ-its coordinates-Chromosome-GENE-and its
> coordinates.
>
> It doesn't need to be in the same order.
> After this, I guess I could use simple Microsoft ACCESS query to pull out
> READS that are not mapped to the GENEs.
> I would greatly appreciate if anyone can has a script that more or less
> similar job.
>
> Thanks
> Jay
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot
net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research


From jason.stajich at gmail.com  Thu Dec  1 17:31:29 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Thu, 1 Dec 2011 09:31:29 -0800
Subject: [Bioperl-l] Chromosome coordinates
In-Reply-To: <CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>
	<CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
Message-ID: <470A08F7-E238-44E1-B0D1-69FDBC0BA2A3@gmail.com>

You might try using BEDtools, intersectBED which can do a lot of what you are doing in simple command line program.

Jason
On Dec 1, 2011, at 8:59 AM, Scott Cain wrote:

> Hi Jay,
> 
> Since the maize GFF file is likely to be fairly large, I would consider
> putting it in a database, using either Bio::DB::GFF if it is GFF2 or
> Bio::DB::SeqFeature::Store if it is gff3.  Then you can use the methods
> that come along with either of those modules to search regions for for
> genes.  They both support a get_features_by_location method, so you could
> get the range for each of the regions you want to look at, and check the
> database with that method to see if anything is there.
> 
> Scott
> 
> 
> On Thu, Dec 1, 2011 at 11:38 AM, Boddu, Jayanand <jboddu at illinois.edu>wrote:
> 
>> Hello
>> I am newbie to Perl scripts.
>> I have a file with short reads mapped to the MAIZE genome
>> The format is a simple BLASTN output.
>> READ_ID
>> 
>> Chr
>> 
>> % Similarity
>> 
>> Alignment
>> 
>> Mismatches
>> 
>> Gaps
>> 
>> READ Start
>> 
>> READ End
>> 
>> Chr Start
>> 
>> Chr End
>> 
>> E Value
>> 
>> Score
>> 
>> READ1
>> 
>> chrPt
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 35021
>> 
>> 35037
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr10
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 128587356
>> 
>> 128587372
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr6
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 160769803
>> 
>> 160769787
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr5
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 172103083
>> 
>> 172103067
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr4
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 213173683
>> 
>> 213173699
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ1
>> 
>> chr3
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 23689132
>> 
>> 23689116
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr8
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 161048603
>> 
>> 161048587
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr6
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 155768884
>> 
>> 155768868
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr5
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 32958812
>> 
>> 32958828
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr3
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 212451090
>> 
>> 212451074
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr2
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 2046449
>> 
>> 2046465
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr1
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 223233801
>> 
>> 223233785
>> 
>> 0.21
>> 
>> 34.2
>> 
>> READ2
>> 
>> chr1
>> 
>> 100
>> 
>> 17
>> 
>> 0
>> 
>> 0
>> 
>> 1
>> 
>> 17
>> 
>> 277573037
>> 
>> 277573021
>> 
>> 0.21
>> 
>> 34.2
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> As expected the same read maps to multiple places on the same/different
>> chromosome.
>> I have a GFF file with annotated coordinates.
>> I would like to run a PERL script to find out READS that are within the
>> GENES in the GFF file and that are not.
>> The anticipated script should;
>> 
>> 1.       Take the READ coordinates on the genome (by chromosome);
>> 
>> 2.       Go the GFF file;
>> 
>> 3.       Find the Chromosome;
>> 
>> 4.       Find the GENE (by coordinates);
>> 
>> 5.       and report READ-its coordinates-Chromosome-GENE-and its
>> coordinates.
>> 
>> It doesn't need to be in the same order.
>> After this, I guess I could use simple Microsoft ACCESS query to pull out
>> READS that are not mapped to the GENEs.
>> I would greatly appreciate if anyone can has a script that more or less
>> similar job.
>> 
>> Thanks
>> Jay
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
> 
> 
> 
> -- 
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at scottcain dot
> net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From jovel_juan at hotmail.com  Thu Dec  1 17:36:32 2011
From: jovel_juan at hotmail.com (Juan Jovel)
Date: Thu, 1 Dec 2011 17:36:32 +0000
Subject: [Bioperl-l] Error when using SearchIO
In-Reply-To: <CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>,
	<CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
Message-ID: <COL102-W29B31BD46108B9F1D9EF46FAB10@phx.gbl>


Hello Everybody!
I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message:
"Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251"
What it does mean? Would it have any effect on my parsing results?
Thanks, 
JUAN 		 	   		  


From cjfields at illinois.edu  Thu Dec  1 19:03:45 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Thu, 1 Dec 2011 19:03:45 +0000
Subject: [Bioperl-l] Error when using SearchIO
In-Reply-To: <COL102-W29B31BD46108B9F1D9EF46FAB10@phx.gbl>
References: <1DE33F161A86CD479E40A2AF7FD4E8D004903111@CITESMBX2.ad.uillinois.edu>,
	<CA+JTaoy0o1CXCm5EZi_4K5vTAaW9kFJUFVXg1rWYB84ZnN7+5A@mail.gmail.com>
	<COL102-W29B31BD46108B9F1D9EF46FAB10@phx.gbl>
Message-ID: <7233AC75-E03B-401A-8D0A-260ED76956A4@illinois.edu>

On Dec 1, 2011, at 11:36 AM, Juan Jovel wrote:

> Hello Everybody!
> I have been running endless standalone Blasts these days, in two different clusters. The first one runs on Rock Linux, while the other one on Devian. In the Rock Linux one, when I use SeachIO to parse my blastp or blastx reports, I get invariably a message:
> "Replacement list is longer than search list at /share/apps/lib/perl5/site_perl/5.14.1/Bio/Range.pm line 251"
> What it does mean? Would it have any effect on my parsing results?
> Thanks, 
> JUAN

This is a bug that was fixed, I think this was in the latest BioPerl release (1.6.901).  There was a transliteration error that produced this warning, but otherwise it's harmless. The warning is perl version dependent, I think it pops up in perl 5.12 and up.  This user post (about halfway down) runs into the same issue: http://www.sysarchitects.com/bioperl.

chris


From David.Messina at sbc.su.se  Thu Dec  1 22:02:20 2011
From: David.Messina at sbc.su.se (Dave Messina)
Date: Thu, 1 Dec 2011 23:02:20 +0100
Subject: [Bioperl-l] re trieving blast multiple alignment in fasta form
In-Reply-To: <32886592.post@talk.nabble.com>
References: <32886592.post@talk.nabble.com>
Message-ID: <CAM3TQQURpsF8+Tq2AQ6yAzmDVuDnip-6mFAfGgUdR5ScTus4YA@mail.gmail.com>

Hi Eric,

Wait, do you want multiple pairwise alignments in your output FASTA file,
or a single multiple alignment of your query and all the hits?

If the former, get_aln() will give you one pairwise alignment per hsp, but
you'll need to move the output file creation statement (my $alnIO = ...)
before the loops so it gets created only once. Then, when you do the write
statement ($alnIO->write_aln($aln);), all of the alignments will go to the
same file.

If on the other hand you'd like to have a multiple alignment between a
query and all of its hits, you'll have to take the IDs of the hits, pull
the corresponding sequences out of the database, and then run a multiple
alignment algorithm on them.


Dave


From scuoppo at gmail.com  Fri Dec  2 22:50:28 2011
From: scuoppo at gmail.com (Claudio Scuoppo)
Date: Fri, 2 Dec 2011 17:50:28 -0500
Subject: [Bioperl-l] List of genes from genomic intervals
Message-ID: <CAEz0Wfv_yj6g5rZEbmj+UhOkJX8ryzObEO7N6rqvLRMV6Yd_Pg@mail.gmail.com>

Hi,

I am new to BioPerl. I was wondering what`s the best strategy to get
the genes contained in a a series of human genomic interval.
Basically, I have a table with:

Chromosome Start End

Which module should I be looking at?
Thanks,
Claudio


From awitney at sgul.ac.uk  Mon Dec  5 11:09:39 2011
From: awitney at sgul.ac.uk (Adam Witney)
Date: Mon, 5 Dec 2011 11:09:39 +0000
Subject: [Bioperl-l] Bio::Graphics imagemap and padding
Message-ID: <44A27378-CC4B-4396-817E-AA31004847C7@sgul.ac.uk>

Hi,

Image maps seem to be out of position if you use padding in the Panel, like this:

my $panel = Bio::Graphics::Panel->new( ?.. -pad_left  => 20, -pad_right => 20 ?? );

Without these options, the image map is fine. Is this a known issue?

Also on a side note, I noticed that when using Bio::Graphics with Dancer, some of the CGI code was blocking somewhere (I found a reference to a similar problem with CGI and Catalyst), swapping CGI with HTML::Entities fixes it:

sub create_web_map {
?.
	eval "require HTML::Entities" unless HTML::Entities->can('encode_entities');
?.
	my $title  = HTML::Entities::encode_entities($self->make_link($tr,$feature,1));
 	my $target = HTML::Entities::encode_entities($self->make_link($tgr,$feature,1));
?..
}

Thanks

Adam


From momin.amin at gmail.com  Mon Dec  5 23:00:23 2011
From: momin.amin at gmail.com (Amin Momin)
Date: Mon, 5 Dec 2011 15:00:23 -0800 (PST)
Subject: [Bioperl-l] SimpleAlign and consensus_string
Message-ID: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>

Hi ,

I am generating a consensus sequence by aligning two protein homologs
using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to
understand the criteria consensus_string() method of simpleAlign uses
to determine the consensus at position with dissimilar aminoacids/
nucleotide. Also how would the % cutoffs provided to
consensus_string() affect the outcome.


Thanks,
Amin


From jason.stajich at gmail.com  Mon Dec  5 23:58:59 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Mon, 5 Dec 2011 15:58:59 -0800
Subject: [Bioperl-l] SimpleAlign and consensus_string
In-Reply-To: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>
References: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>
Message-ID: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com>

There are several methods that do related things. 

Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns. 

If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column.

=head2 consensus_string

 Title     : consensus_string
 Usage     : $str = $ali->consensus_string($threshold_percent)
 Function  : Makes a strict consensus
 Returns   : Consensus string
 Argument  : Optional treshold ranging from 0 to 100.
             The consensus residue has to appear at least threshold %
             of the sequences at a given location, otherwise a '?'
             character will be placed at that location.
             (Default value = 0%)

=cut

On Dec 5, 2011, at 3:00 PM, Amin Momin wrote:

> Hi ,
> 
> I am generating a consensus sequence by aligning two protein homologs
> using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to
> understand the criteria consensus_string() method of simpleAlign uses
> to determine the consensus at position with dissimilar aminoacids/
> nucleotide. Also how would the % cutoffs provided to
> consensus_string() affect the outcome.
> 
> 
> Thanks,
> Amin
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From wenbinmei at gmail.com  Tue Dec  6 16:09:35 2011
From: wenbinmei at gmail.com (wenbin mei)
Date: Tue, 6 Dec 2011 11:09:35 -0500
Subject: [Bioperl-l] revcom the multiple sequence alignment
Message-ID: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>

Hi,

I have a question about revcom the multiple sequence alignment. One way I
can do convert the format into fasta and revcom individual sequences. I
wonder is there a easy way to convert the multiple sequence alignment as a
whole.  Thank you for help.

-best,
wenbin


From jason.stajich at gmail.com  Tue Dec  6 17:40:37 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Tue, 6 Dec 2011 09:40:37 -0800
Subject: [Bioperl-l] revcom the multiple sequence alignment
In-Reply-To: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
References: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
Message-ID: <CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>

I think this would work to update it in place though I haven't tried it myself

for my $seq ( $aln->each_seq ) {
 $seq->seq( $seq->revcom->seq );
}
$out->write_aln($aln);

This may also work - not entirely sure if there is any extra work done on the meta data (start/end) of the Seq object when this is done.  You may want to flip start/end for the sequences (the seqs are Bio::LocatableSeq objects) explicitly if not. Or you may not care about those data and can ignore.

   $seq = $seq->revcom

Jason
On Dec 6, 2011, at 8:09 AM, wenbin mei wrote:

> Hi,
> 
> I have a question about revcom the multiple sequence alignment. One way I
> can do convert the format into fasta and revcom individual sequences. I
> wonder is there a easy way to convert the multiple sequence alignment as a
> whole.  Thank you for help.
> 
> -best,
> wenbin
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From wenbinmei at gmail.com  Tue Dec  6 17:51:18 2011
From: wenbinmei at gmail.com (wenbin mei)
Date: Tue, 6 Dec 2011 12:51:18 -0500
Subject: [Bioperl-l] revcom the multiple sequence alignment
In-Reply-To: <CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
References: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
	<CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
Message-ID: <CAHdrE2TqBxYN+wPC=LH433GwMt3sp5oas_OQhwWY9DbfbjdCpg@mail.gmail.com>

I think I might not explain clearly my questions. I extract the individual
gene alignment from the whole genome alignment. Since some gene are on the
reverse strand, I want to revcom the gene alignment. There is part of my
scripts. I can read the strand information from another file.

my $newstart = $refseq->column_from_residue_number($start);
my $newend = $refseq->column_from_residue_number($end);
$seq{$genename} = $aln->slice($newstart, $newend);


Any suggestion to help me revcom some gene alignment on the minus strand is
helpful. Thank you.

-best,
wenbin


On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich <jason.stajich at gmail.com>wrote:

> I think this would work to update it in place though I haven't tried it
> myself
>
> for my $seq ( $aln->each_seq ) {
>  $seq->seq( $seq->revcom->seq );
> }
> $out->write_aln($aln);
>
> This may also work - not entirely sure if there is any extra work done on
> the meta data (start/end) of the Seq object when this is done.  You may
> want to flip start/end for the sequences (the seqs are Bio::LocatableSeq
> objects) explicitly if not. Or you may not care about those data and can
> ignore.
>
>   $seq = $seq->revcom
>
> Jason
> On Dec 6, 2011, at 8:09 AM, wenbin mei wrote:
>
> > Hi,
> >
> > I have a question about revcom the multiple sequence alignment. One way I
> > can do convert the format into fasta and revcom individual sequences. I
> > wonder is there a easy way to convert the multiple sequence alignment as
> a
> > whole.  Thank you for help.
> >
> > -best,
> > wenbin
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Wenbin Mei Ph.D. Student
Dr. Brad Barbazuk's Lab
Department of Biology
University of Florida
509-899-3067
wmei at ufl.edu <wmei at ufl.edu>


From kellert at ohsu.edu  Tue Dec  6 18:21:39 2011
From: kellert at ohsu.edu (Tom Keller)
Date: Tue, 6 Dec 2011 10:21:39 -0800
Subject: [Bioperl-l] Bioperl-l Digest, Vol 104, Issue 3
In-Reply-To: <mailman.3.1322931604.28955.bioperl-l@lists.open-bio.org>
References: <mailman.3.1322931604.28955.bioperl-l@lists.open-bio.org>
Message-ID: <B68BC6F2-8C57-4749-902D-3232B0DA6113@ohsu.edu>

I'd start by looking for the section "Searching for genes in genomic DNA" in the HOWTO:Beginners - BioPerl website.

Thomas (Tom) Keller, PhD
kellert at ohsu.edu
503.494.2442
6588 R Jones Hall (BSc/CROET)
MMI DNA Services
Member of OHSU Shared Resources

On Dec 3, 2011, at 9:00 AM, <bioperl-l-request at lists.open-bio.org> <bioperl-l-request at lists.open-bio.org> wrote:

> Send Bioperl-l mailing list submissions to
> 	bioperl-l at lists.open-bio.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.open-bio.org/mailman/listinfo/bioperl-l
> or, via email, send a message with subject or body 'help' to
> 	bioperl-l-request at lists.open-bio.org
> 
> You can reach the person managing the list at
> 	bioperl-l-owner at lists.open-bio.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Bioperl-l digest..."
> 
> 
> Today's Topics:
> 
>   1.  List of genes from genomic intervals (Claudio Scuoppo)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Fri, 2 Dec 2011 17:50:28 -0500
> From: Claudio Scuoppo <scuoppo at gmail.com>
> Subject: [Bioperl-l] List of genes from genomic intervals
> To: bioperl-l at lists.open-bio.org
> Message-ID:
> 	<CAEz0Wfv_yj6g5rZEbmj+UhOkJX8ryzObEO7N6rqvLRMV6Yd_Pg at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Hi,
> 
> I am new to BioPerl. I was wondering what`s the best strategy to get
> the genes contained in a a series of human genomic interval.
> Basically, I have a table with:
> 
> Chromosome Start End
> 
> Which module should I be looking at?
> Thanks,
> Claudio
> 
> 
> ------------------------------
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> End of Bioperl-l Digest, Vol 104, Issue 3
> *****************************************


From wenbinmei at gmail.com  Tue Dec  6 22:54:51 2011
From: wenbinmei at gmail.com (wenbin mei)
Date: Tue, 6 Dec 2011 17:54:51 -0500
Subject: [Bioperl-l] revcom the multiple sequence alignment
In-Reply-To: <CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
References: <CAHdrE2RfvOCy_3aUP_nD9Gp=VSmqLh52wh0vQmgKmaebVgyL-w@mail.gmail.com>
	<CAAC48B9-371B-4446-897B-B1D8968CA7EA@gmail.com>
Message-ID: <CAHdrE2S33zcuchbSXuH2NwM5gM-=BnxVx9xA13ye18gPi2Mtcg@mail.gmail.com>

Figured out! Thanks for help.

-best,
wenbin


On Tue, Dec 6, 2011 at 12:40 PM, Jason Stajich <jason.stajich at gmail.com>wrote:

> I think this would work to update it in place though I haven't tried it
> myself
>
> for my $seq ( $aln->each_seq ) {
>  $seq->seq( $seq->revcom->seq );
> }
> $out->write_aln($aln);
>
> This may also work - not entirely sure if there is any extra work done on
> the meta data (start/end) of the Seq object when this is done.  You may
> want to flip start/end for the sequences (the seqs are Bio::LocatableSeq
> objects) explicitly if not. Or you may not care about those data and can
> ignore.
>
>   $seq = $seq->revcom
>
> Jason
> On Dec 6, 2011, at 8:09 AM, wenbin mei wrote:
>
> > Hi,
> >
> > I have a question about revcom the multiple sequence alignment. One way I
> > can do convert the format into fasta and revcom individual sequences. I
> > wonder is there a easy way to convert the multiple sequence alignment as
> a
> > whole.  Thank you for help.
> >
> > -best,
> > wenbin
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


-- 
Wenbin Mei Ph.D. Student
Dr. Brad Barbazuk's Lab
Department of Biology
University of Florida
509-899-3067
wmei at ufl.edu <wmei at ufl.edu>


From momin.amin at gmail.com  Tue Dec  6 17:37:16 2011
From: momin.amin at gmail.com (Amin Momin)
Date: Tue, 6 Dec 2011 11:37:16 -0600
Subject: [Bioperl-l] SimpleAlign and consensus_string
In-Reply-To: <4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com>
References: <a08b3694-1cc6-4f5d-ad80-1306c5b777f7@v5g2000yqn.googlegroups.com>
	<4A59F912-BEE6-48D8-91EC-EAACB293A574@gmail.com>
Message-ID: <CAA0DaRhm+jsPpFFYR6q2xj0YOkYy3Enh8rrRD-YQJ26z_U+Fkw@mail.gmail.com>

Thanks Jason


On Mon, Dec 5, 2011 at 5:58 PM, Jason Stajich <jason.stajich at gmail.com> wrote:
> There are several methods that do related things.
>
> Below is the documentation for the method - it allows a more strict calling besides majority rule voting for the position - if you set the cutoff at 50% then the called AA or NT has to appear at least 50% of the time. There is also an IUPAC consensus string which is useful for DNA consensus generating patterns.
>
> If you want a summary as to whether or not there are only conservative amino acid changes in the column use match_line method which generates the similar string that CLUSTALW reports as a summary string for each column.
>
> =head2 consensus_string
>
> ?Title ? ? : consensus_string
> ?Usage ? ? : $str = $ali->consensus_string($threshold_percent)
> ?Function ?: Makes a strict consensus
> ?Returns ? : Consensus string
> ?Argument ?: Optional treshold ranging from 0 to 100.
> ? ? ? ? ? ? The consensus residue has to appear at least threshold %
> ? ? ? ? ? ? of the sequences at a given location, otherwise a '?'
> ? ? ? ? ? ? character will be placed at that location.
> ? ? ? ? ? ? (Default value = 0%)
>
> =cut
>
> On Dec 5, 2011, at 3:00 PM, Amin Momin wrote:
>
>> Hi ,
>>
>> I am generating a consensus sequence by aligning two protein homologs
>> using Bio::Tools::Run::Alignment::TCoffee. However, I am unable to
>> understand the criteria consensus_string() method of simpleAlign uses
>> to determine the consensus at position with dissimilar aminoacids/
>> nucleotide. Also how would the % cutoffs provided to
>> consensus_string() affect the outcome.
>>
>>
>> Thanks,
>> Amin
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From sunwukong at potc.net  Wed Dec  7 19:05:20 2011
From: sunwukong at potc.net (sunwukong)
Date: Wed, 07 Dec 2011 11:05:20 -0800
Subject: [Bioperl-l] DNA Sequencing two questions
Message-ID: <4EDFB8F0.8080001@potc.net>

I am not a medical professional but I have two DNA related questions.

A year or so ago I realized that if the standard building blocks of life 
were the amino acids GATC then they could be represented as a base 4 
number system (e.g., 0,1,2 and 3).  Then any life form could be 
represented by a number (it would be very long).  So I set out on a 
quest to do this with a small life form.  For fun I chose the Spanish 
Flu which I believe I found on an NIH site.  Then I set out and realized 
that there was no standard.  And I did not know if the number would be 
built with the most significant digit on the left or right.

1.  Is there a standard method for representing the ATCD molecules as 
numbers
g = 0
a = 1
t  = 2
c = 3

2. is the sequence read left to right or right to left?

note:  It may be biologically significant if the right values are 
assigned to the letters GATC, there could be a pattern somewhere that 
holds significant information.  One idea might be to look at DNA 
sequences in bases other than 4 to see if something jumps out.

http://www.insectscience.org/2.10/ref/fig5a.gif

VR
Pat Kirol
509 442-2214


From Russell.Smithies at agresearch.co.nz  Wed Dec  7 21:59:18 2011
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Thu, 8 Dec 2011 10:59:18 +1300
Subject: [Bioperl-l] DNA Sequencing two questions
In-Reply-To: <4EDFB8F0.8080001@potc.net>
References: <4EDFB8F0.8080001@potc.net>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>

I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve.
I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions?  Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png
Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes?

But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery.

But don't let this stop you uncovering the great secret hidden in our genes :-)

On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html

--Russell

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of sunwukong
> Sent: Thursday, 8 December 2011 8:05 a.m.
> To: bioperl-l at bioperl.org
> Subject: [Bioperl-l] DNA Sequencing two questions
> 
> I am not a medical professional but I have two DNA related questions.
> 
> A year or so ago I realized that if the standard building blocks of life were the
> amino acids GATC then they could be represented as a base 4 number
> system (e.g., 0,1,2 and 3).  Then any life form could be represented by a
> number (it would be very long).  So I set out on a quest to do this with a small
> life form.  For fun I chose the Spanish Flu which I believe I found on an NIH
> site.  Then I set out and realized that there was no standard.  And I did not
> know if the number would be built with the most significant digit on the left
> or right.
> 
> 1.  Is there a standard method for representing the ATCD molecules as
> numbers g = 0 a = 1 t  = 2 c = 3
> 
> 2. is the sequence read left to right or right to left?
> 
> note:  It may be biologically significant if the right values are assigned to the
> letters GATC, there could be a pattern somewhere that holds significant
> information.  One idea might be to look at DNA sequences in bases other
> than 4 to see if something jumps out.
> 
> http://www.insectscience.org/2.10/ref/fig5a.gif
> 
> VR
> Pat Kirol
> 509 442-2214
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From jason.stajich at gmail.com  Wed Dec  7 22:53:10 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Wed, 7 Dec 2011 14:53:10 -0800
Subject: [Bioperl-l] DNA Sequencing two questions
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
References: <4EDFB8F0.8080001@potc.net>
	<18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
Message-ID: <9BDA2BFA-264F-45B2-9234-A6BC9402FBA8@gmail.com>

For other fun picture games -- 

You can look at patterns of motifs/words in a chaos game representation of genomes.
http://mbe.oxfordjournals.org/content/16/10/1391.long
http://mbe.oxfordjournals.org/content/20/6/901.long


On Dec 7, 2011, at 1:59 PM, Smithies, Russell wrote:

> I did something similar a few years ago (after watching the movie "Contact" I think) and encoded codons as RGB values and drew an image of a genome. Looked much like random noise but I might try it again and draw as a space filling curve.
> I guess if you're looking for "hidden messages", why restrict yourself to 2 dimensions?  Perhaps something pops out as a single-image stereogram eg. http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Random_Dot_Shark.png
> Perhaps it's a 3D "object" represented by slices drawn in a series of 2D planes?
> 
> But you need a bit of biological background as there will be patterns simply because of the way genes "work" and are laid out in chromosomes. You need to remember that DNA is effectively a 2D representation of a 3D protein structure and there is already much hidden information we know we don't understand - a "simple" task like how proteins fold is barely understood and why some become prions is still a mystery.
> 
> But don't let this stop you uncovering the great secret hidden in our genes :-)
> 
> On a similar note, have a look at http://medgadget.com/2011/10/send-your-secret-message-hidden-in-bacteria.html
> 
> --Russell
> 
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of sunwukong
>> Sent: Thursday, 8 December 2011 8:05 a.m.
>> To: bioperl-l at bioperl.org
>> Subject: [Bioperl-l] DNA Sequencing two questions
>> 
>> I am not a medical professional but I have two DNA related questions.
>> 
>> A year or so ago I realized that if the standard building blocks of life were the
>> amino acids GATC then they could be represented as a base 4 number
>> system (e.g., 0,1,2 and 3).  Then any life form could be represented by a
>> number (it would be very long).  So I set out on a quest to do this with a small
>> life form.  For fun I chose the Spanish Flu which I believe I found on an NIH
>> site.  Then I set out and realized that there was no standard.  And I did not
>> know if the number would be built with the most significant digit on the left
>> or right.
>> 
>> 1.  Is there a standard method for representing the ATCD molecules as
>> numbers g = 0 a = 1 t  = 2 c = 3
>> 
>> 2. is the sequence read left to right or right to left?
>> 
>> note:  It may be biologically significant if the right values are assigned to the
>> letters GATC, there could be a pattern somewhere that holds significant
>> information.  One idea might be to look at DNA sequences in bases other
>> than 4 to see if something jumps out.
>> 
>> http://www.insectscience.org/2.10/ref/fig5a.gif
>> 
>> VR
>> Pat Kirol
>> 509 442-2214
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From Russell.Smithies at agresearch.co.nz  Thu Dec  8 00:29:47 2011
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Thu, 8 Dec 2011 13:29:47 +1300
Subject: [Bioperl-l] DNA Sequencing two questions
In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
References: <4EDFB8F0.8080001@potc.net>
	<18DF7D20DFEC044098A1062202F5FFF340186CF244@exchsth.agresearch.co.nz>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF245@exchsth.agresearch.co.nz>

I tried again and came up with this:
http://www.bioperl.org/w/images/7/7a/Autostereogram.png
If you look carefully, you can see the answer to life, the universe, and everything!!

--Russell

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Smithies, Russell
> Sent: Thursday, 8 December 2011 10:59 a.m.
> To: 'sunwukong'; bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] DNA Sequencing two questions
> 
> I did something similar a few years ago (after watching the movie "Contact" I
> think) and encoded codons as RGB values and drew an image of a genome.
> Looked much like random noise but I might try it again and draw as a space
> filling curve.
> I guess if you're looking for "hidden messages", why restrict yourself to 2
> dimensions?  Perhaps something pops out as a single-image stereogram eg.
> http://upload.wikimedia.org/wikipedia/commons/8/8f/Stereogram_Tut_Ra
> ndom_Dot_Shark.png
> Perhaps it's a 3D "object" represented by slices drawn in a series of 2D
> planes?
> 
> But you need a bit of biological background as there will be patterns simply
> because of the way genes "work" and are laid out in chromosomes. You
> need to remember that DNA is effectively a 2D representation of a 3D
> protein structure and there is already much hidden information we know we
> don't understand - a "simple" task like how proteins fold is barely understood
> and why some become prions is still a mystery.
> 
> But don't let this stop you uncovering the great secret hidden in our genes :-)
> 
> On a similar note, have a look at http://medgadget.com/2011/10/send-your-
> secret-message-hidden-in-bacteria.html
> 
> --Russell
> 
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > bounces at lists.open-bio.org] On Behalf Of sunwukong
> > Sent: Thursday, 8 December 2011 8:05 a.m.
> > To: bioperl-l at bioperl.org
> > Subject: [Bioperl-l] DNA Sequencing two questions
> >
> > I am not a medical professional but I have two DNA related questions.
> >
> > A year or so ago I realized that if the standard building blocks of
> > life were the amino acids GATC then they could be represented as a
> > base 4 number system (e.g., 0,1,2 and 3).  Then any life form could be
> > represented by a number (it would be very long).  So I set out on a
> > quest to do this with a small life form.  For fun I chose the Spanish
> > Flu which I believe I found on an NIH site.  Then I set out and
> > realized that there was no standard.  And I did not know if the number
> > would be built with the most significant digit on the left or right.
> >
> > 1.  Is there a standard method for representing the ATCD molecules as
> > numbers g = 0 a = 1 t  = 2 c = 3
> >
> > 2. is the sequence read left to right or right to left?
> >
> > note:  It may be biologically significant if the right values are
> > assigned to the letters GATC, there could be a pattern somewhere that
> > holds significant information.  One idea might be to look at DNA
> > sequences in bases other than 4 to see if something jumps out.
> >
> > http://www.insectscience.org/2.10/ref/fig5a.gif
> >
> > VR
> > Pat Kirol
> > 509 442-2214
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> ==========================================================
> =============
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities to which
> it is addressed and may contain confidential and/or privileged material. Any
> review, retransmission, dissemination or other use of, or taking of any action
> in reliance upon, this information by persons or entities other than the
> intended recipients is prohibited by AgResearch Limited. If you have received
> this message in error, please notify the sender immediately.
> ==========================================================
> =============
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From lumos.lumos.lumos at gmail.com  Fri Dec  9 16:47:36 2011
From: lumos.lumos.lumos at gmail.com (lumos lumos)
Date: Fri, 9 Dec 2011 08:47:36 -0800
Subject: [Bioperl-l] Mouse->Human homologues ?
Message-ID: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>

Hello,

Is there a way to get human homologues for a mouse gene list where I get
all human genes(symbols) as text output ?

Thank you
LM


From cjfields at illinois.edu  Fri Dec  9 17:17:20 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Fri, 9 Dec 2011 17:17:20 +0000
Subject: [Bioperl-l] Mouse->Human homologues ?
In-Reply-To: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>
References: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>
Message-ID: <C19C1032-5FE7-400A-AEB5-152EEFC4C435@illinois.edu>

There are lots of databases that have this capability (ensembl, orthodb, homologene, oma, to name only a few).  Have you tried a simple search for this, or did you want expert opinion on the matter?  

chris

PS - Just to note, there is a lot of controversy swirling about re: the ortholog conjecture and some recently published papers calling it into question using human-mouse data, worth a look if you're trotting this path to know the current situation.  If you have access to F1000, see the following (paper itself is open :)

Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al. Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi: 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011. F1000.com/12462957

On Dec 9, 2011, at 10:47 AM, lumos lumos wrote:

> Hello,
> 
> Is there a way to get human homologues for a mouse gene list where I get
> all human genes(symbols) as text output ?
> 
> Thank you
> LM
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From lumos.lumos.lumos at gmail.com  Fri Dec  9 17:29:24 2011
From: lumos.lumos.lumos at gmail.com (lumos lumos)
Date: Fri, 9 Dec 2011 09:29:24 -0800
Subject: [Bioperl-l] Mouse->Human homologues ?
In-Reply-To: <C19C1032-5FE7-400A-AEB5-152EEFC4C435@illinois.edu>
References: <CAJbewumvg+iQ787GYQTrL8bLRqPGb6T3jNzw2GZJQAVwZU+aAA@mail.gmail.com>
	<C19C1032-5FE7-400A-AEB5-152EEFC4C435@illinois.edu>
Message-ID: <CAJbewukt_xCCpQaWTsvqi2z1NkbsTZRG6xXJUcZhcK5jdAZhWQ@mail.gmail.com>

Hi Chris,

Thanks for your reply. I wanted to know if there is anyway you can do it
via script/automatically in perl for a list of mouse genes whose human
homologues I require.

LM

On Fri, Dec 9, 2011 at 9:17 AM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> There are lots of databases that have this capability (ensembl, orthodb,
> homologene, oma, to name only a few).  Have you tried a simple search for
> this, or did you want expert opinion on the matter?
>
> chris
>
> PS - Just to note, there is a lot of controversy swirling about re: the
> ortholog conjecture and some recently published papers calling it into
> question using human-mouse data, worth a look if you're trotting this path
> to know the current situation.  If you have access to F1000, see the
> following (paper itself is open :)
>
> Faculty of 1000 evaluations, dissents and comments for [Nehrt NL et al.
> Testing the ortholog conjecture with comparative functional genomic data
> from mammals. PLoS Comput Biol. 2011 Jun; 7(6):e1002073; doi:
> 10.1371/journal.pcbi.1002073]. Faculty of 1000, 31 Aug 2011.
> F1000.com/12462957
>
> On Dec 9, 2011, at 10:47 AM, lumos lumos wrote:
>
> > Hello,
> >
> > Is there a way to get human homologues for a mouse gene list where I get
> > all human genes(symbols) as text output ?
> >
> > Thank you
> > LM
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


From lumos.lumos.lumos at gmail.com  Thu Dec  8 04:47:19 2011
From: lumos.lumos.lumos at gmail.com (lumos lumos)
Date: Wed, 7 Dec 2011 20:47:19 -0800
Subject: [Bioperl-l] Perl parsing
Message-ID: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>

Hello,

I have a text file(tab-delim) with some gene names as shown below.

*BRCA1: breast cancer 1, early onset

TNF: tumor necrosis factor

OMG: oligodendrocyte myelin glycoprotein*

I would like to get the list of gene name BRCA1,TNF,OMG that is before the
colon(:) .
How do I parse in perl this text file with this list of genes?

Thanks in advance.
LM


From b.m.forde at umail.ucc.ie  Fri Dec  9 16:52:56 2011
From: b.m.forde at umail.ucc.ie (BForde)
Date: Fri, 9 Dec 2011 08:52:56 -0800 (PST)
Subject: [Bioperl-l]  Genbank files
Message-ID: <32941955.post@talk.nabble.com>


Hello all,

I am new to Bioperl so I apologise if this is stupid question. 

For CDS features I which to add additional qualifiers e.g. /colour and /note
qualifiers. I have looked at the BioPerl wiki but am still unsure as how to
do this?

regards

Brian
-- 
View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From jboddu at illinois.edu  Fri Dec  9 19:59:39 2011
From: jboddu at illinois.edu (Boddu, Jayanand)
Date: Fri, 9 Dec 2011 19:59:39 +0000
Subject: [Bioperl-l] Batch processing of Data
Message-ID: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>

Hi Anyone:
Please let me know if the following is practical with PERL.
My data output can be described as following.

1.       Hundreds of samples are run.

2.       A batch output sends data from each sample to its own "folder". Output is in the form of few text files, spreadsheets and PDF files.

3.       One of the spreadsheet has the data of most interest.

4.       This means I end up having hundreds of folders.

5.       The spreadsheet with the data has multiple worksheets out of which a couple have the interesting data to be processed (Please find attached a spreadsheet output in which the data is organized and the worksheets of my interest are named as "Compound" and "Peak". Yellow high-lighted columns in each worksheet has the data to be processed).
OK. That's long description.
NOW. Is it practical to write a PERL/or any script to;

1.       Enter each folder.

2.       Look for the spreadsheet of interest.

3.       Look for worksheets named "Compound" and "Peak".

4.       Look for the specific columns of interest.

5.       Copy paste the columns of interest into a new spreadsheet/text file with data from each folder next to each other.

This final spreadsheet will pass through a bunch of other calculations.

I apologize for this long and painful description.
However, it would be great if this can be done.
Thanks
Jay
-------------- next part --------------
A non-text attachment was scrubbed...
Name: REPORT01.xls
Type: application/vnd.ms-excel
Size: 93696 bytes
Desc: REPORT01.xls
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20111209/0528887b/attachment-0004.xls>

From cjfields at illinois.edu  Fri Dec  9 20:37:48 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Fri, 9 Dec 2011 20:37:48 +0000
Subject: [Bioperl-l] Perl parsing
In-Reply-To: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>
References: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>
Message-ID: <E70D9F7E-7419-47AE-A0F9-3FB2B7B25907@illinois.edu>

On Dec 7, 2011, at 10:47 PM, lumos lumos wrote:

> Hello,
> 
> I have a text file(tab-delim) with some gene names as shown below.
> 
> *BRCA1: breast cancer 1, early onset
> 
> TNF: tumor necrosis factor
> 
> OMG: oligodendrocyte myelin glycoprotein*
> 
> I would like to get the list of gene name BRCA1,TNF,OMG that is before the
> colon(:) .
> How do I parse in perl this text file with this list of genes?

'Very carefully?'

Okay, I'll try to refrain from further sarcasm, but I'm confused, what does this have to do with BioPerl (*the toolkit*) specifically?  That is what this mailing list is for.  

Just to note, this is a very common perl task. The answer is attainable by searching for it (not to mention taking the time to learn basic perl).  For instance:

   http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings

One of the many links found by simply using Google:

   http://lmgtfy.com/?q=perl+parse+tab+file

I'll leave the regex munging to you.  

(okay, I failed at refraining from sarcasm, ah well it's friday).

chris


> Thanks in advance.
> LM


From jason.stajich at gmail.com  Fri Dec  9 21:18:38 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Fri, 9 Dec 2011 13:18:38 -0800
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32941955.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
Message-ID: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>

$feature->add_tag_value('color','blue');

On Dec 9, 2011, at 8:52 AM, BForde wrote:

> 
> Hello all,
> 
> I am new to Bioperl so I apologise if this is stupid question. 
> 
> For CDS features I which to add additional qualifiers e.g. /colour and /note
> qualifiers. I have looked at the BioPerl wiki but am still unsure as how to
> do this?
> 
> regards
> 
> Brian
> -- 
> View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org


From bosborne11 at verizon.net  Fri Dec  9 20:31:15 2011
From: bosborne11 at verizon.net (Brian Osborne)
Date: Fri, 09 Dec 2011 15:31:15 -0500
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32941955.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
Message-ID: <3AB0716B-EC07-4BD6-9FC8-0C47A29FC0BA@verizon.net>

Brian,

Reasonable question. Start here:

http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

If you've never used Bioperl then:

http://www.bioperl.org/wiki/HOWTO:Beginners

Brian


On Dec 9, 2011, at 11:52 AM, BForde wrote:

> 
> Hello all,
> 
> I am new to Bioperl so I apologise if this is stupid question. 
> 
> For CDS features I which to add additional qualifiers e.g. /colour and /note
> qualifiers. I have looked at the BioPerl wiki but am still unsure as how to
> do this?
> 
> regards
> 
> Brian
> -- 
> View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32941955.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From asjo at koldfront.dk  Fri Dec  9 22:25:00 2011
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Fri, 09 Dec 2011 23:25:00 +0100
Subject: [Bioperl-l] Batch processing of Data
References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>
Message-ID: <871usdpemb.fsf@topper.koldfront.dk>

On Fri, 9 Dec 2011 19:59:39 +0000, Boddu, wrote:

> Please let me know if the following is practical with PERL.

It might very well be, yes.

Modules you might be interested in include Spreadsheet::ParseExcel,
Spreadsheet::XLSX, Spreadsheet::WriteExcel and Excel::Writer::XLSX?.

A big help in finding interesting CPAN modules is the search engine on
https://metacpan.org/

Depending on your platform and preference using find(1) might also be
helpful to traverse the folders, rather than doing so in Perl.

Note that none of this has anything to do with BioPerl as such, though,
and you'll need to do some actual programming to get the job done.


  Best regards,

    Adam


? http://blogs.perl.org/users/john_mcnamara/2011/10/spreadsheetwriteexcel-is-dead-long-live-excelwriterxlsx.html

-- 
 "Angels can fly because they take themselves lightly."       Adam Sj?gren
                                                         asjo at koldfront.dk


From David.Messina at sbc.su.se  Fri Dec  9 22:30:23 2011
From: David.Messina at sbc.su.se (Dave Messina)
Date: Fri, 9 Dec 2011 23:30:23 +0100
Subject: [Bioperl-l] Batch processing of Data
In-Reply-To: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>
References: <1DE33F161A86CD479E40A2AF7FD4E8D0049039B7@CITESMBX2.ad.uillinois.edu>
Message-ID: <CAM3TQQWMnqwShBhYQWH9iZqtDphVFYLYtuVWEMxxfVY1OqSbhg@mail.gmail.com>

Yes, it can be done. However, it has nothing to do with this mailing list.

Steps 1 and 2 are basic Perl.
For steps 3 through 5, try googling "perl parse excel".


Dave


On Fri, Dec 9, 2011 at 20:59, Boddu, Jayanand <jboddu at illinois.edu> wrote:

> Hi Anyone:
> Please let me know if the following is practical with PERL.
> My data output can be described as following.
>
> 1.       Hundreds of samples are run.
>
> 2.       A batch output sends data from each sample to its own "folder".
> Output is in the form of few text files, spreadsheets and PDF files.
>
> 3.       One of the spreadsheet has the data of most interest.
>
> 4.       This means I end up having hundreds of folders.
>
> 5.       The spreadsheet with the data has multiple worksheets out of
> which a couple have the interesting data to be processed (Please find
> attached a spreadsheet output in which the data is organized and the
> worksheets of my interest are named as "Compound" and "Peak". Yellow
> high-lighted columns in each worksheet has the data to be processed).
> OK. That's long description.
> NOW. Is it practical to write a PERL/or any script to;
>
> 1.       Enter each folder.
>
> 2.       Look for the spreadsheet of interest.
>
> 3.       Look for worksheets named "Compound" and "Peak".
>
> 4.       Look for the specific columns of interest.
>
> 5.       Copy paste the columns of interest into a new spreadsheet/text
> file with data from each folder next to each other.
>
> This final spreadsheet will pass through a bunch of other calculations.
>
> I apologize for this long and painful description.
> However, it would be great if this can be done.
> Thanks
> Jay
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From lsbrath at gmail.com  Sat Dec 10 21:39:44 2011
From: lsbrath at gmail.com (Mgavi Brathwaite)
Date: Sat, 10 Dec 2011 16:39:44 -0500
Subject: [Bioperl-l] Perl parsing
In-Reply-To: <E70D9F7E-7419-47AE-A0F9-3FB2B7B25907@illinois.edu>
References: <CAJbewunJa-_FVfjiOjZqZrv_h9RcZuCNpjGviWDM6O0mXjQQ=A@mail.gmail.com>
	<E70D9F7E-7419-47AE-A0F9-3FB2B7B25907@illinois.edu>
Message-ID: <CAJm=ba98HUgAB1kUG29_KA+ZvNWP_AsHoJQNPQ-_Fe=Pa7b74Q@mail.gmail.com>

Yes grasshopper you have to suffer a little bit. Learn Perl first, then
step up to BioPerl. Chris I feel you concerning the power of Regex, and the
sarcasm.

Lom

On Fri, Dec 9, 2011 at 3:37 PM, Fields, Christopher J <cjfields at illinois.edu
> wrote:

> On Dec 7, 2011, at 10:47 PM, lumos lumos wrote:
>
> > Hello,
> >
> > I have a text file(tab-delim) with some gene names as shown below.
> >
> > *BRCA1: breast cancer 1, early onset
> >
> > TNF: tumor necrosis factor
> >
> > OMG: oligodendrocyte myelin glycoprotein*
> >
> > I would like to get the list of gene name BRCA1,TNF,OMG that is before
> the
> > colon(:) .
> > How do I parse in perl this text file with this list of genes?
>
> 'Very carefully?'
>
> Okay, I'll try to refrain from further sarcasm, but I'm confused, what
> does this have to do with BioPerl (*the toolkit*) specifically?  That is
> what this mailing list is for.
>
> Just to note, this is a very common perl task. The answer is attainable by
> searching for it (not to mention taking the time to learn basic perl).  For
> instance:
>
>
> http://stackoverflow.com/questions/4500407/in-perl-how-can-i-correctly-parse-tab-space-delimited-files-with-quoted-strings
>
> One of the many links found by simply using Google:
>
>   http://lmgtfy.com/?q=perl+parse+tab+file
>
> I'll leave the regex munging to you.
>
> (okay, I failed at refraining from sarcasm, ah well it's friday).
>
> chris
>
>
> > Thanks in advance.
> > LM
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From pawan.mani2 at gmail.com  Mon Dec  5 22:00:09 2011
From: pawan.mani2 at gmail.com (pawan.mani2 at gmail.com)
Date: Tue, 6 Dec 2011 03:30:09 +0530
Subject: [Bioperl-l] bioperl in cygwin
Message-ID: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC>

Hi
     I would like to after the givibg following commands in cgwin terminal:


 perl -MCPAN -e shell

then I type

    o conf prerequisites_policy follow
    o conf commit
    install Bundle::CPAN 
install Module::Build 
d /bioperl/ 
then we  you get a list of different versions. 
I selected CJFIELDS/BioPerl-1.6.1.96
install CJFIELDS/BioPerl-1.6.1.96.tar.gz 


but build.install was not ok.

Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7.

thanks in advanced.

with best regards,
Pawan


From cjfields at illinois.edu  Sun Dec 11 18:22:01 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sun, 11 Dec 2011 18:22:01 +0000
Subject: [Bioperl-l] bioperl in cygwin
In-Reply-To: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC>
References: <7B046189155547CEB23F0846D7577D1F@PAWANKUMARPC>
Message-ID: <B674A464-E650-4CBF-B2CE-2100AB0B29B9@illinois.edu>

Pawan,

Hard to say what the problem is w/o supplying warnings/errors.  Prior to doing this, you should try installing BioPerl-1.6.901 (the latest CPAN release).  You can try a direct installation of the distribution, but the easiest way to get the latest version is to try installing Bio::Perl.

(I'm not sure what BioPerl-1.6.1.96 is, but this seems wrong)

chris

On Dec 5, 2011, at 4:00 PM, <pawan.mani2 at gmail.com>
 <pawan.mani2 at gmail.com> wrote:

> Hi
>     I would like to after the givibg following commands in cgwin terminal:
> 
> 
> perl -MCPAN -e shell
> 
> then I type
> 
>    o conf prerequisites_policy follow
>    o conf commit
>    install Bundle::CPAN 
> install Module::Build 
> d /bioperl/ 
> then we  you get a list of different versions. 
> I selected CJFIELDS/BioPerl-1.6.1.96
> install CJFIELDS/BioPerl-1.6.1.96.tar.gz 
> 
> 
> but build.install was not ok.
> 
> Kindly allow me to know the step of complete bioprl installatin cygwin undre windows 7.
> 
> thanks in advanced.
> 
> with best regards,
> Pawan
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From b.m.forde at umail.ucc.ie  Tue Dec 13 11:03:50 2011
From: b.m.forde at umail.ucc.ie (BForde)
Date: Tue, 13 Dec 2011 03:03:50 -0800 (PST)
Subject: [Bioperl-l] Genbank files
In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
Message-ID: <32965574.post@talk.nabble.com>


Than you for the replies. 

My script (below) reads in a list of locus_tags from a tab delimited text
file. Compares these locus_tags to the locus_tags in  a genbank file and
where they are equal adds new features.
the line
$feat->add_tag_value()
needs to be defined. In the bioperl wiki this variable appears to be defined
by giving it coordinates etc (creating a new feature). I wish to add
features to CDS key when the locus_tags are identical. Is this possible?

use strict; 
use Bio::SeqIO; 

my @V; 
open (LIST1, 'list') ||die; 
while (<LIST1>){ 
    push @V, (split(/\t/, $_))[0]; 
} 
close(LIST1); 

my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb"); 
my $seq_object = $seqio_object->next_seq; 

for my $feat_object ($seq_object->get_SeqFeatures){ 
    if ($feat_object->primary_tag eq "CDS"){ 
        if ($feat_object->has_tag('locus_tag')){ 
            for my $V3 ($feat_object->get_tag_values('locus_tag')){ 
                for my $V1 (@V) { 
                    if ($V1 eq $V3){ 
                        ADD NEW FEATURES 
                        
                    }     
                } 
            } 
        } 
    } 
} 
  
The script works down as far as the comparison point where locus_tags in the
genbankfile "Contig100.gb" are compared against a list of locus_tags from a
delimited txt file. 


regards 

Brian 

Jason Stajich-5 wrote:
> 
> $feature->add_tag_value('color','blue');
> 
> On Dec 9, 2011, at 8:52 AM, BForde wrote:
> 
>> 
>> Hello all,
>> 
>> I am new to Bioperl so I apologise if this is stupid question. 
>> 
>> For CDS features I which to add additional qualifiers e.g. /colour and
>> /note
>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>> to
>> do this?
>> 
>> regards
>> 
>> Brian
>> -- 
>> View this message in context:
>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 

-- 
View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32965574.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From roy.chaudhuri at gmail.com  Tue Dec 13 11:52:05 2011
From: roy.chaudhuri at gmail.com (Roy Chaudhuri)
Date: Tue, 13 Dec 2011 11:52:05 +0000
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32965574.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
	<32965574.post@talk.nabble.com>
Message-ID: <4EE73C65.1080101@gmail.com>

Hi Brian,

Just to check I have understood you, you want to read through a genbank 
file and add additional tags to features which are listed in a 
tab-delimited file of locus tags?

Your code is on the right lines, but it would be much more efficient to 
read your tab-delimited locus_tags into a hash, and check using exists, 
rather than ploughing through the (potentially very long) list of locus 
tags every time. Also, be careful with new lines in your tab file (you 
can safely get rid of them using "chomp"). You can miss out the 
"has_tag" check by using "get_tagset_values" instead of 
"get_tag_values", since the former does not complain if the tag is not 
present. Once you have modified your sequence object, you need to write 
it out to a new file (or STDOUT) using Bio::SeqIO.

Also, just a couple of general points, you should always "use warnings" 
(or even better "use warnings FATAL=>qw(all)") since that can help solve 
many problems, and your code may be easier to read if you don't include 
the word "object" in all your variable names (after all you wouldn't say 
you write on a paper object using a pen object).

use strict;
use warnings FATAL=>qw(all);
use Bio::SeqIO;
open (my $list, 'list') or die $!;
my %V;
while (<$list>){
     chomp;
     $V{(split(/\t/, $_))[0]}=1;
}
my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
my $seq_object = $seqio_object->next_seq;
for my $feat_object ($seq_object->remove_SeqFeatures){
     if ($feat_object->primary_tag eq "CDS"){
	for my $V3 ($feat_object->get_tagset_values('locus_tag')){
             if (exists $V{$V3}){
		$feat_object->add_tag_value(listed_in_tab_file=>'yes');
		next;
             }
         }
     }
     $seq_object->add_SeqFeature($feat_object);
}
Bio::SeqIO->new(-format=>'genbank')->write_seq($seq_object);

Hope this helps.
Cheers,
Roy.

On 13/12/2011 11:03, BForde wrote:
>
> Than you for the replies.
>
> My script (below) reads in a list of locus_tags from a tab delimited text
> file. Compares these locus_tags to the locus_tags in  a genbank file and
> where they are equal adds new features.
> the line
> $feat->add_tag_value()
> needs to be defined. In the bioperl wiki this variable appears to be defined
> by giving it coordinates etc (creating a new feature). I wish to add
> features to CDS key when the locus_tags are identical. Is this possible?
>
> use strict;
> use Bio::SeqIO;
>
> my @V;
> open (LIST1, 'list') ||die;
> while (<LIST1>){
>      push @V, (split(/\t/, $_))[0];
> }
> close(LIST1);
>
> my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
> my $seq_object = $seqio_object->next_seq;
>
> for my $feat_object ($seq_object->get_SeqFeatures){
>      if ($feat_object->primary_tag eq "CDS"){
>          if ($feat_object->has_tag('locus_tag')){
>              for my $V3 ($feat_object->get_tag_values('locus_tag')){
>                  for my $V1 (@V) {
>                      if ($V1 eq $V3){
>                          ADD NEW FEATURES
>
>                      }
>                  }
>              }
>          }
>      }
> }
>
> The script works down as far as the comparison point where locus_tags in the
> genbankfile "Contig100.gb" are compared against a list of locus_tags from a
> delimited txt file.
>
>
> regards
>
> Brian
>
> Jason Stajich-5 wrote:
>>
>> $feature->add_tag_value('color','blue');
>>
>> On Dec 9, 2011, at 8:52 AM, BForde wrote:
>>
>>>
>>> Hello all,
>>>
>>> I am new to Bioperl so I apologise if this is stupid question.
>>>
>>> For CDS features I which to add additional qualifiers e.g. /colour and
>>> /note
>>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>>> to
>>> do this?
>>>
>>> regards
>>>
>>> Brian
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Jason Stajich
>> jason.stajich at gmail.com
>> jason at bioperl.org
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>


From b.m.forde at umail.ucc.ie  Tue Dec 13 14:22:01 2011
From: b.m.forde at umail.ucc.ie (Brian Forde)
Date: Tue, 13 Dec 2011 14:22:01 +0000
Subject: [Bioperl-l] Genbank files
In-Reply-To: <4EE73C65.1080101@gmail.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
	<32965574.post@talk.nabble.com> <4EE73C65.1080101@gmail.com>
Message-ID: <CAJLmuD+0Ts_5hPLL6T2vToY8+oW+PxXHaBiGGKoLXZZoGiBptg@mail.gmail.com>

Hi Roy,

Thank you. That works perfectly. I have to confess that someone else told
me to use hashes but I could  not get them to work.. Thanks again

regards

Brian

On Tue, Dec 13, 2011 at 11:52 AM, Roy Chaudhuri <roy.chaudhuri at gmail.com>wrote:

> Hi Brian,
>
> Just to check I have understood you, you want to read through a genbank
> file and add additional tags to features which are listed in a
> tab-delimited file of locus tags?
>
> Your code is on the right lines, but it would be much more efficient to
> read your tab-delimited locus_tags into a hash, and check using exists,
> rather than ploughing through the (potentially very long) list of locus
> tags every time. Also, be careful with new lines in your tab file (you can
> safely get rid of them using "chomp"). You can miss out the "has_tag" check
> by using "get_tagset_values" instead of "get_tag_values", since the former
> does not complain if the tag is not present. Once you have modified your
> sequence object, you need to write it out to a new file (or STDOUT) using
> Bio::SeqIO.
>
> Also, just a couple of general points, you should always "use warnings"
> (or even better "use warnings FATAL=>qw(all)") since that can help solve
> many problems, and your code may be easier to read if you don't include the
> word "object" in all your variable names (after all you wouldn't say you
> write on a paper object using a pen object).
>
> use strict;
> use warnings FATAL=>qw(all);
> use Bio::SeqIO;
> open (my $list, 'list') or die $!;
> my %V;
> while (<$list>){
>    chomp;
>    $V{(split(/\t/, $_))[0]}=1;
>
> }
> my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb");
> my $seq_object = $seqio_object->next_seq;
> for my $feat_object ($seq_object->remove_**SeqFeatures){
>
>    if ($feat_object->primary_tag eq "CDS"){
>        for my $V3 ($feat_object->get_tagset_**values('locus_tag')){
>            if (exists $V{$V3}){
>                $feat_object->add_tag_value(**listed_in_tab_file=>'yes');
>                next;
>            }
>        }
>    }
>    $seq_object->add_SeqFeature($**feat_object);
> }
> Bio::SeqIO->new(-format=>'**genbank')->write_seq($seq_**object);
>
> Hope this helps.
> Cheers,
> Roy.
>
>
> On 13/12/2011 11:03, BForde wrote:
>
>>
>> Than you for the replies.
>>
>> My script (below) reads in a list of locus_tags from a tab delimited text
>> file. Compares these locus_tags to the locus_tags in  a genbank file and
>> where they are equal adds new features.
>> the line
>> $feat->add_tag_value()
>> needs to be defined. In the bioperl wiki this variable appears to be
>> defined
>> by giving it coordinates etc (creating a new feature). I wish to add
>> features to CDS key when the locus_tags are identical. Is this possible?
>>
>> use strict;
>> use Bio::SeqIO;
>>
>> my @V;
>> open (LIST1, 'list') ||die;
>> while (<LIST1>){
>>     push @V, (split(/\t/, $_))[0];
>> }
>> close(LIST1);
>>
>> my $seqio_object = Bio::SeqIO->new(-file=>"**Contig100.gb");
>> my $seq_object = $seqio_object->next_seq;
>>
>> for my $feat_object ($seq_object->get_SeqFeatures)**{
>>     if ($feat_object->primary_tag eq "CDS"){
>>         if ($feat_object->has_tag('locus_**tag')){
>>             for my $V3 ($feat_object->get_tag_values(**'locus_tag')){
>>                 for my $V1 (@V) {
>>                     if ($V1 eq $V3){
>>                         ADD NEW FEATURES
>>
>>                     }
>>                 }
>>             }
>>         }
>>     }
>> }
>>
>> The script works down as far as the comparison point where locus_tags in
>> the
>> genbankfile "Contig100.gb" are compared against a list of locus_tags from
>> a
>> delimited txt file.
>>
>>
>> regards
>>
>> Brian
>>
>> Jason Stajich-5 wrote:
>>
>>>
>>> $feature->add_tag_value('**color','blue');
>>>
>>> On Dec 9, 2011, at 8:52 AM, BForde wrote:
>>>
>>>
>>>> Hello all,
>>>>
>>>> I am new to Bioperl so I apologise if this is stupid question.
>>>>
>>>> For CDS features I which to add additional qualifiers e.g. /colour and
>>>> /note
>>>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>>>> to
>>>> do this?
>>>>
>>>> regards
>>>>
>>>> Brian
>>>> --
>>>> View this message in context:
>>>> http://old.nabble.com/Genbank-**files-tp32941955p32941955.html<http://old.nabble.com/Genbank-files-tp32941955p32941955.html>
>>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>>>
>>>> ______________________________**_________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l<http://lists.open-bio.org/mailman/listinfo/bioperl-l>
>>>>
>>>
>>> Jason Stajich
>>> jason.stajich at gmail.com
>>> jason at bioperl.org
>>>
>>>
>>> ______________________________**_________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/**mailman/listinfo/bioperl-l<http://lists.open-bio.org/mailman/listinfo/bioperl-l>
>>>
>>>
>>>
>>
>


-- 
Brian Forde
Microbiology Dept.
Bioscience Institute. Room 4.11
University College Cork
Cork
Ireland
tel:+353 21 4901306
email: b.m.forde at umail.ucc.ie


From b.m.forde at umail.ucc.ie  Mon Dec 12 17:20:53 2011
From: b.m.forde at umail.ucc.ie (BForde)
Date: Mon, 12 Dec 2011 09:20:53 -0800 (PST)
Subject: [Bioperl-l] Genbank files
In-Reply-To: <5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
Message-ID: <32959999.post@talk.nabble.com>


Than you for the replies.

I am unsure as to how to use the line below with my script. My script so far
reads

use strict;
use Bio::SeqIO;

my @V;
open (LIST1, 'list') ||die;
while (<LIST1>){
    push @V, (split(/\t/, $_))[0];
}
close(LIST1);

my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
my $seq_object = $seqio_object->next_seq;

for my $feat_object ($seq_object->get_SeqFeatures){
    if ($feat_object->primary_tag eq "CDS"){
        if ($feat_object->has_tag('locus_tag')){
            for my $V3 ($feat_object->get_tag_values('locus_tag')){
                for my $V1 (@V) {
                    if ($V1 eq $V3){
                        ADD NEW FEATURES
                        
                    }    
                }
            }
        }
    }
}
 
The script works down as far as the comparison point where locus_tags in the
genbankfile "Contig100.gb" are compared against a list of locus_tags from a
delimited txt file.
I possbile could you show me how to amend my script so I can add new
features

regards

Brian

Jason Stajich-5 wrote:
> 
> $feature->add_tag_value('color','blue');
> 
> On Dec 9, 2011, at 8:52 AM, BForde wrote:
> 
>> 
>> Hello all,
>> 
>> I am new to Bioperl so I apologise if this is stupid question. 
>> 
>> For CDS features I which to add additional qualifiers e.g. /colour and
>> /note
>> qualifiers. I have looked at the BioPerl wiki but am still unsure as how
>> to
>> do this?
>> 
>> regards
>> 
>> Brian
>> -- 
>> View this message in context:
>> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 

-- 
View this message in context: http://old.nabble.com/Genbank-files-tp32941955p32959999.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From Russell.Smithies at agresearch.co.nz  Wed Dec 14 03:17:02 2011
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Wed, 14 Dec 2011 16:17:02 +1300
Subject: [Bioperl-l] Genbank files
In-Reply-To: <32959999.post@talk.nabble.com>
References: <32941955.post@talk.nabble.com>
	<5674EF55-50B1-46F0-ADA5-774518B30595@gmail.com>
	<32959999.post@talk.nabble.com>
Message-ID: <18DF7D20DFEC044098A1062202F5FFF340186CF27A@exchsth.agresearch.co.nz>

Something like this:

use strict;
use Bio::SeqIO;

my @V;
open (LIST1, 'list') ||die;
while (<LIST1>){
    push @V, (split(/\t/, $_))[0];
}
close(LIST1);

my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
my $seq_object = $seqio_object->next_seq;

for my $feat_object ($seq_object->get_SeqFeatures){
    if ($feat_object->primary_tag eq "CDS"){
        if ($feat_object->has_tag('locus_tag')){
            for my $V3 ($feat_object->get_tag_values('locus_tag')){
                for my $V1 (@V) {
                    if ($V1 eq $V3){
                        #ADD NEW FEATURES
                        $feat_object->add_tag_value('color','blue');
                    }
                }
            }
        }
    }
}
#write the new annotations
my $io = Bio::SeqIO->new(-format => "genbank", -file => ">new.gb" );
$io->write_seq($seq_object);

Take another look at http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Building_Your_Own_Sequences

--Russell


> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of BForde
> Sent: Tuesday, 13 December 2011 6:21 a.m.
> To: Bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Genbank files
> 
> 
> Than you for the replies.
> 
> I am unsure as to how to use the line below with my script. My script so far
> reads
> 
> use strict;
> use Bio::SeqIO;
> 
> my @V;
> open (LIST1, 'list') ||die;
> while (<LIST1>){
>     push @V, (split(/\t/, $_))[0];
> }
> close(LIST1);
> 
> my $seqio_object = Bio::SeqIO->new(-file=>"Contig100.gb");
> my $seq_object = $seqio_object->next_seq;
> 
> for my $feat_object ($seq_object->get_SeqFeatures){
>     if ($feat_object->primary_tag eq "CDS"){
>         if ($feat_object->has_tag('locus_tag')){
>             for my $V3 ($feat_object->get_tag_values('locus_tag')){
>                 for my $V1 (@V) {
>                     if ($V1 eq $V3){
>                         ADD NEW FEATURES
> 
>                     }
>                 }
>             }
>         }
>     }
> }
> 
> The script works down as far as the comparison point where locus_tags in the
> genbankfile "Contig100.gb" are compared against a list of locus_tags from a
> delimited txt file.
> I possbile could you show me how to amend my script so I can add new
> features
> 
> regards
> 
> Brian
> 
> Jason Stajich-5 wrote:
> >
> > $feature->add_tag_value('color','blue');
> >
> > On Dec 9, 2011, at 8:52 AM, BForde wrote:
> >
> >>
> >> Hello all,
> >>
> >> I am new to Bioperl so I apologise if this is stupid question.
> >>
> >> For CDS features I which to add additional qualifiers e.g. /colour
> >> and /note qualifiers. I have looked at the BioPerl wiki but am still
> >> unsure as how to do this?
> >>
> >> regards
> >>
> >> Brian
> >> --
> >> View this message in context:
> >> http://old.nabble.com/Genbank-files-tp32941955p32941955.html
> >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > Jason Stajich
> > jason.stajich at gmail.com
> > jason at bioperl.org
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> 
> --
> View this message in context: http://old.nabble.com/Genbank-files-
> tp32941955p32959999.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From l.m.timmermans at students.uu.nl  Wed Dec 14 15:43:24 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Wed, 14 Dec 2011 16:43:24 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
Message-ID: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>

Hi all,

As already mentioned on IRC, I recently wrote a SFF parser and uploaded it
to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF
entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time
to write one I'd be most grateful.

Leon


From p.j.a.cock at googlemail.com  Wed Dec 14 16:03:05 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 14 Dec 2011 16:03:05 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
Message-ID: <CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>

On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> Hi all,
>
> As already mentioned on IRC, I recently wrote a SFF parser and uploaded it
> to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF
> entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time
> to write one I'd be most grateful.
>
> Leon

Hi Leon,

Have you looked at the index block at all, in order to offer random
access by read ID, or to access the Roche XML manifest? Please
ask if you need more information about this - or if you can read Python:
https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py

Is this building on Miguel Pignatelli's work? I don't recall seeing
any follow up posts from him after this one:
http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html

Peter


From cjfields at illinois.edu  Wed Dec 14 16:12:58 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Wed, 14 Dec 2011 16:12:58 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>,
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
Message-ID: <3BBC1132-E768-45D9-A107-ACD51791722D@illinois.edu>

Leon, 

Nice!  Definitely a good idea to have the lower-level parser and the BioPerl-bridging code separate, one of my concerns with the various parsers we have right now which hard-wire BioPerl classes in with the parser (makes it hard for optimization).

Chris

PS- Peter, I don't think the two projects are related, but I suppose Leon is the best to answer that.

Sent from my stupid iPad, now my laptop's on the fritz

On Dec 14, 2011, at 10:04 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

> On Wed, Dec 14, 2011 at 3:43 PM, Leon Timmermans
> <l.m.timmermans at students.uu.nl> wrote:
>> Hi all,
>> 
>> As already mentioned on IRC, I recently wrote a SFF parser and uploaded it
>> to CPAN as Bio::SFF. I haven't written a Bio::SeqIO::sff wrapper yet (SFF
>> entries don't map 1:1 with Bio::Seq objects), but if anyone is has the time
>> to write one I'd be most grateful.
>> 
>> Leon
> 
> Hi Leon,
> 
> Have you looked at the index block at all, in order to offer random
> access by read ID, or to access the Roche XML manifest? Please
> ask if you need more information about this - or if you can read Python:
> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py
> 
> Is this building on Miguel Pignatelli's work? I don't recall seeing
> any follow up posts from him after this one:
> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html
> 
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From l.m.timmermans at students.uu.nl  Wed Dec 14 16:27:58 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Wed, 14 Dec 2011 17:27:58 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
Message-ID: <CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>

On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Hi Leon,
>
> Have you looked at the index block at all, in order to offer random
> access by read ID, or to access the Roche XML manifest? Please
> ask if you need more information about this - or if you can read Python:
> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py
>

I have looked at it, but not implemented it yet. There is no standardized
index, and the ones that are in common use either seem stupid (the Roche
index, which is essentially just a weirdly formatted sequential list,
though that should still be faster than a table scan) or undocumented (hash
based index).

 Is this building on Miguel Pignatelli's work? I don't recall seeing
> any follow up posts from him after this one:
> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html
>

It isn't. I like his idea for reusing BioPython's test files though.

Leon


From p.j.a.cock at googlemail.com  Wed Dec 14 16:44:28 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 14 Dec 2011 16:44:28 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
Message-ID: <CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>

On Wed, Dec 14, 2011 at 4:27 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> Hi Leon,
>>
>> Have you looked at the index block at all, in order to offer random
>> access by read ID, or to access the Roche XML manifest? Please
>> ask if you need more information about this - or if you can read Python:
>> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py
>
> I have looked at it, but not implemented it yet. There is no standardized
> index, and the ones that are in common use either seem stupid (the Roche
> index, which is essentially just a weirdly formatted sequential list, though
> that should still be faster than a table scan) or undocumented (hash based
> index).

There are two widely used indexes, both from Roche (one with and
one without an XML manifest, magic bytes .mft and .srt). They are
both just a simple table of the reads names and offsets, sorted
alphabetically. This works pretty well for rapid lookup for SFF files
(because the read count is not so high), and is pretty easy.

I don't think anyone used the hash table style indexes (.hsh), which
I assume was a proof of principle or trial in the early days of SFF.

One thing to check is what Ion Torrent's SFF files use. I would
guess they've followed Roche, but I don't know. After all, the
index structure is not defined in the SFF specification - it was
left extensible on purpose.

>> Is this building on Miguel Pignatelli's work? I don't recall seeing
>> any follow up posts from him after this one:
>> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html
>
> It isn't. I like his idea for reusing BioPython's test files though.

Yes, please do.

Peter


From gingerplum at gmail.com  Wed Dec 14 05:18:55 2011
From: gingerplum at gmail.com (plum ginger)
Date: Tue, 13 Dec 2011 21:18:55 -0800 (PST)
Subject: [Bioperl-l] a problem about BLAST
Message-ID: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com>

Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I
need run BLAST on more than one sequences. However the blast outfile
only store the result of last sequence. How to make the outfile store
all results?

Wish your help. Thanks very much!


Best regards


From jason.stajich at gmail.com  Thu Dec 15 17:02:47 2011
From: jason.stajich at gmail.com (Jason Stajich)
Date: Thu, 15 Dec 2011 11:02:47 -0600
Subject: [Bioperl-l] a problem about BLAST
In-Reply-To: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com>
References: <447aac17-935f-4f3e-89d8-bd0b6edaf04f@r6g2000yqr.googlegroups.com>
Message-ID: <58E5B487-7FF0-4018-B109-D6595DC2E493@gmail.com>

you are probably setting the outfile in each parsing iteration -- you need to show your code if you want someone to help you debug the problem.

On Dec 13, 2011, at 11:18 PM, plum ginger wrote:

> Hi. I'm trying to run BLAST with Bio::Tools::Run::StandAloneBlast. I
> need run BLAST on more than one sequences. However the blast outfile
> only store the result of last sequence. How to make the outfile store
> all results?
> 
> Wish your help. Thanks very much!
> 
> 
> Best regards
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org


From pengyu.ut at gmail.com  Fri Dec 16 22:10:27 2011
From: pengyu.ut at gmail.com (Peng Yu)
Date: Fri, 16 Dec 2011 16:10:27 -0600
Subject: [Bioperl-l] How to stop rather than emit warnings with
	Bio::Das::segment?
Message-ID: <CABrM6w=tiMNRJjZtvK-8Ot-TE3WuZgvtod6q-3g=UAWk8K_WRg@mail.gmail.com>

Hi,

Bio::Das::segment can give me the following warnings without stopping
the whole program when the position for the query doesn't exist. I
could test the return result and quit when it is []. But this would
cause my program have an test whenever I call segment. I'm wondering
if there is an automatic way to let Bio::Das::segment stop in such
cases.

--------------------- WARNING ---------------------
MSG: Sequence is not dna or rna, but []. Attempting to revcom, but
unsure if this is right
---------------------------------------------------


-- 
Regards,
Peng


From cjfields at illinois.edu  Sat Dec 17 02:48:07 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Sat, 17 Dec 2011 02:48:07 +0000
Subject: [Bioperl-l] How to stop rather than emit warnings
	with	Bio::Das::segment?
In-Reply-To: <CABrM6w=tiMNRJjZtvK-8Ot-TE3WuZgvtod6q-3g=UAWk8K_WRg@mail.gmail.com>
References: <CABrM6w=tiMNRJjZtvK-8Ot-TE3WuZgvtod6q-3g=UAWk8K_WRg@mail.gmail.com>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0881CF8E@CITESMBX4.ad.uillinois.edu>

Setting verbosity to 2 should convert warnings to exceptions.   

IIRC, set '-verbose => 2' in the Bio::Das constructor, set '$das->verbose(2)' explicitly, or set the env variable BIOPERLDEBUG=2.  

chris

________________________________________
From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of Peng Yu [pengyu.ut at gmail.com]
Sent: Friday, December 16, 2011 4:10 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] How to stop rather than emit warnings with Bio::Das::segment?

Hi,

Bio::Das::segment can give me the following warnings without stopping
the whole program when the position for the query doesn't exist. I
could test the return result and quit when it is []. But this would
cause my program have an test whenever I call segment. I'm wondering
if there is an automatic way to let Bio::Das::segment stop in such
cases.

--------------------- WARNING ---------------------
MSG: Sequence is not dna or rna, but []. Attempting to revcom, but
unsure if this is right
---------------------------------------------------


--
Regards,
Peng
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l


From anna.fr at gmail.com  Mon Dec 19 07:09:15 2011
From: anna.fr at gmail.com (Anna Friedlander)
Date: Mon, 19 Dec 2011 20:09:15 +1300
Subject: [Bioperl-l] StandAloneBlastPlus blastdbcmd question
Message-ID: <CALv2E+1Yvt1OhcTE_YXqho+zYZhPjihhCFupybArxMjLfD1S_g@mail.gmail.com>

Hi all

I have a question about using blastdbcmd via
Bio::Tools::Run::StandAloneBlastPlus

I have some Blast+ search results that I am manipulating in a perl
programme, and I would like to retrieve some sequence information for
some results using subject sequence IDs, and associated subject start
and end indices. If I was using blastdbcmd directly, I would do so
using the -entry and -range options.

My question is, can I use all the blastdbcmd options (or more
specifically, just the -entry and -range options) from within the
StandAloneBlastPlus module?

My apologies if I don't properly understand how this "wrapper" works!

Thanks in advance for your help
Anna Friedlander


From l.m.timmermans at students.uu.nl  Mon Dec 19 14:19:14 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Mon, 19 Dec 2011 15:19:14 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
Message-ID: <CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>

On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> There are two widely used indexes, both from Roche (one with and
> one without an XML manifest, magic bytes .mft and .srt). They are
> both just a simple table of the reads names and offsets, sorted
> alphabetically.


Yeah, that's what I got from the BioPython code. I didn't know it was
sorted though (it doesn't make much sense either, unless they wanted to do
a binary search or something).

This works pretty well for rapid lookup for SFF files
> (because the read count is not so high), and is pretty easy.
>

It's implemented in Bio::SFF 0.003. I did restructure my code into two
readers though, since doing sequential and random-access in the class
didn't make much sense code-wise.

I don't think anyone used the hash table style indexes (.hsh), which
> I assume was a proof of principle or trial in the early days of SFF.
>

I see, too bad.


> One thing to check is what Ion Torrent's SFF files use. I would
> guess they've followed Roche, but I don't know. After all, the
> index structure is not defined in the SFF specification - it was
> left extensible on purpose.
>

Yeah, we should check that too.

Yes, please do.
>

It's added to 0.003. The lack of tests was bothering me, but the SFFs I had
at hand were not suitable.

Leon


From p.j.a.cock at googlemail.com  Mon Dec 19 14:31:18 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 14:31:18 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
Message-ID: <CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>

On Mon, Dec 19, 2011 at 2:19 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Wed, Dec 14, 2011 at 5:44 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>
>> There are two widely used indexes, both from Roche (one with and
>> one without an XML manifest, magic bytes .mft and .srt). They are
>> both just a simple table of the reads names and offsets, sorted
>> alphabetically.
>
> Yeah, that's what I got from the BioPython code. I didn't know it
> was sorted though (it doesn't make much sense either, unless they
> wanted to do a binary search or something).

I presume that's what Roche uses if they keep the index on disk.

The alternative is to load the index into RAM, which is really fast.
You just open the SFF, read the header, seek to the index, load
the index. Without the index, you have to scan the entire SFF file
to find each record and its offset - which is much slower.

>> This works pretty well for rapid lookup for SFF files
>> (because the read count is not so high), and is pretty easy.
>
> It's implemented in Bio::SFF 0.003. I did restructure my code into two
> readers though, since doing sequential and random-access in the class
> didn't make much sense code-wise.
>
>> I don't think anyone used the hash table style indexes (.hsh), which
>> I assume was a proof of principle or trial in the early days of SFF.
>
> I see, too bad.
>
>> One thing to check is what Ion Torrent's SFF files use. I would
>> guess they've followed Roche, but I don't know. After all, the
>> index structure is not defined in the SFF specification - it was
>> left extensible on purpose.
>
> Yeah, we should check that too.

I don't have any Ion Torrent data first hand, and the public
samples I've seen were FASTQ not SFF. But I know a few
people with Ion Torrent machines that might be able to help...

> It's added to 0.003. The lack of tests was bothering me, but the
> SFFs I had at hand were not suitable.

Have you looked at the sample SFF data in Biopython? Please
use them for the BioPerl unit tests (we're been talking about a
cross project collection of test data files like this), the README
file should be self-explanatory:
https://github.com/biopython/biopython/tree/master/Tests/Roche

Peter


From p.j.a.cock at googlemail.com  Mon Dec 19 15:13:53 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 15:13:53 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk>
Message-ID: <CAKVJ-_4U_Yt5A8f4QLxb-SzT8Y7n-2kRvGH=g9n+NfqAFegxgA@mail.gmail.com>

On Mon, Dec 19, 2011 at 3:03 PM, Adam Witney <awitney at sgul.ac.uk> wrote:
>> I don't have any Ion Torrent data first hand, and the public
>> samples I've seen were FASTQ not SFF. But I know a few
>> people with Ion Torrent machines that might be able to help?
>
> I can you let you have some Ion Torrent SFF files if it helps
>
> adam

Hi Adam,

I've just had a quick look at a file from an IonTorrent 314 chip
that a colleague kindly sent me, and that SFF file had no index
(but only 50k reads so this isn't so important).

If you can send me (and Leon?) one of two original SFF files that
would be useful, even if just to confirm that Ion Torrent's SFF files
do indeed typically lack an index. If that is the case, I may need to
remove the warning message Biopython currently prints when
indexing these files: No SFF index, doing it the slow way

Off list is fine if you'd like to keep the data private, use dropbox
or something if you don't have an FTP server.

Thanks,

Peter


From awitney at sgul.ac.uk  Mon Dec 19 15:03:16 2011
From: awitney at sgul.ac.uk (Adam Witney)
Date: Mon, 19 Dec 2011 15:03:16 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
Message-ID: <4FCCB636-3CC8-40E1-A6DB-55DD8F5F3CBD@sgul.ac.uk>

>>> One thing to check is what Ion Torrent's SFF files use. I would
>>> guess they've followed Roche, but I don't know. After all, the
>>> index structure is not defined in the SFF specification - it was
>>> left extensible on purpose.
>> 
>> Yeah, we should check that too.
> 
> I don't have any Ion Torrent data first hand, and the public
> samples I've seen were FASTQ not SFF. But I know a few
> people with Ion Torrent machines that might be able to help?

I can you let you have some Ion Torrent SFF files if it helps

adam


From l.m.timmermans at students.uu.nl  Mon Dec 19 15:48:34 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Mon, 19 Dec 2011 16:48:34 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
Message-ID: <CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>

On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> I presume that's what Roche uses if they keep the index on disk.
>
> The alternative is to load the index into RAM, which is really fast.
> You just open the SFF, read the header, seek to the index, load
> the index. Without the index, you have to scan the entire SFF file
> to find each record and its offset - which is much slower.
>

That's what I'm doing now. It's much faster, but it still takes a
noticeable amount of time on large files.

Have you looked at the sample SFF data in Biopython? Please
> use them for the BioPerl unit tests (we're been talking about a
> cross project collection of test data files like this), the README
> file should be self-explanatory:
> https://github.com/biopython/biopython/tree/master/Tests/Roche
>

Yeah, I'm using those now (
https://github.com/Leont/bio-sff/blob/master/t/reader.t). I must say there
were some interesting corner cases in it.

Leon


From p.j.a.cock at googlemail.com  Mon Dec 19 16:15:15 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 16:15:15 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
Message-ID: <CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>

On Mon, Dec 19, 2011 at 3:48 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Mon, Dec 19, 2011 at 3:31 PM, Peter Cock wrote:
>
>> Have you looked at the sample SFF data in Biopython? Please
>> use them for the BioPerl unit tests (we're been talking about a
>> cross project collection of test data files like this), the README
>> file should be self-explanatory:
>> https://github.com/biopython/biopython/tree/master/Tests/Roche
>
> Yeah, I'm using those now
> (https://github.com/Leont/bio-sff/blob/master/t/reader.t).

Could you a link to your /corpus/README.txt file pointing
back to the Biopython original for acknowledgement and
future reference?

>
> I must say there were some interesting corner cases in it.
>

I'm glad you agree - and if you can think of any more special
cases to verify that would be great.

Are you doing just SFF parsing for now? Not writing?

Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
format name "sff" to mean the full read sequence (with mixed
case, upper case for the good sequence, lower cases for any
left/right clipping - as in the Roche tools), and "sff-trim" to mean
the trimmed sequences. I would encourage you to do the
same, as part of the general aim of having consistent
sequence format names between BioPerl, Biopython, and
EMBOSS, where possible.

Peter


From l.m.timmermans at students.uu.nl  Mon Dec 19 16:47:41 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Mon, 19 Dec 2011 17:47:41 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
Message-ID: <CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>

On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Could you a link to your /corpus/README.txt file pointing
> back to the Biopython original for acknowledgement and
> future reference?
>

I forgot about that, I will add it to the next release.

Are you doing just SFF parsing for now? Not writing?
>

I haven't written the writer yet (haven't needed it so far). I'd rather
release working code early instead of waiting until everything is complete.

Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
> format name "sff" to mean the full read sequence (with mixed
> case, upper case for the good sequence, lower cases for any
> left/right clipping - as in the Roche tools), and "sff-trim" to mean
> the trimmed sequences. I would encourage you to do the
> same, as part of the general aim of having consistent
> sequence format names between BioPerl, Biopython, and
> EMBOSS, where possible.
>

I agree, consistency is good.

Leon


From p.j.a.cock at googlemail.com  Mon Dec 19 17:00:03 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 19 Dec 2011 17:00:03 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
Message-ID: <CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>

On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> Could you a link to your /corpus/README.txt file pointing
>> back to the Biopython original for acknowledgement and
>> future reference?
>
> I forgot about that, I will add it to the next release.

Thanks.

>> Are you doing just SFF parsing for now? Not writing?
>
>
> I haven't written the writer yet (haven't needed it so far). I'd rather
> release working code early instead of waiting until everything is complete.

I understand - but make sure you've designed the data structures
in the parser so as to allow the original record to be re-built as SFF.

>> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
>> format name "sff" to mean the full read sequence (with mixed
>> case, upper case for the good sequence, lower cases for any
>> left/right clipping - as in the Roche tools), and "sff-trim" to mean
>> the trimmed sequences. I would encourage you to do the
>> same, as part of the general aim of having consistent
>> sequence format names between BioPerl, Biopython, and
>> EMBOSS, where possible.
>
> I agree, consistency is good.

Great. I'd guess Bio::SeqIO integration would be more important
that SFF output initially.

Peter


From cjfields at illinois.edu  Mon Dec 19 19:44:22 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 19 Dec 2011 19:44:22 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>,
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
Message-ID: <D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>

Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best.  Barring that, a very simple class for storing data.  We've found BioPerl objects/classes pretty heavy.

(for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing).

Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used.  

For instance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists.

Chris

Sent from my iPad

On Dec 19, 2011, at 11:05 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

> On Mon, Dec 19, 2011 at 4:47 PM, Leon Timmermans
> <l.m.timmermans at students.uu.nl> wrote:
>> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock <p.j.a.cock at googlemail.com>
>> wrote:
>>> 
>>> Could you a link to your /corpus/README.txt file pointing
>>> back to the Biopython original for acknowledgement and
>>> future reference?
>> 
>> I forgot about that, I will add it to the next release.
> 
> Thanks.
> 
>>> Are you doing just SFF parsing for now? Not writing?
>> 
>> 
>> I haven't written the writer yet (haven't needed it so far). I'd rather
>> release working code early instead of waiting until everything is complete.
> 
> I understand - but make sure you've designed the data structures
> in the parser so as to allow the original record to be re-built as SFF.
> 
>>> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
>>> format name "sff" to mean the full read sequence (with mixed
>>> case, upper case for the good sequence, lower cases for any
>>> left/right clipping - as in the Roche tools), and "sff-trim" to mean
>>> the trimmed sequences. I would encourage you to do the
>>> same, as part of the general aim of having consistent
>>> sequence format names between BioPerl, Biopython, and
>>> EMBOSS, where possible.
>> 
>> I agree, consistency is good.
> 
> Great. I'd guess Bio::SeqIO integration would be more important
> that SFF output initially.
> 
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From cjfields at illinois.edu  Tue Dec 20 00:28:25 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Mon, 19 Dec 2011 18:28:25 -0600
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
Message-ID: <4EEFD6A9.3010303@illinois.edu>

On 12/19/2011 10:47 AM, Leon Timmermans wrote:
> On Mon, Dec 19, 2011 at 5:15 PM, Peter Cock<p.j.a.cock at googlemail.com>wrote:
>
>> Could you a link to your /corpus/README.txt file pointing
>> back to the Biopython original for acknowledgement and
>> future reference?
>>
> I forgot about that, I will add it to the next release.
>
> Are you doing just SFF parsing for now? Not writing?
> I haven't written the writer yet (haven't needed it so far). I'd rather
> release working code early instead of waiting until everything is complete.
>
> Now, as to Bio::SeqIO integration, Biopython's SeqIO uses
>> format name "sff" to mean the full read sequence (with mixed
>> case, upper case for the good sequence, lower cases for any
>> left/right clipping - as in the Roche tools), and "sff-trim" to mean
>> the trimmed sequences. I would encourage you to do the
>> same, as part of the general aim of having consistent
>> sequence format names between BioPerl, Biopython, and
>> EMBOSS, where possible.
>>
> I agree, consistency is good.
>
> Leon
This is already implemented in Bio::SeqIO I believe.  This is the same 
line of thinking with the FASTQ format, that one can have a 
'format-variant' combination that (as one might guess) indicates to the 
parser any variation of the parser so logic within the parser can deal 
with it.  You can also pass the '-variant => "foo"' parameter as well 
IIRC.  You would just check the variant with the variant() method.

chris


From l.m.timmermans at students.uu.nl  Tue Dec 20 15:25:13 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 20 Dec 2011 16:25:13 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
Message-ID: <CAC1jpXDmU3jF-3G9dnB=-0fCkydqWd18J0eo2nKkSLvKTbOxjQ@mail.gmail.com>

On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> I understand - but make sure you've designed the data structures
> in the parser so as to allow the original record to be re-built as SFF.
>

 I did, though currently it's rather hard to make new entries from scratch.
That said, I can hardly imagine anyone wanting to do this.

Great. I'd guess Bio::SeqIO integration would be more important
> that SFF output initially.
>

Probably. It looks like it's quite easy, it's just rather underdocumented.

Leon


From l.m.timmermans at students.uu.nl  Tue Dec 20 15:26:11 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 20 Dec 2011 16:26:11 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
	<D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>
Message-ID: <CAC1jpXAq3cN0GriHndyQnbpn3v26M1o927i8z_a=x5bQU-iPQg@mail.gmail.com>

On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J <
cjfields at illinois.edu> wrote:

> Kinda joining this a little late, but I think if there is a way to have a
> low-level parser/writer that generically parses the data into simple
> (possibly hash-tagged) data structures, that would be best.  Barring that,
> a very simple class for storing data.  We've found BioPerl objects/classes
> pretty heavy.
>
> (for an example of this, see Heng Li's readfq parser on github, which has
> some stats for Fastq/fasta parsing).
>
> Any way we can separate the parser from object instantiation would enable
> us to optimize the object/class layer and parser/writer layers separately,
> with the possible nice side effect of making the parser more broadly used.
>
> For insn Sance, if someone wanted a faster parser, use the low level,
> otherwise use the higher level (possibly BioPerl-specific) API. Lincoln
> does this do a certain degree with Bio-samtools; I would go further and
> make the bp- and non-bp code in separate dists.
>

A good OO system can actually help make things faster. For example, I'm
unpacking the flowspace and quality data lazily, which made scanning
through an SFF file 2.5-3 times as fast while having marginal extra costs
when you do need them.

Leon


From l.m.timmermans at students.uu.nl  Tue Dec 20 15:30:54 2011
From: l.m.timmermans at students.uu.nl (Leon Timmermans)
Date: Tue, 20 Dec 2011 16:30:54 +0100
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <4EEFD6A9.3010303@illinois.edu>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<4EEFD6A9.3010303@illinois.edu>
Message-ID: <CAC1jpXD_sSYoU2DS33Yn99c0WToyXvTY2aJdcS6w-yZ0xfCFMg@mail.gmail.com>

On Tue, Dec 20, 2011 at 1:28 AM, Chris Fields <cjfields at illinois.edu> wrote:

> This is already implemented in Bio::SeqIO I believe.  This is the same
> line of thinking with the FASTQ format, that one can have a
> 'format-variant' combination that (as one might guess) indicates to the
> parser any variation of the parser so logic within the parser can deal with
> it.  You can also pass the '-variant => "foo"' parameter as well IIRC.  You
> would just check the variant with the variant() method.
>

Great. That makes life much easier :-)

Leon


From p.j.a.cock at googlemail.com  Tue Dec 20 15:31:59 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 20 Dec 2011 15:31:59 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXDmU3jF-3G9dnB=-0fCkydqWd18J0eo2nKkSLvKTbOxjQ@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
	<CAC1jpXDmU3jF-3G9dnB=-0fCkydqWd18J0eo2nKkSLvKTbOxjQ@mail.gmail.com>
Message-ID: <CAKVJ-_7v+wKQVXkLz_CMJXviYApyirjG9CA89mti5a3N40V8iA@mail.gmail.com>

On Tue, Dec 20, 2011 at 3:25 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Mon, Dec 19, 2011 at 6:00 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> I understand - but make sure you've designed the data structures
>> in the parser so as to allow the original record to be re-built as SFF.
>
> ?I did, though currently it's rather hard to make new entries from scratch.
> That said, I can hardly imagine anyone wanting to do this.

Typical use cases I've found in using the Biopython SFF code are
filtering an SFF file (taking some records only), and modifying the
clipping values. In both cases, the user isn't creating the SFF
records from scratch.

Peter


From cjfields at illinois.edu  Tue Dec 20 22:40:31 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 20 Dec 2011 22:40:31 +0000
Subject: [Bioperl-l] Announcing Bio::SFF
In-Reply-To: <CAC1jpXAq3cN0GriHndyQnbpn3v26M1o927i8z_a=x5bQU-iPQg@mail.gmail.com>
References: <CAC1jpXBPv91+gpSRWZVfdvV7OJ_ZnoS0N8z3EB599nFZ5cYWEQ@mail.gmail.com>
	<CAKVJ-_4Z_EvnX=ST+dv8+p3AZbAW0mgHri7xcHAetT9KfGYqdQ@mail.gmail.com>
	<CAC1jpXAcuMiHV27CK+tvNFPas1BJJ=txO2kcaqQPc=EXzy_+8w@mail.gmail.com>
	<CAKVJ-_4nJd_cVeLa6xUapM4xfHFS6ZKG6nYwxxugQ83Juew6Kw@mail.gmail.com>
	<CAC1jpXBr=Z4bC9vp+1u3QAO-8N9gqzpfGmeb_-Wyi2FPd=J63w@mail.gmail.com>
	<CAKVJ-_7CMbVn8yi8hcAr4YcKyg-dX4PZeMq4o8mY7bDC+w8FOw@mail.gmail.com>
	<CAC1jpXB46vLppi+Y-2+q3RTG15XJcZiZ=gg+C-pjfSTdHVp5Ag@mail.gmail.com>
	<CAKVJ-_4xPceuLJbEQrfzQUv7r_HeYbYKTRhDEmNBPCwQWZ0XkA@mail.gmail.com>
	<CAC1jpXBEFR2hH=0yW+eQHS5Fowo9PUz2rOCcGTwSixnVZxVQWQ@mail.gmail.com>
	<CAKVJ-_4RqBDSpt1ed0t093WTbN_vAnHJFpBEoKNO32VDE2aK0w@mail.gmail.com>
	<D5B63715-D712-44E4-BBAC-35C721AF2098@illinois.edu>,
	<CAC1jpXAq3cN0GriHndyQnbpn3v26M1o927i8z_a=x5bQU-iPQg@mail.gmail.com>
Message-ID: <CE1C3005-EA13-4C4E-A4B5-7F387D0E8E0B@illinois.edu>


On Dec 20, 2011, at 9:26 AM, "Leon Timmermans" <l.m.timmermans at students.uu.nl<mailto:l.m.timmermans at students.uu.nl>> wrote:

On Mon, Dec 19, 2011 at 8:44 PM, Fields, Christopher J <cjfields at illinois.edu<mailto:cjfields at illinois.edu>> wrote:
Kinda joining this a little late, but I think if there is a way to have a low-level parser/writer that generically parses the data into simple (possibly hash-tagged) data structures, that would be best.  Barring that, a very simple class for storing data.  We've found BioPerl objects/classes pretty heavy.

(for an example of this, see Heng Li's readfq parser on github, which has some stats for Fastq/fasta parsing).

Any way we can separate the parser from object instantiation would enable us to optimize the object/class layer and parser/writer layers separately, with the possible nice side effect of making the parser more broadly used.

For insn Sance, if someone wanted a faster parser, use the low level, otherwise use the higher level (possibly BioPerl-specific) API. Lincoln does this do a certain degree with Bio-samtools; I would go further and make the bp- and non-bp code in separate dists.

A good OO system can actually help make things faster. For example, I'm unpacking the flowspace and quality data lazily, which made scanning through an SFF file 2.5-3 times as fast while having marginal extra costs when you do need them.

Leon

Yep, thinking about using the same approach for the Fastq variants.

Chris

Sent from my ancient iPad b/c my laptop's borked


From dgacquer at ulb.ac.be  Wed Dec 21 13:26:07 2011
From: dgacquer at ulb.ac.be (David Gacquer)
Date: Wed, 21 Dec 2011 14:26:07 +0100
Subject: [Bioperl-l] Strange behaviour in the write_seq function for large
	fasta
Message-ID: <4EF1DE6F.4070508@ulb.ac.be>

Dear BioPerl users/developers,

I am facing a strange issue with the $seq_out->write_seq function when 
using large fasta files

I have downloaded the hg19 chromosome 1, and applied the following code 
(basically I wanted to mask some regions in it but the problem also 
appears when copying the sequence without modifications):

sub main{
     my $seq_in  = Bio::SeqIO->new( -format => 'largefasta', -file => 
$ARGV[0]);
     my $seq_out  = Bio::SeqIO->new( -format => 'largefasta', -file => 
'>'.$ARGV[1]);
     my $seq_obj_in = $seq_in->next_seq();
     my $modified_seq = $seq_obj_in->seq();
     my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq => 
$modified_seq, -id  => $seq_obj_in->id, -desc => $seq_obj_in->desc);
     $seq_out->write_seq($seq_obj_out);
}

when checking the output fasta file, the sequence of chr1 is 1-bp shorter.

I have noticed that in the original fasta file, each line contains 
exactly 50 nucleotides, while the output of the $seq_out->write_seq 
function contains always 60 characters per line.
chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that 
the very last base was missing, I created the following fasta files:

chr121.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAG

chr122.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAG

They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last 
character being a G. When running the above code:

chr121.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

chr122.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AG

The output for the 122 bp chromosome is correct (2 lines of 60 bp and 
the last line with 2 bp, AG) but for the 121 bp chromosome, the last 
character is missing (2 lines of 60 bp only, last G is missing).

When replacing -format => 'largefasta' by -format => 'fasta' or writing 
the output without the write_seq function however, the problem is solved.

Am I missing something or is there a problem with the write_seq function 
used with large fasta files? (I am running BioPerl on a Mac under OS X 
Snow Leopard)

Best regards

David

-- 
David Gacquer, Ph. D.

IRIBHM - Universite Libre de Bruxelles
Bldg C, room C.4.117
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium

Phone: +32-2-555 4187
Fax: +32-2-555 4655
E-mail: dgacquer at ulb.ac.be


From koraydogankaya at gmail.com  Sat Dec 24 08:44:43 2011
From: koraydogankaya at gmail.com (Koray)
Date: Sat, 24 Dec 2011 00:44:43 -0800 (PST)
Subject: [Bioperl-l] exons
Message-ID: <8454dd72-4fed-41c5-977e-83d9300cd68b@z25g2000vbs.googlegroups.com>

I need an explicit code for getting exon sequences of an mrna or gene
fetched by get_Seq_by_acc or id.

in ensembl it is easy but here it is not easy many ios exists.

for example:

here how can i get such a $gene object from DBs (GeneBank or
EntrezGene) by acc numberor ids?


exons	code	prev	next	Top
 Title   : exons()
 Usage   : @exons = $gene->exons();
           @inital_exons = $gene->exons('Initial');
 Function: Get all exon features or all exons of a specified type of
this gene
           structure.

           Exon type is treated as a case-insensitive regular
expression and
           optional. For consistency, use only the following types:
           initial, internal, terminal, utr, utr5prime, and
utr3prime.
           A special and virtual type is 'coding', which refers to all
types
           except utr.

           This method basically merges the exons returned by
transcripts.

 Returns : An array of Bio::SeqFeature::Gene::ExonI implementing
objects.
 Args    : An optional string specifying the type of exon.


From challa_ghanashyam at yahoo.com  Sat Dec 24 20:09:09 2011
From: challa_ghanashyam at yahoo.com (GSC)
Date: Sat, 24 Dec 2011 12:09:09 -0800 (PST)
Subject: [Bioperl-l] re trieve description for a list of gi ids..
Message-ID: <33034438.post@talk.nabble.com>


Hi all:
I am new to perl. I am working on a script to retrieve the record
description (name given for a sequence record in genbank) for a list of gi
ids. the script works fine for 1000 ids but my list is about 250,000 ids
long and it is not working for me. Any suggestions on this.

GS
-- 
View this message in context: http://old.nabble.com/retrieve-description-for-a-list-of-gi-ids..-tp33034438p33034438.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.


From cjfields at illinois.edu  Tue Dec 27 15:03:28 2011
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Tue, 27 Dec 2011 15:03:28 +0000
Subject: [Bioperl-l] Strange behaviour in the write_seq function for
 large	fasta
In-Reply-To: <4EF1DE6F.4070508@ulb.ac.be>
References: <4EF1DE6F.4070508@ulb.ac.be>
Message-ID: <118F034CF4C3EF48A96F86CE585B94BF0BB29547@CITESMBX4.ad.uillinois.edu>

This is a strange one.  Personally I haven't seen this behavior, but that maybe it's OS-dependent?

We'll need more information, particularly what version of BioPerl you are using, the OS, version of perl, etc.  Also, in general to make sure we don't lose track of this issue it is best to submit a bug report:

https://redmine.open-bio.org/projects/bioperl

I'm planning on triaging bugs next week, I could take a look then.

chris
________________________________________
From: bioperl-l-bounces at lists.open-bio.org [bioperl-l-bounces at lists.open-bio.org] on behalf of David Gacquer [dgacquer at ulb.ac.be]
Sent: Wednesday, December 21, 2011 7:26 AM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Strange behaviour in the write_seq function for large      fasta

Dear BioPerl users/developers,

I am facing a strange issue with the $seq_out->write_seq function when
using large fasta files

I have downloaded the hg19 chromosome 1, and applied the following code
(basically I wanted to mask some regions in it but the problem also
appears when copying the sequence without modifications):

sub main{
     my $seq_in  = Bio::SeqIO->new( -format => 'largefasta', -file =>
$ARGV[0]);
     my $seq_out  = Bio::SeqIO->new( -format => 'largefasta', -file =>
'>'.$ARGV[1]);
     my $seq_obj_in = $seq_in->next_seq();
     my $modified_seq = $seq_obj_in->seq();
     my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq =>
$modified_seq, -id  => $seq_obj_in->id, -desc => $seq_obj_in->desc);
     $seq_out->write_seq($seq_obj_out);
}

when checking the output fasta file, the sequence of chr1 is 1-bp shorter.

I have noticed that in the original fasta file, each line contains
exactly 50 nucleotides, while the output of the $seq_out->write_seq
function contains always 60 characters per line.
chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that
the very last base was missing, I created the following fasta files:

chr121.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAG

chr122.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAG

They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last
character being a G. When running the above code:

chr121.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

chr122.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AG

The output for the 122 bp chromosome is correct (2 lines of 60 bp and
the last line with 2 bp, AG) but for the 121 bp chromosome, the last
character is missing (2 lines of 60 bp only, last G is missing).

When replacing -format => 'largefasta' by -format => 'fasta' or writing
the output without the write_seq function however, the problem is solved.

Am I missing something or is there a problem with the write_seq function
used with large fasta files? (I am running BioPerl on a Mac under OS X
Snow Leopard)

Best regards

David

--
David Gacquer, Ph. D.

IRIBHM - Universite Libre de Bruxelles
Bldg C, room C.4.117
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium

Phone: +32-2-555 4187
Fax: +32-2-555 4655
E-mail: dgacquer at ulb.ac.be

_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l