[BioSQL-l] load_taxonomy.pl question
Aaron J. Mackey
amackey at pcbi.upenn.edu
Fri Jun 24 09:09:17 EDT 2005
I agree that it would be mildly complicated, but it's not Perl's
fault at all. It would be mildly complicated in any language. The
complication stems from the fact that the data we load from is a tab-
delimited flat file of "taxon parent-taxon" tuples, so as we load we
cannot know (without some additional upfront work) whether any given
row is desirable. If we could know that, the solution would be
trivial. One way to know that is to basically read the whole file
into a memory-representation of the tree (only keeping node id's for
memory conservation), and then only keep the desired subtree (purge
the rest); then, as we process the input files, only act on those
lines that apply to members of the subtree (probably flattened to a
hash to make lookup quicker). No big deal really, and something Perl
can do just as well as any other programming language.
I leave the implementation as an exercise for the reader, however, as
I agree that deleting everything but the desired subtree via SQL
would also work nicely, though not save any processing time ;)
-Aaron
On Jun 23, 2005, at 10:16 PM, Hilmar Lapp wrote:
> I guess there could be a way, but it's got to be very complicated,
> because now you're trying to do something in perl for which perl's
> not made.
>
> Why not just load up everything? It's not that much of diskspace.
> Also, if you're really eager to keep only the subtree, you could
> delete the rest using SQL.
>
> -hilmar
>
> On Jun 23, 2005, at 2:00 PM, Renee Halbrook wrote:
>
>
>> Hi,
>> Is it possible alter the load_taxonomy.pl script to
>> load data for only a certain subtree? For example ,to
>> grab the taxonomy structure starting with
>> CyanoBacteria (id =1117) as the root ?
>>
>> Thanks for any feedback,
>> Renee
>>
>>
>>
>> ____________________________________________________
>> Yahoo! Sports
>> Rekindle the Rivalries. Sign up for Fantasy Football
>> http://football.fantasysports.yahoo.com
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at open-bio.org
>> http://open-bio.org/mailman/listinfo/biosql-l
>>
>>
> --
> -------------------------------------------------------------
> Hilmar Lapp email: lapp at gnf.org
> GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
--
Aaron J. Mackey, Ph.D.
Project Manager, ApiDB Bioinformatics Resource Center
Penn Genomics Institute, University of Pennsylvania
email: amackey at pcbi.upenn.edu
office: 215-898-1205
fax: 215-746-6697
postal: Penn Genomics Institute
Goddard Labs 212
415 S. University Avenue
Philadelphia, PA 19104-6017
More information about the BioSQL-l
mailing list