WE HAVE SUNSET THIS LISTSERV - Join us at collectionspace@lyrasislists.org
View all threadsOne of the aspects of tuning a CollectionSpace system for
performance, is that of tweaking ("tuning") various settings for its
underlying database system, PostgreSQL. (Another related aspect of
tuning can involve experimenting with different OS settings and
virtual machine/hardware configurations, such as changing memory or
disk size, or the number of processors.)
This is to announce the availability of a recently-released tool for
this task, a benchmarking script written by Ray Lee for evaluating the
effects of such changes on execution times for SQL queries, that's
already proving very helpful:
https://github.com/collectionspace/Tools/tree/master/benchmarks/db-benchmark
This script is written in Perl and should be capable of being run on
any Unix-like system (Linux, Unix, Mac OS X). You can use it to
automate the task of repeatedly running the types of queries that you
wish to speed up, allowing you to systematically analyze the effects
of each of the tuning changes you make to your database, and its OS
and machine environment, on their performance.
To use this script, place one or more files, each containing a
single SQL query into a directory or directories of your choice. You
can then point the tool at a single SQL file or at a directory of such
files; e.g.
perl db_benchmark.pl my_sql_directory_name
and it will run your SQL queries multiple times (by default, 10
times), reporting on their min/max/average times and standard
deviations (to show query variability). It will also save detailed
analyses (via EXPLAIN ANALYZE) of each query's execution, as well as a
snapshot of your database configuration, database table settings
(including indexes), and selected OS configuration settings.
To obtain meaningful SQL scripts for this tool to run, you can turn
on logging, in PostgreSQL, of queries whose execution takes longer
than a threshold of your choosing (such as 2 seconds or 10 seconds).
You will typically do this by editing PostgreSQL's main configuration
file, postgresql.conf, to set one or more logging-related
configuration settings, and then telling PostgreSQL to reload its
configuration.
This document gives a succinct overview:
http://heatware.net/databases/how-to-find-log-slow-queries-postgresql/
(Note that, in contrast to the example given in the above post, a
recommended minimum value would likely be 'log_min_duration_statement
= 1000'; e.g. 1 second. As well, the method you'll use for reloading
PostgreSQL's configuration may be different on your system than
'service postgresql reload'. You might try, for example, these
system-independent methods for doing so:
http://heatware.net/databases/postgresql-reload-config-without-restarting/)
Once you've enabled selective logging of queries, you can then
examine PostgreSQL's log files to find the underlying SQL queries
associated with specific activities (such as searches) you've
performed via CollectionSpace's user interface. You can also gather
the SQL queries associated with running your reports, or with requests
submitted via CollectionSpace's RESTful API.
By default, PostgreSQL appears to write its logs to a 'pg_log'
directory, within its default 'data' directory. Here's how to find
that data directory:
http://dba.stackexchange.com/questions/1350/how-do-i-find-postgresqls-data-directory
Note as well that there may be a bit of editing you'll need to do to
reconstruct the original SQL queries. It seems that, when logging SQL
queries, PostgreSQL sometimes replaces actual values with placeholder
variables, and then write the actual values corresponding to those
variables in a second, following log statement; you may need to
replace the placeholder variables with their actual values.
Aron