Empathy List Archives

talk@lists.collectionspace.org

WE HAVE SUNSET THIS LISTSERV - Join us at collectionspace@lyrasislists.org

Tuning CollectionSpace's database settings: new benchmarking tool available

Aron Roberts

Thu, Aug 23, 2012 2:21 AM

This is to announce the availability of a recently-released tool for
this task, a benchmarking script written by Ray Lee for evaluating the
effects of such changes on execution times for SQL queries, that's
already proving very helpful:

https://github.com/collectionspace/Tools/tree/master/benchmarks/db-benchmark

This script is written in Perl and should be capable of being run on
any Unix-like system (Linux, Unix, Mac OS X). You can use it to
automate the task of repeatedly running the types of queries that you
wish to speed up, allowing you to systematically analyze the effects
of each of the tuning changes you make to your database, and its OS
and machine environment, on their performance.

To use this script, place one or more files, each containing a
single SQL query into a directory or directories of your choice. You
can then point the tool at a single SQL file or at a directory of such
files; e.g.

perl db_benchmark.pl my_sql_directory_name

and it will run your SQL queries multiple times (by default, 10
times), reporting on their min/max/average times and standard
deviations (to show query variability). It will also save detailed
analyses (via EXPLAIN ANALYZE) of each query's execution, as well as a
snapshot of your database configuration, database table settings
(including indexes), and selected OS configuration settings.

Obtaining SQL scripts to run

To obtain meaningful SQL scripts for this tool to run, you can turn
on logging, in PostgreSQL, of queries whose execution takes longer
than a threshold of your choosing (such as 2 seconds or 10 seconds).
You will typically do this by editing PostgreSQL's main configuration
file, postgresql.conf, to set one or more logging-related
configuration settings, and then telling PostgreSQL to reload its
configuration.

This document gives a succinct overview:

http://heatware.net/databases/how-to-find-log-slow-queries-postgresql/

(Note that, in contrast to the example given in the above post, a
recommended minimum value would likely be 'log_min_duration_statement
= 1000'; e.g. 1 second. As well, the method you'll use for reloading
PostgreSQL's configuration may be different on your system than
'service postgresql reload'. You might try, for example, these
system-independent methods for doing so:
http://heatware.net/databases/postgresql-reload-config-without-restarting/)

Once you've enabled selective logging of queries, you can then
examine PostgreSQL's log files to find the underlying SQL queries
associated with specific activities (such as searches) you've
performed via CollectionSpace's user interface. You can also gather
the SQL queries associated with running your reports, or with requests
submitted via CollectionSpace's RESTful API.

By default, PostgreSQL appears to write its logs to a 'pg_log'
directory, within its default 'data' directory. Here's how to find
that data directory:

http://dba.stackexchange.com/questions/1350/how-do-i-find-postgresqls-data-directory

Note as well that there may be a bit of editing you'll need to do to
reconstruct the original SQL queries. It seems that, when logging SQL
queries, PostgreSQL sometimes replaces actual values with placeholder
variables, and then write the actual values corresponding to those
variables in a second, following log statement; you may need to
replace the placeholder variables with their actual values.

Aron

One of the aspects of tuning a CollectionSpace system for performance, is that of tweaking ("tuning") various settings for its underlying database system, PostgreSQL. (Another related aspect of tuning can involve experimenting with different OS settings and virtual machine/hardware configurations, such as changing memory or disk size, or the number of processors.) This is to announce the availability of a recently-released tool for this task, a benchmarking script written by Ray Lee for evaluating the effects of such changes on execution times for SQL queries, that's already proving very helpful: https://github.com/collectionspace/Tools/tree/master/benchmarks/db-benchmark This script is written in Perl and should be capable of being run on any Unix-like system (Linux, Unix, Mac OS X). You can use it to automate the task of repeatedly running the types of queries that you wish to speed up, allowing you to systematically analyze the effects of each of the tuning changes you make to your database, and its OS and machine environment, on their performance. To use this script, place one or more files, each containing a single SQL query into a directory or directories of your choice. You can then point the tool at a single SQL file or at a directory of such files; e.g. perl db_benchmark.pl my_sql_directory_name and it will run your SQL queries multiple times (by default, 10 times), reporting on their min/max/average times and standard deviations (to show query variability). It will also save detailed analyses (via EXPLAIN ANALYZE) of each query's execution, as well as a snapshot of your database configuration, database table settings (including indexes), and selected OS configuration settings. Obtaining SQL scripts to run ---------------------------------------------- To obtain meaningful SQL scripts for this tool to run, you can turn on logging, in PostgreSQL, of queries whose execution takes longer than a threshold of your choosing (such as 2 seconds or 10 seconds). You will typically do this by editing PostgreSQL's main configuration file, postgresql.conf, to set one or more logging-related configuration settings, and then telling PostgreSQL to reload its configuration. This document gives a succinct overview: http://heatware.net/databases/how-to-find-log-slow-queries-postgresql/ (Note that, in contrast to the example given in the above post, a recommended minimum value would likely be 'log_min_duration_statement = 1000'; e.g. 1 second. As well, the method you'll use for reloading PostgreSQL's configuration may be different on your system than 'service postgresql reload'. You might try, for example, these system-independent methods for doing so: http://heatware.net/databases/postgresql-reload-config-without-restarting/) Once you've enabled selective logging of queries, you can then examine PostgreSQL's log files to find the underlying SQL queries associated with specific activities (such as searches) you've performed via CollectionSpace's user interface. You can also gather the SQL queries associated with running your reports, or with requests submitted via CollectionSpace's RESTful API. By default, PostgreSQL appears to write its logs to a 'pg_log' directory, within its default 'data' directory. Here's how to find that data directory: http://dba.stackexchange.com/questions/1350/how-do-i-find-postgresqls-data-directory Note as well that there may be a bit of editing you'll need to do to reconstruct the original SQL queries. It seems that, when logging SQL queries, PostgreSQL sometimes replaces actual values with placeholder variables, and then write the actual values corresponding to those variables in a second, following log statement; you may need to replace the placeholder variables with their actual values. Aron