Database Compaction

Because CouchDB/BigCouch uses an append-only file to write changes to, you must monitor the size of those files on disk and periodically compact them. Compaction is the process of taking all the current revisions in the file and writing a new file with just the active documents, effectively deleting all old revisions. Compaction can have dramatic results in recovered disk space and performance (as seek times are reduced on smaller files).

Automatic Compaction

Kazoo allows the cluster operator to enable automatic compaction of all databases. A process will walk the list of known databases (updating each time the process starts again) and check each database against a heuristic to decide whether to compact or not.

System Configuration Schema

Schema

Schema for kazoo_couch system_config

KeyDescriptionTypeDefaultRequiredSupport Level
admin_portThe CouchDB API port, typically 5986integer()5986false
allow_maintenance_db_deletekazoo_couch allow maintenance db deleteboolean()falsefalse
api_portThe CouchDB API port, typically 5984integer()5984false
compact_automaticallykazoo_couch compact automaticallyboolean()falsefalse
default_chunk_sizekazoo_couch default chunk sizeinteger()1000false
max_compacting_shardskazoo_couch maximum compacting shardsinteger()2false
max_compacting_viewskazoo_couch maximum compacting viewsinteger()2false
max_concurrent_docs_to_archivekazoo_couch maximum concurrent docs to archiveinteger()500false
max_wait_for_compaction_pidskazoo_couch maximum wait for compaction pidsinteger()360000false
min_data_sizekazoo_couch minimum data sizeinteger()131072false
min_ratiokazoo_couch minimum rationumber()1.2false
sleep_between_pollkazoo_couch sleep between pollinteger()3000false
use_bigcouch_directkazoo_couch use bigcouch directboolean()truefalse

Operations

The compaction module kt_compactor binds into the triggers for all databases. Based on the tasks.browse_dbs_interval_s in system_config, each database will be processed each time the timer expires. The default timeout is daily, but the next timer won’t start until all the databases have been processed by all task modules bound for them. On large systems, there could be so many databases and operations could take a long time on them, that the next timer won’t start for more than a day.

Typically there’s nothing wrong with that, just know that the timer isn’t a daily timer but instead is 84600 seconds after the last database is processed.

Per-node settings

Its important to set per-couch-node settings if your setup varies from the default ports (5984 and 5986).

sup kapps_config set_node kazoo_couch admin_port 35986 couchdb@db001.host.com

Repeat for admin_port and api_port for each database node in your setup (curl localhost:5984/_membership if you need your list of nodes).

API-initiated compaction

In addition to the automatic crawling of databases available, the tasks API contains methods for initiating compaction of the cluster, per-node, or per-database.

Compact Everything

This is an input-less compaction job. At the end of the run, you should have a CSV with output about the database shards on each node with their disk/data usage before and after compaction.

  1. Create the task:

    curl -v -X PUT \
    -H "X-Auth-Token: {AUTH_TOKEN}" \
    'http://{SERVER}:8000/v2/tasks?category=compaction&action=compact_all'
    
  2. Start the task:

    curl -v -X PATCH \
    -H "X-Auth-Token: {AUTH_TOKEN}" \
    'http://{SERVER}:8000/v2/tasks/{TASK_ID}'
    
  3. Query the task’s status:

    curl -v -X GET \
    -H "X-Auth-Token: {AUTH_TOKEN}" \
    'http://{SERVER}:8000/v2/tasks/{TASK_ID}'
    
  4. Export the results

    curl -v -X GET \
    -o compaction_results.csv
    -H "Accept: text/csv"
    -H "X-Auth-Token: $AUTH_TOKEN"
    'http://{SERVER}:8000/v2/tasks/{TASK_ID}?csv_name=out.csv'
    
    node,database,before_disk,before_data,after_disk,after_data
    node3@127.0.0.1,faxes,804116,1544,804116,1544
    node2@127.0.0.1,faxes,554260,1544,1230100,1544
    node1@127.0.0.1,faxes,132372,1544,132372,1544
    

Compact a node

If you prefer, you can specify a CSV with nodes to compact, including a flag to force compaction or use the heuristics to do it.

CSV columns

| Column | Description | | node | Which node(s) to run compaction on | | force | Whether to force compaction on each database or apply the ratio heuristic |

For instance, compact two nodes, forcing all databases to be compacted on the first node, and only those meeting the ratio criteria on the second node:

"node","force"
"bigcouch@db001.zone1.host.com","true"
"bigcouch@db004.zone3.host.co.uk","false"

API commands

Create the task: shell curl -v -X PUT \ -H "X-Auth-Token: $AUTH_TOKEN" \ -H "Content-Type: text/csv" \ 'http://localhost:8000/v2/tasks?category=compaction&action=compact_node' \ --data-binary $'"node","force"\n"node1@127.0.0.1","false"\n"node2@127.0.0.1","true"'

Note: if you want to specify the CSV content in the args to cURL, you need to use $'...' to allow the \n to be passed to Crossbar properly. Otherwise create a CSV file and use @file.csv instead.

Start, query, and export the results as before.

Compact a database

Compacting a database will ensure all shards on relevant nodes are compacted.

CSV columns

| Column | Description | | database | Which database to run compaction on | | force | Whether to force compaction on the database or apply the ratio heuristic |

For instance, compact two databases, forcing compaction on the first database, and apply the ratio heuristic to see if the second database should be compacted.

"database","force"
"system_config","true"
"accounts","false"

API commands

  1. Create the task:
    curl -v -X PUT \
    -H "X-Auth-Token: {AUTH_TOKEN}" \
    -H "Content-Type: application/json" \
    --data-binary @/path/to/node.csv
    'http://{SERVER}:8000/v2/tasks?category=compaction&action=compact_db'
    

Start, query and export as before.

Via SUP

You can compact all databases on a node:

> sup kt_compactor compact_node {BIGCOUCH@SERVER1.COM}
node, database, before_disk, before_data, after_disk, after_data
{BIGCOUCH@SERVER1.COM}, webhooks, 41056, 8273, 24672, 8273
{BIGCOUCH@SERVER1.COM}, token_auth, 24672, 283, 8288, 283
{BIGCOUCH@SERVER1.COM}, tasks, 28768, 2724, 12384, 2724
...

You can compact a database across all nodes:

> shell kt_compactor compact_db {DATABASE}
node, database, before_disk, before_data, after_disk, after_data
{BIGCOUCH@SERVER1.COM}, {DATABASE}, 24672, 4192, 16480, 4192
{BIGCOUCH@SERVER2.COM}, {DATABASE}, 24672, 4192, 16480, 4192
{BIGCOUCH@SERVER3.COM}, {DATABASE}, 24672, 4192, 16480, 4192

You can compact account databases by using the account id; compact MODBs with {ACCOUNT_ID}-{YYYY}{MM}.