Unfortunately, the restore tool from Couchbase failed, the automatic data recovery failed, but at least our backups worked (of course!). While waiting for the backups to be restored, we started thinking what we could have done if the backups had also failed. So we came up with the idea of restoring data directly from vBuckets.
What are vBuckets anyways? Every key from the database is attributed to a vBucket, and multiple vBuckets reside on each server. In theory, a vBucket is defined as the “owner” of a subset of the key space of a Couchbase Server cluster (for more detailed information, check out this paper). Basically, vBuckets are a bunch of files that contain your data. These files reside in a directory with the bucket’s name inside the document path. If I do a list of files on one of our buckets, I get a bunch of files named “%d.couch.%d” (i.e. a number followed by a dot, the word ‘couch’ and another number). These ‘magic’ files contain all your data.
Dump the data
Unfortunately, the data is stored in a binary format inside the vBuckets. Fortunately, Couchbase provides a small utility program called couch_dbdump
(if you’ve installed the Debian package, its full path is /opt/couchbase/bin/couch_dbdump
). Let’s look over the output (in this example, we’ve changed the actual Json document):
$ /opt/couchbase/bin/couch_dbdump 986.couch.11 Doc seq: 1873 id: offers::3bb6d87e0acfbd1431e33955a8c068d3ad967a8e rev: 1 content_meta: 128 cas: 10486229639083020, expiry: 0, flags: 0 data: (snappy) {"document-type: json"} ( repetead thousands of times )
Let’s go ahead and do this for all the vBuckets on a Couchbase server:
$ for f in `ls $BUCKET_PATH`; do /opt/couchbase/bin/couch_dbdump $f > $DUMP_DIR/`basename $f`; done
The above snippet goes through the bucket found in the $BUCKET_PATH directory and dumps the contents in multiple files in $DUMP_DIR. The result is the same number of files with the same name as in the bucket, but in the human-readable format of couch_dbdump
.
Import the data
What if we were to write a script to reinsert the documents in a new (or flushed) bucket? Let’s use Python:
#!/usr/bin/python from couchbase import Couchbase from argparse import ArgumentParser from sys import argv from json import loads def process(doc, cb): fields = doc.split('\n') for f in fields: if f.strip() == '': fields.remove(f) for f in fields: if 'id: ' in f: key = f.split('id: ')[1] if 'data: ' in f: try: data_j = f.split('data:', 1)[1] data_j = '{'+data_j.split('{', 1)[1] data = loads(data_j) except: print 'Could not load data for', doc data = {} expiry = 0 if 'expiry: ' in f: expiry = f.split('expiry:', 1)[1].split(',', 1)[0] expiry = int(expiry) try: cb.set(key = key, value = data, ttl = expiry) except: print 'Could not set to CB for', doc if __name__ == '__main__': f = open(argv[1]) txt = f.read() # separate docs docs = txt.split('Doc seq:') # first one's always empty docs.pop(0) cb = Couchbase.connect( host = 'localhost', bucket = 'bucket', password = 'swordfish') for d in docs: process(d, cb)
Please note that you need to have the .
The above script parses the couch_dbdump
output and puts the document into Couchbase. Important: Only the key, actual Json document and expiry are preserved!
Now, let’s run the above on every dumped file using the power of bash:
#!/bin/bash for f in `ls $DUMP_DIR/*couch*` do ./cb_recovery.py $f & sleep 1 done
The above snippet runs an instance of the script for each dump file (this is what the & does — if for some reason you want to do it serially and slowly, remove the &). Hopefully, after several minutes or hours, your data should be back in the DB.