Migrating from 0.3 to 0.4
Migrating from 0.3 to 0.4 is unsupported. This document is only intended to document the process internally for the Deuxfleurs cluster where we have to do it. Do not try it yourself, you will lose your data and we will not help you.
Migrating from 0.2 to 0.4 will break everything for sure. Never try it.
The internal data format of Garage hasn't changed much between 0.3 and 0.4. The Sled database is still the same, and the data directory as well.
The following has changed, all in the meta directory:
-
node_id
in 0.3 contains the identifier of the current node. In 0.4, this file does nothing and should be deleted. It is replaced bynode_key
(the secret key) andnode_key.pub
(the associated public key). A node's identifier on the ring is its public key. -
peer_info
in 0.3 contains the list of peers saved automatically by Garage. The format has changed and it is now stored inpeer_list
(peer_info
should be deleted).
When migrating, all node identifiers will change. This also means that the affectation of data partitions on the ring will change, and lots of data will have to be rebalanced.
-
If your cluster has only 3 nodes, all nodes store everything, therefore nothing has to be rebalanced.
-
If your cluster has only 4 nodes, for any partition there will always be at least 2 nodes that stored data before that still store it after. Therefore the migration should in theory be transparent and Garage should continue to work during the rebalance.
-
If your cluster has 5 or more nodes, data will disappear during the migration. Do not migrate (fortunately we don't have this scenario at Deuxfleurs), or if you do, make Garage unavailable until things stabilize (disable web and api access).
The migration steps are as follows:
- Prepare a new configuration file for 0.4. For each node, point to the same meta and data directories as Garage 0.3. Basically, the things that change are the following:
- No more
rpc_tls
section - You have to generate a shared
rpc_secret
and put it in all config files bootstrap_peers
has a different syntax as it has to contain node keys. Leave it empty and usegarage node-id
andgarage node connect
instead (new features of 0.4)- put the publicly accessible RPC address of your node in
rpc_public_addr
if possible (its optional but recommended) - If you are using Consul, change the
consul_service_name
to NOT be the name advertised by Nomad. Now Garage is responsible for advertising its own service itself.
-
Disable api and web access for some time (Garage does not support disabling these endpoints but you can change the port number or stop your reverse proxy for instance).
-
Do
garage repair -a --yes tables
andgarage repair -a --yes blocks
, check the logs and check that all data seems to be synced correctly between nodes. -
Save somewhere the output of
garage status
. We will need this to remember how to reconfigure nodes in 0.4. -
Turn off Garage 0.3
-
Backup metadata folders if you can (i.e. if you have space to do it somewhere). Backuping data folders could also be usefull but that's much harder to do. If your filesystem supports snapshots, this could be a good time to use them.
-
Turn on Garage 0.4
-
At this point, running
garage status
should indicate that all nodes of the previous cluster are "unavailable". The nodes have new identifiers that should appear in healthy nodes once they can talk to one another (usegarage node connect
if necessary`). They should have NO ROLE ASSIGNED at the moment. -
Prepare a script with several
garage node configure
commands that replace each of the v0.3 node ID with the corresponding v0.4 node ID, with the same zone/tag/capacity. For example if your nodedrosera
had identifierc24e
before and now has identifier789a
, and it was configured with capacity2
in zonedc1
, put the following command in your script:
garage node configure 789a -z dc1 -c 2 -t drosera --replace c24e
-
Run your reconfiguration script. Check that the new output of
garage status
contains the correct node IDs with the correct values for capacity and zone. Old nodes should no longer be mentioned. -
If your cluster has 4 nodes or less, and you are feeling adventurous, you can reenable Web and API access now. Things will probably work.
-
Garage might already be resyncing stuff. Issue a
garage repair -a --yes tables
andgarage repair -a --yes blocks
to force it to do so. -
Wait for resyncing activity to stop in the logs. Do steps 12 and 13 two or three times, until you see that when you issue the repair commands, nothing gets resynced any longer.
-
Your upgraded cluster should be in a working state. Re-enable API and Web access and check that everything went well.