Garage's Logo

[ Git repository | Matrix channel | Drone CI ]

This very website is hosted using Garage. In other words: the doc is the PoC!

The Garage Geo-Distributed Data Store

Garage is a lightweight geo-distributed data store. It comes from the observation that despite numerous object stores many people have broken data management policies (backup/replication on a single site or none at all). To promote better data management policies, we focused on the following desirable properties:

  • Self-contained & lightweight: works everywhere and integrates well in existing environments to target hyperconverged infrastructures.
  • Highly resilient: highly resilient to network failures, network latency, disk failures, sysadmin failures.
  • Simple: simple to understand, simple to operate, simple to debug.
  • Internet enabled: made for multi-sites (eg. datacenters, offices, households, etc.) interconnected through regular Internet connections.

We also noted that the pursuit of some other goals are detrimental to our initial goals. The following has been identified as non-goals (if these points matter to you, you should not use Garage):

  • Extreme performances: high performances constrain a lot the design and the infrastructure; we seek performances through minimalism only.
  • Feature extensiveness: complete implementation of the S3 API or any other API to make garage a drop-in replacement is not targeted as it could lead to decisions impacting our desirable properties.
  • Storage optimizations: erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication.
  • POSIX/Filesystem compatibility: we do not aim at being POSIX compatible or to emulate any kind of filesystem. Indeed, in a distributed environment, such synchronizations are translated in network messages that impose severe constraints on the deployment.

Supported and planned protocols

Garage speaks (or will speak) the following protocols:

  • S3 - SUPPORTED - Enable applications to store large blobs such as pictures, video, images, documents, etc. S3 is versatile enough to also be used to publish a static website.
  • IMAP - PLANNED - email storage is quite complex to get good performances. To keep performances optimal, most IMAP servers only support on-disk storage. We plan to add logic to Garage to make it a viable solution for email storage.
  • More to come

Use Cases

Deuxfleurs: Garage is used by Deuxfleurs which is a non-profit hosting organization. Especially, it is used to host their main website, this documentation and some of its members' blogs. Additionally, Garage is used as a backend for Nextcloud. Deuxfleurs also plans to use Garage as their Matrix's media backend and as the backend of OCIS.

Are you using Garage? Open a pull request to add your organization here!

Comparison to existing software

MinIO: MinIO shares our Self-contained & lightweight goal but selected two of our non-goals: Storage optimizations through erasure coding and POSIX/Filesystem compatibility through strong consistency. However, by pursuing these two non-goals, MinIO do not reach our desirable properties. Firstly, it fails on the Simple property: due to the erasure coding, MinIO has severe limitations on how drives can be added or deleted from a cluster. Secondly, it fails on the Internet enabled property: due to its strong consistency, MinIO is latency sensitive. Furthermore, MinIO has no knowledge of "sites" and thus can not distribute data to minimize the failure of a given site.

Openstack Swift: OpenStack Swift at least fails on the Self-contained & lightweight goal. Starting it requires around 8GB of RAM, which is too much especially in an hyperconverged infrastructure. We also do not classify Swift as Simple.

Ceph: This review holds for the whole Ceph stack, including the RADOS paper, Ceph Object Storage module, the RADOS Gateway, etc. At its core, Ceph has been designed to provide POSIX/Filesystem compatibility which requires strong consistency, which in turn makes Ceph latency-sensitive and fails our Internet enabled goal. Due to its industry oriented design, Ceph is also far from being Simple to operate and from being Self-contained & lightweight which makes it hard to integrate it in an hyperconverged infrastructure. In a certain way, Ceph and MinIO are closer together than they are from Garage or OpenStack Swift.

More comparisons are available in our Related Work chapter.

Other Resources

This website is not the only source of information about Garage! We reference here other places on the Internet where you can learn more about Garage.

Rust API (docs.rs)

If you encounter a specific bug in Garage or plan to patch it, you may jump directly to the source code's documentation!

Talks

We love to talk and hear about Garage, that's why we keep a log here:

Did you write or talk about Garage? Open a pull request to add a link here!

Community

If you want to discuss with us, you can join our Matrix channel at #garage:deuxfleurs.fr. Our code repository and issue tracker, which is the place where you should report bugs, is managed on Deuxfleurs' Gitea.

License

Garage's source code, is released under the AGPL v3 License. Please note that if you patch Garage and then use it to provide any service over a network, you must share your code!

Funding

The Deuxfleurs association has received a grant promise to fund 3 people working on Garage for a year, from October 2021 to September 2022.

NGI Pointer logo EU flag logo

This project has received funding from the European Union’s Horizon 2020 research and innovation programme within the framework of the NGI-POINTER Project funded under grant agreement No 871528.

Quick Start

Let's start your Garage journey! In this chapter, we explain how to deploy Garage as a single-node server and how to interact with it.

Our goal is to introduce you to Garage's workflows. Following this guide is recommended before moving on to configuring a real-world deployment.

Note that this kind of deployment should not be used in production, as it provides no redundancy for your data! We will also skip intra-cluster TLS configuration, meaning that if you add nodes to your cluster, communication between them will not be secure.

Get a binary

Download the latest Garage binary from the release pages on our repository:

https://git.deuxfleurs.fr/Deuxfleurs/garage/releases

Place this binary somewhere in your $PATH so that you can invoke the garage command directly (for instance you can copy the binary in /usr/local/bin or in ~/.local/bin).

If a binary of the last version is not available for your architecture, you can build Garage from source.

Writing a first configuration file

This first configuration file should allow you to get started easily with the simplest possible Garage deployment:

metadata_dir = "/tmp/meta"
data_dir = "/tmp/data"

replication_mode = "none"

rpc_bind_addr = "[::]:3901"

bootstrap_peers = [
	"127.0.0.1:3901",
]

[s3_api]
s3_region = "garage"
api_bind_addr = "[::]:3900"

[s3_web]
bind_addr = "[::]:3902"
root_domain = ".web.garage"
index = "index.html"

Save your configuration file as garage.toml.

As you can see in the metadata_dir and data_dir parameters, we are saving Garage's data in /tmp which gets erased when your system reboots. This means that data stored on this Garage server will not be persistent. Change these to locations on your local disk if you want your data to be persisted properly.

Launching the Garage server

Use the following command to launch the Garage server with our configuration file:

RUST_LOG=garage=info garage server -c garage.toml

You can tune Garage's verbosity as follows (from less verbose to more verbose):

RUST_LOG=garage=info garage server -c garage.toml
RUST_LOG=garage=debug garage server -c garage.toml
RUST_LOG=garage=trace garage server -c garage.toml

Log level info is recommended for most use cases. Log level debug can help you check why your S3 API calls are not working.

Checking that Garage runs correctly

The garage utility is also used as a CLI tool to configure your Garage deployment. It tries to connect to a Garage server through the RPC protocol, by default looking for a Garage server at localhost:3901.

Since our deployment already binds to port 3901, the following command should be sufficient to show Garage's status:

garage status

This should show something like this:

Healthy nodes:
2a638ed6c775b69a…	linuxbox	127.0.0.1:3901	UNCONFIGURED/REMOVED

Configuring your Garage node

Configuring the nodes in a Garage deployment means informing Garage of the disk space available on each node of the cluster as well as the zone (e.g. datacenter) each machine is located in.

For our test deployment, we are using only one node. The way in which we configure it does not matter, you can simply write:

garage node configure -z dc1 -c 1 <node_id>

where <node_id> corresponds to the identifier of the node shown by garage status (first column). You can enter simply a prefix of that identifier. For instance here you could write just garage node configure -z dc1 -c 1 2a63.

Creating buckets and keys

In this section, we will suppose that we want to create a bucket named nextcloud-bucket that will be accessed through a key named nextcloud-app-key.

Don't forget that help command and --help subcommands can help you anywhere, the CLI tool is self-documented! Two examples:

garage help
garage bucket allow --help

Create a bucket

Let's take an example where we want to deploy NextCloud using Garage as the main data storage.

First, create a bucket with the following command:

garage bucket create nextcloud-bucket

Check that everything went well:

garage bucket list
garage bucket info nextcloud-bucket

Create an API key

The nextcloud-bucket bucket now exists on the Garage server, however it cannot be accessed until we add an API key with the proper access rights.

Note that API keys are independent of buckets: one key can access multiple buckets, multiple keys can access one bucket.

Create an API key using the following command:

garage key new --name nextcloud-app-key

The output should look as follows:

Key name: nextcloud-app-key
Key ID: GK3515373e4c851ebaad366558
Secret key: 7d37d093435a41f2aab8f13c19ba067d9776c90215f56614adad6ece597dbb34
Authorized buckets:

Check that everything works as intended:

garage key list
garage key info nextcloud-app-key

Allow a key to access a bucket

Now that we have a bucket and a key, we need to give permissions to the key on the bucket:

garage bucket allow \
  --read \
  --write 
  nextcloud-bucket \
  --key nextcloud-app-key

You can check at any time the allowed keys on your bucket with:

garage bucket info nextcloud-bucket

Uploading and downlading from Garage

We recommend the use of MinIO Client to interact with Garage files (mc). Instructions to install it and use it are provided on the MinIO website. Before reading the following, you need a working mc command on your path.

Note that on certain Linux distributions such as Arch Linux, the Minio client binary is called mcli instead of mc (to avoid name clashes with the Midnight Commander).

Configure mc

You need your access key and secret key created above. We will assume you are invoking mc on the same machine as the Garage server, your S3 API endpoint is therefore http://127.0.0.1:3900. For this whole configuration, you must set an alias name: we chose my-garage, that you will used for all commands.

Adapt the following command accordingly and run it:

mc alias set \
  my-garage \
  http://127.0.0.1:3900 \
  <access key> \
  <secret key> \
  --api S3v4

You must also add an environment variable to your configuration to inform MinIO of our region (garage by default, corresponding to the s3_region parameter in the configuration file). The best way is to add the following snippet to your $HOME/.bash_profile or $HOME/.bashrc file:

export MC_REGION=garage

Use mc

You can not list buckets from mc currently.

But the following commands and many more should work:

mc cp image.png my-garage/nextcloud-bucket
mc cp my-garage/nextcloud-bucket/image.png .
mc ls my-garage/nextcloud-bucket
mc mirror localdir/ my-garage/another-bucket

Other tools for interacting with Garage

The following tools can also be used to send and recieve files from/to Garage:

Refer to the "configuring clients" page to learn how to configure these clients to interact with a Garage server.

Cookbook

A cookbook, when you cook, is a collection of recipes. Similarly, Garage's cookbook contains a collection of recipes that are known to works well! This chapter could also be referred as "Tutorials" or "Best practices".

  • Deploying Garage: This page will walk you through all of the necessary steps to deploy Garage in a real-world setting.

  • Configuring S3 clients: This page will explain how to configure popular S3 clients to interact with a Garage server.

  • Hosting a website: This page explains how to use Garage to host a static website.

  • Recovering from failures: Garage's first selling point is resilience to hardware failures. This section explains how to recover from such a failure in the best possible way.

  • Building from source: This page explains how to build Garage from source in case a binary is not provided for your architecture, or if you want to hack with us!

  • Starting with Systemd: This page explains how to run Garage as a Systemd service (instead of as a Docker container).

Deploying Garage on a real-world cluster

To run Garage in cluster mode, we recommend having at least 3 nodes. This will allow you to setup Garage for three-way replication of your data, the safest and most available mode proposed by Garage.

We recommend first following the quick start guide in order to get familiar with Garage's command line and usage patterns.

Prerequisites

To run a real-world deployment, make sure you the following conditions are met:

  • You have at least three machines with sufficient storage space available.

  • Each machine has a public IP address which is reachable by other machines. Running behind a NAT is possible, but having several Garage nodes behind a single NAT is slightly more involved as each will have to have a different RPC port number (the local port number of a node must be the same as the port number exposed publicly by the NAT).

  • Ideally, each machine should have a SSD available in addition to the HDD you are dedicating to Garage. This will allow for faster access to metadata and has the potential to drastically reduce Garage's response times.

  • This guide will assume you are using Docker containers to deploy Garage on each node. Garage can also be run independently, for instance as a Systemd service. You can also use an orchestrator such as Nomad or Kubernetes to automatically manage Docker containers on a fleet of nodes.

Before deploying Garage on your infrastructure, you must inventory your machines. For our example, we will suppose the following infrastructure with IPv6 connectivity:

LocationNameIP AddressDisk Space
ParisMercuryfc00:1::11 To
ParisVenusfc00:1::22 To
LondonEarthfc00:B::12 To
BrusselsMarsfc00:F::11.5 To

Get a Docker image

Our docker image is currently named lxpz/garage_amd64 and is stored on the Docker Hub. We encourage you to use a fixed tag (eg. v0.3.0) and not the latest tag. For this example, we will use the latest published version at the time of the writing which is v0.3.0 but it's up to you to check the most recent versions on the Docker Hub.

For example:

sudo docker pull lxpz/garage_amd64:v0.3.0

Generating TLS certificates

You first need to generate TLS certificates to encrypt traffic between Garage nodes (reffered to as RPC traffic).

To generate your TLS certificates, run on your machine:

wget https://git.deuxfleurs.fr/Deuxfleurs/garage/raw/branch/main/genkeys.sh
chmod +x genkeys.sh
./genkeys.sh

It will creates a folder named pki/ containing the keys that you will used for the cluster. These files will have to be copied to all of your cluster nodes, as explained below.

Deploying and configuring Garage

On each machine, we will have a similar setup, especially you must consider the following folders/files:

  • /etc/garage/garage.toml: Garage daemon's configuration (see below)

  • /etc/garage/pki/: Folder containing Garage certificates, must be generated on your computer and copied on the servers. Only the files garage-ca.crt, garage.crt and garage.key are necessary.

  • /var/lib/garage/meta/: Folder containing Garage's metadata, put this folder on a SSD if possible

  • /var/lib/garage/data/: Folder containing Garage's data, this folder will be your main data storage and must be on a large storage (e.g. large HDD)

A valid /etc/garage/garage.toml for our cluster would be:

metadata_dir = "/var/lib/garage/meta"
data_dir = "/var/lib/garage/data"

replication_mode = "3"

rpc_bind_addr = "[::]:3901"

bootstrap_peers = [
  "[fc00:1::1]:3901",
  "[fc00:1::2]:3901",
  "[fc00:B::1]:3901",
  "[fc00:F::1]:3901",
]

[rpc_tls]
ca_cert = "/etc/garage/pki/garage-ca.crt"
node_cert = "/etc/garage/pki/garage.crt"
node_key = "/etc/garage/pki/garage.key"

[s3_api]
s3_region = "garage"
api_bind_addr = "[::]:3900"

[s3_web]
bind_addr = "[::]:3902"
root_domain = ".web.garage"
index = "index.html"

Please make sure to change bootstrap_peers to your IP addresses!

Check the configuration file reference documentation to learn more about all available configuration options.

Starting Garage using Docker

On each machine, you can run the daemon with:

docker run \
  -d \
  --name garaged \
  --restart always \
  --network host \
  -v /etc/garage/pki:/etc/garage/pki \
  -v /etc/garage/garage.toml:/garage/garage.toml \
  -v /var/lib/garage/meta:/var/lib/garage/meta \
  -v /var/lib/garage/data:/var/lib/garage/data \
  lxpz/garage_amd64:v0.3.0

It should be restarted automatically at each reboot. Please note that we use host networking as otherwise Docker containers can not communicate with IPv6.

Upgrading between Garage versions should be supported transparently, but please check the relase notes before doing so! To upgrade, simply stop and remove this container and start again the command with a new version of Garage.

Controling the daemon

The garage binary has two purposes:

  • it acts as a daemon when launched with garage server ...
  • it acts as a control tool for the daemon when launched with any other command

In this section, we will see how to use the garage binary as a control tool for the daemon we just started. You first need to get a shell having access to this binary. For instance, enter the Docker container with:

sudo docker exec -ti garaged bash

You will now have a shell where the Garage binary is available as /garage/garage

You can also install the binary on your machine to remotely control the cluster.

Talk to the daemon and create an alias

garage requires 4 options to talk with the daemon:

--ca-cert <ca-cert>            
--client-cert <client-cert>    
--client-key <client-key>      
-h, --rpc-host <rpc-host>

The 3 first ones are certificates and keys needed by TLS, the last one is simply the address of Garage's RPC endpoint.

If you are invoking garage from a server node directly, you do not need to set --rpc-host as the default value 127.0.0.1:3901 will allow it to contact Garage correctly.

To avoid typing the 3 first options each time we want to run a command, you can use the following alias:

alias garagectl='/garage/garage \
  --ca-cert /etc/garage/pki/garage-ca.crt \
  --client-cert /etc/garage/pki/garage.crt \
  --client-key /etc/garage/pki/garage.key'

You can now use all of the commands presented in the quick start guide, simply replace occurences of garage by garagectl.

Test the alias

You can test your alias by running a simple command such as:

garagectl status

You should get something like that as result:

Healthy nodes:
8781c50c410a41b3…	Mercury	[fc00:1::1]:3901	UNCONFIGURED/REMOVED
2a638ed6c775b69a…	Venus	[fc00:1::2]:3901	UNCONFIGURED/REMOVED
68143d720f20c89d…	Earth	[fc00:B::1]:3901	UNCONFIGURED/REMOVED
212f7572f0c89da9…	Mars	[fc00:F::1]:3901	UNCONFIGURED/REMOVED

Configuring a cluster

We will now inform Garage of the disk space available on each node of the cluster as well as the zone (e.g. datacenter) in which each machine is located.

For our example, we will suppose we have the following infrastructure (Capacity, Identifier and Datacenter are specific values to Garage described in the following):

LocationNameDisk SpaceCapacityIdentifierZone
ParisMercury1 To28781c5par1
ParisVenus2 To42a638epar1
LondonEarth2 To468143dlon1
BrusselsMars1.5 To3212f75bru1

Node identifiers

After its first launch, Garage generates a random and unique identifier for each nodes, such as:

8781c50c410a41b363167e9d49cc468b6b9e4449b6577b64f15a249a149bdcbc

Often a shorter form can be used, containing only the beginning of the identifier, like 8781c5, which identifies the server "Mercury" located in "Paris" according to our previous table.

The most simple way to match an identifier to a node is to run:

garagectl status

It will display the IP address associated with each node; from the IP address you will be able to recognize the node.

Zones

Zones are simply a user-chosen identifier that identify a group of server that are grouped together logically. It is up to the system administrator deploying Garage to identify what does "grouped together" means.

In most cases, a zone will correspond to a geographical location (i.e. a datacenter). Behind the scene, Garage will use zone definition to try to store the same data on different zones, in order to provide high availability despite failure of a zone.

Capacity

Garage reasons on an abstract metric about disk storage that is named the capacity of a node. The capacity configured in Garage must be proportional to the disk space dedicated to the node. Due to the way the Garage allocation algorithm works, capacity values must be integers, and must be as small as possible, for instance with 1 representing the size of your smallest server.

Here we chose that 1 unit of capacity = 0.5 To, so that we can express servers of size 1 To and 2 To, as wel as the intermediate size 1.5 To, with the integer values 2, 4 and 3 respectively (see table above).

Note that the amount of data stored by Garage on each server may not be strictly proportional to its capacity value, as Garage will priorize having 3 copies of data in different zones, even if this means that capacities will not be strictly respected. For example in our above examples, nodes Earth and Mars will always store a copy of everything each, and the third copy will have 66% chance of being stored by Venus and 33% chance of being stored by Mercury.

Injecting the topology

Given the information above, we will configure our cluster as follow:

garagectl node configure -z par1 -c 2 -t mercury 8781c5
garagectl node configure -z par1 -c 4 -t venus 2a638e
garagectl node configure -z lon1 -c 4 -t earth 68143d
garagectl node configure -z bru1 -c 3 -t mars 212f75

Using your Garage cluster

Creating buckets and managing keys is done using the garagectl CLI, and is covered in the quick start guide. Remember also that the CLI is self-documented thanks to the --help flag and the help subcommand (e.g. garage help, garage key --help).

Configuring an S3 client to interact with Garage is covered in the next section.

Configuring S3 clients to interact with Garage

To configure an S3 client to interact with Garage, you will need the following parameters:

  • An API endpoint: this corresponds to the HTTP or HTTPS address used to contact the Garage server. When runing Garage locally this will usually be http://127.0.0.1:3900. In a real-world setting, you would usually have a reverse-proxy that adds TLS support and makes your Garage server available under a public hostname such as https://garage.example.com.

  • An API access key and its associated secret key. These usually look something like this: GK3515373e4c851ebaad366558 (access key), 7d37d093435a41f2aab8f13c19ba067d9776c90215f56614adad6ece597dbb34 (secret key). These keys are created and managed using the garage CLI, as explained in the quick start guide.

Most S3 clients can be configured easily with these parameters, provided that you follow the following guidelines:

  • Force path style: Garage does not support DNS-style buckets, which are now by default on Amazon S3. Instead, Garage uses the legacy path-style bucket addressing. Remember to configure your client to acknowledge this fact.

  • Configuring the S3 region: Garage requires your client to talk to the correct "S3 region", which is set in the configuration file. This is often set just to garage. If this is not configured explicitly, clients usually try to talk to region us-east-1. Garage should normally redirect your client to the correct region, but in case your client does not support this you might have to configure it manually.

We will now provide example configurations for the most common S3 clients.

AWS CLI

Export the following environment variables:

export AWS_ACCESS_KEY_ID=<access key>
export AWS_SECRET_ACCESS_KEY=<secret key>
export AWS_DEFAULT_REGION=<region>

Now invoke aws as follows:

aws --endpoint-url <endpoint> s3 <command...>

For instance: aws --endpoint-url http://127.0.0.1:3901 s3 ls s3://my-bucket/.

Minio client

Use the following command to set an "alias", i.e. define a new S3 server to be used by the Minio client:

mc alias set \
  garage \
  <endpoint> \
  <access key> \
  <secret key> \
  --api S3v4

Remember that mc is sometimes called mcli (such as on Arch Linux), to avoid conflicts with the Midnight Commander.

rclone

rclone can be configured using the interactive assistant invoked using rclone configure.

You can also configure rclone by writing directly its configuration file. Here is a template rclone.ini configuration file:

[garage]
type = s3
provider = Other
env_auth = false
access_key_id = <access key>
secret_access_key = <secret key>
region = <region>
endpoint = <endpoint>
force_path_style = true
acl = private
bucket_acl = private

Cyberduck

TODO

s3cmd

Here is a template for the s3cmd.cfg file to talk with Garage:

[default]
access_key = <access key>
secret_key = <secret key>
host_base = <endpoint without http(s)://>
host_bucket = <same as host_base>
use_https = False | True

Hosting a website

TODO

Recovering from failures

Garage is meant to work on old, second-hand hardware. In particular, this makes it likely that some of your drives will fail, and some manual intervention will be needed. Fear not! For Garage is fully equipped to handle drive failures, in most common cases.

A note on availability of Garage

With nodes dispersed in 3 zones or more, here are the guarantees Garage provides with the 3-way replication strategy (3 copies of all data, which is the recommended replication mode):

  • The cluster remains fully functional as long as the machines that fail are in only one zone. This includes a whole zone going down due to power/Internet outage.
  • No data is lost as long as the machines that fail are in at most two zones.

Of course this only works if your Garage nodes are correctly configured to be aware of the zone in which they are located. Make sure this is the case using garage status to check on the state of your cluster's configuration.

In case of temporarily disconnected nodes, Garage should automatically re-synchronize when the nodes come back up. This guide will deal with recovering from disk failures that caused the loss of the data of a node.

First option: removing a node

If you don't have spare parts (HDD, SDD) to replace the failed component, and if there are enough remaining nodes in your cluster (at least 3), you can simply remove the failed node from Garage's configuration. Note that if you do intend to replace the failed parts by new ones, using this method followed by adding back the node is not recommended (although it should work), and you should instead use one of the methods detailed in the next sections.

Removing a node is done with the following command:

garage node remove --yes <node_id>

(you can get the node_id of the failed node by running garage status)

This will repartition the data and ensure that 3 copies of everything are present on the nodes that remain available.

Replacement scenario 1: only data is lost, metadata is fine

The recommended deployment for Garage uses an SSD to store metadata, and an HDD to store blocks of data. In the case where only a single HDD crashes, the blocks of data are lost but the metadata is still fine.

This is very easy to recover by setting up a new HDD to replace the failed one. The node does not need to be fully replaced and the configuration doesn't need to change. We just need to tell Garage to get back all the data blocks and store them on the new HDD.

First, set up a new HDD to store Garage's data directory on the failed node, and restart Garage using the existing configuration. Then, run:

garage repair -a --yes blocks

This will re-synchronize blocks of data that are missing to the new HDD, reading them from copies located on other nodes.

You can check on the advancement of this process by doing the following command:

garage stats -a

Look out for the following output:

Block manager stats:
  resync queue length: 26541

This indicates that one of the Garage node is in the process of retrieving missing data from other nodes. This number decreases to zero when the node is fully synchronized.

Replacement scenario 2: metadata (and possibly data) is lost

This scenario covers the case where a full node fails, i.e. both the metadata directory and the data directory are lost, as well as the case where only the metadata directory is lost.

To replace the lost node, we will start from an empty metadata directory, which means Garage will generate a new node ID for the replacement node. We will thus need to remove the previous node ID from Garage's configuration and replace it by the ID of the new node.

If your data directory is stored on a separate drive and is still fine, you can keep it, but it is not necessary to do so. In all cases, the data will be rebalanced and the replacement node will not store the same pieces of data as were originally stored on the one that failed. So if you keep the data files, the rebalancing might be faster but most of the pieces will be deleted anyway from the disk and replaced by other ones.

First, set up a new drive to store the metadata directory for the replacement node (a SSD is recommended), and for the data directory if necessary. You can then start Garage on the new node. The restarted node should generate a new node ID, and it should be shown as NOT CONFIGURED in garage status. The ID of the lost node should be shown in garage status in the section for disconnected/unavailable nodes.

Then, replace the broken node by the new one, using:

garage node configure --replace <old_node_id> \
		-c <capacity> -z <zone> -t <node_tag> <new_node_id>

Garage will then start synchronizing all required data on the new node. This process can be monitored using the garage stats -a command.

Compiling Garage from source

Garage is a standard Rust project. First, you need rust and cargo. For instance on Debian:

sudo apt-get update
sudo apt-get install -y rustc cargo

You can also use Rustup to setup a Rust toolchain easily.

Using source from crates.io

Garage's source code is published on crates.io, Rust's official package repository. This means you can simply ask cargo to download and build this source code for you:

cargo install garage

That's all, garage should be in $HOME/.cargo/bin.

You can add this folder to your $PATH or copy the binary somewhere else on your system. For instance:

sudo cp $HOME/.cargo/bin/garage /usr/local/bin/garage

Using source from the Gitea repository

The primary location for Garage's source code is the Gitea repository.

Clone the repository and build Garage with the following commands:

git clone https://git.deuxfleurs.fr/Deuxfleurs/garage.git
cd garage
cargo build

Be careful, as this will make a debug build of Garage, which will be extremely slow! To make a release build, invoke cargo build --release (this takes much longer).

The binaries built this way are found in target/{debug,release}/garage.

Starting Garage with systemd instead of Docker

NOTE: This guide is incomplete. Typicall you would also want to create a separate Unix user to run Garage.

Make sure you have the Garage binary installed on your system (see quick start), e.g. at /usr/local/bin/garage.

Create a file named /etc/systemd/system/garage.service:

[Unit]
Description=Garage Data Store
After=network-online.target
Wants=network-online.target

[Service]
Environment='RUST_LOG=garage=info' 'RUST_BACKTRACE=1'
ExecStart=/usr/local/bin/garage server -c /etc/garage/garage.toml

[Install]
WantedBy=multi-user.target

To start the service then automatically enable it at boot:

sudo systemctl start garage
sudo systemctl enable garage

To see if the service is running and to browse its logs:

sudo systemctl status garage
sudo journalctl -u garage

If you want to modify the service file, do not forget to run systemctl daemon-reload to inform systemd of your modifications.

Reference Manual

A reference manual contains some extensive descriptions about the features and the behaviour of the software. Reading of this chapter is recommended once you have a good knowledge/understanding of Garage. It will be useful if you want to tune it or to use it in some exotic conditions.

Garage configuration file format reference

Here is an example garage.toml configuration file that illustrates all of the possible options:

metadata_dir = "/var/lib/garage/meta"
data_dir = "/var/lib/garage/data"

block_size = 1048576

replication_mode = "3"

rpc_bind_addr = "[::]:3901"

bootstrap_peers = [
  "[fc00:1::1]:3901",
  "[fc00:1::2]:3901",
  "[fc00:B::1]:3901",
  "[fc00:F::1]:3901",
]

consul_host = "consul.service"
consul_service_name = "garage-daemon"

max_concurrent_rpc_requests = 12

sled_cache_capacity = 134217728
sled_flush_every_ms = 2000

[rpc_tls]
ca_cert = "/etc/garage/pki/garage-ca.crt"
node_cert = "/etc/garage/pki/garage.crt"
node_key = "/etc/garage/pki/garage.key"

[s3_api]
s3_region = "garage"
api_bind_addr = "[::]:3900"

[s3_web]
bind_addr = "[::]:3902"
root_domain = ".web.garage"
index = "index.html"

The following gives details about each available configuration option.

Available configuration options

metadata_dir

The directory in which Garage will store its metadata. This contains the node identifier, the network configuration and the peer list, the list of buckets and keys as well as the index of all objects, object version and object blocks.

Store this folder on a fast SSD drive if possible to maximize Garage's performance.

data_dir

The directory in which Garage will store the data blocks of objects. This folder can be placed on an HDD. The space available for data_dir should be counted to determine a node's capacity when configuring it.

block_size

Garage splits stored objects in consecutive chunks of size block_size (except the last one which might be standard). The default size is 1MB and should work in most cases. If you are interested in tuning this, feel free to do so (and remember to report your findings to us!)

replication_mode

Garage supports the following replication modes:

  • none or 1: data stored on Garage is stored on a single node. There is no redundancy, and data will be unavailable as soon as one node fails or its network is disconnected. Do not use this for anything else than test deployments.

  • 2: data stored on Garage will be stored on two different nodes, if possible in different zones. Garage tolerates one node failure before losing data. Data should be available read-only when one node is down, but write operations will fail. Use this only if you really have to.

  • 3: data stored on Garage will be stored on three different nodes, if possible each in a different zones. Garage tolerates two node failure before losing data. Data should be available read-only when two nodes are down, and writes should be possible if only a single node is down.

Note that in modes 2 and 3, if at least the same number of zones are available, an arbitrary number of failures in any given zone is tolerated as copies of data will be spread over several zones.

Make sure replication_mode is the same in the configuration files of all nodes. Never run a Garage cluster where that is not the case.

Changing the replication_mode of a cluster might work (make sure to shut down all nodes and changing it everywhere at the time), but is not officially supported.

rpc_bind_addr

The address and port on which to bind for inter-cluster communcations (reffered to as RPC for remote procedure calls). The port specified here should be the same one that other nodes will used to contact the node, even in the case of a NAT: the NAT should be configured to forward the external port number to the same internal port nubmer. This means that if you have several nodes running behind a NAT, they should each use a different RPC port number.

bootstrap_peers

A list of IPs and ports on which to contact other Garage peers of this cluster. This should correspond to the RPC ports set up with rpc_bind_addr.

consul_host and consul_service_name

Garage supports discovering other nodes of the cluster using Consul. This works only when nodes are announced in Consul by an orchestrator such as Nomad, as Garage is not able to announce itself.

The consul_host parameter should be set to the hostname of the Consul server, and consul_service_name should be set to the service name under which Garage's RPC ports are announced.

max_concurrent_rpc_requests

Garage implements rate limiting for RPC requests: no more than max_concurrent_rpc_requests concurrent outbound RPC requests will be made by a Garage node (additionnal requests will be put in a waiting queue).

sled_cache_capacity

This parameter can be used to tune the capacity of the cache used by sled, the database Garage uses internally to store metadata. Tune this to fit the RAM you wish to make available to your Garage instance. More cache means faster Garage, but the default value (128MB) should be plenty for most use cases.

sled_flush_every_ms

This parameters can be used to tune the flushing interval of sled. Increase this if sled is thrashing your SSD, at the risk of losing more data in case of a power outage (though this should not matter much as data is replicated on other nodes). The default value, 2000ms, should be appropriate for most use cases.

The [rpc_tls] section

This section should be used to configure the TLS certificates used to encrypt intra-cluster traffic (RPC traffic). The following parameters should be set:

  • ca_cert: the certificate of the CA that is allowed to sign individual node certificates
  • node_cert: the node certificate for the current node
  • node_key: the key associated with the node certificate

Note tha several nodes may use the same node certificate, as long as it is signed by the CA.

If this section is absent, TLS is not used to encrypt intra-cluster traffic.

The [s3_api] section

api_bind_addr

The IP and port on which to bind for accepting S3 API calls. This endpoint does not suport TLS: a reverse proxy should be used to provide it.

s3_region

Garage will accept S3 API calls that are targetted to the S3 region defined here. API calls targetted to other regions will fail with a AuthorizationHeaderMalformed error message that redirects the client to the correct region.

The [s3_web] section

Garage allows to publish content of buckets as websites. This section configures the behaviour of this module.

bind_addr

The IP and port on which to bind for accepting HTTP requests to buckets configured for website access. This endpoint does not suport TLS: a reverse proxy should be used to provide it.

root_domain

The optionnal suffix appended to bucket names for the corresponding HTTP Host.

For instance, if root_domain is web.garage.eu, a bucket called deuxfleurs.fr will be accessible either with hostname deuxfleurs.fr.web.garage.eu or with hostname deuxfleurs.fr.

index

The name of the index file to return for requests ending with / (usually index.html).

Garage CLI

The Garage CLI is mostly self-documented. Make use of the help subcommand and the --help flag to discover all available options.

S3 Compatibility status

Global S3 features

Implemented:

  • path-style URLs (garage.tld/bucket/key)
  • putting and getting objects in buckets
  • multipart uploads
  • listing objects
  • access control on a per-key-per-bucket basis

Not implemented:

  • vhost-style URLs (bucket.garage.tld/key)
  • object-level ACL
  • object versioning
  • encryption
  • most x-amz- headers

Endpoint implementation

All APIs that are not mentionned are not implemented and will return a 400 bad request.

AbortMultipartUpload

Implemented.

CompleteMultipartUpload

Implemented badly. Garage will not check that all the parts stored correspond to the list given by the client in the request body. This means that the multipart upload might be completed with an invalid size. This is a bug and will be fixed.

CopyObject

Implemented.

CreateBucket

Garage does not accept creating buckets or giving access using API calls, it has to be done using the CLI tools. CreateBucket will return a 200 if the bucket exists and user has write access, and a 403 Forbidden in all other cases.

CreateMultipartUpload

Implemented.

DeleteBucket

Garage does not accept deleting buckets using API calls, it has to be done using the CLI tools. This request will return a 403 Forbidden.

DeleteObject

Implemented.

DeleteObjects

Implemented.

GetBucketLocation

Implemented.

GetBucketVersioning

Stub implementation (Garage does not yet support versionning so this always returns "versionning not enabled").

GetObject

Implemented.

HeadBucket

Implemented.

HeadObject

Implemented.

ListBuckets

Implemented.

ListObjects

Implemented, but there isn't a very good specification of what encoding-type=url covers so there might be some encoding bugs. In our implementation the url-encoded fields are in the same in ListObjects as they are in ListObjectsV2.

ListObjectsV2

Implemented.

PutObject

Implemented.

UploadPart

Implemented.

Design

The design section helps you to see Garage from a "big picture" perspective. It will allow you to understand if Garage is a good fit for you, how to better use it, how to contribute to it, what can Garage could and could not do, etc.

Related Work

Context

Data storage is critical: it can lead to data loss if done badly and/or on hardware failure. Filesystems + RAID can help on a single machine but a machine failure can put the whole storage offline. Moreover, it put a hard limit on scalability. Often this limit can be pushed back far away by buying expensive machines. But here we consider non specialized off the shelf machines that can be as low powered and subject to failures as a raspberry pi.

Distributed storage may help to solve both availability and scalability problems on these machines. Many solutions were proposed, they can be categorized as block storage, file storage and object storage depending on the abstraction they provide.

Overview

Block storage is the most low level one, it's like exposing your raw hard drive over the network. It requires very low latencies and stable network, that are often dedicated. However it provides disk devices that can be manipulated by the operating system with the less constraints: it can be partitioned with any filesystem, meaning that it supports even the most exotic features. We can cite iSCSI or Fibre Channel. Openstack Cinder proxy previous solution to provide an uniform API.

File storage provides a higher abstraction, they are one filesystem among others, which means they don't necessarily have all the exotic features of every filesystem. Often, they relax some POSIX constraints while many applications will still be compatible without any modification. As an example, we are able to run MariaDB (very slowly) over GlusterFS... We can also mention CephFS (read RADOS whitepaper), Lustre, LizardFS, MooseFS, etc. OpenStack Manila proxy previous solutions to provide an uniform API.

Finally object storages provide the highest level abstraction. They are the testimony that the POSIX filesystem API is not adapted to distributed filesystems. Especially, the strong concistency has been dropped in favor of eventual consistency which is way more convenient and powerful in presence of high latencies and unreliability. We often read about S3 that pioneered the concept that it's a filesystem for the WAN. Applications must be adapted to work for the desired object storage service. Today, the S3 HTTP REST API acts as a standard in the industry. However, Amazon S3 source code is not open but alternatives were proposed. We identified Minio, Pithos, Swift and Ceph. Minio/Ceph enforces a total order, so properties similar to a (relaxed) filesystem. Swift and Pithos are probably the most similar to AWS S3 with their consistent hashing ring. However Pithos is not maintained anymore. More precisely the company that published Pithos version 1 has developped a second version 2 but has not open sourced it. Some tests conducted by the ACIDES project have shown that Openstack Swift consumes way more resources (CPU+RAM) that we can afford. Furthermore, people developing Swift have not designed their software for geo-distribution.

There were many attempts in research too. I am only thinking to LBFS that was used as a basis for Seafile. But none of them have been effectively implemented yet.

Existing software

Pithos : Pithos has been abandonned and should probably not used yet, in the following we explain why we did not pick their design. Pithos was relying as a S3 proxy in front of Cassandra (and was working with Scylla DB too). From its designers' mouth, storing data in Cassandra has shown its limitations justifying the project abandonment. They built a closed-source version 2 that does not store blobs in the database (only metadata) but did not communicate further on it. We considered there v2's design but concluded that it does not fit both our Self-contained & lightweight and Simple properties. It makes the development, the deployment and the operations more complicated while reducing the flexibility.

IPFS : Not written yet

Specific research papers

Not yet written

WARNING: this documentation is more a "design draft", which was written before Garage's actual implementation. The general principle is similar but details have not yet been updated.

Modules

  • membership/: configuration, membership management (gossip of node's presence and status), ring generation --> what about Serf (used by Consul/Nomad) : https://www.serf.io/? Seems a huge library with many features so maybe overkill/hard to integrate
  • metadata/: metadata management
  • blocks/: block management, writing, GC and rebalancing
  • internal/: server to server communication (HTTP server and client that reuses connections, TLS if we want, etc)
  • api/: S3 API
  • web/: web management interface

Metadata tables

Objects:

  • Hash key: Bucket name (string)
  • Sort key: Object key (string)
  • Sort key: Version timestamp (int)
  • Sort key: Version UUID (string)
  • Complete: bool
  • Inline: bool, true for objects < threshold (say 1024)
  • Object size (int)
  • Mime type (string)
  • Data for inlined objects (blob)
  • Hash of first block otherwise (string)

Having only a hash key on the bucket name will lead to storing all file entries of this table for a specific bucket on a single node. At the same time, it is the only way I see to rapidly being able to list all bucket entries...

Blocks:

  • Hash key: Version UUID (string)
  • Sort key: Offset of block in total file (int)
  • Hash of data block (string)

A version is defined by the existence of at least one entry in the blocks table for a certain version UUID. We must keep the following invariant: if a version exists in the blocks table, it has to be referenced in the objects table. We explicitly manage concurrent versions of an object: the version timestamp and version UUID columns are index columns, thus we may have several concurrent versions of an object. Important: before deleting an older version from the objects table, we must make sure that we did a successfull delete of the blocks of that version from the blocks table.

Thus, the workflow for reading an object is as follows:

  1. Check permissions (LDAP)
  2. Read entry in object table. If data is inline, we have its data, stop here. -> if several versions, take newest one and launch deletion of old ones in background
  3. Read first block from cluster. If size <= 1 block, stop here.
  4. Simultaneously with previous step, if size > 1 block: query the Blocks table for the IDs of the next blocks
  5. Read subsequent blocks from cluster

Workflow for PUT:

  1. Check write permission (LDAP)
  2. Select a new version UUID
  3. Write a preliminary entry for the new version in the objects table with complete = false
  4. Send blocks to cluster and write entries in the blocks table
  5. Update the version with complete = true and all of the accurate information (size, etc)
  6. Return success to the user
  7. Launch a background job to check and delete older versions

Workflow for DELETE:

  1. Check write permission (LDAP)
  2. Get current version (or versions) in object table
  3. Do the deletion of those versions NOT IN A BACKGROUND JOB THIS TIME
  4. Return succes to the user if we were able to delete blocks from the blocks table and entries from the object table

To delete a version:

  1. List the blocks from Cassandra
  2. For each block, delete it from cluster. Don't care if some deletions fail, we can do GC.
  3. Delete all of the blocks from the blocks table
  4. Finally, delete the version from the objects table

Known issue: if someone is reading from a version that we want to delete and the object is big, the read might be interrupted. I think it is ok to leave it like this, we just cut the connection if data disappears during a read.

("Soit P un problème, on s'en fout est une solution à ce problème")

Block storage on disk

Blocks themselves:

  • file path = /blobs/(first 3 hex digits of hash)/(rest of hash)

Reverse index for GC & other block-level metadata:

  • file path = /meta/(first 3 hex digits of hash)/(rest of hash)
  • map block hash -> set of version UUIDs where it is referenced

Usefull metadata:

  • list of versions that reference this block in the Casandra table, so that we can do GC by checking in Cassandra that the lines still exist
  • list of other nodes that we know have acknowledged a write of this block, usefull in the rebalancing algorithm

Write strategy: have a single thread that does all write IO so that it is serialized (or have several threads that manage independent parts of the hash space). When writing a blob, write it to a temporary file, close, then rename so that a concurrent read gets a consistent result (either not found or found with whole content).

Read strategy: the only read operation is get(hash) that returns either the data or not found (can do a corruption check as well and return corrupted state if it is the case). Can be done concurrently with writes.

Internal API:

  • get(block hash) -> ok+data/not found/corrupted
  • put(block hash & data, version uuid + offset) -> ok/error
  • put with no data(block hash, version uuid + offset) -> ok/not found plz send data/error
  • delete(block hash, version uuid + offset) -> ok/error

GC: when last ref is deleted, delete block. Long GC procedure: check in Cassandra that version UUIDs still exist and references this block.

Rebalancing: takes as argument the list of newly added nodes.

  • List all blocks that we have. For each block:
  • If it hits a newly introduced node, send it to them. Use put with no data first to check if it has to be sent to them already or not. Use a random listing order to avoid race conditions (they do no harm but we might have two nodes sending the same thing at the same time thus wasting time).
  • If it doesn't hit us anymore, delete it and its reference list.

Only one balancing can be running at a same time. It can be restarted at the beginning with new parameters.

Membership management

Two sets of nodes:

  • set of nodes from which a ping was recently received, with status: number of stored blocks, request counters, error counters, GC%, rebalancing% (eviction from this set after say 30 seconds without ping)
  • set of nodes that are part of the system, explicitly modified by the operator using the web UI (persisted to disk), is a CRDT using a version number for the value of the whole set

Thus, three states for nodes:

  • healthy: in both sets
  • missing: not pingable but part of desired cluster
  • unused/draining: currently present but not part of the desired cluster, empty = if contains nothing, draining = if still contains some blocks

Membership messages between nodes:

  • ping with current state + hash of current membership info -> reply with same info
  • send&get back membership info (the ids of nodes that are in the two sets): used when no local membership change in a long time and membership info hash discrepancy detected with first message (passive membership fixing with full CRDT gossip)
  • inform of newly pingable node(s) -> no result, when receive new info repeat to all (reliable broadcast)
  • inform of operator membership change -> no result, when receive new info repeat to all (reliable broadcast)

Ring: generated from the desired set of nodes, however when doing read/writes on the ring, skip nodes that are known to be not pingable. The tokens are generated in a deterministic fashion from node IDs (hash of node id + token number from 1 to K). Number K of tokens per node: decided by the operator & stored in the operator's list of nodes CRDT. Default value proposal: with node status information also broadcast disk total size and free space, and propose a default number of tokens equal to 80%Free space / 10Gb. (this is all user interface)

Constants

  • Block size: around 1MB ? --> Exoscale use 16MB chunks
  • Number of tokens in the hash ring: one every 10Gb of allocated storage
  • Threshold for storing data directly in Cassandra objects table: 1kb bytes (maybe up to 4kb?)
  • Ping timeout (time after which a node is registered as unresponsive/missing): 30 seconds
  • Ping interval: 10 seconds
  • ??

Development

Now that you are a Garage expert, you want to enhance it, you are in the right place! We discuss here how to hack on Garage, how we manage its development, etc.

Setup your development environment

We propose the following quickstart to setup a full dev. environment as quickly as possible:

  1. Setup a rust/cargo environment. eg. dnf install rust cargo
  2. Install awscli v2 by following the guide here.
  3. Run cargo build to build the project
  4. Run ./script/dev-cluster.sh to launch a test cluster (feel free to read the script)
  5. Run ./script/dev-configure.sh to configure your test cluster with default values (same datacenter, 100 tokens)
  6. Run ./script/dev-bucket.sh to create a bucket named eprouvette and an API key that will be stored in /tmp/garage.s3
  7. Run source ./script/dev-env-aws.sh to configure your CLI environment
  8. You can use garage to manage the cluster. Try garage --help.
  9. You can use the awsgrg alias to add, remove, and delete files. Try awsgrg help, awsgrg cp /proc/cpuinfo s3://eprouvette/cpuinfo.txt, or awsgrg ls s3://eprouvette. awsgrg is a wrapper on the aws s3 command pre-configured with the previously generated API key (the one in /tmp/garage.s3) and localhost as the endpoint.

Now you should be ready to start hacking on garage!

Working Documents

Working documents are documents that reflect the fact that Garage is a software that evolves quickly. They are a way to communicate our ideas, our changes, and so on before or while we are implementing them in Garage. If you like to live on the edge, it could also serve as a documentation of our next features to be released.

Ideally, once the feature/patch has been merged, the working document should serve as a source to update the rest of the documentation and then be removed.

Load Balancing Data (planned for version 0.2)

I have conducted a quick study of different methods to load-balance data over different Garage nodes using consistent hashing.

Requirements

  • good balancing: two nodes that have the same announced capacity should receive close to the same number of items

  • multi-datacenter: the replicas of a partition should be distributed over as many datacenters as possible

  • minimal disruption: when adding or removing a node, as few partitions as possible should have to move around

  • order-agnostic: the same set of nodes (each associated with a datacenter name and a capacity) should always return the same distribution of partition replicas, independently of the order in which nodes were added/removed (this is to keep the implementation simple)

Methods

Naive multi-DC ring walking strategy

This strategy can be used with any ring-like algorithm to make it aware of the multi-datacenter requirement:

In this method, the ring is a list of positions, each associated with a single node in the cluster. Partitions contain all the keys between two consecutive items of the ring. To find the nodes that store replicas of a given partition:

  • select the node for the position of the partition's lower bound
  • go clockwise on the ring, skipping nodes that:
    • we halve already selected
    • are in a datacenter of a node we have selected, except if we already have nodes from all possible datacenters

In this way the selected nodes will always be distributed over min(n_datacenters, n_replicas) different datacenters, which is the best we can do.

This method was implemented in the first version of Garage, with the basic ring construction from Dynamo DB that consists in associating n_token random positions to each node (I know it's not optimal, the Dynamo paper already studies this).

Better rings

The ring construction that selects n_token random positions for each nodes gives a ring of positions that is not well-balanced: the space between the tokens varies a lot, and some partitions are thus bigger than others. This problem was demonstrated in the original Dynamo DB paper.

To solve this, we want to apply a better second method for partitionning our dataset:

  1. fix an initially large number of partitions (say 1024) with evenly-spaced delimiters,

  2. attribute each partition randomly to a node, with a probability proportionnal to its capacity (which n_tokens represented in the first method)

For now we continue using the multi-DC ring walking described above.

I have studied two ways to do the attribution of partitions to nodes, in a way that is deterministic:

  • Min-hash: for each partition, select node that minimizes hash(node, partition_number)
  • MagLev: see here

MagLev provided significantly better balancing, as it guarantees that the exact same number of partitions is attributed to all nodes that have the same capacity (and that this number is proportionnal to the node's capacity, except for large values), however in both cases:

  • the distribution is still bad, because we use the naive multi-DC ring walking that behaves strangely due to interactions between consecutive positions on the ring

  • the disruption in case of adding/removing a node is not as low as it can be, as we show with the following method.

A quick description of MagLev (backend = node, lookup table = ring):

The basic idea of Maglev hashing is to assign a preference list of all the lookup table positions to each backend. Then all the backends take turns filling their most-preferred table positions that are still empty, until the lookup table is completely filled in. Hence, Maglev hashing gives an almost equal share of the lookup table to each of the backends. Heterogeneous backend weights can be achieved by altering the relative frequency of the backends’ turns…

Here are some stats (run scripts/simulate_ring.py to reproduce):

##### Custom-ring (min-hash) #####

#partitions per node (capacity in parenthesis):
- datura (8) :  227
- digitale (8) :  351
- drosera (8) :  259
- geant (16) :  476
- gipsie (16) :  410
- io (16) :  495
- isou (8) :  231
- mini (4) :  149
- mixi (4) :  188
- modi (4) :  127
- moxi (4) :  159

Variance of load distribution for load normalized to intra-class mean
(a class being the set of nodes with the same announced capacity): 2.18%     <-- REALLY BAD

Disruption when removing nodes (partitions moved on 0/1/2/3 nodes):
removing atuin digitale : 63.09% 30.18% 6.64% 0.10%
removing atuin drosera : 72.36% 23.44% 4.10% 0.10%
removing atuin datura : 73.24% 21.48% 5.18% 0.10%
removing jupiter io : 48.34% 38.48% 12.30% 0.88%
removing jupiter isou : 74.12% 19.73% 6.05% 0.10%
removing grog mini : 84.47% 12.40% 2.93% 0.20%
removing grog mixi : 80.76% 16.60% 2.64% 0.00%
removing grog moxi : 83.59% 14.06% 2.34% 0.00%
removing grog modi : 87.01% 11.43% 1.46% 0.10%
removing grisou geant : 48.24% 37.40% 13.67% 0.68%
removing grisou gipsie : 53.03% 33.59% 13.09% 0.29%
on average:  69.84% 23.53% 6.40% 0.23%                  <-- COULD BE BETTER

--------

##### MagLev #####

#partitions per node:
- datura (8) :  273
- digitale (8) :  256
- drosera (8) :  267
- geant (16) :  452
- gipsie (16) :  427
- io (16) :  483
- isou (8) :  272
- mini (4) :  184
- mixi (4) :  160
- modi (4) :  144
- moxi (4) :  154

Variance of load distribution: 0.37%                <-- Already much better, but not optimal

Disruption when removing nodes (partitions moved on 0/1/2/3 nodes):
removing atuin digitale : 62.60% 29.20% 7.91% 0.29%
removing atuin drosera : 65.92% 26.56% 7.23% 0.29%
removing atuin datura : 63.96% 27.83% 7.71% 0.49%
removing jupiter io : 44.63% 40.33% 14.06% 0.98%
removing jupiter isou : 63.38% 27.25% 8.98% 0.39%
removing grog mini : 72.46% 21.00% 6.35% 0.20%
removing grog mixi : 72.95% 22.46% 4.39% 0.20%
removing grog moxi : 74.22% 20.61% 4.98% 0.20%
removing grog modi : 75.98% 18.36% 5.27% 0.39%
removing grisou geant : 46.97% 36.62% 15.04% 1.37%
removing grisou gipsie : 49.22% 36.52% 12.79% 1.46%
on average:  62.94% 27.89% 8.61% 0.57%                  <-- WORSE THAN PREVIOUSLY

The magical solution: multi-DC aware MagLev

Suppose we want to select three replicas for each partition (this is what we do in our simulation and in most Garage deployments). We apply MagLev three times consecutively, one for each replica selection. The first time is pretty much the same as normal MagLev, but for the following times, when a node runs through its preference list to select a partition to replicate, we skip partitions for which adding this node would not bring datacenter-diversity. More precisely, we skip a partition in the preference list if:

  • the node already replicates the partition (from one of the previous rounds of MagLev)
  • the node is in a datacenter where a node already replicates the partition and there are other datacenters available

Refer to method4 in the simulation script for a formal definition.

##### Multi-DC aware MagLev #####

#partitions per node:
- datura (8) :  268                 <-- NODES WITH THE SAME CAPACITY
- digitale (8) :  267                   HAVE THE SAME NUM OF PARTITIONS
- drosera (8) :  267                    (+- 1)
- geant (16) :  470
- gipsie (16) :  472
- io (16) :  516
- isou (8) :  268
- mini (4) :  136
- mixi (4) :  136
- modi (4) :  136
- moxi (4) :  136

Variance of load distribution: 0.06%                <-- CAN'T DO BETTER THAN THIS

Disruption when removing nodes (partitions moved on 0/1/2/3 nodes):
removing atuin digitale : 65.72% 33.01% 1.27% 0.00%
removing atuin drosera : 64.65% 33.89% 1.37% 0.10%
removing atuin datura : 66.11% 32.62% 1.27% 0.00%
removing jupiter io : 42.97% 53.42% 3.61% 0.00%
removing jupiter isou : 66.11% 32.32% 1.56% 0.00%
removing grog mini : 80.47% 18.85% 0.68% 0.00%
removing grog mixi : 80.27% 18.85% 0.88% 0.00%
removing grog moxi : 80.18% 19.04% 0.78% 0.00%
removing grog modi : 79.69% 19.92% 0.39% 0.00%
removing grisou geant : 44.63% 52.15% 3.22% 0.00%
removing grisou gipsie : 43.55% 52.54% 3.91% 0.00%
on average:  64.94% 33.33% 1.72% 0.01%                  <-- VERY GOOD (VERY LOW VALUES FOR 2 AND 3 NODES)