Since September 2020 and its 2.0 release, Patroni is able to rely on the pysyncobj module in order to use python Raft implementation as DCS.
In this post, we will setup a demo cluster to illustrate that feature.
Installation
For this demo, we will install 3 PostgreSQL nodes in Streaming Replication, running on Rocky Linux 8.
If you’re familiar with Vagrant, here’s a simple Vagrantfile
to initiate 3 virtual machines:
# Vagrantfile
Vagrant.configure(2) do |config|
config.vm.box = 'rockylinux/8'
config.vm.provider 'libvirt' do |lv|
lv.cpus = 1
lv.memory = 1024
end
config.vm.synced_folder ".", "/vagrant", disabled: true
nodes = 'srv1', 'srv2', 'srv3'
nodes.each do |node|
config.vm.define node do |conf|
conf.vm.hostname = node
end
end
config.vm.provision "shell", inline: <<-SHELL
#-----------------------------
sudo dnf install -y bind-utils
#-----------------------------
SHELL
end
PostgreSQL
First of all, let’s install PostgreSQL on all the nodes:
$ sudo dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm
$ sudo dnf -qy module disable postgresql
$ sudo dnf install -y postgresql14-server postgresql14-contrib
$ sudo systemctl disable postgresql-14
Patroni will bootstrap (create) the initial PostgreSQL cluster and be in charge of starting the service, so be sure systemctl
is disabled for the PostgreSQL service.
Watchdog
Watchdog devices are software or hardware mechanisms that will reset the whole system when they do not get a keepalive heartbeat within a specified timeframe. This adds an additional layer of fail safe in case usual Patroni split-brain protection mechanisms fail.
Patroni will be the component interacting with the watchdog device. Set the permissions of the software watchdog:
$ cat <<EOF | sudo tee /etc/udev/rules.d/99-watchdog.rules
KERNEL=="watchdog", OWNER="postgres", GROUP="postgres"
EOF
$ sudo sh -c 'echo "softdog" >> /etc/modules-load.d/softdog.conf'
$ sudo modprobe softdog
$ sudo chown postgres: /dev/watchdog
Patroni
Install Patroni and its Raft dependencies:
$ sudo dnf install -y python39
$ sudo -iu postgres pip3 install --user --upgrade pip
$ sudo -iu postgres pip3 install --user setuptools_rust
$ sudo -iu postgres pip3 install --user psycopg[binary]>=3.0.0
$ sudo -iu postgres pip3 install --user patroni[raft]
Remark: since December 2021 and its version 2.1.2, Patroni supports psycopg3.
Since we installed Patroni for the postgres
user, let’s add its location to the user PATH:
$ sudo -u postgres sh -c 'echo "export PATH=\"/var/lib/pgsql/.local/bin:\$PATH\"" >> ~/.bash_profile'
$ sudo -iu postgres patroni --version
$ sudo -iu postgres syncobj_admin --help
Create the data directory for Raft:
$ sudo mkdir /var/lib/raft
$ sudo chown postgres: /var/lib/raft
Patroni configuration
We will need to define the list of Patroni nodes participating in the Raft consensus cluster.
To fetch it dynamically, you can use this simple shell script (where srv1 srv2 srv3
are the 3 Patroni hosts):
# Fetch the IP addresses of all Patroni hosts
MY_IP=$(hostname -I | awk ' {print $1}')
patroni_nodes=( srv1 srv2 srv3 )
i=0
for node in "${patroni_nodes[@]}"
do
i=$i+1
target_ip=$(dig +short $node)
if [[ "$target_ip" = "$MY_IP" ]]; then
continue
fi
target_array[$i]="'$target_ip:5010'"
done
RAFT_PARTNER_ADDRS=$(printf ",%s" "${target_array[@]}")
export RAFT_PARTNER_ADDRS="[${RAFT_PARTNER_ADDRS:1}]"
echo "partner_addrs: $RAFT_PARTNER_ADDRS"
Let us now define the Patroni configuration in /etc/patroni.yml
:
$ CLUSTER_NAME="demo-cluster-1"
$ MY_NAME=$(hostname --short)
$ MY_IP=$(hostname -I | awk ' {print $1}')
$ cat <<EOF | sudo tee /etc/patroni.yml
scope: $CLUSTER_NAME
namespace: /db/
name: $MY_NAME
restapi:
listen: "0.0.0.0:8008"
connect_address: "$MY_IP:8008"
authentication:
username: patroni
password: mySupeSecretPassword
raft:
data_dir: /var/lib/raft
self_addr: "$MY_IP:5010"
partner_addrs: $RAFT_PARTNER_ADDRS
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: false
use_slots: true
parameters:
archive_mode: "on"
archive_command: "/bin/true"
initdb:
- encoding: UTF8
- data-checksums
- auth-local: peer
- auth-host: scram-sha-256
pg_hba:
- host replication replicator 0.0.0.0/0 scram-sha-256
- host all all 0.0.0.0/0 scram-sha-256
# Some additional users which needs to be created after initializing new cluster
users:
admin:
password: admin%
options:
- createrole
- createdb
postgresql:
listen: "0.0.0.0:5432"
connect_address: "$MY_IP:5432"
data_dir: /var/lib/pgsql/14/data
bin_dir: /usr/pgsql-14/bin
pgpass: /tmp/pgpass0
authentication:
replication:
username: replicator
password: confidential
superuser:
username: postgres
password: my-super-password
rewind:
username: rewind_user
password: rewind_password
parameters:
unix_socket_directories: '/var/run/postgresql,/tmp'
watchdog:
mode: required
device: /dev/watchdog
safety_margin: 5
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
EOF
Except for $MY_IP
, $MY_NAME
and $RAFT_PARTNER_ADDRS
which are related to the local host, the patroni.yml
configuration should be the same on all Patroni nodes.
Depending on the Patroni installation source, create the systemd file if not done during the installation and start the Patroni service:
$ cat <<EOF | sudo tee /etc/systemd/system/patroni.service
[Unit]
Description=Runners to orchestrate a high-availability PostgreSQL
After=syslog.target network.target
[Service]
Type=simple
User=postgres
Group=postgres
ExecStart=python3 /var/lib/pgsql/.local/bin/patroni /etc/patroni.yml
ExecReload=/bin/kill -s HUP \$MAINPID
KillMode=process
TimeoutSec=30
Restart=no
[Install]
WantedBy=multi-user.target
EOF
$ sudo systemctl daemon-reload
$ sudo systemctl enable patroni
$ sudo systemctl start patroni
To check the Raft cluster status, use the syncobj_admin
command:
$ sudo -iu postgres syncobj_admin -conn localhost:5010 -status
To list the members of the cluster, use the patronictl
command:
$ sudo -iu postgres patronictl -c /etc/patroni.yml topology
+--------+-----------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: demo-cluster-1 (7117556723320621508) ----+-----------+
| srv2 | 192.168.121.12 | Leader | running | 1 | |
| + srv1 | 192.168.121.126 | Replica | running | 1 | 0 |
| + srv3 | 192.168.121.194 | Replica | running | 1 | 0 |
+--------+-----------------+---------+---------+----+-----------+
Database connection
Instead of connecting directly to the database server, it is possible to setup HAProxy so the application will be connecting to the proxy instead, which will then forward the request to PostgreSQL. When HAproxy is used for this, it is also possible to route read-only requests to one or more replicas, for load balancing. HAproxy can be installed as an independent server but it can also be installed on the application server or the database server itself.
Another possibility is to use PostgreSQL client libraries like libpq
and jdbc
which support client connection fail-over. The connection string contains multiple servers (eg: host=srv1,srv2,srv3
) and the client library loops over the available hosts to find a connection that is available and capable of read-write or read-only operations. This capability allows clients to follow the primary cluster during a switchover.
Example:
$ psql "host=srv1,srv2,srv3 dbname=postgres user=admin target_session_attrs=read-write" -c "SELECT pg_is_in_recovery();"
pg_is_in_recovery
-------------------
f
(1 row)
$ psql "host=srv1,srv2,srv3 dbname=postgres user=admin target_session_attrs=read-only" -c "SELECT pg_is_in_recovery();"
pg_is_in_recovery
-------------------
t
(1 row)
$ psql "host=srv1,srv2,srv3 dbname=postgres user=admin target_session_attrs=read-only" -c "\conninfo"
You are connected to database "postgres" as user "admin" on host "srv2" (address "192.168.121.36") at port "5432".
Automatic failover test
By default Patroni will set up the watchdog to expire 5 seconds before TTL expires. With the default setup of
loop_wait=10
andttl=30
this gives HA loop at least 15 seconds (ttl - safety_margin - loop_wait) to complete before the system gets forcefully reset. By default accessing DCS is configured to time out after 10 seconds. This means that when DCS is unavailable, for example due to network issues, Patroni and PostgreSQL will have at least 5 seconds (ttl - safety_margin - loop_wait - retry_timeout) to come to a state where all client connections are terminated.
Simply run pgbench
on the leader node and disconnect the VM network interface for a few seconds to notice that a failover may happen very (too?) quickly!
Example:
09:22:45,326 INFO: no action. I am (srv1), a secondary, and following a leader (srv2)
09:22:55,333 INFO: no action. I am (srv1), a secondary, and following a leader (srv2)
09:23:05,355 INFO: Got response from srv3 http://192.168.121.194:8008/patroni: {"state": "running", ...}
09:23:07,268 WARNING: Request failed to srv2: GET http://192.168.121.12:8008/patroni (...)
09:23:07,280 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds
09:23:07,319 INFO: promoted self to leader by acquiring session lock
09:23:07 srv1 python3[27101]: server promoting
09:23:07,340 INFO: cleared rewind state after becoming the leader
09:23:08,760 INFO: no action. I am (srv1), the leader with the lock
When the network interface comes back up on srv2
, if it received additional data, the replication might be broken:
FATAL: could not start WAL streaming: ERROR: requested starting point 0/D9000000 on timeline 1 is not in this server's history
DETAIL: This server's history forked from timeline 1 at 0/CDBDA4A8.
LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/D9683DC8
Since we didn’t configure Patroni to use pg_rewind
, the replication lag might grow very quickly:
$ sudo -iu postgres patronictl -c /etc/patroni.yml list
+--------+-----------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: demo-cluster-1 (7117556723320621508) ----+-----------+
| srv1 | 192.168.121.126 | Leader | running | 2 | |
| srv2 | 192.168.121.12 | Replica | running | 1 | 169 |
| srv3 | 192.168.121.194 | Replica | running | 2 | 0 |
+--------+-----------------+---------+---------+----+-----------+
Hopefully, we defined a maximum_lag_on_failover
to prevent the failover on the failing standby:
$ sudo -iu postgres patronictl -c /etc/patroni.yml switchover --candidate srv2 --force
Current cluster topology
+--------+-----------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: demo-cluster-1 (7117556723320621508) ----+-----------+
| srv1 | 192.168.121.126 | Leader | running | 2 | |
| srv2 | 192.168.121.12 | Replica | running | 1 | 1046 |
| srv3 | 192.168.121.194 | Replica | running | 2 | 0 |
+--------+-----------------+---------+---------+----+-----------+
Switchover failed, details: 503, Switchover failed
From Patroni logs:
INFO: Member srv2 exceeds maximum replication lag
WARNING: manual failover: no healthy members found, failover is not possible
We have to reinitialize the failing standby:
$ sudo -iu postgres patronictl -c /etc/patroni.yml reinit demo-cluster-1 srv2
+--------+-----------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: demo-cluster-1 (7117556723320621508) ----+-----------+
| srv1 | 192.168.121.126 | Leader | running | 2 | |
| srv2 | 192.168.121.12 | Replica | running | 1 | 1575 |
| srv3 | 192.168.121.194 | Replica | running | 2 | 0 |
+--------+-----------------+---------+---------+----+-----------+
Are you sure you want to reinitialize members srv2? [y/N]: y
Success: reinitialize for member srv2
From Patroni logs:
INFO: Removing data directory: /var/lib/pgsql/14/data
INFO: Lock owner: srv1; I am srv2
INFO: reinitialize in progress
...
The reinit
step is by default performed using pg_basebackup
without the fast checkpoint mode. So, depending on the checkpoint configuration and database size, it may take a lot of time.
Conclusion
It is very important to understand the parameters affecting the automatic failover kick-off and the consequences of a switchover/failover, or even the impact of not using pg_rewind
.
As usual, testing its own configuration is important and once in production what’s even more important is to have a good monitoring and alerting system!
Tweet