How To Scale a Single-Host PostgreSQL Database With Citus

When it comes to Citus, successfully building out and scaling a PostgreSQL cluster across multiple nodes and even across data centers can feel, at times, to be an art form because there are so many ways of building it out.

There’s an axiom that I think aptly applies to this situation describing the differences between science and art:

– Science: 1 problem -> 1 solution
– Art: 1 problem -> 1,000 solutions

While the two previous Citus blogs that I wrote covered the fairly straightforward concepts of creating columnar tables and leveraging data redundancy, this blog explores the architectural design considerations using this freely downloadable and fully featured PostgreSQL extension.

Rather than listing the concepts, function calls, etc, available in Citus, which frankly is a bit boring, I’m going to explore scaling out a database system starting with a single host. I won’t cover all the features but show just enough that you’ll want to see more of what you can learn to accomplish for yourself.

About the cluster

Citus PostgreSQL

Following a step-by-step process, the objective is to create a four-node cluster consisting of:

PostgreSQL version 15
Citus extension (I’ll be using version 11, but there are newer ones available.)
PostgreSQL Cluster
- One coordinator node
  - citus-coord-01
- Three worker nodes
  - citus1
  - citus2
  - citus3
Hardware
AWS Instance
Ubuntu Server 20.04, SSD volume type
64-bit (x86)
c5.xlarge 4vCPU 8GB-RAM
Storage: EBS volume (root)
80GB
gp2 (IOPS 240/3000)

As well, high availability will be integrated, guaranteeing cluster viability in the case that one worker node goes down.

Leveraging pgbench, which is a benchmarking utility that comes bundled with PostgreSQL, I will put the cluster through its paces by executing a series of DML operations. Using its default tpc-b benchmark, one can stress test a database of any size ranging from a few clients to simulating thousands interacting with a system sized into the Terabytes if needs be.

Steps

Provisioning

The first step is to provision the four nodes with both PostgreSQL and Citus. Refer here or here for getting and installing the extension.

-- update each postgres node and enabling the extension 
-- with a server restart
alter system set shared_preload_libraries='citus';

On each of the four nodes, create the database with the Citus extension:

-- using psql
create database pgbench;
c pgbench
create extension citus;

One worker node

Beginning with a single worker node, log onto citus-coord-01, declaring the coordinator host, and starting with the single node, citus1:

-- execute on citus-coord-01
    psql pgbench <<_eof1_
        select citus_set_coordinator_host('citus-coord-01', 5432);
        select citus_add_node('citus1', 5432);
_eof1_

While still logged in database pgbench, get a list of the worker nodes, which at this time is just the one:

select * from master_get_active_worker_nodes();

node_name | node_port
-----------+-----------
 citus1    |      5432

Now, things get a little interesting; the normal initialization of pgbench is to create and populate the tables with the appropriate constraints. In this case, however, only the tables, without data or constraints, are initialized:

-- execute the following command on citus-coord-01
pgbench -iI t pgbench

pgbench=# d+
                                        List of relations
Schema |       Name       | Type  |  Owner   | Persistence | Access method |  Size   |
-------+------------------+-------+----------+-------------+---------------+---------+-----
public | citus_tables     | view  | postgres | permanent   |               | 0 bytes |
public | pgbench_accounts | table | postgres | permanent   | heap          | 0 bytes |
public | pgbench_branches | table | postgres | permanent   | heap          | 0 bytes |
public | pgbench_history  | table | postgres | permanent   | heap          | 0 bytes |
public | pgbench_tellers  | table | postgres | permanent   | heap          | 0 bytes |

This next set of commands, executed on the coordinator node, distributes the pgbench tables across node citus1:

psql pgbench <<_eof1_
    BEGIN;
        select create_distributed_table('pgbench_history', 'aid');
        select create_distributed_table('pgbench_accounts', 'aid');
        select create_distributed_table('pgbench_branches', 'bid');
        select create_distributed_table('pgbench_tellers', 'tid');
    COMMIT;
_eof1_

And now, we can begin populating the tables. The following invocation generates almost 4GB of data. Notice I still haven’t added any constraints:

pgbench -iI g -s 300 pgbench

This command shows the total number of shards on the node. The default number of shards generated for a given table is 32.

select * from citus_tables;

table_name    | citus_table_type | distribution_column | colocation_id | table_size | shard_count 
------------------+------------------+---------------------+---------------+------------+-------------
 pgbench_accounts | distributed      | aid                 |             1 | 3844 MB    |      32 
 pgbench_branches | distributed      | bid                 |             1 | 256 kB     |      32 
 pgbench_history  | distributed      | aid                 |             1 | 0 bytes    |      32 
 pgbench_tellers  | distributed      | tid                 |             1 | 256 kB     |      32

You’ll also notice that only some of the shards have been populated with data, which is normal when initializing pgbench with data:

select * from citus_shards;

table_name    | shardid |       shard_name        | citus_table_type | colocation_id | nodename | nodeport | shard_size
------------------+---------+-------------------------+------------------+---------------+----------+----------+------------
 pgbench_accounts |  102040 | pgbench_accounts_102040 | distributed      |             1 | citus1   |     5432 |  125853696
 pgbench_accounts |  102041 | pgbench_accounts_102041 | distributed      |             1 | citus1   |     5432 |  126173184
 pgbench_accounts |  102042 | pgbench_accounts_102042 | distributed      |             1 | citus1   |     5432 |  125739008
 pgbench_accounts |  102043 | pgbench_accounts_102043 | distributed      |             1 | citus1   |     5432 |  125968384
.
.
 pgbench_branches |  102072 | pgbench_branches_102072 | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_branches |  102073 | pgbench_branches_102073 | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_branches |  102074 | pgbench_branches_102074 | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_branches |  102075 | pgbench_branches_102075 | distributed      |             1 | citus1   |     5432 |       8192
.
.
 pgbench_history  |  102008 | pgbench_history_102008  | distributed      |             1 | citus1   |     5432 |          0
 pgbench_history  |  102009 | pgbench_history_102009  | distributed      |             1 | citus1   |     5432 |          0
 pgbench_history  |  102010 | pgbench_history_102010  | distributed      |             1 | citus1   |     5432 |          0
 pgbench_history  |  102011 | pgbench_history_102011  | distributed      |             1 | citus1   |     5432 |          0
.
.
 pgbench_tellers  |  102104 | pgbench_tellers_102104  | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_tellers  |  102105 | pgbench_tellers_102105  | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_tellers  |  102106 | pgbench_tellers_102106  | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_tellers  |  102107 | pgbench_tellers_102107  | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_tellers  |  102108 | pgbench_tellers_102108  | distributed      |             1 | citus1   |     5432 |       8192
.
.

And now, execute the benchmark:

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

The results are not pretty.

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 1524
number of failed transactions: 0 (0.000%)
latency average = 791.374 ms
latency stddev = 157.710 ms
initial connection time = 15.704 ms
tps = 25.197427 (without initial connection time)

ATTENTION: Refer to the bottom of this blog, which summarizes a tabulation of the various benchmarks as this cluster evolves.

In order to speed up the benchmark indexes must be added. But, this standard pgbench method of adding indexes fails:

postgres@citus-coord-01:~$ pgbench -iI p pgbench
creating primary keys...
pgbench: error: query failed: ERROR:  cannot create constraint without a name on a distributed table
pgbench: detail: Query was: alter table pgbench_branches add primary key (bid)

Successfully adding indexes and constraints requires them to be explicitly named and created in the following fashion. Note that table pgbench_history requires a REPLICA identity because it doesn’t have a primary key:

# THIS WORKS!
psql pgbench <<_eof1_
--
-- indexes and constraints must be explicitly named
--
BEGIN;
create unique index pgbench_accounts_pk on pgbench_accounts(aid);
create unique index pgbench_branches_pk on pgbench_branches(bid);
create unique index pgbench_tellers_pk on pgbench_tellers(tid);

alter table pgbench_accounts
   add constraint pk_accounts primary key using index pgbench_accounts_pk;

alter table pgbench_branches
   add constraint pk_branches primary key using index pgbench_branches_pk;

alter table pgbench_tellers
   add constraint pk_tellers primary key using index pgbench_tellers_pk;

qecho adding REPLICA IDENTITY (no PK present)
alter table pgbench_history replica identity full;
COMMIT;
_eof1_

Repeating the benchmark run yields a much better result:

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 135273
number of failed transactions: 0 (0.000%)
latency average = 8.865 ms
latency stddev = 9.452 ms
initial connection time = 15.556 ms
tps = 2254.544852 (without initial connection time)

Two worker nodes

Adding a second worker node is straightforward requiring only two steps:

Adding the node.
Rebalancing the shards across the two nodes.

Execute this on the coordinator node, citus-coord-01:

psql pgbench <<_eof1_
        --
        -- it's understood that database pgbench already exists on citus2
        --  and extension citus has been created
        --
        qecho adding node citus2 ...
        select citus_add_node('citus2', 5432);
        qecho rebalancing shards across TWO nodes ...
        select * from rebalance_table_shards();
        ANALYZE;
_eof1_

Here’s a partial output as the tables are rebalanced across the two nodes:

adding node citus2 ...
citus_add_node
----------------
             3

rebalancing shards across TWO nodes ...
NOTICE:  Moving shard 102008 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102009 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102010 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102011 from citus1:5432 to citus2:5432 ...
.
.
NOTICE:  Moving shard 102020 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102021 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102022 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102023 from citus1:5432 to citus2:5432 ...

Repeating the benchmark once again demonstrates that performance has degraded somewhat. This is to be expected, considering my POC consists of containers on a single machine, all sharing the same system resources:

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 130190
number of failed transactions: 0 (0.000%)
latency average = 9.214 ms
latency stddev = 19.159 ms
initial connection time = 15.776 ms
tps = 2169.559969 (without initial connection time)

Three worker nodes

As per the two-worker node configuration, adding another node is straightforward.

psql pgbench <<_eof1_
       qecho adding node citus3 ...
        select citus_add_node('citus3', 5432);
        qecho rebalancing shards across TWO nodes ...
         select * from rebalance_table_shards();
_eof1_

Repeating the benchmark gives us the updated performance metrics.

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 142630
number of failed transactions: 0 (0.000%)
latency average = 8.410 ms
latency stddev = 5.542 ms
initial connection time = 16.407 ms
tps = 2377.227920 (without initial connection time)

Three worker nodes, with redundancy/replication

This next step is interesting because it adds redundancy by increasing the replication factor from one to two such that two copies of each data shard of each table are not only duplicated but located on two distinct nodes.

ATTENTION: A newly updated replication factor affects tables only after the fact. Pre Existing tables are not affected by the new replication factor:

psql pgbench <<_eof1_
        qecho increasing replication factor to TWO
        alter system set citus.shard_replication_factor=2;
        select pg_reload_conf();
_eof1_

TIP: In order to replicate tables, set the replication factor before creating them.

In order to replicate the existing tables across the nodes, they are first undistributed by centralizing the tables onto the coordinator node. Redistributing them across the three node clusters automatically adds replication.

This centralizes the tables onto the coordinator node:

psql pgbench <<_eof1_
    BEGIN;
        select undistribute_table('pgbench_history');
        select undistribute_table('pgbench_accounts');
        select undistribute_table('pgbench_branches');
        select undistribute_table('pgbench_tellers');
    COMMIT;
_eof1_

The tables are redistributed and rebalanced across the three-node cluster with the correct redundancy of two:

psql pgbench <<_eof1_
BEGIN;
  select create_distributed_table('pgbench_history', 'aid');
  select create_distributed_table('pgbench_accounts', 'aid');
  select create_distributed_table('pgbench_branches', 'bid');
  select create_distributed_table('pgbench_tellers', 'tid');
COMMIT;

-- remove legacy tables from the coordinator node
BEGIN;
   select truncate_local_data_after_distributing_table($$public.pgbench_history$$);
   select truncate_local_data_after_distributing_table($$public.pgbench_accounts$$);
   select truncate_local_data_after_distributing_table($$public.pgbench_branches$$);
   select truncate_local_data_after_distributing_table($$public.pgbench_tellers$$);
COMMIT;
_eof1_

Repeating the benchmark gives us the updated performance metrics.

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 68834
number of failed transactions: 0 (0.000%)
latency average = 17.430 ms
latency stddev = 12.818 ms
initial connection time = 16.321 ms
tps = 1147.093369 (without initial connection time)

Benchmark results

Here’s a summary of the various benchmarking runs, giving you an idea of relative performance.

RUN	NUM NODES	WITH IDX	Replication Factor	TPS
1	1	no	1	25
2	1	yes	1	2255
3	2	yes	1	2170
4	3	yes	1	2377
5	3	yes	2	1147

Caveat

Although not mentioned in the blog, there are a few things you’ll need to keep in mind as you work with CitusDB:

The password for the account used to propagate the operations across the cluster is input in the home account’s .pgpass file on all nodes.
The “wal_level” is set at logical.
Redundancy can potentially decrease overall performance.

Remember to keep in mind that the performance metrics are dependent not only upon the cluster’s hardware and system resources but as to the level of tuning, which, in this case, has not been addressed.

A future blog will continue my exploration into Citus by scaling out pgbench into other architectures.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

How To Scale a Single-Host PostgreSQL Database With Citus

About the cluster

Steps

Provisioning

One worker node

Two worker nodes

Three worker nodes

Three worker nodes, with redundancy/replication

Benchmark results

Caveat

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List