Linux cluster with ZFS on Cluster-in-a-Box

This guide is dedicated to the creation of simple two-node HA cluster and iSCSI target setup with ZFS as storage basement.

Architecture

Hardware

Most common HW for such solution – two servers and one JBOD with SAS HBA connection.


A similar system can be built on Cluster-in-a-box class system – two servers with JBOD in one enclosure like we mention in the article about Shared DAS.

Such system serves as SUT (system under test) for this article.

Software

Software architecture is typical to build high availability (HA) clusters on GNU/Linux. Comments on total solution architecture:

  • Hardware failure resistance is provided by redundant components: two or more servers, fault-tolerant JBOD, redundant SAS (multipath) and networking links.
  • OS-level multipath is provided by the multipathd daemon. Ethernet links are combined with bonding/teaming.
  • Storage redundancy is based on ZFS. ZFS also manage volumes and caching.
  • corosync+pacemaker – infrastructure to setup HA-cluster and resource management.

Initial GNU/Linux setup

NOTE: All steps must be performed on both nodes.

Install GNU/Linus OS with OpenSSH on both nodes. This article provides a guide for RedHat 7, suitable for CentOS 7 and will work (with minor difference) on all other modern GNU/Linux distributions.

Networking

How to disable automatic network management (for true Linux gurus)

systemctl stop NetworkManager

systemctl disable NetworkManager

yum erase NetworkManager*

System has three physical 10Gbps interfaces:

  • ens11f0, ens11f1 — external interfaces. Bond them to team0;
  • enp130s0 — internal. Used only for corosync service purpose.

Interface bonding is done via Team.

External interfaces:

# cat ifcfg-ens11f0

DEVICETYPE=TeamPort

BOOTPROTO=none

USERCTL=no

ONBOOT=no

TEAM_MASTER=team0

TEAM_PORT_CONFIG='{"prio":100}'

NAME="ens11f0"

UUID="704d85d9-7430-4d8f-b920-792263d192ba"

HWADDR="00:8C:FA:E5:6D:E0"

# cat ifcfg-ens11f1

DEVICETYPE=TeamPort

BOOTPROTO=none

USERCTL=no

ONBOOT=no

TEAM_MASTER=team0

TEAM_PORT_CONFIG='{"prio":100}'

NAME=ens11f1

UUID=4bd90873-9097-442a-8ac8-7971756b0fc5

HWADDR=00:8C:FA:E5:6D:E1

team0 interface:

# cat ./ifcfg-team0

DEVICE=team0

DEVICETYPE=Team

BOOTPROTO=static

USERCTL=no

ONBOOT=yes

IPADDR=10.3.254.64

NETMASK=255.255.255.0

GATEWAY=10.3.254.1

DNS1=192.168.10.107

TEAM_CONFIG='{"runner":{"name":"activebackup"},"link_watch":{"name":"ethtool"}}'

Internal:

# cat ./ifcfg-enp130s0

TYPE=Ethernet

BOOTPROTO=none

NAME=enp130s0

UUID=2933ee35-eb16-485e-b65c-e186d772b480

ONBOOT=yes

HWADDR=00:8C:FA:CE:56:DB

IPADDR=172.30.0.1

PREFIX=28

/etc/hosts

# cat /etc/hosts

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

172.30.0.1 node1

172.30.0.2 node2

10.3.254.64 node1-ext

10.3.254.12 node2-ext

Notes

It is recommended to disable firewall for initial setup

systemctl stop firewalld

systemctl disable firewalld

Multipath

If you connect block device to SAS HBA card with more than single path (like in CiB with dual expander access) – you will have have double count in OS dev list. To work right we need multipathd daemon.

yum install device-mapper-multipath.x86_64

touch /etc/multipath.conf

systemctl start multipathd

systemctl enable multipathd

If everything is correct – than +multipath -l+ output will looks like:

35000c50077ad9a3f dm-6 SEAGATE ,ST2000NX0263

size=1.8T features='0' hwhandler='0' wp=rw

|-+- policy='service-time 0' prio=0 status=active

| `- 0:0:0:0 sdb 8:16 active undef running

`-+- policy='service-time 0' prio=0 status=enabled

`- 0:0:6:0 sdg 8:96 active undef running

35000c50077580317 dm-4 SEAGATE ,ST2000NX0273

size=1.8T features='0' hwhandler='0' wp=rw

|-+- policy='service-time 0' prio=0 status=active

| `- 0:0:1:0 sdc 8:32 active undef running

`-+- policy='service-time 0' prio=0 status=enabled

`- 0:0:7:0 sdh 8:112 active undef running

Actual drive location will be like:

/dev/mapper/35000c50077580317

/dev/mapper/35000c50077ad9287

/dev/mapper/35000c50077ad9a3f

/dev/mapper/35000c50077ad8aab

/dev/mapper/35000c50077ad92ef

ZFS

ZFS is used as alternative to MD+LVM pack. It decrease resource variety for heartbeat and simplify management. You can organize block access (iSCSI/FC) or file access (NFS), use SSD cache and dedupe function without additional tools and control from cluster side.

ZFS deployment on CentOS is simple – http://zfsonlinux.org/epel.html:

yum install ftp://ftp.yandex.ru/epel/7/x86_64/e/epel-release-7-5.noarch.rpm

yum install http://archive.zfsonlinux.org/epel/zfs-release.el7.noarch.rpm

yum install kernel-devel zfs

After deployment you need to create drive pool and volume. Pool is exported.

zpool create -o cachefile=none pool72 raidz1 /dev/mapper/35000c50077*

zfs create -s -V 1T pool72/vol1T

zpool export pool72

Pool will be accessible via iSCSI. It’s not possible to say what data and metadata will be stored on that pool, so better prevent udev from checking volumes on that pool:

cp /lib/udev/rules.d/60-persistent-storage.rules /etc/udev/rules.d/60-persistent-storage.sed -i '1s/^/KERNEL=="zd*" SUBSYSTEM=="block" GOTO="persistent_storage_end"\n/' /etc/systemctl restart systemd-udevd

iSCSI

yum install targetcli.noarch

Corosync+Pacemaker

Installation (on all cluster nodes):

Packages required:

yum install pcs fence-agents-all

pacemaker scenario to work with ZFS and iSCSI:

cd /usr/lib/ocf/resource.d/heartbeat/

wget https://github.com/skiselkov/stmf-ha/raw/master/heartbeat/ZFS

wget https://github.com/ClusterLabs/resource-agents/raw/master/heartbeat/iSCSITarget

wget https://github.com/ClusterLabs/resource-agents/raw/master/heartbeat/iSCSILogicalUnit

chmod a+x ./ZFS chmod a+x ./iSCSILogicalUnit

chmod a+x ./iSCSITarget

passwd hacluster

systemctl enable pcsd

systemctl enable corosync

systemctl enable pacemaker

systemctl start pcsd

Cluster creation (on any node)

Node authorization:

pcs cluster auth node1 node2

Cluster starts with two “rings” – on internal and external interfaces:

pcs cluster setup --start --name cib node1,node1-ext node2,node2-ext

pcs status

Cluster name: cib

WARNING: no stonith devices and stonith-enabled is not false

Last updated: Thu Mar 12 14:38:37 2015

Last change: Thu Mar 12 14:38:24 2015 via crmd on node1

Current DC: NONE

2 Nodes configured

0 Resources configured

Node node1 (1): UNCLEAN (offline)

Node node2 (2): UNCLEAN (offline)

Full list of resources:

PCSD Status:

node1: Online

node2: Online

Daemon Status:

corosync: active/disabled

pacemaker: active/disabled

pcsd: active/enabled

WARNING: no stonith devices and stonith-enabled is not false means that STONITH resources are not installed. In our case, we will only “fence” storage subsystem with SCSI-3 Persistent Reservation. In short – in the case of “Split Brain” event, one node will be blocked from writing to storage. Resource – drives in ZFS pool:

pcs stonith create fence-pool72 fence_scsi \

devices="/dev/mapper/35000c50077580317, \

/dev/mapper/35000c50077ad8aab, \

/dev/mapper/35000c50077ad9287, \

/dev/mapper/35000c50077ad92ef, \

/dev/mapper/35000c50077ad9a3f" meta provides=unfencing

Our cluster is two-node and we have to ignore quorum policy:

pcs property set no-quorum-policy=ignore

Cluster resources

All resource serving storage pool are added into single group-pool72 group. Pool:

pcs resource create pool72 ZFS \

params pool="pool72" importargs="-d /dev/mapper/" \

op start timeout="90" op stop timeout="90" --group=group-pool72

iSCSI-target and LUN:

pcs resource create target-pool72 iSCSITarget \

portals="10.3.254.230" iqn="iqn.2005-05.com.etegro:cib.pool72" \

implementation="lio-t" --group group-pool72

pcs resource create lun1-pool72 iSCSILogicalUnit \

target_iqn="iqn.2005-05.com.etegro:cib.pool72" lun="1" \

path="/dev/pool72/vol1T" --group group-pool72

Need to point out that LIO target is used only because it’s default in CentOS. Discussion about target quality is not the main purpose of this material. IP-address:

pcs resource create ip-pool72 IPaddr2 \

ip="10.3.254.230" cidr_netmask=24 --group group-pool72

Order to start resources in group:

pcs constraint order pool72 then target-pool72

pcs constraint order target-pool72 then lun1-pool72

pcs constraint order lun1-pool72 then ip-pool72

NEWS

Latest news