I'm sitting with a situation where my cluster falls over every couple of days, it starts with one node, then the rest follow. The roles get migrated between the nodes with no issue.
We have the following hardware:
Fujitsu Primergy BX400 S1 chassis
6x Primergy BX 2560 M2 blades
Node 1: 2x Intel Xeon E5-2640 v4, 320GB RAM
Node 2: 2x Intel Xeon E5-2640 v4, 320GB RAM
Node 3: 2x Intel Xeon E5-2640 v4, 320GB RAM
Node 4: 2x Intel Xeon E5-2640 v4, 320GB RAM
Node 5: 2x Intel Xeon E5-2640 v4, 320GB RAM
Node 6: 2x Intel Xeon E5-2640 v4, 448GB RAM
Our storage is a PureStorage FA-X20R2.
The BX400 is connected to 2 Brocade SAN switches via 12x 8GB fiber channel cables (2 per node, one FC to each switch for failover) and the storage is connect via 4x 8GB fiber channel cables (2 FC to each switch for failover). The nodes are configured with MPIO to the storage and are all running Windows Server 2019 Datacenter.
The PureStorage we configured with 3x 8TB volumes and a 10GB Disk Witness (Quorum).
The BX400 is connected to our core switch via 2x 10GB fibers and the Pure use 2x 1GB copper for management purposes.
Below are the typical errors we get when the cluster has its wobble:
Cluster Shared Volume 'Pure_CSV08' ('Pure_CSV08') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.
Cluster Shared Volume 'Pure_CSV09' ('Pure_CSV09') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.
Cluster node 'NODE 6' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
At this point all nodes get "removed" from the cluster but then come back online, node 6 goes into an isolated state.
Cluster node 'NODE 6' has entered the isolated state.
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Cluster Shared Volume 'Pure_CSV10' ('Pure_CSV10') has entered a paused state because of 'STATUS_UNEXPECTED_NETWORK_ERROR(c00000c4)'. All I/O will temporarily be queued until a path to the volume is reestablished. This error is usually caused by an infrastructure failure. For example, losing connectivity to storage or the node owning the Cluster Shared Volume being removed from active cluster membership.
Cluster node 'NODE 6' has been quarantined. The node experienced '3' consecutive failures within a short amount of time and has been removed from the cluster to avoid further disruptions. The node will be quarantined until '2023/01/25-02:37:02.374' and then the node will automatically attempt to re-join the cluster.
Refer to the System and Application event logs to determine the issues on this node. When the issue is resolved, quarantine can be manually cleared to allow the node to rejoin with the 'Start-ClusterNode –ClearQuarantine' Windows PowerShell cmdlet.
Node Name : NODE 6
Number of consecutive cluster membership loses: 3
Time quarantine will be automatically cleared: 2023/01/25-02:37:02.374
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
What am I missing here, why is this happening?