BX400 S1 Failover Cluster issues

djmohr · Postby **djmohr** » Fri Jan 27, 2023 12:57

I'm sitting with a situation where my cluster falls over every couple of days, it starts with one node, then the rest follow. The roles get migrated between the nodes with no issue.

We have the following hardware:

Fujitsu Primergy BX400 S1 chassis

6x Primergy BX 2560 M2 blades

Node 1: 2x Intel Xeon E5-2640 v4, 320GB RAM
Node 2: 2x Intel Xeon E5-2640 v4, 320GB RAM
Node 3: 2x Intel Xeon E5-2640 v4, 320GB RAM
Node 4: 2x Intel Xeon E5-2640 v4, 320GB RAM
Node 5: 2x Intel Xeon E5-2640 v4, 320GB RAM
Node 6: 2x Intel Xeon E5-2640 v4, 448GB RAM
Our storage is a PureStorage FA-X20R2.

The BX400 is connected to 2 Brocade SAN switches via 12x 8GB fiber channel cables (2 per node, one FC to each switch for failover) and the storage is connect via 4x 8GB fiber channel cables (2 FC to each switch for failover). The nodes are configured with MPIO to the storage and are all running Windows Server 2019 Datacenter.

The PureStorage we configured with 3x 8TB volumes and a 10GB Disk Witness (Quorum).

The BX400 is connected to our core switch via 2x 10GB fibers and the Pure use 2x 1GB copper for management purposes.

Below are the typical errors we get when the cluster has its wobble:

Cluster Shared Volume 'Pure_CSV08' ('Pure_CSV08') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.
Cluster Shared Volume 'Pure_CSV09' ('Pure_CSV09') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.
Cluster node 'NODE 6' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
At this point all nodes get "removed" from the cluster but then come back online, node 6 goes into an isolated state.

Cluster node 'NODE 6' has entered the isolated state.
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Cluster Shared Volume 'Pure_CSV10' ('Pure_CSV10') has entered a paused state because of 'STATUS_UNEXPECTED_NETWORK_ERROR(c00000c4)'. All I/O will temporarily be queued until a path to the volume is reestablished. This error is usually caused by an infrastructure failure. For example, losing connectivity to storage or the node owning the Cluster Shared Volume being removed from active cluster membership.
Cluster node 'NODE 6' has been quarantined. The node experienced '3' consecutive failures within a short amount of time and has been removed from the cluster to avoid further disruptions. The node will be quarantined until '2023/01/25-02:37:02.374' and then the node will automatically attempt to re-join the cluster.

Refer to the System and Application event logs to determine the issues on this node. When the issue is resolved, quarantine can be manually cleared to allow the node to rejoin with the 'Start-ClusterNode –ClearQuarantine' Windows PowerShell cmdlet.

Node Name : NODE 6
Number of consecutive cluster membership loses: 3
Time quarantine will be automatically cleared: 2023/01/25-02:37:02.374
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
What am I missing here, why is this happening?

elsie · Postby **elsie** » Mon Jan 30, 2023 11:44

Question: Is it always Node 6 that has a problem? Or is it just representative of the latest failure?

Also have you looked Windows System Event Logs to see if it gives any hints?

djmohr · Postby **djmohr** » Mon Jan 30, 2023 11:51

elsie wrote:Question: Is it always Node 6 that has a problem? Or is it just representative of the latest failure?

Also have you looked Windows System Event Logs to see if it gives any hints?

This time round it was Node 6, few days later Node 4 gave problems as well.
Windows event logs haven't been of much use, the errors I mentioned above are typically the same as what's shown in the logs.

Ask Fujitsu · Postby **Ask Fujitsu** » Mon Jan 30, 2023 16:33

Hello

Thank you for your post on the Official Fujitsu Forum! We would like to inform you, that described issue requires deeper analysis and hence it might be necessary to provide logs, journals, reports and/or any other additional information which are necessary in troubleshooting process. Please bear in mind, that Official Fujitsu Forum is public and nothing that contains confidential information cannot be post on it.

Due to this, please submit a ticket to Fujitsu's Technical Support via one of given options:
1. Using contact form available on the site: https://support.ts.fujitsu.com/IndexContact.asp?lng=COM tab "Contact us").
2. Contacting one of our ServiceDesk Office in your country. Please visit https://support.ts.fujitsu.com/IndexContact.asp?lng=COM and click "Your Service Desk" tab. You can find contacts to each Fujitsu's Technical Support divisions on the world there.
3. Sending an email to address: technical.support@ts.fujitsu.com. Message sent should contains below information:- Product ID or serial number of your device
- Operating system on the device
- Description of failure
- Changes of Hardware or Software configuration before malfunction of the system

Thank you,
Primergy Technical Support Team

BX400 S1 Failover Cluster issues

BX400 S1 Failover Cluster issues

Re: BX400 S1 Failover Cluster issues

Re: BX400 S1 Failover Cluster issues

Re: BX400 S1 Failover Cluster issues

Who is online