Cluster operation when certain nodes unreachable

Some of the nodes in my cluster became unreachable. I thought the cluster would recover from this automatically, but that doesn’t seem to be the case. As the result the cluster is nonoperational. A couple of questions:

  • What is the expected behavior in this situation?
  • If auto recovery is not possible, what things can I do to recover from it manually assuming the faulty nodes remain down?

Hi Jim,

The recovery behavior depends on the type of failure. If multiple servers in the cluster fail, it will require manual recovery. For transient errors, the system will retry and bring back services. The recovery for failed service is manual by default. Customers can use an automation script for auto-restarting RESTPP service. Happy to provide more information offline.

Thanks
Rayees