0

I have an unhealthy etcd cluster because a member that failed to join. That member doesn't exist and etcd stuck in selecting a leader.

I.e., there were a cluster with 3 nodes, one tried to join but failed, and now it's 4 members where one is not available. etcd started to select a leader, and stuck in that state.

As a result, etcdctl is not working anymore. But I can access the node API with curl.

Obviously, it's unhealthy:

curl https://10.0.0.1:2379/health
{"health":"false","reason":"RAFT NO LEADER"}

Members list return a list with 4 members:

curl https://10.0.0.1:2379/v2/members | jq

{
  "members": [
    {
      "id": "32ee161a1cedcf0a",
      "name": "",
      "peerURLs": ["https://10.0.0.13:2380"],
      "clientURLs": []
    },
    +3 more, which are actual nodes
}

When I try to remove it I get an error:

curl https://10.0.0.1:2379/v2/members/32ee161a1cedcf0a -XDELETE
{"message":"Internal Server Error"}

And in etcd log there is:

... "caller":"v2http/client.go:267","msg":"failed to remove a member","member-id":"32ee161a1cedcf0a","error":"context deadline exceeded"}
... "caller":"etcdhttp/base.go:136","msg":"unexpected v2 response error","remote-addr":"10.0.0.1:41562","internal-server-error":"context deadline exceeded"}

All other logs is etcd trying to get a consensus on a leader selection and timeouts.

To my understanding, the unhealthy state of the node doesn't allow it to accept/propagate any changes, including the changes to the list of members. But to fix this unhealthy state I need to change the list of members.

How to deal with that state?

1
  • UPDATE It seems that it just stuck in some invalid and unrecoverable state. Most likely caused by other actions to the network topology. Eventually, I just restarted the whole cluster, i.e. every single machine at the same time, and after that it was able to connect and add/remove members. Nov 19 at 21:51

0

You must log in to answer this question.

Browse other questions tagged .