I have an unhealthy etcd
cluster because a member that failed to join. That member doesn't exist and etcd
stuck in selecting a leader.
I.e., there were a cluster with 3 nodes, one tried to join but failed, and now it's 4 members where one is not available. etcd
started to select a leader, and stuck in that state.
As a result, etcdctl
is not working anymore. But I can access the node API with curl
.
Obviously, it's unhealthy:
curl https://10.0.0.1:2379/health
{"health":"false","reason":"RAFT NO LEADER"}
Members list return a list with 4 members:
curl https://10.0.0.1:2379/v2/members | jq
{
"members": [
{
"id": "32ee161a1cedcf0a",
"name": "",
"peerURLs": ["https://10.0.0.13:2380"],
"clientURLs": []
},
+3 more, which are actual nodes
}
When I try to remove it I get an error:
curl https://10.0.0.1:2379/v2/members/32ee161a1cedcf0a -XDELETE
{"message":"Internal Server Error"}
And in etcd
log there is:
... "caller":"v2http/client.go:267","msg":"failed to remove a member","member-id":"32ee161a1cedcf0a","error":"context deadline exceeded"}
... "caller":"etcdhttp/base.go:136","msg":"unexpected v2 response error","remote-addr":"10.0.0.1:41562","internal-server-error":"context deadline exceeded"}
All other logs is etcd
trying to get a consensus on a leader selection and timeouts.
To my understanding, the unhealthy state of the node doesn't allow it to accept/propagate any changes, including the changes to the list of members. But to fix this unhealthy state I need to change the list of members.
How to deal with that state?