-1

Have read other posts; they do not yield much light.

Situation:

  • Kubernetes cluster with ingress points to
  • Several nginx containers that proxy-pass to a
  • Node application on a specific URI via location /app/

What we see:

After days of working without problems, at the same time all 3 nginx containers start reporting upstream issues to the node app - that connection is unexpectedly closed by the client.

However, by going direct to the node app (direct ingress route), or even curling from the nginx containers, there is 100% success rate. i.e., the issue does not seem to exist in the app itself, or we would expect to see a similar failure rate, for the same reasons as going via the nginx.

  • CPU - well below the max
  • Memory - well below the max
  • 1024 sockets configured
  • 100k file descriptors (hard and soft limits).

This is all I have, but this is a pressing issue, and it is not clear what the issue is, or why going direct yields such different behavior. More-so, why does accessing the nginx container via docker exec, and curling from there not produce an issue?

Right now hypotheses extend to some form of resource is exhausted but not immediately clear what that resource is.

We are not maintaining keep-alive, but if socket / port exhaustion was the issue surely we would have seen the same behavior when logging into the container directly.

I am starting to run out of ideas - so any help is hugely appreciated.
Right now, I have a ticking time bomb; flawless service for days, and then suddenly - BOOM! Clients experiencing either 120 seconds or 60 seconds (request appears to fail over from one nginx to another on first failure) latencies 30% of the time.

Lastly - restart the containers: problem goes away.
Well, until it comes back, days sometimes weeks later. As such, this is why we do not see it as an issue with the node app itself;
if it was the node app, why does hit it directly work 100% of the time, and how would restarting the nginx container / nginx reload (process remains active, but new worker process issued with new config changes) fix the issue with the node?

The issue is believed to be nginx because of this - but very unclear as to where. Resources don't appear to be wildly different post-restart when compared with pre-restart - but it is the fact that a restart completely solves the issue, that the issue takes days to appear, that we feel it is somehow resource related. Couldn't offer a decent suggestion as to what that resource is, tho.

3
  • The FIRST place to look when investigating issues on a server is the log files.
    – symcbean
    Jun 15 at 8:33
  • Why assume that hasn't been done? The logs report nothing of particular interest; from nginx it is the 499 status, and the node reports no errors. only thing of slight interest is that there are no entries on the node app when the nginx reports the 499, which would appear to suggest the node app didn't receive the request from the nginx, but it's a bit of a leap given the status can be a forcibly closed connection. either way node is unaware of these requests that are bombing out despite there being nothing between the node & nginx.
    – peteisace
    Jun 17 at 9:43
  • Why assume? Because a good question would include any relevant log entries or list the logs which were checked.
    – symcbean
    Jun 18 at 20:10

1 Answer 1

0

499 error is specific to the client and does not seem to be an issue with Nginx. It means that the backend (your node application) is closing the connection while nginx is still processing the request.

A basic curl command cannot reproduce the same error. To reproduce it, look out for the application logs when your ingress starts reporting 499 error and reproduce the same with curl commands. If you find any particular endpoint which takes longer than expected, try to replicate the same with curl/postman and notice how long it takes to process the request which will give you a good starting point to adjust the max execution time (or server.timeout in NodeJS) of that particular request.

1
  • The curl command is to the upstream application to test its availability / latencies - there is just one of them and is 100% available via curl. as such there would be no value in increasing the timeout on this application since it is not timing out. despite that, we are still getting intermittent reports of the client closing connection / unavailable via the 499 status. this is what is making this a difficult problem to solve: the gap appears to be between the nginx and the node app since neither report any issues, but as for what / why / where ...
    – peteisace
    Jun 17 at 9:38

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .