Previously, a global server balancer was used to assign the next DNS server to
send a query to. That could lead to a non-uniform distribution of servers per
request. A request could be assigned to the same dead server several times in a
row and wait longer for a valid server or even time out without being processed.
Now each query is sent to all servers sequentially in a circle until a
response is received or timeout expires. Initial server for each request is
still globally balanced.
When several requests were waiting for a response, then after getting
a CNAME response only the last request's context had the name updated.
Contexts of other requests had the wrong name. This name was used by
ngx_resolve_name_done() to find the node to remove the request context
from. When the name was wrong, the request could not be properly
cancelled, its context was freed but stayed linked to the node's waiting
list. This happened e.g. when the first request was aborted or timed
out before the resolving completed. When it completed, this triggered
a use-after-free memory access by calling ctx->handler of already freed
request context. The bug manifests itself by
"could not cancel <name> resolving" alerts in error_log.
When a request was responded with a CNAME, the request context kept
the pointer to the original node's rn->u.cname. If the original node
expired before the resolving timed out or completed with an error,
this would trigger a use-after-free memory access via ctx->name in
ctx->handler().
The fix is to keep ctx->name unmodified. The name from context
is no longer used by ngx_resolve_name_done(). Instead, we now keep
the pointer to resolver node to which this request is linked.
Keeping the original name intact also improves logging.
If one or more requests were waiting for a response, then after
getting a CNAME response, the timeout event on the first request
remained active, pointing to the wrong node with an empty
rn->waiting list, and that could cause either null pointer
dereference or use-after-free memory access if this timeout
expired.
If several requests were waiting for a response, and the first
request terminated (e.g., due to client closing a connection),
other requests were left without a timeout and could potentially
wait indefinitely.
This is fixed by introducing per-request independent timeouts.
This change also reverts 954867a2f0a6 and 5004210e8c78.
In 954867a2f0a6, we switched to using resolver node as the timer event data.
This broke debug event logging.
Replaced now unused ngx_resolver_ctx_t.ident with ngx_resolver_node_t.ident
so that ngx_event_ident() extracts something sensible when accessing
ngx_resolver_node_t as ngx_connection_t.