Previously, a global server balancer was used to assign the next DNS server to
send a query to. That could lead to a non-uniform distribution of servers per
request. A request could be assigned to the same dead server several times in a
row and wait longer for a valid server or even time out without being processed.
Now each query is sent to all servers sequentially in a circle until a
response is received or timeout expires. Initial server for each request is
still globally balanced.
When several requests were waiting for a response, then after getting
a CNAME response only the last request's context had the name updated.
Contexts of other requests had the wrong name. This name was used by
ngx_resolve_name_done() to find the node to remove the request context
from. When the name was wrong, the request could not be properly
cancelled, its context was freed but stayed linked to the node's waiting
list. This happened e.g. when the first request was aborted or timed
out before the resolving completed. When it completed, this triggered
a use-after-free memory access by calling ctx->handler of already freed
request context. The bug manifests itself by
"could not cancel <name> resolving" alerts in error_log.
When a request was responded with a CNAME, the request context kept
the pointer to the original node's rn->u.cname. If the original node
expired before the resolving timed out or completed with an error,
this would trigger a use-after-free memory access via ctx->name in
ctx->handler().
The fix is to keep ctx->name unmodified. The name from context
is no longer used by ngx_resolve_name_done(). Instead, we now keep
the pointer to resolver node to which this request is linked.
Keeping the original name intact also improves logging.
When several requests were waiting for a response, then after getting
a CNAME response only the last request was properly processed, while
others were left waiting.
If one or more requests were waiting for a response, then after
getting a CNAME response, the timeout event on the first request
remained active, pointing to the wrong node with an empty
rn->waiting list, and that could cause either null pointer
dereference or use-after-free memory access if this timeout
expired.
If several requests were waiting for a response, and the first
request terminated (e.g., due to client closing a connection),
other requests were left without a timeout and could potentially
wait indefinitely.
This is fixed by introducing per-request independent timeouts.
This change also reverts 954867a2f0a6 and 5004210e8c78.
If enabled, workers are bound to available CPUs, each worker to once CPU
in order. If there are more workers than available CPUs, remaining are
bound in a loop, starting again from the first available CPU.
The optional mask parameter defines which CPUs are available for automatic
binding.
In collaboration with Vladimir Homutov.
The code failed to ensure that "s" is within the buffer passed for
parsing when checking for "ms", and this resulted in unexpected errors when
parsing non-null-terminated strings with trailing "m". The bug manifested
itself when the expires directive was used with variables.
Found by Roman Arutyunyan.
The code for displaying version info and configuration info seemed to be
cluttering up the main function. I was finding it hard to read main. This
extracts out all of the logic for displaying version and configuration info
into its own function, thus making main easier to read.
A configuration like
server { server_name .foo^@; }
server { server_name .foo; }
resulted in a segmentation fault during construction of server names hash.
Reported by Markus Linnala.
Found with afl-fuzz.
Iterating through all connections takes a lot of CPU time, especially
with large number of worker connections configured. As a result
nginx processes used to consume CPU time during graceful shutdown.
To mitigate this we now only do a full scan for idle connections when
shutdown signal is received.
Transitions of connections to idle ones are now expected to be
avoided if the ngx_exiting flag is set. The upstream keepalive module
was modified to follow this.
If nginx was used under OpenVZ and a container with nginx was suspended
and resumed, configuration tests started to fail because of EADDRINUSE
returned from listen() instead of bind():
# nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: [emerg] listen() to 0.0.0.0:80, backlog 511 failed (98: Address already in use)
nginx: configuration file /etc/nginx/nginx.conf test failed
With this change EADDRINUSE errors returned by listen() are handled
similarly to errors returned by bind(), and configuration tests work
fine in the same environment:
# nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
More details about OpenVZ suspend/resume bug:
https://bugzilla.openvz.org/show_bug.cgi?id=2470
If the -T option is passed, additionally to configuration test, configuration
files are output to stdout.
In the debug mode, configuration files are kept in memory and can be accessed
using a debugger.
The function is now called ngx_parse_http_time(), and can be used by
any code to parse HTTP-style date and time. In particular, it will be
used for OCSP stapling.
For compatibility, a macro to map ngx_http_parse_time() to the new name
provided for a while.
When configured, an individual listen socket on a given address is
created for each worker process. This allows to reduce in-kernel lock
contention on configurations with high accept rates, resulting in better
performance. As of now it works on Linux and DragonFly BSD.
Note that on Linux incoming connection requests are currently tied up
to a specific listen socket, and if some sockets are closed, connection
requests will be reset, see https://lwn.net/Articles/542629/. With
nginx, this may happen if the number of worker processes is reduced.
There is no such problem on DragonFly BSD.
Based on previous work by Sepherosa Ziehau and Yingqi Lu.
Two mechanisms are implemented to make it possible to store pointers
in shared memory on Windows, in particular on Windows Vista and later
versions with ASLR:
- The ngx_shm_remap() function added to allow remapping of a shared memory
zone to the address originally used for it in the master process. While
important, it doesn't solve the problem by itself as in many cases it's
not possible to use the address because of conflicts with other
allocations.
- We now create mappings at the same address in all processes by starting
mappings at predefined addresses normally unused by newborn processes.
These two mechanisms combined allow to use shared memory on Windows
almost without problems, including reloads.
Based on the patch by Sergey Brester:
http://mailman.nginx.org/pipermail/nginx-devel/2015-April/006836.html
Similar to ngx_http_file_cache_set_slot(), the last component of file->name
with a fixed length of 10 bytes, as generated in ngx_create_temp_path(), is
used as a source for the names of intermediate subdirectories with each one
taking its own part. Ensure that the sum of specified levels with slashes
fits into the length (ticket #731).
Example of usage:
error_log memory:16m debug;
This allows to configure debug logging with minimum impact on performance.
It's especially useful when rare crashes are experienced under high load.
The log can be extracted from a coredump using the following gdb script:
set $log = ngx_cycle->log
while $log->writer != ngx_log_memory_writer
set $log = $log->next
end
set $buf = (ngx_log_memory_buf_t *) $log->wdata
dump binary memory debug_log.txt $buf->start $buf->end
Initial size as calculated from the number of elements may be bigger
than max_size. If this happens, make sure to set size to max_size.
Reported by Chris West.
Previously, this function checked for connection local address existence
and returned error if it was missing. Now a new address is assigned in this
case making it possible to call this function not only for accepted connections.
It appeared that the NGX_HAVE_AIO_SENDFILE macro was defined regardless of
the "--with-file-aio" configure option and the NGX_HAVE_FILE_AIO macro.
Now they are related.
Additionally, fixed one macro.
This reduces layering violation and simplifies the logic of AIO preread, since
it's now triggered by the send chain function itself without falling back to
the copy filter. The context of AIO operation is now stored per file buffer,
which makes it possible to properly handle cases when multiple buffers come
from different locations, each with its own configuration.
The mtx->wait counter was not decremented if we were able to obtain the lock
right after incrementing it. This resulted in unneeded sem_post() calls,
eventually leading to EOVERFLOW errors being logged, "sem_post() failed
while wake shmtx (75: Value too large for defined data type)".
To close the race, mtx->wait is now decremented if we obtain the lock right
after incrementing it in ngx_shmtx_lock(). The result can become -1 if a
concurrent ngx_shmtx_unlock() decrements mtx->wait before the added code does.
However, that only leads to one extra iteration in the next call of
ngx_shmtx_lock().
The use_temp_path http cache feature is now implemented using a separate temp
hierarchy in cache directory. Prefix-based temp files are no longer needed.
The original check for NGX_AGAIN was surplus, since the function returns
only NGX_OK or NGX_ERROR. Now it looks similar to other places.
No functional changes.
In 954867a2f0a6, we switched to using resolver node as the timer event data.
This broke debug event logging.
Replaced now unused ngx_resolver_ctx_t.ident with ngx_resolver_node_t.ident
so that ngx_event_ident() extracts something sensible when accessing
ngx_resolver_node_t as ngx_connection_t.
In 954867a2f0a6, we switched to using resolver node as the
timer event data, so make sure we do not free resolver node
memory until the corresponding timer is deleted.
If a syslog daemon is restarted and the unix socket is used, further logging
might stop to work. In case of send error, socket is closed, forcing
a reconnection at the next logging attempt.
The ngx_cycle->log is used when sending the message. This allows to log syslog
send errors in another log.
Logging to syslog after its cleanup handler has been executed was prohibited.
Previously, this was possible from ngx_destroy_pool(), which resulted in error
messages caused by attempts to write into the closed socket.
The "processing" flag is renamed to "busy" to better match its semantics.
In theory, this can provide a bit better distribution of latencies.
Also it simplifies the code, since ngx_queue_t is now used instead
of custom implementation.
RFC3986 says that, for consistency, URI producers and normalizers
should use uppercase hexadecimal digits for all percent-encodings.
This is also what modern web browsers and other tools use.
Using lowercase hexadecimal digits makes it harder to interact with
those tools in case when use of the percent-encoded URI is required,
for example when $request_uri is part of the cache key.
Signed-off-by: Piotr Sikora <piotr@cloudflare.com>
The check became meaningless after refactoring in 2a92804f4109.
With the loop currently in place, "current" can't be NULL, hence
the check can be dropped.
Additionally, the local variable "current" was removed to
simplify code, and pool->current now used directly instead.
Found by Coverity (CID 714236).
This isn't really important as configuration testing shortly ends with
a process termination which will free all sockets, though Coverity
complains.
Prodded by Coverity (CID 400872).
Large allocations from a slab pool result in free page blocks being fragmented,
eventually leading to a situation when no further allocation larger than a page
size are possible from the pool. While this isn't a problem for nginx itself,
it is known to be bad for various 3rd party modules. Fix is to merge adjacent
blocks of free pages in the ngx_slab_free_pages() function.
Prodded by Wandenberg Peixoto and Yichun Zhang.
Previous code failed to properly restore cf->conf_file in case of
ngx_close_file() errors, potentially resulting in double free of
cf->conf_file->buffer->start.
Found by Coverity (CID 1087507).
The flag allows to suppress "ngx_slab_alloc() failed: no memory" messages
from a slab allocator, e.g., if an LRU expiration is used by a consumer
and allocation failures aren't fatal.
The flag is now used in the SSL session cache code, and in the limit_req
module.
Client address specified in the PROXY protocol header is now
saved in the $proxy_protocol_addr variable and can be used in
the realip module.
This is currently not implemented for mail.
Linux returns EOPNOTSUPP for non-TCP sockets and ENOPROTOOPT for TCP
sockets, because getsockopt(TCP_FASTOPEN) is not implemented so far.
While there, lower the log level from ALERT to NOTICE to match other
getsockopt() failures.
Signed-off-by: Piotr Sikora <piotr@cloudflare.com>
Backed out 05a56ebb084a, as it turns out that kernel can return connections
without any delay if syncookies are used. This basically means we can't
assume anything about connections returned with deferred accept set.
To solve original problem the 05a56ebb084a tried to solve, i.e. to don't
wait longer than needed if a connection was accepted after deferred accept
timeout, this patch changes a timeout set with setsockopt(TCP_DEFER_ACCEPT)
to 1 second, unconditionally. This is believed to be enough for speed
improvements, and doesn't imply major changes to timeouts used.
Note that before 2.6.32 connections were dropped after a timeout. Though
it is believed that 1s is still appropriate for kernels before 2.6.32,
as previously tcp_synack_retries controlled the actual timeout and 1s results
in more than 1 minute actual timeout by default.
Previously pool->current wasn't moved back to pool, resulting in blocks
not used for further allocations if pool->current was already moved at the
time of ngx_reset_pool(). Additionally, to preserve logic of moving
pool->current, the p->d.failed counters are now properly cleared. While
here, pool->chain is also cleared.
This change is essentially a nop with current code, but generally improves
things.
Fallback to synchronous sendfile() now only done on 3rd EBUSY without
any progress in a row. Not falling back is believed to be better
in case of occasional EBUSY, though protection is still needed to
make sure there will be no infinite loop.
Stricten response header checks: ensure that reserved bits are zeroes,
and that the opcode is "standard query".
Fixed the "zero-length domain name in DNS response" condition.
Renamed ngx_resolver_query_t to ngx_resolver_hdr_t as it describes
the header that is common to DNS queries and answers.
Replaced the magic number 12 by the size of the header structure.
The other changes are self-explanatory.
Recent Linux versions started to return EOPNOTSUPP to getsockopt() calls
on unix sockets, resulting in log pollution on binary upgrade. Such errors
are silently ignored now.
The accept_filter and deferred options were not applied to sockets
that were added to configuration during binary upgrade cycle.
Signed-off-by: Piotr Sikora <piotr@cloudflare.com>
This patch fixes incorrect handling of auto redirect in configurations
like:
location /0 { }
location /a- { }
location /a/ { proxy_pass ... }
With previously used sorting, this resulted in the following locations
tree (as "-" is less than "/"):
"/a-"
"/0" "/a/"
and a request to "/a" didn't match "/a/" with auto_redirect, as it
didn't traverse relevant tree node during lookup (it tested "/a-",
then "/0", and then falled back to null location).
To preserve locale use for non-ASCII characters on case-insensetive
systems, libc's tolower() used.
Found by using auth_basic.t from mdounin nginx-tests under valgrind.
==10470== Invalid write of size 1
==10470== at 0x43603D: ngx_crypt_to64 (ngx_crypt.c:168)
==10470== by 0x43648E: ngx_crypt (ngx_crypt.c:153)
==10470== by 0x489D8B: ngx_http_auth_basic_crypt_handler (ngx_http_auth_basic_module.c:297)
==10470== by 0x48A24A: ngx_http_auth_basic_handler (ngx_http_auth_basic_module.c:240)
==10470== by 0x44EAB9: ngx_http_core_access_phase (ngx_http_core_module.c:1121)
==10470== by 0x44A822: ngx_http_core_run_phases (ngx_http_core_module.c:895)
==10470== by 0x44A932: ngx_http_handler (ngx_http_core_module.c:878)
==10470== by 0x455EEF: ngx_http_process_request (ngx_http_request.c:1852)
==10470== by 0x456527: ngx_http_process_request_headers (ngx_http_request.c:1283)
==10470== by 0x456A91: ngx_http_process_request_line (ngx_http_request.c:964)
==10470== by 0x457097: ngx_http_wait_request_handler (ngx_http_request.c:486)
==10470== by 0x4411EE: ngx_epoll_process_events (ngx_epoll_module.c:691)
==10470== Address 0x5866fab is 0 bytes after a block of size 27 alloc'd
==10470== at 0x4A074CD: malloc (vg_replace_malloc.c:236)
==10470== by 0x43B251: ngx_alloc (ngx_alloc.c:22)
==10470== by 0x421B0D: ngx_malloc (ngx_palloc.c:119)
==10470== by 0x421B65: ngx_pnalloc (ngx_palloc.c:147)
==10470== by 0x436368: ngx_crypt (ngx_crypt.c:140)
==10470== by 0x489D8B: ngx_http_auth_basic_crypt_handler (ngx_http_auth_basic_module.c:297)
==10470== by 0x48A24A: ngx_http_auth_basic_handler (ngx_http_auth_basic_module.c:240)
==10470== by 0x44EAB9: ngx_http_core_access_phase (ngx_http_core_module.c:1121)
==10470== by 0x44A822: ngx_http_core_run_phases (ngx_http_core_module.c:895)
==10470== by 0x44A932: ngx_http_handler (ngx_http_core_module.c:878)
==10470== by 0x455EEF: ngx_http_process_request (ngx_http_request.c:1852)
==10470== by 0x456527: ngx_http_process_request_headers (ngx_http_request.c:1283)
==10470==