Previously, listenings sockets were not cloned if the worker_processes
directive was specified after "listen ... reuseport".
This also simplifies upcoming configuration check on the number
of worker connections, as it needs to know the number of listening
sockets before cloning.
Previously, only one client packet could be processed in a udp stream session
even though multiple response packets were supported. Now multiple packets
coming from the same client address and port are delivered to the same stream
session.
If it's required to maintain a single stream of data, nginx should be
configured in a way that all packets from a client are delivered to the same
worker. On Linux and DragonFly BSD the "reuseport" parameter should be
specified for this. Other systems do not currently provide appropriate
mechanisms. For these systems a single stream of udp packets is only
guaranteed in single-worker configurations.
The proxy_response directive now specifies how many packets are expected in
response to a single client packet.
FreeBSD returns EINVAL when getsockopt(TCP_FASTOPEN) is called on a unix
domain socket, resulting in "getsockopt(TCP_FASTOPEN) ... failed" messages
during binary upgrade when unix domain listen sockets are present in
the configuration. Added EINVAL to the list of ignored error codes.
Some OSes (notably macOS, NetBSD, and Solaris) allow unix socket addresses
larger than struct sockaddr_un. Moreover, some of them (macOS, Solaris)
return socklen of the socket address before it was truncated to fit the
buffer provided. As such, on these systems socklen must not be used without
additional check that it is within the buffer provided.
Appropriate checks added to ngx_event_accept() (after accept()),
ngx_event_recvmsg() (after recvmsg()), and ngx_set_inherited_sockets()
(after getsockname()).
We also obtain socket addresses via getsockname() in
ngx_connection_local_sockaddr(), but it does not need any checks as
it is only used for INET and INET6 sockets (as there can be no
wildcard unix sockets).
When closing a socket with SO_REUSEPORT, Linux drops all connections waiting
in this socket's listen queue. Previously, it was believed to only result
in connection resets when reconfiguring nginx to use smaller number of worker
processes. It also results in connection resets during configuration
testing though.
Workaround is to avoid using SO_REUSEPORT when testing configuration. It
should prevent listening sockets from being created if a conflicting socket
already exists, while still preserving detection of other possible errors.
It should also cover UDP sockets.
The only downside of this approach seems to be that a configuration testing
won't be able to properly report the case when nginx was compiled with
SO_REUSEPORT, but the kernel is not able to set it. Such errors will be
reported on a real start instead.
Closing up to 32 connections might be too aggressive if worker_connections
is set to a comparable number (and/or there are only a small number of
reusable connections). If an occasional connection shorage happens in
such a configuration, it leads to closing all reusable connections instead
of gradually reducing keepalive timeout to a smaller value. To improve
granularity in such configurations we now close no more than 1/8 of all
reusable connections at once.
Suggested by Joel Cunningham.
On nginx reload or binary upgrade, an attempt is made to inherit listen sockets
from the previous configuration. Previously, no check for socket type was made
and the inherited socket could have the wrong type. On binary upgrade, socket
type was not detected at all. Wrong socket type could lead to errors on that
socket due to different logic and unsupported syscalls. For example, a UDP
socket, inherited as TCP, lead to the following error after arrival of a
datagram: "accept() failed (102: Operation not supported on socket)".
Iterating through all connections takes a lot of CPU time, especially
with large number of worker connections configured. As a result
nginx processes used to consume CPU time during graceful shutdown.
To mitigate this we now only do a full scan for idle connections when
shutdown signal is received.
Transitions of connections to idle ones are now expected to be
avoided if the ngx_exiting flag is set. The upstream keepalive module
was modified to follow this.
If nginx was used under OpenVZ and a container with nginx was suspended
and resumed, configuration tests started to fail because of EADDRINUSE
returned from listen() instead of bind():
# nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: [emerg] listen() to 0.0.0.0:80, backlog 511 failed (98: Address already in use)
nginx: configuration file /etc/nginx/nginx.conf test failed
With this change EADDRINUSE errors returned by listen() are handled
similarly to errors returned by bind(), and configuration tests work
fine in the same environment:
# nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
More details about OpenVZ suspend/resume bug:
https://bugzilla.openvz.org/show_bug.cgi?id=2470
When configured, an individual listen socket on a given address is
created for each worker process. This allows to reduce in-kernel lock
contention on configurations with high accept rates, resulting in better
performance. As of now it works on Linux and DragonFly BSD.
Note that on Linux incoming connection requests are currently tied up
to a specific listen socket, and if some sockets are closed, connection
requests will be reset, see https://lwn.net/Articles/542629/. With
nginx, this may happen if the number of worker processes is reduced.
There is no such problem on DragonFly BSD.
Based on previous work by Sepherosa Ziehau and Yingqi Lu.
Previously, this function checked for connection local address existence
and returned error if it was missing. Now a new address is assigned in this
case making it possible to call this function not only for accepted connections.
In theory, this can provide a bit better distribution of latencies.
Also it simplifies the code, since ngx_queue_t is now used instead
of custom implementation.
This isn't really important as configuration testing shortly ends with
a process termination which will free all sockets, though Coverity
complains.
Prodded by Coverity (CID 400872).
Linux returns EOPNOTSUPP for non-TCP sockets and ENOPROTOOPT for TCP
sockets, because getsockopt(TCP_FASTOPEN) is not implemented so far.
While there, lower the log level from ALERT to NOTICE to match other
getsockopt() failures.
Signed-off-by: Piotr Sikora <piotr@cloudflare.com>
Backed out 05a56ebb084a, as it turns out that kernel can return connections
without any delay if syncookies are used. This basically means we can't
assume anything about connections returned with deferred accept set.
To solve original problem the 05a56ebb084a tried to solve, i.e. to don't
wait longer than needed if a connection was accepted after deferred accept
timeout, this patch changes a timeout set with setsockopt(TCP_DEFER_ACCEPT)
to 1 second, unconditionally. This is believed to be enough for speed
improvements, and doesn't imply major changes to timeouts used.
Note that before 2.6.32 connections were dropped after a timeout. Though
it is believed that 1s is still appropriate for kernels before 2.6.32,
as previously tcp_synack_retries controlled the actual timeout and 1s results
in more than 1 minute actual timeout by default.
Recent Linux versions started to return EOPNOTSUPP to getsockopt() calls
on unix sockets, resulting in log pollution on binary upgrade. Such errors
are silently ignored now.
Several warnings silenced, notably (ngx_socket_t) -1 is now checked
on socket operations instead of -1, as ngx_socket_t is unsigned on win32
and gcc complains on comparison.
With this patch, it's now possible to compile nginx using mingw gcc,
with options we normally compile on win32.
The call to ngx_sock_ntop() in ngx_connection_local_sockaddr() might be
performed with the uninitialized "len" variable. The fix is to initialize
variable to the size of corresponding socket address type.
The issue was introduced in commit 05ba5bce31e0.
On Linux, sockaddr length is required to process unix socket addresses properly
due to unnamed sockets (which don't have sun_path set at all) and abstract
namespace sockets.
On Win32 platforms 0 is used to indicate errors in file operations, so
comparing against -1 is not portable.
This was not much of an issue in patched code, since only ngx_fd_info() test
is actually reachable on Win32 and in worst case it might result in bogus
error log entry.
Patch by Piotr Sikora.
And corresponding variable $connections_waiting was added.
Previously, waiting connections were counted as the difference between
active connections and the sum of reading and writing connections.
That made it impossible to count more than one request in one connection
as reading or writing (as is the case for SPDY).
Also, we no longer count connections in handshake state as waiting.
The c->single_connection was intended to be used as lock mechanism
to serialize modifications of request object from several threads
working with client and upstream connections. The flag is redundant
since threads in nginx have never been used that way.
There is a general consensus that this change results in better
consistency between different operating systems and differently
tuned operating systems.
Note: this changes the width and meaning of the ipv6only field
of the ngx_listening_t structure. 3rd party modules that create
their own listening sockets might need fixing.