The "listen" directive in the http module can be used multiple times
in different server blocks. Originally, it was supposed to be specified
once with various socket options, and without any parameters in virtual
server blocks. For example:
server { listen 80 backlog=1024; server_name foo; ... }
server { listen 80; server_name bar; ... }
server { listen 80; server_name bazz; ... }
The address part of the syntax ("address[:port]" / "port" / "unix:path")
uniquely identifies the listening socket, and therefore is enough for
name-based virtual servers (to let nginx know that the virtual server
accepts requests on the listening socket in question).
To ensure that listening options do not conflict between virtual servers,
they were allowed only once. For example, the following configuration
will be rejected ("duplicate listen options for 0.0.0.0:80 in ..."):
server { listen 80 backlog=1024; server_name foo; ... }
server { listen 80 backlog=512; server_name bar; ... }
At some point it was, however, noticed, that it is sometimes convenient
to repeat some options for clarity. In nginx 0.8.51 the "ssl" parameter
was allowed to be specified multiple times, e.g.:
server { listen 443 ssl backlog=1024; server_name foo; ... }
server { listen 443 ssl; server_name bar; ... }
server { listen 443 ssl; server_name bazz; ... }
This approach makes configuration more readable, since SSL sockets are
immediately visible in the configuration. If this is not needed, just the
address can still be used.
Later, additional protocol-specific options similar to "ssl" were
introduced, notably "http2" and "proxy_protocol". With these options,
one can write:
server { listen 443 ssl backlog=1024; server_name foo; ... }
server { listen 443 http2; server_name bar; ... }
server { listen 443 proxy_protocol; server_name bazz; ... }
The resulting socket will use ssl, http2, and proxy_protocol, but this
is not really obvious from the configuration.
To emphasize such misleading configurations are discouraged, nginx now
warns as long as the "listen" directive is used with options different
from the options previously used if this is potentially confusing.
In particular, the following configurations are allowed:
server { listen 8401 ssl backlog=1024; server_name foo; }
server { listen 8401 ssl; server_name bar; }
server { listen 8401 ssl; server_name bazz; }
server { listen 8402 ssl http2 backlog=1024; server_name foo; }
server { listen 8402 ssl; server_name bar; }
server { listen 8402 ssl; server_name bazz; }
server { listen 8403 ssl; server_name bar; }
server { listen 8403 ssl; server_name bazz; }
server { listen 8403 ssl http2; server_name foo; }
server { listen 8404 ssl http2 backlog=1024; server_name foo; }
server { listen 8404 http2; server_name bar; }
server { listen 8404 http2; server_name bazz; }
server { listen 8405 ssl http2 backlog=1024; server_name foo; }
server { listen 8405 ssl http2; server_name bar; }
server { listen 8405 ssl http2; server_name bazz; }
server { listen 8406 ssl; server_name foo; }
server { listen 8406; server_name bar; }
server { listen 8406; server_name bazz; }
And the following configurations will generate warnings:
server { listen 8501 ssl http2 backlog=1024; server_name foo; }
server { listen 8501 http2; server_name bar; }
server { listen 8501 ssl; server_name bazz; }
server { listen 8502 backlog=1024; server_name foo; }
server { listen 8502 ssl; server_name bar; }
server { listen 8503 ssl; server_name foo; }
server { listen 8503 http2; server_name bar; }
server { listen 8504 ssl; server_name foo; }
server { listen 8504 http2; server_name bar; }
server { listen 8504 proxy_protocol; server_name bazz; }
server { listen 8505 ssl http2 proxy_protocol; server_name foo; }
server { listen 8505 ssl http2; server_name bar; }
server { listen 8505 ssl; server_name bazz; }
server { listen 8506 ssl http2; server_name foo; }
server { listen 8506 ssl; server_name bar; }
server { listen 8506; server_name bazz; }
server { listen 8507 ssl; server_name bar; }
server { listen 8507; server_name bazz; }
server { listen 8507 ssl http2; server_name foo; }
server { listen 8508 ssl; server_name bar; }
server { listen 8508; server_name bazz; }
server { listen 8508 ssl backlog=1024; server_name foo; }
server { listen 8509; server_name bazz; }
server { listen 8509 ssl; server_name bar; }
server { listen 8509 ssl backlog=1024; server_name foo; }
The basic idea is that at most two sets of protocol options are allowed:
the main one (with socket options, if any), and a shorter one, with options
being a subset of the main options, repeated for clarity. As long as the
shorter set of protocol options is used, all listen directives except the
main one should use it.
Now "listen" directve has a new "quic" parameter which enables QUIC protocol
for the address. Further, to enable HTTP/3, a new directive "http3" is
introduced. The hq-interop protocol is enabled by "http3_hq" as before.
Now application protocol is chosen by ALPN.
Previously used "http3" parameter of "listen" is deprecated.
A QUIC handshake failure breaks down into several cases:
- a handshake error which leads to a send_alert call
- an error triggered by the add_handshake_data callback
- internal errors (allocation etc)
Previously, in the first case, only error code was set in the send_alert
callback. Now the "handshake failed" reason phrase is set there as well.
In the second case, both code and reason are set by add_handshake_data.
In the last case, setting reason phrase is removed: returning NGX_ERROR
now leads to closing the connection with just INTERNAL_ERROR.
Reported by Jiuzhou Cui.
Previously, since 3550b00d9dc8, the token was allocated on stack, to get
rid of pool usage. Now the token is allocated by ngx_quic_copy_buffer()
in QUIC buffers, also used for STREAM, CRYPTO and ACK frames.
Previously, location prefix length in ngx_http_location_tree_node_t was
stored as "u_char", and therefore location prefixes longer than 255 bytes
were handled incorrectly.
Fix is to use "u_short" instead. With "u_short", prefixes up to 65535 bytes
can be safely used, and this isn't reachable due to NGX_CONF_BUFFER, which
is 4096 bytes.
In contrast to on-the-fly gzipping with gzip filter, static gzipped
representation as returned by gzip_static is persistent, and therefore
the same binary representation is available for future requests, making
it possible to use range requests.
Further, if a gzipped representation is re-generated with different
compression settings, it is expected to result in different ETag and
different size reported in the Content-Range header, making it possible
to safely use range requests anyway.
As such, ranges are now allowed for files returned by gzip_static.
Specifically, now it is kept unset until streams are initialized.
Notably, this unbreaks OCSP with client certificates after 35e27117b593.
Previously, the read event could be posted prematurely via ngx_quic_set_event()
e.g., as part of handling a STREAM frame.
Previously, streams were initialized in early keys handler. However, client
transport parameters may not be available by then. This happens, for example,
when using QuicTLS. Now streams are initialized in ngx_quic_crypto_input()
after calling SSL_do_handshake() for both 0-RTT and 1-RTT.
Previously, there was no timeout for a request stream blocked on insert count,
which could result in infinite wait. Now client_header_timeout is set when
stream is first blocked.
Now, when RESET_STREAM is sent or received, or when streams are closed,
stream connection error flag is set. Previously, only stream state was
changed, which resulted in setting the error flag only after calling
recv()/send()/send_chain(). However, there are cases when none of these
functions is called, but it's still important to know if the stream is being
closed. For example, when an HTTP/3 request stream is blocked on insert count,
receiving RESET_STREAM should trigger stream closure, which was not the case.
The change also fixes ngx_http_upstream_check_broken_connection() and
ngx_http_test_reading() with QUIC streams.
Previously, stream events were added and deleted by ngx_handle_read_event() and
ngx_handle_write_event() in a way similar to level-triggered events. However,
QUIC stream events are effectively edge-triggered and can stay active all time.
Moreover, the events are now active since the moment a stream is created.
Previously, start_time wasn't set for a new stream.
The fix is to derive it from the parent connection.
Also it's used to simplify tracking keepalive_time.
As per RFC 9204, section 3.2.2, a new entry can reference an entry in the
dynamic table that will be evicted when adding this new entry into the dynamic
table.
Previously, such inserts resulted in use-after-free since the old entry was
evicted before the insertion (ticket #2431). Now it's evicted after the
insertion.
This change fixes Insert with Name Reference and Duplicate encoder instructions.
Ports difference must be respected when checking addresses for duplicates,
otherwise configurations like this are broken:
listen 127.0.0.1:6000-6005
It was broken by 4cc2bfeff46c (nginx 1.23.3).
Fixed event flags handling edge cases in ngx_wsarecv() and ngx_wsarecv_chain(),
notably to always reset rev->ready in case of errors (which wasn't the case
after ngx_socket_nread() errors), and after EOF (rev->ready was not cleared
if due to a misconfiguration a zero-sized buffer was used for reading).
With this change, behaviour of ngx_ssl_recv() now matches ngx_unix_recv(),
which used to always reset c->read->ready to 0 when returning errors.
This fixes an infinite loop in unbuffered SSL proxying if writing to the
client is blocked and an SSL error happens (ticket #2418).
With this change, the fix for a similar issue in the stream module
(6868:ee3645078759), which used a different approach of explicitly
testing c->read->error instead, is no longer needed and was reverted.
Casts are believed to be not needed, since memcmp() has "const void *"
arguments since introduction of the "void" type in C89. And on pre-C89
platforms nginx is unlikely to compile without warnings anyway, as there
are no casts in memcpy() and memmove() calls.
These casts were added in 1648:89a47f19b9ec without any details on why they
were added, and Igor does not remember details either. The most plausible
explanation is that they were copied from ngx_strcmp() and were not really
needed even at that time.
Prodded by Alejandro Colomar.
Binary upgrades are not supported without master process, but it is,
however, possible, that nginx running with master process is asked
to upgrade binary, and the configuration file as available on disk
at this time includes "master_process off;".
If this happens, listening sockets inherited from the previous binary
will have ls[i].previous set. But the old cycle on initial process
startup, including startup after binary upgrade, is destroyed by
ngx_init_cycle() once configuration parsing is complete. As a result,
an attempt to dereference ls[i].previous in ngx_event_process_init()
accesses already freed memory.
Fix is to avoid looking into ls[i].previous if the old cycle is already
freed.
With this change it is also no longer needed to clear ls[i].previous in
worker processes, so the relevant code was removed.
Cloning of listening sockets for each worker process does not make sense
when working without master process, and causes some of the connections
not to be accepted if worker_processes is set to more than one and there
are listening sockets configured with the reuseport flag. Fix is to
disable cloning when master process is disabled.
Due to the glibc bug[1], getaddrinfo("localhost") with AI_ADDRCONFIG
on a typical host with glibc and without IPv6 returns two 127.0.0.1
addresses, and therefore "listen localhost:80;" used to result in
"duplicate ... address and port pair" after 4f9b72a229c1.
Fix is to explicitly filter out duplicate addresses returned during
resolution of a name.
[1] https://sourceware.org/bugzilla/show_bug.cgi?id=14969
Previously, if an event was posted by a read event handler, called by
ngx_close_idle_connections(), that event was not processed until the next
event loop iteration, which could happen after a timeout.
As the SSI parser always uses the context from the main request for storing
variables and blocks, that context should always exist for subrequests using
SSI, even though the main request does not necessarily have SSI enabled.
However, `ngx_http_get_module_ctx(r->main, ...)` is getting NULL in such cases,
resulting in the worker crashing SIGSEGV when accessing its attributes.
This patch links the first initialized context to the main request, and
upgrades it only when main context is initialized.
The check is not expected to fail unless there is a bug in the calling
code. But given the check is here, it should log an alert if it fails
instead of silently closing the connection.
Maximum size for reading the PROXY protocol header is increased to 4096 to
accommodate a bigger number of TLVs, which are supported since cca4c8a715de.
Maximum size for writing the PROXY protocol header is not changed since only
version 1 is currently supported.
Previously, keepalive timer was deleted in ngx_http_v3_wait_request_handler()
and set in request cleanup handler. This worked for HTTP/3 connections, but not
for hq connections. Now keepalive timer is deleted in
ngx_http_v3_init_request_stream() and set in connection cleanup handler,
which works both for HTTP/3 and hq.
It's called after handshake completion or prior to the first early data stream
creation. The callback should initialize application-level data before
creating streams.
HTTP/3 callback implementation sets keepalive timer and sends SETTINGS.
Also, this allows to limit max handshake time in ngx_http_v3_init_stream().
Most atoms should not appear more than once in a container. Previously,
this was not enforced by the module, which could result in worker process
crash, memory corruption and disclosure.
Now it properly detects invalid shared zone configuration with omitted size.
Previously it used to read outside of the buffer boundary.
Found with AddressSanitizer.
OpenSSL with TLSv1.3 updates the session creation time on session
resumption and keeps the session timeout unmodified, making it possible
to maintain the session forever, bypassing client certificate expiration
and revocation. To make sure session timeouts are actually used, we
now update the session creation time and reduce the session timeout
accordingly.
BoringSSL with TLSv1.3 ignores configured session timeouts and uses a
hardcoded timeout instead, 7 days. So we update session timeout to
the configured value as soon as a session is created.
Instead of syncing keys with shared memory on each ticket operation,
the code now does this only when the worker is going to change expiration
of the current key, or going to switch to a new key: that is, usually
at most once per second.
To do so without races, the code maintains 3 keys: current, previous,
and next. If a worker will switch to the next key earlier, other workers
will still be able to decrypt new tickets, since they will be encrypted
with the next key.
As long as ssl_session_cache in shared memory is configured, session ticket
keys are now automatically generated in shared memory, and rotated
periodically. This can be beneficial from forward secrecy point of view,
and also avoids increased CPU usage after configuration reloads.
This also helps BoringSSL to properly resume sessions in configurations
with multiple worker processes and no ssl_session_ticket_key directives,
as BoringSSL tries to automatically rotate session ticket keys and does
this independently in different worker processes, thus breaking session
resumption between worker processes.
Given the present typical SSL session sizes, on 32-bit platforms it is
now beneficial to store all data in a single allocation, since rbtree
node + session id + ASN1 representation of a session takes 256 bytes of
shared memory (36 + 32 + 150 = about 218 bytes plus SNI server name).
Storing all data in a single allocation is beneficial for SNI names up to
about 40 characters long and makes it possible to store about 4000 sessions
in one megabyte (instead of about 3000 sessions now). This also slightly
simplifies the code.
Session ids are not expected to be longer than 32 bytes, but this is
theoretically possible with TLSv1.3, where session ids are essentially
arbitrary and sent as session tickets. Since on 64-bit platforms we
use fixed 32-byte buffer for session ids, added an explicit length check
to make sure the buffer is large enough.
Session cache allocations might fail as long as the new session is different
in size from the one least recently used (and freed when the first allocation
fails). In particular, it might not be possible to allocate space for
sessions with client certificates, since they are noticeably bigger than
normal sessions.
To ensure such allocation failures won't clutter logs, logging level changed
to "warn", and logging is now limited to at most one warning per second.
OpenSSL tries to save TLSv1.3 sessions into session cache even when using
tickets for stateless session resumption, "because some applications just
want to know about the creation of a session". To avoid trashing session
cache with useless data, we do not save such sessions now.
The variables have prefix $proxy_protocol_tlv_ and are accessible by name
and by type. Examples are: $proxy_protocol_tlv_0x01, $proxy_protocol_tlv_alpn.
Previously, all received user input was logged. If a multi-line text was
received from client and logged, it could reduce log readability and also make
it harder to parse nginx log by scripts. The change brings to PROXY protocol
the same behavior that exists for HTTP request line in
ngx_http_log_error_handler().
SSL_sendfile() expects integer file descriptor as an argument, but nginx
uses OS file handles (HANDLE) to work with files on Windows, and passing
HANDLE instead of an integer correctly results in build failure. Since
SSL_sendfile() is not expected to work on Windows anyway, the code is now
disabled on Windows with appropriate compile-time checks.
Multiple C4306 warnings (conversion from 'type1' to 'type2' of greater size)
appear during 64-bit compilation with MSVC 2010 (and older) due to extensively
used constructs like "(void *) -1", so they were disabled.
In newer MSVC versions C4306 warnings were replaced with C4312 ones, and
these are not generated for such trivial type casts.
In 2014ed60f17f, "#if SSL_CTRL_SET_ECDH_AUTO" test was incorrectly used
instead of "#ifdef SSL_CTRL_SET_ECDH_AUTO". There is no practical
difference, since SSL_CTRL_SET_ECDH_AUTO evaluates to a non-zero numeric
value when defined, but anyway it's better to correctly test if the value
is defined.
All these events are created in context of a client connection and are deleted
when the connection is closed. Setting ev->cancelable could trigger premature
connection closure and a socket leak alert.
Now main QUIC connection for HTTP/3 always has c->idle flag set. This allows
the connection to receive worker shutdown notification. It is passed to
application level via a new conf->shutdown() callback.
The HTTP/3 shutdown callback sends GOAWAY to client and gracefully shuts down
the QUIC connection.
Previously, stream was kept alive until all its data is sent. This resulted
in disabling retransmission of final part of stream when QUIC connection
was closed right after closing stream connection.
The connection is automatically switched to this mode by transport layer when
there are no non-cancelable streams. Currently, cancelable streams are
HTTP/3 encoder/decoder/control streams.
Previously, ngx_quic_finalize_connection() closed the connection with NGX_ERROR
code, which resulted in immediate connection closure. Now the code is NGX_OK,
which provides a more graceful shutdown with a timeout.
Previously, zero was used for this purpose. However, NGX_QUIC_ERR_NO_ERROR is
zero too. As a result, NGX_QUIC_ERR_NO_ERROR was changed to
NGX_QUIC_ERR_INTERNAL_ERROR when closing a QUIC connection.
If a client packet carrying a stream data frame is not acked due to packet loss,
the stream data is retransmitted later by client. It's also possible that the
retransmitted range is bigger than before due to more stream data being
available by then. If the original data was read out by the application,
there would be no read event triggered by the retransmitted frame, even though
it contains new data.
They are not supported by MSVC till 2012.
SSL_QUIC_METHOD initialization is moved to run-time to preserve portability
among SSL library implementations, which allows to reduce its visibility.
Note using of a static storage to keep SSL_set_quic_method() reference valid.
Previously, ngx_quic_hkdf_t variables used declaration with assignment
in the middle of a function, which is not supported by MSVC 2010.
Fixing this also required to rewrite the ngx_quic_hkdf_set macro
and to switch to an explicit array size.
Previously, HTTP/3 stream connection didn't inherit the servername regex
from the main QUIC connection saved when processing SNI and using regular
expressions in server names. As a result, it didn't execute to set regex
captures when choosing the virtual server while parsing HTTP/3 headers.
The type field was added in 7999d3fbb765 at early stages of QUIC implementation
and was not initialized for default listen. Missing initialization resulted in
default listen socket creation error.
SSL_CIPHER_get_protocol_id() appeared in BoringSSL somewhere between
BORINGSSL_API_VERSION 12 and 13 for compatibility with OpenSSL 1.1.1.
It was adopted without a proper macro test, which remained unnoticed.
This justifies that such old BoringSSL API isn't widely used and its
support can be dropped.
While here, removed SSL_set_quic_use_legacy_codepoint() that became
useless after the default was flipped in BoringSSL over a year ago.
Setting QUIC methods is converted to use C99 designated initializers
for simplicity, as LibreSSL 3.6.0 has different SSL_QUIC_METHOD layout.
Additionally, only set_read_secret/set_write_secret callbacks are set.
Although they are preferred in LibreSSL over set_encryption_secrets,
better be on a safe side as LibreSSL has unexpectedly incompatible
set_encryption_secrets calling convention expressed in passing read
and write secrets split in separate calls, unlike this is documented
in old BoringSSL sources. To avoid introducing further changes for
the old API, it is simply disabled.
This function is present in QuicTLS only. After SSL_READ_EARLY_DATA_SUCCESS
became visible in LibreSSL together with experimental QUIC API, this required
to revise the conditional compilation test to use more narrow macros.
After BoringSSL aligned[1] with OpenSSL on TLS1_3_CK_* macros, and
LibreSSL uses OpenSSL naming, our own variants can be dropped now.
Compatibility is preserved with libraries that lack these macros.
Additionally, transition to SSL_CIPHER_get_id() fixes build error
with LibreSSL that doesn't implement SSL_CIPHER_get_protocol_id().
[1] https://boringssl.googlesource.com/boringssl/+/dfddbc4ded
The SSL_R_BAD_RECORD_TYPE ("bad record type") errors are reported by
OpenSSL 1.1.1 or newer when using TLSv1.3 if the client sends a record
with unknown or unexpected type. These errors are now logged at the
"info" level.
When client DATA frame header and its content come in different QUIC packets,
it may happen that only the header is processed by the first
ngx_http_v3_request_body_filter() call. In this case an empty request body
buffer is added to r->request_body->bufs, which is later reused in a
subsequent ngx_http_v3_request_body_filter() call without being removed from
the body chain. As a result, rb->request_body->bufs ends up with two copies of
the same buffer.
The fix is to avoid adding empty request body buffers to r->request_body->bufs.
When reading exactly rev->available bytes, rev->available might become 0
after FIONREAD usage introduction in efd71d49bde0. On the next call of
ngx_readv_chain() on systems with EPOLLRDHUP this resulted in return without
any actions, that is, with rev->ready set, and this in turn resulted in no
timers set in event pipe, leading to socket leaks.
Fix is to reset rev->ready in ngx_readv_chain() when returning due to
rev->available being 0 with EPOLLRDHUP, much like it is already done in
ngx_unix_recv(). This ensures that if rev->available will become 0, on
systems with EPOLLRDHUP support appropriate EPOLLRDHUP-specific handling
will happen on the next ngx_readv_chain() call.
While here, also synced ngx_readv_chain() to match ngx_unix_recv() and
reset rev->ready when returning due to rev->available being 0 with kqueue.
This is mostly cosmetic change, as rev->ready is anyway reset when
rev->available is set to 0.
Some servers might emit Content-Range header on 200 responses, and this
does not seem to contradict RFC 9110: as per RFC 9110, the Content-Range
header has no meaning for status codes other than 206 and 416. Previously
this resulted in duplicate Content-Range headers in nginx responses handled
by the range filter. Fix is to clear pre-existing headers.
Starting with OpenSSL 1.1.1, various additional errors can be reported
by OpenSSL in case of client-related issues, most notably during TLSv1.3
handshakes. In particular, SSL_R_BAD_KEY_SHARE ("bad key share"),
SSL_R_BAD_EXTENSION ("bad extension"), SSL_R_BAD_CIPHER ("bad cipher"),
SSL_R_BAD_ECPOINT ("bad ecpoint"). These are now logged at the "info"
level.
To ensure optimal use of memory, SSL contexts for proxying are now
inherited from previous levels as long as relevant proxy_ssl_* directives
are not redefined.
Further, when no proxy_ssl_* directives are redefined in a server block,
we now preserve plcf->upstream.ssl in the "http" section configuration
to inherit it to all servers.
Similar changes made in uwsgi, grpc, and stream proxy.
Similar to 70e65bf8dfd7, the change is made to ensure that the ability to
cancel resolver tasks is fully controlled by the caller. As mentioned in the
referenced commit, it is safe to make this timer cancelable because resolve
tasks can have their own timeouts that are not cancelable.
The scenario where this may become a problem is a periodic background resolve
task (not tied to a specific request or a client connection), which receives a
response with short TTL, large enough to warrant fallback to a TCP query.
With each event loop wakeup, we either have a previously set write timer
instance or schedule a new one. The non-cancelable write timer can delay or
block graceful shutdown of a worker even if the ngx_resolver_ctx_t->cancelable
flag is set by the API user, and there are no other tasks or connections.
We use the resolver API in this way to maintain the list of upstream server
addresses specified with the 'resolve' parameter, and there could be third-party
modules implementing similar logic.
When we generate the last_buf buffer for an UDP upstream recv error, it does
not contain any data from the wire. ngx_stream_write_filter attempts to forward
it anyways, which is incorrect (e.g., UDP upstream ECONNREFUSED will be
translated to an empty packet).
This happens because we mark the buffer as both 'flush' and 'last_buf', and
ngx_stream_write_filter has special handling for flush with certain types of
connections (see d127837c714f, 32b0ba4855a6). The flags are meant to be
mutually exclusive, so the fix is to ensure that flush and last_buf are not set
at the same time.
Reproduction:
stream {
upstream unreachable {
server 127.0.0.1:8880;
}
server {
listen 127.0.0.1:8998 udp;
proxy_pass unreachable;
}
}
1 0.000000000 127.0.0.1 → 127.0.0.1 UDP 47 45588 → 8998 Len=5
2 0.000166300 127.0.0.1 → 127.0.0.1 UDP 47 51149 → 8880 Len=5
3 0.000172600 127.0.0.1 → 127.0.0.1 ICMP 75 Destination unreachable (Port
unreachable)
4 0.000202400 127.0.0.1 → 127.0.0.1 UDP 42 8998 → 45588 Len=0
Fixes d127837c714f.
Both "count" and "duration" variables are 32-bit, so their product might
potentially overflow. It is used to reduce 64-bit start_time variable,
and with very large start_time this can result in incorrect seeking.
Found by Coverity (CID 1499904).
Now, if the directive is given an empty string, such configuration cancels
loading of certificates, in particular, if they would be otherwise inherited
from the previous level. This restores previous behaviour, before variables
support in certificates was introduced (3ab8e1e2f0f7).
Previously, if caching was disabled due to Expires in the past, nginx
failed to cache the response even if it was cacheable as per subsequently
parsed Cache-Control header (ticket #964).
Similarly, if caching was disabled due to Expires in the past,
"Cache-Control: no-cache" or "Cache-Control: max-age=0", caching was not
used if it was cacheable as per subsequently parsed X-Accel-Expires header.
Fix is to avoid disabling caching immediately after parsing Expires in
the past or Cache-Control, but rather set flags which are later checked by
ngx_http_upstream_process_headers() (and cleared by "Cache-Control: max-age"
and X-Accel-Expires).
Additionally, now X-Accel-Expires does not prevent parsing of cache control
extensions, notably stale-while-revalidate and stale-if-error. This
ensures that order of the X-Accel-Expires and Cache-Control headers is not
important.
Prodded by Vadim Fedorenko and Yugo Horie.
If a module adds multiple WWW-Authenticate headers (ticket #485) to the
response, linked in r->headers_out.www_authenticate, all headers are now
cleared if another module later allows access.
This change is a nop for standard modules, since the only access module which
can add multiple WWW-Authenticate headers is the auth request module, and
it is checked after other standard access modules. Though this might
affect some third party access modules.
Note that if a 3rd party module adds a single WWW-Authenticate header
and not yet modified to set the header's next pointer to NULL, attempt to
clear such a header with this change will result in a segmentation fault.
When using auth_request with an upstream server which returns 401
(Unauthorized), multiple WWW-Authenticate headers from the upstream server
response are now properly copied to the response.
When using proxy_intercept_errors and an error page for error 401
(Unauthorized), multiple WWW-Authenticate headers from the upstream server
response are now properly copied to the response.
Most of the known duplicate upstream response headers are now ignored
with a warning.
If syntax permits multiple headers, these are now properly linked to
the lists, notably Vary and WWW-Authenticate. This makes it possible
to further handle such lists where it makes sense.
With this change, duplicate Content-Length and Transfer-Encoding headers
are now rejected. Further, responses with invalid Content-Length or
Transfer-Encoding headers are now rejected, as well as responses with both
Content-Length and Transfer-Encoding.
The h->next pointer properly provided as NULL in all cases where known
output headers are added.
Note that there are 3rd party modules which might not do this, and it
might be risky to rely on this for arbitrary headers.
Since introduction of offset handling in ngx_http_upstream_copy_header_line()
in revision 573:58475592100c, the ngx_http_upstream_copy_content_encoding()
function is no longer needed, as its behaviour is exactly equivalent to
ngx_http_upstream_copy_header_line() with appropriate offset. As such,
the ngx_http_upstream_copy_content_encoding() function was removed.
Further, the u->headers_in.content_encoding field is not used anywhere,
so it was removed as well.
Further, Content-Encoding handling no longer depends on NGX_HTTP_GZIP,
as it can be used even without any gzip handling compiled in (for example,
in the charset filter).
As all known input headers are now linked lists, these are now handled
identically. In particular, this makes it possible to access properly
combined values of headers not specifically handled previously, such
as "Via" or "Connection".
The ngx_http_process_multi_header_lines() function is removed, as it is
exactly equivalent to ngx_http_process_header_line(). Similarly,
ngx_http_variable_header() is used instead of ngx_http_variable_headers().
Multi headers are now using linked lists instead of arrays. Notably,
the following fields were changed: r->headers_in.cookies (renamed
to r->headers_in.cookie), r->headers_in.x_forwarded_for,
r->headers_out.cache_control, r->headers_out.link, u->headers_in.cache_control
u->headers_in.cookies (renamed to u->headers_in.set_cookie).
The r->headers_in.cookies and u->headers_in.cookies fields were renamed
to r->headers_in.cookie and u->headers_in.set_cookie to match header names.
The ngx_http_parse_multi_header_lines() and ngx_http_parse_set_cookie_lines()
functions were changed accordingly.
With this change, multi headers are now essentially equivalent to normal
headers, and following changes will further make them equivalent.
Previously, $http_*, $sent_http_*, $sent_trailer_*, $upstream_http_*,
and $upstream_trailer_* variables returned only the first header (with
a few specially handled exceptions: $http_cookie, $http_x_forwarded_for,
$sent_http_cache_control, $sent_http_link).
With this change, all headers are returned, combined together. For
example, $http_foo variable will be "a, b" if there are "Foo: a" and
"Foo: b" headers in the request.
Note that $upstream_http_set_cookie will also return all "Set-Cookie"
headers (ticket #1843), though this might not be what one want, since
the "Set-Cookie" header does not follow the list syntax (see RFC 7230,
section 3.2.2).
The uwsgi specification states that "The uwsgi block vars represent a
dictionary/hash". This implies that no duplicate headers are expected.
Further, provided headers are expected to follow CGI specification,
which also requires to combine headers (RFC 3875, section "4.1.18.
Protocol-Specific Meta-Variables"): "If multiple header fields with
the same field-name are received then the server MUST rewrite them
as a single value having the same semantics".
SCGI specification explicitly forbids headers with duplicate names
(section "3. Request Format"): "Duplicate names are not allowed in
the headers".
Further, provided headers are expected to follow CGI specification,
which also requires to combine headers (RFC 3875, section "4.1.18.
Protocol-Specific Meta-Variables"): "If multiple header fields with
the same field-name are received then the server MUST rewrite them
as a single value having the same semantics".
FastCGI responder is expected to receive CGI/1.1 environment variables
in the parameters (see section "6.2 Responder" of the FastCGI specification).
Obviously enough, there cannot be multiple environment variables with
the same name.
Further, CGI specification (RFC 3875, section "4.1.18. Protocol-Specific
Meta-Variables") explicitly requires to combine headers: "If multiple
header fields with the same field-name are received then the server MUST
rewrite them as a single value having the same semantics".
Previously, the r->header_in->connection pointer was never set despite
being present in ngx_http_headers_in, resulting in incorrect value returned
by $r->header_in("Connection") in embedded perl.
In 7583:efd71d49bde0 (nginx 1.17.5) along with introduction of the
ioctl(FIONREAD) support proper handling of systems without EPOLLRDHUP
support in the kernel (but with EPOLLRDHUP in headers) was broken.
Before the change, rev->available was never set to 0 unless
ngx_use_epoll_rdhup was also set (that is, runtime test for EPOLLRDHUP
introduced in 6536:f7849bfb6d21 succeeded). After the change,
rev->available might reach 0 on systems without runtime EPOLLRDHUP
support, stopping further reading in ngx_readv_chain() and ngx_unix_recv().
And, if EOF happened to be already reported along with the last event,
it is not reported again by epoll_wait(), leading to connection hangs
and timeouts on such systems.
This affects Linux kernels before 2.6.17 if nginx was compiled
with newer headers, and, more importantly, emulation layers, such as
DigitalOcean's App Platform's / gVisor's epoll emulation layer.
Fix is to explicitly check ngx_use_epoll_rdhup before the corresponding
rev->pending_eof tests in ngx_readv_chain() and ngx_unix_recv().
Previously, QUIC used the existing UDP framework, which was created for UDP in
Stream. However the way QUIC connections are created and looked up is different
from the way UDP connections in Stream are created and looked up. Now these
two implementations are decoupled.
Previously, last buffer was tracked by keeping a pointer to the previous
chain link "next" field. When the previous buffer was split and then removed,
the pointer was no longer valid. Writing at this pointer resulted in broken
data chains.
Now last buffer is tracked by keeping a direct pointer to it.
The object is used instead of ngx_chain_t pointer for buffer operations like
ngx_quic_write_chain() and ngx_quic_read_chain(). These functions are renamed
to ngx_quic_write_buffer() and ngx_quic_read_buffer().
Such fatal errors are reported by OpenSSL 1.1.1, and similarly by BoringSSL,
if application data is encountered during SSL shutdown, which started to be
observed on the second SSL_shutdown() call after SSL shutdown fixes made in
09fb2135a589 (1.19.2). The error means that the client continues to send
application data after receiving the "close_notify" alert (ticket #2318).
Previously it was reported as SSL_shutdown() error of SSL_ERROR_SYSCALL.
Now ngx_quic_stream_t is decoupled from ngx_connection_t in a way that it
can persist after connection is closed by application. During this period,
server is expecting stream final size from client for correct flow control.
Also, buffered output is sent to client as more flow control credit is granted.
As shown in RFC 8446, section 2.2, Figure 3, and further specified in
section 4.6.1, BoringSSL releases session tickets in Application Data
(along with Finished) early, based on a precalculated client Finished
transcript, once client signalled early data in extensions.
Initially, frames are genereated and stored in ctx->frames.
Next, ngx_quic_output() collects frames to be sent in in ctx->sending.
On failure, ngx_quic_revert_sned() returns frames into ctx->frames.
On success, the ngx_quic_commit_send() moves ack-eliciting frames into
ctx->sent and frees non-ack-eliciting frames.
This function also updates in-flight bytes counter, so only actually sent
frames are accounted.
The counter is decremented in the following cases:
- acknowledgment is received
- packet was declared lost
- we are discarding context completely
In each of this cases frame is removed from ctx->sent queue and in-flight
counter is accordingly decremented.
The patch fixes the case of discarding context - only removing frames
from ctx->sent must be followed by in-flight bytes counter decrement,
otherwise cg->in_flight could experience type underflow.
The issue appeared in b1676cd64dc9.
The cd8018bc81a5 fixed unintended send of non-padded initial packets,
but failed to restore context properly: only processed contexts need
to be restored. As a consequence, a packet number could be restored
from uninitialized value.
Previously, the flag could be reset after send_chain() with a limit, even
though there was room for more data. The application then started waiting for
a write event notification, which never happened.
Now the wev->ready flag is only reset when flow control is exhausted.
With large http2_max_concurrent_streams or http2_max_concurrent_pushes, more
than 255 ngx_http_v2_node_t structures might be allocated, eventually leading
to h2c->closed_nodes overflow when closing corresponding streams. This will
in turn result in additional allocations in ngx_http_v2_get_node_by_id().
While mostly harmless, it can result in excessive memory usage by a HTTP/2
connection, notably in configurations with many keepalive_requests allowed.
Fix is to use ngx_uint_t for h2c->closed_nodes instead of unsigned:8.
The switch happens when received byte counter reaches stream final size.
Previously, this state was skipped. The stream went from SIZE_KNOWN to
DATA_READ when all bytes were read by application.
The change prevents STOP_SENDING frames from being sent when all data is
received from client, but not yet fully read by application.
Previously, size was calculated based on the number of input bytes processed
by the function. Now only the copied bytes are considered. This prevents
overlapping buffers from contributing twice to the overall written size.
Response headers can be buffered in the SSL buffer. But stream's fake
connection buffered flag did not reflect this, so any attempts to flush
the buffer without sending additional data were stopped by the write filter.
It does not seem to be possible to reflect this in fc->buffered though, as
we never known if main connection's c->buffered corresponds to the particular
stream or not. As such, fc->buffered might prevent request finalization
due to sending data on some other stream.
Fix is to implement handling of flush buffers when the c->need_flush_buf
flag is set, similarly to the existing last buffer handling. The same
flag is now used for UDP sockets in the stream module instead of explicit
checking of c->type.
Previously, non-padded initial packet could be sent as a result of the
following situation:
- initial queue is not empty (so padding to 1200 is required)
- handshake queue is not empty (so padding is to be added after h/s packet)
- path is limited
If serializing handshake packet would violate path limit, such packet was
omitted, and the non-padded initial packet was sent.
The fix is to avoid sending the packet at all in such case. This follows the
original intention introduced in c5155a0cb12f.
During configuration reload two cache managers might exist for a short
time. If both tried to delete the same cache node, the "ignore long locked
inactive cache entry" alert appeared in logs. Additionally,
ngx_http_file_cache_forced_expire() might be also called by worker
processes, with similar results.
Fix is to ignore cache nodes being deleted, similarly to how it is
done in ngx_http_file_cache_expire() since 3755:76e3a93821b1. This
was somehow missed in 7002:ab199f0eb8e8, when ignoring long locked
cache entries was introduced in ngx_http_file_cache_forced_expire().
- wording in log->action is adjusted to match function names.
- connection close steps are made obvious and start with "quic close" prefix:
*1 quic close initiated rc:-4
*1 quic close silent drain:0 timedout:1
*1 quic close resumed rc:-1
*1 quic close resumed rc:-1
*1 quic close resumed rc:-4
*1 quic close completed
this makes it easy to understand if particular "close" record is an initial
cause or lasting process, or the final one.
- cases of close without quic connection now logged as "packet rejected":
*14 quic run
*14 quic packet rx long flags:ec version:1
*14 quic packet rx hs len:61
*14 quic packet rx dcid len:20 00000000000002c32f60e4aa2b90a64a39dc4228
*14 quic packet rx scid len:8 81190308612cd019
*14 quic expected initial, got handshake
*14 quic packet done rc:-1 level:hs decr:0 pn:0 perr:0
*14 quic packet rejected rc:-1, cleanup connection
*14 reusable connection: 0
this makes it easy to spot early packet rejection and avoid confuse with
quic connection closing (which in fact was not even created).
- packet processing summary now uses same prefix "quic packet done rc:"
- added debug to places where packet was rejected without any reason logged
The NGX_DECLINED is replaced with NGX_DONE to match closer to return code
of ngx_quic_handle_packet() and ngx_quic_close_connection() rc argument.
The ngx_quic_close_connection() rc code is used only when quic connection
exists, thus anything goes if qc == NULL.
The ngx_quic_handle_datagram() does not return NG_OK in cases when quic
connection is not yet created.
Previously, closure detection for server-initiated uni streams was not properly
implemented. Instead, HTTP/3 code relied on QUIC code posting the read event
and setting rev->error when it needed to close the stream. Then, regular
uni stream read handler called c->recv() and received error, which closed the
stream. This was an ad-hoc solution. If, for whatever reason, the read
handler was called earlier, c->recv() would return 0, which would also close
the stream.
Now server-initiated uni streams have a separate read event handler for
tracking stream closure. The handler calls c->recv(), which normally returns
0, but may return error in case of closure.
Sending the instruction is delayed until the end of the current event cycle.
Delaying the instruction is allowed by quic-qpack-21, section 2.2.2.3.
The goal is to reduce the amount of data sent back to client by accumulating
several inserts in one instruction and sometimes not sending the instruction at
all, if Section Acknowledgement was sent just before it.
Operations like ngx_quic_open_stream(), ngx_http_quic_get_connection(),
ngx_http_v3_finalize_connection(), ngx_http_v3_shutdown_connection() used to
receive a QUIC stream connection. Now they can receive the main QUIC
connection as well. This is useful when calling them from a stream context.
This is to ease transition with oldish BoringSSL versions,
the default for SSL_set_quic_use_legacy_codepoint() has been
flipped in BoringSSL a1d3bfb64fd7ef2cb178b5b515522ffd75d7b8c5.
Chrome only uses TLS session tickets once with TLS 1.3, likely following
RFC 8446 Appendix C.4 recommendation. With OpenSSL, this works fine with
built-in session tickets, since these are explicitly renewed in case of
TLS 1.3 on each session reuse, but results in only two connections being
reused after an initial handshake when using ssl_session_ticket_key.
Fix is to always renew TLS session tickets in case of TLS 1.3 when using
ssl_session_ticket_key, similarly to how it is done by OpenSSL internally.
Previously, when input ended on a QUIC buffer boundary, input chain was not
advanced to the next buffer. As a result, ngx_quic_write_chain() returned
a chain with an empty buffer instead of NULL. This broke HTTP write filter,
preventing it from closing the HTTP request and eventually timing out.
Now input chain is always advanced to a buffer that has data, before checking
QUIC buffer boundary condition.
RFC 9000, 9.3. Responding to Connection Migration:
An endpoint only changes the address to which it sends packets in
response to the highest-numbered non-probing packet.
The patch extends this requirement to probing packets. Although it may
seem excessive, it helps with mitigation of reply attacks (when an off-path
attacker has copied packet with PATH_CHALLENGE and uses different
addresses to exhaust available connection ids).
The quic connection now holds active, backup and probe paths instead
of sockets. The number of migration paths is now limited and cannot
be inflated by a bad client or an attacker.
The client id is now associated with path rather than socket. This allows
to simplify processing of output and connection ids handling.
New migration abandons any previously started migrations. This allows to
free consumed client ids and request new for use in future migrations and
make progress in case when connection id limit is hit during migration.
A path now can be revalidated without losing its state.
The patch also fixes various issues with NAT rebinding case handling:
- paths are now validated (previously, there was no validation
and paths were left in limited state)
- attempt to reuse id on different path is now again verified
(this was broken in 40445fc7c403)
- former path is now validated in case of apparent migration
The function splits a buffer at given offset. The function is now
called from ngx_quic_read_chain() and ngx_quic_write_chain(), which
simplifies both functions.
After 9ae239d2547d, ngx_quic_handle_write_event() no longer runs into
ngx_send_lowat() for QUIC connections, so the check became excessive.
It is assumed that external modules operating with SO_SNDLOWAT
(I'm not aware of any) should do this check on their own.
Previously, ngx_quic_write_chain() treated each input buffer as a memory
buffer, which is not always the case. Special buffers were not skipped, which
is especially important when hitting the input byte limit.
The issue manifested itself with ngx_quic_write_chain() returning a non-empty
chain consisting of a special last_buf buffer when called from QUIC stream
send_chain(). In order for this to happen, input byte limit should be equal to
the chain length, and the input chain should end with an empty last_buf buffer.
An easy way to achieve this is the following:
location /empty {
return 200;
}
When this non-empty chain was returned from send_chain(), it signalled to the
caller that input was blocked, while in fact it wasn't. This prevented HTTP
request from finalization, which prevented QUIC from sending STREAM FIN to
the client. The QUIC stream was then reset after a timeout.
Now special buffers are skipped and send_chain() returns NULL in the case
above, which signals to the caller a successful operation.
Also, original byte limit is now passed to ngx_quic_write_chain() from
send_chain() instead of actual chain length to make sure it's never zero.
Previously, when a STREAM FIN frame with no data bytes was received after all
prior stream data were already read by the application layer, the frame was
ignored and eof was not reported to the application.
When a worker process is shutting down, keepalive is not used: this is checked
before the ngx_http_set_keepalive() call in ngx_http_finalize_connection().
Yet the "Connection: keep-alive" header was still sent, even if we know that
the worker process is shutting down, potentially resulting in additional
requests being sent to the connection which is going to be closed anyway.
While clients are expected to be able to handle asynchronous close events
(see ticket #1022), it is certainly possible to send the "Connection: close"
header instead, informing the client that the connection is going to be closed
and potentially saving some unneeded work.
With this change, we additionally check for worker process shutdown just
before sending response headers, and disable keepalive accordingly.
Linux with EPOLLEXCLUSIVE usually notifies only the process which was first
to add the listening socket to the epoll instance. As a result most of the
connections are handled by the first worker process (ticket #2285). To fix
this, we re-add the socket periodically, so other workers will get a chance
to accept connections.
The SF_NOCACHE flag, introduced in FreeBSD 11 along with the new non-blocking
sendfile() implementation by glebius@, makes it possible to use sendfile()
along with the "directio" directive.
Starting with FreeBSD 11, there is no need to use AIO operations to preload
data into cache for sendfile(SF_NODISKIO) to work. Instead, sendfile()
handles non-blocking loading data from disk by itself. It still can, however,
return EBUSY if a page is already being loaded (for example, by a different
process). If this happens, we now post an event for the next event loop
iteration, so sendfile() is retried "after a short period", as manpage
recommends.
The limit of the number of EBUSY tolerated without any progress is preserved,
but now it does not result in an alert, since on an idle system event loop
iteration might be very short and EBUSY can happen many times in a row.
Instead, SF_NODISKIO is simply disabled for one call once the limit is
reached.
With this change, sendfile(SF_NODISKIO) is now used automatically as long as
sendfile() is enabled, and no longer requires "aio on;".
It was mostly copy of the ngx_quic_listen(). Now ngx_quic_listen() no
longer generates server id and increments seqnum. Instead, the server
id is generated when the socket is created.
The ngx_quic_alloc_socket() function is renamed to ngx_quic_create_socket().
Notably, NAXSI is known to misuse ngx_regex_compile() with rc.options set
to PCRE_CASELESS | PCRE_MULTILINE. With PCRE2 support, and notably binary
compatibility changes, it is no longer possible to set PCRE[2]_MULTILINE
option without using proper interface. To facilitate correct usage,
this change adds the NGX_REGEX_MULTILINE option.
With this change, dynamic modules using nginx regex interface can be used
regardless of the variant of the PCRE library nginx was compiled with.
If a module is compiled with different PCRE library variant, in case of
ngx_regex_exec() errors it will report wrong function name in error
messages. This is believed to be tolerable, given that fixing this will
require interface changes.
The PCRE2 library is now used by default if found, instead of the
original PCRE library. If needed for some reason, this can be disabled
with the --without-pcre2 configure option.
To make it possible to specify paths to the library and include files
via --with-cc-opt / --with-ld-opt, the library is first tested without
any additional paths and options. If this fails, the pcre2-config script
is used.
Similarly to the original PCRE library, it is now possible to build PCRE2
from sources with nginx configure, by using the --with-pcre= option.
It automatically detects if PCRE or PCRE2 sources are provided.
Note that compiling PCRE2 10.33 and later requires inttypes.h. When
compiling on Windows with MSVC, inttypes.h is only available starting
with MSVC 2013. In older versions some replacement needs to be provided
("echo '#include <stdint.h>' > pcre2-10.xx/src/inttypes.h" is good enough
for MSVC 2010).
The interface on nginx side remains unchanged.
If a configuration parsing fails for some reason, ngx_regex_module_init()
is not called, and ngx_pcre_studies remained set despite the fact that
the pool it was allocated from is already freed. This might result in
a segmentation fault during runtime regular expression compilation, such
as in SSI, for example, in the single process mode, or if a worker process
died and was respawned from a master process in such an inconsistent state.
Fix is to clear ngx_pcre_studies from the pool cleanup handler (which is
anyway used to free JIT-compiled patterns).
Previously, buffer lists was used to track used buffers. Now reference
counter is used instead. The new implementation is simpler and faster with
many buffer clones.
They are replaced with ngx_quic_write_chain() and ngx_quic_read_chain().
These functions represent the API to data buffering.
The first function adds data of given size at given offset to the buffer.
Now it returns the unwritten part of the chain similar to c->send_chain().
The second function returns data of given size from the beginning of the buffer.
Its second argument and return value are swapped compared to
ngx_quic_split_bufs() to better match ngx_quic_write_chain().
Added, returned and stored data are regular ngx_chain_t/ngx_buf_t chains.
Missing data is marked with b->sync flag.
The functions are now used in both send and recv data chains in QUIC streams.
Previously, when a few bytes were send to a QUIC stream by the application, a
4K buffer was allocated for these bytes. Then a STREAM frame was created and
that entire buffer was used as data for that frame. The frame with the buffer
were in use up until the frame was acked by client. Meanwhile, when more
bytes were send to the stream, more buffers were allocated and assigned as
data to newer STREAM frames. In this scenario most buffer memory is unused.
Now the unused part of the stream output buffer is available for further
stream output while earlier parts of the buffer are waiting to be acked.
This is achieved by splitting the output buffer.
The output is always sent to the active path, which is stored in the
quic connection. There is no need to pass it in arguments.
When output has to be send to to a specific path (in rare cases, such as
path probing), a separate method exists (ngx_quic_frame_sendto()).
The path validation status and anti-amplification limit status is actually
two different variables. It is possible that validating path should not
be limited (for example, when re-validating former path).
Previously, path was considered valid during arbitrary selected 10m timeout
since validation. This is quite not what RFC 9000 says; the relevant
part is:
An endpoint MAY skip validation of a peer address if that
address has been seen recently.
The patch considers a path to be 'recently seen' if packets were received
during idle timeout. If a packet is received from the path that was seen
not so recently, such path is considered new, and anti-amplification
restrictions apply.
After creation, a client stream is added to qc->streams.uninitialized queue.
After initialization it's removed from the queue. If a stream is never
initialized, it is freed in ngx_quic_close_streams(). Stream initializer
is now set as read event handler in stream connection.
Previously qc->streams.uninitialized was used only for delayed stream
initialization.
The change makes it possible not to handle separately the case of a new stream
in stream-related frame handlers. It makes these handlers simpler since new
streams and existing streams are now handled by the same code.
With sendfile() in threads ("aio threads; sendfile on;"), client connection
can block on writing, waiting for sendfile() to complete. In HTTP/2 this
might result in the request hang, since an attempt to continue processing
in thread event handler will call request's write event handler, which
is usually stopped by ngx_http_v2_send_chain(): it does nothing if there
are no additional data and stream->queued is set. Further, HTTP/2 resets
stream's c->write->ready to 0 if writing blocks, so just fixing
ngx_http_v2_send_chain() is not enough.
Can be reproduced with test suite on Linux with:
TEST_NGINX_GLOBALS_HTTP="aio threads; sendfile on;" prove h2*.t
The following tests currently fail: h2_keepalive.t, h2_priority.t,
h2_proxy_max_temp_file_size.t, h2.t, h2_trailers.t.
Similarly, sendfile() with AIO preloading on FreeBSD can block as well,
with similar results. This is, however, harder to reproduce, especially
on modern FreeBSD systems, since sendfile() usually does not return EBUSY.
Fix is to modify ngx_http_v2_send_chain() so it actually tries to send
data to the main connection when called, and to make sure that
c->write->ready is set by the relevant event handlers.
With sendfile in threads, "task already active" alerts might appear in logs
if a write event happens on the main HTTP/2 connection, triggering a sendfile
in threads while another thread operation is already running. Observed
with "aio threads; aio_write on; sendfile on;" and with thread event handlers
modified to post a write event to the main HTTP/2 connection (though can
happen without any modifications).
Similarly, sendfile() with AIO preloading on FreeBSD can trigger duplicate
aio operation, resulting in "second aio post" alerts. This is, however,
harder to reproduce, especially on modern FreeBSD systems, since sendfile()
usually does not return EBUSY.
Fix is to avoid starting a sendfile operation if other thread operation
is active by checking r->aio in the thread handler (and, similarly, in
aio preload handler). The added check also makes duplicate calls protection
redundant, so it is removed.
The SSL_OP_ENABLE_MIDDLEBOX_COMPAT option is provided by QuicTLS and enabled
by default in the newly created SSL contexts. SSL_set_quic_method() is used
to clear it, which is required for SSL handshake to work on QUIC connections.
Switching context in the ngx_http_ssl_servername() SNI callback overrides SSL
options from the new SSL context. This results in the option set again.
Fix is to explicitly clear it when switching to another SSL context.
Initially reported here (in Russian):
http://mailman.nginx.org/pipermail/nginx-ru/2021-November/063989.html
ngx_http_v3_tables.h and ngx_http_v3_tables.c are renamed to
ngx_http_v3_table.h and ngx_http_v3_table.c to better match HTTP/2 code.
ngx_http_v3_streams.h and ngx_http_v3_streams.c are renamed to
ngx_http_v3_uni.h and ngx_http_v3_uni.c to better match their content.
Directives that set transport parameters are removed from the configuration.
Corresponding values are derived from the quic configuration or initialized
to default. Whenever possible, quic configuration parameters are taken from
higher-level protocol settings, i.e. HTTP/3.
A new variable $http3 is added. The variable equals to "h3" for HTTP/3
connections, "hq" for hq connections and is an empty string otherwise.
The variable $quic is eliminated.
The new variable is similar to $http2 variable.
RFC 9000 19.16
The sequence number specified in a RETIRE_CONNECTION_ID frame MUST NOT
refer to the Destination Connection ID field of the packet in which the
frame is contained.
Before the patch, the RETIRE_CONNECTION_ID frame was sent before switching
to the new client id. If retired client id was currently in use, this lead
to violation of the spec.
The c->udp->dgram may be NULL only if the quic connection was just
created: the ngx_event_udp_recvmsg() passes information about datagrams
to existing connections by providing information in c->udp.
If case of a new connection, c->udp is allocated by the QUIC code during
creation of quic connection (it uses c->sockaddr to initialize qsock->path).
Thus the check for qsock->path is excessive and can be read wrong, assuming
that other options possible, leading to warnings from clang static analyzer.
Removed sending CLOSE_CONNECTION directly to avoid duplicate frames,
since it is sent later again in SSL_do_handshake() error handling.
As such, removed redundant settings of error fields set elsewhere.
While here, improved debug message.
All open sockets are stored in a queue. There is no need to close some
of them separately. If it happens that active and backup point to same
socket, double close may happen (leading to possible segfault).
The RFC 9000 allows a packet from known CID arrive from unknown path:
These requirements regarding connection ID reuse apply only to the
sending of packets, as unintentional changes in path without a change
in connection ID are possible. For example, after a period of
network inactivity, NAT rebinding might cause packets to be sent on a
new path when the client resumes sending.
Before the patch, such packets were rejected with an error in the
ngx_quic_check_migration() function. Removing the check makes the
separate function excessive - remaining checks are early migration
check and "disable_active_migration" check. The latter is a transport
parameter sent to client and it should not be used by server.
The server should send "disable_active_migration" "if the endpoint does
not support active connection migration" (18.2). The support status depends
on nginx configuration: to have migration working with multiple workers,
you need bpf helper, available on recent Linux systems. The patch does
not set "disable_active_migration" automatically and leaves it for the
administrator. By default, active migration is enabled.
RFC 900 says that it is ok to migrate if the peer violates
"disable_active_migration" flag requirements:
If the peer violates this requirement,
the endpoint MUST either drop the incoming packets on that path without
generating a Stateless Reset
OR
proceed with path validation and allow the peer to migrate. Generating a
Stateless Reset or closing the connection would allow third parties in the
network to cause connections to close by spoofing or otherwise manipulating
observed traffic.
So, nginx adheres to the second option and proceeds to path validation.
Note:
The ngtcp2 may be used for testing both active migration and NAT rebinding:
ngtcp2/client --change-local-addr=200ms --delay-stream=500ms <ip> <port> <url>
ngtcp2/client --change-local-addr=200ms --delay-stream=500ms --nat-rebinding \
<ip> <port> <url>
Single UDP datagram may contain multiple QUIC datagrams. In order to
facilitate handling of such cases, 'first' flag in the ngx_quic_header_t
structure is introduced.
Previously, the retired socket was not closed if it didn't match
active or backup.
New sockets could not be created (due to count limit), since retired socket
was not closed before calling ngx_quic_create_sockets().
When replacing retired socket, new socket is only requested after closing
old one, to avoid hitting the limit on the number of active connection ids.
Together with added restrictions, this fixes an issue when a current socket
could be closed during migration, recreated and erroneously reused leading
to null pointer dereference.
Previously the frame was not handled and connection was closed with an error.
Now, after receiving this frame, global flow control is updated and new
flow control credit is sent to client.
Previously, after receiving STREAM_DATA_BLOCKED, current flow control limit
was sent to client. Now, if the limit can be updated to the full window size,
it is updated and the new value is sent to client, otherwise nothing is sent.
The change lets client update flow control credit on demand. Also, it saves
traffic by not sending MAX_STREAM_DATA with the same value twice.
The reasons why a stream may not be created by server currently include hitting
worker_connections limit and memory allocation error. Previously in these
cases the entire QUIC connection was closed and all its streams were shut down.
Now the new stream is rejected and existing streams continue working.
To reject an HTTP/3 request stream, RESET_STREAM and STOP_SENDING with
H3_REQUEST_REJECTED error code are sent to client. HTTP/3 uni streams and
Stream streams are not rejected.
The variable contains a negotiated curve used for the handshake key
exchange process. Known curves are listed by their names, unknown
ones are shown in hex.
Note that for resumed sessions in TLSv1.2 and older protocols,
$ssl_curve contains the curve used during the initial handshake,
while in TLSv1.3 it contains the curve used during the session
resumption (see the SSL_get_negotiated_group manual page for
details).
The variable is only meaningful when using OpenSSL 3.0 and above.
With older versions the variable is empty.
Without this change, aio used with HTTP/2 can result in connection hang,
as observed with "aio threads; aio_write on;" and proxying (ticket #2248).
The problem is that HTTP/2 updates buffers outside of the output filters
(notably, marks them as sent), and then posts a write event to call
output filters. If a filter does not call the next one for some reason
(for example, because of an AIO operation in progress), this might
result in a state when the owner of a buffer already called
ngx_chain_update_chains() and can reuse the buffer, while the same buffer
is still sitting in the busy chain of some other filter.
In the particular case a buffer was sitting in output chain's ctx->busy,
and was reused by event pipe. Output chain's ctx->busy was permanently
blocked by it, and this resulted in connection hang.
Fix is to change ngx_chain_update_chains() to skip buffers from other
modules unconditionally, without trying to wait for these buffers to
become empty.
The "sendfile_max_chunk" directive is important to prevent worker
monopolization by fast connections. The 2m value implies maximum 200ms
delay with 100 Mbps links, 20ms delay with 1 Gbps links, and 2ms on
10 Gbps links. It also seems to be a good value for disks.
Previously, connections to upstream servers used sendfile() if it was
enabled, but never honored sendfile_max_chunk. This might result
in worker monopolization for a long time if large request bodies
are allowed.
On Linux starting with 2.6.16, sendfile() silently limits all operations
to MAX_RW_COUNT, defined as (INT_MAX & PAGE_MASK). This incorrectly
triggered the interrupt check, and resulted in 0-sized writev() on the
next loop iteration.
Fix is to make sure the limit is always checked, so we will return from
the loop if the limit is already reached even if number of bytes sent is
not exactly equal to the number of bytes we've tried to send.
Previously, it was checked that sendfile_max_chunk was enabled and
almost whole sendfile_max_chunk was sent (see e67ef50c3176), to avoid
delaying connections where sendfile_max_chunk wasn't reached (for example,
when sending responses smaller than sendfile_max_chunk). Now we instead
check if there are unsent data, and the connection is still ready for writing.
Additionally we also check c->write->delayed to ignore connections already
delayed by limit_rate.
This approach is believed to be more robust, and correctly handles
not only sendfile_max_chunk, but also internal limits of c->send_chain(),
such as sendfile() maximum supported length (ticket #1870).
Previously, 1 millisecond delay was used instead. In certain edge cases
this might result in noticeable performance degradation though, notably on
Linux with typical CONFIG_HZ=250 (so 1ms delay becomes 4ms),
sendfile_max_chunk 2m, and link speed above 2.5 Gbps.
Using posted next events removes the artificial delay and makes processing
fast in all cases.
The directive enables including all frames from start time to the most recent
key frame in the result. Those frames are removed from presentation timeline
using mp4 edit lists.
Edit lists are currently supported by popular players and browsers such as
Chrome, Safari, QuickTime and ffmpeg. Among those not supporting them properly
is Firefox[1].
Based on a patch by Tracey Jaquith, Internet Archive.
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1735300
The function updates the duration field of mdhd atom. Previously it was
updated in ngx_http_mp4_read_mdhd_atom(). The change makes it possible to
alter track duration as a result of processing track frames.
As per quic-qpack-21:
When a stream is reset or reading is abandoned, the decoder emits a
Stream Cancellation instruction.
Previously the instruction was not sent. Now it's sent when closing QUIC
stream connection if dynamic table capacity is non-zero and eof was not
received from client. The latter condition means that a trailers section
may still be on its way from client and the stream needs to be cancelled.
A QUIC stream connection is treated as reusable until first bytes of request
arrive, which is also when the request object is now allocated. A connection
closed as a result of draining, is reset with the error code
H3_REQUEST_REJECTED. Such behavior is allowed by quic-http-34:
Once a request stream has been opened, the request MAY be cancelled
by either endpoint. Clients cancel requests if the response is no
longer of interest; servers cancel requests if they are unable to or
choose not to respond.
When the server cancels a request without performing any application
processing, the request is considered "rejected." The server SHOULD
abort its response stream with the error code H3_REQUEST_REJECTED.
The client can treat requests rejected by the server as though they had
never been sent at all, thereby allowing them to be retried later.
When an HTTP/3 function returns an error in context of a QUIC stream, it's
this function's responsibility now to finalize the entire QUIC connection
with the right code, if required. Previously, QUIC connection finalization
could be done both outside and inside such functions. The new rule follows
a similar rule for logging, leads to cleaner code, and allows to provide more
details about the error.
While here, a few error cases are no longer treated as fatal and QUIC connection
is no longer finalized in these cases. A few other cases now lead to
stream reset instead of connection finalization.
The PATH_RESPONSE frame must be expanded to 1200, except the case
when anti-amplification limit is in effect, i.e. on unvalidated paths.
Previously, the anti-amplification limit was always applied.
If client ID was never used, its refcount is zero. To keep things simple,
the ngx_quic_unref_client_id() function is now aware of such IDs.
If client ID was used, the ngx_quic_replace_retired_client_id() function
is supposed to find all users and unref the ID, thus ngx_quic_unref_client_id()
should not be called after it.
Previously, it was not enforced in the stream module.
Now, since b9e02e9b2f1d it is possible to specify protocols.
Since ALPN is always required, the 'require_alpn' setting is now obsolete.
The "min" and "max" arguments refer to UDP datagram size. Generating payload
requires to account properly for header size, which is variable and depends on
payload size and packet number.
After fe919fd63b0b, processing QUIC streams was postponed until after handshake
completion, which means that 0-RTT is effectively off. With ssl_ocsp enabled,
it could be further delayed. This differs from how OCSP validation works with
SSL_read_early_data(). With this change, processing QUIC streams is unlocked
when obtaining 0-RTT secret.
The sent queue is sorted by packet number. It is possible to avoid
traversing full queue while handling ack ranges. It makes sense to
start traversing from the queue head (i.e. check oldest packets first).
With this patch, all traffic over a QUIC connection is compared to traffic
over QUIC streams. As long as total traffic is many times larger than stream
traffic, we consider this to be a flood.
With this patch, all traffic over HTTP/3 bidi and uni streams is counted in
the h3c->total_bytes field, and payload traffic is counted in the
h3c->payload_bytes field. As long as total traffic is many times larger than
payload traffic, we consider this to be a flood.
Request header traffic is counted as if all fields are literal. Response
header traffic is counted as is.
Checking the reset after encryption avoids false positives. More importantly,
it avoids the check entirely in the usual case where decryption succeeds.
RFC 9000, 10.3.1 Detecting a Stateless Reset
Endpoints MAY skip this check if any packet from a datagram is
successfully processed.
As per RFC 9000:
An endpoint that receives a STOP_SENDING frame MUST send a RESET_STREAM
frame if the stream is in the "Ready" or "Send" state.
An endpoint SHOULD copy the error code from the STOP_SENDING frame to
the RESET_STREAM frame it sends, but it can use any application error code.
The flag indicates that the entire response was sent to the socket up to the
last_buf flag. The flag is only usable for protocol implementations that call
ngx_http_write_filter() from header filter, such as HTTP/1.x and HTTP/3.
Similar to the previous change, a segmentation fault occurres when evaluating
SSL certificates on a QUIC connection due to an uninitialized stream session.
The fix is to adjust initializing the QUIC part of a connection until after
it has session and variables initialized.
Similarly, this appends logging error context for QUIC connections:
- client 127.0.0.1:54749 connected to 127.0.0.1:8880 while handling frames
- quic client timed out (60: Operation timed out) while handling quic input
A QUIC connection doesn't have c->log->data and friends initialized to sensible
values. Yet, a request can be created in the certificate callback with such an
assumption, which leads to a segmentation fault due to null pointer dereference
in ngx_http_free_request(). The fix is to adjust initializing the QUIC part of
a connection such that it has all of that in place.
Further, this appends logging error context for unsuccessful QUIC handshakes:
- cannot load certificate .. while handling frames
- SSL_do_handshake() failed .. while sending frames
Previously the counter was not incremented for HTTP/3 streams, but still
decremented in ngx_http_close_connection(). There are two solutions here, one
is to increment the counter for HTTP/3 streams, and the other one is not to
decrement the counter for HTTP/3 streams. The latter solution looks
inconsistent with ngx_stat_reading/ngx_stat_writing, which are incremented on a
per-request basis. The change adds ngx_stat_active increment for HTTP/3
request and push streams.
Notably, it is to avoid setting the TCP_NODELAY flag for QUIC streams
in ngx_http_upstream_send_response(). It is an invalid operation on
inherently SOCK_DGRAM sockets, which leads to QUIC connection close.
The change reduces diff to the default branch in stream content phase.
This function was only referenced from ngx_http_v3_create_push_request() to
initialize push connection log. Now the log handler is copied from the parent
request connection.
The change reduces diff to the default branch.
The functions ngx_quic_handle_read_event() and ngx_quic_handle_write_event()
are added. Previously this code was a part of ngx_handle_read_event() and
ngx_handle_write_event().
The change simplifies ngx_handle_read_event() and ngx_handle_write_event()
by moving QUIC-related code to a QUIC source file.
Previously it had -1 as fd. This fixes proxying, which relies on downstream
connection having a real fd. Also, this reduces diff to the default branch for
ngx_close_connection().
The request body filter chain is no longer called after processing
a DATA frame. Instead, we now post a read event to do this. This
ensures that multiple small DATA frames read during the same event loop
iteration are coalesced together, resulting in much faster processing.
Since rb->buf can now contain unprocessed data, window update is no
longer sent in ngx_http_v2_state_read_data() in case of flow control
being used due to filter buffering. Instead, window will be updated
by ngx_http_v2_read_client_request_body_handler() in the posted read
event.
Following rb->filter_need_buffering changes, request body reading is
only finished after the filter chain is called and rb->last_saved is set.
As such, with r->request_body_no_buffering, timer on fc->read is no
longer removed when the last part of the body is received, potentially
resulting in incorrect behaviour.
The fix is to call ngx_http_v2_process_request_body() from the
ngx_http_v2_read_unbuffered_request_body() function instead of
directly calling ngx_http_v2_filter_request_body(), so the timer
is properly removed.
In the body read handler, the window was incorrectly calculated
based on the full buffer size instead of the amount of free space
in the buffer. If the request body is buffered by a filter, and
the buffer is not empty after the read event is generated by the
filter to resume request body processing, this could result in
"http2 negative window update" alerts.
Further, in the body ready handler and in ngx_http_v2_state_read_data()
the buffer wasn't cleared when the data were already written to disk,
so the client might stuck without window updates.
If a MAX_DATA frame was received before any stream was created, then the worker
process would crash in nginx_quic_handle_max_data_frame() while traversing the
stream tree. The issue is solved by adding a check that makes sure the tree is
not empty.