The history of acknowledged packet is kept in send context as ranges.
Up to NGX_QUIC_MAX_RANGES ranges is stored.
As a result, instead of separate ack frames, single frame with ranges
is sent.
Previously, if there were multiple limits configured, errors in
ngx_http_complex_value() during processing of a non-first limit
resulted in reference count leak in shared memory nodes of already
processed limits. Fix is to explicity unlock relevant nodes, much
like we do when rejecting requests.
The proxy_smtp_auth directive instructs nginx to authenticate users
on backend via the AUTH command (using the PLAIN SASL mechanism),
similar to what is normally done for IMAP and POP3.
If xclient is enabled along with proxy_smtp_auth, the XCLIENT command
won't try to send the LOGIN parameter.
In 7717:e3e8b8234f05, the 1st bit was incorrectly used. It shouldn't
be used for bitmask values, as it is used by NGX_CONF_BITMASK_SET.
Additionally, special value "off" added to make it possible to clear
inherited userid_flags value.
The "false" parameter of the proxy_redirect directive is deprecated.
Warning has been emitted since c2230102df6f (0.7.54).
The "off" parameter of the proxy_redirect, proxy_cookie_domain, and
proxy_cookie_path directives tells nginx not to inherit the
configuration from the previous configuration level.
Previously, after specifying the directive with the "off" parameter,
any other directives were ignored, and syntax checking was disabled.
The syntax was enforced to allow either one directive with the "off"
parameter, or several directives with other parameters.
Also, specifying "proxy_redirect default foo" no longer works like
"proxy_redirect default".
Previously, this field was not set while creating a QUIC stream connection.
As a result, calling ngx_connection_local_sockaddr() led to getsockname()
bad descriptor error.
If client acknowledged an Initial packet with CRYPTO frame and then
sent another Initial packet containing duplicate CRYPTO again, this
could result in resending frames off the empty send queue.
The new "quic_stateless_reset_token_key" directive is added. It sets the
endpoint key used to generate stateless reset tokens and enables feature.
If the endpoint receives short-header packet that can't be matched to
existing connection, a stateless reset packet is generated with
a proper token.
If a valid stateless reset token is found in the incoming packet,
the connection is closed.
Example configuration:
http {
quic_stateless_reset_token_key "foo";
...
}
The flag is tied to the initial secret creation. The presence of c->quic
pointer is sufficient to enable execution of ngx_quic_close_quic().
The ngx_quic_new_connection() function now returns the allocated quic
connection object and the c->quic pointer is set by the caller.
If an early error occurs before secrets initialization (i.e. in cases
of invalid retry token or nginx exiting), it is still possible to
generate an error response by trying to initialize secrets directly
in the ngx_quic_send_cc() function.
Before the change such early errors failed to send proper connection close
message and logged an error.
An auxilliary ngx_quic_init_secrets() function is introduced to avoid
verbose call to ngx_quic_set_initial_secret() requiring local variable.
All packet header parsing is now performed by ngx_quic_parse_packet()
function, located in the ngx_quic_transport.c file.
The packet processing is centralized in the ngx_quic_process_packet()
function which decides if the packet should be accepted, ignored or
connection should be closed, depending on the connection state.
As a result of refactoring, behavior has changed in some places:
- minimal size of Initial packet is now always tested
- connection IDs are always tested in existing connections
- old keys are discarded on encryption level switch
Now flags are processed in ngx_quic_input(), and raw->pos points to the first
byte after the flags. Redundant checks from ngx_quic_parse_short_header() and
ngx_quic_parse_long_header() are removed.
Previously, when a packet was declared lost, another packet was sent with the
same frames. Now lost frames are moved to the output frame queue and push
event is posted. This has the advantage of forming packets with more frames
than before.
Also, the start argument is removed from the ngx_quic_resend_frames()
function as excess information.
Previously the default server configuration context was used until the
:authority or host header was parsed. This led to using the configuration
parameters like client_header_buffer_size or request_pool_size from the default
server rather than from the server selected by SNI.
Also, the switch to the right server log is implemented. This issue manifested
itself as QUIC stream being logged to the default server log until :authority
or host is parsed.
Initially, client certificate verification didn't work due to the missing
hc->ssl on a QUIC stream, which is started to be set in 7738:7f0981be07c4.
Then it was lost in 7999:0d2b2664b41c introducing "quic" listen parameter.
This change re-adds hc->ssl back for all QUIC connections, similar to SSL.
As per HTTP/3 draft 30, section 7.2.8:
Frame types that were used in HTTP/2 where there is no corresponding
HTTP/3 frame have also been reserved (Section 11.2.1). These frame
types MUST NOT be sent, and their receipt MUST be treated as a
connection error of type H3_FRAME_UNEXPECTED.
In rare cases, such as memory allocation failure, SSL_set_SSL_CTX() returns
NULL, which could mean that a different SSL configuration has not been set.
Note that this new behaviour seemingly originated in OpenSSL-1.1.0 release.
HTTP/2 code failed to run posted requests after calling the request body
handler, and this resulted in connection hang if a subrequest was created
in the body handler and no other actions were made.
If 400 errors were redirected to an upstream server using the error_page
directive, DATA frames from the client might cause segmentation fault
due to null pointer dereference. The bug had appeared in 6989:2c4dbcd6f2e4
(1.13.0).
Fix is to skip such frames in ngx_http_v2_state_read_data() (similarly
to 7561:9f1f9d6e056a). With the fix, behaviour of 400 errors in HTTP/2
is now similar to one in HTTP/1.x, that is, nginx doesn't try to read the
request body.
Note that proxying 400 errors, as well as other early stage errors, to
upstream servers might not be a good idea anyway. These errors imply
that reading and processing of the request (and the request headers)
wasn't complete, and proxying of such incomplete request might lead to
various errors.
Reported by Chenglong Zhang.
This fixes "SSL_shutdown() failed (SSL: ... bad write retry)" errors
as observed on the second SSL_shutdown() call after SSL shutdown fixes in
09fb2135a589 (1.19.2), notably when HTTP/2 connections are closed due
to read timeouts while there are incomplete writes.
This fixes "SSL_shutdown() failed (SSL: ... bad write retry)" errors
as observed on the second SSL_shutdown() call after SSL shutdown fixes in
09fb2135a589 (1.19.2), notably when sending fails in ngx_http_test_expect(),
similarly to ticket #1194.
Note that there are some places where c->error is misused to prevent
further output, such as ngx_http_v2_finalize_connection() if there
are pending streams, or in filter finalization. These places seem
to be extreme enough to don't care about missing shutdown though.
For example, filter finalization currently prevents keepalive from
being used.
The c->read->ready and c->write->ready flags need to be cleared to ensure
that appropriate read or write events will be reported by kernel. Without
this, SSL shutdown might wait till the timeout after blocking on writing
or reading even if there is a socket activity.
OpenSSL 1.1.1 fails to return SSL_ERROR_SYSCALL if an error happens
during SSL_write() after close_notify alert from the peer, and returns
SSL_ERROR_ZERO_RETURN instead. Broken by this commit, which removes
the "i == 0" check around the SSL_RECEIVED_SHUTDOWN one:
https://git.openssl.org/?p=openssl.git;a=commitdiff;h=8051ab2
In particular, if a client closed the connection without reading
the response but with properly sent close_notify alert, this resulted in
unexpected "SSL_write() failed while ..." critical log message instead
of correct "SSL_write() failed (32: Broken pipe)" at the info level.
Since SSL_ERROR_ZERO_RETURN cannot be legitimately returned after
SSL_write(), the fix is to convert all SSL_ERROR_ZERO_RETURN errors
after SSL_write() to SSL_ERROR_SYSCALL.
If the variant hash doesn't match one we used as a secondary cache key,
we switch back to the original key. In this case, c->body_start was kept
updated from an existing cache node overwriting the new response value.
After file cache update, it led to discrepancy between a cache node and
cache file seen as critical errors "file cache .. has too long header".
As per HTTP/3 draft 29, section 4.1:
Frames of unknown types (Section 9), including reserved frames
(Section 7.2.8) MAY be sent on a request or push stream before,
after, or interleaved with other frames described in this section.
Also, trailers frame is now used as an indication of the request body end.
While for HTTP/1 unexpected eof always means an error, for HTTP/3 an eof right
after a DATA frame end means the end of the request body. For this reason,
since adding HTTP/3 support, eof no longer produced an error right after recv()
but was passed to filters which would make a decision. This decision was made
in ngx_http_parse_chunked() and ngx_http_v3_parse_request_body() based on the
b->last_buf flag.
Now that since 0f7f1a509113 (1.19.2) rb->chunked->length is a lower threshold
for the expected number of bytes, it can be set to zero to indicate that more
bytes may or may not follow. Now it's possible to move the check for eof from
parser functions to ngx_http_request_body_chunked_filter() and clean up the
parsing code.
Also, in the default branch, in case of eof, the following three things
happened, which were replaced with returning NGX_ERROR while implementing
HTTP/3:
- "client prematurely closed connection" message was logged
- c->error flag was set
- NGX_HTTP_BAD_REQUEST was returned
The change brings back this behavior for HTTP/1 as well as HTTP/3.
If a packet sent in response to an initial client packet was lost, then
successive client initial packets were dropped by nginx with the unexpected
dcid message logged. This was because the new DCID generated by the server was
not available to the client.
The check tested the total size of a packet header and unprotected packet
payload, which doesn't include the packet number length and expansion of
the packet protection AEAD. If the packet was corrupted, it could cause
false triggering of the condition due to unsigned type underflow leading
to a connection error.
Existing checks for the QUIC header and protected packet payload lengths
should be enough.
From quic-tls draft, section 5.4.2:
An endpoint MUST discard packets that are not long enough to contain
a complete sample.
The check includes the Packet Number field assumed to be 4 bytes long.
During long packet header parsing, pkt->len is updated with the Length
field value that is used to find next coalesced packets in a datagram.
For short packets it still contained the whole QUIC packet size.
This change uniforms packet length handling to always contain the total
length of the packet number and protected packet payload in pkt->len.
Previously STOP_SENDING was sent to client upon stream closure if rev->eof and
rev->error were not set. This was an indirect indication that no RESET_STREAM
or STREAM fin has arrived. But it is indeed possible that rev->eof is not set,
but STREAM fin has already been received, just not read out by the application.
In this case sending STOP_SENDING does not make sense and can be misleading for
some clients.
The peer may issue additional connection IDs up to the limit defined by
transport parameter "active_connection_id_limit", using NEW_CONNECTION_ID
frames, and retire such IDs using RETIRE_CONNECTION_ID frame.
It is required to distinguish internal errors from corrupted packets and
perform actions accordingly: drop the packet or close the connection.
While there, made processing of ngx_quic_decrypt() erorrs similar and
removed couple of protocol violation errors.
quic-transport
5.2:
Packets that are matched to an existing connection are discarded if
the packets are inconsistent with the state of that connection.
5.2.2:
Servers MUST drop incoming packets under all other circumstances.
The removal of QUIC packet protection depends on the largest packet number
received. When a garbage packet was received, the decoder still updated the
largest packet number from that packet. This could affect removing protection
from subsequent QUIC packets.
As per HTTP/3 draft 29, section 4.1:
When the server does not need to receive the remainder of the request,
it MAY abort reading the request stream, send a complete response, and
cleanly close the sending part of the stream.
On QUIC connections, SSL_shutdown() is used to call the send_alert callback
to send a CONNECTION_CLOSE frame. The reverse side is handled by other means.
At least BoringSSL doesn't differentiate whether this is a QUIC SSL method,
so waiting for the peer's close_notify alert should be explicitly disabled.
The logical quic connection state is tested by handler functions that
process corresponding types of packets (initial/handshake/application).
The packet is declined if state is incorrect.
No timeout is required for the input queue.
If a client attemtps to start a new connection with unsupported version,
a version negotiation packet is sent that contains a list of supported
versions (currently this is a single version, selected at compile time).
The function ngx_http_upstream_check_broken_connection() terminates the HTTP/1
request if client sends eof. For QUIC (including HTTP/3) the c->write->error
flag is now checked instead. This flag is set when the entire QUIC connection
is closed or STOP_SENDING was received from client.
Previously the request body DATA frame header was read by one byte because
filters were called only when the requested number of bytes were read. Now,
after 08ff2e10ae92 (1.19.2), filters are called after each read. More bytes
can be read at once, which simplifies and optimizes the code.
This also reduces diff with the default branch.
Previously, such packets weren't handled as the resulting zero remaining time
prevented setting the loss detection timer, which, instead, could be disarmed.
For implementation details, see quic-recovery draft 29, appendix A.10.
The PTO handler is split into separate PTO and loss detection handlers
that operate interchangeably depending on which timer should be set.
The present ngx_quic_lost_handler is now only used for packet loss detection.
It replaces ngx_quic_pto_handler if there are packets preceeding largest_ack.
Once there is no more such packets, ngx_quic_pto_handler is installed again.
Probes carry unacknowledged data previously sent in the oldest packet number,
one per each packet number space. That is, it could be up to two probes.
PTO backoff is now increased before scheduling next probes.
In particular, this prevents declaring packet number 0 as lost if
there aren't yet any acknowledgements in this packet number space.
For example, only Initial packets were acknowledged in handshake.
Previously a single STREAM frame was created for each buffer in stream output
chain which is wasteful with respect to memory. The following changes were
made in the stream send code:
- ngx_quic_stream_send_chain() no longer calls ngx_quic_stream_send() and got
a separate implementation that coalesces neighbouring buffers into a single
frame
- the new ngx_quic_stream_send_chain() respects the limit argument, which fixes
sendfile_max_chunk and limit_rate
- ngx_quic_stream_send() is reimplemented to call ngx_quic_stream_send_chain()
- stream frame size limit is moved out to a separate function
ngx_quic_max_stream_frame()
- flow control is moved out to a separate function ngx_quic_max_stream_flow()
- ngx_quic_stream_send_chain() is relocated next to ngx_quic_stream_send()
Reworked connections reuse, so closing connections is attempted in
advance, as long as number of free connections is less than 1/16 of
worker connections configured. This ensures that new connections can
be handled even if closing a reusable connection requires some time,
for example, for a lingering close (ticket #2017).
The 1/16 ratio is selected to be smaller than 1/8 used for disabling
accept when working with accept mutex, so nginx will try to balance
new connections to different workers first, and will start reusing
connections only if this won't help.
Previously, reusing connections happened silently and was only
visible in monitoring systems. This was shown to be not very user-friendly,
and administrators often didn't realize there were too few connections
available to withstand the load, and configured timeouts (keepalive_timeout
and http2_idle_timeout) were effectively reduced to keep things running.
To provide at least some information about this, a warning is now logged
(at most once per second, to avoid flooding the logs).
Sending shutdown when ngx_http_test_reading() detects the connection is
closed can result in "SSL_shutdown() failed (SSL: ... bad write retry)"
critical log messages if there are blocked writes.
Fix is to avoid sending shutdown via the c->ssl->no_send_shutdown flag,
similarly to how it is done in ngx_http_keepalive_handler() for kqueue
when pending EOF is detected.
Reported by Jan Prachař
(http://mailman.nginx.org/pipermail/nginx-devel/2018-December/011702.html).
Without the flag, SSL shutdown is attempted on such connections,
resulting in useless work and/or bogus "SSL_shutdown() failed
(SSL: ... bad write retry)" critical log messages if there are
blocked writes.
Previously, bidirectional shutdown never worked, due to two issues
in the code:
1. The code only tested SSL_ERROR_WANT_READ and SSL_ERROR_WANT_WRITE
when there was an error in the error queue, which cannot happen.
The bug was introduced in an attempt to fix unexpected error logging
as reported with OpenSSL 0.9.8g
(http://mailman.nginx.org/pipermail/nginx/2008-January/003084.html).
2. The code never called SSL_shutdown() for the second time to wait for
the peer's close_notify alert.
This change fixes both issues.
Note that after this change bidirectional shutdown is expected to work for
the first time, so c->ssl->no_wait_shutdown now makes a difference. This
is not a problem for HTTP code which always uses c->ssl->no_wait_shutdown,
but might be a problem for stream and mail code, as well as 3rd party
modules.
To minimize the effect of the change, the timeout, which was used to be 30
seconds and not configurable, though never actually used, is now set to
3 seconds. It is also expanded to apply to both SSL_ERROR_WANT_READ and
SSL_ERROR_WANT_WRITE, so timeout is properly set if writing to the socket
buffer is not possible.
If some additional data from a pipelined request happens to be
read into the body buffer, we copy it to r->header_in or allocate
an additional large client header buffer for it.
This ensures that copying won't write more than the buffer size
even if the buffer comes from hc->free and it is smaller than the large
client header buffer size in the virtual host configuration. This might
happen if size of large client header buffers is different in name-based
virtual hosts, similarly to the problem with number of buffers fixed
in 6926:e662cbf1b932.
Creating client-initiated streams is moved from ngx_quic_handle_stream_frame()
to a separate function ngx_quic_create_client_stream(). This function is
responsible for creating streams with lower ids as well.
Also, simplified and fixed initial data buffering in
ngx_quic_handle_stream_frame(). It is now done before calling the initial
handler as the handler can destroy the stream.
Previously this function generated an error trying to figure out if client shut
down the write end of the connection. The reason for this error was that a
QUIC stream has no socket descriptor. However checking for eof is not the
right thing to do for an HTTP/3 QUIC stream since HTTP/3 clients are expected
to shut down the write end of the stream after sending the request.
Now the function handles QUIC streams separately. It checks if c->read->error
is set. The error flags for c->read and c->write are now set for all streams
when closing the QUIC connection instead of setting the pending_eof flag.
According to quic-transport draft 29, section 19.3.1:
The value of the Gap field establishes the largest packet number
value for the subsequent ACK Range using the following formula:
largest = previous_smallest - gap - 2
Thus, given a largest packet number for the range, the smallest value
is determined by the formula:
smallest = largest - ack_range
While here, changed min/max to uint64_t for consistency.
A QUIC stream could be destroyed by handler while in ngx_quic_stream_input().
To detect this, ngx_quic_find_stream() is used to check that it still exists.
Previously, a stream id was passed to this routine off the frame structure.
In case of stream cleanup, it is freed along with other frames belonging to
the stream on cleanup. Then, a cleanup handler reuses last frames to update
MAX_STREAMS and serve other purpose. Thus, ngx_quic_find_stream() is passed
a reused frame with zeroed out part pointed by stream_id. If a stream with
id 0x0 still exists, this leads to use-after-free.
After 05e42236e95b (1.19.1) responses with extra data might result in
zero size buffers being generated and "zero size buf" alerts in writer
(if f->rest happened to be 0 when processing additional stdout data).
The limits on active bidi and uni client streams are maintained at their
initial values initial_max_streams_bidi and initial_max_streams_uni by sending
a MAX_STREAMS frame upon each client stream closure.
Also, the following is changed for data arriving to non-existing streams:
- if a stream was already closed, such data is ignored
- when creating a new stream, all streams of the same type with lower ids are
created too
Previously, the document generated by the xslt filter was always fully sent
to client even if a range was requested and response status was 206 with
appropriate Content-Range.
The xslt module is unable to serve a range because of suspending the header
filter chain. By the moment full response xml is buffered by the xslt filter,
range header filter is not called yet, but the range body filter has already
been called and did nothing.
The fix is to disable ranges by resetting the r->allow_ranges flag much like
the image filter that employs a similar technique.
The ngx_http_perl_module module doesn't have a notion of including additional
search paths through --with-cc-opt, which results in compile error incomplete
type 'enum ssl_encryption_level_t' when building nginx without QUIC support.
The enum is visible from quic event headers and eventually pollutes ngx_core.h.
The fix is to limit including headers to compile units that are real consumers.
According to quic-transport draft 29, section 21.12.1.1:
Prior to validation, endpoints are limited in what they are able to
send. During the handshake, a server cannot send more than three
times the data it receives; clients that initiate new connections or
migrate to a new network path are limited.
The ngx_quic_queue_frame() functions puts a frame into send queue and
schedules a push timer to actually send data.
The patch adds tracking for data amount in the queue and sends data
immediately if amount of data exceeds limit.
Instead of timer-based retransmissions with constant packet lifetime,
this patch implements ack-based loss detection and probe timeout
for the cases, when no ack is received, according to the quic-recovery
draft 29.
The c->quic->retransmit timer is now called "pto".
The ngx_quic_retransmit() function is renamed to "ngx_quic_detect_lost()".
This is a preparation for the following patches.
According to the quic-recovery 29, Section 5: Estimating the Round-Trip Time.
Currently, integer arithmetics is used, which loses sub-millisecond accuracy.
The slice filter allows ranges for the response by setting the r->allow_ranges
flag, which enables the range filter. If the range was not requested, the
range filter adds an Accept-Ranges header to the response to signal the
support for ranges.
Previously, if an Accept-Ranges header was already present in the first slice
response, client received two copies of this header. Now, the slice filter
removes the Accept-Ranges header from the response prior to setting the
r->allow_ranges flag.
As long as the "Content-Length" header is given, we now make sure
it exactly matches the size of the response. If it doesn't,
the response is considered malformed and must not be forwarded
(https://tools.ietf.org/html/rfc7540#section-8.1.2.6). While it
is not really possible to "not forward" the response which is already
being forwarded, we generate an error instead, which is the closest
equivalent.
Previous behaviour was to pass everything to the client, but this
seems to be suboptimal and causes issues (ticket #1695). Also this
directly contradicts HTTP/2 specification requirements.
Note that the new behaviour for the gRPC proxy is more strict than that
applied in other variants of proxying. This is intentional, as HTTP/2
specification requires us to do so, while in other types of proxying
malformed responses from backends are well known and historically
tolerated.
Previous behaviour was to pass everything to the client, but this
seems to be suboptimal and causes issues (ticket #1695). Fix is to
drop extra data instead, as it naturally happens in most clients.
Additionally, we now also issue a warning if the response is too
short, and make sure the fact it is truncated is propagated to the
client. The u->error flag is introduced to make it possible to
propagate the error to the client in case of unbuffered proxying.
For responses to HEAD requests there is an exception: we do allow
both responses without body and responses with body matching the
Content-Length header.
Previous behaviour was to pass everything to the client, but this
seems to be suboptimal and causes issues (ticket #1695). Fix is to
drop extra data instead, as it naturally happens in most clients.
This change covers generic buffered and unbuffered filters as used
in the scgi and uwsgi modules. Appropriate input filter init
handlers are provided by the scgi and uwsgi modules to set corresponding
lengths.
Note that for responses to HEAD requests there is an exception:
we do allow any response length. This is because responses to HEAD
requests might be actual full responses, and it is up to nginx
to remove the response body. If caching is enabled, only full
responses matching the Content-Length header will be cached
(see b779728b180c).
Previously, additional data after final chunk was either ignored
(in the same buffer, or during unbuffered proxying) or sent to the
client (in the next buffer already if it was already read from the
socket). Now additional data are properly detected and ignored
in all cases. Additionally, a warning is now logged and keepalive
is disabled in the connection.
Previous behaviour was to pass everything to the client, but this
seems to be suboptimal and causes issues (ticket #1695). Fix is to
drop extra data instead, as it naturally happens in most clients.
If a memcached response was followed by a correct trailer, and then
the NUL character followed by some extra data - this was accepted by
the trailer checking code. This in turn resulted in ctx->rest underflow
and caused negative size buffer on the next reading from the upstream,
followed by the "negative size buf in writer" alert.
Fix is to always check for too long responses, so a correct trailer cannot
be followed by extra data.
After sending the GOAWAY frame, a connection is now closed using
the lingering close mechanism.
This allows for the reliable delivery of the GOAWAY frames, while
also fixing connection resets observed when http2_max_requests is
reached (ticket #1250), or with graceful shutdown (ticket #1544),
when some additional data from the client is received on a fully
closed connection.
For HTTP/2, the settings lingering_close, lingering_timeout, and
lingering_time are taken from the "server" level.
Previously, the expression (ch & 0x7f) was promoted to a signed integer.
Depending on the platform, the size of this integer could be less than 8 bytes,
leading to overflow when handling the higher bits of the result. Also, sign
bit of this integer could be replicated when adding to the 64-bit st->value.
Previously errors led only to closing streams.
To simplify closing QUIC connection from a QUIC stream context, new macro
ngx_http_v3_finalize_connection() is introduced. It calls
ngx_quic_finalize_connection() for the parent connection.
The function finalizes QUIC connection with an application protocol error
code and sends a CONNECTION_CLOSE frame with type=0x1d.
Also, renamed NGX_QUIC_FT_CONNECTION_CLOSE2 to NGX_QUIC_FT_CONNECTION_CLOSE_APP.
Previously dynamic table was not functional because of zero limit on its size
set by default. Now the following changes enable it:
- new directives to set SETTINGS_QPACK_MAX_TABLE_CAPACITY and
SETTINGS_QPACK_BLOCKED_STREAMS
- send settings with SETTINGS_QPACK_MAX_TABLE_CAPACITY and
SETTINGS_QPACK_BLOCKED_STREAMS to the client
- send Insert Count Increment to the client
- send Header Acknowledgement to the client
- evict old dynamic table entries on overflow
- decode Required Insert Count from client
- block stream if Required Insert Count is not reached
Using SSL_CTX_set_verify(SSL_VERIFY_PEER) implies that OpenSSL will
send a certificate request during an SSL handshake, leading to unexpected
certificate requests from browsers as long as there are any client
certificates installed. Given that ngx_ssl_trusted_certificate()
is called unconditionally by the ngx_http_ssl_module, this affected
all HTTPS servers. Broken by 699f6e55bbb4 (not released yet).
Fix is to set verify callback in the ngx_ssl_trusted_certificate() function
without changing the verify mode.
Client streams may send literal strings which are now limited in size by the
new directive. The default value is 4096.
The directive is similar to HTTP/2 directive http2_max_field_size.
So that connections are protected from failing from on-path attacks.
Decryption failure of long packets used during handshake still leads
to connection close since it barely makes sense to handle them there.
A previously used undefined error code is now replaced with the generic one.
Note that quic-transport prescribes keeping connection intact, discarding such
QUIC packets individually, in the sense that coalesced packets could be there.
This is selectively handled in the next change.
The patch removes remnants of the old state tracking mechanism, which did
not take into account assimetry of read/write states and was not very
useful.
The encryption state now is entirely tracked using SSL_quic_read/write_level().
quic-transport draft 29:
section 7:
* authenticated negotiation of an application protocol (TLS uses
ALPN [RFC7301] for this purpose)
...
Endpoints MUST explicitly negotiate an application protocol. This
avoids situations where there is a disagreement about the protocol
that is in use.
section 8.1:
When using ALPN, endpoints MUST immediately close a connection (see
Section 10.3 of [QUIC-TRANSPORT]) with a no_application_protocol TLS
alert (QUIC error code 0x178; see Section 4.10) if an application
protocol is not negotiated.
Changes in ngx_quic_close_quic() function are required to avoid attempts
to generated and send packets without proper keys, what happens in case
of failed ALPN check.
quic-transport draft 29, section 14:
QUIC depends upon a minimum IP packet size of at least 1280 bytes.
This is the IPv6 minimum size [RFC8200] and is also supported by most
modern IPv4 networks. Assuming the minimum IP header size, this
results in a QUIC maximum packet size of 1232 bytes for IPv6 and 1252
bytes for IPv4.
Since the packet size can change during connection lifetime, the
ngx_quic_max_udp_payload() function is introduced that currently
returns minimal allowed size, depending on address family.
quic-tls, 8.2:
The quic_transport_parameters extension is carried in the ClientHello
and the EncryptedExtensions messages during the handshake. Endpoints
MUST send the quic_transport_parameters extension; endpoints that
receive ClientHello or EncryptedExtensions messages without the
quic_transport_parameters extension MUST close the connection with an
error of type 0x16d (equivalent to a fatal TLS missing_extension
alert, see Section 4.10).
Clearing cache based on free space left on a file system is
expected to allow better disk utilization in some cases, notably
when disk space might be also used for something other than nginx
cache (including nginx own temporary files) and while loading
cache (when cache size might be inaccurate for a while, effectively
disabling max_size cache clearing).
Based on a patch by Adam Bambuch.
With XFS, using "allocsize=64m" mount option results in large preallocation
being reported in the st_blocks as returned by fstat() till the file is
closed. This in turn results in incorrect cache size calculations and
wrong clearing based on max_size.
To avoid too aggressive cache clearing on such volumes, st_blocks values
which result in sizes larger than st_size and eight blocks (an arbitrary
limit) are no longer trusted, and we use st_size instead.
The ngx_de_fs_size() counterpart is intentionally not modified, as
it is used on closed files and hence not affected by this problem.
NFS on Linux is known to report wsize as a block size (in both f_bsize
and f_frsize, both in statfs() and statvfs()). On the other hand,
typical file system block sizes on Linux (ext2/ext3/ext4, XFS) are limited
to pagesize. (With FAT, block sizes can be at least up to 512k in
extreme cases, but this doesn't really matter, see below.)
To avoid too aggressive cache clearing on NFS volumes on Linux, block
sizes larger than pagesize are now ignored.
Note that it is safe to ignore large block sizes. Since 3899:e7cd13b7f759
(1.0.1) cache size is calculated based on fstat() st_blocks, and rounding
to file system block size is preserved mostly for Windows.
Note well that on other OSes valid block sizes seen are at least up
to 65536. In particular, UFS on FreeBSD is known to work well with block
and fragment sizes set to 65536.
When validating second and further certificates, ssl callback could be called
twice to report the error. After the first call client connection is
terminated and its memory is released. Prior to the second call and in it
released connection memory is accessed.
Errors triggering this behavior:
- failure to create the request
- failure to start resolving OCSP responder name
- failure to start connecting to the OCSP responder
The fix is to rearrange the code to eliminate the second call.
The flush flag was not set when forwarding the request body to the uwsgi
server. When using uwsgi_pass suwsgi://..., this causes the uwsgi server
to wait indefinitely for the request body and eventually time out due to
SSL buffering.
This is essentially the same change as 4009:3183165283cc, which was made
to ngx_http_proxy_module.c.
This will fix the uwsgi bug https://github.com/unbit/uwsgi/issues/1490.
This is a temporary workaround, proper retransmission mechanism based on
quic-recovery rfc draft is yet to be implemented.
Currently hardcoded value is too small for real networks. The patch
sets static PTO, considering rtt of ~333ms, what gives about 1s.
This ensures that certificate verification is properly logged to debug
log during upstream server certificate verification. This should help
with debugging various certificate issues.
Listening UNIX sockets were not removed on graceful shutdown, preventing
the next runs. The fix is to replace the custom socket closing code in
ngx_master_process_cycle() by the ngx_close_listening_sockets() call.
When changing binary, sending a SIGTERM to the new binary's master process
should not remove inherited UNIX sockets unless the old binary's master
process has exited.
Also, if both are present, require that they have the same value. These
requirements are specified in HTTP/3 draft 28.
Current implementation of HTTP/2 treats ":authority" and "Host"
interchangeably. New checks only make sure at least one of these values is
present in the request. A similar check existed earlier and was limited only
to HTTP/1.1 in 38c0898b6df7.
The flags was originally added by 8f038068f4bc, and is propagated correctly
in the stream module. With QUIC introduction, http module now uses datagram
sockets as well, thus the fix.
Previously, invalid connection preface errors were only logged at debug
level, providing no visible feedback, in particular, when a plain text
HTTP/2 listening socket is erroneously used for HTTP/1.x connections.
Now these are explicitly logged at the info level, much like other
client-related errors.
When enabled, certificate status is stored in cache and is used to validate
the certificate in future requests.
New directive ssl_ocsp_cache is added to configure the cache.
OCSP validation for client certificates is enabled by the "ssl_ocsp" directive.
OCSP responder can be optionally specified by "ssl_ocsp_responder".
When session is reused, peer chain is not available for validation.
If the verified chain contains certificates from the peer chain not available
at the server, validation will fail.
Previously only the first responder address was used per each stapling update.
Now, in case of a network or parsing error, next address is used.
This also fixes the issue with unsupported responder address families
(ticket #1330).
Preserving pointers within the client buffer is not needed for HTTP/3 because
all data is either allocated from pool or static. Unlike with HTTP/1, data
typically cannot be referenced directly within the client buffer. Trying to
preserve NULLs or external pointers lead to broken pointers.
Also, reverted changes in ngx_http_alloc_large_header_buffer() not relevant
for HTTP/3 to minimize diff to mainstream.
New field r->parse_start is introduced to substitute r->request_start and
r->header_name_start for request length accounting. These fields only work for
this purpose in HTTP/1 because HTTP/1 request line and header line start with
these values.
Also, error logging is now fixed to output the right part of the request.
As per HTTP/3 draft 27, a request or response containing uppercase header
field names MUST be treated as malformed. Also, existing rules applied
when parsing HTTP/1 header names are also applied to HTTP/3 header names:
- null character is not allowed
- underscore character may or may not be treated as invalid depending on the
value of "underscores_in_headers"
- all non-alphanumeric characters with the exception of '-' are treated as
invalid
Also, the r->locase_header field is now filled while parsing an HTTP/3
header.
Error logging for invalid headers is fixed as well.
The first one parses pseudo-headers and is analagous to the request line
parser in HTTP/1. The second one parses regular headers and is analogous to
the header parser in HTTP/1.
Additionally, error handling of client passing malformed uri is now fixed.
The function ngx_http_parse_chunked() is also called from the proxy module to
parse the upstream response. It should always parse HTTP/1 body in this case.
According to quic-transport draft 28 section 10.3.1:
When sending CONNECTION_CLOSE, the goal is to ensure that the peer
will process the frame. Generally, this means sending the frame in a
packet with the highest level of packet protection to avoid the
packet being discarded. After the handshake is confirmed (see
Section 4.1.2 of [QUIC-TLS]), an endpoint MUST send any
CONNECTION_CLOSE frames in a 1-RTT packet. However, prior to
confirming the handshake, it is possible that more advanced packet
protection keys are not available to the peer, so another
CONNECTION_CLOSE frame MAY be sent in a packet that uses a lower
packet protection level.
There is no need in a separate type for the QUIC connection state.
The only state not found in the SSL library is NGX_QUIC_ST_UNAVAILABLE,
which is actually a flag used by the ngx_quic_close_quic() function
to prevent cleanup of uninitialized connection.
Sections 4.10.1 and 4.10.2 of quic transport describe discarding of initial
and handshake keys. Since the keys are discarded, we no longer need
to retransmit packets and corresponding queues should be emptied.
This patch removes previously added workaround that did not require
acknowledgement for initial packets, resulting in avoiding retransmission,
which is wrong because a packet could be lost and we have to retransmit it.
It was possible that retransmit timer was not set after the first
retransmission attempt, due to ngx_quic_retransmit() did not set
wait time properly, and the condition in retransmit handler was incorrect.
Section 17.2 and 17.3 of QUIC transport:
Fixed bit: Packets containing a zero value for this bit are not
valid packets in this version and MUST be discarded.
Reserved bit: An endpoint MUST treat receipt of a packet that has
a non-zero value for these bits, after removing both packet and
header protection, as a connection error of type PROTOCOL_VIOLATION.
When an error occurs, then c->quic->error field may be populated
with an appropriate error code, and the CONNECTION CLOSE frame will be
sent to the peer before the connection is closed. Otherwise, the error
treated as internal and INTERNAL_ERROR code is sent.
The pkt->error field is populated by functions processing packets to
indicate an error when it does not fit into pass/fail return status.
As per QUIC transport, the first flight of 0-RTT packets obviously uses same
Destination and Source Connection ID values as the client's first Initial.
The fix is to match 0-RTT against original DCID after it has been switched.
The ordered frame handler is always called for the existing stream, as it is
allocated from this stream. Instead of searching stream by id, pointer to the
stream node is passed.
The idea is to skip any zeroes that follow valid QUIC packet. Currently such
behavior can be only observed with Firefox which sends zero-padded initial
packets.
Now there's no need to annotate every frame in ACK-eliciting packet.
Sending ACK was moved to the first place, so that queueing ACK frame
no longer postponed up to the next packet after pushing STREAM frames.
+ added "quic" prefix to all error messages
+ rephrased some messages
+ removed excessive error logging from frame parser
+ added ngx_quic_check_peer() function to check proper source/destination
match and do it one place
- the ngx_quic_hexdump0() macro is renamed to ngx_quic_hexdump();
the original ngx_quic_hexdump() macro with variable argument is
removed, extra information is logged normally, with ngx_log_debug()
- all labels in hex dumps are prefixed with "quic"
- the hexdump format is simplified, length is moved forward to avoid
situations when the dump is truncated, and length is not shown
- ngx_quic_flush_flight() function contents is debug-only, placed under
NGX_DEBUG macro to avoid "unused variable" warnings from compiler
- frame names in labels are capitalized, similar to other places
+ all dumps are moved under one of the following macros (undefined by default):
NGX_QUIC_DEBUG_PACKETS
NGX_QUIC_DEBUG_FRAMES
NGX_QUIC_DEBUG_FRAMES_ALLOC
NGX_QUIC_DEBUG_CRYPTO
+ all QUIC debug messages got "quic " prefix
+ all input frames are reported as "quic frame in FOO_FRAME bar:1 baz:2"
+ all outgoing frames re reported as "quic frame out foo bar baz"
+ all stream operations are prefixed with id, like: "quic stream id 0x33 recv"
+ all transport parameters are prefixed with "quic tp"
(hex dump is moved to caller, to avoid using ngx_cycle->log)
+ packet flags and some other debug messages are updated to
include packet type
As per https://tools.ietf.org/html/rfc7540#section-8.1,
: A server can send a complete response prior to the client
: sending an entire request if the response does not depend on
: any portion of the request that has not been sent and
: received. When this is true, a server MAY request that the
: client abort transmission of a request without error by
: sending a RST_STREAM with an error code of NO_ERROR after
: sending a complete response (i.e., a frame with the
: END_STREAM flag). Clients MUST NOT discard responses as a
: result of receiving such a RST_STREAM, though clients can
: always discard responses at their discretion for other
: reasons.
Previously, RST_STREAM(NO_ERROR) received from upstream after
a frame with the END_STREAM flag was incorrectly treated as an
error. Now, a single RST_STREAM(NO_ERROR) is properly handled.
This fixes problems observed with modern grpc-c [1], as well
as with the Go gRPC module.
[1] https://github.com/grpc/grpc/pull/1661
We always generate stream frames that have length. The 'len' member is used
during parsing incoming frames and can be safely ignored when generating
output.
There are following flags in quic connection:
closing - true, when a connection close is initiated, for whatever reason
draining - true, when a CC frame is received from peer
The following state machine is used for closing:
+------------------+
| I/HS/AD |
+------------------+
| | |
| | V
| | immediate close initiated:
| | reasons: close by top-level protocol, fatal error
| | + sends CC (probably with app-level message)
| | + starts close_timer: 3 * PTO (current probe timeout)
| | |
| | V
| | +---------+ - Reply to input with CC (rate-limited)
| | | CLOSING | - Close/Reset all streams
| | +---------+
| | | |
| V V |
| receives CC |
| | |
idle | |
timer | |
| V |
| +----------+ | - MUST NOT send anything (MAY send a single CC)
| | DRAINING | | - if not already started, starts close_timer: 3 * PTO
| +----------+ | - if not already done, close all streams
| | |
| | |
| close_timer fires
| |
V V
+------------------------+
| CLOSED | - clean up all the resources, drop connection
+------------------------+ state completely
The ngx_quic_close_connection() function gets an "rc" argument, that signals
reason of connection closing:
NGX_OK - initiated by application (i.e. http/3), follow state machine
NGX_DONE - timedout (while idle or draining)
NGX_ERROR - fatal error, destroy connection immediately
The PTO calculations are not yet implemented, hardcoded value of 5s is used.
The function is split into three:
ngx_quic_close_connection() itself cleans up all core nginx things
ngx_quic_close_quic() deals with everything inside c->quic
ngx_quic_close_streams() deals with streams cleanup
The quic and streams cleanup functions may return NGX_AGAIN, thus signalling
that cleanup is not ready yet, and the close cannot continue to next step.
The header size macros for long and short packets were fixed to provide
correct values in bytes.
Currently the sending code limits frames so they don't exceed max_packet_size.
But it does not account the case when a single frame can exceed the limit.
As a result of this patch, big payload (CRYPTO and STREAM) will be split
into a number of smaller frames that fit into advertised max_packet_size
(which specifies final packet size, after encryption).
chrome-unstable 83.0.4103.7 starts with Initial packet number 1.
I couldn't find a proper explanation besides this text in quic-transport:
An endpoint MAY skip packet numbers when sending
packets to detect this (Optimistic ACK Attack) behavior.
+ MAX_STREAM_DATA frame is sent when recv() is performed on stream
The new value is a sum of total bytes received by stream + free
space in a buffer;
The sending of MAX_STREM_DATA frame in response to STREAM_DATA_BLOCKED
frame is adjusted to follow the same logic as above.
+ MAX_DATA frame is sent when total amount of received data is 2x
of current limit. The limit is doubled.
+ Default values of transport parameters are adjusted to more meaningful
values:
initial stream limits are set to quic buffer size instead of
unrealistically small 255.
initial max data is decreased to 16 buffer sizes, in an assumption that
this is enough for a relatively short connection, instead of randomly
chosen big number.
All this allows to initiate a stable flow of streams that does not block
on stream/connection limits (tested with FF 77.0a1 and 100K requests)
Before the patch, full STREAM frame handling was delayed until the frame with
zero offset is received. Only node in the streams tree was created.
This lead to problems when such stream was deleted, in particular, it had no
handlers set for read events.
This patch creates new stream immediately, but delays data delivery until
the proper offset will arrive. This is somewhat similar to how accept()
operation works.
The ngx_quic_add_stream() function is no longer needed and merged into stream
handler. The ngx_quic_stream_input() now only handles frames for existing
streams and does not deal with stream creation.
Frames can still float in the following queues:
- crypto frames reordering queues (one per encryption level)
- moved crypto frames cleanup to the moment where all streams are closed
- stream frames reordering queues (one per packet number namespace)
- frames retransmit queues (one per packet number namespace)
Each stream node now includes incoming frames queue and sent/received counters
for tracking offset. The sent counter is not used, c->sent is used, not like
in crypto buffers, which have no connections.
If offset in CRYPTO frame doesn't match expected, following actions are taken:
a) Duplicate frames or frames within [0...current offset] are ignored
b) New data from intersecting ranges (starts before current_offset, ends
after) is consumed
c) "Future" frames are stored in a sorted queue (min offset .. max offset)
Once a frame is consumed, current offset is updated and the queue is inspected:
we iterate the queue until the gap is found and act as described
above for each frame.
The amount of data in buffered frames is limited by corresponding macro.
The CRYPTO and STREAM frame structures are now compatible: they share
the same set of initial fields. This allows to have code that deals with
both of this frames.
The ordering layer now processes the frame with offset and invokes the
handler when it can organise an ordered stream of data.
Quote: Conceptually, a packet number space is the context in which a packet
can be processed and acknowledged.
ngx_quic_namespace_t => ngx_quic_send_ctx_t
qc->ns => qc->send_ctx
ns->largest => send_ctx->largest_ack
The ngx_quic_ns(level) macro now returns pointer, not just index:
ngx_quic_get_send_ctx(c->quic, level)
ngx_quic_retransmit_ns() => ngx_quic_retransmit()
ngx_quic_output_ns() => ngx_quic_output_frames()
The request processing is delayed by a timer. Since nginx updates
internal time once at the start of each event loop iteration, this
normally ensures constant time delay, adding a mitigation from
time-based attacks.
A notable exception to this is the case when there are no additional
events before the timer expires. To ensure constant-time processing
in this case as well, we trigger an additional event loop iteration
by posting a dummy event for the next event loop iteration.
The offset in client CRYPTO frames is tracked in c->quic->crypto_offset_in.
This means that CRYPTO frames with non-zero offset are now accepted making
possible to finish handshake with client certificates that exceed max packet
size (if no reordering happens).
The c->quic->crypto_offset field is renamed to crypto_offset_out to avoid
confusion with tracking of incoming CRYPTO stream.
+ since number of ranges in unknown, provide a function to parse them once
again in handler to avoid memory allocation
+ ack handler now processes all ranges, not only the first
+ ECN counters are parsed and saved into frame if present
Such frames are grouped together in a switch and just ignored, instead of
closing the connection This may improve test coverage. All such frames
require acknowledgment.
The qc->closing flag is set when a connection close is initiated for the first
time.
No timers will be set if the flag is active.
TODO: this is a temporary solution to avoid running timer handlers after
connection (and it's pool) was destroyed. It looks like currently we have
no clear policy of connection closing in regard to timers.
Found with a previously received Initial packet with ACK only, which
instantiates a new connection but do not produce the handshake keys.
This can be triggered by a fairly well behaving client, if the server
stands behind a load balancer that stripped Initial packets exchange.
Found by F5 test suite.
This makes sending large number of bidirectional stream work within ngtcp2,
which doesn't bother sending optional STREAMS_BLOCKED when exhausted.
This also introduces tracking currently opened and maximum allowed streams.
Currently, the output is called periodically, each 200 ms to invoke
ngx_quic_output() that will push all pending frames into packets.
TODO: implement flags a-là Nagle & co (NO_DELAY/NO_PUSH...)
All frames collected to packet are moved into a per-namespace send queue.
QUIC connection has a timer which fires on the closest max_ack_delay time.
The frame is deleted from the queue when a corresponding packet is acknowledged.
The NGX_QUIC_MAX_RETRANSMISSION is a timeout that defines maximum length
of retransmission of a frame.
The quic->keys[4] array now contains secrets related to the corresponding
encryption level. All protection-level functions get proper keys and do
not need to switch manually between levels.
If early data is accepted, SSL_do_handshake() completes as soon as ClientHello
is processed. SSL_in_init() will report the handshake is still in progress.
Static buffers are used instead in functions where decryption takes place.
The pkt->plaintext points to the beginning of a static buffer.
The pkt->payload.data points to decrypted data actual start.
+ ngx_quic_encrypt():
- no longer accepts pool as argument
- pkt is 1st arg
- payload is passed as pkt->payload
- performs encryption to the specified static buffer
+ ngx_quic_create_long/short_packet() functions:
- single buffer for everything, allocated by caller
- buffer layout is: [ ad | payload | TAG ]
the result is in the beginning of buffer with proper length
- nonce is calculated on stack
- log is passed explicitly, pkt is 1st arg
- no more allocations inside
+ ngx_quic_create_long_header():
- args changed: no need to pass str_t
+ added ngx_quic_create_short_header()
+ Client-related errors (i.e. parsing) are done at INFO level
+ c->log->action is updated through the process of receiving, parsing.
handling packet/payload and generating frames/output.
For ngx_http_process_request() part to work, this required to set both
r->http_connection->ssl and c->ssl on a QUIC stream. To avoid damaging
global SSL object, ngx_ssl_shutdown() is managed to ignore QUIC streams.
+ ngx_quic_init_ssl_methods() is no longer there, we setup methods on SSL
connection directly.
+ the handshake_handler is actually a generic quic input handler
+ updated c->log->action and debug to reflect changes and be more informative
+ c->quic is always set in ngx_quic_input()
+ the quic connection state is set by the results of SSL_do_handshake();
note:
+ parameters are available in SSL connection since they are obtained by ssl
stack
quote:
During connection establishment, both endpoints make authenticated
declarations of their transport parameters. These declarations are
made unilaterally by each endpoint.
and really, we send our parameters before we read client's.
no handling of incoming parameters is made by this patch.
- integer parameters can be configured using the following directives:
quic_max_idle_timeout
quic_max_ack_delay
quic_max_packet_size
quic_initial_max_data
quic_initial_max_stream_data_bidi_local
quic_initial_max_stream_data_bidi_remote
quic_initial_max_stream_data_uni
quic_initial_max_streams_bidi
quic_initial_max_streams_uni
quic_ack_delay_exponent
quic_active_migration
quic_active_connection_id_limit
- only following parameters are actually sent:
active_connection_id_limit
initial_max_streams_uni
initial_max_streams_bidi
initial_max_stream_data_bidi_local
initial_max_stream_data_bidi_remote
initial_max_stream_data_uni
(other parameters are to be added into ngx_quic_create_transport_params()
function as needed, should be easy now)
- draft 24 and draft 27 are now supported
(at compile-time using quic_version macro)
The ngx_quic_parse_frame() functions now has new 'pkt' argument: the packet
header of a currently processed frame. This allows to log errors/debug
closer to reasons and perform additional checks regarding possible frame
types. The handler only performs processing of good frames.
A number of functions like read_uint32(), parse_int[_multi] probably should
be implemented as a macro, but currently it is better to have them as
functions for simpler debugging.
Cleanup in ngx_event_quic.c:
+ reorderded functions, structures
+ added missing prototypes
+ added separate handlers for each frame type
+ numerous indentation/comments/TODO fixes
+ removed non-implemented qc->state and corresponding enum;
this requires deep thinking, stub was unused.
+ streams inside quic connection are now in own structure
All code dealing with serializing/deserializing
is moved int srv/event/ngx_event_quic_transport.c/h file.
All macros for dealing with data are internal to source file.
The header file exposes frame types and error codes.
The exported functions are currently packet header parsers and writers
and frames parser/writer.
The ngx_quic_header_t structure is updated with 'log' member. This avoids
passing extra argument to parsing functions that need to report errors.
+ support for more than one initial packet
+ workaround for trailing zeroes in packet
+ ignore application data packet if no keys yet (issue in draft 27/ff nightly)
+ fixed PING frame parser
+ STREAM frames need to be acknowledged
The following HTTP configuration is used for firefox (v74):
http {
ssl_certificate_key localhost.key;
ssl_certificate localhost.crt;
ssl_protocols TLSv1.2 TLSv1.3;
server {
listen 127.0.0.1:10368 reuseport http3;
ssl_quic on;
server_name localhost;
location / {
return 200 "This-is-QUICK\n";
}
}
server {
listen 127.0.0.1:5555 ssl; # point the browser here
server_name localhost;
location / {
add_header Alt-Svc 'h3-24=":10368";ma=100';
return 200 "ALT-SVC";
}
}
}
New files:
src/event/ngx_event_quic_protection.h
src/event/ngx_event_quic_protection.c
The protection.h header provides interface to the crypto part of the QUIC:
2 functions to initialize corresponding secrets:
ngx_quic_set_initial_secret()
ngx_quic_set_encryption_secret()
and 2 functions to deal with packet processing:
ngx_quic_encrypt()
ngx_quic_decrypt()
Also, structures representing secrets are defined there.
All functions require SSL connection and a pool, only crypto operations
inside, no access to nginx connections or events.
Currently pool->log is used for the logging (instead of original c->log).
- events handling moved into src/event/ngx_event_quic.c
- http invokes once ngx_quic_run() and passes stream callback
(diff to original http_request.c is now minimal)
- streams are stored in rbtree using ID as a key
- when a new stream is registered, appropriate callback is called
- ngx_quic_stream_t type represents STREAM and stored in c->qs
- now NEW_CONNECTION_ID frames can be received and parsed
The packet structure is created in ngx_quic_input() and passed
to all handlers (initial, handshake and application data).
The UDP datagram buffer is saved as pkt->raw;
The QUIC packet is stored as pkt->data and pkt->len (instead of pkt->buf)
(pkt->len is adjusted after parsing headers to actual length)
The pkt->pos is removed, pkt->raw->pos is used instead.
- added basic parsing of ACK, PING and PADDING frames on input
- added preliminary parsing of SHORT headers
The ngx_quic_output() is now called after processing of each input packet.
Frames are added into output queue according to their level: inital packets
go ahead of handshake and application data, so they can be merged properly.
The payload handler is called from both new, handshake and applicataion data
handlers (latter is a stub).
As was objserved with ngtcp2 client, Finished CRYPTO frame within Handshake
packet may not be sent for some reason if there's nothing to append on 1-RTT.
This results in unnecessary retransmit. To avoid this edge case, a non-zero
active_connection_id_limit transport parameter is now used to append datagram
with NEW_CONNECTION_ID 1-RTT frames.
Now handshake generates frames, and they are queued in c->quic->frames.
The ngx_quic_output() is called from ngx_quic_flush_flight() or manually,
processes the queue and encrypts all frames according to required encryption
level.
ngx_quic_hexdump0(log, format, buffer, buffer_size);
- logs hexdump of buffer to specified error log
ngx_quic_hexdump0(c->log, "this is foo:", foo.data, foo.len);
ngx_quic_hexdump(log, format, buffer, buffer_size, ...)
- same as hexdump0, but more format/args possible:
ngx_quic_hexdump(c->log, "a=%d b=%d, foo is:", foo.data, foo.len, a, b);
When "aio" or "aio threads" is used while processing the response body of an
in-memory background subrequest, the subrequest could be finalized with an aio
operation still in progress. Upon aio completion either parent request is
woken or the old r->write_event_handler is called again. The latter may result
in request errors. In either case post_subrequest handler is never called with
the full response body, which is typically expected when using in-memory
subrequests.
Currently in nginx background subrequests are created by the upstream module
and the mirror module. The issue does not manifest itself with these
subrequests because they are header-only. But it can manifest itself with
third-party modules which create in-memory background subrequests.
We used to have default error_page overwrite for 495, 496, and 497, so
a configuration like
error_page 495 /error;
will result in error 400, much like without any error_page configured.
The 494 status code was introduced later (in 3848:de59ad6bf557, nginx 0.9.4),
and relevant changes to ngx_http_core_error_page() were missed, resulting
in inconsistent behaviour of "error_page 494" - with error_page configured
it results in 494 being returned instead of 400.
Reported by Frank Liu,
http://mailman.nginx.org/pipermail/nginx/2020-February/058957.html.
Introduced ngx_quic_input() and ngx_quic_output() as interface between
nginx and protocol. They are the only functions that are exported.
While there, added copyrights.
In "co64" atom chunk start offset is a 64-bit unsigned integer. When trimming
the "mdat" atom, chunk offsets are casted to off_t values which are typically
64-bit signed integers. A specially crafted mp4 file with huge chunk offsets
may lead to off_t overflow and result in negative trim boundaries.
The consequences of the overflow are:
- Incorrect Content-Length header value in the response.
- Negative left boundary of the response file buffer holding the trimmed "mdat".
This leads to pread()/sendfile() errors followed by closing the client
connection.
On rare systems where off_t is a 32-bit integer, this scenario is also feasible
with the "stco" atom.
The fix is to add checks which make sure data chunks referenced by each track
are within the mp4 file boundaries. Additionally a few more checks are added to
ensure mp4 file consistency and log errors.
Duplicate "Host" headers were allowed in nginx 0.7.0 (revision b9de93d804ea)
as a workaround for some broken Motorola phones which used to generate
requests with two "Host" headers[1]. It is believed that this workaround
is no longer relevant.
[1] http://mailman.nginx.org/pipermail/nginx-ru/2008-May/017845.html
The "identity" transfer coding has been removed in RFC 7230. It is
believed that it is not used in real life, and at the same time it
provides a potential attack vector.
We anyway do not support more than one transfer encoding, so accepting
requests with multiple Transfer-Encoding headers doesn't make sense.
Further, we do not handle multiple headers, and ignore anything but
the first header.
Reported by Filippo Valsorda.
A connection could get stuck without timers if a client has partially sent
the HEADERS frame such that it was split on the individual header boundary.
In this case, it cannot be processed without the rest of the HEADERS frame.
The fix is to call ngx_http_v2_state_headers_save() in this case. Normally,
it would be called from the ngx_http_v2_state_header_block() handler on the
next iteration, when there is not enough data to continue processing. This
isn't the case if recv_buffer became empty and there's no more data to read.
With the recent change to prevent frames flood in d4448892a294,
nginx will finalize the connection with NGX_HTTP_V2_INTERNAL_ERROR
whenever flood is detected, causing nginx aborting or stopping if
the debug_points directive is used in nginx config.
Previous change 1ce3f01a4355 incorrectly introduced processing of the
ngx_posted_next_events queue at the end of operation, effectively making
posted next events a nop, since at the end of an event loop iteration
the queue is always empty. Correct approach is to move events to the
ngx_posted_events queue at an iteration start, as it was done previously.
Further, in some cases the c->read event might be already in the
ngx_posted_events queue, and calling ngx_post_event() with the
ngx_posted_next_events queue won't do anything. To make sure the event
will be correctly placed into the ngx_posted_next_events queue
we now check if it is already posted.
Introduced in 9d2ad2fb4423 available bytes handling in SSL relied
on connection read handler being overwritten to set the ready flag
and the amount of available bytes. This approach is, however, does
not work properly when connection read handler is changed, for example,
when switching to a next pipelined request, and can result in unexpected
connection timeouts, see here:
http://mailman.nginx.org/pipermail/nginx-devel/2019-December/012825.html
Fix is to introduce ngx_event_process_posted_next() instead, which
will set ready and available regardless of how event handler is set.
When ngx_http_v2_close_stream_handler() is used to retry stream close
after queued frames are sent, client timeouts on the stream can be
logged multiple times and/or in addition to already happened errors.
To resolve this, separate ngx_http_v2_retry_close_stream_handler()
was introduced, which does not try to log timeouts.
If a stream is closed with queued frames, it is possible that no further
write events will occur on the stream, leading to the socket leak.
To fix this, the stream's fake connection read handler is set to
ngx_http_v2_close_stream_handler(), to make sure that finalizing the
connection with ngx_http_v2_finalize_connection() will be able to
close the stream regardless of the current number of queued frames.
Additionally, the stream's fake connection fc->error flag is explicitly
set, so ngx_http_v2_handle_stream() will post a write event when queued
frames are finally sent even if stream flow control window is exhausted.
These checks were missed when chunked support was introduced. And also
added an explicit error message to ngx_http_dav_copy_move_handler()
(it was missed for some reason, in contrast to DELETE and MKCOL handlers).
While empty replacements were caught at run-time, parsing code
of the "rewrite" directive expects that a minimum length of the
"replacement" argument is 1.
If a rewritten URI has the null character, only a part of URI was
copied to a memory buffer allocated for path. In some setups this
could be exploited to expose uninitialized memory via the Location
header.
The "alias" directive cannot be used in the same location where URI
was rewritten. This has been detected in the "rewrite ... break"
case, but not when the standalone "break" directive was used.
This change also fixes proxy_pass with URI component in a similar
case:
location /aaa/ {
rewrite ^ /xxx/yyy;
break;
proxy_pass http://localhost:8080/bbb/;
}
Previously, the "/bbb/yyy" would be sent to a backend instead of
"/xxx/yyy". And if location's prefix was longer than the rewritten
URI, a segmentation fault might occur.
Previously, connections returned from keepalive cache had c->data
pointing to the keepalive cache item. While this shouldn't be a problem
for correct code, as c->data is not expected to be used before it is set,
explicitly clearing it might help to avoid confusion.
Previously only an rbtree was associated with a limit_conn. To make it
possible to associate more data with a limit_conn, shared context is introduced
similar to limit_req. Also, shared pool pointer is kept in a way similar to
limit_req.
Now a new structure ngx_proxy_protocol_t holds these fields. This allows
to add more PROXY protocol fields in the future without modifying the
connection structure.
With MinGW-w64, building 64-bit nginx binary with GCC 8 and above
results in warning due to cast of GetProcAddress() result to ngx_wsapoll_pt,
which GCC thinks is incorrect. Added intermediate cast to "void *" to
silence the warning.
FormatMessage() seems to return many errors which essentially indicate that
the language in question is not available. At least the following were
observed in the wild and during testing: ERROR_MUI_FILE_NOT_FOUND (15100)
(ticket #1868), ERROR_RESOURCE_TYPE_NOT_FOUND (1813). While documentation
says it should be ERROR_RESOURCE_LANG_NOT_FOUND (1815), this doesn't seem
to be the case.
As such, checking error code was removed, and as long as FormatMessage()
returns an error, we now always try the default language.
Added code to track number of bytes available in the socket.
This makes it possible to avoid looping for a long time while
working with fast enough peer when data are added to the socket buffer
faster than we are able to read and process data.
When kernel does not provide number of bytes available, it is
retrieved using ioctl(FIONREAD) as long as a buffer is filled by
SSL_read().
It is assumed that number of bytes returned by SSL_read() is close
to the number of bytes read from the socket, as we do not use
SSL compression. But even if it is not true for some reason, this
is not important, as we post an additional reading event anyway.
Note that data can be buffered at SSL layer, and it is not possible
to simply stop reading at some point and wait till the event will
be reported by the kernel again. This can be only done when there
are no data in SSL buffers, and there is no good way to find out if
it's the case.
Instead of trying to figure out if SSL buffers are empty, this patch
introduces events posted for the next event loop iteration - such
events will be processed only on the next event loop iteration,
after going into the kernel and retrieving additional events. This
seems to be simple and reliable approach.
This makes it possible to avoid looping for a long time while working
with a fast enough peer when data are added to the socket buffer faster
than we are able to read and process them (ticket #1431). This is
basically what we already do on FreeBSD with kqueue, where information
about the number of bytes in the socket buffer is returned by
the kevent() call.
With other event methods rev->available is now set to -1 when the socket
is ready for reading. Later in ngx_recv() and ngx_recv_chain(), if
full buffer is received, real number of bytes in the socket buffer is
retrieved using ioctl(FIONREAD). Reading more than this number of bytes
ensures that even with edge-triggered event methods the event will be
triggered again, so it is safe to stop processing of the socket and
switch to other connections.
Using ioctl(FIONREAD) only after reading a full buffer is an optimization.
With this approach we only call ioctl(FIONREAD) when there are at least
two recv()/readv() calls.
As long as there are data to read in the socket, yet the amount of data
is less than total size of the buffers in the chain, this saves one
unneeded read() syscall. Before this change, reading only stopped if
ngx_ssl_recv() returned no data, that is, two read() syscalls in a row
returned EAGAIN.
In SSL connections, data can be buffered by the SSL layer, and it is
wrong to avoid doing c->recv_chain() if c->read->available is 0 and
c->read->pending_eof is set. And tests show that the optimization in
question indeed can result in incorrect detection of premature connection
close if upstream closes the connection without sending a close notify
alert at the same time. Fix is to disable c->read->available optimization
for SSL connections.
This could happen when graceful shutdown configured by worker_shutdown_timeout
times out and is then followed by another timeout such as proxy_read_timeout.
In this case, the HEADERS frame is added to the output queue, but attempt to
send it fails (due to c->error forcibly set during graceful shutdown timeout).
This triggers request finalization which attempts to close the stream. But the
stream cannot be closed because there is a frame in the output queue, and the
connection cannot be finalized. This leaves the connection open without any
timer events leading to alert.
The fix is to post write event when sending output queue fails on c->error.
That will finalize the connection.
With this patch, all traffic over an HTTP/2 connection is counted in
the h2c->total_bytes field, and payload traffic is counted in
the h2c->payload_bytes field. As long as total traffic is many times
larger than payload traffic, we consider this to be a flood.
In 8df664ebe037, we've switched to maximizing stream window instead
of sending RST_STREAM. Since then handling of RST_STREAM with NO_ERROR
was fixed at least in Chrome, hence we switch back to using RST_STREAM.
This allows more effective rejecting of large bodies, and also minimizes
non-payload traffic to be accounted in the next patch.
Previously, if a response to the PTR request was cached, and ngx_resolver_dup()
failed to allocate memory for the resulting name, then the original node was
freed but left in expire_queue. A subsequent address resolving would end up
in a use-after-free memory access of the node either in ngx_resolver_expire()
or ngx_resolver_process_ptr(), when accessing it through expire_queue.
The fix is to leave the resolver node intact.
Don't waste server resources by sending RST_STREAM frames. Instead,
reject WINDOW_UPDATE frames with invalid zero increment by closing
connection with PROTOCOL_ERROR.
Don't waste server resources by sending RST_STREAM frames. Instead,
reject HEADERS and PRIORITY frames with self-dependency by closing
connection with PROTOCOL_ERROR.
When ngx_http_discard_request_body() call was added to ngx_http_send_response(),
there were no return codes other than NGX_OK and NGX_HTTP_INTERNAL_SERVER_ERROR.
Now it can also return NGX_HTTP_BAD_REQUEST, but ngx_http_send_response() still
incorrectly transforms it to NGX_HTTP_INTERNAL_SERVER_ERROR.
The fix is to propagate ngx_http_discard_request_body() errors.
As defined in HTTP/1.1, body chunks have the following ABNF:
chunk = chunk-size [ chunk-ext ] CRLF chunk-data CRLF
where chunk-data is a sequence of chunk-size octets.
With this change, chunk-data that doesn't end up with CRLF at chunk-size
offset will be treated as invalid, such as in the example provided below:
4
SEE-THIS-AND-
4
THAT
0
Previously, if unbuffered request body reading wasn't finished before
the request was redirected to a different location using error_page
or X-Accel-Redirect, and the request body is read again, this could
lead to disastrous effects, such as a duplicate post_handler call or
"http request count is zero" alert followed by a segmentation fault.
This happened in the following configuration (ticket #1819):
location / {
proxy_request_buffering off;
proxy_pass http://bad;
proxy_intercept_errors on;
error_page 502 = /error;
}
location /error {
proxy_pass http://backend;
}
Fixed excessive memory growth and CPU usage if stream windows are
manipulated in a way that results in generating many small DATA frames.
Fix is to limit the number of simultaneously allocated DATA frames.
Fixed uncontrolled memory growth if peer sends a stream of
headers with a 0-length header name and 0-length header value.
Fix is to reject headers with zero name length.
When using SMTP with SSL and resolver, read events might be enabled
during address resolving, leading to duplicate ngx_mail_ssl_handshake_handler()
calls if something arrives from the client, and duplicate session
initialization - including starting another resolving. This can lead
to a segmentation fault if the session is closed after first resolving
finished. Fix is to block read events while resolving.
Reported by Robert Norris,
http://mailman.nginx.org/pipermail/nginx/2019-July/058204.html.
After ac5a741d39cf it is now possible that after zstream.avail_out
reaches 0 and we allocate additional buffer, there will be no more data
to put into this buffer, triggering "zero size buf" alert. Fix is to
reset b->temporary flag in this case.
Additionally, an optimization added to avoid allocating additional buffer
in this case, by checking if last deflate() call returned Z_STREAM_END.
Note that checking for Z_STREAM_END by itself is not enough to fix alerts,
as deflate() can return Z_STREAM_END without producing any output if the
buffer is smaller than gzip trailer.
Reported by Witold Filipczyk,
http://mailman.nginx.org/pipermail/nginx-devel/2019-July/012469.html.
Due to shortcomings of the ccv->zero flag implementation in complex value
interface, length of the resulting string from ngx_http_complex_value()
might either not include terminating null character or include it,
so the only safe way to work with the result is to use it as a
null-terminated string.
Reported by Patrick Wollgast.
When "-" follows a parameter of maximum length, a single byte buffer
overflow happens, since the error branch does not check parameter length.
Fix is to avoid saving "-" to the parameter key, and instead use an error
message with "-" explicitly written. The message is mostly identical to
one used in similar cases in the preequal state.
Reported by Patrick Wollgast.
With level-triggered event methods it is important to specify
the NGX_CLOSE_EVENT flag to ngx_handle_read_event(), otherwise
the event won't be removed, resulting in CPU hog.
Reported by Patrick Wollgast.
To save memory hash code uses u_short to store resulting bucket sizes,
so maximum bucket size is limited to 65536 minus ngx_cacheline_size (larger
values will be aligned to 65536 which will overflow u_short). However,
there were no checks to enforce this, and using larger bucket sizes
resulted in overflows and segmentation faults.
Appropriate safety checks to enforce this added to ngx_hash_init().
When nginx is used with zlib patched with [1], which provides
integration with the future IBM Z hardware deflate acceleration, it ends
up computing CRC32 twice: one time in hardware, which always does this,
and one time in software by explicitly calling crc32().
crc32() calls were added in changesets 133:b27548f540ad ("nginx-0.0.1-
2003-09-24-23:51:12 import") and 134:d57c6835225c ("nginx-0.0.1-
2003-09-26-09:45:21 import") as part of gzip wrapping feature - back
then zlib did not support it.
However, since then gzip wrapping was implemented in zlib v1.2.0.4,
and it's already being used by nginx for log compression.
This patch replaces hand-written gzip wrapping with the one provided by
zlib. It simplifies the code, and makes it avoid computing CRC32 twice
when using hardware acceleration.
[1] https://github.com/madler/zlib/pull/410
Similarly to the change in 5491:74bfa803a5aa (1.5.9), we should accept
properly escaped URIs and unescape them as needed, else it is not possible
to handle URIs with question marks.
As we now have ctx->header_sent flag, it is further used to prevent
duplicate $r->send_http_header() calls, prevent output before sending
header, and $r->internal_redirect() after sending header.
Further, $r->send_http_header() protected from calls after
$r->internal_redirect().
Returning NGX_HTTP_INTERNAL_SERVER_ERROR if a perl code died after
sending header will lead to a "header already sent" alert. To avoid
it, we now check if header was already sent, and return NGX_ERROR
instead if it was.
Previously, redirects scheduled with $r->internal_redirect() were followed
even if the code then died. Now these are ignored and nginx will return
an error instead.
Variable handlers are not expected to send anything to the client, cannot
sleep or read body, and are not expected to modify the request. Added
appropriate protection to prevent accidental foot shooting.
Duplicate $r->sleep() and/or $r->has_request_body() calls result
in undefined behaviour (in practice, connection leaks were observed).
To prevent this, croak() added in appropriate places.
Previously, allocation errors in nginx.xs were more or less ignored,
potentially resulting in incorrect code execution in specific low-memory
conditions. This is changed to use ctx->error bit and croak(), similarly
to how output errors are now handled.
Note that this is mostly a cosmetic change, as Perl itself exits on memory
allocation errors, and hence nginx with Perl is hardly usable in low-memory
conditions.
When an error happens, the ctx->error bit is now set, and croak()
is called to terminate further processing. The ctx->error bit is
checked in ngx_http_perl_call_handler() to cancel further processing,
and is also checked in various output functions - to make sure these won't
be called if croak() was handled by an eval{} in perl code.
In particular, this ensures that output chain won't be called after
errors, as filters might not expect this to happen. This fixes some
segmentation faults under low memory conditions. Also this stops
request processing after filter finalization or request body reading
errors.
For cases where an HTTP error status can be additionally returned (for
example, 416 (Requested Range Not Satisfiable) from the range filter),
the ctx->status field is also added.
This ensures that correct ctx is always available, including after
filter finalization. In particular, this fixes a segmentation fault
with the following configuration:
location / {
image_filter test;
perl 'sub {
my $r = shift;
$r->send_http_header();
$r->print("foo\n");
$r->print("bar\n");
}';
}
This also seems to be the only way to correctly handle filter finalization
in various complex cases, for example, when embedded perl is used both
in the original handler and in an error page called after filter
finalization.
The NGX_DONE test in ngx_http_perl_handle_request() was introduced
in 1702:86bb52e28ce0, which also modified ngx_http_perl_call_handler()
to return NGX_DONE with c->destroyed. The latter part was then
removed in 3050:f54b02dbb12b, so NGX_DONE test is no longer needed.
Embedded perl does not set any request fields needed for conditional
requests processing. Further, filter finalization in the not_modified
filter can cause segmentation faults due to cleared ctx as in
ticket #1786.
Before 5fb1e57c758a (1.7.3) the not_modified filter was implicitly disabled
for perl responses, as r->headers_out.last_modified_time was -1. This
change restores this behaviour by using the explicit r->disable_not_modified
flag.
Note that this patch doesn't try to address perl module robustness against
filter finalization and other errors returned from filter chains. It should
be eventually reworked to handle errors instead of ignoring them.
A new directive limit_req_dry_run allows enabling the dry run mode. In this
mode requests are neither rejected nor delayed, but reject/delay status is
logged as usual.
In case of filter finalization, essential request fields like r->uri,
r->args etc could be changed, which affected the cache update subrequest.
Also, after filter finalization r->cache could be set to NULL, leading to
null pointer dereference in ngx_http_upstream_cache_background_update().
The fix is to create background cache update subrequest before sending the
cached response.
Since initial introduction in 1aeaae6e9446 (1.11.10) background cache update
subrequest was created after sending the cached response because otherwise it
blocked the parent request output. In 9552758a786e (1.13.1) background
subrequests were introduced to eliminate the delay before sending the final
part of the cached response. This also made it possible to create the
background cache update subrequest before sending the response.
Note that creating the subrequest earlier does not change the fact that in case
of filter finalization the background cache update subrequest will likely not
have enough time to successfully update the cache entry. Filter finalization
leads to the main request termination as soon the current iteration of request
processing is complete.
Previously, a variant not present in shared memory and stored on disk using a
secondary key was read using c->body_start from a variant stored with a main
key. This could result in critical errors "cache file .. has too long header".
Previously the stale-if-error extension of the Cache-Control upstream header
triggered the return of a stale response for all error conditions that can be
specified in the proxy_cache_use_stale directive. The list of these errors
includes both network/timeout/format errors, as well as some HTTP codes like
503, 504, 403, 429 etc. The latter prevented a cache entry from being updated
by a response with any of these HTTP codes during the stale-if-error period.
Now stale-if-error only works for network/timeout/format errors and ignores
the upstream HTTP code. The return of a stale response for certain HTTP codes
is still possible using the proxy_cache_use_stale directive.
This change also applies to the stale-while-revalidate extension of the
Cache-Control header, which triggers stale-if-error if it is missing.
Reported at
http://mailman.nginx.org/pipermail/nginx/2020-July/059723.html.
In ngx_http_range_singlepart_body() special buffers where passed
unmodified, including ones after the end of the range. As such,
if the last buffer of a response was sent separately as a special
buffer, two buffers with b->last_buf set were present in the response.
In particular, this might result in a duplicate final chunk when using
chunked transfer encoding (normally range filter and chunked transfer
encoding are not used together, but this may happen if there are trailers
in the response). This also likely to cause problems in HTTP/2.
Fix is to skip all special buffers after we've sent the last part of
the range requested. These special buffers are not meaningful anyway,
since we set b->last_buf in the buffer with the last part of the range,
and everything is expected to be flushed due to it.
Additionally, ngx_http_next_body_filter() is now called even
if no buffers are to be passed to it. This ensures that various
write events are properly propagated through the filter chain. In
particular, this fixes test failures observed with the above change
and aio enabled.
Filters are not allowed to change incoming chain links, and should allocate
their own links if any modifications are needed. Nevertheless
ngx_http_range_singlepart_body() modified incoming chain links in some
cases, notably at the end of the requested range.
No problems caused by this are currently known, mostly because of
limited number of possible modifications and the position of the range
body filter in the filter chain. Though this behaviour is clearly incorrect
and tests demonstrate that it can at least cause some proxy buffers being
lost when using proxy_force_ranges, leading to less effective handling
of responses.
Fix is to always allocate new chain links in ngx_http_range_singlepart_body().
Links are explicitly freed to ensure constant memory usage with long-lived
requests.
If a complex value is expected to be of type size_t, and the compiled
value is constant, the constant size_t value is remembered at compile
time.
The value is accessed through ngx_http_complex_value_size() which
either returns the remembered constant or evaluates the expression
and parses it as size_t.
Previously, ngx_utf8_decode() was called from ngx_utf8_length() with
incorrect length, potentially resulting in out-of-bounds read when
handling invalid UTF-8 strings.
In practice out-of-bounds reads are not possible though, as autoindex, the
only user of ngx_utf8_length(), provides null-terminated strings, and
ngx_utf8_decode() anyway returns an errors when it sees a null in the
middle of an UTF-8 sequence.
Reported by Yunbin Liu.
If OCSP stapling was enabled with dynamic certificate loading, with some
OpenSSL versions (1.0.2o and older, 1.1.0h and older; fixed in 1.0.2p,
1.1.0i, 1.1.1) a segmentation fault might happen.
The reason is that during an abbreviated handshake the certificate
callback is not called, but the certificate status callback was called
(https://github.com/openssl/openssl/issues/1662), leading to NULL being
returned from SSL_get_certificate().
Fix is to explicitly check SSL_get_certificate() result.
If X509_get_issuer_name() or X509_get_subject_name() returned NULL,
this could lead to a certificate reference leak. It cannot happen
in practice though, since each function returns an internal pointer
to a mandatory subfield of the certificate successfully decoded by
d2i_X509() during certificate message processing (closes#1751).
Previously the ngx_inet_resolve_host() function sorted addresses in a way that
IPv4 addresses came before IPv6 addresses. This was implemented in eaf95350d75c
(1.3.10) along with the introduction of getaddrinfo() which could resolve host
names to IPv6 addresses. Since the "listen" directive only used the first
address, sorting allowed to preserve "listen" compatibility with the previous
behavior and with the behavior of nginx built without IPv6 support. Now
"listen" uses all resolved addresses which makes sorting pointless.
Previously only one address was used by the listen directive handler even if
host name resolved to multiple addresses. Now a separate listening socket is
created for each address.
This makes it possible to provide certificates directly via variables
in ssl_certificate / ssl_certificate_key directives, without using
intermediate files.
It was accidentally introduced in 77436d9951a1 (1.15.9). In MSVC 2015
and more recent MSVC versions it triggers warning C4456 (declaration of
'pkey' hides previous local declaration). Previously, all such warnings
were resolved in 2a621245f4cf.
Reported by Steve Stevenson.
Server name callback is always called by OpenSSL, even
if server_name extension is not present in ClientHello. As such,
checking c->ssl->handshaked before the SSL_get_servername() result
should help to more effectively prevent renegotiation in
OpenSSL 1.1.0 - 1.1.0g, where neither SSL3_FLAGS_NO_RENEGOTIATE_CIPHERS
nor SSL_OP_NO_RENEGOTIATION is available.
The SSL_OP_NO_CLIENT_RENEGOTIATION option was introduced in LibreSSL 2.5.1.
Unlike OpenSSL's SSL_OP_NO_RENEGOTIATION, it only disables client-initiated
renegotiation, and hence can be safely used on all SSL contexts.
If ngx_pool_cleanup_add() fails, we have to clean just created SSL context
manually, thus appropriate call added.
Additionally, ngx_pool_cleanup_add() moved closer to ngx_ssl_create() in
the ngx_http_ssl_module, to make sure there are no leaks due to intermediate
code.
Notably this affects various allocation errors, and should generally
improve things if an allocation error actually happens during a callback.
Depending on the OpenSSL version, returning an error can result in
either SSL_R_CALLBACK_FAILED or SSL_R_CLIENTHELLO_TLSEXT error from
SSL_do_handshake(), so both errors were switched to the "info" level.
OpenSSL 1.1.1 does not save server name to the session if server name
callback returns anything but SSL_TLSEXT_ERR_OK, thus breaking
the $ssl_server_name variable in resumed sessions.
Since $ssl_server_name can be used even if we've selected the default
server and there are no other servers, it looks like the only viable
solution is to always return SSL_TLSEXT_ERR_OK regardless of the actual
result.
To fix things in the stream module as well, added a dummy server name
callback which always returns SSL_TLSEXT_ERR_OK.
A virtual server may have no SSL context if it does not have certificates
defined, so we have to use config of the ngx_http_ssl_module from the
SSL context in the certificate callback. To do so, it is now passed as
the argument of the callback.
The stream module doesn't really need any changes, but was modified as
well to match http code.
Dynamic certificates re-introduce problem with incorrect session
reuse (AKA "virtual host confusion", CVE-2014-3616), since there are
no server certificates to generate session id context from.
To prevent this, session id context is now generated from ssl_certificate
directives as specified in the configuration. This approach prevents
incorrect session reuse in most cases, while still allowing sharing
sessions across multiple machines with ssl_session_ticket_key set as
long as configurations are identical.
Passwords have to be copied to the configuration pool to be used
at runtime. Also, to prevent blocking on stdin (with "daemon off;")
an empty password list is provided.
To make things simpler, password handling was modified to allow
an empty array (with 0 elements and elts set to NULL) as an equivalent
of an array with 1 empty password.
To evaluate variables, a request is created in the certificate callback,
and then freed. To do this without side effects on the stub_status
counters and connection state, an additional function was introduced,
ngx_http_alloc_request().
Only works with OpenSSL 1.0.2+, since there is no SSL_CTX_set_cert_cb()
in older versions.
This makes it possible to reuse certificate loading at runtime,
as introduced in the following patches.
Additionally, this improves error logging, so nginx will now log
human-friendly messages "cannot load certificate" instead of only
referring to sometimes cryptic names of OpenSSL functions.
The "(SSL:)" snippet currently appears in logs when nginx code uses
ngx_ssl_error() to log an error, but OpenSSL's error queue is empty.
This can happen either because the error wasn't in fact from OpenSSL,
or because OpenSSL did not indicate the error in the error queue
for some reason.
In particular, currently "(SSL:)" can be seen in errors at least in
the following cases:
- When SSL_write() fails due to a syscall error,
"[info] ... SSL_write() failed (SSL:) (32: Broken pipe)...".
- When loading a certificate with no data in it,
"[emerg] PEM_read_bio_X509_AUX(...) failed (SSL:)".
This can easily happen due to an additional empty line before
the end line, so all lines of the certificate are interpreted
as header lines.
- When trying to configure an unknown curve,
"[emerg] SSL_CTX_set1_curves_list("foo") failed (SSL:)".
Likely there are other cases as well.
With this change, "(SSL:...)" will be only added to the error message
if there is something in the error queue. This is expected to make
logs more readable in the above cases. Additionally, with this change
it is now possible to use ngx_ssl_error() to log errors when some
of the possible errors are not from OpenSSL and not expected to have
anything in the error queue.
Checking multiple errors at once is a bad practice, as in general
it is not guaranteed that an object can be used after the error.
In this particular case, checking errors after multiple allocations
can result in excessive errors being logged when there is no memory
available.
On Windows, connect() errors are only reported via exceptfds descriptor set
from select(). Previously exceptfds was set to NULL, and connect() errors
were not detected at all, so connects to closed ports were waiting till
a timeout occurred.
Since ongoing connect() means that there will be a write event active,
except descriptor set is copied from the write one. While it is possible
to construct except descriptor set as a concatenation of both read and write
descriptor sets, this looks unneeded.
With this change, connect() errors are properly detected now when using
select(). Note well that it is not possible to detect connect() errors with
WSAPoll() (see https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/).
WSAPoll() is only available with Windows Vista and newer (and only
available during compilation if _WIN32_WINNT >= 0x0600). To make
sure the code works with Windows XP, we do not redefine _WIN32_WINNT,
but instead load WSAPoll() dynamically if it is not available during
compilation.
Also, sockets are not guaranteed to be small integers on Windows.
So an index array is used instead of NGX_USE_FD_EVENT to map
events to connections.
Previously, the code incorrectly assumed "ngx_event_t *" elements
instead of "struct pollfd".
This is mostly cosmetic change, as this code is never called now.
Previously, when using proxy_upload_rate and proxy_download_rate, the buffer
size for reading from a socket could be reduced as a result of rate limiting.
For connection-oriented protocols this behavior is normal since unread data will
normally be read at the next iteration. But for datagram-oriented protocols
this is not the case, and unread part of the datagram is lost.
Now buffer size is not limited for datagrams. Rate limiting still works in this
case by delaying the next reading event.
A shared connection does not own its file descriptor, which means that
ngx_handle_read_event/ngx_handle_write_event calls should do nothing for it.
Currently the c->shared flag is checked in several places in the stream proxy
module prior to calling these functions. However it was not done everywhere.
Missing checks could lead to calling
ngx_handle_read_event/ngx_handle_write_event on shared connections.
The problem manifested itself when using proxy_upload_rate and resulted in
either duplicate file descriptor error (e.g. with epoll) or incorrect further
udp packet processing (e.g. with kqueue).
The fix is to set and reset the event active flag in a way that prevents
ngx_handle_read_event/ngx_handle_write_event from scheduling socket events.
Previous interface of ngx_open_dir() assumed that passed directory name
has a room for NGX_DIR_MASK at the end (NGX_DIR_MASK_LEN bytes). While all
direct users of ngx_dir_open() followed this interface, this also implied
similar requirements for indirect uses - in particular, via ngx_walk_tree().
Currently none of ngx_walk_tree() uses provides appropriate space, and
fixing this does not look like a right way to go. Instead, ngx_dir_open()
interface was changed to not require any additional space and use
appropriate allocations instead.
If SSL_write_early_data() returned SSL_ERROR_WANT_WRITE, stop further reading
using a newly introduced c->ssl->write_blocked flag, as otherwise this would
result in SSL error "ssl3_write_bytes:bad length". Eventually, normal reading
will be restored by read event posted from successful SSL_write_early_data().
While here, place "SSL_write_early_data: want write" debug on the path.
Previously, if an SRV record was successfully resolved, but all of its A
records failed to resolve, NXDOMAIN was returned to the caller, which is
considered a successful resolve rather than an error. This could result in
losing the result of a previous successful resolve by the caller.
Now NXDOMAIN is only returned if at least one A resolve completed with this
code. Otherwise the error state of the first A resolve is returned.
Previously, unnamed regex captures matched in the parent request, were not
available in a cloned subrequest. Now 3 fields related to unnamed captures
are copied to a cloned subrequest: r->ncaptures, r->captures and
r->captures_data. Since r->captures cannot be changed by either request after
creating a clone, a new flag r->realloc_captures is introduced to force
reallocation of r->captures.
The issue was reported as a proxy_cache_background_update misbehavior in
http://mailman.nginx.org/pipermail/nginx/2018-December/057251.html.
In the past, there were several security issues which resulted in
worker process memory disclosure due to buffers with negative size.
It looks reasonable to check for such buffers in various places,
much like we already check for zero size buffers.
While here, removed "#if 1 / #endif" around zero size buffer checks.
It looks highly unlikely that we'll disable these checks anytime soon.
On 32-bit platforms mp4->buffer_pos might overflow when a large
enough (close to 4 gigabytes) atom is being skipped, resulting in
incorrect memory addesses being read further in the code. In most
cases this results in harmless errors being logged, though may also
result in a segmentation fault if hitting unmapped pages.
To address this, ngx_mp4_atom_next() now only increments mp4->buffer_pos
up to mp4->buffer_end. This ensures that overflow cannot happen.
Variables now do not depend on presence of the HTTP status code in response.
If the corresponding event occurred, variables contain time between request
creation and the event, and "-" otherwise.
Previously, intermediate value of the $upstream_response_time variable held
unix timestamp.
The directive allows to drop binding between a client and existing UDP stream
session after receiving a specified number of packets. First packet from the
same client address and port will start a new session. Old session continues
to exist and will terminate at moment defined by configuration: either after
receiving the expected number of responses, or after timeout, as specified by
the "proxy_responses" and/or "proxy_timeout" directives.
By default, proxy_requests is zero (disabled).
An attack that continuously switches HTTP/2 connection between
idle and active states can result in excessive CPU usage.
This is because when a connection switches to the idle state,
all of its memory pool caches are freed.
This change limits the maximum allowed number of idle state
switches to 10 * http2_max_requests (i.e., 10000 by default).
This limits possible CPU usage in one connection, and also
imposes a limit on the maximum lifetime of a connection.
Initially reported by Gal Goldshtein from F5 Networks.
Fixed uncontrolled memory growth in case peer is flooding us with
some frames (e.g., SETTINGS and PING) and doesn't read data. Fix
is to limit the number of allocated control frames.
Previously there was no validation for the size of a 64-bit atom
in an mp4 file. This could lead to a CPU hog when the size is 0,
or various other problems due to integer underflow when calculating
atom data size, including segmentation fault or worker process
memory disclosure.
Size of a shared memory zones must be at least two pages - one page
for slab allocator internal data, and another page for actual allocations.
Using 8192 instead is wrong, as there are systems with page sizes other
than 4096.
Note well that two pages is usually too low as well. In particular, cache
is likely to use two allocations of different sizes for global structures,
and at least four pages will be needed to properly allocate cache nodes.
Except in a few very special cases, with keys zone of just two pages nginx
won't be able to start. Other uses of shared memory impose a limit
of 8 pages, which provides some room for global allocations. This patch
doesn't try to address this though.
Inspired by ticket #1665.
With maximum version explicitly set, TLSv1.3 will not be unexpectedly
enabled if nginx compiled with OpenSSL 1.1.0 (without TLSv1.3 support)
will be run with OpenSSL 1.1.1 (with TLSv1.3 support).
In e3ba4026c02d (1.15.4) nginx own renegotiation checks were disabled
if SSL_OP_NO_RENEGOTIATION is available. But since SSL_OP_NO_RENEGOTIATION
is only set on a connection, not in an SSL context, SSL_clear_option()
removed it as long as a matching virtual server was found. This resulted
in a segmentation fault similar to the one fixed in a6902a941279 (1.9.8),
affecting nginx built with OpenSSL 1.1.0h or higher.
To fix this, SSL_OP_NO_RENEGOTIATION is now explicitly set in
ngx_http_ssl_servername() after adjusting options. Additionally, instead
of c->ssl->renegotiation we now check c->ssl->handshaked, which seems
to be a more correct flag to test, and will prevent the segmentation fault
from happening even if SSL_OP_NO_RENEGOTIATION is not working.
The "no suitable signature algorithm" errors are reported by OpenSSL 1.1.1
when using TLSv1.3 if there are no shared signature algorithms. In
particular, this can happen if the client limits available signature
algorithms to something we don't have a certificate for, or to an empty
list. For example, the following command:
openssl s_client -connect 127.0.0.1:8443 -sigalgs rsa_pkcs1_sha1
will always result in the "no suitable signature algorithm" error
as the "rsa_pkcs1_sha1" algorithm refers solely to signatures which
appear in certificates and not defined for use in TLS 1.3 handshake
messages.
The SSL_R_NO_COMMON_SIGNATURE_ALGORITHMS error is what BoringSSL returns
in the same situation.
The "no suitable key share" errors are reported by OpenSSL 1.1.1 when
using TLSv1.3 if there are no shared groups (that is, elliptic curves).
In particular, it is easy enough to trigger by using only a single
curve in ssl_ecdh_curve:
ssl_ecdh_curve secp384r1;
and using a different curve in the client:
openssl s_client -connect 127.0.0.1:443 -curves prime256v1
On the client side it is seen as "sslv3 alert handshake failure",
"SSL alert number 40":
0:error:14094410:SSL routines:ssl3_read_bytes:sslv3 alert handshake failure:ssl/record/rec_layer_s3.c:1528:SSL alert number 40
It can be also triggered with default ssl_ecdh_curve by using a curve
which is not in the default list (X25519, prime256v1, X448, secp521r1,
secp384r1):
openssl s_client -connect 127.0.0.1:8443 -curves brainpoolP512r1
Given that many clients hardcode prime256v1, these errors might become
a common problem with TLSv1.3 if ssl_ecdh_curve is redefined. Previously
this resulted in not using ECDH with such clients, but with TLSv1.3 it
is no longer possible and will result in a handshake failure.
The SSL_R_NO_SHARED_GROUP error is what BoringSSL returns in the same
situation.
Seen at:
https://serverfault.com/questions/932102/nginx-ssl-handshake-error-no-suitable-key-share
Previously, configurations with typo, for example
fastcgi_cache_valid 200301 302 5m;
successfully pass configuration test. Adding check for status
codes > 599, and such configurations are now properly rejected.
The bgcolor attribute overrides compatibility settings in browsers
and leads to undesirable behavior when the default font color is set
to white in the browser, since font-color is not also overridden.
Following 7319:dcab86115261, as long as SSL_OP_NO_RENEGOTIATION is
defined, it is OpenSSL library responsibility to prevent renegotiation,
so the checks are meaningless.
Additionally, with TLSv1.3 OpenSSL tends to report SSL_CB_HANDSHAKE_START
at various unexpected moments - notably, on KeyUpdate messages and
when sending tickets. This change prevents unexpected connection
close on KeyUpdate messages and when finishing handshake with upcoming
early data changes.
Trying to look into r->err_status in the "return" directive
makes it behave differently than real errors generated in other
parts of the code, and is an endless source of various problems.
This behaviour was introduced in 726:7b71936d5299 (0.4.4) with
the comment "fix: "return" always overrode "error_page" response code".
It is not clear if there were any real cases this was expected to fix,
but there are several cases which are broken due to this change, some
previously fixed (4147:7f64de1cc2c0).
In ticket #1634, the problem is that when r->err_status is set to
a non-special status code, it is not possible to return a response
by simply returning r->err_status. If this is the case, the only
option is to return script's e->status instead. An example
configuration:
location / {
error_page 404 =200 /err502;
return 404;
}
location = /err502 {
return 502;
}
After the change, such a configuration will properly return
standard 502 error, much like it happens when a 502 error is
generated by proxy_pass.
This also fixes the following configuration to properly close
connection as clearly requested by "return 444":
location / {
error_page 404 /close;
return 404;
}
location = /close {
return 444;
}
Previously, this required "error_page 404 = /close;" to work
as intended.
Socket leak was observed in the following configuration:
error_page 400 = /close;
location = /close {
return 444;
}
The problem is that "return 444" triggers termination of the request,
and due to error_page termination thinks that it needs to use a posted
request to clear stack. But at the early request processing where 400
errors are generated there are no ngx_http_run_posted_requests() calls,
so the request is only terminated after an external event.
Variants of the problem include "error_page 497" instead (ticket #695)
and various other errors generated during early request processing
(405, 414, 421, 494, 495, 496, 501, 505).
The same problem can be also triggered with "return 499" and "return 408"
as both codes trigger ngx_http_terminate_request(), much like "return 444".
To fix this, the patch adds ngx_http_run_posted_requests() calls to
ngx_http_process_request_line() and ngx_http_process_request_headers()
functions, and to ngx_http_v2_run_request() and ngx_http_v2_push_stream()
functions in HTTP/2.
Since the ngx_http_process_request() function is now only called via
other functions which call ngx_http_run_posted_requests(), the call
there is no longer needed and was removed.
It is possible that after SSL_read() will return SSL_ERROR_WANT_WRITE,
further calls will return SSL_ERROR_WANT_READ without reading any
application data. We have to call ngx_handle_write_event() and
switch back to normal write handling much like we do if there are some
application data, or the write there will be reported again and again.
Similarly, we have to switch back to normal read handling if there
is saved read handler and SSL_write() returns SSL_ERROR_WANT_WRITE.
While SSL_read() most likely to return SSL_ERROR_WANT_WRITE (and SSL_write()
accordingly SSL_ERROR_WANT_READ) during an SSL renegotiation, it is
not necessary mean that a renegotiation was started. In particular,
it can never happen during a renegotiation or can happen multiple times
during a renegotiation.
Because of the above, misleading "peer started SSL renegotiation" info
messages were replaced with "SSL_read: want write" and "SSL_write: want read"
debug ones.
Additionally, "SSL write handler" and "SSL read handler" are now logged
by the SSL write and read handlers, to make it easier to understand that
temporary SSL handlers are called instead of normal handlers.
The "do { c->recv() } while (c->read->ready)" form used in the
ngx_http_lingering_close_handler() is not really correct, as for
example with SSL c->read->ready may be still set when returning NGX_AGAIN
due to SSL_ERROR_WANT_WRITE. Therefore the above might be an infinite loop.
This doesn't really matter in lingering close, as we shutdown write side
of the socket anyway and also disable renegotiation (and even without shutdown
and with renegotiation it requires using very large certificate chain and
tuning socket buffers to trigger SSL_ERROR_WANT_WRITE). But for the sake of
correctness added an NGX_AGAIN check.
If sending request body was not completed (u->request_body_sent is not set),
the upstream keepalive module won't save such a connection. However, it
is theoretically possible (though highly unlikely) that sending of some
control frames can be blocked after the request body was sent. The
ctx->output_blocked flag introduced to disable keepalive in such cases.
The code is now able to parse additional control frames after
the response is received, and can send control frames as well.
This fixes keepalive problems as observed with grpc-c, which can
send window update and ping frames after the response, see
http://mailman.nginx.org/pipermail/nginx/2018-August/056620.html.
Previously the preread phase code ignored NGX_AGAIN value returned from
c->recv() and relied only on c->read->ready. But this flag is not reliable and
should only be checked for optimization purposes. For example, when using
SSL, c->read->ready may be set when no input is available. This can lead to
calling preread handler infinitely in a loop.
The problem does not manifest itself currently, because in case of
non-buffered reading, chain link created by u->create_request method
consists of a single element.
Found by PVS-Studio.
The directive configures maximum number of requests allowed on
a connection kept in the cache. Once a connection reaches the number
of requests configured, it is no longer saved to the cache.
The default is 100.
Much like keepalive_requests for client connections, this is mostly
a safeguard to make sure connections are closed periodically and the
memory allocated from the connection pool is freed.
The directive configures maximum time a connection can be kept in the
cache. By configuring a time which is smaller than the corresponding
timeout on the backend side one can avoid the race between closing
a connection by the backend and nginx trying to use the same connection
to send a request at the same time.
LibreSSL 2.8.0 "added const annotations to many existing APIs from OpenSSL,
making interoperability easier for downstream applications". This includes
the const change in the SSL_CTX_sess_set_get_cb() callback function (see
9dd43f4ef67e), which breaks compilation.
To fix this, added a condition on how we redefine OPENSSL_VERSION_NUMBER
when working with LibreSSL (see 382fc7069e3a). With LibreSSL 2.8.0,
we now set OPENSSL_VERSION_NUMBER to 0x1010000fL (OpenSSL 1.1.0), so the
appropriate conditions in the code will use "const" as it happens with
OpenSSL 1.1.0 and later versions.
There are clients which cannot handle HPACK's dynamic table size updates
as added in 12cadc4669a7 (1.13.6). Notably, old versions of OkHttp library
are known to fail on it (ticket #1397).
This change makes it possible to work with such clients by only sending
dynamic table size updates in response to SETTINGS_HEADER_TABLE_SIZE. As
a downside, clients which do not use SETTINGS_HEADER_TABLE_SIZE will
continue to maintain default 4k table.
Previously, a chunk of spaces larger than NGX_CONF_BUFFER (4096 bytes)
resulted in the "too long parameter" error during parsing such a
configuration. This was because the code only set start and start_line
on non-whitespace characters, and hence adjacent whitespace characters
were preserved when reading additional data from the configuration file.
Fix is to always move start and start_line if the last character was
a space.
Early data AKA 0-RTT mode is enabled as long as "ssl_early_data on" is
specified in the configuration (default is off).
The $ssl_early_data variable evaluates to "1" if the SSL handshake
isn't yet completed, and can be used to set the Early-Data header as
per draft-ietf-httpbis-replay-04.
BoringSSL currently requires SSL_CTX_set_max_proto_version(TLS1_3_VERSION)
to be able to enable TLS 1.3. This is because by default max protocol
version is set to TLS 1.2, and the SSL_OP_NO_* options are merely used
as a blacklist within the version range specified using the
SSL_CTX_set_min_proto_version() and SSL_CTX_set_max_proto_version()
functions.
With this change, we now call SSL_CTX_set_max_proto_version() with an
explicit maximum version set. This enables TLS 1.3 with BoringSSL.
As a side effect, this change also limits maximum protocol version to
the newest protocol we know about, TLS 1.3. This seems to be a good
change, as enabling unknown protocols might have unexpected results.
Additionally, we now explicitly call SSL_CTX_set_min_proto_version()
with 0. This is expected to help with Debian system-wide default
of MinProtocol set to TLSv1.2, see
http://mailman.nginx.org/pipermail/nginx-ru/2017-October/060411.html.
Note that there is no SSL_CTX_set_min_proto_version macro in BoringSSL,
so we call SSL_CTX_set_min_proto_version() and SSL_CTX_set_max_proto_version()
as long as the TLS1_3_VERSION macro is defined.
The behaviour is now in line with COPY of a directory with contents,
which preserves access masks on individual files, as well as the "cp"
command.
Requested by Roman Arutyunyan.
This fixes wrong permissions and file time after cross-device MOVE
in the DAV module (ticket #1577). Broken in 8101d9101ed8 (0.8.9) when
cross-device copying was introduced in ngx_ext_rename_file().
With this change, ngx_copy_file() always calls ngx_set_file_time(),
either with the time provided, or with the time from the original file.
This is considered acceptable given that copying the file is costly anyway,
and optimizing cases when we do not need to preserve time will require
interface changes.
Previously, ngx_open_file(NGX_FILE_CREATE_OR_OPEN) was used, resulting
in destination file being partially rewritten if exists. Notably,
this affected WebDAV COPY command (ticket #1576).
Previously, "%uA" was used, which corresponds to ngx_atomic_uint_t.
Size of ngx_atomic_uint_t can be easily different from uint64_t,
leading to undefined results.
In TLSv1.3, NewSessionTicket messages arrive after the handshake and
can come at any time. Therefore we use a callback to save the session
when we know about it. This approach works for < TLSv1.3 as well.
The callback function is set once per location on merge phase.
Since SSL_get_session() in BoringSSL returns an unresumable session for
TLSv1.3, peer save_session() methods have been updated as well to use a
session supplied within the callback. To preserve API, the session is
cached in c->ssl->session. It is preferably accessed in save_session()
methods by ngx_ssl_get_session() and ngx_ssl_get0_session() wrappers.
In OpenSSL 1.1.0 the SSL_CTRL_CLEAR_OPTIONS macro was removed, so
conditional compilation test on it results in SSL_clear_options()
and SSL_CTX_clear_options() not being used. Notably, this caused
"ssl_prefer_server_ciphers off" to not work in SNI-based virtual
servers if server preference was switched on in the default server.
It looks like the only possible fix is to test OPENSSL_VERSION_NUMBER
explicitly.
Starting with OpenSSL 1.1.0, SSL_R_UNSUPPORTED_PROTOCOL instead of
SSL_R_UNKNOWN_PROTOCOL is reported when a protocol is disabled via
an SSL_OP_NO_* option.
Additionally, SSL_R_VERSION_TOO_LOW is reported when using MinProtocol
or when seclevel checks (as set by @SECLEVEL=n in the cipher string)
rejects a protocol, and this is what happens with SSLv3 and @SECLEVEL=1,
which is the default.
There is also the SSL_R_VERSION_TOO_HIGH error code, but it looks like
it is not possible to trigger it.
There should be at least one worker connection for each listening socket,
plus an additional connection for channel between worker and master,
or starting worker processes will fail.
Previously, listenings sockets were not cloned if the worker_processes
directive was specified after "listen ... reuseport".
This also simplifies upcoming configuration check on the number
of worker connections, as it needs to know the number of listening
sockets before cloning.
The variable keeps the latest SSL protocol version supported by the client.
The variable has the same format as $ssl_protocol.
The version is read from the client_version field of ClientHello. If the
supported_versions extension is present in the ClientHello, then the version
is set to TLSv1.3.
Errors when sending UDP datagrams can happen, e.g., when local IP address
changes (see fa0e093b64d7), or an unavailable DNS server on the LAN can cause
send() to fail with EHOSTDOWN on BSD systems. If this happens during
initial query, retry sending immediately, to a different DNS server when
possible. If this is not enough, allow normal resend to happen by ignoring
the return code of the second ngx_resolver_send_query() call, much like we
do in ngx_resolver_resend().
The "http request" and "https proxy request" errors cannot happen
with HTTP due to pre-handshake checks in ngx_http_ssl_handshake(),
but can happen when SSL is used in stream and mail modules.
With gRPC it is possible that a request sending is blocked due to flow
control. Moreover, further sending might be only allowed once the
backend sees all the data we've already sent. With such a backend
it is required to clear the TCP_NOPUSH socket option to make sure all
the data we've sent are actually delivered to the backend.
As such, we now clear TCP_NOPUSH in ngx_http_upstream_send_request()
also on NGX_AGAIN if c->write->ready is set. This fixes a test (which
waits for all the 64k bytes as per initial window before allowing more
bytes) with sendfile enabled when the body was written to a file
in a different context.
Now tcp_nopush on peer connections is disabled if it is disabled on
the client connection, similar to how we handle c->sendfile. Previously,
tcp_nopush was always used on upstream connections, regardless of
the "tcp_nopush" directive.
We copy input buffers to our buffers, so various flags might be
unexpectedly set in buffers returned by ngx_chain_get_free_buf().
In particular, the b->in_file flag might be set when the body was
written to a file in a different context. With sendfile enabled this
in turn might result in protocol corruption if such a buffer was reused
for a control frame.
Make sure to clear buffers and set only fields we really need to be set.
The module implements random load-balancing algorithm with optional second
choice. In the latter case, the best of two servers is chosen, accounting
number of connections and server weight.
Example:
upstream u {
random [two [least_conn]];
server 127.0.0.1:8080;
server 127.0.0.1:8081;
server 127.0.0.1:8082;
server 127.0.0.1:8083;
}
Before 4a8c9139e579, ngx_resolver_create() didn't use configuration
pool, and allocations were done using malloc().
In 016352c19049, when resolver gained support of several servers,
new allocations were done from the pool.
With u->conf->preserve_output set the request body file might be used
after the response header is sent, so avoid cleaning it. (Normally
this is not a problem as u->conf->preserve_output is only set with
r->request_body_no_buffering, but the request body might be already
written to a file in a different context.)
Previously, only one client packet could be processed in a udp stream session
even though multiple response packets were supported. Now multiple packets
coming from the same client address and port are delivered to the same stream
session.
If it's required to maintain a single stream of data, nginx should be
configured in a way that all packets from a client are delivered to the same
worker. On Linux and DragonFly BSD the "reuseport" parameter should be
specified for this. Other systems do not currently provide appropriate
mechanisms. For these systems a single stream of udp packets is only
guaranteed in single-worker configurations.
The proxy_response directive now specifies how many packets are expected in
response to a single client packet.
Previously, ngx_event_recvmsg() got remote socket addresses after creating
the connection object. In preparation to handling multiple UDP packets in a
single session, this code was moved up.
On Linux recvmsg() syscall may return a zero-length client address when
receiving a datagram from an unbound unix datagram socket. It is usually
assumed that socket address has at least the sa_family member. Zero-length
socket address caused buffer over-read in functions which receive socket
address, for example ngx_sock_ntop(). Typically the over-read resulted in
unexpected socket family followed by session close. Now a fake socket address
is allocated instead of a zero-length client address.
Negative times can appear since workers only update time on an event
loop iteration start. If a worker was blocked for a long time during
an event loop iteration, it is possible that another worker already
updated the time stored in the node. As such, time since last update
of the node (ms) will be negative.
Previous code used ngx_abs(ms) in the calculations. That is, negative
times were effectively treated as positive ones. As a result, it was
not possible to maintain high request rates, where the same node can be
updated multiple times from during an event loop iteration.
In particular, this affected setups with many SSL handshakes, see
http://mailman.nginx.org/pipermail/nginx/2018-May/056291.html.
Fix is to only update the last update time stored in the node if the
new time is larger than previously stored one. If a future time is
stored in the node, we preserve this time as is.
To prevent breaking things on platforms without monotonic time available
if system time is updated backwards, a safety limit of 60 seconds is
used. If the time stored in the node is more than 60 seconds in the future,
we assume that the time was changed backwards and update lr->last
to the current time.
The bug in question was fixed in glibc 2.3.2 and is no longer expected
to manifest itself on real servers. On the other hand, the workaround
causes compilation problems on various systems. Previously, we've
already fixed the code to compile with musl libc (fd6fd02f6a4d), and
now it is broken on Fedora 28 where glibc's crypt library was replaced
by libxcrypt. So the workaround was removed.
FreeBSD returns EINVAL when getsockopt(TCP_FASTOPEN) is called on a unix
domain socket, resulting in "getsockopt(TCP_FASTOPEN) ... failed" messages
during binary upgrade when unix domain listen sockets are present in
the configuration. Added EINVAL to the list of ignored error codes.
Previously, only unix domain sockets were reopened to tolerate cases when
local syslog server was restarted. It makes sense to treat other cases
(for example, local IP address changes) similarly.
Cast to intermediate "void *" to lose compiler knowledge about the original
type and pass the warning. This is not a real fix but rather a workaround.
Found by gcc8.
In mail and stream modules, no certificate provided is a fatal condition,
much like with the "ssl" and "starttls" directives.
In http, "listen ... ssl" can be used in a non-default server without
certificates as long as there is a certificate in the default one, so
missing certificate is only fatal for default servers.
In 51e1f047d15d, the "ssl" directive name was incorrectly hardcoded
in the error message shown when there are some SSL keys defined, but
not for all certificates. Right approach is to use the "mode" variable,
which can be either "ssl" or "starttls".
Previously, result of ngx_atoi() was assigned to an ngx_uint_t variable,
and errors reported by ngx_atoi() became positive, so the following check
in "status < 100" failed to catch them. This resulted in the configurations
like "proxy_cache_valid 2xx 30s" being accepted as correct, while they
in fact do nothing. Changing type to ngx_int_t fixes this, and such
configurations are now properly rejected.
Previously, ngx_http_upstream_process_header() might be called after
we've finished reading response headers and switched to a different read
event handler, leading to errors with gRPC proxying. Additionally,
the u->conf->read_timeout timer might be re-armed during reading response
headers (while this is expected to be a single timeout on reading
the whole response header).
Previously, ngx_http_upstream_test_next() used an outdated condition on
whether it will be possible to switch to a different server or not. It
did not take into account restrictions on non-idempotent requests, requests
with non-buffered request body, and the next upstream timeout.
For such requests, switching to the next upstream server was rejected
later in ngx_http_upstream_next(), resulting in nginx own error page
being returned instead of the original upstream response.
- use normal prefixes for types and macros
- removed some macros and types
- revised debug messages
- removed useless check of ngx_sock_ntop() returning 0
- removed special processing of AF_UNSPEC
The protocol used on inbound connection is auto-detected and corresponding
parser is used to extract passed addresses. TLV parameters are ignored.
The maximum supported size of PROXY protocol header is 107 bytes
(similar to version 1).
All cases are harmless and should not happen on valid values, though can
result in bad values being shown incorrectly in logs.
Found by Coverity (CID 1430311, 1430312, 1430313).
The fields "uri", "location", and "url" from ngx_http_upstream_conf_t
moved to ngx_http_proxy_loc_conf_t and ngx_http_proxy_vars_t, reflect
this change in create_loc_conf comments.
The gRPC protocol makes a distinction between HEADERS frame with
the END_STREAM flag set, and a HEADERS frame followed by an empty
DATA frame with the END_STREAM flag. The latter is not permitted,
and results in errors not being propagated through nginx. Instead,
gRPC clients complain that "server closed the stream without sending
trailers" (seen in grpc-go) or "13: Received RST_STREAM with error
code 2" (seen in grpc-c).
To fix this, nginx now returns HEADERS with the END_STREAM flag if
the response length is known to be 0, and we are not expecting
any trailer headers to be added. And the response length is
explicitly set to 0 in the gRPC proxy if we see initial HEADERS frame
with the END_STREAM flag set.
According to the gRPC protocol specification, the "TE" header is used
to detect incompatible proxies, and at least grpc-c server rejects
requests without "TE: trailers".
To preserve the logic, we have to pass "TE: trailers" to the backend if
and only if the original request contains "trailers" in the "TE" header.
Note that no other TE values are allowed in HTTP/2, so we have to remove
anything else.
The module allows passing requests to upstream gRPC servers.
The module is built by default as long as HTTP/2 support is compiled in.
Example configuration:
grpc_pass 127.0.0.1:9000;
Alternatively, the "grpc://" scheme can be used:
grpc_pass grpc://127.0.0.1:9000;
Keepalive support is available via the upstream keepalive module. Note
that keepalive connections won't currently work with grpc-go as it fails
to handle SETTINGS_HEADER_TABLE_SIZE.
To use with SSL:
grpc_pass grpcs://127.0.0.1:9000;
SSL connections use ALPN "h2" when available. At least grpc-go works fine
without ALPN, so if ALPN is not available we just establish a connection
without it.
Tested with grpc-c++ and grpc-go.
The flag can be used to continue sending request body even after we've
got a response from the backend. In particular, this is needed for gRPC
proxying of bidirectional streaming RPCs, and also to send control frames
in other forms of RPCs.
The flag indicates whether last ngx_output_chain() returned NGX_AGAIN
or not. If the flag is set, we arm the u->conf->send_timeout timer.
The flag complements c->write->ready test, and allows to stop sending
the request body in an output filter due to protocol-specific flow
control.
Basic trailer headers support allows one to access response trailers
via the $upstream_trailer_* variables.
Additionally, the u->conf->pass_trailers flag was introduced. When the
flag is set, trailer headers from the upstream response are passed to
the client. Like normal headers, trailer headers will be hidden
if present in u->conf->hide_headers_hash.
When clock_gettime(CLOCK_MONOTONIC) (or faster variants, _FAST on FreeBSD,
and _COARSE on Linux) is available, we now use it for ngx_current_msec.
This should improve handling of timers if system time changes (ticket #189).
The r->out chain link could be left uninitialized in case of error.
A segfault could happen if the subrequest handler accessed it.
The issue was introduced in commit 20f139e9ffa8.
Previously, only the upstream response body could be accessed with the
NGX_HTTP_SUBREQUEST_IN_MEMORY feature. Now any response body from a subrequest
can be saved in a memory buffer. It is available as a single buffer in r->out
and the buffer size is configured by the subrequest_output_buffer_size
directive.
Upstream, proxy and fastcgi code used to handle the old-style feature is
removed.
On some platforms (for example, Linux with glibc 2.12-2.25) IPv4 transparent
proxying is available, but IPv6 transparent proxying is not. The entire feature
is enabled in this case and NGX_HAVE_TRANSPARENT_PROXY macro is set to 1.
Previously, an attempt to enable transparency for an IPv6 socket was silently
ignored in this case and was usually followed by a bind(2) EADDRNOTAVAIL error
(ticket #1487). Now the error is generated for unavailable IPv6 transparent
proxy.
If during configuration parsing of the geo directive the memory
allocation has failed, pool used to parse configuration inside
the block, and sometimes the temporary pool were not destroyed.
There is no need to calculate hashes of static strings at runtime. The
ngx_hash() macro can be used to do it during compilation instead, similarly
to how it is done in ngx_http_proxy_module.c for "Server" and "Date" headers.
In particular, if a stream object allocation failed, and a client sent
the PRIORITY frame for this stream, ngx_http_v2_set_dependency() could
dereference a null pointer while trying to re-parent a dependency node.
r->headers_in.host can be NULL in ngx_http_v2_push_resource().
This happens when a request is terminated with 400 before the :authority
or Host header is parsed, and either pushing is enabled on the server{}
level or error_page 400 redirects to a location with pushes configured.
Found by Coverity (CID 1429156).
Resources to be pushed are configured with the "http2_push" directive.
Also, preload links from the Link response headers, as described in
https://www.w3.org/TR/preload/#server-push-http-2, can be pushed, if
enabled with the "http2_push_preload" directive.
Only relative URIs with absolute paths can be pushed.
The number of concurrent pushes is normally limited by a client, but
cannot exceed a hard limit set by the "http2_max_concurrent_pushes"
directive.
Previously, when request body was not available or was previously read in
memory rather than a file, client received HTTP 500 error, but no explanation
was logged in error log. This could happen, for example, if request body was
read or discarded prior to error_page redirect, or if mirroring was enabled
along with dav.
This fixes segfault in configurations with multiple virtual servers sharing
the same port, where a non-default virtual server block misses certificate.
Following ad3f342f14ba046c (1.9.13), it is possible that a request where
header was already sent will be finalized with NGX_HTTP_BAD_GATEWAY,
triggering an attempt to return additional error response and the
"header already sent" alert as a result.
In particular, it is trivial to reproduce the problem with a HEAD request
and caching enabled. With caching enabled nginx will change HEAD to GET
and will set u->pipe->downstream_error to suppress sending the response
body to the client. When a backend-related error occurs (for example,
proxy_read_timeout expires), ngx_http_finalize_upstream_request() will
be called with NGX_HTTP_BAD_GATEWAY. After ad3f342f14ba046c this will
result in ngx_http_finalize_request(NGX_HTTP_BAD_GATEWAY).
Fix is to move u->pipe->downstream_error handling to a later point,
where all special response codes are changed to NGX_ERROR.
Reported by Jan Prachar,
http://mailman.nginx.org/pipermail/nginx-devel/2018-January/010737.html.
Specifically, it is now allowed to start with a variable expression with braces:
${name}. The opening curly bracket in such a token was previously considered
the start of a new block. Variables located anywhere else in a token worked
fine: foo${name}.
Previously, capset(2) was called with the 64-bit capabilities version
_LINUX_CAPABILITY_VERSION_3. With this version Linux kernel expected two
copies of struct __user_cap_data_struct, while only one was submitted. As a
result, random stack memory was accessed and random capabilities were requested
by the worker. This sometimes caused capset() errors. Now the 32-bit version
_LINUX_CAPABILITY_VERSION_1 is used instead. This is OK since CAP_NET_RAW is
a 32-bit capability (CAP_NET_RAW = 13).
Previously included file sys/capability.h mentioned in capset(2) man page,
belongs to the libcap-dev package, which may not be installed on some Linux
systems when compiling nginx. This prevented the capabilities feature from
being detected and compiled on that systems.
Now linux/capability.h system header is included instead. Since capset()
declaration is located in sys/capability.h, now capset() syscall is defined
explicitly in code using the SYS_capset constant, similarly to other
Linux-specific features in nginx.
The capability is retained automatically in unprivileged worker processes after
changing UID if transparent proxying is enabled at least once in nginx
configuration.
The feature is only available in Linux.
If the flag space_in_uri is set, the URI in HTTP upstream request is escaped to
convert space to %20. However this flag is not checked while creating the
default cache key. This leads to different cache keys for requests
'/foo bar' and '/foo%20bar', while the upstream requests are identical.
Additionally, the change fixes background cache updates when the client URI
contains unescaped space. Default cache key in a subrequest is always based on
escaped URI, while the main request may not escape it. As a result, background
cache update subrequest may update a different cache entry.
Inheriting this flag will make the cloned subrequest behave consistently with
the parent. Specifically, the upstream HTTP request and cache key created by
the proxy module may depend directly on unparsed_uri if valid_unparsed_uri flag
is set. Previously, the flag was zero for cloned requests, which could make
background update proxy a request different than its parent and cache the result
with a different key. For example, if client URI contained the escaped slash
character %2F, it was used as is by the proxy module in the main request, but
was unescaped in the subrequests.
Similar problems exist in the slice module.
Previously, the unparsed uri was explicitly allowed to be used only by the main
request. However the valid_unparsed_uri flag is nonzero only in the main
request, which makes the main request check pointless.
If the data to write is bigger than what the socket can send, and the
reminder is smaller than NGX_SSL_BUFSIZE, then SSL_write() fails with
SSL_ERROR_WANT_WRITE. The reminder of payload however is successfully
copied to the low-level buffer and all the output chain buffers are
flushed. This means that retry logic doesn't work because
ngx_http_upstream_process_non_buffered_request() checks only if there's
anything in the output chain buffers and ignores the fact that something
may be buffered in low-level parts of the stack.
Signed-off-by: Patryk Lesiewicz <patryk@google.com>
If a connection with the read delayed flag set was stored in the keepalive
cache, and after picking it from the cache a read timer was set on that
connection, this timer was considered a delay timer rather than a socket read
event timer as expected. The latter timeout is usually much longer than the
former, which caused a significant delay in request processing.
The issue manifested itself with proxy_limit_rate and upstream keepalive
enabled and exists since 973ee2276300 (1.7.7) when proxy_limit_rate was
introduced.
On some systems, it's possible that reaper of orphaned processes is
set to something other than "init" process. On such systems, the
changing binary procedure did not work.
The fix is to check if PPID has changed, instead of assuming it's
always 1 for orphaned processes.
The ngx_http_upstream_process_upgraded() did not handle c->close request,
and upgraded connections do not use the write filter. As a result,
worker_shutdown_timeout did not affect upgraded connections (ticket #1419).
Fix is to handle c->close in the ngx_http_request_handler() function, thus
covering most of the possible cases in http handling.
Additionally, mail proxying did not handle neither c->close nor c->error,
and thus worker_shutdown_timeout did not work for mail connections. Fix is
to add c->close handling to ngx_mail_proxy_handler().
Also, added explicit handling of c->close to stream proxy,
ngx_stream_proxy_process_connection(). This improves worker_shutdown_timeout
handling in stream, it will no longer wait for some data being transferred
in a connection before closing it, and will also provide appropriate
logging at the "info" level.
A zlib variant from Intel as available from https://github.com/jtkukunas/zlib
uses 64K hash instead of scaling it from the specified memory level, and
also uses 16-byte padding in one of the window-sized memory buffers, and can
force window bits to 13 if compression level is set to 1 and appropriate
compile options are used. As a result, nginx complained with "gzip filter
failed to use preallocated memory" alerts.
This change improves deflate_state allocation detection by testing that
items is 1 (deflate_state is the only allocation where items is 1).
Additionally, on first failure to use preallocated memory we now assume
that we are working with the Intel's modified zlib, and switch to using
appropriate preallocations. If this does not help, we complain with the
usual alerts.
Previous version of this patch was published at
http://mailman.nginx.org/pipermail/nginx/2014-July/044568.html.
The zlib variant in question is used by default in ClearLinux from Intel,
see http://mailman.nginx.org/pipermail/nginx-ru/2017-October/060421.html,
http://mailman.nginx.org/pipermail/nginx-ru/2017-November/060544.html.
Previously, nginx failed to move buffer position when parsing an incomplete
record header, and due to this wasn't be able to continue parsing once
remaining bytes of the record header were received.
This can affect response header parsing, potentially generating spurious errors
like "upstream sent unexpected FastCGI request id high byte: 1 while reading
response header from upstream". While this is very unlikely, since usually
record headers are written in a single buffer, this still can happen in real
life, for example, if a record header will be split across two TCP packets
and the second packet will be delayed.
This does not affect non-buffered response body proxying, due to "buf->pos =
buf->last;" at the start of the ngx_http_fastcgi_non_buffered_filter()
function. Also this does not affect buffered response body proxying, as
each input buffer is only passed to the filter once.
This is what usually happens for zones no longer used in the new
configuration, but zones where size or tag were changed were freed
when creating new memory zones. If reconfiguration failed (for
example, due to a conflicting listening socket), this resulted in a
segmentation fault in the master process.
Reported by Zhihua Cao,
http://mailman.nginx.org/pipermail/nginx-devel/2017-October/010536.html.
In particular, if ngx_http_postpone_filter_add() fails in ngx_chain_add_copy(),
the output chain of the postponed request was left in an invalid state.
This header carries the definition of HMAC_Init_ex(). In OpenSSL this
header is included by <openssl/ssl.h>, but it's not so in BoringSSL.
It's probably a good idea to explicitly include this header anyway,
regardless of whether it's included by other headers or not.
Upgrading an upstream connection is usually followed by reading from the client
which a subrequest is not allowed to do. Moreover, accessing the header_in
request field while processing upgraded connection ends up with a null pointer
dereference since the header_in buffer is only created for the the main request.
If proxy_next_upstream includes http_503/http_504, and upstream
returns 503/504, $upstream_status converted this to 502 for any
values except the last one.
The NGX_DONE value returned from ngx_http_upstream_cache_send() indicates
that upstream was already finalized in ngx_http_upstream_process_headers().
It was treated as a generic error which resulted in duplicate finalization.
Handled NGX_HTTP_UPSTREAM_INVALID_HEADER from ngx_http_upstream_cache_send().
Previously, it could return within ngx_http_upstream_finalize_request(), and
since it's below NGX_HTTP_SPECIAL_RESPONSE, a client connection could stuck.
When parsing of headers in a cache file fails, already parsed headers
need to be cleared, and protocol state needs to be reinitialized. To do
so, u->request_sent is now set to ensure ngx_http_upstream_reinit() will
be called.
This change complements improvements in 46ddff109e72.
This slightly reduces cost of selecting a peer if all or almost all peers
failed, see ticket #1030. There should be no measureable difference with
other workloads.
While this may result in non-ideal distribution of requests if nginx
won't be able to select a server in a reasonable number of attempts,
this still looks better than severe performance degradation observed
if there is no limit and there are many points configured (ticket #1030).
This is also in line with what we do for other hash balancing methods.
Previously, unix sockets were treated as AF_INET ones, and this may
result in buffer overread on Linux, where unbound unix sockets have
2-byte addresses.
Note that it is not correct to use just sun_path as a binary representation
for unix sockets. This will result in an empty string for unbound unix
sockets, and thus behaviour of limit_req and limit_conn will change when
switching from $remote_addr to $binary_remote_addr. As such, normal text
representation is used.
Reported by Stephan Dollberg.
At least FreeBSD, macOS, NetBSD, and OpenBSD can return unix sockets
with non-null-terminated sun_path. Additionally, the address may become
non-null-terminated if it does not fit into the buffer provided and was
truncated (may happen on macOS, NetBSD, and Solaris, which allow unix socket
addresess larger than struct sockaddr_un). As such, ngx_sock_ntop() might
overread the sockaddr provided, as it used "%s" format and thus assumed
null-terminated string.
To fix this, the ngx_strnlen() function was introduced, and it is now used
to calculate correct length of sun_path.
Some OSes (notably macOS, NetBSD, and Solaris) allow unix socket addresses
larger than struct sockaddr_un. Moreover, some of them (macOS, Solaris)
return socklen of the socket address before it was truncated to fit the
buffer provided. As such, on these systems socklen must not be used without
additional check that it is within the buffer provided.
Appropriate checks added to ngx_event_accept() (after accept()),
ngx_event_recvmsg() (after recvmsg()), and ngx_set_inherited_sockets()
(after getsockname()).
We also obtain socket addresses via getsockname() in
ngx_connection_local_sockaddr(), but it does not need any checks as
it is only used for INET and INET6 sockets (as there can be no
wildcard unix sockets).
The sync flag of HTTP/2 request body buffer is used when the size of request
body is unknown or bigger than configured "client_body_buffer_size". In this
case the buffer points to body data inside the global receive buffer that is
used for reading all HTTP/2 connections in the worker process. Thus, when the
sync flag is set, the buffer must be flushed to a temporary file, otherwise
the request body data can be overwritten.
Previously, the sync buffer wasn't flushed to a temporary file if the whole
body was received in one DATA frame with the END_STREAM flag and wasn't
copied into the HTTP/2 body preread buffer. As a result, the request body
might be corrupted (ticket #1384).
Now, setting r->request_body_in_file_only enforces writing the sync buffer
to a temporary file in all cases.
When caching intercepted errors, previous behaviour was to use
proxy_cache_valid times specified, regardless of various cache control
headers present in the response. Fix is to check u->cacheable and
use u->cache->valid_sec as set by various cache control response headers,
similar to how we do this in the normal caching code path.
If cache file is truncated, it is possible that u->process_header()
will return NGX_AGAIN. Added appropriate handling of this case by
changing the error to NGX_HTTP_UPSTREAM_INVALID_HEADER.
Also, added appropriate logging of this and NGX_HTTP_UPSTREAM_INVALID_HEADER
cases at the "crit" level. Note that this will result in duplicate logging
in case of NGX_HTTP_UPSTREAM_INVALID_HEADER. While this is something better
to avoid, it is considered to be an overkill to implement cache-specific
error logging in u->process_header().
Additionally, u->buffer.start is now reset to be able to receive a new
response, and u->cache_status set to MISS to provide the value in the
$upstream_cache_status variable, much like it happens on other cache file
errors detected by ngx_http_file_cache_read(), instead of HIT, which is
believed to be misleading.
It is to be used as a bitmask with various bits set/reset when appropriate.
63b8b157b776 made a similar change to ngx_http_upstream_rr_peer_t.down and
ngx_stream_upstream_rr_peer_t.down.
Previously, "get indexed header" message was logged when in fact only
header name was obtained using an index, and "get indexed header name"
was logged when full header representation (name and value) was obtained
using an index. Fixed version logs "get indexed name" and "get indexed
header" respectively.
Previously, when the first UDP response packet was not received from the
proxied server within proxy_timeout, no error message was logged before
switching to the next upstream. Additionally, when one of succeeding response
packets was not received within the timeout, the timeout error had low severity
because it was logged as a client connection error as opposed to upstream
connection error.
Various buffers are allocated in an assumption that there would be
no more than 4 year digits. This might not be true on platforms
with 64-bit time_t, as 64-bit time_t is able to represent more than that.
Such dates with more than 4 year digits hardly make sense though, as
various date formats in use do not allow them anyway.
As such, all dates are now truncated by ngx_gmtime() to December 31, 9999.
This should have no effect on valid dates, though will prevent potential
buffer overflows on invalid ones.
In ngx_gmtime(), instead of casting to ngx_uint_t we now work with
time_t directly. This allows using dates after 2038 on 32-bit platforms
which use 64-bit time_t, notably NetBSD and OpenBSD.
As the code is not able to work with negative time_t values, argument
is now set to 0 for negative values. As a positive side effect, this
results in Epoch being used for such values instead of a date in distant
future.
This change lets NGINX talk to clients with SETTINGS_HEADER_TABLE_SIZE
smaller than the default 4KB. Previously, NGINX would ACK the SETTINGS
frame with a small dynamic table size, but it would never send dynamic
table size update, leading to a connection-level COMPRESSION_ERROR.
Also, it allows clients to release 4KB of memory per connection, since
NGINX doesn't use HPACK's dynamic table when encoding headers, however
clients had to maintain it, since NGINX never signaled that it doesn't
use it.
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
When switching to a next upstream, some buffers could be stuck in the middle
of the filter chain. A condition existed that raised an error when this
happened. As it turned out, this condition prevented switching to a next
upstream if ssl preread was used with the TCP protocol (see the ticket).
In fact, the condition does not make sense for TCP, since after successful
connection to an upstream switching to another upstream never happens. As for
UDP, the issue with stuck buffers is unlikely to happen, but is still possible.
Specifically, if a filter delays sending data to upstream.
The condition can be relaxed to only check the "buffered" bitmask of the
upstream connection. The new condition is simpler and fixes the ticket issue
as well. Additionally, the upstream_out chain is now reset for UDP prior to
connecting to a new upstream to prevent repeating the client data twice.
When secure link checksum has length of 23 or 24 bytes, decoded base64 value
could occupy 17 or 18 bytes which is more than 16 bytes previously allocated
for it on stack. The buffer overflow does not have any security implications
since only one local variable was corrupted and this variable was not used in
this case.
The fix is to increase buffer size up to 18 bytes. Useless buffer size
initialization is removed as well.
This fixes at least the following cases, where no last_modified_time
(assuming caching is not enabled) resulted in incorrect behaviour:
- slice filter and If-Range requests (ticket #1357);
- If-Range requests with proxy_force_ranges;
- expires modified.
The $ssl_server_name variable used SSL_get_servername() result directly,
but this is not safe: it references a memory allocation in an SSL
session, and this memory might be freed at any time due to renegotiation.
Instead, copy the name to memory allocated from the pool.
This variable contains URL-encoded client SSL certificate. In contrast
to $ssl_client_cert, it doesn't depend on deprecated header continuation.
The NGX_ESCAPE_URI_COMPONENT variant of encoding is used, so the resulting
variable can be safely used not only in headers, but also as a request
argument.
The $ssl_client_cert variable should be considered deprecated now.
The $ssl_client_raw_cert variable will be eventually renambed back
to $ssl_client_cert.
Total length of a response with multiple ranges can be larger than a size_t
variable can hold, so type changed to off_t. Previously, an incorrect
Content-Length was returned when requesting more than 4G of ranges from
a large enough file on a 32-bit system.
An additional size_t variable introduced to calculate size of the boundary
header buffer, as off_t is not needed here and will require type casts on
win32.
Reported by Shuxin Yang,
http://mailman.nginx.org/pipermail/nginx/2017-July/054384.html.
The "fd" field should be after 3 pointers for ngx_event_ident() to use it.
This was broken by ccad84a174e0. While it does not seem to be currently used
for aio-related events, it should be a good idea to preserve the correct
layout nevertheless.
Pass NGX_FILE_OPEN to ngx_open_file() to fix "The parameter is incorrect"
error on win32 when using the ssl_session_ticket_key directive or loading
a binary geo base. On UNIX, this change is a no-op.
On Windows, a worker process does not call ngx_slab_init() from
ngx_init_zone_pool(), so ngx_slab_max_size, ngx_slab_exact_size,
and ngx_slab_exact_shift were left uninitialized.
The variable was considered non-existent in the absence of any
valid_referers directives.
Given the following config snippet,
location / {
return 200 $invalid_referer;
}
location /referer {
valid_referers server_names;
}
"location /" should work identically and independently on other
"location /referer".
The fix is to always add the $invalid_referer variable as long
as the module is compiled in, as is done by other modules.
The shared objects should generally be allocated from shared memory.
While peers->name and the data it points to allocated from cf->pool
happened to work on UNIX, it broke on Windows. On UNIX this worked
only because the shared memory zone for upstreams is re-created for
every new configuration.
But on Windows, a worker process does not inherit the address space
of the master process, so the peers->name pointed to data allocated
from cf->pool by the master process, and was invalid.
The phase is added instead of the try_files phase. Unlike the old phase, the
new one supports registering multiple handlers. The try_files implementation is
moved to a separate ngx_http_try_files_module, which now registers a precontent
phase handler.
The new request flag "preserve_body" indicates that the request body file should
not be removed by the upstream module because it may be used later by a
subrequest. The flag is set by the SSI (ticket #585), addition and slice
modules. Additionally, it is also set by the upstream module when a background
cache update subrequest is started to prevent the request body file removal
after an internal redirect. Only the main request is now allowed to remove the
file.
When closing a socket with SO_REUSEPORT, Linux drops all connections waiting
in this socket's listen queue. Previously, it was believed to only result
in connection resets when reconfiguring nginx to use smaller number of worker
processes. It also results in connection resets during configuration
testing though.
Workaround is to avoid using SO_REUSEPORT when testing configuration. It
should prevent listening sockets from being created if a conflicting socket
already exists, while still preserving detection of other possible errors.
It should also cover UDP sockets.
The only downside of this approach seems to be that a configuration testing
won't be able to properly report the case when nginx was compiled with
SO_REUSEPORT, but the kernel is not able to set it. Such errors will be
reported on a real start instead.
Suffix ranges no longer allowed to set negative start values, to prevent
ranges with negative start from appearing even if total size protection
will be removed.
The overflow can be used to circumvent the restriction on total size of
ranges introduced in c2a91088b0c0 (1.1.2). Additionally, overflow
allows producing ranges with negative start (such ranges can be created
by using a suffix, "bytes=-100"; normally this results in 200 due to
the total size check). These can result in the following errors in logs:
[crit] ... pread() ... failed (22: Invalid argument)
[alert] ... sendfile() failed (22: Invalid argument)
When using cache, it can be also used to reveal cache file header.
It is believed that there are no other negative effects, at least with
standard nginx modules.
In theory, this can also result in memory disclosure and/or segmentation
faults if multiple ranges are allowed, and the response is returned in a
single in-memory buffer. This never happens with standard nginx modules
though, as well as known 3rd party modules.
Fix is to properly protect from possible overflow when incrementing size.
It is safe because re-sending still works during graceful shutdown as
long as resolving takes place (and resolve tasks set their own timeouts
that are not cancelable).
Also, the new ctx->cancelable flag can be set to make resolve task's
timeout event cancelable.
Notably, on ppc64 with 64k pagesize, slab 0 (of size 8) requires
128 64-bit elements for bitmasks. The code bogusly assumed that
one uintptr_t is enough for bitmasks plus at least one free slot.
Resolving an SRV record includes resolving its host names in subrequests.
Previously, if memory allocation failed while reporting a subrequest result
after receiving a response from a DNS server, the SRV resolve handler was
called immediately with the NGX_ERROR state. However, if the SRV record
included another copy of the resolved name, it was reported once again.
This could trigger the use-after-free memory access after SRV resolve
handler freed the resolve context by calling ngx_resolve_name_done().
Now the SRV resolve handler is called only when all its subrequests are
completed.
Previously, each configured header was represented in one of two ways,
depending on whether or not its value included any variables.
If the value didn't include any variables, then it would be represented
as as a single script that contained complete header line with HTTP/1.1
delimiters, i.e.:
"Header: value\r\n"
But if the value included any variables, then it would be represented
as a series of three scripts: first contained header name and the ": "
delimiter, second evaluated to header value, and third contained only
"\r\n", i.e.:
"Header: "
"$value"
"\r\n"
This commit changes that, so that each configured header is represented
as a series of two scripts: first contains only header name, and second
contains (or evaluates to) only header value, i.e.:
"Header"
"$value"
or
"Header"
"value"
This not only makes things more consistent, but also allows header name
and value to be accessed separately.
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
As per RFC 2616 / RFC 7233, any range request to an empty file
is expected to result in 416 Range Not Satisfiable response, as
there cannot be a "byte-range-spec whose first-byte-pos is less
than the current length of the entity-body". On the other hand,
this makes use of byte-range requests inconvenient in some cases,
as reported for the slice module here:
http://mailman.nginx.org/pipermail/nginx-devel/2017-June/010177.html
This commit changes range filter to instead return 200 if the file
is empty and the range requested starts at 0.
This change reworks 13a5f4765887 to only run posted requests once,
with nothing on stack. Running posted requests with other request
functions on stack may result in use-after-free in case of errors,
similar to the one reported in #788.
To only run posted request once, a separate function was introduced
to be used as ssl handshake handler in c->ssl->handler,
ngx_http_upstream_ssl_handshake_handler(). The ngx_http_run_posted_requests()
is only called in this function, and not in ngx_http_upstream_ssl_handshake()
which may be called directly on stack.
Additionaly, ngx_http_upstream_ssl_handshake_handler() now does appropriate
debug logging of the current subrequest, similar to what is done in other
event handlers.
Previously, the upstream resolve handler always called
ngx_http_run_posted_requests() to run posted requests after processing the
resolver response. However, if the handler was called directly from the
ngx_resolve_name() function (for example, if the resolver response was cached),
running posted requests from the handler could lead to the following errors:
- If the request was scheduled for termination, it could actually be terminated
in the resolve handler. Upper stack frames could reference the freed request
object in this case.
- If a significant number of requests were posted, and for each of them the
resolve handler was called directly from the ngx_resolve_name() function,
posted requests could be run recursively and lead to stack overflow.
Now ngx_http_run_posted_requests() is only called from asynchronously invoked
resolve handlers.
Trailers added using this directive are evaluated after response body
is processed by output filters (but before it's written to the wire),
so it's possible to use variables calculated from the response body
as the trailer value.
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Example:
ngx_table_elt_t *h;
h = ngx_list_push(&r->headers_out.trailers);
if (h == NULL) {
return NGX_ERROR;
}
ngx_str_set(&h->key, "Fun");
ngx_str_set(&h->value, "with trailers");
h->hash = ngx_hash_key_lc(h->key.data, h->key.len);
The code above adds "Fun: with trailers" trailer to the response.
Modules that want to emit trailers must set r->expect_trailers = 1
in header filter, otherwise they might not be emitted for HTTP/1.1
responses that aren't already chunked.
This change also adds $sent_trailer_* variables.
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
The current style in variable handlers returning NGX_OK is to either set
v->not_found to 1, or to initialize the entire ngx_http_variable_value_t
structure.
In theory, always setting v->valid = 1 for NGX_OK would be useful, which
would mean that the value was computed and is thus valid, including the
special case of v->not_found = 1. But currently that's not the case and
causes the (v->valid || v->not_found) check to access an uninitialized
v->valid value, which is safe only because its value doesn't matter when
v->not_found is set.
When evaluating a mapped $reset_uid variable in the userid filter,
if get_handler set to ngx_http_map_variable() returned an error,
this previously resulted in a NULL pointer dereference.
If memory allocation of a new r->uri.data storage failed, reset its length as
well. Request URI is used in ngx_http_finalize_request() for debug logging.
Previously, when using NGX_HTTP_SSI_ERROR, error was ignored in ssi processing,
thus timefmt could be accessed later in ngx_http_ssi_date_gmt_local_variable()
as part of "set" handler, or NULL format pointer could be passed to strftime().
Previously, SETTINGS ACK was sent immediately upon receipt of SETTINGS
frame, before already queued DATA frames created using old SETTINGS.
This incorrect behavior was source of interoperability issues, because
peers rely on the fact that new SETTINGS are in effect after receiving
SETTINGS ACK.
Reported by Feng Li.
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Previously, new frames could be emitted in the middle of applying
new (and already acknowledged) SETTINGS params, which is illegal.
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
If the main request was finalized while a background request performed an
asynchronous operation, the main request ended up in ngx_http_writer() and was
not finalized until a network event or a timeout. For example, cache
background update with aio enabled made nginx unable to process further client
requests or close the connection, keeping it open until client closes it.
Now regular finalization of the main request is not suspended because of an
asynchronous operation in another request.
If a background request was terminated while an asynchronous operation was in
progress, background request's write event handler was changed to
ngx_http_request_finalizer() and never called again.
Now, whenever a request is terminated while an asynchronous operation is in
progress, connection error flag is set to make further finalizations of any
request with this connection lead to termination.
These issues appeared in 1aeaae6e9446 (not yet released).
In http these checks were changed in a6d6d762c554, though mail module
was missed at that time. Since then, the stream module was introduced
based on mail, using "== NGX_ERROR" check.
With OpenSSL 1.1.0+, the workaround for handshake buffer size as introduced
in a720f0b0e083 (ticket #413) no longer works, as OpenSSL no longer exposes
handshake buffers, see https://github.com/openssl/openssl/commit/2e7dc7cd688.
Moreover, it is no longer possible to adjust handshake buffers at all now.
To avoid additional RTT if handshake uses more than 4k we now set TCP_NODELAY
on SSL connections before handshake. While this still results in sub-optimal
network utilization due to incomplete packets being sent, it seems to be
better than nothing.
Previously, cache background update might not work as expected, making client
wait for it to complete before receiving the final part of a stale response.
This could happen if the response could not be sent to the client socket in one
filter chain call.
Now background cache update is done in a background subrequest. This type of
subrequest does not block any other subrequests or the main request.
Previously, the read event of the accepted connection was marked ready, but not
available. This made EPOLLRDHUP-related code (for example, in ngx_unix_recv())
expect more data from the socket, leading to unexpected behavior.
For example, if SSL, PROXY protocol and deferred accept were enabled on a listen
socket, the client connection was aborted due to unexpected return value of
c->recv().
If allocation of cleanup handler in the HTTP/2 header filter failed, then
a stream might be freed with a HEADERS frame left in the output queue.
Now the HEADERS frame is accounted in the queue before trying to allocate
the cleanup handler.
Abnormally exited workers may leave locked cache entries, this can
result in the cache size on disk exceeding max_size and shared memory
exhaustion.
This change mitigates the issue by ignoring locked entries during forced
expire. It also increases the visibility of the problem by logging such
entries.
Previously, an allocation error resulted in uninitialized memory access
when evaluating $upstream_http_ variables.
On a related note, see r->headers_out.headers cleanup work in 0cdee26605f3.
In ac9b1df5b246 (1.13.0) we attempted to allow renegotiation in client mode,
but when using OpenSSL 1.0.2 or older versions it was additionally disabled
by SSL3_FLAGS_NO_RENEGOTIATE_CIPHERS.
If initialization of a header failed for some reason after ngx_list_push(),
leaving the header as is can result in uninitialized memory access by
the header filter or the log module. The fix is to clear partially
initialized headers in case of errors.
For the Cache-Control header, the fix is to postpone pushing
r->headers_out.cache_control until its value is completed.
Previously, ngx_http_sub_header_filter() could fail with a partially
initialized context, later accessed in ngx_http_sub_body_filter()
if called from the perl content handler.
The issue had appeared in 2c045e5b8291 (1.9.4).
A better fix would be to handle ngx_http_send_header() errors in
the perl module, though this doesn't seem to be easy enough.
The SSL_CTRL_SET_CURVES_LIST macro is removed in the OpenSSL master branch.
SSL_CTX_set1_curves_list is preserved as compatibility with previous versions.
CVE-2009-3555 is no longer relevant and mitigated by the renegotiation
info extension (secure renegotiation). On the other hand, unexpected
renegotiation still introduces potential security risks, and hence we do
not allow renegotiation on the server side, as we never request renegotiation.
On the client side the situation is different though. There are backends
which explicitly request renegotiation, and disabled renegotiation
introduces interoperability problems. This change allows renegotiation
on the client side, and fixes interoperability problems as observed with
such backends (ticket #872).
Additionally, with TLSv1.3 the SSL_CB_HANDSHAKE_START flag is currently set
by OpenSSL when receiving a NewSessionTicket message, and was detected by
nginx as a renegotiation attempt. This looks like a bug in OpenSSL, though
this change also allows better interoperability till the problem is fixed.
Previously, the source IP address of a response UDP datagram could differ from
the original datagram destination address. This could happen if the server UDP
socket is bound to a wildcard address and the network interface chosen to output
the response packet has a different default address than the destination address
of the original packet. For example, if two addresses from the same network are
configured on an interface.
Now source address is set explicitly if a response is sent for a server UDP
socket bound to a wildcard address.
This change adds "http_429" parameter to "proxy_next_upstream" for
retrying rate-limited requests, and to "proxy_cache_use_stale" for
serving stale cached responses after being rate-limited.
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
This change adds reason phrase in status line and pretty response body
when "429" status code is used in "return", "limit_conn_status" and/or
"limit_req_status" directives.
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
When a slice subrequest was redirected to a new location, its context was lost.
After its completion, a new slice subrequest for the same slice was created.
This could lead to infinite loop. Now the slice module makes sure each slice
subrequest starts output with the slice context available.
With post_action or subrequests, it is possible that the timer set for
wev->delayed will expire while the active subrequest write event handler
is not ready to handle this. This results in request hangs as observed
with limit_rate / sendfile_max_chunk and post_action (ticket #776) or
subrequests (ticket #1228).
Moving the handling to the connection event handler fixes the hangs observed,
and also slightly simplifies the code.
Since limit_req uses connection's write event to delay request processing,
it can conflict with timers in other subrequests. In particular, even
if applied to an active subrequest, it can break things if wev->delayed
is already set (due to limit_rate or sendfile_max_chunk), since after
limit_req finishes the wev->delayed flag will be set and no timer will be
active.
Fix is to use the wev->delayed flag in limit_req as well. This ensures that
wev->delayed won't be set after limit_req finishes, and also ensures that
limit_req's timers will be properly handled by other subrequests if the one
delayed by limit_req is not active.
All streams in connection must be finalized before the connection
itself can be finalized and all related memory is freed. That's
not always possible on the current event loop iteration.
Thus when the last stream is finalized, it sets the special read
event handler ngx_http_v2_handle_connection_handler() and posts
the event.
Previously, this handler didn't check the connection state and
could call the regular event handler on a connection that was
already in finalization stage. In the worst case that could
lead to a segmentation fault, since some data structures aren't
supposed to be used during connection finalization. Particularly,
the waiting queue can contain already freed streams, so the
WINDOW_UPDATE frame received by that moment could trigger
accessing to these freed streams.
Now, the connection error flag is explicitly checked in
ngx_http_v2_handle_connection_handler().
In order to finalize stream the error flag is set on fake connection and
either "write" or "read" event handler is called. The read events of fake
connections are always ready, but it's not the case with the write events.
When the ready flag isn't set, the error flag can be not checked in some
cases and as a result stream isn't finalized. Now the ready flag is
explicilty set on write events for proper finalization in all cases.
Previously, flow control didn't account for padding in DATA frames,
which meant that its view of the world could drift from peer's view
by up to 256 bytes per received padded DATA frame, which could lead
to a deadlock.
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Previously, its value accounted for payloads of HEADERS, CONTINUATION
and DATA frames, as well as frame headers of HEADERS and DATA frames,
but it didn't account for frame headers of CONTINUATION frames.
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Previously, connection write handler was called, resulting in wake up
of the active subrequest. This change makes it possible to read data
in non-active subrequests as well. For example, this allows SSI to
process instructions in non-active subrequests earlier and start
additional subrequests if needed, reducing overall response time.
If the subrequest is already finalized, the handler set with aio_write
may still be used by sendfile in threads when using range requests
(see also e4c1f5b32868, and the original note in 9fd738b85fad). Calling
already finalized subrequest's r->write_event_handler in practice
results in request hang in some cases.
Fix is to trigger connection event handler if the subrequest was already
finalized.
The ngx_linux_sendfile() function is now used for both normal sendfile()
and sendfile in threads. The ngx_linux_sendfile_thread() function was
modified to use the same interface as ngx_linux_sendfile(), and is simply
called from ngx_linux_sendfile() when threads are enabled.
Special return code NGX_DONE is used to indicate that a thread task was
posted and no further actions are needed.
If number of bytes sent is less that what we were sending, we now always
retry sending. This is needed for sendfile() in threads as the number
of bytes we are sending might have been changed since the thread task
was posted. And this is also needed for Linux 4.3+, as sendfile() might
be interrupted at any time and provides no indication if it was interrupted
or not (ticket #1174).
If of.err is 0, it means that there was a memory allocation error
and no further logging and/or processing is needed. The of.failed
string can be only accessed if of.err is not 0.
The directive configures a timeout to be used when gracefully shutting down
worker processes. When the timer expires, nginx will try to close all
the connections currently open to facilitate shutdown.
There is no need to cancel timers early if there are other timers blocking
shutdown anyway. Preserving such timers allows nginx to continue some
periodic work till the shutdown is actually possible.
With the new approach, timers with ev->cancelable are simply ignored when
checking if there are any timers left during shutdown.
The ev->timedout flag is set on first timer expiration, and never reset
after it. Due to this the code to stop the timer when the timer was
canceled never worked (except in a very specific time frame immediately
after start), and the timer was always armed again. This essentially
resulted in a buffer flush at the end of an event loop iteration.
This behaviour actually seems to be better than just stopping the flush
timer for the whole shutdown, so it is preserved as is instead of fixing
the code to actually remove the timer. It will be further improved by
upcoming changes to preserve cancelable timers if there are other timers
blocking shutdown.
Most notably, this fixes possible buffer overflows if number of large
client header buffers in a virtual server is different from the one in
the default server.
Reported by Daniil Bondarev.
Cloned subrequests should inherit r->content_handler. This way they will
be able to use the same location configuration as the original request
if there are "if" directives in the configuration.
Without r->content_handler inherited, the following configuration tries
to access a static file in the update request:
location / {
set $true 1;
if ($true) {
# nothing
}
proxy_pass http://backend;
proxy_cache one;
proxy_cache_use_stale updating;
proxy_cache_background_update on;
}
See http://mailman.nginx.org/pipermail/nginx/2017-February/053019.html for
initial report.
With "proxy_ignore_client_abort off" (the default), upstream module changes
r->read_event_handler to ngx_http_upstream_rd_check_broken_connection().
If the handler is not cleared during upstream finalization, it can be
triggered later, causing unexpected effects, if, for example, a request
was redirected to a different location using error_page or X-Accel-Redirect.
In particular, it makes "proxy_ignore_client_abort on" non-working after
a redirection in a configuration like this:
location = / {
error_page 502 = /error;
proxy_pass http://127.0.0.1:8082;
}
location /error {
proxy_pass http://127.0.0.1:8083;
proxy_ignore_client_abort on;
}
It is also known to cause segmentation faults with aio used, see
http://mailman.nginx.org/pipermail/nginx-ru/2015-August/056570.html.
Fix is to explicitly set r->read_event_handler to ngx_http_block_reading()
during upstream finalization, similar to how it is done in the request body
reading code and in the limit_req module.
This allows to store larger ETag values for proxy_cache_revalidate,
including ones generated as SHA256, and cache responses with longer
Vary (ticket #826).
In particular, this fixes caching of Amazon S3 responses with CORS
enabled, which now use "Vary: Origin, Access-Control-Request-Headers,
Access-Control-Request-Method".
Cache version bumped accordingly.
Previously, slice subrequest location was selected based on request URI.
If request is then redirected to a new location, its context array is cleared,
making the slice module loose current slice range information. This lead to
broken output. Now subrequests with the NGX_HTTP_SUBREQUEST_CLONE flag are
created for slices. Such subrequests stay in the same location as the parent
request and keep the right slice context.
Previously, there was no way to enable the proxy_cache_use_stale behavior by
reading the backend response. Now, stale-while-revalidate and stale-if-error
Cache-Control extensions (RFC 5861) are supported. They specify, how long a
stale response can be used when a cache entry is being updated, or in case of
an error.
The function may leave error in the error queue while returning success,
e.g., when taking a DSO reference to itself as of OpenSSL 1.1.0d:
https://git.openssl.org/?p=openssl.git;a=commit;h=4af9f7f
Notably, this fixes alert seen with statically linked OpenSSL on some platforms.
While here, check OPENSSL_init_ssl() return value.
Previously, buffer size was not changed from the one saved during
initial ngx_ssl_create_connection(), even if the buffer itself was not
yet created. Fix is to change c->ssl->buffer_size in the SNI callback.
Note that it should be also possible to update buffer size even in non-SNI
virtual hosts as long as the buffer is not yet allocated. This looks
like an overcomplication though.
The ngx_event_pipe() function wasn't called on write events with
wev->delayed set. As a result, threaded writing results weren't
properly collected in ngx_event_pipe_write_to_downstream() when a
write event was triggered for a completed write.
Further, this wasn't detected, as p->aio was reset by a thread completion
handler, and results were later collected in ngx_event_pipe_read_upstream()
instead of scheduling a new write of additional data. If this happened
on the last reading from an upstream, last part of the response was never
written to the cache file.
Similar problems might also happen in case of timeouts when writing to
client, as this also results in ngx_event_pipe() not being called on write
events. In this scenario socket leaks were observed.
Fix is to check if p->writing is set in ngx_event_pipe_read_upstream(), and
therefore collect results of previous write operations in case of read events
as well, similar to how we do so in ngx_event_pipe_write_downstream().
This is enough to fix the wev->delayed case. Additionally, we now call
ngx_event_pipe() from ngx_http_upstream_process_request() if there are
uncollected write operations (p->writing and !p->aio). This also fixes
the wev->timedout case.
The ngx_chain_coalesce_file() function may produce more bytes to send then
requested in the limit passed, as it aligns the last file position
to send to memory page boundary. As a result, (limit - send) may become
negative. This resulted in big positive number when converted to size_t
while calling ngx_output_chain_to_iovec().
Another part of the problem is in ngx_chain_coalesce_file(): it changes cl
to the next chain link even if the current buffer is only partially sent
due to limit.
Therefore, if a file buffer was not expected to be fully sent due to limit,
and was followed by a memory buffer, nginx called sendfile() with a part
of the file buffer, and the memory buffer in trailer. If there were enough
room in the socket buffer, this resulted in a part of the file buffer being
skipped, and corresponding part of the memory buffer sent instead.
The bug was introduced in 8e903522c17a (1.7.8). Configurations affected
are ones using limits, that is, limit_rate and/or sendfile_max_chunk, and
memory buffers after file ones (may happen when using subrequests or
with proxying with disk buffering).
Fix is to explicitly check if (send < limit) before constructing trailer
with ngx_output_chain_to_iovec(). Additionally, ngx_chain_coalesce_file()
was modified to preserve unfinished file buffers in cl.
Closing up to 32 connections might be too aggressive if worker_connections
is set to a comparable number (and/or there are only a small number of
reusable connections). If an occasional connection shorage happens in
such a configuration, it leads to closing all reusable connections instead
of gradually reducing keepalive timeout to a smaller value. To improve
granularity in such configurations we now close no more than 1/8 of all
reusable connections at once.
Suggested by Joel Cunningham.
A missing check could cause ngx_stream_ssl_handler() to be applied
to a non-ssl session, which resulted in a null pointer dereference
if ssl_verify_client is enabled.
The bug had appeared in 1.11.8 (41cb1b64561d).
Previously, an unavailable peer was considered recovered after a successful
proxy session to this peer. Until then, only a single client connection per
fail_timeout was allowed to be proxied to the peer.
Since stream sessions can be long, it may take indefinite time for a peer to
recover, limiting the ability of the peer to receive new connections.
Now, a peer is considered recovered after a successful TCP connection is
established to it. Balancers are notified of this event via the notify()
callback.
OpenSSL 1.1.0 now uses normal "nmake; nmake install" instead of using
custom "ms\do_ms.bat" script and "ms\nt.mak" makefile. And Configure
now requires --prefix to be absolute, and no longer derives --openssldir
from prefix (so it's specified explicitly). Generated libraries are now
called "libcrypto.lib" and "libssl.lib" instead of "libeay32.lib"
and "ssleay32.lib". Appropriate tests added to support both old and new
variants.
Additionally, openssl/lhash.h now triggers warning C4090 ('function' :
different 'const' qualifiers), so the warning was disabled.
There are lots of C4244 warnings (conversion from 'type1' to 'type2',
possible loss of data), so they were disabled.
The same applies to C4267 warnings (conversion from 'size_t' to 'type',
possible loss of data), most notably - conversion from ngx_str_t.len to
ngx_variable_value_t.len (which is unsigned:28). Additionally, there
is at least one case when it is not possible to fix the warning properly
without introducing win32-specific code: recv() on win32 uses "int len",
while POSIX defines "size_t len".
The ssize_t type now properly defined for 64-bit compilation with MSVC.
Caught by warning C4305 (truncation from '__int64' to 'ssize_t'), on
"cutoff = NGX_MAX_SIZE_T_VALUE / 10" in ngx_atosz()).
Several C4334 warnings (result of 32-bit shift implicitly converted to 64 bits)
were fixed by adding explicit conversions.
Several C4214 warnings (nonstandard extension used: bit field types other
than int) in ngx_http_script.h fixed by changing bit field types from
uintptr_t to unsigned.
Most notably, warning W8012 (comparing signed and unsigned values) reported
in multiple places where an unsigned value of small type (e.g., u_short) is
promoted to an int and compared to an unsigned value.
Warning W8072 (suspicious pointer arithmetic) disabled, it is reported
when we increment base pointer in ngx_shm_alloc().
These types are available with MSVC (at least since 2003, in stddef.h),
all variants of GCC (in stdint.h) and Watcom C. We need to define them
only for Borland C.
This implies ticket key size of 80 bytes instead of previously used 48,
as both HMAC and AES keys are 32 bytes now. When an old 48-byte ticket key
is provided, we fall back to using backward-compatible AES128 encryption.
OpenSSL switched to using AES256 in 1.1.0, and we are providing equivalent
security. While here, order of HMAC and AES keys was reverted to make
the implementation compatible with keys used by OpenSSL with
SSL_CTX_set_tlsext_ticket_keys().
Prodded by Christian Klinger.
The current version of HTTP/1.1 standard allows relative references in
redirects (https://tools.ietf.org/html/rfc7231#section-7.1.2).
Allow this form for redirects generated by nginx by introducing the new
directive absolute_redirect.
SSL version 3.0 can be specified by the client at the record level for
compatibility reasons. Previously, ssl_preread module rejected such
connections, presuming they don't have SNI. Now SSL 3.0 is allowed at
the record level.
The resolver handles SRV requests in two stages. In the first
stage it gets all SRV RRs, and in the second stage it resolves
the names from SRV RRs into addresses.
Previously, if a response to an SRV request was cached, the
queries to resolve names were not limited by a timeout. If a
response to any of these queries was not received, the SRV
request could never complete.
If a response to an SRV request was not cached, and some of the
queries to resolve names timed out, NGX_RESOLVE_TIMEDOUT was
returned instead of successfully resolved addresses.
To fix both issues, resolving of names is now always limited by
a timeout.
Changeset e7cb5deb951d breaks build on CentOS 5 with "dereferencing
type-punned pointer will break strict-aliasing rules" warning. It is
backed out.
Instead, to keep builds with BoringSSL happy, type of the "value"
variable changed to "char *", and an explicit cast added before calling
ngx_parse_http_time().
A bug was introduced by 82efcedb310b that could lead to timing out of
responses or segmentation fault, when accept_mutex was enabled.
The output queue in HTTP/2 can contain frames from different streams.
When the queue is sent, all related write handlers need to be called.
In order to do so, the streams were added to the h2c->posted queue
after handling sent frames. Then this queue was processed in
ngx_http_v2_write_handler().
If accept_mutex is enabled, the event's "ready" flag is set but its
handler is not called immediately. Instead, the event is added to
the ngx_posted_events queue. At the same time in this queue can be
events from upstream connections. Such events can result in sending
output queue before ngx_http_v2_write_handler() is triggered. And
at the time ngx_http_v2_write_handler() is called, the output queue
can be already empty with some streams added to h2c->posted.
But after 82efcedb310b, these streams weren't processed if all frames
have already been sent and the output queue was empty. This might lead
to a situation when a number of streams were get stuck in h2c->posted
queue for a long time. Eventually these streams might get closed by
the send timeout.
In the worst case this might also lead to a segmentation fault, if
already freed stream was left in the h2c->posted queue. This could
happen if one of the streams was terminated but wasn't closed, due to
the HEADERS frame or a partially sent DATA frame left in the output
queue. If this happened the ngx_http_v2_filter_cleanup() handler
removed the stream from the h2c->waiting or h2c->posted queue on
termination stage, before the frame has been sent, and the stream
was again added to the h2c->posted queue after the frame was sent.
In order to fix all these problems and simplify the code, write
events of fake stream connections are now added to ngx_posted_events
instead of using a custom h2c->posted queue.
By default, "map" creates cacheable variables [1]. With this
parameter it creates a non-cacheable variable.
An original idea was to deduce the cacheability of the "map"
variable by checking the cacheability of variables specified
in source and resulting values, but it turned to be too hard.
For example, a cacheable variable can be overridden with the
"set" directive or with the SSI "set" command. Also, keeping
"map" variables cacheable by default is good for performance
reasons. This required adding a new parameter.
[1] Before db699978a33f (1.11.0), the cacheability of the
"map" variable could vary depending on the cacheability of
variables specified in resulting values (ticket #1090).
This is believed to be a bug rather than a feature.
Removed code that would cause an endless loop, and removed condition
check that is always false. The first page in the slot list is
guaranteed to satisfy an allocation.
On exit environment allocated from a pool is no longer available, leading
to a segmentation fault if, for example, a library tries to use it from
an atexit() handler.
Fix is to allocate environment via ngx_alloc() instead, and explicitly
free it using a pool cleanup handler if it's no longer used (e.g., on
configuration reload).
In Perl 5.8.6 the default was switched to use putenv() when used as
embedded library unless "PL_use_safe_putenv = 0" is explicitly used
in the code. Therefore, for modern versions of Perl it is no longer
necessary to restore previous environment when calling perl_destruct().
For Perl compiled with threads, without PERL_SET_INTERP() the PL_curinterp
remains set to the first interpreter created (that is, one created at
original start). As a result after a reload Perl thinks that operations
are done withing a thread, and, most notably, denies to change environment.
For example, the following code properly works on original start,
but fails after a reload:
perl 'sub {
my $r = shift;
$r->send_http_header("text/plain");
$ENV{TZ} = "UTC";
$r->print("tz: " . $ENV{TZ} . " (localtime " . (localtime()) . ")\n");
$ENV{TZ} = "Europe/Moscow";
$r->print("tz: " . $ENV{TZ} . " (localtime " . (localtime()) . ")\n");
return OK;
}';
To fix this, PERL_SET_INTERP() added anywhere where PERL_SET_CONTEXT()
was previously used.
Note that PERL_SET_INTERP() doesn't seem to be documented anywhere.
Yet it is used in some other software, and also seems to be the only
solution possible.
Atom size is the sum of atom header size and atom data size. The
specification says that the first 4 bytes are set to one when
the atom size is greater than the maximum unsigned 32-bit value.
Which means atom header size should be considered when the
comparison takes place between atom data size and 0xffffffff.
The variable contains a list of curves as supported by the client.
Known curves are listed by their names, unknown ones are shown
in hex, e.g., "0x001d:prime256v1:secp521r1:secp384r1".
Note that OpenSSL uses session data for SSL_get1_curves(), and
it doesn't store full list of curves supported by the client when
serializing a session. As a result $ssl_curves is only available
for new sessions (and will be empty for reused ones).
The variable is only meaningful when using OpenSSL 1.0.2 and above.
With older versions the variable is empty.
The variable contains list of ciphers as supported by the client.
Known ciphers are listed by their names, unknown ones are shown
in hex, e.g., ""AES128-SHA:AES256-SHA:0x00ff".
The variable is fully supported only when using OpenSSL 1.0.2 and above.
With older version there is an attempt to provide some information
using SSL_get_shared_ciphers(). It only lists known ciphers though.
Moreover, as OpenSSL uses session data for SSL_get_shared_ciphers(),
and it doesn't store relevant data when serializing a session. As
a result $ssl_ciphers is only available for new sessions (and not
available for reused ones) when using OpenSSL older than 1.0.2.
Now in case of a verification failure $ssl_client_verify contains
"FAILED:<reason>", similar to Apache's SSL_CLIENT_VERIFY, e.g.,
"FAILED:certificate has expired".
Detailed description of possible errors can be found in the verify(1)
manual page as provided by OpenSSL.
Normally, the epoll module calls the read and write handlers depending
on whether EPOLLIN and EPOLLOUT are reported by epoll_wait(). No error
processing is done in the module, the handlers are expected to get an
error when doing I/O.
If an error event is reported without EPOLLIN and EPOLLOUT, the module
set both EPOLLIN and EPOLLOUT to ensure the error event is handled at
least in one active handler.
This works well unless the error is delivered along with only one of
EPOLLIN or EPOLLOUT, and the corresponding handler does not do any I/O.
For example, it happened when getting EPOLLERR|EPOLLOUT from
epoll_wait() upon receiving "ICMP port unreachable" while proxying UDP.
As the write handler had nothing to send it was not able to detect and
log an error, and did not switch to the next upstream.
The fix is to unconditionally set EPOLLIN and EPOLLOUT in case of an
error event. In the aforementioned case, this causes the read handler
to be called which does recv() and detects an error.
In addition to the epoll module, analogous changes were made in
devpoll/eventport/poll.
Previously, a request body bigger than "client_body_buffer_size" wasn't written
into a temporary file if it has been pre-read entirely. The preread buffer
is freed after processing, thus subsequent use of it might result in sending
corrupted body or cause a segfault.
On Linux, the rename syscall can be slow due to a global file system lock,
acquired for the entire rename operation, unless both old and new files are
in the same directory. To address this temporary files are now created
in the same directory as the expected resulting cache file when using the
"use_temp_path=off" parameter.
This change mostly reverts 99639bfdfa2a and 3281de8142f5, restoring the
behaviour as of a9138c35120d (with minor changes).
Holding a cache node lock doesn't make sense as we can't use caching
anyway, and results in "ignore long locked inactive cache entry" alerts
if a node is locked for a long time.
The same is done for unbuffered connections, as they can be alive for
a long time as well.
It configures a threshold in bytes, above which client range
requests are not cached. In such a case the client's Range
header is passed directly to a proxied server.
As the pointer to the first argument was tested instead of the argument
itself, array of arguments was always created, even if there were no
arguments. Fix is to test args[0] instead of args.
Found by Coverity (CID 1356862).
The only thing that default_port comparison did in the current
code is prevented implicit upstreams to the same address/port
from being aliased for http and https, e.g.:
proxy_pass http://10.0.0.1:12345;
proxy_pass https://10.0.0.1:12345;
This is inconsistent because it doesn't work for a similar case
with uswgi_pass:
uwsgi_pass uwsgi://10.0.0.1:12345;
uwsgi_pass suwsgi://10.0.0.1:12345;
or with an explicit upstream:
upstream u {
server 10.0.0.1:12345;
}
proxy_pass http://u;
proxy_pass https://u;
Before c9059bd5445b, default_port comparison was needed to
differentiate implicit upstreams in
proxy_pass http://example.com;
and
proxy_pass https://example.com;
as u->port was not set.
When an upstream{} block follows a proxy_pass reference to it,
such an upstream inherited port and default_port settings from
proxy_pass. This was different from when they came in another
order (see ticket #1059). Explicit upstreams should not have
port and default_port in any case.
This fixes the following case:
server { location / { proxy_pass http://u; } ... }
upstream u { server 127.0.0.1; }
server { location / { proxy_pass https://u; } ... }
but not the following:
server { location / { proxy_pass http://u; } ... }
server { location / { proxy_pass https://u; } ... }
upstream u { server 127.0.0.1; }
If proxy_pass (and friends) with variables evaluates an upstream
specified with literal address, nginx always created a per-request
upstream.
Now, if there's a matching upstream specified in the configuration
(either implicit or explicit), it will be used instead.
This fixes inconsistency in what is stored in the "host" field.
Normally it would contain the "host" part of the parsed URL
(e.g., proxy_pass with variables), but for the case of an
implicit upstream specified with literal address it contained
the text representation of the socket address (that is, host
including port for IP).
Now the "host" field always contains the "host" part of the URL,
while the text representation of the socket address is stored
in the newly added "name" field.
The ngx_http_upstream_create_round_robin_peer() function was
modified accordingly in a way to be compatible with the code
that does not know about the new "name" field.
The "stream" code was similarly modified except for not adding
compatibility in ngx_stream_upstream_create_round_robin_peer().
This change is also a prerequisite for the next change.
The new directive "http2_max_requests" is introduced. From users point of
view it works quite similar to "keepalive_requests" but has significantly
bigger default value that is more suitable for HTTP/2.
This allows to correctly parse "start" and "end" arguments without
null-termination (ticket #475), and also fixes rounding errors observed
with strtod() when using i387 instructions.
Originally, the variables kept a result of X509_NAME_oneline(),
which is, according to the official documentation, a legacy
function. It produces a non standard output form and has
various quirks and inconsistencies.
The RFC2253 compliant behavior is introduced for these variables.
The original variables are available through $ssl_client_s_dn_legacy
and $ssl_client_i_dn_legacy.