-
Type: Bug
-
Resolution: Fixed
-
Priority: Critical - P2
-
Affects Version/s: 4.0.24
-
Component/s: None
-
None
-
Fully Compatible
-
ALL
-
Sharding 2021-05-03
PoolForHost::_maxInUse is init to <int>::max and cannot be changed, and when connections are maxed using the DBConnectionPool::_maxInUse DBConnectionPool::get() always throws, immediately.
To understand this bug first pay attention that there is PoolForHost::_maxInUse here and DBConnectionPool::_maxInUse here. They are not synchronized in any way.
The DBConnectionPool::_maxInUse can be set with public method setMaxInUse() here. The PoolForHost::_maxInUse is initialized in constructor to std::numeric_limits<int>::max() and there is no code path that changes it, it remains MAX_INT for every host for the lifetime of the pool.
The PoolForHost::_maxInUse is used only here, in waitForFreeConnection(). The logic is that waitForFreeConnection() is invoked from DBConnectionPool::Detail::get() here when the DBConnectionPool::_maxInUse is exhausted.
However, this code fragment in waitForFreeConnection() always throws:
auto condition = [&] { return (numInUse() < _maxInUse || _inShutdown.load()); }; if (timeout > 0) { stdx::chrono::seconds timeoutSeconds{timeout}; // If we timed out waiting without getting a new connection, throw. uassert(ErrorCodes::ExceededTimeLimit, str::stream() << "too many connections to " << _hostName << ":" << timeout, !_cv.wait_for(lk, timeoutSeconds, condition));
because the 'condition' "numInUse() < _maxInUse" is always true because _maxInUse is MAX_INT. Thus the "_cv.wait_for()" never blocks because the condition is always true when it's invoked. However the assertion "!_cv.wait_for(...)" always throws because wait_for() always returns true and its negation is asserted.
As a result of this bug the behavior of DBConnectionPool is that once the connection count reaches the limit the get() always throws, immediately. The code actually expects that it should block for specified timeout, throwing without blocking should create very tight loops somewhere up the stack, if the caller logic retries whatever it needs the connection pool for.
The proposed fix actually contains 4 bugfixes in one change, CR pending
1. PoolForHost::_maxInUse is removed as there is a similar field in DBConnectionPool and it was never set properly
2. PoolKey::timeout is changed from double to int64. The reason is that it is dangerous to make a double key for a container, container lookup may become non-deterministic because of double precision ops errors
3. In PoolForHost::waitForFreeConnection(), "!_cv.wait_for(...)" is changed to "_cv.wait_for()". The inversion of the wait return value for assertion is a bug, in case of timeout wait() returns false and we should assert it without inversion
4. DBConnectionPool::decrementEgress() must notify waiters. This is SERVER-56147, the CV is never notified if the connection is released using this method.
- related to
-
SERVER-56229 Forward-port SERVER-56217 bug in connection pool to head
- Backlog