Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: WT3.1.0, WT3.2.1
Component/s: Checkpoints
Labels:
- bug-classification-activity-phase-2
- group-e

Sprint:
None
Story Points:
None
Case:

I'm creating this issue related to ~~WT-1598~~, mostly for documentation purposes, because it took us weeks to figure out what was happening and we could not find good information online, before debugging the WT code ourselves. The only other mention of this problem we found was in a StackExchange post that also has no solution to this date.

Our application is neither very heavy in document or collection size, so we did not yet see the necessity for horizontal scaling. But it does use a large number of collections and indexes.

"files currently open": 126583
"connection data handles currently active": 209848

Our problems began after upgrading to MongoDB 3.6 (WT3.1) from MongoDB 3.4 (WT2.9). One obvious difference we noticed is that the number of data handles was previously similar to the number of open files, but now seems to be about twice that, as can be seen above. A possible reason may be the introduction of client sessions, but this is merely conjecture. The checkpoint code mentioned below was also changed significantly between WT2 and WT3.

What we started noticing were occasional extreme slow queries, that hung solely due to schema lock acquisition and were not reproducible.

planSummary: IXSCAN { ... }
keysExamined:0
..
nreturned:0
...
storage: { timeWaitingMicros: { schemaLock: 89210873 } }
protocol:op_msg 89214ms

Eventually, we traced the issue back to the WiredTigerCheckpointThread which acquires the lock in the following line of txn_ckpt.c:

WT_WITH_SCHEMA_LOCK(session, ret = __checkpoint_prepare(session, &tracking, cfg));

and then subsequently spends a large amount of time iterating over all active data handles as part of the following line of the same file:

__checkpoint_apply_all(session, cfg, __wt_checkpoint_get_handles))

Commands are only occasionally affected by this, namely when they require a new data handle (e.g. an index which has not been used or is in the wrong mode). In this case they run into the following code block of session_dhandle.c:

/*
 * For now, we need the schema lock and handle list locks to
 * open a file for real.
 *
 * Code needing exclusive access (such as drop or verify)
 * assumes that it can close all open handles, then open an
 * exclusive handle on the active tree and no other threads can
 * reopen handles in the meantime.  A combination of the schema
 * and handle list locks are used to enforce this.
 */
if (!F_ISSET(session, WT_SESSION_LOCKED_SCHEMA)) {
	dhandle->excl_session = NULL;
	dhandle->excl_ref = 0;
	F_CLR(dhandle, WT_DHANDLE_EXCLUSIVE);
	__wt_writeunlock(session, &dhandle->rwlock);

	WT_WITH_SCHEMA_LOCK(
	  session, ret = __wt_session_get_dhandle(session, uri, checkpoint, cfg, flags));

	return (ret);
}

I am not sure if improvements to this problem can be made without all of ~~WT-1598~~. At the very least this current limitation of the WT storage engine could be documented to advise application developers against excessive use of collections and indexes.

Any recommendations for tweaks to our setup are also welcome. Thank you!

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

4.2.9_vs_4.2.10.png
143 kB
Oct 07 2020 08:42:26 AM UTC
bench-schema-locks.js
6 kB
Dec 14 2020 10:22:54 AM UTC
image (2).png
392 kB
Dec 18 2020 10:27:29 AM UTC
image (3).png
195 kB
Dec 18 2020 10:40:26 AM UTC
image (4).png
87 kB
Dec 18 2020 10:41:42 AM UTC
image-2020-12-14-11-20-21-140.png
138 kB
Dec 14 2020 10:20:22 AM UTC
image-2020-12-14-16-21-43-885.png
77 kB
Dec 14 2020 03:21:44 PM UTC
MongoDB-v-Percona.png
148 kB
Aug 14 2020 10:02:48 AM UTC
mongo-upgrade-4.2-4.4.png
46 kB
Sep 09 2021 09:38:45 AM UTC
perf-kernel.svg
2.42 MB
Dec 18 2020 10:50:58 AM UTC
screenshot1.png
80 kB
Dec 02 2020 03:11:22 PM UTC
screenshot-1.png
114 kB
Aug 06 2020 03:35:43 PM UTC
screenshot2.png
72 kB
Dec 02 2020 03:12:19 PM UTC
screenshot-2.png
38 kB
Dec 04 2020 03:01:28 PM UTC
Screen Shot 2020-03-02 at 2.50.55 PM.png
97 kB
Mar 02 2020 11:26:26 PM UTC
screenshot3.png
13 kB
Dec 02 2020 03:14:09 PM UTC
screenshot-3.png
88 kB
Dec 04 2020 03:03:01 PM UTC
screenshot4.png
24 kB
Dec 02 2020 03:15:30 PM UTC
screenshot-4.png
222 kB
Dec 18 2020 10:30:07 AM UTC
WT_open_files_schema_lock.png
33 kB
Feb 08 2020 06:53:05 AM UTC
wt-5479.png
192 kB
Dec 15 2020 10:00:53 AM UTC
wt-5479-2.png
98 kB
Dec 15 2020 10:53:11 AM UTC

duplicates

SERVER-31704 Periodic drops in throughput during checkpoints while waiting on schema lock

Closed

is related to

WT-7381 Cache btree's ckptlist between checkpoints

Closed

related to

WT-6598 Add new API allowing changing dhandle hash bucket size

Closed

WT-1598 Remove the schema, table locks

Closed

WT-6421 Avoid parsing metadata checkpoint for clean files

Closed

WT-5042 Reduce configuration parsing overhead from checkpoints

Closed

WT-7028 Sweep thread shouldn't lock during checkpoint gathering handles

Closed

WT-7004 Architecture guide page for checkpoints

Closed

(3 related to)

Assignee:: Sulabh Mahajan
Reporter:: Ralf Strobel
Votes:: 32 Vote for this issue
Watchers:: 60 Start watching this issue

Created:: Jan 24 2020 05:22:01 PM UTC
Updated:: Mar 22 2024 05:29:17 PM UTC
Resolved:: Apr 03 2022 09:40:32 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates