When the storage engine is shutting down, all of the collections are deregistered from the CollectionCatalog before shutting down the underlying storage engine.
However, a recent change in the CollectionCatalog by SERVER-47885 made it so that the Collection object was no longer uniquely owned by the CollectionCatalog, but a shared object instead.
This becomes problematic during a clean shutdown for the OplogVisibilityThread. The OplogVisibilityThread is stopped when the RecordStore for the Collection is being destructed. But because the Collection wasn't destructed during shut down (the shared_ptr is being held by something else outside of the CollectionCatalog), the destructor never got to run and therefore the OplogVisibilityThread was not stopped. After deregistering all the Collections from the CollectionCatalog, the storage engine is shut down but it's still possible for the OplogVisibilityThread to run and try to query the storage engine for the durable timestamp. This causes a crash.
After digging around the code a bit more, I noticed that SERVER-49731 also changed the RecordStore from a unique_ptr to a shared_ptr in the Collection object. I'm wondering if this could lead to similar problems in a way. I'm predicting that this is okay as long as all of the Collections sharing the same RecordStore are destructed during a clean shut down before the storage engines stops.
This issue was discovered in a build failure. As seen, the OplogVisibilityThread attempted to access the storage engine for the all durable timestamp after it has shut down.
The reason that the storage engine is starting up again a bit later is because it's reconfiguring itself. But it will still shut down again! This probably extended the window just long enough for the OplogVisbilityThread to run and cause a crash.
[SignalHandler] "Closing WiredTiger in preparation for reconfiguring","attr":{"closeConfig":"leak_memory=true,"} [SignalHandler] "WiredTiger closed","attr":{"durationMillis":114} [OplogVisibilityThread] "Writing fatal message","attr":{"message":"Invalid access at address: 0x48"} [OplogVisibilityThread] "Writing fatal message","attr":{"message":"Got signal: 11 (Segmentation fault).\n"} [SignalHandler] "WiredTiger message","attr":{"message":"[1599108139:459597][57428:0x7fa180b38700], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 1 through 2"} [OplogVisibilityThread] "BACKTRACE","attr":{"bt":{"backtrace":[{"a":"7FA1911F4ACE","b":"7FA191027000","o":"1CDACE","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.627","s+":"1EE"},{"a":"7FA1911F5EE9","b":"7FA191027000","o":"1CEEE9","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"7FA1911F2B53","b":"7FA191027000","o":"1CBB53","s":"_ZN5mongo12_GLOBAL__N_124abruptQuitWithAddrSignalEiP7siginfoPv","s+":"F3"},{"a":"7FA18F4DC7E0","b":"7FA18F4CD000","o":"F7E0","s":"_L_unlock_16","s+":"2D"},{"a":"7FA1888ADF16","b":"7FA188829000","o":"84F16","s":"_ZNK5mongo18WiredTigerKVEngine22getAllDurableTimestampEv","s+":"36"},{"a":"7FA1888C0538","b":"7FA188829000","o":"97538","s":"_ZN5mongo22WiredTigerOplogManager26_updateOplogVisibilityLoopEPNS_22WiredTigerSessionCacheEPNS_21WiredTigerRecordStoreE","s+":"358"},{"a":"7FA1888C1A09","b":"7FA188829000","o":"98A09","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IMNS3_22WiredTigerOplogManagerEFvPNS3_22WiredTigerSessionCacheEPNS3_21WiredTigerRecordStoreEEJPS7_S9_RSB_ELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"69"},{"a":"7FA18B5E307F","b":"7FA18B5D6000","o":"D07F","s":"execute_native_thread_routine","s+":"F"},{"a":"7FA18F4D4AA1","b":"7FA18F4CD000","o":"7AA1","s":"start_thread","s+":"D1"},{"a":"7FA18F221C4D","b":"7FA18F139000","o":"E8C4D","s":"clone","s+":"6D"}],"processInfo":{"mongodbVersion":"4.5.0-2994-g8e64b07","gitVersion":"8e64b07e3bb363347ee2c11a56aba873365ed74a","compiledModules":["enterprise"],"uname":{"sysname":"Linux","release":"2.6.32-220.el6.x86_64","version":"#1 SMP Wed Nov 9 08:03:13 EST 2011","machine":"x86_64"},"somap":[{"b":"7FA191027000","path":"/data/mci/3f10f1c8535546d89d7b83865c3d9b72/src/dist-test/bin/../lib/libbase.so","elfType":3,"buildId":"83D34D173525B78B7B0880182EE09AC9629D1AB8"},{"b":"7FA18B5D6000","path":"/data/mci/3f10f1c8535546d89d7b83865c3d9b72/src/dist-test/bin/../lib/../lib/libbackground_job.so","elfType":3,"buildId":"82B33038BB8BDE9582F6F0BEC9FAE4BDD33C3773"},{"b":"7FA188829000","path":"/data/mci/3f10f1c8535546d89d7b83865c3d9b72/src/dist-test/bin/../lib/../lib/../lib/libstorage_wiredtiger_core.so","elfType":3,"buildId":"F5CB5410FDAFD95B86150A4A9A9E614E118301C0"}]}}} [OplogVisibilityThread] "Frame","attr":{"frame":{"a":"7FA1911F4ACE","b":"7FA191027000","o":"1CDACE","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.627","s+":"1EE"}} [OplogVisibilityThread] "Frame","attr":{"frame":{"a":"7FA1911F5EE9","b":"7FA191027000","o":"1CEEE9","s":"_ZN5mongo15printStackTraceEv","s+":"29"}} [OplogVisibilityThread] "Frame","attr":{"frame":{"a":"7FA1911F2B53","b":"7FA191027000","o":"1CBB53","s":"_ZN5mongo12_GLOBAL__N_124abruptQuitWithAddrSignalEiP7siginfoPv","s+":"F3"}} [OplogVisibilityThread] "Frame","attr":{"frame":{"a":"7FA18F4DC7E0","b":"7FA18F4CD000","o":"F7E0","s":"_L_unlock_16","s+":"2D"}} [OplogVisibilityThread] "Frame","attr":{"frame":{"a":"7FA1888ADF16","b":"7FA188829000","o":"84F16","s":"_ZNK5mongo18WiredTigerKVEngine22getAllDurableTimestampEv","s+":"36"}} [OplogVisibilityThread] "Frame","attr":{"frame":{"a":"7FA1888C0538","b":"7FA188829000","o":"97538","s":"_ZN5mongo22WiredTigerOplogManager26_updateOplogVisibilityLoopEPNS_22WiredTigerSessionCacheEPNS_21WiredTigerRecordStoreE","s+":"358"}} [OplogVisibilityThread] "Frame","attr":{"frame":{"a":"7FA1888C1A09","b":"7FA188829000","o":"98A09","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IMNS3_22WiredTigerOplogManagerEFvPNS3_22WiredTigerSessionCacheEPNS3_21WiredTigerRecordStoreEEJPS7_S9_RSB_ELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"69"}} [OplogVisibilityThread] "Frame","attr":{"frame":{"a":"7FA18B5E307F","b":"7FA18B5D6000","o":"D07F","s":"execute_native_thread_routine","s+":"F"}} [OplogVisibilityThread] "Frame","attr":{"frame":{"a":"7FA18F4D4AA1","b":"7FA18F4CD000","o":"7AA1","s":"start_thread","s+":"D1"}} [OplogVisibilityThread] "Frame","attr":{"frame":{"a":"7FA18F221C4D","b":"7FA18F139000","o":"E8C4D","s":"clone","s+":"6D"}}
- related to
-
SERVER-47885 Have lookupCollectionBy{Namespace,UUID} return a shared_ptr
- Closed