Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-53747

replica set down after setFeatureCompatibilityVersion 4.4

    • Type: Icon: Question Question
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.4.2
    • Component/s: None
    • None
    • Service Arch

      Today we have experienced a major issue in our mongodb production database.

      3 weeks ago we upgraded binaries from 4.2 to 4.4 in all nodes in the replica set.

      Today, in order to complete the upgrade, we ran the following command from the primary (as mentioned in the official doc https://docs.mongodb.com/manual/release-notes/4.4-upgrade-replica-set/#feature-compatibility-version{{db.adminCommand( { setFeatureCompatibilityVersion: "4.4" } )}}

      Right after execute the command all data-bearing nodes of our replica set got unhealthy. CPU spike to 100% in all data-bearing nodes.

      Any attempt to establish a connection to the replica set end with a timeout or a rejection.

      The mongod log shows this fatal error:

      {"t":{"$date":"2021-01-13T11:50:17.983+00:00"},"s":"F",  "c":"-",        "id":23079,   "ctx":"TopologyVersionObserver","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}
      {"t":{"$date":"2021-01-13T11:50:17.985+00:00"},"s":"F",  "c":"-",        "id":23080,   "ctx":"TopologyVersionObserver","msg":"\n\n***aborting after invariant() failure\n\n"}
      {"t":{"$date":"2021-01-13T11:50:17.985+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"TopologyVersionObserver","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}
      {"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31431,   "ctx":"TopologyVersionObserver","msg":"BACKTRACE: {bt}","attr":{"bt":{"backtrace":[{"a":"55606F60F921","b":"55606C92C000","o":"2CE3921","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1E1"},{"a":"55606F610F59","b":"55606C92C000","o":"2CE4F59","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"55606F60E7B6","b":"55606C92C000","o":"2CE27B6","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP9siginfo_tPv","s+":"66"},{"a":"7EFF61C048A0","b":"7EFF61BF2000","o":"128A0","s":"funlockfile","s+":"50"},{"a":"7EFF6183FF47","b":"7EFF61801000","o":"3EF47","s":"gsignal","s+":"C7"},{"a":"7EFF618418B1","b":"7EFF61801000","o":"408B1","s":"abort","s+":"141"},{"a":"55606D840D8F","b":"55606C92C000","o":"F14D8F","s":"_ZN5mongo15invariantFailedEPKcS1_j","s+":"12C"},{"a":"55606D814458","b":"55606C92C000","o":"EE8458","s":"_ZN5mongo6Client19setOperationContextEPNS_16OperationContextE.cold.135","s+":"18"},{"a":"55606F4D1069","b":"55606C92C000","o":"2BA5069","s":"_ZN5mongo14ServiceContext20makeOperationContextEPNS_6ClientE","s+":"129"},{"a":"55606F4C5B17","b":"55606C92C000","o":"2B99B17","s":"_ZN5mongo6Client20makeOperationContextEv","s+":"27"},{"a":"55606DD474A1","b":"55606C92C000","o":"141B4A1","s":"_ZN5mongo4repl23TopologyVersionObserver17_workerThreadBodyEv","s+":"251"},{"a":"55606DD47B4C","b":"55606C92C000","o":"141BB4C","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_4repl23TopologyVersionObserver4initEPNS3_14ServiceContextEPNS7_22ReplicationCoordinatorEEUlvE0_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"5C"},{"a":"55606F7B547F","b":"55606C92C000","o":"2E8947F","s":"execute_native_thread_routine","s+":"F"},{"a":"7EFF61BF96DB","b":"7EFF61BF2000","o":"76DB","s":"start_thread","s+":"DB"},{"a":"7EFF61922A3F","b":"7EFF61801000","o":"121A3F","s":"clone","s+":"3F"}],"processInfo":{"mongodbVersion":"4.4.2","gitVersion":"15e73dc5738d2278b688f8929aee605fe4279b0e","compiledModules":[],"uname":{"sysname":"Linux","release":"5.0.0-1036-azure","version":"#38-Ubuntu SMP Sun Mar 22 21:27:21 UTC 2020","machine":"x86_64"},"somap":[{"b":"55606C92C000","elfType":3,"buildId":"D18F657A1E06C333C2AEE534E3047044B0653DBF"},{"b":"7EFF61BF2000","path":"/lib/x86_64-linux-gnu/libpthread.so.0","elfType":3,"buildId":"BC3C06107774266C5F7DB3F1F380A3DA68AF90FA"},{"b":"7EFF61801000","path":"/lib/x86_64-linux-gnu/libc.so.6","elfType":3,"buildId":"D3CF764B2F97AC3EFE366DDD07AD902FB6928FD7"}]}}}}
      {"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606F60F921","b":"55606C92C000","o":"2CE3921","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1E1"}}}
      {"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606F610F59","b":"55606C92C000","o":"2CE4F59","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}}
      {"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606F60E7B6","b":"55606C92C000","o":"2CE27B6","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP9siginfo_tPv","s+":"66"}}}
      {"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"7EFF61C048A0","b":"7EFF61BF2000","o":"128A0","s":"funlockfile","s+":"50"}}}
      {"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"7EFF6183FF47","b":"7EFF61801000","o":"3EF47","s":"gsignal","s+":"C7"}}}
      {"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"7EFF618418B1","b":"7EFF61801000","o":"408B1","s":"abort","s+":"141"}}}
      {"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606D840D8F","b":"55606C92C000","o":"F14D8F","s":"_ZN5mongo15invariantFailedEPKcS1_j","s+":"12C"}}}
      {"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606D814458","b":"55606C92C000","o":"EE8458","s":"_ZN5mongo6Client19setOperationContextEPNS_16OperationContextE.cold.135","s+":"18"}}}
      {"t":{"$date":"2021-01-13T11:50:18.082+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606F4D1069","b":"55606C92C000","o":"2BA5069","s":"_ZN5mongo14ServiceContext20makeOperationContextEPNS_6ClientE","s+":"129"}}}
      {"t":{"$date":"2021-01-13T11:50:18.082+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606F4C5B17","b":"55606C92C000","o":"2B99B17","s":"_ZN5mongo6Client20makeOperationContextEv","s+":"27"}}}
      {"t":{"$date":"2021-01-13T11:50:18.082+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606DD474A1","b":"55606C92C000","o":"141B4A1","s":"_ZN5mongo4repl23TopologyVersionObserver17_workerThreadBodyEv","s+":"251"}}}
      {"t":{"$date":"2021-01-13T11:50:18.082+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606DD47B4C","b":"55606C92C000","o":"141BB4C","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_4repl23TopologyVersionObserver4initEPNS3_14ServiceContextEPNS7_22ReplicationCoordinatorEEUlvE0_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"5C"}}}
      {"t":{"$date":"2021-01-13T11:50:18.082+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606F7B547F","b":"55606C92C000","o":"2E8947F","s":"execute_native_thread_routine","s+":"F"}}}
      {"t":{"$date":"2021-01-13T11:50:18.082+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"7EFF61BF96DB","b":"7EFF61BF2000","o":"76DB","s":"start_thread","s+":"DB"}}}
      {"t":{"$date":"2021-01-13T11:50:18.082+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"7EFF61922A3F","b":"7EFF61801000","o":"121A3F","s":"clone","s+":"3F"}}}
      
      {"t":{"$date":"2021-01-13T12:17:39.998+00:00"},"s":"I",  "c":"CONTROL",  "id":20698,   "ctx":"main","msg":"***** SERVER RESTARTED *****"}
      {"t":{"$date":"2021-01-13T12:17:40.002+00:00"},"s":"I",  "c":"CONTROL",  "id":23285,   "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
      {"t":{"$date":"2021-01-13T12:17:40.068+00:00"},"s":"W",  "c":"ASIO",     "id":22601,   "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"}
      

      Issues likely related I have found already reported are:

      Attached relevant part of the mongod.log.

      After several minutes the mongod process got killed, after restart all data bearing node we recover the cluster. Arbiter was not affected.

      The featureCompatibilityVersion was not updated to 4.4.

      db.adminCommand( { getParameter: 1, featureCompatibilityVersion: 1 } ).featureCompatibilityVersion.version
      // 4.2 

      More info:

      • Replica set is running 4.4.2 binaries
      • Replica set is composed by
        • 3 Data bearing node (1 with priority 0 and votes 0, 2 with priority 1 and votes 1)
        • 1 Arbiter
      • OS: Linux Ubuntu 18.0 LTS
      • Cloud: Azure

       

      What should be the recommendation in this case to overcome the issue and being able to finalize the upgrade?

      • Can we try to run the FCV command in a secondary only first?
      • Is it known or expected such behavior after running the FCV command?

        1. mongod.log
          177 kB
        2. image-2021-01-20-10-05-02-966.png
          image-2021-01-20-10-05-02-966.png
          97 kB
        3. image-2021-01-20-10-00-43-525.png
          image-2021-01-20-10-00-43-525.png
          105 kB
        4. image-2021-01-14-18-36-39-527.png
          image-2021-01-14-18-36-39-527.png
          11 kB
        5. image-2021-01-14-18-32-15-793.png
          image-2021-01-14-18-32-15-793.png
          9 kB
        6. image-2021-01-14-18-31-45-693.png
          image-2021-01-14-18-31-45-693.png
          6 kB
        7. image-2021-01-14-13-07-05-310.png
          image-2021-01-14-13-07-05-310.png
          43 kB
        8. image-2021-01-13-17-54-12-906.png
          image-2021-01-13-17-54-12-906.png
          46 kB
        9. arbiter-44.png
          arbiter-44.png
          27 kB

            Assignee:
            backlog-server-servicearch [DO NOT USE] Backlog - Service Architecture
            Reporter:
            francesco.rivola@gmail.com Francesco Rivola
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: