Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28232

Performance degradation after upgrade from 3.0.11 to 3.2.12

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.12
    • Component/s: Performance
    • None
    • ALL

      Environment: CentoOS 6 (kernel 2.6.32), MongoDB community edition 3.0.11 and 3.2.12 , storage engine MMAP, config servers on SCCC

      We are experiencing performance degradation when moving from 3.0.11 to 3.2.12. Application throughput is getting reduced by 5-10 times in 3.2.12 compared to 3.0.11. In the past, we had attempted to upgrade from 3.0.11 to 3.2.8 but due to https://jira.mongodb.org/browse/SERVER-26159 bug, we rollback to 3.0.11. In 3.2.8 application throughput was fine but since the mongos were randomly crashing due to SERVER-26159 we rollback to 3.0.11. Bug SERVER-26159 fixed in 3.2.10 so we attempt to upgrade but we got our performance reduced so we rollback to 3.0.11 again. We opened a JIRA SERVER-26654 about this issue (and several other people report almost the same issues) and according to Jira the issue was solved in 3.2.12. We attempt to upgrade to 3.2.12 but we got the same performance degradation as the 3.2.10 upgrade.

      The issue we are seeing in the logs after increasing the verbosity from 1 to 2 is the following:

      I ASIO    [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to connect to (node) - ExceededTimeLimit: Operation timed out
      D ASIO  [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to execute command: RemoteCommand 23628777 -- target:(node) db:admin cmd:{ 
      isMaster: 1 } reason: ExceededTimeLimit: Operation timed out
      isMaster command is timeout for different "TaskExecutorPool" all the time.
      

      Note: I am not changing the "protocolVersion" to 1 after the 3.0.11 to 3.2.12 upgrade as makes the rollback harder.

      We managed to reproduce the issue with sysbench-mongodb using 3.2.12 on a 10 nodes sharded cluster, not in the scale we getting it on our production system.

      To remedy the issue in testing we changed taskExecutorPoolSize value:

      Our mongos has 6 CPUs so I assume it creates 6 connection pools with defaults. Using a smaller value like "taskExecutorPoolSize"=2 reduces the timeouts so it seems the more connection pools I use the more timeouts I get during the benchmark. When I set "taskExecutorPoolSize"=1, which I believe set a single connection pool, I am not getting the above timeouts.

      We also modified the ShardingTaskExecutorPoolRefreshTimeoutMS from the default 20 seconds to 60 seconds which also eliminated the timeouts.

      We combined both on production but unfortunately, the timeouts didn't go away and we still noticed the same performance degradation.

      setParameter:
       ShardingTaskExecutorPoolRefreshTimeoutMS: 60000
       taskExecutorPoolSize: 1
      

      I want to believe that is not our workload that triggering the performance degradation as it operates fine on 3.0.11

      The purpose of the thread is to understand what has changed between 3.2.8 and 3.2.12 that might trigger the isMaster request to fail between mongod and mongos.

      Much appreciated If anyone has internals on the change or is facing the same problem and found a workaround.

      Thanks in advance,

      Jason

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            jason_or Jason Terpko
            Votes:
            3 Vote for this issue
            Watchers:
            19 Start watching this issue

              Created:
              Updated:
              Resolved: