Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-61125

TooManyLogicalSessions error on sharded cluster when balancer active

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.4.9
    • Component/s: None
    • None
    • ALL

      This happened twice in a 2 days time range.

      We have a sharded cluster with two shards (A and B), 3 mongos and 3 config servers.

      The first day (2021-10-27T20:45-700) the shard B primary, started reporting the TooManyLogicalSessions error and became basically unusable (we couldn't even run a rs.stepDown() or shutdownServer).

      We stopped traffic to the cluster and waited. The machine became responsive. We re-enabled the traffic, and the machine was unusable again (we couldn't even tell the number of active sessions at that time).
      Stopped the traffic again, and got to a state were we could promote a different primary and restart the machine. This fixed the issue.

      Around 24hrs after that (2021-10-28T20:45-700), the shard A experienced the same issue. In that occasion we were able to see the number of sessions (db.serverStatus().logicalSessionRecordCache.activeSessionsCount) that was around 950k.
      We stopped the traffic again and noticed that, after a few minutes, the number of sessions started going up and down. It would basically stay at a very low number (like 50), and then suddenly jump to 950k. Then down again and then up. This continued until we decided to stop the shard balancer, which seemed to fix the issue.

      We kept the balancer off for a few hours, then switched it on again and we didn't see the issue again yet (after around 12hrs being on).

      I think it is worth noticing that we're in the middle of a very big chunk migration. shard A has >500k chunks that we're slowly moving to shardB, that started as empty. The migration is being extremely slow. To try to speed-up the migration we have merged some chunks, so that they are ~200M each, and those are the ones that are currently being migrated.

      This piece of log here on shard A during the event of Oct 28th seems relevant:

      {"t":{"$date":"2021-10-28T20:15:12.311-07:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"LogicalSessionCacheReap","msg":"Slow query","attr":{"type":"command","ns":"config.transactions","command":{"getMore":8936347662363425973,"collection":"transactions","$db":"config"},"originatingCommand":{"find":"transactions","filter":{"lastWriteDate":{"$lt":{"$date":"2021-10-29T02:45:08.517Z"}}},"projection":{"_id":1},"sort":{"_id":1},"readConcern":{},"$db":"config"},"planSummary":"IXSCAN { _id: 1 }","cursorid":8936347662363425973,"keysExamined":187923,"docsExamined":187923,"cursorExhausted":true,"numYields":193,"nreturned":136087,"reslen":12137122,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":194}},"Global":{"acquireCount":{"r":194}},"Database":{"acquireCount":{"r":194}},"Collection":{"acquireCount":{"r":194}},"Mutex":{"acquireCount":{"r":1}}},"readConcern":{"provenance":"clientSupplied"},"storage":{"data":{"bytesRead":88715347,"timeReadingMicros":253343}},"protocol":"op_msg","durationMillis":1057}}
      {"t":{"$date":"2021-10-28T20:18:39.894-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
      {"t":{"$date":"2021-10-28T20:21:33.303-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
      {"t":{"$date":"2021-10-28T20:27:01.972-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
      {"t":{"$date":"2021-10-28T20:30:31.718-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
      {"t":{"$date":"2021-10-28T20:36:12.032-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
      {"t":{"$date":"2021-10-28T20:40:32.434-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
      {"t":{"$date":"2021-10-28T20:50:36.166-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
      {"t":{"$date":"2021-10-28T20:50:58.783-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
      

      The other thing I want to mention is that when I run this command on mongos(as suggested in other related Jiras), I get an unauthorized:

      mongos> db.getSiblingDB('config').system.sessions.stats()
      {
      	"ok" : 0,
      	"errmsg" : "not authorized on config to execute command { collStats: \"system.sessions\", scale: undefined, lsid: { id: UUID(\"051501ce-f5ab-4eaa-80a4-687810b8f7da\") }, $clusterTime: { clusterTime: Timestamp(1635569249, 19), signature: { hash: BinData(0, 2E85395E3928AF293BC4339EB8FBB389043E3BA7), keyId: 6962678295087284246 } }, $db: \"config\" }",
      	"code" : 13,
      	"codeName" : "Unauthorized",
      	"operationTime" : Timestamp(1635569255, 54),
      	"$clusterTime" : {
      		"clusterTime" : Timestamp(1635569255, 54),
      		"signature" : {
      			"hash" : BinData(0,"ckGeG5WkxyjzXSgDOrVk0Cd4xfE="),
      			"keyId" : NumberLong("6962678295087284246")
      		}
      	}
      }

            Assignee:
            eric.sedor@mongodb.com Eric Sedor
            Reporter:
            daniele@spiketrap.io Daniele Tessaro
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: