Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-38798

shard registry busy-loops in commitTransaction

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Sharding
    • None
    • ALL
    • Sharding 2019-01-28
    • 9

      If a node is down, the shard registry will hammer retries with no backoff. This generates so much system log that it overloads logkeeper and makes it difficult to diagnose problems:

      [ShardedClusterFixture:job0:shard1:secondary1] 2018-12-23T23:49:10.133+0000 I REPL     [replexec-4] Member localhost:20004 is now in state RS_DOWN
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinD
      ata(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1
      [ShardedClusterFixture:job0:shard1:secondary1] 2018-12-23T23:49:10.133+0000 D REPL_HB  [replexec-4] Scheduling heartbeat to localhost:20004 at 2018-12-23T23:49:10.633Z
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9
      d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) 
      :: caused by :: Connection refused, resending command.
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [TransactionCoordinatorService-6] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37
      e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO     [ShardRegistry] Connecting to localhost:20004
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO     [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO     [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinD
      ata(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command.
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [TransactionCoordinatorService-0] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO     [ShardRegistry] Connecting to localhost:20004
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO     [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO     [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command.
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN      [TransactionCoordinatorService-1] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO     [ShardRegistry] Connecting to localhost:20004
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO     [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO     [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN      [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN      [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command.
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN      [TransactionCoordinatorService-2] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO     [ShardRegistry] Connecting to localhost:20004
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO     [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO     [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN      [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN      [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command.
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN      [TransactionCoordinatorService-5] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO     [ShardRegistry] Connecting to localhost:20004
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 I ASIO     [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 I ASIO     [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 D TXN      [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 D TXN      [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command.
      [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 D TXN      [TransactionCoordinatorService-3] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1
      

      This behavior happened in the linked BF, but it's very difficult to actually download the logs from that failure; I have pasted in a portion of the log here for reference.

            Assignee:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Reporter:
            milkie@mongodb.com Eric Milkie
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: