-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Sharding
-
None
-
ALL
-
Sharding 2019-01-28
-
9
If a node is down, the shard registry will hammer retries with no backoff. This generates so much system log that it overloads logkeeper and makes it difficult to diagnose problems:
[ShardedClusterFixture:job0:shard1:secondary1] 2018-12-23T23:49:10.133+0000 I REPL [replexec-4] Member localhost:20004 is now in state RS_DOWN [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinD ata(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1 [ShardedClusterFixture:job0:shard1:secondary1] 2018-12-23T23:49:10.133+0000 D REPL_HB [replexec-4] Scheduling heartbeat to localhost:20004 at 2018-12-23T23:49:10.633Z [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9 d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command. [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN [TransactionCoordinatorService-6] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37 e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO [ShardRegistry] Connecting to localhost:20004 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinD ata(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command. [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN [TransactionCoordinatorService-0] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO [ShardRegistry] Connecting to localhost:20004 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command. [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN [TransactionCoordinatorService-1] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO [ShardRegistry] Connecting to localhost:20004 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command. [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN [TransactionCoordinatorService-2] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO [ShardRegistry] Connecting to localhost:20004 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command. [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN [TransactionCoordinatorService-5] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO [ShardRegistry] Connecting to localhost:20004 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 I ASIO [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 I ASIO [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 D TXN [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1 [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 D TXN [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command. [ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 D TXN [TransactionCoordinatorService-3] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1
This behavior happened in the linked BF, but it's very difficult to actually download the logs from that failure; I have pasted in a portion of the log here for reference.
- duplicates
-
SERVER-37880 Add in backoff for retrying sending commit and abort messages
- Closed