-
Type: Bug
-
Resolution: Cannot Reproduce
-
Priority: Major - P3
-
None
-
Affects Version/s: 4.4.5
-
Component/s: None
-
ALL
MongoDB version : v4.4.5
MongoDB Java Driver Version: mongodb-driver-sync 4.1.2
A transaction T1 failed to commit due to the WritableServerSelector-Timeout exception. However, another transaction T2 read the updates of T1. I use the withTransaction API and T1 retries for many times.
The exception T1 throws are:
2022-08-25 00:35:04,237{GMT} WARN [jepsen worker 46] jepsen.mongodb.rw-register: #error { :cause Timed out after 1000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, servers=[{address=public-cd-a5.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=34.4 ms, state=CONNECTED}, {address=public-cd-a1.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=156.8 ms, state=CONNECTED}, {address=public-cd-a4.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=38.3 ms, state=CONNECTED}, {address=public-cd-a2.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.9 ms, state=CONNECTED}, {address=public-cd-a3.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.1 ms, state=CONNECTED}] :via [{:type com.mongodb.MongoTimeoutException :message Timed out after 1000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, servers=[{address=public-cd-a5.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=34.4 ms, state=CONNECTED}, {address=public-cd-a1.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=156.8 ms, state=CONNECTED}, {address=public-cd-a4.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=38.3 ms, state=CONNECTED}, {address=public-cd-a2.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.9 ms, state=CONNECTED}, {address=public-cd-a3.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.1 ms, state=CONNECTED}] :at [com.mongodb.internal.connection.BaseCluster createTimeoutException BaseCluster.java 407]}] :trace [[com.mongodb.internal.connection.BaseCluster createTimeoutException BaseCluster.java 407] [com.mongodb.internal.connection.BaseCluster selectServer BaseCluster.java 118] [com.mongodb.internal.connection.AbstractMultiServerCluster selectServer AbstractMultiServerCluster.java 52] [com.mongodb.internal.binding.ClusterBinding$ClusterBindingConnectionSource <init> ClusterBinding.java 103] [com.mongodb.internal.binding.ClusterBinding$ClusterBindingConnectionSource <init> ClusterBinding.java 99] [com.mongodb.internal.binding.ClusterBinding getWriteConnectionSource ClusterBinding.java 91] [com.mongodb.client.internal.ClientSessionBinding getWriteConnectionSource ClientSessionBinding.java 93] [com.mongodb.internal.operation.OperationHelper withReleasableConnection OperationHelper.java 618] [com.mongodb.internal.operation.CommandOperationHelper executeRetryableCommand CommandOperationHelper.java 705] [com.mongodb.internal.operation.TransactionOperation execute TransactionOperation.java 69] [com.mongodb.internal.operation.CommitTransactionOperation execute CommitTransactionOperation.java 133] [com.mongodb.internal.operation.CommitTransactionOperation execute CommitTransactionOperation.java 54] [com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor execute MongoClientDelegate.java 195] [com.mongodb.client.internal.ClientSessionImpl commitTransaction ClientSessionImpl.java 129] [com.mongodb.client.internal.ClientSessionImpl withTransaction ClientSessionImpl.java 210] [jdk.internal.reflect.GeneratedMethodAccessor29 invoke nil -1] [jdk.internal.reflect.DelegatingMethodAccessorImpl invoke DelegatingMethodAccessorImpl.java 43] [java.lang.reflect.Method invoke Method.java 566] [clojure.lang.Reflector invokeMatchingMethod Reflector.java 167] [clojure.lang.Reflector invokeInstanceMethod Reflector.java 102] [jepsen.mongodb.rw_register.Client$fn__2566 invoke rw_register.clj 137] [jepsen.mongodb.rw_register.Client invoke_BANG_ rw_register.clj 128] [jepsen.client.Validate invoke_BANG_ client.clj 83] [jepsen.generator.interpreter.ClientWorker invoke_BANG_ interpreter.clj 62] [jepsen.generator.interpreter$spawn_worker$fn__9422$fn__9423 invoke interpreter.clj 137] [jepsen.generator.interpreter$spawn_worker$fn__9422 invoke interpreter.clj 120] [clojure.core$binding_conveyor_fn$fn__5739 invoke core.clj 2030] [clojure.lang.AFn call AFn.java 18] [java.util.concurrent.FutureTask run FutureTask.java 264] [java.util.concurrent.ThreadPoolExecutor runWorker ThreadPoolExecutor.java 1128] [java.util.concurrent.ThreadPoolExecutor$Worker run ThreadPoolExecutor.java 628] [java.lang.Thread run Thread.java 829]]}
The core message is
Client view of cluster state is {type=REPLICA_SET, servers=[ {address=public-cd-a5.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=34.4 ms, state=CONNECTED}, {address=public-cd-a1.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=156.8 ms, state=CONNECTED}, {address=public-cd-a4.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=38.3 ms, state=CONNECTED}, {address=public-cd-a2.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.9 ms, state=CONNECTED}, {address=public-cd-a3.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.1 ms, state=CONNECTED} ]
I have a mongo replica-set cluster consisting of 5 nodes. The primary is public-cd-a1.disalg.cn:37017. However, in this exception, they are all secondaries in the client's view. I have checked the mongod.log and there are no network issues or leader re-election. Node a1 is always the primary.
Moreover, there are other types of nodes in the client's view except REPLICA_SET_SECONDARY. In this exception, I found REPLICA_SET_OTHER and UNKNOWN.
:cause Timed out after 1000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, servers=[ {address=public-cd-a5.disalg.cn:37017, type=, roundTripTime=37.0 ms, state=CONNECTED}, {address=public-cd-a1.disalg.cn:37017, type=UNKNOWN, state=CONNECTING}, {address=public-cd-a4.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=822.4 ms, state=CONNECTED}, {address=public-cd-a2.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=34.1 ms, state=CONNECTED}, {address=public-cd-a3.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=195.9 ms, state=CONNECTED}]
This issue makes us can not determine whether a transaction is successfully committed. Why does the client see a wrong view of the cluster state?