-
Type: Task
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.4.1
-
Component/s: Sharding
-
None
-
Environment:red hat linux
we set up a 30 nodes mongodb system on our cluster with two physical nodes, node40 and node41, 15 each. I tried to up load as much as 500 GB data into this database. however, one node shutdown with this error report:
Sun Nov 24 14:58:15.910 [conn9] command admin.$cmd command:
{ writebacklisten: ObjectId('528f8628c868398bb45de20e') } ntoreturn:1 keyUpdates:0 reslen:262523 50686ms
Sun Nov 24 14:58:15.973 [conn3642] waiting till out of critical section
Sun Nov 24 14:58:25.540 [conn15] Socket recv() timeout 192.168.1.159:27020
Sun Nov 24 14:58:25.540 [conn15] SocketException: remote: 192.168.1.159:27020 error: 9001 socket exception [3] server [192.168.1.159:27020]
Sun Nov 24 14:58:25.540 [conn15] DBClientCursor::init call() failed
Sun Nov 24 14:58:25.977 [conn3642] waiting till out of critical section
Sun Nov 24 14:58:29.726 [conn15] scoped connection to node40.clus.cci.emory.edu:27020,node40.clus.cci.emory.edu:27021,node41.clus.cci.emory.edu:27020 not being returned to the pool
Sun Nov 24 14:58:29.727 [conn15] warning: 13104 SyncClusterConnection::findOne prepare failed: 10276 DBClientBase::findN: transport error: node40.clus.cci.emory.edu:27020 ns: admin.$cmd query:
node40.clus.cci.emory.edu:27020:{}
Sun Nov 24 14:58:29.727 [conn15] warning: moveChunk commit outcome ongoing: { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: "dicomdb.fs.chunks-files_id_ObjectId('5292592d0cf23a90f681c5e0')", lastmod: Timestamp 12000|0, lastmodEpoch: ObjectId('529258b4c868398bb45e3755'), ns: "dicomdb.fs.chunks", min:
, max:
{ files_id: ObjectId('529259620cf23a90f681d1ec') }, shard: "dicom2" }, o2:
{ _id: "dicomdb.fs.chunks-files_id_ObjectId('5292592d0cf23a90f681c5e0')" }}, { op: "u", b: false, ns: "config.chunks", o: { _id: "dicomdb.fs.chunks-files_id_ObjectId('529259620cf23a90f681d1ec')", lastmod: Timestamp 12000|1, lastmodEpoch: ObjectId('529258b4c868398bb45e3755'), ns: "dicomdb.fs.chunks", min:
{ files_id: ObjectId('529259620cf23a90f681d1ec') }, max:
{ files_id: ObjectId('5292597d0cf23a90f681de3f') }, shard: "dicom1" }, o2:
{ _id: "dicomdb.fs.chunks-files_id_ObjectId('529259620cf23a90f681d1ec')" }} ], preCondition: [ { ns: "config.chunks", q: { query:
{ ns: "dicomdb.fs.chunks" }, orderby:
{ lastmod: -1 }}, res:
{ lastmod: Timestamp 11000|3 }} ] } for command :{ $err: "SyncClusterConnection::findOne prepare failed: 10276 DBClientBase::findN: transport error: node40.clus.cci.emory.edu:27020 ns: admin.$cmd query:
{ fsy...", code: 13104 }Sun Nov 24 14:58:35.980 [conn3642] waiting till out of critical section
Sun Nov 24 14:58:38.672 [DataFileSync] flushing mmaps took 48713ms for 11 files
Sun Nov 24 14:58:39.729 [conn15] SyncClusterConnection connecting to [node40.clus.cci.emory.edu:27020]
Sun Nov 24 14:58:39.730 [conn15] SyncClusterConnection connecting to [node40.clus.cci.emory.edu:27021]
Sun Nov 24 14:58:39.730 [conn15] SyncClusterConnection connecting to [node41.clus.cci.emory.edu:27020]
Sun Nov 24 14:58:39.731 [conn15] ERROR: moveChunk commit failed: version is at11|3||000000000000000000000000 instead of 12|1||529258b4c868398bb45e3755
Sun Nov 24 14:58:39.731 [conn15] ERROR: TERMINATING
At the beginning, all data is inserted into the primary node, and then balancer try to move chunks to other nodes. It can move chunks for a short period of time like 10 minute and then always raise this error.
I thought it’s the network issue, but actually they can ping each other. Do you know how can I solve this problem? or how can I know what causes this problem.
I also attached the log file, you can locate the error part by searching “failed”
- is related to
-
SERVER-10458 Sanity check on "from" side that all cloned and modified documents were sent to TO side on commit
- Closed