Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.4.1
Component/s: Sharding
Labels:
None
Environment:
red hat linux

Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

we set up a 30 nodes mongodb system on our cluster with two physical nodes, node40 and node41, 15 each. I tried to up load as much as 500 GB data into this database. however, one node shutdown with this error report:

Sun Nov 24 14:58:15.910 [conn9] command admin.$cmd command:

{ writebacklisten: ObjectId('528f8628c868398bb45de20e') }

ntoreturn:1 keyUpdates:0 reslen:262523 50686ms
Sun Nov 24 14:58:15.973 [conn3642] waiting till out of critical section
Sun Nov 24 14:58:25.540 [conn15] Socket recv() timeout 192.168.1.159:27020
Sun Nov 24 14:58:25.540 [conn15] SocketException: remote: 192.168.1.159:27020 error: 9001 socket exception [3] server [192.168.1.159:27020]
Sun Nov 24 14:58:25.540 [conn15] DBClientCursor::init call() failed
Sun Nov 24 14:58:25.977 [conn3642] waiting till out of critical section
Sun Nov 24 14:58:29.726 [conn15] scoped connection to node40.clus.cci.emory.edu:27020,node40.clus.cci.emory.edu:27021,node41.clus.cci.emory.edu:27020 not being returned to the pool
Sun Nov 24 14:58:29.727 [conn15] warning: 13104 SyncClusterConnection::findOne prepare failed: 10276 DBClientBase::findN: transport error: node40.clus.cci.emory.edu:27020 ns: admin.$cmd query:

{ fsync: 1 }

node40.clus.cci.emory.edu:27020:{}
Sun Nov 24 14:58:29.727 [conn15] warning: moveChunk commit outcome ongoing: { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: "dicomdb.fs.chunks-files_id_ObjectId('5292592d0cf23a90f681c5e0')", lastmod: Timestamp 12000|0, lastmodEpoch: ObjectId('529258b4c868398bb45e3755'), ns: "dicomdb.fs.chunks", min:

{ files_id: ObjectId('5292592d0cf23a90f681c5e0') }

, max:

{ files_id: ObjectId('529259620cf23a90f681d1ec') }

, shard: "dicom2" }, o2:

{ _id: "dicomdb.fs.chunks-files_id_ObjectId('5292592d0cf23a90f681c5e0')" }

}, { op: "u", b: false, ns: "config.chunks", o: { _id: "dicomdb.fs.chunks-files_id_ObjectId('529259620cf23a90f681d1ec')", lastmod: Timestamp 12000|1, lastmodEpoch: ObjectId('529258b4c868398bb45e3755'), ns: "dicomdb.fs.chunks", min:

{ files_id: ObjectId('529259620cf23a90f681d1ec') }

, max:

{ files_id: ObjectId('5292597d0cf23a90f681de3f') }

, shard: "dicom1" }, o2:

{ _id: "dicomdb.fs.chunks-files_id_ObjectId('529259620cf23a90f681d1ec')" }

} ], preCondition: [ { ns: "config.chunks", q: { query:

{ ns: "dicomdb.fs.chunks" }

, orderby:

{ lastmod: -1 }

}, res:

{ lastmod: Timestamp 11000|3 }

} ] } for command :{ $err: "SyncClusterConnection::findOne prepare failed: 10276 DBClientBase::findN: transport error: node40.clus.cci.emory.edu:27020 ns: admin.$cmd query:

{ fsy...", code: 13104 }

Sun Nov 24 14:58:35.980 [conn3642] waiting till out of critical section
Sun Nov 24 14:58:38.672 [DataFileSync] flushing mmaps took 48713ms for 11 files
Sun Nov 24 14:58:39.729 [conn15] SyncClusterConnection connecting to [node40.clus.cci.emory.edu:27020]
Sun Nov 24 14:58:39.730 [conn15] SyncClusterConnection connecting to [node40.clus.cci.emory.edu:27021]
Sun Nov 24 14:58:39.730 [conn15] SyncClusterConnection connecting to [node41.clus.cci.emory.edu:27020]
Sun Nov 24 14:58:39.731 [conn15] ERROR: moveChunk commit failed: version is at11|3||000000000000000000000000 instead of 12|1||529258b4c868398bb45e3755
Sun Nov 24 14:58:39.731 [conn15] ERROR: TERMINATING

At the beginning, all data is inserted into the primary node, and then balancer try to move chunks to other nodes. It can move chunks for a short period of time like 10 minute and then always raise this error.
I thought it’s the network issue, but actually they can ping each other. Do you know how can I solve this problem? or how can I know what causes this problem.

I also attached the log file, you can locate the error part by searching “failed”

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

dicom1.log
Dec 17 2013 06:32:02 AM UTC
862 kB
dejun teng

is related to

SERVER-10458 Sanity check on "from" side that all cloned and modified documents were sent to TO side on commit

Closed

Assignee:: Unassigned
Reporter:: dejun teng
Participants:: dejun teng, Eliot Horowitz
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Dec 17 2013 06:32:01 AM UTC
Updated:: Jul 11 2016 05:40:04 PM UTC
Resolved:: Dec 20 2013 08:12:31 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates