-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Replication
-
ALL
In a master-slave configuration, slaves (fatally) halt replication if the oplog getmore returns an $err response. Instead, slaves should retry the oplog query.
The master will return an $err response to an oplog getmore under a number of conditions, most notably if a closeAllDatabases command was just issued (this can be evidenced during dropdb_race.js test failures under the small_oplog passthrough).
This issue does not affect replica set secondaries (secondaries correctly retry the oplog query when this condition occurs).
Reproduce with the following script, which forces an $err response with a killOp (failure observed on line 14, after replication halts).
var rt = new ReplTest("foo"); var m = rt.start(true); var s = rt.start(false); var db = m.getDB("test") for(var i=0; i<5; i++) { for(var j=0; j<1000; j++) { db.foo.insert({}); } var x = db.currentOp({ns:'local.oplog.$main', op:'getmore'}).inprog; if (x.length) { db.killOp(x[0].opid) } assert.writeOK(db.foo.insert({}, {writeConcern:{w:2, wtimeout:20000}})); }
Log output for the above script, showing the failure:
m31001| 2014-03-25T18:09:48.560-0400 [replslave] resync: cloning database test to get an initial copy m31000| 2014-03-25T18:09:48.560-0400 [initandlisten] connection accepted from 127.0.0.1:57289 #3 (3 connections now open) m31001| 2014-03-25T18:09:48.560-0400 [FileAllocator] allocating new datafile /data/db/foo-slave/test.ns, filling with zeroes... m31001| 2014-03-25T18:09:48.627-0400 [FileAllocator] done allocating datafile /data/db/foo-slave/test.ns, size: 16MB, took 0.066 secs m31001| 2014-03-25T18:09:48.652-0400 [FileAllocator] allocating new datafile /data/db/foo-slave/test.0, filling with zeroes... m31001| 2014-03-25T18:09:48.713-0400 [FileAllocator] done allocating datafile /data/db/foo-slave/test.0, size: 16MB, took 0.06 secs m31001| 2014-03-25T18:09:48.740-0400 [replslave] build index on: test.foo properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.foo" } m31001| 2014-03-25T18:09:48.741-0400 [replslave] building index using bulk method m31001| 2014-03-25T18:09:48.741-0400 [replslave] build index done. scanned 114 total records. 0.001 secs m31001| 2014-03-25T18:09:48.741-0400 [replslave] resync: done with initial clone for db: test m31000| 2014-03-25T18:09:48.741-0400 [conn3] end connection 127.0.0.1:57289 (2 connections now open) m31000| 2014-03-25T18:09:49.746-0400 [slaveTracking] build index on: local.slaves properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "local.slaves" } m31000| 2014-03-25T18:09:49.747-0400 [slaveTracking] added index to empty collection m31000| 2014-03-25T18:09:55.100-0400 [conn1] going to kill op: op: 3916.0 m31000| 2014-03-25T18:09:55.105-0400 [conn2] getmore local.oplog.$main cursorid:19256671971 ntoreturn:0 keyUpdates:0 exception: operation was interrupted code:11601 numYields:0 locks(micros) r:81 nreturned:1 reslen:71 20ms m31001| 2014-03-25T18:09:55.105-0400 [replslave] sync error: problem querying remote oplog record m31001| 2014-03-25T18:09:55.105-0400 [replslave] op: { $err: "operation was interrupted", code: 11601 } m31001| 2014-03-25T18:09:55.105-0400 [replslave] halting replication m31001| 2014-03-25T18:09:55.105-0400 [replslave] caught SyncException m31001| 2014-03-25T18:09:55.105-0400 [replslave] repl: sleep 10 sec before next pass m31001| 2014-03-25T18:10:05.107-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds m31000| 2014-03-25T18:10:05.107-0400 [conn2] killcursors: found 0 of 1 m31000| 2014-03-25T18:10:05.107-0400 [conn2] end connection 127.0.0.1:57288 (1 connection now open) m31001| 2014-03-25T18:10:10.108-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds m31000| 2014-03-25T18:10:15.106-0400 [conn1] command test.$cmd command: insert { insert: "foo", documents: [ { _id: ObjectId('5331feb3cd9f37898e100a9f') } ], ordered: true, writeConcern: { w: 2.0, wtimeout: 20000.0 } } keyUpdates:0 numYields:0 locks(micros) w:3 reslen:161 20001ms m31001| 2014-03-25T18:10:15.108-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds assert: write concern failed with errors: { "nInserted" : 1, "writeConcernError" : { "code" : 64, "errInfo" : { "wtimeout" : true }, "errmsg" : "waiting for replication timed out" } } Error: write concern failed with errors: { "nInserted" : 1, "writeConcernError" : { "code" : 64, "errInfo" : { "wtimeout" : true }, "errmsg" : "waiting for replication timed out" } } at Error (<anonymous>) at doassert (src/mongo/shell/assert.js:11:14) at Function.assert.writeOK (src/mongo/shell/assert.js:363:13) at foo.js:14:10 2014-03-25T18:10:15.154-0400 Error: write concern failed with errors: { "nInserted" : 1, "writeConcernError" : { "code" : 64, "errInfo" : { "wtimeout" : true }, "errmsg" : "waiting for replication timed out" } } at src/mongo/shell/assert.js:13
- related to
-
SERVER-13435 dropdb_race.js in small_oplog (causes replication to halt on slave)
- Closed