Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.4.4, 2.5.0
Component/s: Replication
Labels:
None
Environment:
Ubuntu 12.04 on a Dell XPS13 laptop

Operating System:
ALL

When a primary node has died and a secondary has won an election, it proceeds to assume primary role. However, assumePrimary() will wait for producerThread() which is sleeping 10 seconds. The reason it is sleeping is the same reason we are assuming primary, the old primary node has disconnected.

I've attached a verbose syslog where all 3 replicaset members run on the same host, but the summary is:

port 27001 is the primary:

Jun 17 00:07:00 hingo-sputnik mongod.27001[2060]: Mon Jun 17 02:07:00.980 [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
Jun 17 00:07:00 hingo-sputnik mongod.27001[2060]: Mon Jun 17 02:07:00.980 [signalProcessingThread] shutdown: going to close listening sockets...

mongod on ports 27002 and 27003 immediately notice the connection was closed:

Jun 17 00:07:00 hingo-sputnik mongod.27002[2218]: Mon Jun 17 02:07:00.981 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: hingo-sputnik:27001
Jun 17 00:07:00 hingo-sputnik mongod.27003[2377]: Mon Jun 17 02:07:00.981 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: hingo-sputnik:27001

This causes the following to happen in producerThread():

            try {
                _producerThread();
            }
            catch (DBException& e) {
                sethbmsg(str::stream() << "db exception in producer: " << e.toString());
                sleepsecs(10);
            }

Soon after, well at most 2 seconds later, 27002 has won the election and proceeds to assume primary role

Jun 17 00:07:02 hingo-sputnik mongod.27002[2218]: Mon Jun 17 02:07:02.793 [rsMgr] replSet election succeeded, assuming primary role

In the code, we end up here, and step 2 is blocked for next 10 seconds:

    void BackgroundSync::stopReplicationAndFlushBuffer() {
        boost::unique_lock<boost::mutex> lck(_mutex);

        // 1. Tell syncing to stop
        _assumingPrimary = true;

        // 2. Wait for syncing to stop and buffer to be applied
        while (!(_pause && _appliedBuffer)) {
            _condvar.wait(lck);
        }

        // 3. Now actually become primary
        _assumingPrimary = false;
    }

And 10 seconds later, without much fanfare, 27002 actually is PRIMARY:

Jun 17 00:07:10 hingo-sputnik mongod.27002[2218]: Mon Jun 17 02:07:10.981 [rsMgr] replSet PRIMARY

It seems to me this is an unintended wait that arises from 2 threads depending on each other. The producer thread in isolation thinks "oh no, couldn't read from primary, I'll try again (10 sec) later". At the same time the assumePrimary thread correctly asks the producer to finish what it might be doing, but unfortunately all it is doing is sleeping.

Note that while there might be an argument to insert delay before electing a new primary, that is not what is happening here: the replica set members have already called for election and agreed on the winner, it's just the state change itself that is blocked for <10 seconds.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

verbose.log
116 kB
Jun 17 2013 10:54:44 AM UTC

duplicates

SERVER-9934 Slow failovers after step down due to sleeping rsBackgroundSync

Closed

Assignee:: Unassigned

Reporter:: Henrik Ingo (Inactive)

Participants:: Henrik Ingo

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: Jun 17 2013 10:53:44 AM UTC

Updated:: Jul 20 2013 02:31:25 AM UTC

Resolved:: Jun 17 2013 12:54:28 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates