Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28172

Renaming collection causes permanent performance issue on secondary

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Critical - P2 Critical - P2
    • None
    • Affects Version/s: 3.4.2
    • Component/s: None
    • None
    • ALL
    • Hide

      Here's the rough pseudocode to reproduce:

      List<DBObject> rows = doBigAggregation(); // this aggregation uses disk, is in cursor mode, and returns roughly 35,000 rows
      DBCollection temp = db.getCollection("schoolusage_temp");
      temp.drop(); // just in case this process failed last night
      BulkWriteOperation bulkwrite = temp.initializeOrderedBulkOperation();
      for (DBObject row:rows) bulkwrite.insert(u); // in reality we insert in batches of 250
      bulkwrite.execute();
      temp.rename("schoolusage", true);
      
      Show
      Here's the rough pseudocode to reproduce: List<DBObject> rows = doBigAggregation(); // this aggregation uses disk, is in cursor mode, and returns roughly 35,000 rows DBCollection temp = db.getCollection("schoolusage_temp"); temp.drop(); // just in case this process failed last night BulkWriteOperation bulkwrite = temp.initializeOrderedBulkOperation(); for (DBObject row:rows) bulkwrite.insert(u); // in reality we insert in batches of 250 bulkwrite.execute(); temp.rename("schoolusage", true);

      Every night, we do an aggregation, followed by a bulk insert, followed by a rename with the "drop target option".

      Without fail, roughly one second (the time it takes for the rename to propagate to the secondary) after the rename completes, our secondary goes completely haywire and performance goes out the window. Performance does not recover unless we restart the secondary.

      This behavior did not occur until we upgraded from 3.2.10 to 3.4.2

      Once the performance drops, every couple of seconds we'll see hundreds of queries piling up for very simple queries. They all block and then complete simultaneously. It looks as though it's some sort of concurrency issue.

      +Included are some graphs and logs that illustrate the problem. +

      Server restarted at 22:00 and aggregation happened at 00:00:

      Same period in MMS:

      Here are some very simple queries taking over 6 seconds to complete:

      2017-03-02T16:02:32.034+0000 I COMMAND  [conn1328] command thinkmap.gamequestion command: find { find: "gamequestion", filter: { _id: 163258 }, projection: { workflow-writing: 0, votes: 0, ag: 0 }, limit: 1, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:740 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 6288267 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 6356ms
      2017-03-02T16:02:32.034+0000 I COMMAND  [conn1289] command thinkmap.gamequestion command: find { find: "gamequestion", filter: { _id: 300683 }, projection: { workflow-writing: 0, votes: 0, ag: 0 }, limit: 1, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:642 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 6378585 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 6443ms
      2017-03-02T16:02:32.034+0000 I COMMAND  [conn1166] command thinkmap.gamequestion command: find { find: "gamequestion", filter: { _id: 131553 }, projection: { workflow-writing: 0, votes: 0, ag: 0 }, limit: 1, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:526 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 6555928 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 6623ms
      2017-03-02T16:02:32.034+0000 I COMMAND  [conn1369] command thinkmap.gamequestion command: find { find: "gamequestion", filter: { _id: 212580 }, projection: { workflow-writing: 0, votes: 0, ag: 0 }, limit: 1, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:683 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 6193103 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 6257ms
      

        1. bad-mongod.log.gz
          5.64 MB
        2. Screen Shot 2017-03-02 at 11.32.19 AM.png
          Screen Shot 2017-03-02 at 11.32.19 AM.png
          94 kB
        3. Screen Shot 2017-03-02 at 11.34.55 AM.png
          Screen Shot 2017-03-02 at 11.34.55 AM.png
          137 kB
        4. Screen Shot 2017-03-02 at 9.34.42 PM.png
          Screen Shot 2017-03-02 at 9.34.42 PM.png
          505 kB

            Assignee:
            mark.agarunov Mark Agarunov
            Reporter:
            tinkler@vocabulary.com Marc Tinkler
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: