Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17206

WT Secondaries fall behind on heavy insert workload but MMAPv1 secondaries don't

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • 3.0.0-rc9, 3.1.0
    • Affects Version/s: 3.0.0-rc8
    • Component/s: Replication, Storage
    • None
    • Fully Compatible
    • ALL
    • Hide

      Socialite load workload (available at https://github.com/10gen-labs/socialite).

      java -jar target/socialite-0.0.1-SNAPSHOT.jar load  --users 10000000 --maxfollows 1000 --messages 2000 --threads 32 sample-config.yml
      

      3 node replica set c3.2xlarge instances, 8 cpu, 15g.
      Replica set configuration

      {
      	"_id" : "shard1",
      	"version" : 1,
      	"members" : [
      		{
      			"_id" : 1,
      			"host" : "shard1-01.knuckleboys.com:27017"
      		},
      		{
      			"_id" : 2,
      			"host" : "shard1-02.knuckleboys.com:27017"
      		},
      		{
      			"_id" : 3,
      			"host" : "shard1-03.knuckleboys.com:27017"
      		}
      	]
      }
      

      Socialite load workload: java -jar target/socialite-0.0.1-SNAPSHOT.jar load --users 10000000 --maxfollows 1000 --messages 2000 --threads 32 sample-config.yml

      Indexes:

      [
      	{
      		"v" : 1,
      		"key" : {
      			"_id" : 1
      		},
      		"name" : "_id_",
      		"ns" : "socialite.content"
      	},
      	{
      		"v" : 1,
      		"key" : {
      			"_a" : 1,
      			"_id" : 1
      		},
      		"name" : "_a_1__id_1",
      		"ns" : "socialite.content"
      	}
      ]
      [
      	{
      		"v" : 1,
      		"key" : {
      			"_id" : 1
      		},
      		"name" : "_id_",
      		"ns" : "socialite.followers"
      	},
      	{
      		"v" : 1,
      		"unique" : true,
      		"key" : {
      			"_f" : 1,
      			"_t" : 1
      		},
      		"name" : "_f_1__t_1",
      		"ns" : "socialite.followers"
      	}
      ]
      [
      	{
      		"v" : 1,
      		"key" : {
      			"_id" : 1
      		},
      		"name" : "_id_",
      		"ns" : "socialite.following"
      	},
      	{
      		"v" : 1,
      		"unique" : true,
      		"key" : {
      			"_f" : 1,
      			"_t" : 1
      		},
      		"name" : "_f_1__t_1",
      		"ns" : "socialite.following"
      	}
      ]
      [
      	{
      		"v" : 1,
      		"key" : {
      			"_id" : 1
      		},
      		"name" : "_id_",
      		"ns" : "socialite.users"
      	}
      ]
      

      Show
      Socialite load workload (available at https://github.com/10gen-labs/socialite ). java -jar target/socialite-0.0.1-SNAPSHOT.jar load --users 10000000 --maxfollows 1000 --messages 2000 --threads 32 sample-config.yml 3 node replica set c3.2xlarge instances, 8 cpu, 15g. Replica set configuration { "_id" : "shard1" , "version" : 1, "members" : [ { "_id" : 1, "host" : "shard1-01.knuckleboys.com:27017" }, { "_id" : 2, "host" : "shard1-02.knuckleboys.com:27017" }, { "_id" : 3, "host" : "shard1-03.knuckleboys.com:27017" } ] } Socialite load workload: java -jar target/socialite-0.0.1-SNAPSHOT.jar load --users 10000000 --maxfollows 1000 --messages 2000 --threads 32 sample-config.yml Indexes: [ { "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_" , "ns" : "socialite.content" }, { "v" : 1, "key" : { "_a" : 1, "_id" : 1 }, "name" : "_a_1__id_1" , "ns" : "socialite.content" } ] [ { "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_" , "ns" : "socialite.followers" }, { "v" : 1, "unique" : true , "key" : { "_f" : 1, "_t" : 1 }, "name" : "_f_1__t_1" , "ns" : "socialite.followers" } ] [ { "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_" , "ns" : "socialite.following" }, { "v" : 1, "unique" : true , "key" : { "_f" : 1, "_t" : 1 }, "name" : "_f_1__t_1" , "ns" : "socialite.following" } ] [ { "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_" , "ns" : "socialite.users" } ] 

      Running the socialite load workload (primarily writes) against a three member replica set, 4g oplog with one secondary using MMAPv1 and the other using wiredTiger, the WT secondary starts falling behind. The MMAPv1 secondary keeps up.
      Typically, I'll see the WT secondary get to about 1500s behind the primary, then it will clear up. This take about 2 hours to reproduce. After about 8-10 hours total run time the WT secondary starts to fall behind again, then falls off the tail of the oplog.
      Screen shots of MMS during a recovered lag, and of timeline output (from https://github.com/10gen/support-tools) with correlated activity. The full timeline files are attached.

        1. 1lag.html
          1.45 MB
          Michael Grundy
        2. 1-op-cs.log
          1.06 MB
          Michael Grundy
        3. 1-oplog-cs.html
          467 kB
          Michael Grundy
        4. 1-ss.log
          3.39 MB
          Michael Grundy
        5. 2lag.html
          554 kB
          Michael Grundy
        6. 2-op-cs.log
          136 kB
          Michael Grundy
        7. 2-oplog-cs.html
          63 kB
          Michael Grundy
        8. 2-ss.log
          2.01 MB
          Michael Grundy
        9. 3lag.html
          1.21 MB
          Michael Grundy
        10. 3-op-cs.log
          1.08 MB
          Michael Grundy
        11. 3-oplog-cs.html
          475 kB
          Michael Grundy
        12. 3-ss.log
          3.32 MB
          Michael Grundy
        13. Dashboard___MMS__MongoDB_Management_Service.png
          83 kB
          Michael Grundy
        14. mongod-1-.log.gz
          3.99 MB
          Michael Grundy
        15. mongod-2-.log.gz
          35 kB
          Michael Grundy
        16. mongod-3-.log.gz
          36 kB
          Michael Grundy
        17. node 1 WT primary timeline.png
          288 kB
          Michael Grundy
        18. node 2 MMAPv1 secondary timeline.png
          213 kB
          Michael Grundy
        19. node 3 WT secondary timeline.png
          294 kB
          Michael Grundy
        20. socialite_oplog_1G.png
          73 kB
          Darren Wood
        21. socialite-3wtnode.png
          110 kB
          Darren Wood

            Assignee:
            darren.wood@10gen.com Darren Wood
            Reporter:
            michael.grundy Michael Grundy
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: