Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-14759

Splitting very close to an existing double precision value causes missing chunks

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.7.4
    • Component/s: Sharding
    • None
    • ALL
    • Sharding 2019-09-09

      Consider a shard key has values that are double precision floats (ie. "numbers" in javascript). If an attempt is made to split at a point that is very close to an existing chunk min/max (ie. a double value which is "adjacent" or nearly so), then metadata corruption occurs. Specifically, at least one chunk will be missing, causing a gap in the chunk ranges and an inability for the config metadata to be loaded for that collection (since the config is invalid). Only certain values cause the problem.

      The exact outcome depends on whether the split point is just above or just below the existing chunk endpoint.

      • Just larger than existing point: split command fails (ok: 0), the "left" chunk (chunk B) is missing:
                   existing endpoint     new split point
        ...------------------------|-----|------------------------...
                    Chunk A         Chunk       Chunk C
                                      B
                                  (missing)
        
      • Just smaller than existing point: split command succeeds (ok: 1), chunk is split correctly into chunks A and B, but the "subsequent" chunk (chunk C) is missing:
                     new split point     existing endpoint
        ...------------------------|-----|------------------------...
                    Chunk A         Chunk       Chunk C
                                      B        (missing)
        

      The attached reproducer shows some values that work and that don't. It splits at two double precision values in order. Since some combinations of these work and some don't, there must be something specific about the actual values (or the difference between them) which is causing the failure. If the "A then B" case doesn't work, then "B then A" also doesn't work (though with the different symptoms as above).

      Here are the results. The expectation is that every test should pass (or at least not cause an invalid config).

      test1: *** FAILED ***: 1 then 1.0000000000000002: [ "(second) split not ok", "(second) wrong chunk count", "(second) gaps" ]
              {  "_id" : "test1.test1-field_MinKey",  "lastmod" : Timestamp(1, 1),  "lastmodEpoch" : ObjectId("53db166ec333c70bae888422"),  "ns" : "test1.test1",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1 },  "shard" : "shard0000" }
              {  "_id" : "test1.test1-field_1.0",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db166ec333c70bae888422"),  "ns" : "test1.test1",  "min" : {  "field" : 1.0000000000000002 },  "max" : {  "field" : { "$maxKey" : 1 } },  "shard" : "shard0000" }
      test2: *** FAILED ***: 1 then 1.0000000000000004: [ "(second) split not ok", "(second) wrong chunk count", "(second) gaps" ]
              {  "_id" : "test2.test2-field_MinKey",  "lastmod" : Timestamp(1, 1),  "lastmodEpoch" : ObjectId("53db166ec333c70bae888427"),  "ns" : "test2.test2",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1 },  "shard" : "shard0000" }
              {  "_id" : "test2.test2-field_1.0",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db166ec333c70bae888427"),  "ns" : "test2.test2",  "min" : {  "field" : 1.0000000000000004 },  "max" : {  "field" : { "$maxKey" : 1 } },  "shard" : "shard0000" }
      test3: passed: 1 then 1.0000000000000007
      test4: passed: 1 then 1.0000000000000009
      test5: *** FAILED ***: 1.0000000000000002 then 1.0000000000000004: [ "(second) split not ok", "(second) wrong chunk count", "(second) gaps" ]
              {  "_id" : "test5.test5-field_MinKey",  "lastmod" : Timestamp(1, 1),  "lastmodEpoch" : ObjectId("53db166fc333c70bae888436"),  "ns" : "test5.test5",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1.0000000000000002 },  "shard" : "shard0000" }
              {  "_id" : "test5.test5-field_1.0",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db166fc333c70bae888436"),  "ns" : "test5.test5",  "min" : {  "field" : 1.0000000000000004 },  "max" : {  "field" : { "$maxKey" : 1 } },  "shard" : "shard0000" }
      test6: passed: 1.0000000000000002 then 1.0000000000000007
      test7: passed: 1.0000000000000002 then 1.0000000000000009
      test8: passed: 1.0000000000000004 then 1.0000000000000007
      test9: passed: 1.0000000000000004 then 1.0000000000000009
      test10: *** FAILED ***: 1.0000000000000007 then 1.0000000000000009: [ "(second) split not ok", "(second) wrong chunk count", "(second) gaps" ]
              {  "_id" : "test10.test10-field_MinKey",  "lastmod" : Timestamp(1, 1),  "lastmodEpoch" : ObjectId("53db166fc333c70bae88844f"),  "ns" : "test10.test10",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1.0000000000000007 },  "shard" : "shard0000" }
              {  "_id" : "test10.test10-field_1.000000000000001",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db166fc333c70bae88844f"),  "ns" : "test10.test10",  "min" : {  "field" : 1.0000000000000009 },  "max" : {  "field" : {
      "$maxKey" : 1 } },  "shard" : "shard0000" }
      test11: *** FAILED ***: 1.0000000000000002 then 1: [ "(second) wrong chunk count", "(second) bad min/max chunk" ]
              {  "_id" : "test11.test11-field_MinKey",  "lastmod" : Timestamp(1, 3),  "lastmodEpoch" : ObjectId("53db166fc333c70bae888454"),  "ns" : "test11.test11",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1 },  "shard"
      : "shard0000" }
              {  "_id" : "test11.test11-field_1.0",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db166fc333c70bae888454"),  "ns" : "test11.test11",  "min" : {  "field" : 1 },  "max" : {  "field" : 1.0000000000000002 },  "shard" :
      "shard0000" }
      test12: *** FAILED ***: 1.0000000000000004 then 1: [ "(second) wrong chunk count", "(second) bad min/max chunk" ]
              {  "_id" : "test12.test12-field_MinKey",  "lastmod" : Timestamp(1, 3),  "lastmodEpoch" : ObjectId("53db1670c333c70bae888459"),  "ns" : "test12.test12",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1 },  "shard"
      : "shard0000" }
              {  "_id" : "test12.test12-field_1.0",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db1670c333c70bae888459"),  "ns" : "test12.test12",  "min" : {  "field" : 1 },  "max" : {  "field" : 1.0000000000000004 },  "shard" :
      "shard0000" }
      test13: passed: 1.0000000000000007 then 1
      test14: passed: 1.0000000000000009 then 1
      test15: *** FAILED ***: 1.0000000000000004 then 1.0000000000000002: [ "(second) wrong chunk count", "(second) bad min/max chunk" ]
              {  "_id" : "test15.test15-field_MinKey",  "lastmod" : Timestamp(1, 3),  "lastmodEpoch" : ObjectId("53db1670c333c70bae888468"),  "ns" : "test15.test15",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1.0000000000000002 },  "shard" : "shard0000" }
              {  "_id" : "test15.test15-field_1.0",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db1670c333c70bae888468"),  "ns" : "test15.test15",  "min" : {  "field" : 1.0000000000000002 },  "max" : {  "field" : 1.0000000000000004 },  "shard" : "shard0000" }
      test16: passed: 1.0000000000000007 then 1.0000000000000002
      test17: passed: 1.0000000000000009 then 1.0000000000000002
      test18: passed: 1.0000000000000007 then 1.0000000000000004
      test19: passed: 1.0000000000000009 then 1.0000000000000004
      test20: *** FAILED ***: 1.0000000000000009 then 1.0000000000000007: [ "(second) wrong chunk count", "(second) bad min/max chunk" ]
              {  "_id" : "test20.test20-field_MinKey",  "lastmod" : Timestamp(1, 3),  "lastmodEpoch" : ObjectId("53db1671c333c70bae888481"),  "ns" : "test20.test20",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : 1.0000000000000007 },  "shard" : "shard0000" }
              {  "_id" : "test20.test20-field_1.000000000000001",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db1671c333c70bae888481"),  "ns" : "test20.test20",  "min" : {  "field" : 1.0000000000000007 },  "max" : {  "field" : 1.0000000000000009 },  "shard" : "shard0000" }
      test21: *** FAILED ***: -4204176258327475000 then -4204176258327474700: [ "(second) split not ok", "(second) wrong chunk count", "(second) gaps" ]
              {  "_id" : "test21.test21-field_MinKey",  "lastmod" : Timestamp(1, 1),  "lastmodEpoch" : ObjectId("53db1671c333c70bae888486"),  "ns" : "test21.test21",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : -4204176258327475000 },  "shard" : "shard0000" }
              {  "_id" : "test21.test21-field_-4.204176258327475e+18",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db1671c333c70bae888486"),  "ns" : "test21.test21",  "min" : {  "field" : -4204176258327474700 },  "max" : {  "field" : { "$maxKey" : 1 } },  "shard" : "shard0000" }
      test22: *** FAILED ***: -4204176258327474700 then -4204176258327475000: [ "(second) wrong chunk count", "(second) bad min/max chunk" ]
              {  "_id" : "test22.test22-field_MinKey",  "lastmod" : Timestamp(1, 3),  "lastmodEpoch" : ObjectId("53db1671c333c70bae88848b"),  "ns" : "test22.test22",  "min" : {  "field" : { "$minKey" : 1 } },  "max" : {  "field" : -4204176258327475000 },  "shard" : "shard0000" }
              {  "_id" : "test22.test22-field_-4.204176258327475e+18",  "lastmod" : Timestamp(1, 4),  "lastmodEpoch" : ObjectId("53db1671c333c70bae88848b"),  "ns" : "test22.test22",  "min" : {  "field" : -4204176258327475000 },  "max" : {  "field" : -4204176258327474700 },  "shard" : "shard0000" }
      

      The test case values are:

      double hex decimal
      1.0000000000000000 0x3ff0000000000000 4607182418800017408
      1.0000000000000002 0x3ff0000000000001 4607182418800017409
      1.0000000000000004 0x3ff0000000000002 4607182418800017410
      1.0000000000000007 0x3ff0000000000003 4607182418800017411
      1.0000000000000009 0x3ff0000000000004 4607182418800017412

            Assignee:
            janna.golden@mongodb.com Janna Golden
            Reporter:
            kevin.pulo@mongodb.com Kevin Pulo
            Votes:
            1 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: