Uploaded image for project: 'C Driver'
  1. C Driver
  2. CDRIVER-4231

Duplicates in ObjectID when inserting in parallel

    • Type: Icon: Task Task
    • Resolution: Fixed
    • Priority: Icon: Unknown Unknown
    • 1.21.0
    • Affects Version/s: None
    • Component/s: None
    • None

      Prior to driver version 1.14, ObjectID included 3 bytes identifying the machine and 2 bytes denoting PID. This made duplicate ID's impossible. Following 1.14 (and CDRIVER-2771) we see users bulk loading in parallel getting the error:

      Error from Mongo database on operation 'bulk operation execute': E11000 duplicate key error collection: gfxdb.Rpt_AggregatedExposure_HU index: id dup key: { : ObjectId('60ab881bb7b6e778e76e34a2') }

      The new algorithm includes a 5-byte random value generated once per process. This random value is meant to be unique to the machine and process. My testing shows that different PIDs are capable of creating the same 5-byte random value. 

      An additional mechanism for avoiding duplicates is that the 3-byte counter at the end of the ObjectID which starts at a random number. In the cases of duplicate 5-byte situation, the counter number starts at the same number.

      To demonstrate this behavior, I wrote a program that loads to a Mongo collection with 1 document per process, and ran it 5000 times. Each ObjectID is then parsed into it's constituent parts. Here I show those parts for a number of 5-byte random number duplicates:

      random duplicates timestamp counter write PID
      191,162,307,748 2 1637005599 1943458 20617
          1637005804 1943458 4837
      162,692,425,265 2 1637005688 6330770 9673
          1637005995 6330770 17759
      335,481,554,228 2 1637005711 4833330 15678
          1637006002 4833330 19618
      -146,439,389,216 2 1637005841 6191586 13851
          1637006026 6191586 25035
      -484,404,755,953 2 1637005985 6445778 15646
          1637006023 6445778 24430
      51,631,052,349 2 1637005747 2533378 23997
          1637005882 2533378 23354

      Testing against a 1.6 C Driver, I see the 5-bytes (3 for machine and 2 for PID) identical in the case of processes which happen to have identical PIDs (running at different times), but in those cases the counter bytes are randomly distributed, e.g.:

      random duplicates timestamp counter write PID
      206,863,667,287 3 1637004679 461522 1057
          1637004817 1525762 1057
          1637005104 6964834 1057

      Given there are 1/1 trillion odds of two 5-byte random numbers colliding, combined with the odds of the 1/8 million counter numbers colliding (all in the same second), if the code matched the spec, the modern algorithm would not be a problem.

            Assignee:
            colby.pike@mongodb.com Colby Pike
            Reporter:
            seamus.riley@gmail.com Seamus Riley
            Votes:
            2 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: