Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-4364

Method for running large mapReduce operations in a working replica set

    • Type: Icon: Improvement Improvement
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 1.8.4, 2.0.1
    • Component/s: MapReduce
    • Environment:
      replica set -- e.g., a PRIMARY, a SECONDARY, and an ARBITER -- receiving a steady trickle of read & write operations
    • Query Optimization

      As a user of a MongoDB database served from an N-node replica set (running in --auth mode) that is responding to a steady trickle of read and write requests, I want to be able to complete a large map/reduce task without material disruption to the operation of the replica set.

      By "large," I mean that I cannot guarantee the result set fits in a single 16MB document – i.e., my task is a poor candidate for "out:

      { inline: 1 }

      ".

      By "material disruption," I mean a variety of pathologies I'd like to avoid if possible. For example:

      1. failover to the SECONDARY because the PRIMARY becomes unresponsive
        • (with successive failure of the mapReduce – db assertion 13312 – when it cannot commit its result collection because former PRIMARY is no longer PRIMARY)
      2. error RS102 ("too stale to catch up") on a secondary because of the deluge of a suddenly-appearing result set being replicated;
      3. significant degradation of overall performance re: the latency of servicing other requests;
      4. etc.

      For the sake of argument, let's further assume my map/reduce task is not amenable to an incremental application of successive mapReduce calls; and that I would potentially like to preserve the map/reduce output collection as a target for future queries, though it would be acceptable to have to manually move that result data around among hosts.

      I've spoken with a couple members of the 10gen engineering team, and they've been very helpful in brainstorming approaches and workarounds. But for the sake of the product, I want to make sure the use case is captured here, as I don't think any of the approaches I'm currently aware of really addresses it satisfactorily.

      Working with current constraints, here's where we get ...

      The "large" / no inline requirement today will force us to execute the mapReduce() call on the PRIMARY. The reason being: an output collection will have to be created, and only the PRIMARY is permitted to do these writes.

      But today, for a certain class of job, executing on the PRIMARY is a non-starter. It will consume too many system resources, monopolize the one and only javascript thread, hold the write lock for extended periods of time, etc. – risking running afoul of pathology #1 above. If that failure mode is dodged, it runs the risk of pathology #2, as the large result set (created much faster than any more "natural" insertion of data) floods the SECONDARY nodes and overflows their oplogs. And of course failure mode #3 is always a concern – the PRIMARY is too critical to the overall responsiveness of any app that uses the database.

      That's the problem, in a nutshell. Some potential, suggested solutions – all of which involve changes to core server :

      A. If we could relax or finesse the only-PRIMARY-can-write constraint for this kind of operation, we could run the mapReduce() call on a SECONDARY – perhaps even a "hidden" node – and leave the overall cluster and any apps almost completely unaffected. Then we'd take responsibility for getting the result collection out of that node as a separate task. Currently, though, this does not seem possible – cf. issue SERVER-4264 – as we can't target even the 'local' collection there. (And even if we could target 'local', today it requires admin/separate privileges to do so.)

      B. Alternately, if we could alter the mapReduce() command in such a way as to allow "out:" to target a remote database – perhaps in an entirely disjoint replica set – we could allow a secondary to run the calculation, save the result set, but still not violate the proscription against writing on a secondary.

      C. Another possibility is to flag the result set for exclusion from replication – though this leaves us running on the PRIMARY, where failure modes #1 and #3 are still an issue.

      My intent here, though, is not to describe or attach to any particular solution; rather, I'm just seeking to articulate the problem standing in the way of (what I consider) a rather important use case.

      Thanks for reading this far!

      TD

            Assignee:
            backlog-query-optimization [DO NOT USE] Backlog - Query Optimization
            Reporter:
            dampier T. Dampier
            Votes:
            7 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: