Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 1.8.4, 2.0.1
Component/s: MapReduce
Labels:
- mapreduce,
- replication
Environment:
replica set -- e.g., a PRIMARY, a SECONDARY, and an ARBITER -- receiving a steady trickle of read & write operations

Assigned Teams:

Query Optimization
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

As a user of a MongoDB database served from an N-node replica set (running in --auth mode) that is responding to a steady trickle of read and write requests, I want to be able to complete a large map/reduce task without material disruption to the operation of the replica set.

By "large," I mean that I cannot guarantee the result set fits in a single 16MB document – i.e., my task is a poor candidate for "out:

{ inline: 1 }

By "material disruption," I mean a variety of pathologies I'd like to avoid if possible. For example:

failover to the SECONDARY because the PRIMARY becomes unresponsive
- (with successive failure of the mapReduce – db assertion 13312 – when it cannot commit its result collection because former PRIMARY is no longer PRIMARY)
error RS102 ("too stale to catch up") on a secondary because of the deluge of a suddenly-appearing result set being replicated;
significant degradation of overall performance re: the latency of servicing other requests;
etc.

For the sake of argument, let's further assume my map/reduce task is not amenable to an incremental application of successive mapReduce calls; and that I would potentially like to preserve the map/reduce output collection as a target for future queries, though it would be acceptable to have to manually move that result data around among hosts.

I've spoken with a couple members of the 10gen engineering team, and they've been very helpful in brainstorming approaches and workarounds. But for the sake of the product, I want to make sure the use case is captured here, as I don't think any of the approaches I'm currently aware of really addresses it satisfactorily.

Working with current constraints, here's where we get ...

The "large" / no inline requirement today will force us to execute the mapReduce() call on the PRIMARY. The reason being: an output collection will have to be created, and only the PRIMARY is permitted to do these writes.

But today, for a certain class of job, executing on the PRIMARY is a non-starter. It will consume too many system resources, monopolize the one and only javascript thread, hold the write lock for extended periods of time, etc. – risking running afoul of pathology #1 above. If that failure mode is dodged, it runs the risk of pathology #2, as the large result set (created much faster than any more "natural" insertion of data) floods the SECONDARY nodes and overflows their oplogs. And of course failure mode #3 is always a concern – the PRIMARY is too critical to the overall responsiveness of any app that uses the database.

That's the problem, in a nutshell. Some potential, suggested solutions – all of which involve changes to core server :

A. If we could relax or finesse the only-PRIMARY-can-write constraint for this kind of operation, we could run the mapReduce() call on a SECONDARY – perhaps even a "hidden" node – and leave the overall cluster and any apps almost completely unaffected. Then we'd take responsibility for getting the result collection out of that node as a separate task. Currently, though, this does not seem possible – cf. issue ~~SERVER-4264~~ – as we can't target even the 'local' collection there. (And even if we could target 'local', today it requires admin/separate privileges to do so.)

B. Alternately, if we could alter the mapReduce() command in such a way as to allow "out:" to target a remote database – perhaps in an entirely disjoint replica set – we could allow a secondary to run the calculation, save the result set, but still not violate the proscription against writing on a secondary.

C. Another possibility is to flag the result set for exclusion from replication – though this leaves us running on the PRIMARY, where failure modes #1 and #3 are still an issue.

My intent here, though, is not to describe or attach to any particular solution; rather, I'm just seeking to articulate the problem standing in the way of (what I consider) a rather important use case.

Thanks for reading this far!

Assignee:: [DO NOT USE] Backlog - Query Optimization
Reporter:: T. Dampier
Participants:: [DO NOT USE] Backlog - Query Optimization, Eliot Horowitz, Esha Bhargava, T. Dampier
Votes:: 7 Vote for this issue
Watchers:: 13 Start watching this issue

Created:: Nov 23 2011 08:12:35 PM UTC
Updated:: Dec 06 2022 05:39:05 AM UTC
Resolved:: Feb 04 2022 03:09:31 PM UTC

Details

Description

Attachments

Activity

People

Dates