Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.4.6
Component/s: Replication
Labels:
None
Environment:
Amazon Linux, official MongoDB.org packages

Operating System:
ALL
Steps To Reproduce:

Hide

Setup a 3 node replica set. Use separate servers or at least different data disk for each node.

Using the following script, insert data into a collection:

#!/bin/bash

while true
do
TIME=`date`
echo "db.test.insert(

Unknown macro: { time }

)" | mongo test
done

Use a tool like dd if=/dev/zero of=/tmp/eatspace bs=1024 count=1024 to fill the disk. Note that even after the disk is full, MongoDB will continue to successfully insert more data until the last 2 GB data file becomes full.

What actually happens:

Observe following errors from the insert:

MongoDB shell version: 2.4.6
connecting to: test
Can't take a write lock while out of disk space
bye

And in the log:

Wed Aug 28 11:08:07.306 [FileAllocator] allocating new datafile /var/lib/mongo/test.2, filling with zeroes...
Wed Aug 28 11:08:07.307 [FileAllocator] FileAllocator: posix_fallocate failed: errno:28 No space left on device falling back
Wed Aug 28 11:08:07.307 [FileAllocator] error: failed to allocate new file: /var/lib/mongo/test.2 size: 268435456 failure creating new datafile; lseek failed for fd 22 with errno: errno:2 No such file or directory. will try again in 10 seconds

What should happen

When failing to allocate a new datafile, the primary should step down and allow another node to become primary. In addition, it should go into a state where it cannot become primary again (for example, if it has a high priority) until the problem has been fixed.

Workarounds

When noticing the failure, the DBA must call rs.stepDown() or shut down the failing mongod process. rs.stepDown() could also be called automatically from an application that receives disk full or other similar error message. In addition, it might make sense to set the node into hidden or priority=0 state until problem is fixed.

Show
Setup a 3 node replica set. Use separate servers or at least different data disk for each node. Using the following script, insert data into a collection: #!/bin/bash while true do TIME=`date` echo "db.test.insert( Unknown macro: { time } )" | mongo test done Use a tool like dd if=/dev/zero of=/tmp/eatspace bs=1024 count=1024 to fill the disk. Note that even after the disk is full, MongoDB will continue to successfully insert more data until the last 2 GB data file becomes full. What actually happens: Observe following errors from the insert: MongoDB shell version: 2.4.6 connecting to: test Can't take a write lock while out of disk space bye And in the log: Wed Aug 28 11:08:07.306 [FileAllocator] allocating new datafile /var/lib/mongo/test.2, filling with zeroes... Wed Aug 28 11:08:07.307 [FileAllocator] FileAllocator: posix_fallocate failed: errno:28 No space left on device falling back Wed Aug 28 11:08:07.307 [FileAllocator] error: failed to allocate new file: /var/lib/mongo/test.2 size: 268435456 failure creating new datafile; lseek failed for fd 22 with errno: errno:2 No such file or directory. will try again in 10 seconds What should happen When failing to allocate a new datafile, the primary should step down and allow another node to become primary. In addition, it should go into a state where it cannot become primary again (for example, if it has a high priority) until the problem has been fixed. Workarounds When noticing the failure, the DBA must call rs.stepDown() or shut down the failing mongod process. rs.stepDown() could also be called automatically from an application that receives disk full or other similar error message. In addition, it might make sense to set the node into hidden or priority=0 state until problem is fixed.
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Summary:

Given a replica set with 3 or more nodes, if the PRIMARY node is shutdown, crashes, or becomes available due to network issues, the other nodes will proceed to elect a new PRIMARY and automatic failover occurs within seconds.

However, in other error situations where the mongod process remains alive and continues to respond to heartbeats, failover will not happen, but write operations to the PRIMARY will fail, rendering the cluster unusable and de facto unavailable (for writes).

An example of such error situation is a disk error such as disk full.

duplicates

SERVER-9552 when replica set member has full disk, step down to (sec|rec)?

Backlog

Assignee:: Unassigned
Reporter:: Henrik Ingo (Inactive)
Participants:: Daniel Pasette, Henrik Ingo
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Aug 28 2013 11:13:34 AM UTC
Updated:: Dec 10 2014 11:05:57 PM UTC
Resolved:: Aug 29 2013 04:38:07 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates