Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-1811

Astrolabe Testing Improvements

    • Type: Icon: Epic Epic
    • Resolution: Done
    • Priority: Icon: Unknown Unknown
    • None
    • Component/s: None
    • None
    • Hide

      NA

      Show
      NA
    • To Do
    • Astrolabe Testing Improvements
    • Hide

      Engineer(s): Oleg, Jeff

      Summary: Make improvements to Astrolabe testing in order to be sure we are receiving the full benefit of the tool. This project captures the effort to increase stability of Astrolabe as well as verify the efficacy of its testing such that it is accurately surfacing driver bugs.

      2021-12-01:
      No updates since Oleg returned from vacation earlier this week


      2021-11-17:

      • On pause while Oleg is on vacation

      2021-10-06:

      • Completed DRIVERS-1924: Retrieve server logs only when tests fail
      • Completed DRIVERS-1923: Retrieve server logs in a separate task
      • Deferred DRIVERS-1691: Research ways to avoid OutOfMemory exception when we handle huge batch of events
      • Started DRIVERS-1932: When Astrolabe run fails due to issue with Atlas, color it lavender to indicate a setup failure rather than a task failure
      • Next up: RCA on the regularly occurring timeouts in Atlas QA. Cory is going to help with triage in #astrolabe-triage Slack channel. Eventually will add the on-call leads to assist

      Show
      Engineer(s): Oleg, Jeff Summary: Make improvements to Astrolabe testing in order to be sure we are receiving the full benefit of the tool. This project captures the effort to increase stability of Astrolabe as well as verify the efficacy of its testing such that it is accurately surfacing driver bugs. 2021-12-01: No updates since Oleg returned from vacation earlier this week 2021-11-17: On pause while Oleg is on vacation 2021-10-06: Completed DRIVERS-1924 : Retrieve server logs only when tests fail Completed DRIVERS-1923 : Retrieve server logs in a separate task Deferred DRIVERS-1691: Research ways to avoid OutOfMemory exception when we handle huge batch of events Started DRIVERS-1932: When Astrolabe run fails due to issue with Atlas, color it lavender to indicate a setup failure rather than a task failure Next up: RCA on the regularly occurring timeouts in Atlas QA. Cory is going to help with triage in #astrolabe-triage Slack channel. Eventually will add the on-call leads to assist
    • Not Needed

      Summary

      We need to make improvements to Astrolabe testing in order to be sure we are receiving the full benefit of the tool. This project captures the effort to increase stability of Astrolabe as well as verify the efficacy of its testing such that it is accurately surfacing driver bugs.

      Efficacy Improvements

      The astrolabe project currently tests various Atlas planned maintenance scenarios in an attempt to find problematic scenarios indicating either driver bugs or bugs in the planned maintenance scenarios themselves. So far the project has not found bugs in either (aside from timeout issues in cloud-dev).

      We need to test Astrolabe itself to see if it actually achieves its goals. One way to do this would be to test a driver version that is known to have bugs related to planned maintenance, along with the workload that was used to reproduce the bug, and see if those bugs are reproduced by Astrolabe. We would want to continually test Astrolabe this way to ensure future changes to it don't obscure known driver bugs. As new driver bugs are found the pre-bugfix version of the driver, and the workload that reproduces the bug, should be added to a corpus of such tests.

      Stability Improvements

      The Atlas Planned Maintenance test suite should only be red due to a driver, server or Atlas bug. Not because of an Astrolabe bug, or cloud-dev/cloud-qa instability. The Atlas Planned Maintenance tests have a number of stability issues that obscure the problems the test suite was intended to find, reducing or eliminating its value:

      • Tests often time out waiting for maintenance to complete
      • Tests often time out attempting to download logs
      • Tests sometimes OOM tracking APM events
      • etc. etc. etc.

      A policy for triage of testing failures

      • Automatically notify language teams when Astrolabe failures occur for their driver
      • Define a process for handing off triage to other teams (server, cloud, etc.) when triage determines there is no driver bug

      Motivation

      Who is the affected end user?

      No end users will be affected by this work as it is internal testing. Driver engineers are essentially the end users here and they may need to make changes to their atlas testing to accommodate updates that come out of this project.

      Is this issue urgent?

      This ticket is urgent because we must have a strong functional validation mechanism between drivers and Atlas. This only becomes more urgent over time as Atlas functionality expands and usage continues to grow.

      Is this ticket required by a downstream team?

      Not functionally, but our testing with Astrolabe is an essential verification mechanism between drivers and Atlas, so it is implicitly required cross-org.

      Is this ticket only for tests?

      Sort of - this project is solely about testing, however it involves making changes to Astrolabe and is more significant in urgency and scope than simply adding tests.

      Cast of Characters

      Engineering Lead:
      Document Author:
      POCers:
      Product Owner:
      Program Manager:
      Stakeholders:

      Channels & Docs

      Running List of Astrolabe Issues

      Slack Channel

      [Scope Document|some.url]

      [Technical Design Document|some.url]

            Assignee:
            Unassigned Unassigned
            Reporter:
            rachelle.palmer@mongodb.com Rachelle Palmer
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: