Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-93641

Kafka connection failures sometimes return bad error message

    • Atlas Streams
    • Sprint 60, Sprint 62

      There have been a few prod examples of poor/unhelpful error messages connecting to Kafka.

      We should investigate why, we have code to return more detailed error information: https://github.com/10gen/mongo/blob/master/src/mongo/db/modules/enterprise/src/streams/exec/kafka_event_callback.cpp#L48 

      There is likely an intermittent issue we need to fix in the "append recent errors" logic: https://github.com/10gen/mongo/blob/master/src/mongo/db/modules/enterprise/src/streams/exec/kafka_partition_consumer.cpp#L575

      1. We should use splunk to see if librdkafka is printing any useful error informations that should have been included in the "append recent errors" logic: https://github.com/10gen/mongo/blob/master/src/mongo/db/modules/enterprise/src/streams/exec/kafka_event_callback.cpp#L60.
      2. If so, why are those error messages not picked up in the "appendRecentErrorsToStatus" logic? Do we need to wait for a ~second for those error messages to come in?

      Here is a recent repro: https://mongodb.slack.com/archives/C07HH6A8M55/p1730824228637259

      Note this can also repro for VPC peering errors. Those will have better error messages after https://github.com/10gen/mongo/pull/26113 .

            Assignee:
            calvin.nix@mongodb.com Calvin Nix
            Reporter:
            matthew.normyle@mongodb.com Matthew Normyle
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: