-
Type: Task
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Atlas Streams
-
Fully Compatible
-
Sprint 32, Sprint 33
-
135
https://mongodb.slack.com/archives/C04AH2TF7E1/p1695164357496619
Conversation above ^
–
aadesh 13 days ago
@ kenny.gorman re: throughput numbers for streams
__
aadesh 13 days ago
was chatting with Sandeep, so tomorrow im planning on getting a bunch of numbers together with various set ups. We have a few genny workloads setup for streams right now but those are using the in-memory source/sink operators so not super reflective of production setups. So plan is to run those same workloads against a kafka source in different regions with streams running in us-east-1 so that we have throughput numbers on a kafka source pipelines, and then send over a bunch of throughput numbers to you for each workload and source operator setuphows that generally sound? * in-memory source operator
- same region kafka -> mstreams
- different regions kafka -> mstreams
will run that set up for every workload we have in genny ^ along with the avg document size (in bytes) that we're using for those workloads
__
aadesh 13 days ago
@ kenny.gorman
__
aadesh 13 days ago
each workload will be a diff type of stream pipeline and document size
__
Sandeep Dhoot 13 days ago
@aadesh we will want to change pipelines later and get these numbers a few times. So please do try to make the whole process repeatable.
__
Joe Niemiec 13 days ago
I think it makes since it'll be good to have some baseline idea of the impact of a bandwidth delay product with cross region
__
Joe Niemiec 13 days ago
I would make sure you really document how your Kafka setup is as well, Kafka tends to rely heavily on Linux page cache so there could be a difference between a Kafka which has buffered properly versus one that isn't because it's cold (edited)
__
Joe Niemiec 13 days ago
We also have some customers where reading change streams could potentially be cross region or merging to a cluster cross region (edited)
__
kenny.gorman 13 days ago
Maybe I missed it but we need source and sink variations. Like Kafka to Kafka and Kafka to Mongo. To a lesser degree we need change stream source to Mongo.
__
kenny.gorman 13 days ago
Yeah exact Kafka config is important. Repeatable is critical. Maybe something anyone can run not just engineering (thinking field) but maybe I am being too optimistic
__
kenny.gorman 13 days ago
This is awesome guys. Can’t wait to see the results
__
Sandeep Dhoot 13 days ago
Btw sources/sinks will introduce variability in results for N different reasons (source is in different region, unique kind of Kafka deployment). It would require too much effort to try to cover all the different scenarios. I hope we can just test with only a couple of difference scenarios and use those as ballpark numbers. (edited)
__
kenny.gorman 12 days ago
A couple different scenarios is what I meant yes, not all
__
kenny.gorman 12 days ago
The main one is intra-region from kafka to mongodb, and intra region mongodb to mongodb. I am not sure (@ joe) if we have lots of Kafka to kafka use cases just yet.
Joe Niemiec 12 days ago
some rough telemetry based on data I have for combinations over 30 customers (a customer may do more then 1 pattern)CS 2 Kafka - 7
Kafka 2 Col - 12
Kafka 2 Kafka - 8
CS 2 Collection - 14
__
Joe Niemiec 12 days ago
so really Kafka to Collection and CS to Collection are the top dogs
__
aadesh 12 days ago
perf thats super helpful
__
aadesh 11 days ago
re: repeatability, might take a bit more time on getting to system where we can easily repeat different setups for different stream pipelines
need to make various changes to existing perf tooling infra (DSI specifically) to distinguish mongod vs mstreams, but looking into that more today so that we can get to a place where its easy to do all this
- related to
-
SERVER-81719 Streams: Integrate mongostream into performance testing infrastructure
- Closed