Thanos use case - Any ideas on this scale? #4872

erikeverts · 2021-11-15T10:29:26Z

erikeverts
Nov 15, 2021

Hello,

We are looking at using Thanos. We have the following environment/use case and are looking to see if Thanos would be a good fit.

There will be around 27000 devices across different sites for which metrics are scraped.
These devices are scraped by a total of 3000-5000 prometheus instances.
Metrics ingestion into Thanos will be via remote write.
Total number of metrics will be around 5 million per minute.
Data will be stored in S3
Data will be retained for three years and should be queryable

We will be using a containerized environment (Cloud foundry, looking at Amazon EKS as another option).
Current limitations in CF, max 16GB memory per container. Max 20GB disk space per container.
Containers will have a fixed open file limit op 16k
We will be running the different thanos components in seperate apps in cloud foundry.

Question to the community, is Thanos a viable solution, looking at above requirements and limitations. Does anyone have experience using Thanos is a similar infrastructure or deployment and using it on this scale?
What are the thoughts on the data retention and being able to query three years of data.
What would be the best way to set up thanos store, taken into account the limits mentioned above(disk space and open file limits). We have done some initial testing and are now sharding data on two weeks. Is there another way to go about this? Having to need different stores that each shard 2 weeks of data will mean a lot of stores when looking at three years of data. (150+ when taking HA into account)

PS Also posted the same in Thanos slack chanel.

4xoc · 2022-02-16T10:03:18Z

4xoc
Feb 16, 2022

Hi, in case Slack didn't answer on this here are my thoughts about this: Can't say I've worked with that scale with Thanos but with other software. Remote write certainly is the best way instead of sidecar.
I wouldn't go for containers unless absolutely necessary because of the ram limitation and containers don't really help the performance. Instead, ec2 instance with ram disk for the recent data chunks so that it makes recent data queries much faster (high IO storage on AWS is rather expensive).
Using the receive replication you still have a low risk of data loss. I'd also recommend multiple receive clusters, one per site, that are scaled to the need of the site. If you want to go very HA, one receive cluster per site in at least two close regions.
Try to keep the tenants separate during ingestion so that you can one compactor per tenant/s3 bucket. I would assume that compactor will be the major bottleneck as it cannot be horizontally scaled up.
Assuming you have 4 sites with about equal device count (6750 per site as per your numbers), you will run 4-8 receive clusters saving data into 4-8 individual s3 buckets. Per cluster runs one compactor.
Check out https://thanos.io/tip/components/query-frontend.md/ for scaling out your query interface, add caching of recent metrics in ram. Find out what time frames of data are most requested to focus query instances on that timeframe. For example, if your main time frame is the last 24 hours, consider splitting query instances into 6h windows.

Generally I can't say if Thanos will scale that well on that scale. As mentioned, the compactor is for sure a bottleneck that can't be easily fixed as of today. My suggestion above is based on experience with Thanos and other software at large scale but I don't claim that this will work for sure. Would be great to know how you solved this in the end.

0 replies

thibautmery · 2022-05-09T13:37:19Z

thibautmery
May 9, 2022

Hi @erikeverts erikeverts

Did you test thanos on your infrastructure ? I am currently deploy thanos and need some feedback about using thanos store / receiver / bucket / ...

Thanks :)

0 replies

FzTihazakari · 2022-06-20T17:32:06Z

FzTihazakari
Jun 20, 2022

Hi, can you share Thanos use case? We are currently studied about Thanos.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thanos use case - Any ideas on this scale? #4872

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Thanos use case - Any ideas on this scale? #4872

Uh oh!

erikeverts Nov 15, 2021

Replies: 3 comments

Uh oh!

4xoc Feb 16, 2022

Uh oh!

thibautmery May 9, 2022

Uh oh!

FzTihazakari Jun 20, 2022

erikeverts
Nov 15, 2021

4xoc
Feb 16, 2022

thibautmery
May 9, 2022

FzTihazakari
Jun 20, 2022