Better Metrics with Prometheus

Generating millions of metrics from microservices and infrastructure is easy (perhaps too easy). Collecting those metrics and sculpting them into meaningful dashboards and alarms is a challenge operations teams face as they grow and scale their applications. We found in Prometheus a capable and flexible metrics collection platform that came with an entire ecosystem of out-of-the-box functionality that we were able to adapt into our existing tooling.  Today, we share the story of our adoption of Prometheus for IdentityNow.

First… What is Prometheus?

Prometheus is a pull-based monitoring system that scrapes metrics from configured endpoints, stores them efficiently, and supports a powerful query language to compose dynamic information from a variety of otherwise unrelated data points.  It boasts a strong library of exporters, which help you gain immediate value by providing quick starts to begin collecting metrics from things like your computing systems (via node exporter), MySQL, and ElasticSearch.

Metrics and Observability Before Prometheus

When we started down this path, we were using one SaaS provider for system-level metrics (CPU, Memory, etc.), and another for custom application metrics.  The first obstacle we faced was that both services needed to make AWS API calls to get metrics from CloudWatch, and additional calls to enrich that information with EC2, RDS, and ELB configuration data.  Problems arose when the services jointly saturated AWS API limits, causing issues to arise in other areas such as deployment orchestration with terraform.

With two SaaS monitoring services, there are two different systems to check when troubleshooting a production issue (in addition to logging, which was already self-hosted). We also had to implement workarounds to counter things like “predictive” alerting, which didn’t translate well across all our workloads. We realized that these attributes were all things that we needed to avoid in our new metrics system so that it could grow with our platform.

Next Generation Observability

The IdentityNow team at SailPoint holds semi-annual “hack days” where engineers get to put that really cool idea they have into action and present to the organization.  This event began the path towards providing a Prometheus and Grafana proof-of-concept.  The next steps were to make sure we could instrument and get quality data from the various exporters in the Prometheus ecosystem.  We found we were able to get things like the node exporter, Redis exporter, and JMX-exporter running against development environments, gaining confidence we could instrument our entire environment with Prometheus.

Prometheus in an ECS Environment

Armed with a viable proof of concept and a project plan to move forward, we set forth to design the next generation of metrics for IdentityNow.  The most pressing issue was optimizing around API calls to AWS to reduce deployment risk or running into API limits simply viewing the console.

Since we needed to run the old and new systems together for a period of time, we couldn’t saturate our already strained API limits.  We also had to be highly elastic, without having to do any manual modifications.  Automation first!  To address these requirements, we built three custom components and utilized Prometheus’ file-based discovery.

For target discovery, we added a sidecar called Registrator to every ECS task.  This lightweight Go program is responsible for reading the ECS metadata endpoint from inside the container and adding the necessary IP/Port and label information to a DynamoDB table, avoiding any AWS API calls for ECS task data.

Prometheus then needs to know about these targets, which drove the creation of Hesione (named after the wife of the Titan Prometheus).  Hesione is another Go-based microservice which acts as a sidecar to Prometheus.  Hesione retrieves the application target list from DynamoDB and provides a formatted prometheus.yml with all static and dynamic target jobs.  Hesione also powers our alerts-as-code functionality by taking Git source-controlled alert and recording rules and distributing them to all known Prometheus instances from S3.

In an elastic, cloud environment, we also have to handle the case where services are de-provisioned, whether it be a standard rolling deployment or an actual failure.  An alarm was configured within Prometheus to fire when any dynamic target stops responding, notifying our Deregistrator service via SNS. Deregistrator then confirms the target is supposed to be stopped and removes the target from DynamoDB, which Hesione captures and updates the file-based service discovery within Prometheus.

And last but not least, the application metrics themselves.  Most of our microservices were already collecting metrics for the third-party SaaS metrics solution using Dropwizard, which has a simple Prometheus client available from prometheus.io.  We were able to expose the same metrics we had previously instrumented similar to our former integrations with the SaaS services.

A Few Bumps in the Road

Prometheus best practices tell us to avoid high-cardinality metrics as that increases the load on the system.  Once we were in production, we began to register more applications, which brought with them lots of metrics.  We kept an eye on the Prometheus HEAD series metric, which indicates how many total time-series Prometheus is tracking.  Each series consumes a bit of memory and correlates to each unique metric and set of labels.  When this number got larger and larger, the Prometheus UI became less and less responsive, often crashing our browsers just using the type-ahead search box.  Time to investigate.  We found that one microservice was emitting a new metric on every website user click.  Each click resulted in a new time-series (with 5-7 labels each), requiring more memory for Prometheus to maintain, and providing no actual value.  The Engineering team provided a fix to stop emitting those metrics, allowing us to delete the offending time-series and clean the tombstones, which returned Prometheus to a usable state.

Another surprise was during an event where approximately 80% of the targets on a Prometheus server stopped responding to metric scrapes at once.  When the thousands of alarms started to fire, Prometheus dutifully tried to send all of those to our Deregistrator lambda by way of the SNS Forwarder.  The payload size was greater than the 256KB allowed by SNS, and the SNS Forwarder was returning a 5xx error back to Alertmanager, which in turn caused Alertmanager to retry, until the point where the Alert queue was full and no more alarms could be triggered.

To address this, we modified the alarm to send only a single down target per notification, which directly addressed the 256KB SNS payload limit.  But the retries could still be a problem, and this isn’t unique to us, so we also sent a patch to the SNS Forwarder to ensure everyone could benefit from this lesson learned.  Deregistrator also found itself saturating the ECS API limits during this event.  To add another layer of resiliency, we set up a lambda invocation concurrency limit on the Deregistration function to ensure the fan-out wouldn’t become an API problem as well.

What We Gained

Since converting all the existing infrastructure and applications to Prometheus, we’ve seen a steady increase in adoption and improvement of instrumentation throughout the organization.  Our dashboards within Grafana contain rich information across runtime, instance, and application metrics, and are now making use of the many features Grafana offers, such as annotations and dynamically generated variable lists.

We’ve also been able to convert some data points that were formerly instrumented as logs in Elasticsearch into Prometheus metrics, reducing the overall data size and query time needed to get the same information.

Next Steps

We have several million metrics in Prometheus today, and we want to make them more useful for monitoring.  We’re actively working with our engineering teams to collect more discreet, interesting metrics on a per-service basis.  By using techniques such as histograms and event counters, we can more accurately measure our services’ behavior versus our service-level objective, without worrying about averages over time missing potential outlying errors or slow response times.

We’re also reshaping our definition of observability as it relates to logging and metrics.  By instrumenting interesting data points such as health check status and interesting errors, we can drive the alarms with Prometheus instead of log-based alerting.  Sometimes, log ingestion is running a few minutes behind, but Prometheus is dutifully capturing metrics every minute, allowing us to respond more quickly to changes in our applications.  We can also remove all of these events from the logs, allowing us to more quickly triage an issue and sort through fewer log messages in Kibana.

The DevOps team is also migrating to Kubernetes for container orchestration, where we can benefit from many of the built-in service discovery options Prometheus offers for out-of-the-box.  We expect this will simplify the container discovery and deregistration processes we use today.

And that’s the story of how Prometheus became a core component of how we do SaaS observability.  Each day we find new opportunities to further instrument our applications, create actionable alarms, and construct information-rich dashboards.   We are now better positioned to leverage data-driven decision making, where the question, “How are we going to measure that change?” is met with ideas and assurance we’re doing the right thing.  Remember – if you don’t measure it, it didn’t happen.


Discussion