SailPoint

Reducing Audit Events Lag From 1 Hour to 1 Second

Authors: Ajay Srinivas, Mark Bruce

What Are Audit Events?

Audit events are critical to our clients for various reasons—most notably, compliance and troubleshooting. SailPoint’s Identity Security Platform audits all important changes to an identity such as role assignments, password change activities, access requests, and more.

Audit events emitted by many different micro-services within the platform are processed asynchronously by the Audit service which normalizes the data to make it suitable for search, then publishes the event onto a Kafka topic. These audit events are ingested into ElasticSearch by the Search service making them accessible via the Search UI and Search APIs. 

SailPoint’s custom event processing framework adds value by decorating an event with tenant context in the process of publishing the audit event to a multi-tenant Kafka topic. It also captures processing metrics: latency (the amount of time the event waits in the topic before it is processed) and consumer lag (the number of events in the Kafka consumer group waiting to be processed).

Why Was This a Problem?

Every morning, millions of audit records were published to the audit Kafka topic in a short period of time. This spike severely impacted the Search service’s ingestion latency and lag, and these delays caused an unpleasant customer experience and an increase in the number of support tickets. The latency, normally measured in seconds, sometimes reached several hours.

What Caused the Latency?

To avoid overwhelming the Elasticsearch cluster, the original Search ingestion engine submitted batches of up to 5MB, waited for Elasticsearch to process them, then processed the results. This would throw exceptions if Elasticsearch errors occurred, during which all ingestion request message threads were blocked. Also, by relying on retry logic to handle error processing, all documents in an ingestion request message were reprocessed when any part of the request failed. This design had a major adverse effect on throughput. In addition to the aforementioned, the framework’s processing of one Kafka event at a time and committing the offset to Kafka only after the ingestion process completed without exception also contributed to the lag.

How Did We Fix It?

Our approach was to process the documents to be ingested using a Kafka Streams application with the following design criteria:

  1. Process the documents in batches, optimally sized for Elasticsearch ingestion (5MB)
  2. Submit batches within a minimum period of time so clients do not realize noticeable ingestion delay; this delay could vary based on latency to increase throughput
  3. Separate document batch processing from batch submission response processing
  4. Use Kafka Streams-backed state stores to optimize the processing time for each document and to protect against application container failures
  5. Support batching for documents targeted to different Elasticsearch hosts
  6. Reprocess individual, failed documents using an incremented delay based on retry count
  7. Gather comprehensive metrics to support analysis of the state of the ingestion process
  8. Ensure that older versions of a document do not override newer versions of that document during the ingestion process

Our new ingestion engine is a Kafka Streams application implementation that balances the objective of submitting large batches of documents with the objective of ingesting individual documents quickly.

The process Method

The punctuate method

The submit method

Results