Uber’s IngestionNext: Transforming Data Ingestion with a Streaming-First Approach
Uber has taken a significant step forward in data management by re-architecting its data lake ingestion platform. The new platform, dubbed IngestionNext, represents a major shift from traditional batch processing methods to a modern, streaming-first architecture. This transition is designed to improve data availability for analytics and machine learning workloads, reducing ingestion latency from hours to mere minutes.
The Shift from Batch to Streaming
Previously, Uber’s data ingestion pipelines relied heavily on Apache Spark to handle large-scale processing through scheduled batch jobs. While effective in managing vast datasets, these batch pipelines often delayed data availability, making it challenging for teams to access timely insights for analytics and experiments.
Kai Waehner, Uber’s Global Field CTO, articulates the significance of this paradigm shift in a recent LinkedIn post:
“This move is all about treating data freshness as a key dimension of data quality.”
Architecture of IngestionNext
IngestionNext introduces a streaming-first pipeline where event streams are processed continuously before they are committed to the data lake. The flow of events is managed through Apache Kafka, with processing accomplished via Apache Flink jobs. This setup writes data into Hudi tables, which support transactional commits, rollbacks, and time travel capabilities. The architecture not only enhances data freshness but also ensures completeness through comprehensively measured end-to-end metrics.
IngestionNext architecture (Source: Uber Blog Post)
Handling High Data Volumes
The architecture is designed to accommodate thousands of datasets while managing high global data volumes. This capability leads to faster availability of analytics dashboards, experimentation platforms, and machine learning models. A control plane automates the job lifecycle, configuration, and health monitoring, further enhancing the system’s robustness. To ensure continuity and prevent data loss during outages, regional failovers and fallback strategies have been implemented.
Overcoming Technical Challenges
Transitioning to a streaming ingestion model was not without its challenges. A primary hurdle was the creation of numerous small files in the data lake, which could adversely affect query performance and storage efficiency. To mitigate this, the engineering team implemented row-group-level merging strategies for Parquet files and added compaction mechanisms to uphold efficient file layouts during ongoing data ingestion.
Additionally, open-source initiatives, including Apache Hudi, have been explored for schema-evolution-aware merging. This involves using padding and masking techniques to align differing schemas, although it introduces complexities in implementation and ongoing maintenance.
Mechanisms for Reliability
To further fortify the streaming ingestion process, mechanisms for handling checkpointing, partition skew, and recovery in distributed streaming pipelines were integrated. By tracking offsets from upstream streams and coordinating commits, the system ensures that ingestion jobs maintain data correctness and can effectively recover from failures.
Improved Resource Efficiency
One of the standout benefits of IngestionNext is the improved resource efficiency it offers. By replacing scheduled batch workloads with continuously running streaming jobs that dynamically scale with incoming data volume, Uber has achieved a remarkable reduction in compute usage, estimated at around 25%. This efficiency boost not only reduces costs but also optimizes resource allocation across various departments.
Suqiang Song, co-author of the Uber Engineering blog, notes in his LinkedIn post:
“This enabled a fully end-to-end real-time data stack, from ingestion to transformation to analytics.”
Future Directions
While the implementation of IngestionNext has significantly improved the freshness of raw data entering the data lake, engineers recognize that downstream transformations and analytics pipelines may still introduce additional latency. Future efforts will focus on extending streaming capabilities throughout the data processing stack, ensuring that improvements in data freshness carry through to the entire analytics workflow.
With IngestionNext, Uber has set a powerful example in the realm of data management, showcasing how implementing cutting-edge technologies can elevate data accessibility and capabilities for critical analytics and machine learning. As the landscape of data ingestion continues to evolve, Uber’s innovations will likely serve as a guide for organizations aiming to enhance their own data handling practices.
Inspired by: Source


