Uber’s IngestionNext: Transforming Data Ingestion with a Streaming-First Approach

Uber has taken a significant step forward in data management by re-architecting its data lake ingestion platform. The new platform, dubbed IngestionNext, represents a major shift from traditional batch processing methods to a modern, streaming-first architecture. This transition is designed to improve data availability for analytics and machine learning workloads, reducing ingestion latency from hours to mere minutes.

Contents

The Shift from Batch to Streaming
Architecture of IngestionNext

Handling High Data Volumes

Overcoming Technical Challenges

Mechanisms for Reliability

Improved Resource Efficiency
Future Directions

The Shift from Batch to Streaming

Previously, Uber’s data ingestion pipelines relied heavily on Apache Spark to handle large-scale processing through scheduled batch jobs. While effective in managing vast datasets, these batch pipelines often delayed data availability, making it challenging for teams to access timely insights for analytics and experiments.

Kai Waehner, Uber’s Global Field CTO, articulates the significance of this paradigm shift in a recent LinkedIn post:

“This move is all about treating data freshness as a key dimension of data quality.”

Architecture of IngestionNext

IngestionNext introduces a streaming-first pipeline where event streams are processed continuously before they are committed to the data lake. The flow of events is managed through Apache Kafka, with processing accomplished via Apache Flink jobs. This setup writes data into Hudi tables, which support transactional commits, rollbacks, and time travel capabilities. The architecture not only enhances data freshness but also ensures completeness through comprehensively measured end-to-end metrics.

IngestionNext architecture (Source: Uber Blog Post)

Handling High Data Volumes

The architecture is designed to accommodate thousands of datasets while managing high global data volumes. This capability leads to faster availability of analytics dashboards, experimentation platforms, and machine learning models. A control plane automates the job lifecycle, configuration, and health monitoring, further enhancing the system’s robustness. To ensure continuity and prevent data loss during outages, regional failovers and fallback strategies have been implemented.

Overcoming Technical Challenges

Transitioning to a streaming ingestion model was not without its challenges. A primary hurdle was the creation of numerous small files in the data lake, which could adversely affect query performance and storage efficiency. To mitigate this, the engineering team implemented row-group-level merging strategies for Parquet files and added compaction mechanisms to uphold efficient file layouts during ongoing data ingestion.

Additionally, open-source initiatives, including Apache Hudi, have been explored for schema-evolution-aware merging. This involves using padding and masking techniques to align differing schemas, although it introduces complexities in implementation and ongoing maintenance.

Mechanisms for Reliability

To further fortify the streaming ingestion process, mechanisms for handling checkpointing, partition skew, and recovery in distributed streaming pipelines were integrated. By tracking offsets from upstream streams and coordinating commits, the system ensures that ingestion jobs maintain data correctness and can effectively recover from failures.

Improved Resource Efficiency

One of the standout benefits of IngestionNext is the improved resource efficiency it offers. By replacing scheduled batch workloads with continuously running streaming jobs that dynamically scale with incoming data volume, Uber has achieved a remarkable reduction in compute usage, estimated at around 25%. This efficiency boost not only reduces costs but also optimizes resource allocation across various departments.

Suqiang Song, co-author of the Uber Engineering blog, notes in his LinkedIn post:

“This enabled a fully end-to-end real-time data stack, from ingestion to transformation to analytics.”

Future Directions

While the implementation of IngestionNext has significantly improved the freshness of raw data entering the data lake, engineers recognize that downstream transformations and analytics pipelines may still introduce additional latency. Future efforts will focus on extending streaming capabilities throughout the data processing stack, ensuring that improvements in data freshness carry through to the entire analytics workflow.

With IngestionNext, Uber has set a powerful example in the realm of data management, showcasing how implementing cutting-edge technologies can elevate data accessibility and capabilities for critical analytics and machine learning. As the landscape of data ingestion continues to evolve, Uber’s innovations will likely serve as a guide for organizations aiming to enhance their own data handling practices.

Inspired by: Source

Uber Unveils IngestionNext: Next-Gen Streaming Data Lake Reduces Latency and Compute Costs by 25%

Uber’s IngestionNext: Transforming Data Ingestion with a Streaming-First Approach

The Shift from Batch to Streaming

Architecture of IngestionNext

Handling High Data Volumes

Overcoming Technical Challenges

Mechanisms for Reliability

Improved Resource Efficiency

Future Directions

Stay Connected

Explore Top AI Tools Instantly

Latest News

Meta Disables Instagram Feature Allowing Users to Create AI Deepfakes of Public Accounts

Optimizing Layer-Adaptive Large Language Models: Curvature-Weighted Capacity Allocation Using Minimum Description Length Framework

Concerns Rise as UK Shops Launch Facial Recognition Technology with Real-Time Police Alerts

Cloudflare Launches Temporary Accounts for Seamless Autonomous Worker Deployment

Leading global tech insights for 20M+ innovators

Quick Link

Support

Sign Up for Our Newsletter

Uber’s IngestionNext: Transforming Data Ingestion with a Streaming-First Approach

The Shift from Batch to Streaming

Architecture of IngestionNext

More Read

Handling High Data Volumes

Overcoming Technical Challenges

Mechanisms for Reliability

Improved Resource Efficiency

Future Directions

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

Stay Connected

Explore Top AI Tools Instantly

Latest News

Meta Disables Instagram Feature Allowing Users to Create AI Deepfakes of Public Accounts

Optimizing Layer-Adaptive Large Language Models: Curvature-Weighted Capacity Allocation Using Minimum Description Length Framework

Concerns Rise as UK Shops Launch Facial Recognition Technology with Real-Time Police Alerts

Cloudflare Launches Temporary Accounts for Seamless Autonomous Worker Deployment