Google BigQuery Launches Third-Party Generative AI Inference: What You Need to Know
Google has recently unveiled a trailblazing capability in BigQuery, enabling third-party generative AI inference for open models. This feature allows data teams to deploy and run models from Hugging Face or Vertex AI Model Garden using simple SQL commands. Let’s dive into what this means for data teams, how it works, and what advantages it brings to the table.
Simplifying AI Deployment
Historically, deploying open-source AI models has been a cumbersome task for data teams. They faced a multitude of challenges, including managing Kubernetes clusters, configuring endpoints, and coordinating various tools. As Virinchi T noted in a Medium article, "This process requires multiple tools, different skill sets, and significant operational overhead." For many teams, this friction prevented them from harnessing AI capabilities—even when the models were readily available.
With the latest enhancement in BigQuery, this complexity has been significantly reduced. Now, utilizing a SQL interface, the entire workflow can be distilled down to merely two SQL statements.
How to Use the New Feature
Deploying a Model
To get started, users can create a model by executing a CREATE MODEL statement, specifying either a Hugging Face model ID, such as sentence-transformers/all-MiniLM-L6-v2, or a model name from the Vertex AI Model Garden. Google’s BigQuery takes care of provisioning compute resources with default configurations, typically completing the deployment process within 3 to 10 minutes, depending on the chosen model’s size.
Running Inference
Once the model is deployed, running inference is seamless. Users can utilize AI.GENERATE_TEXT for language models or AI.GENERATE_EMBEDDING for embeddings, querying the necessary data directly from BigQuery tables. BigQuery also smartly manages the resource lifecycle with the endpoint_idle_ttl option, automatically shutting down idle endpoints to prevent unnecessary charges. If a team needs to undeploy endpoints, they can easily do so using the ALTER MODEL statement when batch jobs conclude.
Customization for Production Use
One of the standout features of this new capability is its customization options for production use cases. Users can specify machine types, set replica counts, and configure endpoint idle times directly within the CREATE MODEL statement. Additionally, Compute Engine reservations can secure GPU instances to ensure consistent performance. When it’s time to retire a model, a simple DROP MODEL statement cleans up all associated resources in Vertex AI.
Granular Resource Control
Google’s blog emphasizes "granular resource control" and "automated resource management," which allow teams to find an effective balance between performance and cost—all while remaining within the SQL environment. Earlier posts demonstrated that using similar patterns with open-source embedding models, processing 38 million rows cost as little as $2-3.
Model Compatibility
This new feature supports an impressive array of over 13,000 Hugging Face text embedding models and more than 170,000 text generation models, including Meta’s Llama series and Google’s Gemma family. However, models must meet Vertex AI Model Garden’s deployment requirements, such as regional availability and quota limits.
Impacts on Data Roles
The launch has distinct advantages for various roles within data teams:
-
For Data Analysts: The new SQL interface empowers you to experiment with ML models directly in your SQL environment, eliminating the need to wait for engineering resources.
- For Data Engineers: It simplifies the process of building ML-powered data pipelines, removing the need for separate ML infrastructure maintenance.
Competitive Landscape
With the introduction of this feature, BigQuery enters the competitive landscape alongside Snowflake’s Cortex AI and Databricks’ Model Serving, both of which offer SQL-accessible ML inference. BigQuery’s strength lies in its direct integration with the extensive Hugging Face model catalog, making it an attractive option for users already leveraging Google Cloud.
Learning Resources
For those eager to explore this new functionality, comprehensive documentation and tutorials are readily available for text generation with Gemma models and embedding generation, ensuring users can quickly get up to speed and make the most of these advancements.
This new capability is set to revolutionize how data teams approach machine learning, streamlining processes and making cutting-edge AI more accessible than ever.
Inspired by: Source

