At the Data + AI Summit, Databricks announced the latest generation of its industry-leading machine learning (ML) offering with the launch of Databricks Machine Learning, a new data-native platform built on top of an open lakehouse architecture. With Databricks Machine Learning, new and existing ML capabilities on the Databricks Lakehouse Platform are integrated into a collaborative, purpose-built experience that provides ML engineers with everything they need to build, train, deploy, and manage ML models from experimentation to production, uniquely combining data and the full ML lifecycle. Databricks Machine Learning also includes two new capabilities: Databricks AutoML to augment the machine learning process by automating all of the tedious steps that data scientists today have to manually do, while still exposing enough control and transparency, and Databricks Feature Store to improve discoverability, reuse, and governance of model features in a system integrated in the enterprise’s data engineering platform.
Many ML platforms fall short because they ignore a key challenge in machine learning: they assume that data are available at high quality and ready for training. That requires data teams to stitch together solutions that are good at data but not AI, with others that are good at AI but not data. To complicate things further, the people responsible for data platforms and pipelines (data engineers) are different from those that train ML models (data scientists), which are different from those who deploy product applications (engineering teams who own business applications). As a result, solutions for ML need to bridge gaps between data and AI, the tooling required, and the people involved.
Databricks Machine Learning provides each member of the data team with the right tools in one collaborative environment. Users can switch between Data Science / Engineering, SQL Analytics, and the new Machine Learning experiences to access tools and features relevant to their everyday workflow. Databricks Machine Learning also provides a new ML-focused start page that surfaces the new ML capabilities and resources, with quick access to Experiments, the Feature Store, and the Model Registry. Built on an open lakehouse foundation, Databricks Machine Learning ensures customers can easily work with any type of data, at any scale, for machine learning across traditional structured tables, to unstructured data like videos and images, to streaming data from real-time applications and IoT sensors, and quickly move through the ML workflow to get more models to production faster.
“Humana’s machine learning platform, FlorenceAI, is enabling us to automate and accelerate the delivery lifecycle of ML solutions at scale. Databricks has been an essential underlying technology, with hundreds of our data scientists using the platform to deliver dozens of models in production, so that our teams are able to operate at orders of magnitude faster than before,” said Slawek Kierner, Senior Vice President of Enterprise Data and Analytics at Humana.
Databricks AutoML: Jumpstart new projects and automate tedious ML tasks
AutoML has the potential to allow data teams to more quickly build ML models by automating a lot of heavy lifting involved in the experimentation and training phases. But, enterprises who use AutoML tools today often struggle with getting AutoML models to production. This happens because the tools provide no visibility into how they arrive at their final model, which makes it impossible to modify its performance or troubleshoot it when edge cases in data lead to low confidence predictions. Additionally, it can be difficult for organizations to satisfy compliance requirements that require them to explain how a model works, because they lack visibility into the model’s code.
The introduction of the AutoML capabilities within Databricks ML takes a unique ‘glass box’ approach instead. It allows data teams to not only quickly produce trained models either through a UI or API, but also auto-generates underlying experiments and notebooks with code so data scientists can easily validate an unfamiliar data set or modify the generated ML project. Data scientists have full transparency into how a model operates and can take control at any time. This transparency is critical in highly regulated environments and for collaboration with expert data scientists.
All AutoML experiments are integrated with the rest of the Databricks Lakehouse Platform, including MLflow, to track all the related parameters, metrics, artifacts, and models associated with every trial run to make it easy to compare models and easily deploy them to production.
Databricks Feature Store: Streamline ML at scale with simplified feature sharing and discovery
Machine learning models are built using features, which are the attributes used by a model to make a prediction. To work most efficiently, data scientists need to be able to discover what features exist within their organization, how they are built, and where they are used, rather than wasting significant time repeatedly reinventing features. Additionally, feature code needs to be kept consistent across several teams that participate in the ML workflow, otherwise, model performance will drift apart between real-time and batch use cases – a problem called online/offline skew.
The Databricks Feature Store is the first of its kind that is co-designed with a data and MLOps platform. Tight integration with the popular open source frameworks Delta Lake and MLflow guarantees that data stored in the Feature Store is open and that models trained with any ML framework can benefit from the integration of the Feature Store with the MLflow model format. Most importantly, the Feature Store eliminates online/offline skew by packaging feature store references with the model, so that the model itself can lookup features from the Feature Store instead of requiring a client application to do so. As a result, features can be updated without any changes to the client application that sends requests to the model. The Feature Store also enables reusability and discoverability with automated lineage tracking to automatically track the data sources used for feature computation, as well as the exact version of the code that was used. With this, a data scientist can find all of the features that have already been defined based on the raw data they are planning to use. Finally, the Feature Store knows exactly which models and endpoints consume any given feature, facilitating end-to-end lineage as well as safe decision-making on whether a feature can be updated or deleted.