Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Data Versioning & Lineage

A model is only as good as the data it was trained on. MLflow helps you track the Lineage of your data.

1. Logging Datasets

In newer versions of MLflow, you can log the dataset metadata directly.

import mlflow.data
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset

df = pd.read_csv("data.csv")
dataset: PandasDataset = mlflow.data.from_pandas(df, source="data.csv", name="iris_training_set")

with mlflow.start_run():
    mlflow.log_input(dataset, context="training")

2. Integration with Delta Lake

If you use Delta Lake, you can log the specific version of the table used.

mlflow.log_param("delta_version", 5)