A model is only as good as the data it was trained on. MLflow helps you track the Lineage of your data.
1. Logging Datasets¶
In newer versions of MLflow, you can log the dataset metadata directly.
import mlflow.data
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset
df = pd.read_csv("data.csv")
dataset: PandasDataset = mlflow.data.from_pandas(df, source="data.csv", name="iris_training_set")
with mlflow.start_run():
mlflow.log_input(dataset, context="training")2. Integration with Delta Lake¶
If you use Delta Lake, you can log the specific version of the table used.
mlflow.log_param("delta_version", 5)