On ML Flow

This is a series of blog post about MLOPS, the series consists of 4 posts, which we will discuss:

To follow along this blog post, the repo is here

Architecture

Here is an overview of the MLOPS workflow with MLFlow

png

And in today’s post, we will focus on this part.

png

What is MLFlow?

png

MLflow is an open-source platform to manage the end-to-end machine learning lifecycle. It tackles three primary functions: tracking experiments to record and compare parameters and results, packaging code into reproducible runs, and sharing and deploying models. MLflow is library-agnostic and can be used with any machine learning library, and in any programming language, since all functions are accessible through REST APIs and CLI commands. It is designed to work with any existing ML library or framework, and can be easily integrated with scikit-learn, TensorFlow, and PyTorch.

To start off the series, in today’s post, we will be

1 - Deploying MLFlow on Google Cloud Platform (GCP) using Cloud Run.

2 - Set up load balancing, cloud armor policy, custom domain, and SSL certificate for the MLFlow server for restrcited access.

3 - Set up a secret manager to store the MLFlow server’s credentials and access them in the deployment process.

4 - Deploy a simple machine learning model using MLFlow and test the deployment.

Deploying MLFlow on GCP

Basic Configuration

1 - Create a service account with neccessary permissions to access the GCP resources. At the minimum, the service account should have the following roles:

Cloud Run Invoker
Storage Object Viewer
Secret Manager Secret Accessor
Cloud container admin

For example, the following command can be used to add the roles to the service account:

gcloud projects add-iam-policy-binding gcp-prj-id-123 \
    --member "serviceAccount:svc@gcp-prj-id-123.iam.gserviceaccount.com" \
    --role "roles/secretmanager.secretAccessor" \
    --role "roles/cloudobjectstorage.objectViewer" \
    --role "roles/container.admin" \
    --role "roles/cloudrun.invoker"

2 - Obtain a credential file for the service account.

You can do so by navigating to IAM & Admin > Service Accounts > Select the service account > Keys > Add Key > Click on the "Create Key" button.

3 - Create Cloud storage bucket to store the MLFlow artifacts.

Confirm the the service account that will be used to deploy the application has the necessary permissions to access the bucket.

4 - Create a Cloud SQL instance to store the MLFlow metadata.

You can choose either MySQL or PostgreSQL as the database engine. In this example, we will use PostgreSQL.

Fill in the instance name, password, and other necessary details.

Create a database and a user for the MLFlow server.

To create database SQL component > Databases > Create Database

To create user SQL instance details > Users > Add User Account

Enable SSL connection to the database. SQL component > Connections > Create Database > Security > Manage SSL mode > Allow only SSL connections under SSL mode

Then under Manage client certificates, click on the “Create client certificate” button to download the client certificate.

5 - Create a secret in the Secret Manager to store the database credentials.

Create Secret:

mlflow_artifact_url: Storage Bucket where MLflow artifacts are stored, for example, gs://mlflow mlflow_database_url: Sample value postgresql+psycopg2://<db-user>:<db-pass>@/<db-name>?host=/cloudsql/gcp-prj-id-123:us-central1:<cloud-sql-instance-name>

Not that the Cloud SQLinstance name can be copied from Cloud SQL instance overview page

mlflow_tracking_username: Basic HTTP auth username for MLflow (your choice) mlflow_tracking_password: your choice

Deploying server

Folder structure for the project:

.
├── Dockerfile
├── README.md
├── app
│   ├── __init__.py
│   ├── get_secret.py
│   └── mlflow_auth.py
├── deploy
│   └── production
│       └── service.yaml
├── entry-point.sh
├── poetry.lock
├── pyproject.toml
└── skaffold.yaml

Dockerfile as below:

FROM python:3.10-slim

ENV PYTHONUNBUFFERED=True

ENV POETRY_VERSION=1.7.1 \
    USERNAME=nonroot

WORKDIR /usr/src/app

RUN adduser $USERNAME
USER $USERNAME

ENV HOME=/home/$USERNAME
ENV PATH="$HOME/.local/bin:$PATH"

RUN pip install pipx
RUN pipx install poetry==${POETRY_VERSION}

COPY ./poetry.lock ./pyproject.toml /usr/src/app/

RUN poetry install -nv --no-root

COPY app .

COPY entry-point.sh ./entry-point.sh

ENTRYPOINT ["/usr/bin/env", "bash", "./entry-point.sh"]

EXPOSE 8080

Set up skaffold.yaml as below:

apiVersion: skaffold/v4beta2
kind: Config
metadata:
  name: mlflow-gcp
build:
  artifacts:
  - image: us-central1-docker.pkg.dev/gcp-prj-id-123/mlflow-gcp/mlflow-gcp
    docker:
      dockerfile: Dockerfile
    platforms:
      - "linux/amd64"
profiles:
- name: production
  manifests:
    rawYaml:
    - deploy/production/service.yaml
deploy:
  cloudrun:
    projectid: gcp-prj-id-123
    region: us-central1

And the service.yaml file in the deployment/production folder as below:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: mlflow-gcp
spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/cloudsql-instances: gcp-prj-id-123:us-central1:mlflow-pgsql
        run.googleapis.com/startup-cpu-boost: 'true'
    spec:
      serviceAccountName: svc@developer.gserviceaccount.com
      containers:
      - image: us-central1-docker.pkg.dev/gcp-prj-id-123/mlflow-gcp/mlflow-gcp
        ports:
        - name: http1
          containerPort: 8080
        env:
        - name: GCP_PROJECT
          value: gcp-prj-id-123
        resources:
          limits:
            memory: 1Gi
            cpu: 1000m

To deploy the server, run the following command:

skaffold run -p production

And here you go, after successful deployment, you can access the MLFlow server at the URL provided in the Cloud Run service details.

And when clicked on it, it will prompt for the username and password that you have set in the secret manager.

png

Setting up restricted access to MLFlow server

Now, the next set up our MLFlow UI so that it is only accessible to the users who are on the specific IP addresses, instead of making the application public and only protected by the basic HTTP auth.

Custom restriction access with Cloud Armor Under Network Security > Cloud Armor Policies > Create Policy For Policy type, select Backend security policy

Fill in the necessary details and choose the Allow action and add the IP addresses that you want to allow access to the MLFlow server.

png

Creating up SSL certificate Under Security > SSL Certificates > Create Certificate > Fill in the necessary details and choose Create Google-managed certificate then click on the Create button.

Setting up Load Balancing

To set up the load balancer, navigate to the Cloud network services > Load balancing > Create Load Balancer.

png

For type of load balancer, choose HTTP(S) Load Balancing, and follow the default settings.

png

The next step is to set up the frontend configuration, for protocol choose HTTPS and for certificate, choose the SSL certificate that we created in the previous step.

png

For the backend configuration, click on Create a backend service, for backend type choose Serverless network endpoint group and under Backend clcik on Add backend.

png

Then under click Create Serverless network endpoint group, under select service, choose the cloud run app that we created for MLFlow.

png

Under Cloud CDN disable the Cloud CDN.

Under Security, choose the Cloud Armor policy that we created in the previous step.

png

After all of the above steps, click on the Create button to create the load balancer.

Setting up IAP

Navigate to Security > Identity-Aware Proxy, If you don’t see the IAP option, you may have to add it by clicking on the Connect Application button.

Turn on IAP for the load balancer backend services of the Cloud Run servic under All Web Services > Backend Services.

Setting up custom domain To begin with is step, we will have to enable the Cloud DNS API on the GCP project. Then, we can navigaye to the Network Services > Cloud Doains > Reguster Domain > Fill in the necessary details and click on the Create button.

Setting up Cloud DNS

To set up the Cloud DNS, navigate to the Cloud DNS page and the domain that we chose from the above step should already be showing there. Click on the domain name and then click on the Add Standard under Record Set.

Under Resource Type, choose A And for IPv4 Address, place the IP address of the load balancer that we created in the previous step.

png

Setting up OAuth Consent Screen Navigate to API and Services > OAuth consent screen > Add our domain to the Authorized domains

Restricting access to Cloud Run ingress

Now as the final step, we will restrict the access to the Cloud Run ingress. Navigate to the Cloud Run service details > Networking > Ingress Control > Allow internal traffic only

png

Or if we don’t want to do it manually on the GCP Console, we can modify the deploy/production/service.yaml to as below

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: mlflow-gcp
  annotations:
    run.googleapis.com/ingress: internal-and-cloud-load-balancing
    run.googleapis.com/ingress-status: internal-and-cloud-load-balancing
spec:
# Rest of the original spec

Now, if you are not connecting to the MLFlow server from the allowed IP addresses, you will not be able to access the server and it will return a 403 forbidden response.

png

Simple Machine Learning Model

Now, with everything set up, let’s deploy a simple machine learning model using MLFlow and test the deployment.

As a first step, let’s preprocess the data and prepare it for training.

white_wine = pd.read_csv(".winequality-white.csv", sep=";")
red_wine = pd.read_csv("./winequality-red.csv", sep=";")

red_wine['is_red'] = 1
white_wine['is_red'] = 0
data = pd.concat([red_wine, white_wine], axis=0) 
data.rename(columns=lambda x: x.replace(' ', '_'), inplace=True)

high_quality = (data.quality >= 7).astype(int)
data.quality = high_quality

Next, let’s split the data into training, validation, and test sets.

from sklearn.model_selection import train_test_split 
X = data.drop(["quality"], axis=1)
y = data.quality

X_train, X_rem, y_train, y_rem = train_test_split(X, y, train_size=0.6, random_state=123)

X_val, X_test, y_val, y_test = train_test_split(X_rem, y_rem, test_size=0.5, random_state=123)

Set up the necessary classes and functions to log the model and deploy it using MLFlow.

class SklearnModelWrapper(mlflow.pyfunc.PythonModel): 
  def __init__(self, model):
    self.model = model
  def predict(self, context, model_input):
    return self.model.predict_proba(model_input)[:,1]

class GCPSecrets:
    def __init__(self) -> None:
        self.client = secretmanager.SecretManagerServiceClient()
        self.base_url = f"projects/{PROJECT_ID}/secrets"

    def get_secret(self, secret):
        secret_response = self.client.access_secret_version(
            name=f"{self.base_url}/{secret}/versions/latest"
        )
        creds = secret_response.payload.data.decode("UTF-8")
        return creds

Set up the MLFlow configuration.

os.environ["MLFLOW_TRACKING_URI"]  = secrets.get_secret("mlflow_tracking_url")
os.environ["MLFLOW_TRACKING_USERNAME"] = secrets.get_secret("mlflow_tracking_username")
os.environ["MLFLOW_TRACKING_PASSWORD"] = secrets.get_secret("mlflow_tracking_password")

Now, let’s train the model and log it using MLFlow.

with mlflow.start_run(run_name='untuned_random_forest'):
  n_estimators = 10
  model = RandomForestClassifier(n_estimators=n_estimators, random_state=np.random.RandomState(123))
  model.fit(X_train, y_train)

  predictions_test = model.predict_proba(X_test)[:,1]
  auc_score = roc_auc_score(y_test, predictions_test)
  mlflow.log_param('n_estimators', n_estimators)
  mlflow.log_metric('auc', auc_score)
  wrappedModel = SklearnModelWrapper(model)
  signature = infer_signature(X_train, wrappedModel.predict(None, X_train))
  
  conda_env =  _mlflow_conda_env(
        additional_conda_deps=None,
        additional_pip_deps=["cloudpickle=={}".format(cloudpickle.__version__), "scikit-learn=={}".format(sklearn.__version__)],
        additional_conda_channels=None,
    )
  mlflow.pyfunc.log_model("random_forest_model", python_model=wrappedModel, conda_env=conda_env, signature=signature)

Examine the learned feature importances output by the model as a sanity-check.

feature_importances = pd.DataFrame(model.feature_importances_, index=X_train.columns.tolist(), columns=['importance'])
feature_importances.sort_values('importance', ascending=False)

Now, let’s register the model and transition it to the “Production” stage.

from mlflow.tracking import MlflowClient
client = MlflowClient()

run_id = mlflow.search_runs(filter_string='tags.mlflow.runName = "untuned_random_forest"').iloc[0].run_id

model_name = "wine_quality"
model_version = mlflow.register_model(f"runs:/{run_id}/random_forest_model", model_name)

client.set_registered_model_alias(name=model_name, alias="Production", version=model_version.version)
client.transition_model_version_stage(
  name=model_name,
  version=model_version.version,
  stage="Production",
)

Finally, load the model and test it.

model = mlflow.pyfunc.load_model(f"models:/{model_name}@Production")

print(f'AUC: {roc_auc_score(y_test, model.predict(X_test))}')

png

Making a prediction with the model

prediction = model.predict(X_test[:5])

print(f'Prediction: {prediction}')

png

Now, if we navigate to the MLFlow UI, we should see the model’s artifacts.

png

You will also see the model is registered under model registry.

png

And that’s it!

We have successfully deployed a simple machine learning model using MLFlow on GCP and tested the deployment.

Thank you for reading and have a nice day!

More articles from Yu Wong

On AKS

On Airflow - Part 2