Reading subset of data from Azure Machine Learning "folder" dataset

Training a model is an interactive and iterative process which requires fair bit of tuning and quite a lot of debugging. Ability to write, run and debug locally makes this process significantly faster.

However, data that we use to train our models has grown big enough that it is no longer feasible or possible to use entire dataset during development.

Azure ML SDK v1 offered TabularDataset and FileDataset types which had a function called take(count), which we were using to get a subset of our data. Azure ML SDK v2 replaced them with new data type: uri_folder.

uri_folder refers to a folder containing data, either non-tabular (e.g. folder containing large number of images) or tabular with multiple files (e.g. folder containing partitioned parquet files or csv output from a Spark job). We have two way to read data (or subset of data) from uri_folder dataset:

Using mltable

mltable is feature rich Python library to work with tabular dataset. It comes with built-in functionality to get the subset large dataset using take(count) method.

Once we have mltable installed using following command:

pip install -U mltable azureml-dataprep[pandas]

It is pretty straight forward to read the Azure ML dataset:

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import Data

import mltable

# get credentials
credential = DefaultAzureCredential()

# initialise Azure ML SDK v2 client
ml_client = MLClient.from_config(path=".", credential=creds)

# get reference to Azure ML dataset
apps_dataset = ml_client.data.get(name="applications", version=1)

# initialise mltable
# apps_dataset.path refers to a folder containing multiple parquet files
tbl = mltable.from_parquet_files([{
    "pattern": apps_dataset.path
}])

# take subset of data as pandas dataframe
df = tbl.take(1000)
💡
mltable can read any tabular dataset even if it is saved as Azure ML dataset of type uri_folder . Most common example are parquet / csv outputs generated by Apache Spark job, where Spark create a folder and writes one or many files inside it. The best part: code shown above can be used to read uri_folder data type containing tabular data without any change.

Using AzureMachineLearningFileSystem

For some reason if we don't want to use mltable or we have non-tabular data, there is no direct way but to use AzureMachineLearningFileSystem to read subset of data.

Once installed and configured, it provides filesystem like commands (such as ls , open and many other) to interact with Azure ML dataset.

First thing first, installing the library:

 pip install -U azureml-fsspec

Then, all we need is to initialise the filesystem using Azure ML dataset:

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import Data
from azureml.fsspec import AzureMachineLearningFileSystem

# get credentials
credential = DefaultAzureCredential()

# initialise Azure ML SDK v2 client
ml_client = MLClient.from_config(path=".", credential=creds)

# get reference to Azure ML dataset
apps_dataset = ml_client.data.get(name="applications", version=1)

# initialise file system
fs = AzureMachineLearningFileSystem(apps_dataset.path)

temp_dir = tempfile.mkdtemp()
df_list = []
count = 0
limit = 1000

# iterate through individual files in folder
# read then as pandas dataframe
# concate to single dataframe at the end
for path in fs.glob(f'{apps_dataset.path}/*.parquet'):
    _, tail = os.path.split(path)
    dest_path = os.path.join(temp_dir, tail)
    fs.download(rpath=path, lpath=dest_path)

    df_part = pd.read_parquet(dest_path)
    df_list.append(df_part)
    count = count + len(df_part)

    if count >= limit:
        break

df = pd.concat(df_list)