Reading subset of data from Azure Machine Learning "folder" dataset
Training a model is an interactive and iterative process which requires fair bit of tuning and quite a lot of debugging. Ability to write, run and debug locally makes this process significantly faster.
However, data that we use to train our models has grown big enough that it is no longer feasible or possible to use entire dataset during development.
Azure ML SDK v1 offered TabularDataset
and FileDataset
types which had a function called take(count)
, which we were using to get a subset of our data. Azure ML SDK v2 replaced them with new data type: uri_folder
.
uri_folder
refers to a folder containing data, either non-tabular (e.g. folder containing large number of images) or tabular with multiple files (e.g. folder containing partitioned parquet files or csv output from a Spark job). We have two way to read data (or subset of data) from uri_folder
dataset:
Using mltable
mltable
is feature rich Python library to work with tabular dataset. It comes with built-in functionality to get the subset large dataset using take(count)
method.
Once we have mltable
installed using following command:
pip install -U mltable azureml-dataprep[pandas]
It is pretty straight forward to read the Azure ML dataset:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import Data
import mltable
# get credentials
credential = DefaultAzureCredential()
# initialise Azure ML SDK v2 client
ml_client = MLClient.from_config(path=".", credential=creds)
# get reference to Azure ML dataset
apps_dataset = ml_client.data.get(name="applications", version=1)
# initialise mltable
# apps_dataset.path refers to a folder containing multiple parquet files
tbl = mltable.from_parquet_files([{
"pattern": apps_dataset.path
}])
# take subset of data as pandas dataframe
df = tbl.take(1000)
mltable
can read any tabular dataset even if it is saved as Azure ML dataset of type uri_folder
. Most common example are parquet / csv outputs generated by Apache Spark job, where Spark create a folder and writes one or many files inside it. The best part: code shown above can be used to read uri_folder
data type containing tabular data without any change.Using AzureMachineLearningFileSystem
For some reason if we don't want to use mltable
or we have non-tabular data, there is no direct way but to use AzureMachineLearningFileSystem
to read subset of data.
Once installed and configured, it provides filesystem like commands (such as ls
, open
and many other) to interact with Azure ML dataset.
First thing first, installing the library:
pip install -U azureml-fsspec
Then, all we need is to initialise the filesystem using Azure ML dataset:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import Data
from azureml.fsspec import AzureMachineLearningFileSystem
# get credentials
credential = DefaultAzureCredential()
# initialise Azure ML SDK v2 client
ml_client = MLClient.from_config(path=".", credential=creds)
# get reference to Azure ML dataset
apps_dataset = ml_client.data.get(name="applications", version=1)
# initialise file system
fs = AzureMachineLearningFileSystem(apps_dataset.path)
temp_dir = tempfile.mkdtemp()
df_list = []
count = 0
limit = 1000
# iterate through individual files in folder
# read then as pandas dataframe
# concate to single dataframe at the end
for path in fs.glob(f'{apps_dataset.path}/*.parquet'):
_, tail = os.path.split(path)
dest_path = os.path.join(temp_dir, tail)
fs.download(rpath=path, lpath=dest_path)
df_part = pd.read_parquet(dest_path)
df_list.append(df_part)
count = count + len(df_part)
if count >= limit:
break
df = pd.concat(df_list)