# Reading subset of data from Azure Machine Learning "folder" dataset

Training a model is an interactive and iterative process which requires fair bit of tuning and quite a lot of debugging. Ability to write, run and debug locally makes this process significantly faster.

However, data that we use to train our models has grown big enough that it is no longer feasible or possible to use entire dataset during development.

Azure ML SDK v1 offered `TabularDataset` and `FileDataset` types which had a function called `take(count)`, which we were using to get a subset of our data. Azure ML SDK v2 replaced them with new data type: `uri_folder`.

`uri_folder` refers to a folder containing data, either non-tabular (e.g. folder containing large number of images) or tabular with multiple files (e.g. folder containing partitioned parquet files or csv output from a Spark job). We have two way to read data (or subset of data) from `uri_folder` dataset:

## Using mltable

`mltable` is feature rich Python library to work with tabular dataset. It comes with built-in functionality to get the subset large dataset using `take(count)` method.

Once we have `mltable` installed using following command:

```bash
pip install -U mltable azureml-dataprep[pandas]
```

It is pretty straight forward to read the Azure ML dataset:

```python
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import Data

import mltable

# get credentials
credential = DefaultAzureCredential()

# initialise Azure ML SDK v2 client
ml_client = MLClient.from_config(path=".", credential=creds)

# get reference to Azure ML dataset
apps_dataset = ml_client.data.get(name="applications", version=1)

# initialise mltable
# apps_dataset.path refers to a folder containing multiple parquet files
tbl = mltable.from_parquet_files([{
    "pattern": apps_dataset.path
}])

# take subset of data as pandas dataframe
df = tbl.take(1000)
```

<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><code>mltable</code> can read any tabular dataset even if it is saved as Azure ML dataset of type <code>uri_folder</code> . Most common example are parquet / csv outputs generated by Apache Spark job, where Spark create a folder and writes one or many files inside it. The best part: code shown above can be used to read <code>uri_folder</code> data type containing tabular data without <strong>any </strong>change.</div>
</div>

### **Using AzureMachineLearningFileSystem**

For some reason if we don't want to use `mltable` or we have non-tabular data, there is no direct way but to use `AzureMachineLearningFileSystem` to read subset of data.

Once installed and configured, it provides filesystem like commands (such as `ls` , `open` and many other) to interact with Azure ML dataset.

First thing first, installing the library:

```bash
 pip install -U azureml-fsspec
```

Then, all we need is to initialise the filesystem using Azure ML dataset:

```python
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import Data
from azureml.fsspec import AzureMachineLearningFileSystem

# get credentials
credential = DefaultAzureCredential()

# initialise Azure ML SDK v2 client
ml_client = MLClient.from_config(path=".", credential=creds)

# get reference to Azure ML dataset
apps_dataset = ml_client.data.get(name="applications", version=1)

# initialise file system
fs = AzureMachineLearningFileSystem(apps_dataset.path)

temp_dir = tempfile.mkdtemp()
df_list = []
count = 0
limit = 1000

# iterate through individual files in folder
# read then as pandas dataframe
# concate to single dataframe at the end
for path in fs.glob(f'{apps_dataset.path}/*.parquet'):
    _, tail = os.path.split(path)
    dest_path = os.path.join(temp_dir, tail)
    fs.download(rpath=path, lpath=dest_path)

    df_part = pd.read_parquet(dest_path)
    df_list.append(df_part)
    count = count + len(df_part)

    if count >= limit:
        break

df = pd.concat(df_list)
```
