Create your Explainable Parquet Dataset#

See key concepts to instantiate a project prior to the dataset creation.

1. Convert your Raw Data#

Currently, Xpdeep supports .parquet format under the ParquetDataset class. Parquet files are backed by an Arrow Table, where each column represent a feature.

Note

Xpdeep currently supports cloud storage (S3, GCP etc.) for your .parquet file. It relies internally on fsspec to read the data from the cloud location.

You must then split your data into train, validation and test. Each split is itself a ParquetDataset.

The Arrow Format#

As HuggingFace's Datasets package underlines, Arrow makes it possible to process and move large quantities of data quickly.

It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages:

Arrow’s standard format allows zero-copy reads which removes virtually all serialization overhead.
Arrow is language-agnostic so it supports different programming languages.
Arrow is column-oriented so it is faster at querying and processing slices or columns of data.
Arrow allows for copy-free hand-offs to standard machine learning tools such as NumPy, Pandas, PyTorch, and TensorFlow.
Arrow supports many, possibly nested, column types.

Let's make an Arrow Table from a toy dataset made of 8 samples, containing two numerical features "petal_length" and "petal_width", with a categorical target "flower_type".

import pyarrow as pa
import pyarrow.parquet as pq
raw_data = pa.table(
    {
        'petal_length': [1.4, 1.5, 1.3, 4.5, 4.1, 5.0, 6.0, 5.5],
        'petal_width': [0.2, 0.2, 0.2, 1.5, 1.3, 1.8, 2.5, 2.3],
        'flower_type': ["Setosa", "Setosa", "Setosa", "Versicolor", "Versicolor", "Versicolor", "Virginica", "Virginica"],
    }
)

# Write the table to a Parquet file
pq.write_table(raw_data, "train.parquet")

👀 Full file preview

import pyarrow as pa
import pyarrow.parquet as pq

import xpdeep
from xpdeep import Project
from xpdeep.dataset.parquet_dataset import ParquetDataset

demo = {"api_key": "your_api_key", "api_url": "your_api_url"}
xpdeep.init(**demo)

xpdeep.set_project(Project.create_or_get(name="toy dataset example", description="tutorial"))

# Create a Dataset

S3_DATASET_ENDPOINT_URL = "S3_DATASET_ENDPOINT_URL"
S3_DATASET_ACCESS_KEY_ID = "S3_DATASET_ACCESS_KEY_ID"
S3_DATASET_SECRET_ACCESS_KEY = "S3_DATASET_SECRET_ACCESS_KEY"
S3_DATASET_BUCKET_NAME = "S3_DATASET_BUCKET_NAME"

raw_data = pa.table({
    "petal_length": [1.4, 1.5, 1.3, 4.5, 4.1, 5.0, 6.0, 5.5],
    "petal_width": [0.2, 0.2, 0.2, 1.5, 1.3, 1.8, 2.5, 2.3],
    "flower_type": ["Setosa", "Setosa", "Setosa", "Versicolor", "Versicolor", "Versicolor", "Virginica", "Virginica"],
})

# Write the table to a Parquet file
pq.write_table(raw_data, "train.parquet")


path = f"s3://{S3_DATASET_BUCKET_NAME}/train.parquet"
storage_options = {
    "key": S3_DATASET_ACCESS_KEY_ID,
    "secret": S3_DATASET_SECRET_ACCESS_KEY,
    "client_kwargs": {
        "endpoint_url": S3_DATASET_ENDPOINT_URL,
    },
    "s3_additional_kwargs": {"addressing_style": "path"},
}


train_dataset = ParquetDataset(name="dataset_name", path=path, storage_options=storage_options)

Tip

Please follow Getting Started to get your api key.

As mentioned earlier in this tutorial, Xpdeep expects your converted data to be stored on a fsspec compatible cloud storage.

You then need to grant Xpdeep the permission to read your data files. The ParquetDataset expects a fsspec compatible path (which should be a URL) and optional storage_options (the credentials if required).

Please refer again to the Datasets cloud storage documentation for the storage_options expected format.

Example with fake credentials, assuming your data was uploaded to path on an S3 bucket.

S3_DATASET_ENDPOINT_URL = "S3_DATASET_ENDPOINT_URL"
S3_DATASET_ACCESS_KEY_ID = "S3_DATASET_ACCESS_KEY_ID"
S3_DATASET_SECRET_ACCESS_KEY = "S3_DATASET_SECRET_ACCESS_KEY"
S3_DATASET_BUCKET_NAME = "S3_DATASET_BUCKET_NAME"

path = f"s3://{S3_DATASET_BUCKET_NAME}/train.parquet",
storage_options = {
    "key": S3_DATASET_ACCESS_KEY_ID,
    "secret": S3_DATASET_SECRET_ACCESS_KEY,
    "client_kwargs": {
        "endpoint_url": S3_DATASET_ENDPOINT_URL,
    },
    "s3_additional_kwargs": {"addressing_style": "path"},
}

Warning

When creating an access key for a S3 Bucket, make sure to allow at least the GetObject permission in its policy. See the following exemple:

 {
 "Version": "2012-10-17",
 "Statement": [
  {
   "Effect": "Allow",
   "Action": [
    "s3:GetObject"
   ],
   "Resource": [
    "arn:aws:s3:::bucket_name/datasets_path/*"
   ]
  }
 ]
}

3. Instantiate a Dataset#

From your newly converted Arrow Table data, you can instantiate a Xpdeep explainable dataset.

from xpdeep.dataset.parquet_dataset import ParquetDataset

train_dataset = ParquetDataset(
    name="dataset_name",
    path=path,
    storage_options=storage_options
)

👀 Full file preview

import pyarrow as pa
import pyarrow.parquet as pq

import xpdeep
from xpdeep import Project
from xpdeep.dataset.parquet_dataset import ParquetDataset

demo = {"api_key": "your_api_key", "api_url": "your_api_url"}
xpdeep.init(**demo)

xpdeep.set_project(Project.create_or_get(name="toy dataset example", description="tutorial"))

# Create a Dataset

S3_DATASET_ENDPOINT_URL = "S3_DATASET_ENDPOINT_URL"
S3_DATASET_ACCESS_KEY_ID = "S3_DATASET_ACCESS_KEY_ID"
S3_DATASET_SECRET_ACCESS_KEY = "S3_DATASET_SECRET_ACCESS_KEY"
S3_DATASET_BUCKET_NAME = "S3_DATASET_BUCKET_NAME"

raw_data = pa.table({
    "petal_length": [1.4, 1.5, 1.3, 4.5, 4.1, 5.0, 6.0, 5.5],
    "petal_width": [0.2, 0.2, 0.2, 1.5, 1.3, 1.8, 2.5, 2.3],
    "flower_type": ["Setosa", "Setosa", "Setosa", "Versicolor", "Versicolor", "Versicolor", "Virginica", "Virginica"],
})

# Write the table to a Parquet file
pq.write_table(raw_data, "train.parquet")


path = f"s3://{S3_DATASET_BUCKET_NAME}/train.parquet"
storage_options = {
    "key": S3_DATASET_ACCESS_KEY_ID,
    "secret": S3_DATASET_SECRET_ACCESS_KEY,
    "client_kwargs": {
        "endpoint_url": S3_DATASET_ENDPOINT_URL,
    },
    "s3_additional_kwargs": {"addressing_style": "path"},
}


train_dataset = ParquetDataset(name="dataset_name", path=path, storage_options=storage_options)

4. Create a Schema#

This dataset does not contain a Schema object yet. As a schema is a requirement to get any result from your data, you can either use this dataset to infer a schema automatically or make your own schema from scratch. Please check the next section to learn how to get a schema for your explainable dataset.