Create your Explainable Parquet Dataset#
See key concepts to instantiate a project prior to the dataset creation.
1. Convert your Raw Data#
Currently, Xpdeep supports .parquet format under the ParquetDataset class. Parquet files are
backed by an Arrow Table, where each column represent a feature.
Note
Xpdeep currently supports cloud storage (S3, GCP etc.) for your .parquet file. It relies internally on
fsspec to read the data from the cloud location.
You must then split your data into train, validation and test. Each split is itself a ParquetDataset.
The Arrow Format#
As HuggingFace's Datasets package underlines, Arrow makes it possible to process and move large quantities of data quickly.
It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages:
- Arrow’s standard format allows zero-copy reads which removes virtually all serialization overhead.
- Arrow is language-agnostic so it supports different programming languages.
- Arrow is column-oriented so it is faster at querying and processing slices or columns of data.
- Arrow allows for copy-free hand-offs to standard machine learning tools such as NumPy, Pandas, PyTorch, and TensorFlow.
- Arrow supports many, possibly nested, column types.
Let's make an Arrow Table from a toy dataset made of 8 samples, containing two numerical features "petal_length" and "petal_width", with a categorical target "flower_type".
import pyarrow as pa
import pyarrow.parquet as pq
raw_data = pa.table(
{
'petal_length': [1.4, 1.5, 1.3, 4.5, 4.1, 5.0, 6.0, 5.5],
'petal_width': [0.2, 0.2, 0.2, 1.5, 1.3, 1.8, 2.5, 2.3],
'flower_type': ["Setosa", "Setosa", "Setosa", "Versicolor", "Versicolor", "Versicolor", "Virginica", "Virginica"],
}
)
# Write the table to a Parquet file
pq.write_table(raw_data, "train.parquet")
👀 Full file preview
import pyarrow as pa
import pyarrow.parquet as pq
import xpdeep
from xpdeep import Project
from xpdeep.dataset.parquet_dataset import ParquetDataset
demo = {"api_key": "your_api_key", "api_url": "your_api_url"}
xpdeep.init(**demo)
xpdeep.set_project(Project.create_or_get(name="toy dataset example", description="tutorial"))
# Create a Dataset
S3_DATASET_ENDPOINT_URL = "S3_DATASET_ENDPOINT_URL"
S3_DATASET_ACCESS_KEY_ID = "S3_DATASET_ACCESS_KEY_ID"
S3_DATASET_SECRET_ACCESS_KEY = "S3_DATASET_SECRET_ACCESS_KEY"
S3_DATASET_BUCKET_NAME = "S3_DATASET_BUCKET_NAME"
raw_data = pa.table({
"petal_length": [1.4, 1.5, 1.3, 4.5, 4.1, 5.0, 6.0, 5.5],
"petal_width": [0.2, 0.2, 0.2, 1.5, 1.3, 1.8, 2.5, 2.3],
"flower_type": ["Setosa", "Setosa", "Setosa", "Versicolor", "Versicolor", "Versicolor", "Virginica", "Virginica"],
})
# Write the table to a Parquet file
pq.write_table(raw_data, "train.parquet")
path = f"s3://{S3_DATASET_BUCKET_NAME}/train.parquet"
storage_options = {
"key": S3_DATASET_ACCESS_KEY_ID,
"secret": S3_DATASET_SECRET_ACCESS_KEY,
"client_kwargs": {
"endpoint_url": S3_DATASET_ENDPOINT_URL,
},
"s3_additional_kwargs": {"addressing_style": "path"},
}
train_dataset = ParquetDataset(name="dataset_name", path=path, storage_options=storage_options)
Tip
Please follow Getting Started to get your api key.
2. Upload then Share your Converted Dataset#
As mentioned earlier in this tutorial, Xpdeep expects your converted data to be stored on a fsspec compatible cloud storage.
You then need to grant Xpdeep the permission to read your data files. The ParquetDataset expects a fsspec
compatible path (which should be a URL) and optional storage_options (the credentials if required).
Please refer again to the Datasets cloud storage documentation for the storage_options expected format.
Example with fake credentials, assuming your data was uploaded to path on an S3 bucket.
S3_DATASET_ENDPOINT_URL = "S3_DATASET_ENDPOINT_URL"
S3_DATASET_ACCESS_KEY_ID = "S3_DATASET_ACCESS_KEY_ID"
S3_DATASET_SECRET_ACCESS_KEY = "S3_DATASET_SECRET_ACCESS_KEY"
S3_DATASET_BUCKET_NAME = "S3_DATASET_BUCKET_NAME"
path = f"s3://{S3_DATASET_BUCKET_NAME}/train.parquet",
storage_options = {
"key": S3_DATASET_ACCESS_KEY_ID,
"secret": S3_DATASET_SECRET_ACCESS_KEY,
"client_kwargs": {
"endpoint_url": S3_DATASET_ENDPOINT_URL,
},
"s3_additional_kwargs": {"addressing_style": "path"},
}
Warning
When creating an access key for a S3 Bucket, make sure to allow at least the GetObject permission in its policy. See the following exemple:
3. Instantiate a Dataset#
From your newly converted Arrow Table data, you can instantiate a Xpdeep explainable dataset.
from xpdeep.dataset.parquet_dataset import ParquetDataset
train_dataset = ParquetDataset(
name="dataset_name",
path=path,
storage_options=storage_options
)
👀 Full file preview
import pyarrow as pa
import pyarrow.parquet as pq
import xpdeep
from xpdeep import Project
from xpdeep.dataset.parquet_dataset import ParquetDataset
demo = {"api_key": "your_api_key", "api_url": "your_api_url"}
xpdeep.init(**demo)
xpdeep.set_project(Project.create_or_get(name="toy dataset example", description="tutorial"))
# Create a Dataset
S3_DATASET_ENDPOINT_URL = "S3_DATASET_ENDPOINT_URL"
S3_DATASET_ACCESS_KEY_ID = "S3_DATASET_ACCESS_KEY_ID"
S3_DATASET_SECRET_ACCESS_KEY = "S3_DATASET_SECRET_ACCESS_KEY"
S3_DATASET_BUCKET_NAME = "S3_DATASET_BUCKET_NAME"
raw_data = pa.table({
"petal_length": [1.4, 1.5, 1.3, 4.5, 4.1, 5.0, 6.0, 5.5],
"petal_width": [0.2, 0.2, 0.2, 1.5, 1.3, 1.8, 2.5, 2.3],
"flower_type": ["Setosa", "Setosa", "Setosa", "Versicolor", "Versicolor", "Versicolor", "Virginica", "Virginica"],
})
# Write the table to a Parquet file
pq.write_table(raw_data, "train.parquet")
path = f"s3://{S3_DATASET_BUCKET_NAME}/train.parquet"
storage_options = {
"key": S3_DATASET_ACCESS_KEY_ID,
"secret": S3_DATASET_SECRET_ACCESS_KEY,
"client_kwargs": {
"endpoint_url": S3_DATASET_ENDPOINT_URL,
},
"s3_additional_kwargs": {"addressing_style": "path"},
}
train_dataset = ParquetDataset(name="dataset_name", path=path, storage_options=storage_options)
4. Create a Schema#
This dataset does not contain a Schema object yet. As a schema is a requirement to get any result from your data, you
can either use this dataset to infer a schema automatically or make your own schema from scratch. Please check the next
section to learn how to get a schema for your explainable dataset.