from xpdeep.dataset.schema import FittedSchema

Create a Schema#

To use your dataset with the Xpdeep framework, you need a `Schema object defining the dataset structure. The schema determines how each column in your dataset is interpreted, how features are preprocessed, and how the data connects across samples through the index.

The Schema exists with several levels.

As it can be tedious to find the correct schema, Xpdeep provides an AutoAnalyzer to help you get a first schema version. You can later update it if some feature analysis seem incorrect.

Note

The dataset during the training process corresponds to the train dataset.

1. Find a Schema#

The first step consist in finding each feature type and their associated preprocessor. You can find a list of available Feature in the API reference.

Warning

For security issue, we do not allow arbitrary code to be executed in the framework yet. Therefore, with StandardDataset, your preprocessing must come from a list of trusted preprocessors. Xpdeep currently support scikit-learn and pytorch preprocessing to be used to build your preprocessor.

Understanding IndexMetadata#

The dataset schema in Xpdeep should include exactly one column acting as a unique identifier for samples or sequences.
This column is represented by the class IndexMetadata, which is used internally by the framework to track data instances.

If a dataset schema does not include IndexMetadata, Xpdeep issues a warning:

No index metadata found in dataset schema: <dataset_name>. Some functionalities will be limited.

With the Auto Analyzer#

With a dataset object, you can get a first schema proposal using analyze method of ParquetDataset object instance.

Set the Target(s)#

The only requirement is to indicate which feature(s) should be considered as target(s).

Please use target_names parameter to specify which features should be considered as targets prior to the analysis.

analyzed_train_dataset = train_dataset.analyze(target_names=["flower_type"])

👀 Full file preview

"""Create a schema."""

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from sklearn.preprocessing import StandardScaler

import xpdeep
from xpdeep import Project
from xpdeep.dataset.feature import ExplainableFeature
from xpdeep.dataset.feature.feature_types import NumericalFeature
from xpdeep.dataset.parquet_dataset import ParquetDataset
from xpdeep.dataset.preprocessor.preprocessor import SklearnPreprocessor

demo = {"api_key": "your_api_key", "api_url": "your_api_url"}
xpdeep.init(**demo)

xpdeep.set_project(Project.create_or_get(name="toy dataset example", description="tutorial"))

# Create a Dataset

S3_DATASET_ENDPOINT_URL = "S3_DATASET_ENDPOINT_URL"
S3_DATASET_ACCESS_KEY_ID = "S3_DATASET_ACCESS_KEY_ID"
S3_DATASET_SECRET_ACCESS_KEY = "S3_DATASET_SECRET_ACCESS_KEY"
S3_DATASET_BUCKET_NAME = "S3_DATASET_BUCKET_NAME"

raw_data = pa.Table.from_pandas(
    pd.DataFrame({
        "petal_length": [1.4, 1.5, 1.3, 4.5, 4.1, 5.0, 6.0, 5.5],
        "petal_width": [0.2, 0.2, 0.2, 1.5, 1.3, 1.8, 2.5, 2.3],
        "flower_type": [
            "Setosa",
            "Setosa",
            "Setosa",
            "Versicolor",
            "Versicolor",
            "Versicolor",
            "Virginica",
            "Virginica",
        ],
    }).reset_index(names="index_xp_deep"),
    preserve_index=False,
)

# Write the table to a Parquet file
pq.write_table(raw_data, "train.parquet")

path = f"s3://{S3_DATASET_BUCKET_NAME}/train.parquet"
storage_options = {
    "key": S3_DATASET_ACCESS_KEY_ID,
    "secret": S3_DATASET_SECRET_ACCESS_KEY,
    "client_kwargs": {
        "endpoint_url": S3_DATASET_ENDPOINT_URL,
    },
    "s3_additional_kwargs": {"addressing_style": "path"},
}

train_dataset = ParquetDataset(name="dataset_name", path=path, storage_options=storage_options)

# Create a Schema
forced_feature = ExplainableFeature(
    name="petal_length",
    is_target=False,
    preprocessor=SklearnPreprocessor(preprocess_function=StandardScaler()),
    feature_type=NumericalFeature(),
)

analyzed_train_dataset = train_dataset.analyze(target_names=["flower_type"])
fitted_train_dataset = analyzed_train_dataset.fit()
print(fitted_train_dataset.fitted_schema)

You can also set the target name directly on the analyzed schema, after the analysis.

analyzed_train_dataset = train_dataset.analyze()
analyzed_train_dataset.analyzed_schema["flower_type"].is_target = True

Set the Features#

In addition, you can force a feature type by calling the analyze method with specific features. In the following example, the feature with name "petal_length" will be a NumericalFeature.

from xpdeep.dataset.feature.feature_types import NumericalFeature
from xpdeep.dataset.preprocessor.preprocessor import SklearnPreprocessor
from sklearn.preprocessing import StandardScaler
from xpdeep.dataset.feature import ExplainableFeature

forced_feature = ExplainableFeature(
    name="petal_length",
    is_target=False,
    preprocessor=SklearnPreprocessor(preprocess_function=StandardScaler()),
    feature_type=NumericalFeature()
)

analyzed_train_dataset = train_dataset.analyze(forced_feature)

As the returned schema is only a proposal, you can edit it later if it doesn't correctly match your needs. Any feature can be overwritten or updated.

from xpdeep.dataset.feature import ExplainableFeature
from xpdeep.dataset.feature.feature_types import NumericalFeature
from xpdeep.dataset.preprocessor.preprocessor import SklearnPreprocessor
from sklearn.preprocessing import StandardScaler

# Set feature type name after the schema inference
analyzed_train_dataset = train_dataset.analyze()
analyzed_train_dataset.analyzed_schema["petal_length"] = ExplainableFeature(
    name="petal_length",
    is_target=False,
    preprocessor=SklearnPreprocessor(
        preprocess_function=StandardScaler(),
    ),
    feature_type=NumericalFeature()
)

You can remove a feature from the schema if needed, using its name:

analyzed_train_dataset.analyzed_schema.remove_feature(feature_name="petal_length")

Set the Index#

By default the Auto Analyzer automatically recognizes columns named "__index_level_0__" or "index_xp_deep" as IndexMetadata.

Tip

If you convert a Pandas DataFrame to a PyArrow Table, use preserve_index=True for the pandas index "__index_level_0__" to be used as `IndexMetadata automatically.

table = pa.Table.from_pandas(data, preserve_index=True)

Or from Scratch#

You can also create your own analyzed schema from scratch without using the auto-analyze.

Here we use the column named 'my_index' as the IndexMetadata.

from xpdeep.dataset.feature import ExplainableFeature, IndexMetadata
from xpdeep.dataset.feature.feature_types import NumericalFeature
from xpdeep.dataset.preprocessor.preprocessor import SklearnPreprocessor
from sklearn.preprocessing import StandardScaler
from xpdeep.dataset.schema import AnalyzedSchema

feature_1 = ExplainableFeature(
    name="petal_length",
    is_target=False,
    preprocessor=SklearnPreprocessor(
        preprocess_function=StandardScaler(),
    ),
    feature_type=NumericalFeature()
)

feature_2 = ExplainableFeature(
    name="petal_width",
    is_target=False,
    preprocessor=SklearnPreprocessor(
        preprocess_function=StandardScaler(),
    ),
    feature_type=NumericalFeature()
)

index = IndexMetadata(
    name="my_index",
)

analyzed_schema = AnalyzedSchema(feature_1, feature_2, index)

Finally, use the analyzed schema to build the AnalyzedParquetDataset.

from xpdeep.dataset.parquet_dataset import AnalyzedParquetDataset

analyzed_train_dataset = AnalyzedParquetDataset(
    name="train_dataset",
    path=directory["train_set_path"],
    analyzed_schema=analyzed_schema
)

You can remove a feature from the schema if needed, using its name:

analyzed_train_dataset.analyzed_schema.remove_feature(feature_name="petal_length")

2. Fit the Schema#

Once satisfied with the feature types and their preprocessor, the next step is to fit the dataset schema. Each preprocessor is indeed responsible for the raw <-> preprocessed space mapping and must be fit to allow the association.

From the AnalyzedParquetDataset#

The schema object can be used to automatically fit each feature preprocessor.

fitted_train_dataset = analyzed_train_dataset.fit()
print(fitted_train_dataset.fitted_schema)

 +----------------------------------------------------+
|                  Schema Contents                   |
+--------------------+-------------------+-----------+
| Type               | Name              | Is Target |
+--------------------+-------------------+-----------+
| NumericalFeature   | petal_length      | ❌        |
| NumericalFeature   | petal_width       | ❌        |
| CategoricalFeature | flower_type       | ✅        |
| IndexMetadata      | index_xp_deep     |           |
+--------------------+-------------------+-----------+

👀 Full file preview

"""Create a schema."""

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from sklearn.preprocessing import StandardScaler

import xpdeep
from xpdeep import Project
from xpdeep.dataset.feature import ExplainableFeature
from xpdeep.dataset.feature.feature_types import NumericalFeature
from xpdeep.dataset.parquet_dataset import ParquetDataset
from xpdeep.dataset.preprocessor.preprocessor import SklearnPreprocessor

demo = {"api_key": "your_api_key", "api_url": "your_api_url"}
xpdeep.init(**demo)

xpdeep.set_project(Project.create_or_get(name="toy dataset example", description="tutorial"))

# Create a Dataset

S3_DATASET_ENDPOINT_URL = "S3_DATASET_ENDPOINT_URL"
S3_DATASET_ACCESS_KEY_ID = "S3_DATASET_ACCESS_KEY_ID"
S3_DATASET_SECRET_ACCESS_KEY = "S3_DATASET_SECRET_ACCESS_KEY"
S3_DATASET_BUCKET_NAME = "S3_DATASET_BUCKET_NAME"

raw_data = pa.Table.from_pandas(
    pd.DataFrame({
        "petal_length": [1.4, 1.5, 1.3, 4.5, 4.1, 5.0, 6.0, 5.5],
        "petal_width": [0.2, 0.2, 0.2, 1.5, 1.3, 1.8, 2.5, 2.3],
        "flower_type": [
            "Setosa",
            "Setosa",
            "Setosa",
            "Versicolor",
            "Versicolor",
            "Versicolor",
            "Virginica",
            "Virginica",
        ],
    }).reset_index(names="index_xp_deep"),
    preserve_index=False,
)

# Write the table to a Parquet file
pq.write_table(raw_data, "train.parquet")

path = f"s3://{S3_DATASET_BUCKET_NAME}/train.parquet"
storage_options = {
    "key": S3_DATASET_ACCESS_KEY_ID,
    "secret": S3_DATASET_SECRET_ACCESS_KEY,
    "client_kwargs": {
        "endpoint_url": S3_DATASET_ENDPOINT_URL,
    },
    "s3_additional_kwargs": {"addressing_style": "path"},
}

train_dataset = ParquetDataset(name="dataset_name", path=path, storage_options=storage_options)

# Create a Schema
forced_feature = ExplainableFeature(
    name="petal_length",
    is_target=False,
    preprocessor=SklearnPreprocessor(preprocess_function=StandardScaler()),
    feature_type=NumericalFeature(),
)

analyzed_train_dataset = train_dataset.analyze(target_names=["flower_type"])
fitted_train_dataset = analyzed_train_dataset.fit()
print(fitted_train_dataset.fitted_schema)

Or from Scratch#

It is also possible to directly build a FittedParquetDataset from an existing FittedSchema using the default constructor. This can be useful to instantiate FittedParquetDataset for a test set from another dataset schema.

from xpdeep.dataset.parquet_dataset import FittedParquetDataset

fitted_validation_dataset = FittedParquetDataset(
    name="validation_dataset",
    path=directory["test_set_path"],
    fitted_schema=fitted_train_dataset.fitted_schema
)

👀 Full file preview

"""Create a schema."""

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from sklearn.preprocessing import StandardScaler

import xpdeep
from xpdeep import Project
from xpdeep.dataset.feature import ExplainableFeature
from xpdeep.dataset.feature.feature_types import NumericalFeature
from xpdeep.dataset.parquet_dataset import ParquetDataset
from xpdeep.dataset.preprocessor.preprocessor import SklearnPreprocessor

demo = {"api_key": "your_api_key", "api_url": "your_api_url"}
xpdeep.init(**demo)

xpdeep.set_project(Project.create_or_get(name="toy dataset example", description="tutorial"))

# Create a Dataset

S3_DATASET_ENDPOINT_URL = "S3_DATASET_ENDPOINT_URL"
S3_DATASET_ACCESS_KEY_ID = "S3_DATASET_ACCESS_KEY_ID"
S3_DATASET_SECRET_ACCESS_KEY = "S3_DATASET_SECRET_ACCESS_KEY"
S3_DATASET_BUCKET_NAME = "S3_DATASET_BUCKET_NAME"

raw_data = pa.Table.from_pandas(
    pd.DataFrame({
        "petal_length": [1.4, 1.5, 1.3, 4.5, 4.1, 5.0, 6.0, 5.5],
        "petal_width": [0.2, 0.2, 0.2, 1.5, 1.3, 1.8, 2.5, 2.3],
        "flower_type": [
            "Setosa",
            "Setosa",
            "Setosa",
            "Versicolor",
            "Versicolor",
            "Versicolor",
            "Virginica",
            "Virginica",
        ],
    }).reset_index(names="index_xp_deep"),
    preserve_index=False,
)

# Write the table to a Parquet file
pq.write_table(raw_data, "train.parquet")

path = f"s3://{S3_DATASET_BUCKET_NAME}/train.parquet"
storage_options = {
    "key": S3_DATASET_ACCESS_KEY_ID,
    "secret": S3_DATASET_SECRET_ACCESS_KEY,
    "client_kwargs": {
        "endpoint_url": S3_DATASET_ENDPOINT_URL,
    },
    "s3_additional_kwargs": {"addressing_style": "path"},
}

train_dataset = ParquetDataset(name="dataset_name", path=path, storage_options=storage_options)

# Create a Schema
forced_feature = ExplainableFeature(
    name="petal_length",
    is_target=False,
    preprocessor=SklearnPreprocessor(preprocess_function=StandardScaler()),
    feature_type=NumericalFeature(),
)

analyzed_train_dataset = train_dataset.analyze(target_names=["flower_type"])
fitted_train_dataset = analyzed_train_dataset.fit()
print(fitted_train_dataset.fitted_schema)

Once in possession of a suitable fitted schema associated to your FittedParquetDataset, the next step is to build your explainable model.

3. Data augmentation#

You can also add data augmentation functions (mainly for images features)à using FeatureAugmentation. This function will be used to generate new data from raw data and/or preprocessed data. Currently, we only support augmentation using torchvision. The augmentation must be defined on the FittedSchema and not on the AnalyzedSchema, otherwise it won't be considered.

from torchvision.transforms import Compose, RandomRotation
from xpdeep.dataset.feature.augmentation.augmentation import FeatureAugmentation
from xpdeep.dataset.feature import ExplainableFeature
from xpdeep.dataset.feature.feature_types import ImageFeature

augmentation = Compose([RandomRotation(90)])
image_rotation_augmentation = FeatureAugmentation(augment_preprocessed=augmentation)

image_feature = ExplainableFeature(
    preprocessor=ScaleImage(input_size=(28, 28)),
    name="image",
    feature_type=ImageFeature()
)

fitted_schema = FittedSchema(image_feature)

fitted_schema["image"].augmentation = image_rotation_augmentation

Future Release

In the future, xpdeep will support augmentation on other feature types.

Warning

The ImageFeature expects the channel-last format, i.e. batch_size x H x W x num_channels.

You may need to use Compose([Permute([0, 3, 1, 2]), YourTransformation(), Permute([0, 2, 3, 1])]) if your augmentation requires the channel first, as it is usually the case for torchvision.transforms objects. You don't need to add a transform to convert to torch tensor first, as it is automatically handled by xpdeep.