Creating datasets for HistomicsML2

The dataset creation docker container provides all the functionality needed to create new HistomicsML2 datasets. This page describes how to use this docker image to generate new datasets from whole-slide image datasets.

Note

When working with user-generated datasets you will be mounting folders from the local filesystem to be visible as volumes within the various Docker containers. When we refer to paths in the documentation we are careful to note whether these paths are on the local system or a mounted volume inside a container.

Note

Processing time for creating datasets varies depending on hardware. We observed 40 minutes for superpixel segmentation (CPU) and 1.5 hours for feature extraction (GPU) on a 40X objective 66K x 76K slide with 382,225 superpixels performed on a two-CPU system equipped with two NVIDIA P100 GPUs.

Download the dataset creation container

Use the docker pull command to download the dataset creation container. Use cancerdatascience/hml_dataset_gpu:1.0 if running on a GPU-equipped system to accelerate feature extraction.

$ docker pull cancerdatascience/hml_dataset_superpixel:1.0
$ docker pull cancerdatascience/hml_dataset_cpu:1.0
#use the GPU-enabled docker if runnnig on a GPU system
$ docker pull cancerdatascience/hml_dataset_gpu:1.0

Create directories

HistomicsML2 datasets should be stored inside subdirectories within a master directory to simplifies file sharing with the HistomicsML2 containers.

Create the master directory on the local file system and navigate to this folder

$ mkdir HistomicsML
$ cd HistomicsML

Then create directories inside the master to store labeled data, model, and outputs from training sessions

$ mkdir -p classifiers/tmp
$ mkdir -p models
$ mkdir -p outputs

Now create a project directory and generate subdirectories to store superpixel boundaries, centroids, and whole-slide images

$ mkdir myproject
$ cd myproject
$ mkdir boundary centroid svs tif

These directories will be mounted inside the data creation docker during dataset creation, and again by the database and server containers when running HistomicsML2.

Create slide information table

Use CreateSlideInformation.py to create a .csv in the project directory that describes the whole-slide images

$ cd myproject
$ docker run -it --rm --name createinfo -v "$PWD":/"${PWD##*/}" cancerdatascience/hml_dataset_gpu:1.0 python scripts/CreateSlideInformation.py --projectName "${PWD##*/}"

Here the -v option mounts the project directory inside the container and with the same name under root ‘/’.

Note

The slide_info.csv file describes the whole-slide image dimensions and magnifications and will be ingested by the HistomicsML2 database when datasets are ingested.

Note

Whole-slide image filenames must not contain any ‘.’ characters other than the extension (e.g. .svs). This character interferes with the database ingestion and will prevent dataset import.

Convert whole-slide images

A pyramidal .tif format is needed to serve images inside the UI with IIPImage server. The data generation docker contains the VIPs library to support conversion of whole-slide-images to pyramidal tifs.

Use create_tiff.sh to convert ‘.svs’ to ‘.tif’ format

$ cd myproject
$ docker run -it --rm --name convertslide -v "$PWD":/"${PWD##*/}" cancerdatascience/hml_dataset_gpu:1.0 bash scripts/create_tiff.sh /"${PWD##*/}"/svs /"${PWD##*/}"/tif

/"${PWD##*/}"/svs and /"${PWD##*/}"/tif are the paths where the whole-slide image and converted tif folders are mounted in the data creation container. As the converted tif files are written they will also appear in the local file system outside the container.

Generate superpixel segmentation

Use SuperpixelSegmentation.py to generate superpixel boundaries and centroids

$ cd myproject
$ docker run -it --rm --name createboundary -v "$PWD":/"${PWD##*/}" cancerdatascience/hml_dataset_superpixel:1.0 python scripts/SuperpixelSegmentation.py --projectName "${PWD##*/}" --superpixelSize 64 --patchSize 128

Parameters of the superpixel segmentation script SuperpixelSegmentation.py can be adjusted to change the size, shape, and threshold of superpixels to discard background regions

--superpixelSize

Superpixel edge length in pixels. Range is [8, 256] (default 64).

—patchSize Patch edge length in pixels. Range is [8, 512] (default 128).

--compactness

SLIC compactness parameter. Range is [0.01, 100] (default 50).

—projectName Name of the project directory. Default ‘myproject’.

Generate features and PCA transformation

Use FeatureExtraction.py to extract features from the superpixel segmentation.

To extract features on a CPU system

$ docker run -it --rm --name extractfeatures -v "$PWD":/"${PWD##*/}" cancerdatascience/hml_dataset_cpu:1.0 python scripts/FeatureExtraction.py --projectName "${PWD##*/}"

To extract features on a GPU equipped system (currently supporting CUDA 9.0, Linux x86_64 Driver Version >= 384.81):

$ docker run --runtime=nvidia -it --rm --name extractfeatures -v "$PWD":/"${PWD##*/}" cancerdatascience/hml_dataset_gpu:1.0 python scripts/FeatureExtraction.py --projectName "${PWD##*/}"

Parameters of the feature extraction script can be adjusted to change the patch size and dimensionality reduction process

--superpixelSize

Superpixel edge length in pixels. Range is [8, 256] (default 64).

—patchSize Patch size of each superpixel. Range is [8, 512] (default 128).

--inputPCAModel

Path and filename of .pkl when importing a PCA transform. This specifies the location of the .pkl as mounted inside the docker. If the .pkl file was copied to the current project then –inputPCAModel /${PWD##*/}/pca_model_sample.pkl.

—projectName Name of the project directory (default - current working directory name).

An important note on training, inference, and the PCA transformation:

Note

HistomicsML2 can be used to train new classifiers or to apply existing classifiers to new datasets (inference). For inference it is important that features are extracted consistently in both the training dataset and the inference dataset.

Since features are transformed through principal component analysis (PCA), the same PCA transform used in training datasets needs to be re-used where these classifier are applied to inference datasets. The data creation container provides the option to generate a new PCA transform when creating a training set, or to re-use an existing PCA transform when creating an inference dataset.

HistomicsML2 stores a PCA transforms as .pkl files. Each project directory needs a .pkl file to be imported into HistomicsML2, and so users should manage and copy these files when creating inference datasets.

Completed dataset

The above steps will generate a series of files in your project folder:

myproject/
|----- HistomicsML_dataset.h5
|----- pca_model_sample.pkl
|----- slide_info.csv
|----- boundary/
|      |----- slide1.txt
|      |----- slide2.txt
|      |----- slide3.txt
.
.
.
|----- centroid/
|      |----- slide1.h5
|      |----- slide2.h5
|      |----- slide3.h5
.
.
.
|----- svs/
|      |----- slide1.svs
|      |----- slide2.svs
|      |----- slide3.svs
.
.
.
|----- tif/
|      |----- slide1.svs.dzi.tif
|      |----- slide2.svs.dzi.tif
|      |----- slide3.svs.dzi.tif
.
.
.

Next steps

See how to import HistomicsML2 datasets using the command-line and user interface.