labeled-datasets

96 Labeled Datasets

In this repository, we provide 96 publicly available labeled dataset. The datasets were originally collected to be utilized in the paper “Measuring the Validity of Clustering Validation Datasets”, previously entitled “Sanity Check for External Clustering Validation Benchmarks using Internal Validation Measure”, as a potential candidate for external clustering validation. However, it sill can be used for various purposes (e.g., classification, dimensionality reduction, etc.) For better applicability, we provide datasets in both numpy (.npy) and compressed (.bin) format. We also provided a reader code for the compressed files.

A full list of the datasets is available at this website and the Appendix of our reference paper (TBA).

Reader API

API

The reader of the compressed files is written in reader.py. We assume that the relative path of the reader file and the compressed datasets is identical to the one of this repository. The reader code depends on numpy and zlib.

read_dataset(name)

read_dataset_by_path(path)

read_multiple_datasets(names)

read_all_datasets()

Example

import reader as rd
import numpy as np

data, label = rd.read_dataset("cifar10")

Contact

If you have any issue exploiting the datasets, feel free to contact us via hj@hcil.snu.ac.kr.

Reference

TBA