Datasets for Multitask Learning

3 minute read

Published:

During my PhD, I spent a lot of time searching, collecting, and preparing datasets for multitask learning. To help new researchers in the field, I’m sharing the datasets that I’ve been collecting for the last few years. All the Global Climate Models (GCMs) combination datasets were created by our research group at University of Minnesota - Twin Cities. I refer to one of our papers for a detailed explanation of the problem and the datasets.

NameProblem typeNumber of tasksNumber of samples per taskDimensionsDescription
Landmine_19Classification19445 ~ 6909Data from 19 different landmine fields were collected, which have distinct types of characteristics. Each object in a given data set is represented by a 9-dimensional feature vector and the corresponding binary label (1 for landmine and 0 for clutter). The feature vectors are extracted from radar images, concatenating four moment-based features, three correlation-based features, one energy ratio feature and one spatial variance feature. The goal is to classify between mine or clutter. For more details see paper.
Spam_15Classification15400500Spam detection problem. Each task is to learn a binary classifier for a specific user. This dataset contains emails from 15 different users.
Spam_3Classification32500500Spam detection problem. Each task is to learn a binary classifier for a specific user. This dataset contains emails from 3 different users, however has more samples per task than the previous dataset.
MnistClassification4515000784Consists of 28×28-size images of hand-written digits from 0 through 9. We transform this multiclass classification problem by applying the all-versus-all decomposition, leading to 45 binary classification problems (tasks). After trained, when a new test sample arrives, a voting is performed among the classifiers and the class with the maximum number of votes is chosen.
Yale-facesClassification10521024The face recognition dataset contains 165 grayscale images with dimension 32x32 pixels of 15 individuals. Similar to MNIST, the problem is also transformed by all-versus-all decomposition, totalling 105 binary classification problems (tasks).
LetterClassification83057 ~ 7931128The handwritten letter dataset consists of eight tasks, with each one being a binary classification of two letters: a/g, a/o, c/e, f/t, g/y, h/n, m/n and i/j. The input for each data point consists of 128 features representing the pixel values of the handwritten letter.
Five-RegionsRegression25120010Each task is a linear combination of Global Climate Models for land temperature for a specific location. This dataset contains 5 neighboring locations of the following regions: North-America, South-America, Africa, Australia and Russia. We don’t know beforehand where each task data come from.
South-America-GCMRegression250120010Each task is a linear combination of Global Climate Models for land temperature for a specific location. South America with 250 spatial locations. For more details see paper.
North-America-GCMRegression490120010Each task is a linear combination of Global Climate Models for land temperature for a specific location. North America with 490 spatial locations over land. For more details see paper.