Datasets for Multitask Learning
Published:
During my PhD, I spent a lot of time searching, collecting, and preparing datasets for multitask learning. To help new researchers in the field, I’m sharing the datasets that I’ve been collecting for the last few years. All the Global Climate Models (GCMs) combination datasets were created by our research group at University of Minnesota - Twin Cities. I refer to one of our papers for a detailed explanation of the problem and the datasets.
Name | Problem type | Number of tasks | Number of samples per task | Dimensions | Description |
---|---|---|---|---|---|
Landmine_19 | Classification | 19 | 445 ~ 690 | 9 | Data from 19 different landmine fields were collected, which have distinct types of characteristics. Each object in a given data set is represented by a 9-dimensional feature vector and the corresponding binary label (1 for landmine and 0 for clutter). The feature vectors are extracted from radar images, concatenating four moment-based features, three correlation-based features, one energy ratio feature and one spatial variance feature. The goal is to classify between mine or clutter. For more details see paper. |
Spam_15 | Classification | 15 | 400 | 500 | Spam detection problem. Each task is to learn a binary classifier for a specific user. This dataset contains emails from 15 different users. |
Spam_3 | Classification | 3 | 2500 | 500 | Spam detection problem. Each task is to learn a binary classifier for a specific user. This dataset contains emails from 3 different users, however has more samples per task than the previous dataset. |
Mnist | Classification | 45 | 15000 | 784 | Consists of 28×28-size images of hand-written digits from 0 through 9. We transform this multiclass classification problem by applying the all-versus-all decomposition, leading to 45 binary classification problems (tasks). After trained, when a new test sample arrives, a voting is performed among the classifiers and the class with the maximum number of votes is chosen. |
Yale-faces | Classification | 105 | 2 | 1024 | The face recognition dataset contains 165 grayscale images with dimension 32x32 pixels of 15 individuals. Similar to MNIST, the problem is also transformed by all-versus-all decomposition, totalling 105 binary classification problems (tasks). |
Letter | Classification | 8 | 3057 ~ 7931 | 128 | The handwritten letter dataset consists of eight tasks, with each one being a binary classification of two letters: a/g, a/o, c/e, f/t, g/y, h/n, m/n and i/j. The input for each data point consists of 128 features representing the pixel values of the handwritten letter. |
Five-Regions | Regression | 25 | 1200 | 10 | Each task is a linear combination of Global Climate Models for land temperature for a specific location. This dataset contains 5 neighboring locations of the following regions: North-America, South-America, Africa, Australia and Russia. We don’t know beforehand where each task data come from. |
South-America-GCM | Regression | 250 | 1200 | 10 | Each task is a linear combination of Global Climate Models for land temperature for a specific location. South America with 250 spatial locations. For more details see paper. |
North-America-GCM | Regression | 490 | 1200 | 10 | Each task is a linear combination of Global Climate Models for land temperature for a specific location. North America with 490 spatial locations over land. For more details see paper. |