Datasets for Multitask Learning

3 minute read

Published: May 24, 2018

During my PhD, I spent a lot of time searching, collecting, and preparing datasets for multitask learning. To help new researchers in the field, I’m sharing the datasets that I’ve been collecting for the last few years. All the Global Climate Models (GCMs) combination datasets were created by our research group at University of Minnesota - Twin Cities. I refer to one of our papers for a detailed explanation of the problem and the datasets.

Name	Problem type	Number of tasks	Number of samples per task	Dimensions	Description
Landmine_19	Classification	19	445 ~ 690	9	Data from 19 different landmine fields were collected, which have distinct types of characteristics. Each object in a given data set is represented by a 9-dimensional feature vector and the corresponding binary label (1 for landmine and 0 for clutter). The feature vectors are extracted from radar images, concatenating four moment-based features, three correlation-based features, one energy ratio feature and one spatial variance feature. The goal is to classify between mine or clutter. For more details see paper.
Spam_15	Classification	15	400	500	Spam detection problem. Each task is to learn a binary classifier for a specific user. This dataset contains emails from 15 different users.
Spam_3	Classification	3	2500	500	Spam detection problem. Each task is to learn a binary classifier for a specific user. This dataset contains emails from 3 different users, however has more samples per task than the previous dataset.
Mnist	Classification	45	15000	784	Consists of 28×28-size images of hand-written digits from 0 through 9. We transform this multiclass classification problem by applying the all-versus-all decomposition, leading to 45 binary classification problems (tasks). After trained, when a new test sample arrives, a voting is performed among the classifiers and the class with the maximum number of votes is chosen.
Yale-faces	Classification	105	2	1024	The face recognition dataset contains 165 grayscale images with dimension 32x32 pixels of 15 individuals. Similar to MNIST, the problem is also transformed by all-versus-all decomposition, totalling 105 binary classification problems (tasks).
Letter	Classification	8	3057 ~ 7931	128	The handwritten letter dataset consists of eight tasks, with each one being a binary classification of two letters: a/g, a/o, c/e, f/t, g/y, h/n, m/n and i/j. The input for each data point consists of 128 features representing the pixel values of the handwritten letter.
Five-Regions	Regression	25	1200	10	Each task is a linear combination of Global Climate Models for land temperature for a specific location. This dataset contains 5 neighboring locations of the following regions: North-America, South-America, Africa, Australia and Russia. We don’t know beforehand where each task data come from.
South-America-GCM	Regression	250	1200	10	Each task is a linear combination of Global Climate Models for land temperature for a specific location. South America with 250 spatial locations. For more details see paper.
North-America-GCM	Regression	490	1200	10	Each task is a linear combination of Global Climate Models for land temperature for a specific location. North America with 490 spatial locations over land. For more details see paper.

Share on

Twitter Facebook Google+ LinkedIn

Andre Goncalves

Datasets for Multitask Learning

Share on

You May Also Enjoy

Machine Learning Write Ups (in Portuguese)