:floppy_disk: Data Repository for the "Levels" dataset
This repository contains the Levels data collected in support of the manuscript: Aligning Machine and Human Visual Representations across Abstraction Levels, which can be accessed here. In brief, the Levels dataset contains the `sourcedata` and `processed_data` from the HumanEval Experiment, conducted at the Max Planck Institute for Human Development (MPIB) in collaboration with BIFOLD at TU Berlin in 2024, by Frieda Born and colleagues.
🪁 What does the Levels dataset contain and why did we collect it? The Levels dataset is a new dataset of human similarity judgments spanning multiple levels of semantic abstraction.
In our main 'AligNet' project, the Levels dataset is used for evaluation. We used the human odd-one-out similarity judgments at multiple levels of semantic abstraction to evaluate whether (synthetically generated) similarity judgments of existing state-of-the-art human-aligned models (Muttenthaler et al., 2023) correspond to ground-truth human judgments. Please see the main manuscript (referenced below) for details on results and methods.
🗞 What is the broader background of this work? Human alignment is becoming central to representation learning (e.g., Muttenthaler et al. 2023, Sucholutsky et al., 2023). Models are needed that don’t just perform well on machine learning downstream tasks, but that align with human perception and intentions. To achieve this, we used existing approaches for aligning neural network models with human object perception (Muttenthaler et al., 2023) to create a large-scale dataset of human-aligned similarity judgments. To test if these judgments are in line with actual human similarity judgments, we needed to collect sets of ground truth human similarity judgments 🚀 The Levels dataset 🚀. While this dataset was primarily designed for evaluation, it can, of course, be utilized for a wide range of purposes, including training models or other applications that require human similarity evaluations.
:file_folder: Overview
This is a condensed overview of the data. For more detailed information, please see the materials linked below and feel free to contact the author(s) if needed.
The sourcedata
folder contains the raw subject data as one *.json
file per subject. Detailed information about each variable can be found in the variable codebook within the docs/
folder. "The processed_data
folder contains preprocessed data files, offering a streamlined and efficient way to work with the dataset. For more details, please refer to the separate README file in this section."
🚨 Main variables
- :page_facing_up:
Key Data Columns:
- rt: Response times
- image1Path: Stimulus name of the triplet image 1
- image2Path: Stimulus name of the triplet image 2
- image3Path: Stimulus name of the triplet image 3
- selected_image: Name of the image selected as the odd-one-out in each trial
- exp_trial_type: Defines the type of trial (e.g., experiment or training trial)
- response: Demographic information of the participant
:floppy_disk: Download Instructions
You can download the full dataset using the following methods:
- Via the GIN Client:
- Install: Follow GIN CLI Setup for installation instructions.
- Clone: Run the following command to clone the repository:
gin get fborn/Levels_dataset/src/main/sourcedata
- Navigate to the root of the dataset:
cd sourcedata
- Download all data:
gin download --content
- If you wish to work on or edit the files, run:
gin unlock *
- Alternatively, click the small "download icon" on the right side above the list of files in the repository overview on GIN. This will show you how to download the data via the GIN documentation.
:file_folder: Loading Raw Data
If you are using Python, you can load the raw data like this:
def load_response_data(path_to_responses: str) -> List[Dict[str, Union[float, int, str]]]:
"""Load human odd-one-out responses from disk."""
trials = []
for file in os.scandir(path_to_results):
if file.name.endswith(".json"):
with open(file, "r") as f:
for line in f:
trials.append(json.loads(line))
return trials
📄 Using this Dataset
If you use this dataset in your work, please consider citing the following paper:
@article{muttenthaler2024aligning,
title={Aligning Machine and Human Visual Representations across Abstraction Levels},
author={Muttenthaler, Lukas and Greff, Klaus and Born, Frieda and Spitzer, Bernhard
and Kornblith, Simon and Mozer, Michael C and M{\"u}ller, Klaus-Robert and
Unterthiner, Thomas and Lampinen, Andrew K},
journal={arXiv preprint arXiv:2409.06509},
year={2024}
}
You can access the paper here.
(back to top)
:warning: License
This data is made available under the Public Domain Dedication and License v1.0 whose full text can be found at: LICENSE.
See also the human readable summary at: summary.
Please see the LICENSE file for details.
📬 Please do not hesitate to contact us (born[at]mpib-berlin.mpg.de) when you have questions about the data or wish to receive them in a different format.
(back to top)