Benchmark Tutorial

Benchmarks are at the core of Brain-Score and test models’ match to experimental observations. New benchmarks keep models in check and require them to generalize to new experiments.

A benchmark reproduces the experimental paradigm on a model candidate, the experimentally observed data, and a metric to compare model with experimental observations.

To submit a new benchmark, there are three steps: 1. packaging stimuli and data, 2. creating the benchmark with experimental paradigm and metric to compare against data, and 3. opening a pull request on the github repository to commit the updates from 1 and 2 In order to ensure the continued validity of the benchmark, we require unit tests for all components (stimuli and data as well as the benchmark itself).

1. Package stimuli and data

We require a certain format for stimuli and data so that we can maintain them for long-term use. In particular, we use BrainIO for data management. BrainIO uses StimulusSet (a subclass of pandas DataFrame) to maintain stimuli, and DataAssembly (a subclass of xarray DataArray) to maintain experimental measurements. Aside from unifying data from different sources, the advantage of these formats is that all data are kept together with metadata such as image parameters, electrode locations, and details on behavioral choices. For both StimulusSet and DataAssembly, BrainIO provides packaging methods that upload to S3 cloud storage, and add the entries to lookup.csv from which they can later be accessed.

Data and stimuli can be made public or kept private. It is your choice if you wish to release the data itself or only the benchmark. If you choose to keep the data private, model submissions can be scored on the data, but the actual data itself will not be visible. Publicly released data can also be scored against, but will be fully accessible.

Getting started, please create a new folder <authoryear>/ in the packaging directory in which you keep all your packaging scripts. If your code depends on additional requirements, it is good practice to additionally keep a requirements.txt or file specifying the dependencies.

Before executing the packaging methods to actually upload to S3, please check in with us via Slack or Github Issue so that we can give you access. With the credentials, you can then configure the awscli (pip install awscli, aws configure using region us-east-1, output format json) to make the packaging methods upload successfully.

StimulusSet: The StimulusSet contains the stimuli that were used in the experiment as well as any kind of metadata for the stimuli. Below is a slim example of creating and uploading a StimulusSet. The package_stimulus_set method returns the AWS metadata needed in the data/ file (such as sha1 and the version_id). In this example, we store the metadata in the packaged_stimulus_metadata variable.

from pathlib import Path
from brainio.stimuli import StimulusSet
from brainio.packaging import package_stimulus_set

stimuli = []  # collect meta
stimulus_paths = {}  # collect mapping of stimulus_id to filepath
for filepath in Path(stimuli_directory).glob('*.png'):
    stimulus_id = filepath.stem
    object_name = filepath.stem.split('_')[0]  # if the filepath contains meta, this can come from anywhere
    # ...and other metadata
    stimulus_paths[stimulus_id] = filepath
        'stimulus_id': stimulus_id,
        'object_name': object_name,
        # ...and other metadata
        # optionally you can set 'stimulus_path_within_store' to define the filename in the packaged stimuli
stimuli = StimulusSet(stimuli)
stimuli.stimulus_paths = stimulus_paths = '<AuthorYear>'  # give the StimulusSet an identifier name

assert len(stimuli) == 1600  # make sure the StimulusSet is what you would expect

packaged_stimulus_metadata = package_stimulus_set(catalog_name=None, proto_stimulus_set=stimuli,
                   , bucket_name="brainio-brainscore")  # upload to S3


DataAssemblies contain the actual experimental measurements as well as any metadata on them. Note that these do not necessarily have to be raw data, but can also be previously published characterizations of the data such as preference distributions. As such, the person submitting the data to Brain-Score does not have to be involved in the data collection. If you package someone else’s data, we do however recommend checking the specifics with them to avoid mis-interpretation. So far, we have encountered data in three forms:

  • NeuroidAssembly: neural data recorded from “neuroids” – neurons or their analogues such as multi-unit activity from Utah array electrodes. These assemblies typically contain spike rates structured in three dimensions presentation x neuroid x time_bin where the presentation dimension represents stimulus presentations (e.g. images x trials), the neuroid dimension represents e.g. electrodes (with metadata such as neuroid_id and location), and the time_bin dimension contains information about the start (time_bin_start) and end (time_bin_end) of a time bin of spike rates.

  • BehavioralAssembly: behavioral measurements, typically choices in a task structured in one dimension presentation that represents stimulus presentations (e.g. images x trials, with metadata on the task such as the sample object and the distractor object in a match-to-sample task) with the actual choices (e.g. “dog”/”cat”, “left”/”right”) in the assembly values.

  • PropertiesAssembly: any kind of data in a pre-processed form, such as a surround suppression index per neuroid.

Here is an example of a BehavioralAssembly:

from brainio.assemblies import BehavioralAssembly
from brainio.packaging import package_data_assembly

assembly = BehavioralAssembly(['dog', 'dog', 'cat', 'dog', ...],
                                   'stimulus_id': ('presentation', ['image1', 'image2', 'image3', 'image4', ...]),
                                   'sample_object': ('presentation', ['dog', 'cat', 'cat', 'dog', ...]),
                                   'distractor_object': ('presentation', ['cat', 'dog', 'dog', 'cat', ...]),
                                   # ...more meta
                                   # Note that meta from the StimulusSet will automatically be merged into the
                                   #  presentation dimension:
                               dims=['presentation']) = '<authoryear>'  # give the assembly an identifier name

# make sure the assembly is what you would expect
assert len(assembly['presentation']) == 179660
assert len(set(assembly['stimulus_id'].values)) == 1600
assert len(set(assembly['choice'].values)) == len(set(assembly['sample_object'].values)) \
       == len(set(assembly['distractor_object'].values)) == 2

# upload to S3
packaged_assembly_metadata = package_data_assembly(proto_data_assembly=assembly,,
                   ,  # link to the StimulusSet packaged above
                             assembly_class_name="BehavioralAssembly", bucket_name="brainio-brainscore",

In our experience, it is generally a good idea to include as much metadata as possible (on both StimulusSet and Assembly). This will increase the utility of the data and make it a more valuable long-term contribution. Please note that, like in package_stimulus_set, The package_data_assembly method returns the AWS metadata needed in the data/ file (such as sha1 and the version_id). In this example, we store the metadata in the packaged_assembly_metadata variable.

You can also put both of these packaging methods inside of one Python file, called e.g. This file would then package and upload both the stimulus_set and assembly.

Unit Tests ( We ask that packaged stimuli and assemblies are tested so that their validity can be confirmed for a long time, even as details in the system might change. For instance, we want to avoid accidental overwrite of a packaged experiment, and the unit tests guard against that.

When creating your benchmark, we require you to include a file. For what this file should contain, see below.

We realize that unit tests can be a hurdle and we can take over this task for you. Please let us know of any hurdles and we will do our best to support.

There are already generic tests in place to which you can add your StimulusSet and assembly identifiers:

  1. tests.test_stimuli.test_list_stimulus_set()

  2. tests.test_assemblies.test_list_assembly()

  3. tests.test_assemblies.test_existence()

Simply add your identifiers to the list.

Additionally, you can write your own test method to run some more detailed checks on the validity of StimulusSet and assembly:

# in
def test_<authoryear>:
    stimulus_set = brainio.get_stimulus_set('<authoryear>')
    assert len(stimulus_set) == 123  # check number of stimuli
    assert len(set(stimulus_set['stimulus_id'])) == 12  # check number of unique stimuli
    assert set(stimulus_set['object_name']) == {'dog', 'cat'}
    # etc

# in
def test_<authoryear>:
    assembly = brainscore.get_assembly('<authoryear>')
    np.testing.assert_array_equal(assembly.dims, ['presentation'])
    assert len(set(assembly['stimulus_id'].values)) == 123  # check number of stimuli
    assert len(assembly) == 123456  # check number of trials
    assert assembly.stimulus_set is not None
    assert len(assembly.stimulus_set) == 123  # make sure number of stimuli in stimulus_set lines up with assembly
    # etc

Adding your data to Brain-Score: You will also need an file to go along with your submission. The purpose of this file is to register the benchmark inside the Brain-Score ecosystem. This involves adding both the stimuli and the data to the stimulus_set_registry and data_registry respectively. See below for an example from the data for Geirhos2021:

# assembly
data_registry['Geirhos2021_colour'] = lambda: load_assembly_from_s3(
    stimulus_set_loader=lambda: load_stimulus_set('Geirhos2021_colour'),

# stimulus set
stimulus_set_registry['Geirhos2021_colour'] = lambda: load_stimulus_set_from_s3(

Data Packaging Summary: Part 1 of creating a benchmark involves packaging the stimuli and data, adding a file, and adding these stimuli and data to the data_registry. The summary of what to submit is seen below with an example structure of an example submission structure:


2. Create the benchmark

The Benchmark brings together the experimental paradigm with stimuli, and a Metric to compare model measurements against experimental data. The paradigm typically involves telling the model candidate to perform a task or start recording in a particular area, while looking at images from the previously packaged StimulusSet. Interacting with the model candidate is agnostic of the specific model and is guided by the BrainModel – all models implement this interface, and through this interface the benchmark can interact with all current and future model candidates.

Typically, all benchmarks inherit from BenchmarkBase, a super-class requesting the commmonly used attributes. These attributes include

  • the identifier which uniquely designates the benchmark

  • the version number which increases when changes to the benchmark are made

  • a ceiling_func that, when run, returns a ceiling for this benchmark

  • the benchmark’s parent to group under e.g. V1, V2, V4, IT, behavior, or engineering (machine learning benchmarks)

  • a bibtex that is used to link to the publication from the benchmark and website for further details (we are working on crediting benchmark submitters more prominently in addition to only the data source.)

Here is an example of a behavioral benchmark that uses an already defined metric, I2n, to compare image-level behaviors:

import brainscore
from brainscore.benchmarks import BenchmarkBase
from brainscore.benchmarks.screen import place_on_screen
from brainscore.metrics.image_level_behavior import I2n
from brainscore.model_interface import BrainModel
from brainscore.utils import LazyLoad

# the BIBTEX will be used to link to the publication from the benchmark for further details
BIBTEX = """@article {AuthorYear,
                author = {Author},
                title = {title},
                year = {2021},
                url = {link},
                journal = {bioRxiv}

class AuthorYearI2n(BenchmarkBase):
    def __init__(self):
        self._metric = I2n()  # use a previously defined metric
        # we typically use the LazyLoad wrapper to only load the assembly on demand
        self._fitting_stimuli = LazyLoad(lambda: brainscore.get_stimulus_set('<authoryear>'))
        self._assembly = LazyLoad(lambda: brainscore.get_assembly('<authoryear>'))
        # at what degree visual angle stimuli were presented
        self._visual_degrees = 8
        # how many repeated trials each stimulus was shown for
        self._number_of_trials = 2
        super(AuthorYearI2n, self).__init__(
            # the version number increases when changes to the benchmark are made; start with 1
            # the ceiling function outputs a ceiling estimate of how reliable the data is, or in other words, how
            # well we would expect the perfect model to perform on this benchmark
            ceiling_func=lambda: self._metric.ceiling(self._assembly),

    # The __call__ method takes as input a candidate BrainModel and outputs a similarity score of how brain-like
    # the candidate is under this benchmark.
    # A candidate here could be a model such as CORnet or brain-mapped Alexnet, but importantly the benchmark can be
    # agnostic to the details of the candidate and instead only engage with the BrainModel interface.
    def __call__(self, candidate: BrainModel):
        # based on the visual degrees of the candidate
        fitting_stimuli = place_on_screen(self._fitting_stimuli, target_visual_degrees=candidate.visual_degrees(),
        candidate.start_task(BrainModel.Task.probabilities, fitting_stimuli)
        stimulus_set = place_on_screen(self._assembly.stimulus_set, target_visual_degrees=candidate.visual_degrees(),
        probabilities = candidate.look_at(stimulus_set, number_of_trials=self._number_of_trials)
        score = self._metric(probabilities, self._assembly)
        score = self._metric.ceil_score(score, self.ceiling)
        return score

We also need to register the benchmark in the benchmark registry in order to make it accessible by its identifier. This is done in the file inside the benchmark directory:

# in brainscore_vision/benchmarks/mybenchmark/

from brainscore_vision import benchmark_registry

benchmark_registry['mybenchmark-i2n'] = AuthorYearI2n  # specify the class and not the object, i.e. without `()`

Unit Tests

Like with the stimuli and data, we want to ensure the continued validity of the benchmark so that it remains valuable and can be maintained. All tests are in your plugin folder’s, e.g. brainscore_vision/benchmarks/mybenchmark/

We realize that unit tests can be a hurdle and we can take over this task for you. Please let us know of any hurdles and we will do our best to support.

We ask that all benchmarks test at least two things:

  1. The ceiling value of the benchmark:

benchmark = load_benchmark('mybenchmark')
assert benchmark.ceiling == expected
  1. The score of one or more models:

The idea for scores of existing models is to run a few models on the benchmark, and test that running them on the benchmark will reproduce the same score.

from brainscore_vision import score

actual_score = score(model_identifier='your-favorite-model', benchmark_identifier='mybenchmark')
assert actual_score == expected

Benchmark Summary: To summarize, Part 2 of creating a benchmark involves making the actual benchmark package. This is done by adding the file, the file, and registering the benchmark via the file.

The summary of what to submit is seen below with an example structure of an example submission structure:


3. Submit the benchmark and iterate to finalize

Finally, submit your entire model plugin. You can do this by either opening a pull request on or by submitting a zip file containing your plugin (<zip>/benchmarks/mybenchmark) on the website.

This will trigger server-side unit tests which ensure that all unit tests pass successfully. Often, this step can highlight some issues in the code, so it can take some iterations on the code to make sure everything runs smoothly. Please open an issue if you run into trouble or get stuck.

If any stimuli or data should be made public, please let us know so that we can change the corresponding S3 bucket policy.

After the PR has been merged, the submission system will automatically run all existing models on the new benchmark.