Exploring spatial representations in Visual Foundation Models

Recent publications studying the inner workings of foundation models have focused largely on internal representations of large language models (LLMs). In this short blog post, I will explore neural representations in large vision models (Radford et al., Oquab et al., Kirillov et al.) with the aim to adress a specific question: Can we find spatially selective cells that encode parts of the visual field, akin to the ones found in biological neural networks?

For this I first create a dataset that can be used to estimate receptive fields among synthetic neurons ('design'). We then record neuronal responses across many different artificial layers ('record') and visualize and quantify the responses using methods traditionally employed in neuroscience ('analyse'). Finally, we establish an 'artificial-causality' by virtually lesioning a subset of spatially selective neurons ('perturb').

Each of the aforementioned steps — design, record, analyse, perturb — mirrors techniques used in neuroscience, with artificial neural networks (ANNs) simplifying at least two (record & perturb) of these procedures dramatically. This simplification is made possible by the fact that, in contrast to biological tissues, we have direct access to every single neuron and its connections, enabling an investigation with an unprecedented level of detail.


We first create a dataset containing 3000 stimuli which are randomly distributed across the visual field. Initially, I had the stimuli take several geometric forms — square, circle, triangle, rectangle — which were randomly picked during dataset creation. I ultimately decided against the added complexity of different shapes and used only circles, which are randomly sampled to have a diameter between 20 and 80 pixels and are positioned to cover the full visual field.

Stimuli Visualization
Overview of dataset and stimulus locations. (Left) Eight representative stimuli of the synthetic dataset that are used to extract neural activations across model layers. (Right) For each stimulus the center is visualized (top), accompanied by the composite image formed by taking the average across all stimuli (bottom).

With these stimuli at hand, we can characterize the responses of neurons within a vision model with respect to each location in the visual field. The model I will be using here is based on the original CLIP publication of OpenAI, more specifically we will use the ViT-B/32 architecture[I]These models are trained using a contrastive objective, given pairs of images and text. During training, the loss is minimized to make sure that each image in the embedding space is close to its paired text. To quantify performance during inference, people usually measure the similarity between the image and each of the text embeddings using cosine similarity with the most similar text being the best match..

To explore whether CLIP can accurately identify the above-created stimuli, we first need to generate text embeddings to compare with the visual embeddings of these stimuli. Initially, I used three texts: a circle, a square and a triangle. The correct response for all stimuli would be "a circle", which CLIP gets right 88% of the time (see Table below). However, the introduction of another descriptor ("an empty image") complicates the outcome. The performance for "a circle" drops to 9% while the performance for "an empty image" reaches 90%. This shift suggests that, rather than recognizing the shape, CLIP reclassifies the stimuli as resembling an empty image.

Initially, I was struggling with this problem for a while as I naively thought it to be trivial for CLIP to actually classify the circle correctly. After some experiments with other texts, I noticed that providing additional color information is sufficient to have CLIP recognize the correct shape. When using "a black circle" instead of "a circle" the classification performance goes back up to 73%. This reveals how adding seemingly simple details to descriptors can play a pivotal role in enhancing the model's ability to correctly identify visual elements.

Circle Square Triangle Empty
Shape Only 0.88 ± 0.08 0.05 ± 0.03 0.07 ± 0.06 -
+ Empty 0.09 ± 0.06 0.00 ± 0.00 0.00 ± 0.00 0.90 ± 0.06
+ Color 0.73 ± 0.22 0.06 ± 0.02 0.00 ± 0.00 0.21 ± 0.21


Having established that CLIP is capable of "seeing" the circle with consistent accuracy we can proceed with the second step: recording neural responses. For this we integrate hooks across all model layers, which enables the capturing and storing of responses across various stimuli for subsequent analysis. In general, hooks are functions that are triggered after a specified event; here they are triggered during the forward pass and store the layer activations[I] Note that hooks can be also triggered during the backward pass, e.g. to clip the gradients:

def _register_hook(layer, layer_name, activations):
    Register a forward hook to record the layer's activations.
    layer.register_forward_hook(_create_activation_hook(layer_name, activations))

def _create_activation_hook(layer_name, activations):
    Define and return a hook for recording the layer's activations.
    def hook(module, input, output):
        activations[layer_name] = output.detach()
    return hook

Note that the "recording" phase starkly contrasts with traditional neuroscience experiments, where single-cell recordings still demand a tremendous amount of effort. This typically involves skilled experimentalists doing meticulous work, which encompasses surgeries, implant placements, management of electrophysiological or calcium imaging devices, and numerous other complex tasks. In contrast, ANNs provide us a more efficient pathway: directly accessing and analyzing neural responses, bypassing the invasive procedures inherent to biological studies and allowing for a more detailed exploration of the synthetic networks' internal processes.


There are several methods to visualize neuronal responses from a trained neural network, e.g. by visualizing the filters or activation maps of convolutional layers or by adapting the input to maximally excite a given neuron within a specific model layer. Here, I will visualize the neuron activation with respect to the spatial location of each stimulus, i.e. a ratemap of the activity of each neuron shown in the image reference frame. As there are millions of parameters within CLIP, I only visualize responses for the vision transformer (63 million parameters) and focus in particular on the attention layers of specific residual blocks[I]More specifically I use layer \(visual.transformer.resblocks.[L].attn\) for each residual block \(L\), with \( L ∈ {0, 1,...,11} \). .

Each residual block contains 50x768 parameters, where the first dimension indicates the number of patches (7x7) plus the 'cls' token. In CLIP the 'cls' token is prepended to the sequence of image patches, then processed normally to gather global image information, which is then extracted at the end as the main representation of the entire image[I]To understand the intricacies of the vision transformer it was helpful for me to look more closely at the code, especially where the class embedding is initialized and used.. Below, I plot the ratemaps of the 'cls' token across all 12 residual layers, highlighting the top 5 neurons for three different scores, which are commonly used in neuroscience to quantify spatially selective cells. The spatial information score quantifies how well the firing rate of a cell predicts the animal's location in space, essentially indicating the amount of spatial information that the neuron's activity conveys. The grid score quantifies the degree to which firing fields of a neuron form a periodic grid-like pattern, akin to grid cells found in the entorhinal cortex[I]I use the grid score here as my initial hypothesis was to find visual grid cells which encode locations within the image reference frame.. The border score is used to measure the tendency of a neuron to fire selectively near the boundaries of the environment, resembling the function of border cells that provide a sense of environmental borders.

Layer Visualization
Layer 2
Neuronal activations across different model layers. For each layer, the ratemaps show the neurons with the highest scores with respect to the spatial information criteria, the grid score and the border score. When toggling the random weights box all responses are based on a randomly initialized CLIP model, using the initialization schema from the original paper.

As shown above, we can see that spatially selective cells persist even in deeper network layers, which implies that information about the position in screen coordinates is directly encoded in the activity of these neurons. I originally anticipated these deeper layers to show an invariance of the neurons to an object in screen coordinates, as seen in hippocampal structures where cells (mostly) fire with respect to world coordinates. A more careful examination of these responses and their distribution across the model layers will be discussed in a future blog post.


Finally, we can assess whether perturbing the layers and the above described spatially selective cells will change the model's performance. For this, I first systematically lesion every layer by setting the weights to zero. In this case we use all layers within each residual block for both the visual and the language transformer, not just the attention layer as in the ratemaps above. Here, I again quantify performance based on the model classifying the input stimuli as "a black circle" or "an empty image".

Description of Image
Performance of model across lesioned layers. Every layer within the model is lesioned by replacing the weights with 0. The performance is then quantified with the lesioned model before restoring the model weights. Note that the information flow is not zeroed out completely, due to the skip connections within the residual blocks.

As seen in the plot above, performance varies widely between between different layers. Lesioning certain layers appears to have a more pronounced effect on the model's ability to classify the stimuli accurately, especially when lesioning visual layers around "Resblock 9" the model struggles to detect any circle within the image. Lesioning or disrupting the first convolutional layer also has a pronounced effect, suggesting that this layer is essential for capturing basic features and patterns from the input, which are critical for subsequent layers to build upon and make accurate classifications. Conversely, some layers, when lesioned, show minimal disruption in performance ("Resblock 11"), suggesting their contributions might be redundant or less critical for this specific task.

To find out if the neurons that we identified as encoding information about visual space also play a causal role in stimulus classification, we can make targeted lesions instead of lesioning layers indiscriminately. From the lesioning of the layers above, we can infer that the "best lesion" (in the sense that it resulted in the worst performance) occurred at Layer 9. With a targeted lesion, we aim to use fewer neurons to achieve the same performance. For this I implemented individual lesions for every single stimulus. To perform individual lesions, we first obtain ratemaps for each neuron, calculated across all stimuli. We then lesion the neurons which have the highest activity at the central location of the stimulus, and do this across every stimulus in our dataset.

Description of Image
Artificial Lesions in Residual Block 9. The x-axis quantifies the proportion of neurons ablated (n=768 neurons). The y-axis represents the performance of the model in categorizing the stimulus as a circle. \( \textit{Random} \) refers to random neuronal lesions within the layer, \( \textit{Targeted} \) denotes lesions applied to neurons exhibiting heightened activity at the stimulus location, and \( \textit{Targeted}_{\textit{Abs}} \) pertains to lesions targeting neurons with high or low activity at the stimulus location.

The targeted lesion shows that the above described spatially selective neurons seem to have a causal influence on the performance of the model, e.g. when lesioning cells that have a high activity in the lower left part of the image in visual space, the model is unable to detect stimuli with a circle in the lower left part of the image. Note that the performance deficit is smaller than one might have suspected, likely due to neurons having multiple fields of high activity instead of a single bump (see ratemaps above in layer 9).


In this post I explored the inner representations of one of the most widely used vision models (CLIP) and show that there are spatially selective cells which are causally linked to the performance of the model. I hope to address some of the open questions in a future post, where I will be exploring some of the following research questions[I] Your suggestions or additional questions are very welcome. Please feel free to discuss it here.: