Deep Learning Gymnastics #3: Tensor (re)Shaping

Welcome to the 3rd episode of the Deep Learning Gymnastics series. By now you should already start to be in shape. That’s good, because today we’ll talk about how to shape (or more precisely reshape) tensors, a basic yet critical operation that is needed in any advanced enough deep learning model implementation.

To best understand this post, it is highly recommended to read the previous gymnastic exercise around tensor indexing as we’ll build on top of it.

MLP Motivating example

To illustrate the power of tensor (re-)shaping, we’ll continue to get inspired from Andrej Karpathy’s makemore series, where he implements from scratch the famous paper “A neural probabilistic language model” . As Andrej says, it is not the first paper who proposed a neural network approach to predict the next token in a sequence, but it is one that is very often cited and is a really nice write-up.

The gymnastic exercise will consist into implementing the bottom part of the figure below, which describes the architecture of the neural network (or Multi Layer Perceptron, MLP for short) defined in the paper. First we’ll explain a bit the diagram so the goal of the exercise will be crystal clear.

Let’s assume that the 3 green dots at the bottom are the last three characters of a word and that we’re trying to predict (or generate) the next character. The first layer (this one: ) is nothing else than the embeddings of each of the three characters. Turns out it is exactly the output of the example we introduced in our previous gymnastic exercise around tensor indexing . We ended up with a tensor of shape (8,3,4) , the one on the right in the figure below. As a reminder, an embedding is simply here a one dimensional tensor (of size 4 in our case).

So in our example, the first layer of the neural net, the , is nothing else than the 3 embeddings of each character, as seen below:

So the first example of the batch is associated with those three embeddings:

Now, in order to pass this to the next layer (this one), we need to concatenate those three embeddings of size 4 each, into a single long one of size 12.

So here is the gymnastic exercise: take our (8,3,4) tensor, and for each of the 8 lines of the batch, transform the 3 embeddings of size 4 into one of size 12 (which is just the concatenation of the 3). We should thus end up with a tensor of shape (8,12).

The basics of PyTorch Views

Let’s introduce the concept that will allow us to solve the gymnastic exercise as a breeze: PyTorch views. The easiest way to understand PyTorch views is through a simple example.

Let’s create a one dimensional tensor of elements from 0 to 17.

The exact same underlying storage can be viewed as (2,9) tensor.

Or a a (9,2) one

Or a (3,2,3) one:

As you understand, as long as the product of the dimensions equals the number of element in the underlying storage (18 in our case), then we can view (or reshape) the tensor.

Beyond being very convenient, the big of advantage of this is that it is blazing fast, because no new tensors are created: the underlying storage stays the same, and only some metadata about the tensor are modified.

Bonus: we can also use -1 to infer the dimension automatically. E.g., if the underlying storage is 18 numbers, then invoking the view function with shape (-1,9), it will deduce that the first dimension has to be 2:

Solving our gymnastic exercise with views

Now that we understand views, let’s get back to our gymnastic exercise: we have a tensor of shape (8,3,4) and we need to transform into a tensor of shape (8,12). First, let’s reproduce the embedded batch of shape (8,3,4) (see our previous gymnastic exercise to understand the code below):

import torch
torch.manual_seed(18)

# Create a random batch of shape (8,3) 
# with indexes between 0 and 26
random_tensor = torch.randint(low=0, high=26, size=(8,3))

# Create a random embedding matrix of shape (27,4): 
# one embedding for each of the 27 indexes elements
embeddings = torch.randn(size=(27, 4))

#Creating the embedded batch
embedded_batch = embeddings[random_tensor]

Get ready, and let’s solve our exercise. As in last post, it will be a short yet sharp (tensor) movement:

input_layer = embedded_batch.view(8,12)

Yes, that’s it, just one line. By doing this, each line of batch of 8 embeddings, will extremely effectively and in parallel take their 3 associated embedding of size 4 each, concatenate them together, to thus end up with a tensor of size (8,12).

Let’s actually validate it on the first example of the batch:

We obtain an embedding of size 12 as expected, which is nothing else than the concatenation of the 3 embeddings of size 4 that we showed at the end of our motivating example above. Baam.

Oh, let’s not forget that we created this to pass it as input to a layer of a neural net. So let’s do it: we create the initial random weight and biaises of the layer, pass into it our (reshaped) batch and apply tanh on top of it, in other words:

W1 = torch.randn((12, 100)) # weights
b1 = torch.randn(100) # biases
h = torch.tanh(emb.view(-1, 12) @ W1 + b1) # (8,12) @ (12,100) => (8,100)

PyTorch view vs. reshape ?

There is another function in PyTorch called reshape that seems to achieve the exact same goal as view. So what’s the difference?

Typically, view is extremely efficient as it won’t move any underlying data and just modify the shape of the tensor. But it comes with a constraint: the underlying data has to be contiguous, otherwise calling view will return an error (see example below).

If you’re not sure if your tensor is contiguous, you can either use the contiguous function before calling view (it will make the tensor contiguous), or simply use reshape which returns a view if the shapes are compatible, and copies otherwise.

You might ask why anyone would use view over reshape? I asked myself the same question, and I assume that given that using view is guaranteed to be efficient, seeing it in the code gives any reader the guarantee that there is nothing to optimize there. As for the one writing the code, if there are some cases where there would be an inefficient copy, then at least when using view it will fail explicitly and make you aware of the potentially efficiency bottleneck.

Below an example of code illustrating where view wouldn’t work:

import torch

# Create a non-contiguous tensor
tensor = torch.tensor([[1, 2, 3], [4, 5, 6]]).t()  # Transpose to make it non-contiguous

# Reshape works successfully
reshaped_tensor = tensor.reshape(6)
print(reshaped_tensor)  # Output: tensor([1, 4, 2, 5, 3, 6])

# View fails with an error
try:
    viewed_tensor = tensor.view(6)
except RuntimeError as e:
    print(e)  # Output: RuntimeError: view size is not compatible with input tensor's size and stride

TensorFlow reshape

Obviously, TensorFlow also supports the same powerful reshape operation. In TensorFlow, you don’t have the explicit view function, but reshape handles non-contiguous tensors gracefully, similar to PyTorch’s reshape.

Below the full TensorFlow code equivalent to what we illustrated above in PyTorch.

import tensorflow as tf
tf.random.set_seed(18)

# Create a random batch of shape (8,3) with indexes between 0 and 26
random_tensor = tf.random.uniform(shape=(8,3), minval=0, maxval=26, dtype=tf.int32)

# Create a random embedding matrix of shape (27,4): one embedding for each of the 27 indexes elements
embeddings = tf.random.uniform((27,4), dtype=tf.float32)

# Solving the gymnastic exercise: creating an embedded batch with the tf.gather function
embedded_batch = tf.gather(embeddings,random_tensor)

# Validating the results
print(random_tensor)
print(embeddings)
print(embedded_batch.shape) # (8,3,4) which is the expected dimension
print(embedded_batch[0,0])

W1 =  tf.random.normal([12, 100])
b1 =  tf.random.normal([100])
h = tf.math.tanh(tf.linalg.matmul(tf.reshape(embedded_batch, [8, 12]) , W1) + b1)

Another example of usage: CNNs

Reshaping is a very useful operation in various cases in Deep Learning. Another frequent usage/example is in the context of image manipulation in convolutional neural networks (CNN), where you need for instance to connect the output of a convolutional layer to a fully connected layer:

import torch

# An output from a convolutional layer
conv_output = torch.randn(10, 8, 5, 5)  # (batch size, channels, height, width)

# Flatten for a fully connected layer
flattened = conv_output.view(-1, 8 * 5 * 5)  # (batch size, flattened features)

print(flattened.shape)  # Output: torch.Size([10, 200])

Alright, that’s if for today. Hope you’re now in a better shape, and see you next time for other gymnastic exercises 🤸.

References

  • Part 2 of the amazing makemore series by Andrej Karpathy (which inspired this post).
  • Great blog post on the internal representation of tensors, and his very cool stride visualizer (it is from a PyTorch research engineer, so it is about PyTorch 🙂 but still useful general concepts )

Leave a Reply

Your email address will not be published. Required fields are marked *