How do we do this?

What does a CNN do?

At a high level, in order for a network to perform image classification (which this network has been trained to do), it must understand the image.

This requires taking the raw image as input pixels and building an internal representation that converts the raw image pixels into a complex understanding of the features present within the image.

Starting from the network's input layer, the first few layer activations represent low-level features like edges and textures.

As you step through the network, the final few layers represent higher-level features—object parts like wheels or eyes.

Since the CNNs try to understand the images, they are able to generalize well: they’re able to capture the invariances and defining features within classes (e.g. cats vs. dogs) that are agnostic to background noise and other nuisances.

Thus, somewhere between where the raw image is fed into the model and the output classification label, the model serves as a complex feature extractor.

What are we going to do?

By accessing intermediate layers of the model, we will able to describe the content and style of input images.

Thus, we could use the intermediate layers of the model to get the content and style representations of the image. We will be using the intermediate layers of the VGG19 network architecture, a pre-trained image classification network.

enter image description here

We follow the below steps to define the content and style representations:

Using the pre-trained VGG19 weights, we shall input the content image to the network, to get the target content representations from the content image.
Similarly, we shall input the style image to the network, to get the target gram matrices(we would discuss gram matrices later) which represent the style from the style image.
Having gotten the target content and style representations, we shall now pass the copy of the content image(let us call it the generated image) to get it's content and style representations.
We shall then calculate the distance of the target content and style representations from those of the generated image, to calculate the corresponding content and style errors.
We shall then define the total loss, having the content, and style losses as its components.
We shall use gradient descent to update the image pixels of the generated image, such that the total loss is minimized.

Thus, for an input image, we try to match the corresponding style and content target representations at the corresponding intermediate layers.

Project - Introduction to Neural Style Transfer using Deep Learning & TensorFlow 2 (Art Generation Project)

How do we do this?

XP

Loading comments...