SEARCH
You are in browse mode. You must login to use MEMORY

   Log in to start

level: Chapter 6 - Neural Network Architectures

Questions and Answers List

level questions: Chapter 6 - Neural Network Architectures

QuestionAnswer
How does convolution work?• Repeated application of a filter (Kernel) on a sliding window • The Stride indicates how much the kernel shifts • In this example the stride is 1 • Kernels can be of various shapes, in the example the kernel is 3x3x1. • The same kernel is applied to every image • The values inside each kernel are learned with backprop By computing the Dot product between the input and the filter, we can say that the filter is Convolved with the input.
Activation function and PoolingAn activation function (ReLU) is applied to the feature map • A (max) pooling layer groups together a selection of pixels and selects the max • This identifies the area of image where the filter found the best match • Making the network robust against small shifts in the image • And reducing the size of the input while keeping the important information
Generative Adversarial Networks (GAN) - IntroductionCNN are well suited for image classification • But is it possible to exploit them to create an image generator? • GANs are implemented as a min-max two-player game between two systems: a Generator G and a Discriminator D • In adversarial training, the goal of the generator is to create images are seem as real as possible. The goal of the discriminator it to detect which images are real and which are generated • They are trained in cycles. At the end, the discriminator can be thrown away and we have a generator that can create seemingly real images from random noise.
GAN – Why it worked?Exploits the loss function of the discriminator to train the generator • The generator never interacts directly with the dataset, only through the discriminator. This minimizes cases where the generator “cheats” by simply copying images from the dataset. • This technique can be used with any machine learning algorithm, and not only for image generation (cheap chatbots)
Transformer - ArchitectureTransformers are the basic architecture used in NLP (chatbots, translators) • Contrary to LSTM, they do not work sequencially -> high parallelization • They need positional encoding to specify the position of each word • They use multiple attention layers to keep track of important information across sentences • The output sentence is produced word by word • The output of the network is a probability distribution across all word in the dictionary to predict the next word in the sentence • The process stops only when <EOS> is predicted
Transformer - Vector embeddingsKey idea: Similar words should have similar representation vectors • Mapping of words (or images) into a multi-dimensional space called Embedding space, or Latent space • In this space, similar concepts are close to each other In the example, the words “cat” and “dog” are often use in similar context, therefore they are close in the latent space. The word “car”, while very similar to the word “cat”, is used in different contexts, and therefore is far away in the latent space.
Transformer - Positional encodingKey idea: add a new vector containing positional information to current vectors Checklist: 1) Unique encoding for each time-step 2) Consistent distance between any two time-steps 3) Should generalize to longer sentences 4) Deterministic
Transformer - Single-Head attentionKey idea: Calculate attention using different copies of the same input.
Query Key Value - IntuitionQUERIES, KEYS and VALUES • The queries indicate what are you interested in (e.g. Name of person) • Keys are pointers to different concepts (name, height, age) • Value is the actual concept (e.g. name) • Based on the query, the key selects the values that are most related with the query
Transformer - Multi-Head Attention1) Concentenate all the attention heads 2) Multiply with a weight matrix W^o that was trained jointly with the model 3) The result should be the Z matrix that captures information from all the attention heads. We can send this forward to the FFNN.
Transformer - Add & NormLayer Normalization = Output of the previous layer (from attention block) + Input Embedding (From the first step) Benefits - Faster training, Reduce Bias, Prevent weight explosion Types of Normalization - Batch & Layer normalization *Layer normalization is preferable for transformers, especially for Natural language processing tasks
Transformer - Feed forwardA standard feed forward fully connected NN with two layers. These helps to: • Format the input for the decoder layer • The non linear layers can capture more abstract concepts from the input data and pass it forward
Conclusion• CNN revolutionized the image classification field thanks to smart architecture choices that allowed them to exploit the spacial relations of images • The success of GANs as image generators builds upon the success of CNNs. This architecture allowed to train a generator exploiting the loss function of the (CNN) discriminator • Transformer used complex architecture tricks in order to encode the sentence structure while using a non sequential algorithm (at least for the training)