Latent Space and Contrastive Learning

Latent Space

Definition: Representation of compressed data
Data compression: process of encoding information using fewer bits than the original representation

latent-space

Ekin Tiu has a Medium article about why it is called latent “space” here

Tasks where latent space is necessary
- Representation learning:
  - Definition: set of techniques that allow a system to discover the representations needed for feature detection or classification from raw data
  - Latent space representation of our data must contain all the important info (features) to represent our original data input
- Manifold learning (subfield of representation learning):
  - Definition: groups or subsets of data that are “similar” in some way in the latent space, that does not quite show in the higher dimensional space.
  - Manifolds just mean groups of similar data
- Autoencoders and Generative Models
  - Autoencoders: a neural network that acts an identify function, that has both an encoder and a decoder
  - We need the model to compress the representation (encode) in a way that we can accurately reconstruct it (decode).
    - i.e. image in image out, audio in audio out
  - Generative models: interpolate on latent space to generate “new” image
    - Interpolate: make estimations of independent variables if the independent variable takes on a value in between the range
    - Example: if chair images have 2D latent space vectors as [0.4, 0.5] and [0.45, 0.45], whereas the table has [0.6, 0.75]. Then to generate a picture that is a morph between a chair and a desk, we would sample points in latent space between the chair cluster and the desk cluster.
    - Diff between discriminative and generative:
      - Generative can generate new data instances, capture the joint probability of p(X,Y) or p(X) if Y does not exist
      - Discriminative models classifies instances into different labels. It captures p(Y X) -> given the image, how likely is it a cat?

Contrastive Learning with SimCLRv2

Definition: a technique that learns general features of a dataset without labels by teaching the model which data points are similar or different.
- Happens before classification or segmentation.
- A type of self-supervised learning. The other is non-contrastive learning.
- Can significantly improve model performance even when only a fraction of the dataset is labeled.
Process:
1. Data Augmentation through 2 augmentation combos (i.e. crop + resize + recolor, etc.)
2. Encoding: Feed the two augmented images into deep learning model to create vector representations.
  - Goal is to train the model to output similar representations for similar images
3. Minimize loss: Maximize the similarity of the two vector representations by minimizing a contrastive loss function
  - Goal is to quantify the similarity of the two vector representations, then maximize the probability that two vector representations are similar.
  - We use cosine similarity as an example to quantify similarities: the angle between the two vectors in space. The closer they are, the bigger the similarity score
  - Next compute the probability with softmax:
  - Last we use -log() to make it a loss function so that we are minimizing this value, which corresponds to maximizing the probability that two pairs are similar