Week 2 — Dimensionality Reduction

Dimensionality Effect on Performance

A note about neural networks

Yes, neural networks will perform a kind of automatic feature selection
However, that's not as efficient as a well-designed dataset and model
- Much of the model can be largely "shut off" to ignore unwanted features
- Even unused parts of the model consume space and compute resources
- Unwanted features can still introduce unwanted noise
- Each feature requires infrastructure to collect, store, and manage.

# WORD EMBEDDING EXAMPLE

import tensorflow as tf
from tensorflow import keras
import numpy as np
from keras.datasets import reuters
from keras.preprocessing import sequence

NUM_WORDS = 1_000 # Least repeated words are considered unknown

(reuters_train_x, reuters_train_y), (reuters_test_x, reuters_test_y) =
  tf.keras.dataset.reuters.load_data(num_words=NUM_WORDS)
n_laberls = np.unique(reuters_train_y).shape[0]

# Further preprocessing
from keras.utils import np_utils
reuters_train_y = np_utils.to_categorical(reuters_train_y, 46)
reuters_test_y = np_utils.to_categorical(reuters_test_y, 46)

reuters_train_x = keras.preprocessing.sequence.pad_sequence(
  reuters_train_x, maxlen=20)
reuters_test_x = keras.preprocessing.sequence.pad_sequence(
  reuters_test_x, maxlen=20)

# Using all dimensions
from tensorflow.keras import layers
model = keras.Sequential([
    # The embedding is projected to 1000 dimesions here (2nd parameter)
    layers.Embedding(NUM_WORDS, 1000, input_length=20),
    layers.Flatten(),
    layers.Dense(256),
    layers.Dropout(0.25),
    layers.Actiation('relu'),
    layers.Dense(46),
    layers.Activation('softmax')
])

# Model compilation and training
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

model_1 = model.fit(reuters_train_x, reuters_train_y,
                    validation_data=(reuters_test_x
                    reuters_test_y),
  batch_size=128, epochs=20, verbose=0)

tion="This model is overfitting as the model clearly perform poorly on validation set, and a high loss for validation set compared to training loss." >}}

Word embeddings: 6 dimensions instead of 1000

 model = keras.Sequential([
    # The embedding is projected to 1000 dimesions here (2nd parameter)
    layers.Embedding(NUM_WORDS, 6, input_length=20),
    layers.Flatten(),
    layers.Dense(256),
    layers.Dropout(0.25),
    layers.Actiation('relu'),
    layers.Dense(46),
    layers.Activation('softmax')
])

tion="Although there is some overfitting, The newer model seems to perform better" >}}

Curse of Dimensionality

Many ML methods use the distance measures like KNN, SVM and recommendation systems.

Most common being Euclidean Distance.

Why is high-dimensional data a problem?

More dimension $\rightarrow$ more features
Risk of overfitting our models
Distances grow more and more alike, vectors might appear equidistant from all others.
No clear distinction between clustered objects
Concentration phenomenon for Euclidean distance
- Distribution of norms (distance between vectors) in a given distribution of points tends to concentrate.
Adding dimensions increases feature space volume
Solutions take longer to converge, might get stuck in local optima
Runtime and system memory requirements

Why are more features bad?

Redundant/irrelevant features
More noise added than signal
Hard to interpret and visualize
Hard to store and process data

Curse of dimensionality in the distance function

$$ d_{i,j} = \sqrt{\sum_{k=1}^n(x_{ik} - x_{jk})^2} \tag{Euclidean distance} $$

New dimensions add non-negative terms to the sum
Distance increases with the number of dimensions
For a given number of examples, the feature space becomes increasingly sparse

The size of the feature space grows exponentially as the number of features increases making it much harder to generalize efficiently.
The variance increases, features might even be correlated and thus there are higher chances of overfitting to noise.
The challenge is to keep as much of the predictive information as possible using as few features as possible.

The Hughes effect

The more the features, the large the hypothesis space.

Curse of Dimensionality: an Example

More features aren't better if they don't add predictive information
Number of training instances needed increases exponentially with each added feature

What do ML models need?

No hard and fast rule on how many features are required
Number of features to be used vary depending on the amount of training data available, the variance in that data, the complexity of the decision surface, and the type of classifier that is used. It can also depend on which features actually contain predictive information.
Prefer uncorrelated data, with containing predictive information to produce correct results

Manual Dimensionality Reduction

Increasing predictive performance

Features must have information to produce correct results
Derive feature from inherent features
Extract and recombine to create new feature

Why reduce dimensionality?

Dimensionality reduction looks for patterns and data to re express the data in a lower dimensional form.

Reduce multicollinearity by removing redundant features.

Feature Engineering

Manual Dimensionality Reduction: Case Study with Taxi Fare dataset

CSV_COLUMNS = [
    'fare_amount',
    'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
    'dropoff_longitude', 'dropoff_lattidude',
    'passenger_count', 'key'
]

LABEL_COLUMN = 'fare_amount'
STRING_COLS = ['pickup_datetime']
NUMERIC_COLS = ['pickup_longitude', 'pickup_latitude',
    'dropoff_longitude', 'dropoff_lattidude',
    'passenger_count']

DEFAULTS = [[0.0], ['na'], [0.0], [0.0], [0.0], [0.0], [0.0], ['na']]
DAYS = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']

# Build a baseline model using raw features
from tensorflow.leras import layers
from tensorflow.keras.metrics import RootMeanSquared as RMSE

dnn_inputs = layers.DenseFeatures(features_columns.values())(inputs)
h1 = layers.Dense(32, activation='relu', name='h1')(dnn_inputs)
h2 = layers.Dense(8, activation='relu', name='h2')(h1)
output = layers.Dense(1, actiation='linear', name='fare')(h2)

model = keras.models.Model(inputs, output)
model.compile(optimizer='adam', loss='mse',
              metrics=[RMSE(name='rmse'), 'mse'])

Increasing model performance with Feature Engineering

Carefully craft features for the data types
- Temporal (pickup date & time)
- Geographical (latitude and longitude)

Handling temporal features

def parse_datetime(s):
  if type(s) is not str:
    s = s.numpy().decode('utf-8')
  return datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S %Z")

def get_dayofweek(s):
  ts = parse_datetime(s)
  return DAYS[ts.weekday()]

@tf.function
def dayofweek(ts_in):
  return tf.map_fn(
    lambda s: tf.py_function(get_dayofweek, inp=[s],
              Tout=tf.string),
    ts_in)

Geological features

def euclidean(params):
  lon1, lat1, lon2, lat2 = params
  lodiff = lon2 - lon1
  latdiff = lat2 - lat1
  return tf.sqrt(londiff * longdiff + latdiff * latdif)

Scaling latitude and longitude

def sclae_longitude(lon_column):
  return (lon_column + 78)/8.   # Min: -70 | Max: +78

def scale_latitude(lat_column):
  return (lat_column - 37/8.    # Min:  37 | Max: 45

Preparing the transformations

def transform(inputs, numeric_cols, string_cols, nbuckets):
  ...
  feature_columns = {
    colname: tf.feature_column.numeric_column(colname)
      for colname in numeric_cols
  }

  for lon_col in ['pickup_longitude', 'dropoff_longitude']:
    transformed[lon_col] = layers.Lambda(scale_longitude,
        ...)(inputs[long_col])

  for lat_col in ['pickup_latitude', 'dropoff_latitude']:
    transformed[lat_col] = layers.Lambda(scale_latitude,
        ...)(inputs[lat_col])

Bucketizing and feature crossing

Unless the specific geometry of the earth is relevant to your data, a bucketized version of the map is likely to be more useful than the raw inputs.

def transform(inputs, numeric_cols, string_cols, nbuckets):
  ...
  latbucksts = np.linspace(0, 1, nbuckets).tolist()
  lonbuckets = ...
  b_plat = fc.bucketized_column(
    feature_columns['pickup_latitude'], latbuckets)
  b_dlat = # Bucketize 'dropoff_latitidue'
  b_plon = # Bucketize 'pickup_longitude'
  b_dlon = # Bucketize 'dropoff_longitude'

ploc = fc.cross_column([b_plat, b_plon], nbuckets * nbuckets)
dloc = # Feature corss 'b_dlat' and 'b_dlon'
pd_pair = fc.crossed_column([ploc, dloc], nbuckets ** 4)

Feature_columns['pickup_and_dropoff'] = fc.embedding_column(pd_pair, 100)

Algorithmic Dimensionality Reduction

Linear dimensionality reduction

Linearly project $n$-dimensional data onto a $k$-dimensional subspace ($k < n$, often $k << n$)
There are infinitely many $k$-dimensional subspace we can project the data onto
Which one should we choose?

Projecting onto a line

Let's thing of features as vectors existing in a high-dimensional space.
Vectors being high dimensional is difficult to visualize, but if we project onto a lower dimension, allows us to visualize the data more easily. This is referred to as embedding.

Best k-dimensional subspace for projection

Classification: maximum separation among classes

Example: Linear Discriminant analysis (LDA)

Regression: maximize correlation between projected data and output variable

Example: Partial Least Squares (PLS)

Unsupervised: retain as much data variance as possible

Example Principal component analysis (PCA)

Principal Components Analysis

Relies on eigen-decomposition (which can only be done for square matrices)
PCA is a minimization of the orthogonal distance
Widely used method for unsupervised & linear dimensionality reduction
Accounts for variance of data in as few dimensions as possible using linear projections

PCA performs dimensionality reduction in two steps:

PCA rotates the samples so that they are aligned with the coordinate axis.

PCA also shifts the samples so that they have a mean of zero

Principal components (PCs)

PCs maximize the variance of projections
PCs are orthogonal
Gives the best axis to project
Goal of PCA: Minimize total squared reconstruction error

Repeat until we a have k orthogonal lines

from sklearn.decomposition import PCA

# PCA that will retain 99% of the variance
pca = PCA(n_components=0.99, whiten=True)
pca.fit(X)
X_pca = pca.transform(X)

Plot the explained variance

tot = sum(pca.e_vals_)
var_exp = [(i / tot) * 100 for i in sorted (pca.e_evals_, reverse=True)
cum_var_exp = np.cumsum(var_exp)

PCA factor loadings

The factor loadings are the unstandardized values of the eigenvectors.

loadings = pca.e_vecs_ * np.sqrt(pca.e_vals_)

When to use PCA?

Strengths:

A versatile technique
Fast and simple
Offers several variations and extensions (e.g., kernel/sparse PCA)

Weaknesses:

Result is not interpretable
Requires setting threshold for cumulative explained variance

Other Techniques

Unsupervised

Latent Semantic Indexing/ Analysis (LSI and LSA) (SVD)
Independent Component Analysis (ICA)

Matrix Factorization

Non-Negative Matrix Factorization (NMF)

Latent Methods

Latent Dirichlet Allocation (LDA)

Singular Value Decomposition (SVD)

SVD decomposes non-square matrices
Useful for sparse matrices and matrices that are not square matrices as produced by TF-IDF.
Removes Redundant features from the dataset

Independent Component Analysis

PCA seeks directions in feature space that minimize reconstructions error, or for uncorrelated factors.
ICA seeks directions that are most statistically independent
ICA addresses higher order dependence

How does ICA work?

Assume there exists independent signals:
- $S = [s_1(t), s_2(t), ..., s_N(t)]$
Linear combinations of signals: $Y(t) = A S(t)$
- Both $A$ and $S$ are unknown
- $A$ - mixing matrix
Goal of ICA: recover original signal, $S(t)$ from $Y(t)$

Comparing PCA and ICA

Non-negative Matrix Factorization (NMF)

NMF model are interpretable and easier to understand
NMF requires the sample features to be non-negative
NMF models are interpretable but it can't be applied to all datasets
- It requires the sample features to non-negative.

Dimensionality Reduction Techniques

If you wish to dive more deeply into dimensionality reduction techniques, feel free to check out these optional references. You won’t have to read these to complete this week’s practice quizzes.

Quantization and Pruning — Mobile, IoT, and Similar Use Cases

Factors driving new trend of edge computing

Demands move ML capability from cloud to on-device
Cost-effectiveness
Compliance with privacy regulations

Online ML inference

To generate real-time predictions you can:
- Host the model on a server
- Embed the model in the device
Is it faster on a server, or on-device?
Mobile processing limitations?

Mobile inference

Model development

Benefits and Process of Quantization

Quantization involves transforming a model into an equivalent representation that uses parameters and computations at a lower precision.

This improves model execution performance and efficiency but at the cost of slightly lower model accuracy.

Source

Think of a picture, which is a grid of pixels, each pixel has a certain number of bits. Now if we try reducing the continuous color spectrum of real life to discrete colors, we're quantizing or approximating the image.

Beyond a certain point it may get harder to recognize what the data/image really is.

Why quantize neural netowrks?

Neural networks have many parameters and take up space
Shrinking model file size
Reduce computational resources
Make models run faster and use less power with low-precision

MobileNets: Latency vs Accuracy trade-off

MobileNets are family of small, low-latency, low-power models parameterized to meet the resource constraints of a variety of use cases.

Benefits of quantization

Faster compute
Low memory bandwidth
Low power
Integer operations supported across CPU/DSP/NPUs

The quantization process

The weights and activations for a particular layer often tend to lie in a small range, which can be estimated beforehand. That's why we don't need to store the range in the same data type. Therefore we can concentrate the fewer bits within a smaller range.

Find the maximum absolute weight value, $m$, then maps the floating point range $-m$ to $+m$ to the fixed-point range $-127$ to $+127$.

What parts of the mode are affected?

Static parameters (like weights of the layers)
Dynamix parameters (like activations inside networks)
Computation (transformations)

Trade-offs

Optimizations impact model accuracy
- Difficult to predict ahead of time
In rare cases, models may actually gain some accuracy
Undefined effects on ML interpretability.

Choose the best model for the task

Post Training Quantization

In this technique an already trained TensorFlow model size is reduced by using TensorFlow Lite converter to save into TensorFlow Lite format.

Reduced precision representation
Incur small loss in model accuracy
Joint optimization for mode and latency

In dynamic range quantization, during inference the weights are converted back from eight bits to floating point and activations are computed using floating point kernels.
This conversion is done once, cached to reduce latency

import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
coverter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]

tflite_quant_model = converter.convert()

Model accuracy

Small accuracy loss incurred (mostly for smaller network)
Use the benchmarking tools to evaluate model accuracy
If the loss of accuracy drop is not within acceptable limits, consider using quantization-aware training

Quantization Aware Training

Quantization aware training adds fake quantization operations to the mode so it can learn to ignore the quantization noise during training.

Inserts fake quantization (FQ) nodes in the forward pass
Rewrites the graph to emulate quantized inference
Reduces the loss of accuracy due to quantization
Resulting model contains all data to be quantized according to spec

QAT on entire model

import tensorflow_model_optimization as tfmot

model = tf.keras.Sequential([
    ...
])

# Qunatize the entire model.
quntized_model = tfmot.quantization.keras.quantize_model(model)

# Continue with training as usual
quantized_model.compile(...)
quantized_model.fit(...)

Quantize parts(s) of a model

import tensorflow_model_optimization as tfmot
quanitze_annotate_layer = tfmot.quanitzation.keras.quanitze_annotate_kayer

model = tf.keras.Sequential([
    ....
    # Only anotated layers will be quantized
    quantize_annotate_layer(Conv2D()),
    quantize_annotate_layer(ReLU()),
    Dense(),
    ...
])

# Quantize the model
quantized_model = tfmot.quantization.keras.quantize_apply(model)

Quantize custom Keras layer

quantize_annotate_layer = tfmot.quantization.keras.quantize_annotate_layer
quantize_annotate_model = tfmot.quantization.leras.quantize_annotate_model
quantize_scope = tfmot.quatization.keras.quantize_scope

model = quantize_annotate_model(tf.keras.Sequential([
    quantize_annotate_layer(CustomLayer(20, input_shape=(20,)),
                                                    DefaultDenseQuantizeConfig()),
    tf.keras.layers.Flatten()
]))

# `quantize_apply` requires mentioning `DefaultDenseQuantizeConfig` with
# `quantize_scope`
with quantize_scope(
    {'DefaultDenseQuantizeConfig': DefaultDenseQuantizeConfig,
     'CustomLayer': CustomLayer}):
    # Use 'quantize_apply` to actually make the model quantization aware
    quant_aware_model = tfmot.quantization.keras.quantize_apply(model)

Quantization

If you wish to dive more deeply into quantization, feel free to check out these optional references. You won’t have to read these to complete this week’s practice quizzes.

Pruning

Pruning increases the efficiency of model by removing parts (connections) of model that do not contribute to substantially to producing accurate results.

Model Sparsity

Origins of weight pruning

The first major paper advocating sparsity and neural networks dates back from 1990, in "Optimal of Brain Damage" written by Yann LeCun, John S.Denker, and Sara A. Solla.

At that time post pruning NN was already a trendy choice to reduce the size of models. runing was mainly done by using magnitude as an approximation for saliency to determine less useful connections.

Intuition being smaller the weight smaller the effect was on the output.
The saliency of each weight was estimated, defined by change in the loss function upon applying a perturbation to the nodes in the network. Finally retraining the model again.
But retraining became a lot harder.
The answer came with "lottery ticket hypothesis"

Lottery ticket Hypothesis

As the network size increases, the number of possible subnetworks and the probability of finding the ‘lucky subnetwork’ also increase. As per the lottery ticket hypothesis, if we find this lucky subnetwork, we can train small and sparsified networks to give higher performance even when 90 percent of the full network’s parameters are removed.

Finding Sparse Neural Networks

"A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations" - Jonathan Frankle and Michael Carbin

Basically instead of fine-tuning weights, just reset the weight to original value and retain.

This lead to the acceptance of the idea that over parameterized dense networks containing several sparse subnetworks with varying performances, and one of these subnetworks is the winning ticket, which perform all others.

Pruning research is evolving

The new method didn't perform well at large sacle
The new method failed to identify the randomly initialized winners
Active area of research

Eliminate connections based on their magnitude

TensorFlow includes a weight pruning API, which can iteratively remove connections based on their magnitude during training.

Sparsity increases with training

tion="Image source: TensorFlow Model Optimization Toolkit — Pruning API (morioh.com) Black cells indicate where the non-zero weights exist as pruning is applied to a tensor." >}}

What's special about pruning?

Better storage and/or transmission
Gain speedups in CPU and some ML accelerators
Can be used in tandem with quantization to get additional benefits
Unlock performance improvements

Pruning with TF Model Optimization Toolkit iwth Keras

import tensorflow_model_optimization as tfmot

model = build_your_model()
pruning_schedule = tfmot.sparsity.keras.PolynomialDecay(
                                            initial_sparsity=0.5, final_sparsity=0.8,
                                            begin_Step=2000, end_step=4000)

model_for_pruning = tfmot.sparsity.keras.prune_low_magnitude(
                                            model,
                                            pruning_scehdule=pruning_schedule)

...
model_for_pruning.fit()

Pruning

If you wish to dive more deeply into pruning, feel free to check out these optional references. You won’t have to read these to complete this week’s practice quizzes.