logo

study-notes

  • Content
  • Signal to noise ratio
  • Distributed representation
  • Logistic regression
  • Empirical risk minimization
  • One dimensional motion
  • Structure learning
  • Fisher information matrix
  • Deep generative models
  • Beta distribution
  • PDF
  • Moments
  • Conjugacy
  • Properties
  • Conditional expectations
  • Gibs distribution
  • Bayesian view of exponential family
  • Softmax function
  • Slow feature Analysis
  • Model Averaging
  • Bayesian model selection
  • Forces in nature
  • Representation learning
  • Likelihood principle
  • Mechanical Energy
  • Rules Determinants
  • Expectations
  • Functions of random variables
  • Variance
  • Skew
  • Curtosis
  • Discrete random variables
  • Adam (adaptive moment estimation method )
  • Directed vs undirected graphical models
  • Algorithms and data structures
  • Newtons Laws
  • MAD
  • Graphical models
  • Manifold learning
  • Variational free energy
  • Conditional independence in Markov Random Fields
  • Multiway Search Tree
  • Log partition function
  • Markov Random Fields example
  • Semi-Supervised Disentangling of causal factors
  • Second order methods
  • Newtons Method
  • Secant method
  • Quasi Newton
  • Multinomial coefficien
  • Sufficient statistics
  • Newtons Method
  • Log-sum-exp trick
  • Probability
  • Joint probability
  • Conditional probability
  • Union of mutliple events
  • Markov assumption
  • Tree traversal algorithms
  • First order Methods
  • Gradient descent
  • Conjugate gradient
  • Momentum
  • Nesterov momentum
  • Adagrad
  • RMS-Prop
  • Adam
  • Hyper gradient descent
  • Post order traversal
  • Encoder Decoder Recurrent neural network
  • Conjugate prior
  • MLE and MAP estimation for Generalized linear Models
  • Slice Sampling
  • Support vector machines (SVM)
  • Quadratic discriminant analysis (QDA)
  • Bernoulli-Gaussian Model
  • Gaussian mixture model EM
  • Trust region method
  • Black-box variational inference
  • Binomial distribution
  • Conjugacy
  • Multinomial
  • Chow Liu Algorithm
  • Empirical Bayes
  • Neural Networks
  • Kernel machines
  • Inference in hidden markov models
  • Forward-Backward
  • MAP estimation (Viberty)
  • SVM classification
  • Unsupervised learning
  • Work
  • RNN as a directed graph
  • Red-black tree to 2-4 tree
  • Contractive autoencoder (CAE)
  • Mean filed method
  • Hypergradient Descent
  • Physics
  • Neural networks parameter initialization
  • Linear factor model
  • Latent Variable Models
  • Greedy layer wise unsupervised pre-training
  • Learning Markov Random Fields
  • Elipse
  • Sampling from standard distributions
  • Map inference in graphical models
  • Search Trees
  • Ordinary Least Squares (MLE)
  • Rejection sampling
  • Bridge regression
  • Structured support vector machines
  • Multivariate Gaussian
  • MLE
  • Join Gaussian:
  • Maximum entropy distribution
  • Mixture density networks
  • Stochastic Encoder and Decoders
  • Probability tables
  • Linear programming
  • Adagrad (Adaptive subgradient method)
  • Jensens inequality
  • Newtons law of gravitation
  • \(l_1\) regularization
  • Fitting
  • Variants
  • Hadamard Product
  • Local descent
  • Collaborative Filtering
  • Dirichlet Multinomial model
  • Bayeisan decision theory
  • Steps of HMC iteration
  • The EM Algorithm
  • Gradient clipping
  • Probability chain rule
  • Normal Inverse Chi Squared
  • Uninformative prior
  • Sequential importance sampling
  • Information theory
  • Large margin principle
  • Multinoulli exponential family
  • Graph
  • Mercer (positive definite kernels)
  • Inverse Wishart distribution
  • Catastrophic forgetting in neural networks
  • PMF/PDF of a function of random variables
  • Continuous
  • Autoencoders
  • Mixture of Gaussians
  • Subgradient
  • Naive Bayes Classifier
  • Neural network universal approximator
  • Hierarchical adaptive lasso
  • Matrix multiplication
  • Gaussian process latent variable model
  • Potential Function
  • Kernel density estimation (KDE)
  • Probabilistic matrix factorization
  • Entropy
  • Joint entropy
  • Total mechanical energy
  • Adams Law
  • Power
  • Markov chain Monte Carlo (MCMC) inference
  • Potential Energy
  • Bayes rule
  • Cholesky Decomposition
  • Markov models
  • Relative velocity
  • Mahalanobis distance
  • Generalized linear models
  • Motion in plane
  • Mixture models
  • Parameter estimation
  • Variational inference
  • Junction tree algorithm
  • Cumulative distribution function
  • Linear Gaussian systems
  • Robust linear regression
  • Whitening or sphereing
  • Biparite graph
  • Posterior distribution
  • Sumarizing the posterior distribution
  • Recurrent neural networks conditioned on context
  • RMS Prop
  • D-separation alternative for Markov Random Fields
  • Ridge regression connection to PCA
  • Jeffrey prior
  • Survival function
  • Feed forward Neural Network (Multilayered perceptron MLP)
  • Gibbs sampling
  • Plate notation
  • Soft threhsolding
  • Non Identifiable models
  • Noise-free Gaussian Process Regression
  • Variable elimination algorithm
  • Bidirectional Recurrent neural networks
  • Teacher Forcing
  • Elastic Forces
  • Splay trees
  • Shannon infromation content of an outcome
  • Rao Blackwell theorem
  • Trees
  • Monte Carlo approximation
  • Forward vs reverse KL divergence
  • EM For PPCA
  • Inequalities
  • Matrix vector addition
  • Proximal algorithms
  • Proximal gradient method
  • Principal component analysis
  • Probabilistic PCA
  • Posterior predictive distribution
  • Order statistics
  • Kullback-Lieber divergence (KL)
  • Schur Complement
  • Sum-product message passing
  • Excess error
  • D-separation (Directed-Separation)
  • Maximum entropy derivation
  • Noisy Gaussian process Regerssion
  • Importance sampling
  • Projectile Motion
  • Equivariante function
  • Conditional random fields (CRFs)
  • Dropout
  • SVM classification EXAMPLE cvxpy
  • Derivation of BIC
  • Toeplitz matrix
  • Map estimation
  • Credible intervals
  • Highest posterior density (HDP)
  • Hamiltonian monte carlo
  • Junction tree example
  • Linear Regression
  • Fiting linear regression
  • Hamiltonian Monte Carlo Tunning parameters
  • Polyak Averaging
  • Batch and mini-batch algorithms
  • Wolfe Conditions
  • Uniform circular motion
  • Predictive sparse decomposition
  • Hammersley-Clifford
  • Recursive neural network
  • Parameter estimation for mixture models
  • EM
  • Area and Volume
  • Mixture of multinoullis
  • Clustering
  • Mixtures of experts
  • Computing the marginal likelihood (evidence)
  • Bic approximation of marginal likelihood
  • Multi-class Logistic Regression
  • Bernoulli distribution
  • Variational bayes
  • Regularization
  • Unscented transform
  • Simsons Paradox
  • Estimator
  • 0-1 Loss
  • Rejection option
  • \(l_2\) loss
  • \(l_1\) loss
  • Student T
  • Moments
  • Connections to Cauchy/Lorentz
  • Distribution
  • Symetry arround arguments
  • t-distribution in place of the Normal for bayesian modelling
  • t -distribution as a mixture of Normals
  • Variable elimination example
  • Chi-Squared distribution
  • Connection to Gaussian
  • Connection to Gamma
  • Learning
  • Sparse autoencoder
  • Hinge loss
  • Tangent distance
  • Variational bayes EM
  • Kalman smoothing algorithm
  • Non-Linear, non-Gaussian SSMs
  • Ridge regression (Penalized least squares)
  • Gamma Function
  • Gaussian distribution
  • Poisson process
  • This is an test
  • Law of large numbers
  • Variance stabilizing transformation
  • Matrix inverse
  • Information norm
  • Kernel function
  • Generative Adiversal Networks (GAN)
  • Maximum Likelihood Learning (Estimation)
  • Variable elimination motivation
  • Linear discriminant analysis
  • Regularization
  • Structured predictions
  • Max-product message passing
  • Bayesian t-test
  • Clustering
  • Smoothness regularization
  • Adversarial training
  • Determinant
  • Under complete autoencoder
  • Bonferroni Correction
  • Balanced Search Trees
  • Energy based models
  • Kalman filtering algorithm
  • Prediction step
  • Measurement step
  • Posterior predictive
  • Computational issues
  • Conservation of energy
  • Graph cut MAP inference PGM
  • Random Variables
  • Monte carlo inference
  • Evidence lower bound (ELBO
  • AVL tree
  • Poisson distribution
  • Properties
  • EM Theory
  • Association Rules
  • LSTM
  • Univariate Gaussian exponential distribution
  • Principal Component as a Manifold
  • Force and motion
  • Optimization in deep learning
  • Convolution
  • Law of total Probability
  • Bayes factors
  • Gamma distribution
  • Moments
  • Properties
  • Sum of 2 Gamma distributions
  • Connection to Exponential distribution
  • Connection to Beta distribution
  • Erlang distribution
  • Exponential distribution
  • Memory less property
  • Connection to Gamma
  • Poisson process
  • Back-propagation through time
  • Comparing gaussian processes to other models
  • Red black trees
  • Las Vegas Algorithms
  • Why HM works
  • Sigmoid and Softplus
  • Hidden units in neural networks
  • Auto-encoding variational Bayes
  • Classical PCA
  • Markov chain monte carlo accuracy
  • Laplace distribution (Double exponential)
  • Moments
  • Gradient descent
  • Bayesian linear regression
  • Leaky Unis
  • Singular value decomposition (SVD)
  • Infering the parameters of an NVM
  • Inclusion Exclusion Probability
  • Denoising autoencoder (DAE)
  • Vector
  • Independence
  • Bayesian occam’s razor.
  • Factor analysis unidentifiability
  • Learning in conditional random fields
  • Metropolis Hastings
  • Uniform distribution
  • Universality
  • Uninformative prior
  • L1 regularization sparse solution
  • Contact Force
  • Computer Vision neural networks
  • Gated Recurrent Units
  • Convolution operation
  • EM as variational inference
  • PCA for categorical data
  • Supervised PCA (Bayesian factor regression)
  • Partial least squares
  • Canonical correlation analysis
  • Belief propagation
  • Probit regression
  • Convex set
  • Page Rank
  • Numerically stable ridge regression computation
  • Inverse Gamma
  • PDF
  • Moments
  • Conjugate gradient
  • Black Holes
  • Boostrap
  • False positive vs false negative tradeoff
  • Cross correlation
  • Numeric computation
  • Independent component Analysis (ICA)
  • Binary Tree
  • Bernoulli distribution exponential family
  • Bayesian Learning
  • Wishart distribution
  • Example:
  • Sum of squares
  • Epigraf
  • Dirac measure / Dirac delta
  • Batch Normalization
  • Kronecker product
  • Comparing OLS with Ridge and Lasso
  • Quasi Newton
  • Elastic Net (ridge and lasso combined)
  • Jacobian
  • Sparse Linear models
  • Multi-task learning
  • Gaussian Process
  • Mutual information
  • Point wise mutual information
  • Continuous random variables
  • Bayes Error
  • Factor analysis (FA)
  • Unidentifiability
  • Stationary distribution
  • Unfolding an Computational Graph
  • Functions of random variables
  • Linear functions
  • Dirichlet distribution
  • Restricted Boltzman Machine
  • Priors
  • K-medoids
  • Transfer learning and domain adaptation
  • State space models (SSM)
  • Long term dependencies in RNN
  • Loopy belief propagation (LPB)
  • Output units in neural networks
  • Graph cut
  • Hessian
  • Bootstrap aggregation (Bagging)
  • Convex Functions
  • Strict convexity
  • Operations that preserve convexity
  • Convexity in higher dimesnions
  • SVM for regression.
  • QR decomposition
  • Kernels for document comparison
  • Marko chain monte carlo convergence
  • Trace Operator
  • Persistent contrastive divergence
  • Matrix inversion lemma
  • Variational auto-encoder
  • Central Limit Theorem
  • Kernels derived from probabilistic generative models
  • Tangent prop
  • Maximum likelihood markov random fields
  • Chordal graph
  • Pareto distribution
  • Moments
  • Laplace (Gaussian) approxiation
  • Sparse coding
  • Weight
  • Sampling from graphical models
  • Covariance
  • Correlation (Pearsons correlation coefficient)
  • Dirichlet Compound Multinomial
  • Usage
  • Hierarchical bayes (multi-level model)
  • Acceleration
  • (2,4) Trees
  • Directed graphical models
  • Examples
  • Back propagation
  • Deep recurrent neural networks
  • Probability mass functions
  • Probability density function
  • Kernel trick
  • Bias Variance tradeoff
  • Terminal conditions
  • Metropolis Hastings mathematical formulation
  • Hybrid discrete/continuous SSMs
  • Basic trigonometry
  • Quadratic formula
  • Logarithms and exponentials
  • Exponential family sufficient statistics
  • Generalized linear mixed models
  • Deep learning guide line
  • Inference
  • Einsum
  • Affine function
  • Conditional independence
  • MLE for Bayesian Networks
  • Inference in joint Gaussian distribution
  • Rectified linear units (RELU)
  • Gaussian Process Regression
  • Bayesian Statistics
  • Drag Force
  • Hidden Markov models
  • Optimality conditions for the lasso
  • Exponential family
  • Automatic symbolic differentiation
  • Reverse accumulation
  • Line search
  • Satellite motion
  • Learning for LG-SSM
  • Regularization in Neural Networks
  • Sample variance
  • Common recurrent neural network architectures
  • Learning manifolds with autoencoders
  • Span of a set of Vectors
  • Poison regression
  • Bayes Biliard
  • Markov random field (MRF)
  • Beta function
  • Gama function
  • Higher order derivatives in neural networks
  • Metropolis hastings proposal distribution
  • LARS (homotopy methods)
  • Empirical Distribution
  • Empirical CDF
  • Factor graph
  • Qda to LDA
  • Recurrent neural networks (RNN)
  • Coordinate descent
  • Activation functions
  • Gaussian discriminant Analysis
  • Beta binomial model
  • Independence Map of a Graph
Powered by Jupyter Book
Contents
  • Approximate Line search
    • Algorithm for Backtracking line Search

Line search¶

Lets assume we have already chosen the descent direction \(d\). And we want to choose the step factor \(\alpha\) to choose the next design point. One approach is to use line search, which select the step factor that minimizes:

\[ \min_{\alpha} f(x + \alpha d) \]

This can be achieved using any univariate optimization. However this is expensive to perform at each iteration. Instead we find an reasonable \(\alpha^{(k)}\) and move on.

Some algorithms use fixed step factor. Here however is a risk that we overshoot the minimum or that convergence will be slow. Another common approach is to select a decaying step factor:

\[ \alpha^{(k)} = \alpha^{(1)}\gamma^{(k^-1)} \; \gamma \in [0,1] \]

Decaying step factors are especially usefull when minimizing a noisy objective function. (Typical in ML).

Approximate Line search¶

Approximate line search tries to find a suitable step size in a small number of steps. In a descent method any value of \(\alpha\) that will cause a decrease in the objective function is good. However if we enforce a variety of condition we may achieve faster convergence. A condition of sufficient decrease (Armijo condition or First Wolfe) requires that the step sizes causes a sufficient decrease in the objective function:

\[ f(x^{(k+1)} ) \le f(x^{(k)}) + \beta \alpha \nabla_{d^{(k)}}f(x^{(k)}) \]
  • \(\beta \in [0,1]\) often set to \(\beta = 10^{-4}\). If we set \(\beta = 0\) then we accept any decrease. If \(\beta = 1\) than the decrease has to be at least as much as what would be predicted by a first order approximation.

  • \(\nabla_{d^{(k)}}\) is the gradietn of \(f\) at \(x^{(k)}\)

Algorithm for Backtracking line Search¶

Start with a large value of \(\alpha\) We require: \(f, \nabla f, x, d \alpha, p=0.5, \beta = 1e^{-4}\)

\(y, g = f(x), \nabla f(x)\)

while \(f(x + \alpha d) > y + \beta \alpha (g \cdot d) \\ \; \; \; \; \alpha = \alpha p\)

return \(\alpha\)

Here we try to accecpt the largest possible step.

The black lines show the iterations taken by the descent method, and red lines show the points considered during each line search.

Automatic symbolic differentiation Satellite motion

By Marek Barak
© Copyright 2020.