- Applicability of ML: translation, recognition, computer vision, game playing, natural
language
- More emrging applications: medical diagnosis, financial services, transportation, e-commerce


Desiderata of ML systems:
- Modular with proper abstraction
- Understandable
- Reliable
- Fair
- Transparent


Empirical paradigm derivels DL: 
- Availabilit of benchmark data
- Compute power
- Trail and Error

Downsides of empirical paradigm
- Difficult to differential diagnosis when something goes wrong
- Impossible to isolate: training data, model ?

Cyclic learning rate => Unprincipled, fragile, excessive tuning

Batch normalization: Internal covariance shift ill defined or not defined at al

- Sometimes changing the seeds changes the error

- Focus: Why SGD works and when it generalizes well?
- Test error = optimization error + generalization eror
- Does sgd find solutions of low optimization error or generalization error?

- Why challenging? Non-convex, many local optima (optimization)
statistical: More parameters than samples, algorithm depedent


Why is non-convex hard? Finding global minima is non-trivial and hard
- Gradient methods will never stick around at the saddle point
- Saddle points are unstable
- Gradient methods do convverge to local minimizer
- Key: All local minima of DL problems are approximately converging to global minima

Landscape analysis: All local minimzers of the loss are approximate global minimizers

Well behaved non-convex probmelms: CA, PCA, Matrox fatorization
Also for these probems:
Matrix completion

# Over parameterization
- Landscape design => Change the loss vial altering the architecture
- First common way is to increase the layers (over parameterize)

- Given a neural network architecture wiht a small network : SGD doesn't obtain local minima
but on over parameterization (increasing the layes) leeads to global minima

- Residual links: Before 2016 it used to be feedforward networks, residual
architecture so that input feeds to other layers using an identity mapping allowing training of very deep networks

- Theorem: Overparameterized residual network leads to global minima


IF the Jacobian is rank deficient the training can get stuck. But at initialzation
the jacobian is well conidtion (random matrices as the overparameterization ensures stability 
overparameterization in resnets help make the jacobian well conditioned

# Generalization


Generalization error <= sqrt{parameterization error/n}	

Textbook overfitting doesn't lead to higher loss but increase so generalization
error decreases witht he na

Radamacher complexity is also indo

 Hoe does overparamaeterizaion affect the margin>
 Over paramaterization will 
 regula 
- SGD without any 

- SGD converges to solution of max margin probelm (SVD)

- Tries to find a direction for SVM that maximizes the margin

- Does this generalize to other deep architectures? Yet reuire

- Predictor function 

- Improvements come from the architectural changes


- By changing the architecture of a net increase the trrainin 

Adaptive data analysis: analyses or questions based on given data.