Overview. Ø Neural Networks are considered black-box models Ø They are complex and do not provide much insight into variable relationships

Neural Networks

Overview Ø s are considered black-box models Ø They are complex and do not provide much insight into variable relationships Ø They have the potential to model very complicated patterns ( universal approximators ) Ø Can be used for both classification and continuous prediction tasks.

The History Ø Concept was welcomed with enthusiasm in 80 s Ø Didn t live up to expectations then Ø Too much hype, perhaps Ø Overtaken by other black box techniques like Support Vector Machines with Kernels in 2000 s Ø Now in the age of image and visual recognition problems, neural networks have made comeback Ø Area of rapid development Ø Rebranded as Deep Learning Ø Recurrent s Ø Convolutional s Ø Feedforward s

The Structure of a These s are often called Multilayer Perceptrons (MLPs)

The Structure of a Output Hidden Layer 2 Input Layer Hidden Layer 1

The Structure of a! " & "" & #"! #! $ & "# & ## '( Output! % & "$ bias bias (=1) Input Layer bias Hidden Layer 1 Hidden Layer 2

The Structure of a '( Associated with each line in this diagram is a parameter to be solved for!

A Simpler! " & "! # & # )! $ +,-. bias (=1) To avoid triple subscripts, let s simplify our network to 1 hidden layer and just 3 input variables. We ll assume a binary target

Math Structure of a! " 6 "" & "! # 6 "# & # ) 6 "$! $ 6 "7 +,-. bias (=1) & " = tanh (6 "7 + 6 ""! " + 6 "#! # + 6 "$! $ ) Hyperbolic tangent. One of many possible sigmoid functions. Range is -1 to 1. Related to logistic function.

Sigmoid Function 1 tanh -5 0 5-1

Math Structure of a! " 6 "" & "! # 6 "# & # ) 6 "$! $ 6 "7 +,-. bias (=1) & " = tanh (6 "7 + 6 ""! " + 6 "#! # + 6 "$! $ ) The intercept of each equation called bias term

Math Structure of a! " 6 #" & "! # 6 ## & # )! $ 6 #$ +,-. bias (=1) 6 #7 & # = tanh (6 #7 + 6 #"! " + 6 ##! # + 6 #$! $ ) The intercept of each equation called bias term

Math Structure of a! " & "! # & # 6 7" )! $ 6 7# +,-. 6 77 bias (=1) :;<,= () ) = 6 77 + 6 7" & " + 6 7# & #

Math Structure of a Ø With just 3 input variables and 1 hidden layer containing 2 hidden units, we have to estimate 11 parameters! & " = tanh (6 "7 + 6 ""! " + 6 "#! # + 6 "$! $ ) & # = tanh (6 #7 + 6 #"! " + 6 ##! # + 6 #$! $ ) :;<,= () ) = 6 77 + 6 7" & " + 6 7# & # Ø Weight estimates found by maximizing the loglikelihood function for a class target Ø The process involves an algorithm called backpropagation

Math Structure of a Ø With just 3 input variables and 1 hidden layer containing 2 hidden units, we have to estimate 11 parameters! & " = tanh (6 "7 + 6 ""! " + 6 "#! # + 6 "$! $ ) & # = tanh (6 #7 + 6 #"! " + 6 ##! # + 6 #$! $ ) :;<,= () ) = 6 77 + 6 7" & " + 6 7# & # Ø Probability estimates are obtained by solving the logit equation for p for each (x 1, x 2 ):

Training a Neural Net (Backpropagation Algorithm) Ø Forward phase: Starting with some initial weights (often random), the calculations are passed through the network to the output layer where a predicted value is computed. Ø Backward phase: The predicted value is compared to the actual value and the error is propagated backwards in the network to modify the connection weights. Ø Repeat until something like convergence.

Standardization Ø s work best when input data are scaled to a narrow range around 0 Ø For bell shaped data, statistical z-score standardization appropriate Ø For severely non-normal data, range standardization more appropriate.

Probability Surface of a High probability of yellow class Lower probability of yellow class

Probability Surface of a

Advantages of a Neural Network Ø Can be adapted to classification or numerical prediction problems Ø Capable of modelling complex nonlinear patterns Ø (More complex than any other algorithm right now) Ø Makes few assumptions about the data s underlying relationships.

Disadvantages of a Ø s have no mechanism for variable selection. You provide inputs. Ø Very difficult to see the relationships underlying the data. Ø Signs of weights can cancel each other out through the networks Ø Each input gets weight for each hidden unit which then get combined Ø Extremely computationally intensive Ø Slow to train Ø Particularly if network structure is complex or number of variables is large Ø Prone to overfitting training data