Algorithm Overview - Neural Networks

Algorithm Description 

The terms "Neural Net", “Deep Learning”, “Deep Neural Net”, and other similar terms describe a classical neural network algorithm with at least one hidden layer. The words “Neural Network” came about because the algorithm is loosely patterned off of the human brain, which is a collection of interconnected (network) of neurons. In the brain, a neuron receives an electrical signal and ‘decides’ how much of the signal to pass on to its neighboring neurons. There is a chemical “action potential” which decides this. 

Generally, neural nets are complex algorithms that can do a very good job of modeling data that have complex relationships.  They do not require any pre-determined notion of how variables are related to one another, or any assumptions about the distributions of the variables.  From this standpoint, they are not so much statistical techniques, but more so mathematical algorithms or classical machine learning techniques.  The downside in some cases is that they can lead to over-fitting a model, meaning that the algorithm fits not just to the true patterns in the data, but also to some extent the noise that is a part of the randomness inherent in any data collection process.  Good selection of parameters can control over-fitting well.

For Neural Net algorithms in general, data enters the algorithm and each node (neuron) has to “decide” how much of the information from each data point should be passed on. In an oversimplification, one can think of a number coming in with a certain magnitude, and the node decides how much of that magnitude gets passed on to other nodes. The way it “decides” this is with a scalar, simple numerical weight assigned to each connection between nodes (each arrow below). The algorithm is trained in an iterative fashion by trying to figure out the best weights which minimize the overall error. It uses matrix transformations to do this.

 

https://www.researchgate.net/figure/Artificial-neural-network-of-multiple-layers-and-outputs-31_fig2_331097835 , Araujo et al.

 

The final output is a decimal or classification label, which, like any algorithm is trying to predict some value based on a given row of data. Since these nodes are often “fully connected” as shown above, with each node connected to all other nodes that are next in the stream of data, then it can find linear and/or non-linear patterns. It can also handle mixed classes of columns, both continuous and categorical, since prior to entering the neural net, all data must be transformed into a numerical matrix. 

In practice, deep neural nets often have better performance than neural nets without hidden layers, and so has become more of the default. When one says, “neural net”, one is often referring to a “deep neural net” but could also be referring to other subtypes, like “convoluted neural net”. In practice, and in general, neural nets under the right circumstances can provide the best predictive power out of all algorithms, especially for audio or visual data. However, they are known to be among the most difficult to properly train, and give no insight into the feature importance, and so this makes quality control very difficult. It is on the extreme side of performance in the “performance vs. interpretability” spectrum. So, the user must use extra care while training and running these models. 

See the Additional Information section below for more details.

 

 

Additional Links

http://news.mit.edu/2017/explained-neural-networks-deep-learning-0414 

https://en.wikipedia.org/wiki/Artificial_neural_network 

 

Lityx IQ Parameters 

Neural Net

Decay Rate - this helps control overfitting of the neural net model by adding a small adjustment term ("regularization term") to the model as weights are updated from one iteration to the next.  The term helps to maintain the weights within reasonable bounds.  A value of zero would eliminate this term.

Number of Hidden Nodes - the Neural Net algorithm in LityxIQ supports a single hidden layer.  This parameter specifies the number of hidden nodes to be used.

Range for Initial Weights - Neural net node weights are started at initial random default values within the range specified by this setting.  It is generally a good idea to have the range be small within the 0 to 1 range.

Skip Layer Connection - if checked, this allows the architecture of the neural net to additionally have a connection from the input variables directly to the output, skipping over the hidden layer nodes.  This can be useful in helping the algorithm find the best solution efficiently.

Maximum Number of Iterations - the maximum number of times the algorithm will iterate through the dataset to adjust node weights.  Larger values can take longer to execute the algorithm, but could lead to finding incrementally better solutions.

Absolute/Relative Convergence Tolerance - on each iteration through the neural net algorithm, it determines if it has found an optimal solution by checking if the target metric (e.g., sum of squared errors) has not changed by very much.  These tolerance values determine the threshold allowed for defining "not changing much", both in absolute terms, and relative (percentage) terms.

Maximum No. Model Terms - the maximum number of variables that will be used in building the neural net.  Note that LityxIQ will use a variety of techniques to reduce the full set of available variables down to this value prior to executing the algorithm.

 

DeepNet

Hidden Nodes - use this to specify how the number of hidden layers in the network and number of nodes in each layer.  Specify this using a comma separated list of nodes per layer.  For example, setting it to "100,40,10” would setup a network with three hidden layers, having 100, 40, and 10 nodes respectively.  The maximum number of hidden layers allowed is three. 

Activation Function - the activation function to use in hidden layers.  Set this to either "Sigmoid" (the default, and commonly used transformation function), or "Tanh" which uses a hyperbolic tangent transformation function.

Learning Rate - how fast and radical does it adjust the weights. If it is too fast, it might skip over the global minimum or maximum 

Momentum - high momentum allows it to move past a local minimum in search of the global minimum error, but too high a value could miss the global minimum.

Learning Rate Scale - a factor which adjusts the learning rate with each iteration.  Setting it to 1.0 will keep the same learning weight from one iteration to the next.  Setting it to a value less than 1 (e.g., 0.9) will adjust the learning rate to decline over the iterations.

Num Epochs - this is the number of iterations or passes through the dataset the network will use to train the network weights. 

Batch Size - this controls how dataset rows are aggregated together during an iteration.  A batch size of one means that each record is treated individually (slower, more accurate), while a larger batch size (e.g., 1000) will average up 1000 records at a time during an iteration pass (faster, potentially less accurate). 

Hidden/Visible Dropout - the fraction of the hidden or input nodes that should be dropped in each iteration of the training process.  This is a successful method for helping to reduce overfitting in the network by trying different network sizes during the training process.  See https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/ for more information.

Maximum No. of Model terms - Same as for Neural Net.

 

Additional Technical Information and Terms

Optimization Contour

 

Having high momentum can be thought of as a way for the model to get out of local minima to find the global minimum.

 

A high learning rate sounds like a good thing, but it means in the above figure that the arrows would be long, meaning that it can easily overshoot the minimum. It is often preferable to have a low (slow) learning rate (short arrows), but this increases the time it takes to train. 

Dropout: increasing dropout increases the number of nodes which are temporarily deleted in the model. Sometimes neural nets can overfit the data, and essentially memorize every single datapoint, making it difficult to generalize to future, unseen datapoints. So, dropout adds randomness (like random forest), which helps prevent overfitting. Too much randomness will hurt performance.