Proposal Title: Implement Multi-Layer Perceptrons with backpropagation learning on Hadoop (addresses issue Mahout-342)

Student Name: Zaid Md. Abdul Wahab Sheikh

Student E-mail: (gmail id) sheikh.zaid

## I. Brief Description

A feedforward neural network (NN) reveals several degrees of parallelism within it such as weight parallelism, node parallelism, network parallelism, layer parallelism and training parallelism. However network based parallelism requires fine-grained synchronization and communication and thus is not suitable for map/reduce based algorithms. On the other hand, training-set parallelism is coarse grained. This can be easily exploited on Hadoop which can split up the input among different mappers. Each of the mappers will then propagate the 'InputSplit' through their own copy of the complete neural network.

The backpropagation algorithm will operate in batch mode. This is because updating a common set of parameters after each training example creates a bottleneck for parallelization. The overall error gradient vector calculation can be parallelized by calculating the gradients from each training vector in the Mapper, combining them to get partial batch gradients and then adding them in a reducer to get the overall batch gradient.

In a similiar manner, error function evaluations during line searches (for the conjugate gradient and quasi-Newton algorithms) can be efficiently parallelized.

Lastly, to avoid local minima in its error function, we can take advantage of training session parallelism to start multiple training sessions in parallel with different initial weights (simulated annealing).

## II. Detailed Proposal

The most important step is to design the base neural network classes in such a way that other NN architectures like Hopfield nets, Boltzman machines, SOM etc can be easily implemented by deriving from these base classes. For that I propose to implement a set of core classes that correspond to basic neural network concepts like artificial neuron, neuron layer, neuron connections, weight, transfer function, input function, learning rule etc. This architecture is inspired from that of the opensource Neuroph neural network framework (http://imgur.com/gDIOe.jpg). This design of the base architecture allows for great flexibility in deriving newer NNs and learning rules. All that needs to be done is to derive from the NeuralNetwork class, provide the method for network creation, create a new training method by deriving from LearningRule, and then add that learning rule to the network during creation. In addition, the API is very intuitive and easy to understand (in comparision to other NN frameworks like Encog and JOONE).

### The approach to parallelization in Hadoop:

In the Driver class:

- The input parameters are read and the NeuralNetwork with a specified LearningRule (training algorithm) created.
- Initial weight values are randomly generated and written to the FileSystem. If number of training sessions (for simulated annealing) is specified, multiple sets of initial weight values are generated.
- Training is started by calling the NeuralNetwork's learn() method. For each iteration, every time the error gradient vector needs to be calculated, the method submits a Job where the input path to the training-set vectors and various key properties (like path to the stored weight values) are set. The gradient vectors calculated by the Reducers are written back to an output path in the FileSystem.
- After the JobClient.runJob() returns, the gradient vectors are retrieved from the FileSystem and tested to see if the stopping criterion is satisfied. The weights are then updated, using the method implemented by the particular LearningRule. For line searches, each error function evaluation is again done by submitting a job.
- The NN is trained in iterations until it converges.

In the Mapper class:

- Each Mapper is initialized using the configure method, the weights are retrieved and the complete NeuralNetwork created.
- The map function then takes in the training vectors as key/value pairs (the key is ignored), runs them through the NN to calculate the outputs and backpropagates the errors to find out the error gradients. The error gradient vectors are then output as key/value pairs where all the keys are set to a common value, such as the training session number (for each training session, all keys in the outputs of all the mappers have to be identical).

In the Combiner class:

- Iterates through the all individual error gradient vectors output by the mapper (since they all have the same key) and adds them up to get a partial batch gradient.

In the Reducer class:

- There's a single reducer class that will combine all the partial gradients from the Mappers to get the overall batch gradient.
- The final error gradient vector is written back to the FileSystem

### I propose to complete all of the following sub-tasks during GSoC 2010:

Implementation of the Backpropagation algorithm:

- Initialization of weights: using the Nguyen-Widrow algorithm to select the initial range of starting weight values.
- Input, transfer and error functions: implement basic input functions like WeightedSum and transfer functions like Sigmoid, Gaussian, tanh and linear. Implement the sum-of-squares error function.
- Optimization methods to update the weights: (a) Batch Gradient descent, with momentum and a variable learning rate method [2] (b) A Conjugate gradient method with Brent's line search.

Improving generalization:

- Validating the network to test for overfitting (Early stopping method)
- Regularization (weight decay method)

Create examples for:

- Classification: using the Abalone Data Set from UCI Machine Learning Repository
- Classification, Regression: Breast Cancer Wisconsin (Prognostic) Data Set

If time permits, also implement:

- Resilient Backpropagation (RPROP)

## III. Week Plan with list of deliverables

- (Till May 23rd, community bonding period)

Brainstorm with my mentor and the Apache Mahout community to come up with the most optimal design for an extensible Neural Network framework. Code prototypes to identify potential problems and/or investigate new solutions.

Deliverable: A detailed report or design document on how to implement the basic Neural Network framework and the learning algorithms.

- (May 24th, coding starts) Week 1:

Deliverables: Basic Neural network classes (Neuron, Connection, Weight, Layer, LearningRule, NeuralNetwork etc) and the various input, transfer and error functions mentioned previously.

- (May 31st) Week 2 and Week 3:

Deliverable: Driver, Mapper, Combiner and Reducer classes with basic functonality to run a feedforward Neural Network on Hadoop (no training methods yet, weights are generated using Nguyen-Widrow algorithm).

- (June 14th) Week 4:

Deliverable: Backpropagation algorithm using standard Batch Gradient descent.

- (June 21st) Week 5:

Deliverables: Variable learning rate and momentum during Batch Gradient descent. Validation tests support. Do some big tests.

- (June 28th) Week 6:

Deliverable: Support for Early stopping and Regularization (weight decay) during training.

- (July 5th) Week 7 and Week 8:

Deliverable: Conjugate gradient method with Brent's line search algorithm.

- (July 19th) Week 9:

Deliverable: Write unit tests. Do bigger scale tests for both batch gradient descent and conjugate gradient method.

- (July 26th) Week 10 and Week 11:

Deliverable: 2 examples of classification and regression on real-world datasets from UCI Machine Learning Repository. More tests.

- (August 9th, tentative 'pencils down' date) Week 12:

Deliverable: Wind up the work. Scrub code. Improved documentation, tutorials (on the wiki) etc.

- (August 16: Final evaluation)

## IV. Additional Information

I am a final year Computer Science student at NIT Allahabad (India) graduating in May. For my final year project/thesis, I am working on Open Domain Question Answering. I participated in GSoC last year for the Apertium machine translation system (http://google-opensource.blogspot.com/2009/11/apertium-projects-first-google-summer.html). I am familiar with the three major opensource Neural Network frameworks in Java, JOONE, Encog and Neuroph since I have used them in past projects on fingerprint recognition and face recognition (during a summer course on image and speech processing). My research interests are machine learning and statistical natural language processing and I will be enrolling for a Ph.D. next semester(i.e. next fall) in the same institute.

I have no specific time constraints throughout the GSoC period. I will devote a minimum of 6 hours everyday to GSoC.

Time offset: UTC+5:30 (IST)

## V. References

[1] Fast parallel off-line training of multilayer perceptrons, S McLoone, GW Irwin - IEEE Transactions on Neural Networks, 1997

[2] Optimization of the backpropagation algorithm for training multilayer perceptrons, W. Schiffmann, M. Joost and R. Werner, 1994

[3] Map-Reduce for Machine Learning on Multicore, Cheng T. Chu, Sang K. Kim, Yi A. Lin, et al - in NIPS, 2006

[4] Neural networks for pattern recognition, CM Bishop - 1995 [BOOK]