[SPARK-5575] Artificial neural networks for MLlib deep learning - ASF JIRA

Details

Type: Umbrella
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.2.0
Fix Version/s: None
Component/s: MLlib
Labels:
- bulk-closed

Description

Goal: Implement various types of artificial neural networks

Motivation: (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want to cover them. Eventually, it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. To summarize this, Spark should have at least the most widely used deep learning models, such as fully connected artificial neural network, convolutional network and autoencoder. Advanced and experimental deep learning features might reside within packages or as pluggable external tools. These 3 will provide a comprehensive deep learning set for Spark ML. We might also include recurrent networks as well.

Requirements:

Extensible API compatible with Spark ML. Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused. Define the Spark ML API for deep learning. This interface is similar to the other analytics tools in Spark and supports ML pipelines. This makes deep learning easy to use and plug in into analytics workloads for Spark users.
Efficiency. The current implementation of multilayer perceptron in Spark is less than 2x slower than Caffe, both measured on CPU. The main overhead sources are JVM and Spark's communication layer. For more details, please refer to https://github.com/avulanov/ann-benchmark. Having said that, the efficient implementation of deep learning in Spark should be only few times slower than in specialized tool. This is very reasonable for the platform that does much more than deep learning and I believe it is understood by the community.
Scalability. Implement efficient distributed training. It relies heavily on the efficient communication and scheduling mechanisms. The default implementation is based on Spark. More efficient implementations might include some external libraries but use the same interface defined.

Main features:

Multilayer perceptron classifier (MLP)
Autoencoder
Convolutional neural networks for computer vision. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet

Additional features:

Other architectures, such as Recurrent neural network (RNN), Long-short term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), MLP multivariate regression
Regularizers, such as L1, L2, drop-out
Normalizers
Network customization. The internal API of Spark ANN is designed to be flexible and can handle different types of layers. However, only a part of the API is made public. We have to limit the number of public classes in order to make it simpler to support other languages. This forces us to use (String or Number) parameters instead of introducing of new public classes. One of the options to specify the architecture of ANN is to use text configuration with layer-wise description. We have considered using Caffe format for this. It gives the benefit of compatibility with well known deep learning tool and simplifies the support of other languages in Spark. Implementation of a parser for the subset of Caffe format might be the first step towards the support of general ANN architectures in Spark.
Hardware specific optimization. One can wrap other deep learning implementations with this interface allowing users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default one. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet. The main motivation for using specialized libraries for deep learning would be to fully take advantage of the hardware where Spark runs, in particular GPUs. Having the default interface in Spark, we will need to wrap only a subset of functions from a given specialized library. It does require an effort, however it is not the same as wrapping all functions. Wrappers can be provided as packages without the need to pull new dependencies into Spark.

Completed (merged to the main Spark branch):

Requirements: https://issues.apache.org/jira/browse/SPARK-9471
- API https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/
- Efficiency & Scalability: https://github.com/avulanov/ann-benchmark
Features:
- Multilayer perceptron classifier https://issues.apache.org/jira/browse/SPARK-9471

In progress (pull request):

Features:
- Autoencoder https://issues.apache.org/jira/browse/SPARK-2623

Additional features:
- MLP regression: https://issues.apache.org/jira/browse/SPARK-10409

Scalable deep learning package:

This package is intended for new Spark deep learning features that were not yet merged to Spark ML or that are too specific to be merged: https://spark-packages.org/package/avulanov/scalable-deeplearning

Attachments

Issue Links

incorporates

SPARK-2352 [MLLIB] Add Artificial Neural Network (ANN) to Spark

Resolved

SPARK-10408 Autoencoder

Resolved

SPARK-10409 Multilayer perceptron regression

Resolved

SPARK-10627 Regularization for artificial neural networks

Resolved

is duplicated by

SPARK-2352 [MLLIB] Add Artificial Neural Network (ANN) to Spark

Resolved

is related to

SPARK-2352 [MLLIB] Add Artificial Neural Network (ANN) to Spark

Resolved

SPARK-2623 Stacked Auto Encoder (Deep Learning )

Resolved

SPARK-4251 Add Restricted Boltzmann machine(RBM) algorithm to MLlib

Resolved

SPARK-9471 Multilayer perceptron classifier

Resolved

SPARK-2351 Add Artificial Neural Network (ANN) to Spark

Closed

SPARK-4752 Classifier based on artificial neural network

Closed

SPARK-9273 Add Convolutional Neural network to Spark MLlib

Closed

SPARK-4288 Add Sparse Autoencoder algorithm to MLlib

Resolved

relates to

SPARK-8449 HDF5 read/write support for Spark MLlib

Resolved

(8 is related to, 1 relates to)

Artificial neural networks for MLlib deep learning

Details

Description

Attachments

Issue Links

Activity

People

Dates