Home¶

Foreword - please read¶

Parallel ML System (PMLS) is a distributed machine learning framework. It takes care of the difficult system “plumbing work”, allowing you to focus on the ML. PMLS runs efficiently at scale on research clusters and cloud compute like Amazon EC2 and Google GCE.

The PMLS project is organized into 4 open-source (BSD 3-clause license) Github repositories:

To install Bösen and Strads, please continue reading this manual. If you have a Java environment and want to use JBösen, please start here. If you wish to use PMLS-Caffe for Deep Learning, please go here.

PMLS Bösen/Strads v1.1 manual¶

Quickstart
ML Applications
1. Topic Models
  1. Latent Dirichlet Allocation (topic modeling)
  2. MedLDA (supervised topic modeling)
2. Deep Learning
  1. PMLS-Caffe: Distributed Deep Learning Framework on PMLS
  2. General-purpose Deep Neural Network (DNN)
    1. DNN for Speech Recognition
3. Matrix Factorization and Sparse Coding
4. Regression
  1. Lasso Regression
5. Metric Learning
  1. Distance Metric Learning
6. Clustering
  1. K-means Clustering
7. Classification
  1. Random Forest
  2. Logistic Regression
  3. SVM (Newly added in v1.1)
  4. Multi-class Logistic Regression
Programming API
1. Bosen Bounded-Async Key-Value Store
2. Strads Model-Parallel Scheduler (Coming soon)

Introduction to PMLS¶

PMLS is a distributed machine learning framework. It takes care of the difficult system “plumbing work”, allowing you to focus on the ML. PMLS runs efficiently at scale on research clusters and cloud compute like Amazon EC2 and Google GCE.

PMLS provides essential distributed programming tools to tackle the challenges of ML at scale: Big Data (many data samples), and Big Models (very large parameter and intermediate variable spaces). To address these challenges, PMLS provides two key platforms:

Bösen, a bounded-asynchronous key-value store for Data-Parallel ML algorithms
Strads, a scheduler for Model-Parallel ML algorithms

Unlike general-purpose distributed programming platforms, PMLS is designed specifically for ML algorithms. This means that PMLS takes advantage of data correlation, staleness, and other statistical properties to maximize the performance for ML algorithms.

ML programs are built around update functions that are iterated repeatedly until convergence, as the following diagram illustrates:

Data and Model Parallelism

The update function takes the data and model parameters as input, and outputs a change to the model parameters. Data parallelism divides the data among different workers, whereas model parallelism divides the parameters among different workers. Both styles of parallelism can be found in modern ML algorithms: for example, Sparse Coding via Stochastic Gradient Descent is a data-parallel algorithm, while Lasso regression via Coordinate Descent is a model-parallel algorithm. The PMLS Bösen and Strads systems are built to enable data-parallel and model-parallel styles, respectively.

Key PMLS features¶

Runs on compute clusters and cloud compute, supporting up to 100s of machines
Bösen, a bounded-asynchronous distributed key-value store for data-parallel ML programming
- Bösen uses the Stale Synchronous Parallel consistency model, which allows asynchronous-like performance that outperforms MapReduce and bulk synchronous execution, yet does not sacrifice ML algorithm correctness
Strads, a dynamic scheduler for model-parallel ML programming
- Strads performs fine-grained scheduling of ML update operations, prioritizing computation on the parts of the ML program that need it most, while avoiding unsafe parallel operations that could hurt performance
Programming interfaces for C++ and Java
YARN and HDFS support, allowing execution on Hadoop clusters
ML library with 10+ ready-to-run algorithms
- Newer algorithms such as discriminative topic models, deep learning, distance metric learning and sparse coding
- Classic algorithms such as logistic regression, k-means, and random forest

Support and Bug reports¶

For support, or to report a bug, please send email to pmls-support@googlegroups.com. Please provide your name and affiliation; we do not support anonymous inquiries.