Matrix Factorization

Given an input matrix A (with some missing entries), MF learns two matrices W and H such that W*H approximately equals A (except where elements of A are missing). If A is N-by-M, then W will be N-by-K and H will be K-by-M. Here, K is a user-supplied parameter (the “rank”) that controls the accuracy of the factorization. Higher values of K usually yield a more accurate factorization, but require more computation.

MF is commonly used to perform Collaborative Filtering, where A represents the known relationships between two categories of things - for example, A(i,j) = v might mean that “person i gave product j rating v”. If some relationships A(i,j) are missing, we can use the learnt matrices W and H to predict them:

A(i,j) = W(i,1)*H(1,i) + W(i,2)*H(2,i) + ... + W(i,K)*H(K,i)

The PMLS MF app uses a model-parallel coordinate descent scheme, implemented on the Strads scheduler. If you would like to use the older Bösen-based PMLS MF app, you may obtain it from the PMLS v0.93 release.

Performance

The Strads MF app finishes training on the Netflix dataset (480k by 20k matrix) with rank=40 in 2 minutes, using 25 machines (16 cores each).

Quick start

PMLS MF uses the Strads scheduler, and can be found in src/strads/apps/cdmf_release/. From this point on, all instructions will assume you are in src/strads/apps/cdmf_release/. After building the main PMLS libraries (as explained under Installation), you may build the MF app from src/strads/apps/cdmf_release/ by running

make

Test the app (on your local machine) by running

./run.py

This will perform a rank K=40 decomposition on a synthetic 10k-by-10k matrix, and output the factors W and H to tmplog/wfile-mach-* and tmplog/hfile-mach-* respectively.

Input data format

The MF app uses the MatrixMarket format:

%%MatrixMarket matrix coordinate real general
num_rows num_cols num_nonzeros
row col value 
row col value 
row col value 

The first line is the MatrixMarket header, and should be copied as-is. The second line gives the number of rows N, columns M, and non-zero entries in the matrix. This is followed by num_nonzeros lines, each representing a single matrix entry A(row,col) = value (where row and col are 0-indexed).

Output format

The MF app outputs W and H to tmplog/wfile-mach-* and tmplog/hfile-mach-* respectively. The W files have the following format:

row-id: value-0 value-1 ... value-(K-1)
row-id: value-0 value-1 ... value-(K-1)
...

Each line represents one row in W, beginning with the row index row-id, and followed by all K values that make up the row.

The H files follow a similar format:

col-id: value-0 value-1 ... value-(K-1)
col-id: value-0 value-1 ... value-(K-1)
...

The number of files wfile-mach-* and hfile-mach-* depends on the number of worker processes used — see the next section for more information.

Program options

The MF app is launched using a python script, e.g. run.py used earlier:

#!/usr/bin/python
import os
import sys

datafile = ['./sampledata/mftest.mmt ']
threads = [' 16 ']
rank = [' 40 ']
iterations = [' 10 ']
lambda_param = [' 0.05 ']

machfile = ['./singlemach.vm']

prog = ['./bin/lccdmf ']
os.system("mpirun -machinefile "+machfile[0]+" "+prog[0]+" --machfile "+machfile[0]+" -threads "+threads[0]+" -num_rank "+rank[0]+" -num_iter "+iterations[0]+" -lambda "+lambda_param[0]+" -data_file "+datafile[0]+"  -wfile_pre tmplog/wfile -hfile_pre tmplog/hfile");

The basic options are:

  • datafile: Path to the data file, which must be present/visible to all machines. We strongly recommend providing the full path name to the data file.
  • threads: How many threads to use for each worker.
  • rank: The desired decomposition rank K.
  • iterations: How many iterations to run.
  • lambda_param: Regularization parameter (PMLS MF uses an L2 regularizer)
  • machfile: Strads machine file; see below for details.

Strads requires a machine file - singlemach.vm in the above example. Strads machine files control which machines house Workers, the Scheduler, and the Coordinator (the 3 architectural elements of Strads). In singlemach.vm, we spawn all element processes on the local machine 127.0.0.1, so the file simply looks like this:

127.0.0.1
127.0.0.1
127.0.0.1
127.0.0.1

To prepare a multi-machine file, please refer to the Strads section under Configuration Files for PMLS Apps.