Non-negative Matrix Factorization (NMF)¶
After compiling Bösen, we can test that it is working correctly with Non-negative Matrix Factorization, a popular algorithm for recommender systems.
Introduction to NMF¶
Given an input matrix X
, the NMF app on Bösen learns two non-negative matrices L
and R
such that L*R
is approximately equal to X
.
If X
is N-by-M, then L
will be N-by-K and R
will be K-by-M where N is the number of data points, M is the dimension of the data, K is a user-supplied parameter that controls the rank of the factorization. The objective of NMF app is to minimize the function ||X - L*R||^2
subject to the constraint that the elements in matrices are non-negative.
Our NMF app uses the projected Stochastic Gradient Descent (SGD) algorithm to learn the two matrices L
and R
.
Quick Start¶
The NMF app can be found in bosen/app/NMF/
. From this point on, all instructions will assume you are in bosen/app/NMF/
. After building PMLS (as explained earlier in this manual), you can build the NMF app from bosen/app/NMF
by running
make
This will put the NMF binary in the subdirectory bin/
.
Create directory
mkdir sample
mkdir sample/data
mkdir sample/output
Create a simulated dataset:
./script/make_synth_data.py 3 3 sample/data/sample.txt
Change app_dir
in script/run_local.py
; see example below. You need to provide the full path to the NMF directory. This will allow for correct YARN operation (if required).
app_dir = "/home/user/bosen/app/NMF"
then you can test that the app works on the local machine (with 4 worker threads). From the bosen/app/NMF/
directory, run:
./script/launch.py
The NMF app runs in the background (progress is output to stdout). After a few minutes, you should get the following output files in the subdirectory sample/output/
:
L.txt.0
R.txt
loss.txt
time.txt
The file L.txt.0
and the file R.txt
are the optimization results of two matrices L
and R
, which looks something like this (note that there exist many valid solutions; the NMF app produces one at random):
0.80641 0 0
0.80641 0 0
0.80641 0 0
0 0 1.47865
0 0 1.47865
0 0 1.47865
0 2.2296 0
0 2.2296 0
0 2.2296 0
1.24006 1.24006 1.24006 0 0 0 0 0 0
0 0 0 0 0 0 1.34553 1.34553 1.34553
0 0 0 1.35258 1.35258 1.35258 0 0 0
The file loss.txt
contains the statistics of loss function evaluated at iterations specified by user. If there are num_client
machines in the experiment, it will has num_client
columns, and the i-th column is the loss function evaluated by the i-th machine. Note that loss function is evaluated by averaging loss function at randomly sampled data points. The file time.txt
also has num_client
columns, and contains the time (seconds) taken between evaluations on each client (Evaluating time is excluded). Don’t be surprised if the first row of time.txt
is nearly 0, that’s because the loss function is evaluated once before any optimization is conducted. Also note that at the end of time.txt
and loss.txt
there might be some N/A
s, that’s invalid and is due to the fact that different clients may evaluate different times.
You won’t see the exact same numeric values in different runs as there are multiple optimums of the optimization problem. All you need to confirm is that in optimized L
, lines 1-3, lines 4-6 and line 7-9 shall be similar. Similarly, in optimized R
, columns 1-3, 4-6, 7-9 shall be similar.
Data format¶
The NMF app takes a dense matrix X
as input, where each row corresponds to a data sample and each column corresponds to a feature. The file containing X
should be a binary or text file, and elements are sorted by rows then columns:
ele_0_0 ele_0_1 ... ele_0_n-1
ele_1_0 ele_1_1 ... ele_1_n-1
...
ele_n-1_0 ele_n-1_1 ... ele_n-1_n-1
You must specify the format of file (binary or text) by using the parameter file_format
, which will be explained in Running the NMF application section.
Creating synthetic data¶
We provide a synthetic data generator for testing purposes (we have tested it on python 2.7). To see detailed instructions, run
python script/make_synth_data.py
Running the NMF application¶
There are many parameters involved for running NMF. We provide default values for all of them in script/run_local.py
. Some important ones are:
- Input files
data_filename="sample/data/sample.txt"
: The data file name. The data file, which represents input matrixX
, can be text file or binary file. The format of file needs to be specified indata_format
. Matrix elements shall be sorted by rows then columns. Matrix elements in text file needs to be separated by blank characters. Matrix elements in binary file needs to be single-precision floating-point format which occupies 4 bytes each.is_partitioned=0
: Indicates whether or not the input file has been partitioned. For distributed setting, each machine needs to access their part of data. Ifis_partitioned
is set to 0, then each machine needs to get access to the whole data file and NMF app will do the partitioning automatically. Otherwise, the whole data file needs to be partitioned by column id (See Data partitioning for details), and each machine will read file named data_filename.client_id, e.g. if data_filename is “sample.txt”, then machine 0 will read file “sample.txt.0”.data_format="text"
: Specify the format of input and output data. Can be “text” or “binary”. Note that the format oftime.txt
andloss.txt
does not depend on this parameter and is always text.load_cache=0
: Specify if whether or not load previous saved results. Ifload_cache
is set to 1, the app will load the results in directorycache_dirname
, which has to contain the matrixR
and the partial matrixLi
. The data format of files incache_dirname
need to be the same as specified indata_format
.cache_dirname="N/A"
: Required ifload_cache
is set to 1. Seeload_cache
.
- Output files
output_dirname="sample/output"
: Directory where the results will be put into. Theoutput_dirname
is path relative tobosen/app/NMF
. Setoutput_path
directly if you want to use absolute path.
- Objective function parameters
m
: Dimension of input data. It is also the number of columns (for text format input file) in input data.n
: Size of input data. It is also the number of rows (for text file) in input data.rank
: Rank of matrix factorization.
- Optimization parameters
num_epochs=500
: Perform how many epochs during optimization. Each epoch approximately visit all data points once (not exactly because the data points are visited stochastically).minibatch_size=100
: Size of minibatch.init_step_size=0.01
: Base init step size. The step size at the t-th iter is in the form ofinit_step_size
* (t +step_size_offset
)^(-step_size_pow
). As we are alternatively optimizingR
andL
, and there might be multiple threads or multiple clients running, the actual step size forR
andL
is rescaled by a factor related to number of threads and dimension of data. For most applications, it is enough to only tune this parameter, keepingstep_size_offset
andstep_size_pow
to 0. If the output contains nan during optimization, then theinit_step_size
shall be decreased. But smaller step size may result in lower convergence speed.step_size_offset=0.0
: Seeinit_step_size
.step_size_pow=0.0
: Seeinit_step_size
.init_L_low=0.0
: Elements in matrixL
are initialized from uniform distribution with lower boundinit_L_low
and upper boundinit_L_high
.init_L_high=0.0
: Elements in matrixL
are initialized from uniform distribution with lower boundinit_L_low
and upper boundinit_L_high
.init_R_low=0.0
: Elements in matrixR
are initialized from uniform distribution with lower boundinit_R_low
and upper boundinit_R_high
.init_R_high=0.0
: Elements in matrixR
are initialized from uniform distribution with lower boundinit_R_low
and upper boundinit_R_high
.num_iter_L_per_minibatch=10
: How many iterations to perform gradient on a randomly picked data point to update the corresponding row inL
matrix. The default value is enough for most applications. A bigger value will result in better optimization results at a given iteration, but at the cost of more time.init_step_size_R=0.0
: Optional. Valid when it is set to nonzero values. For advanced users, the step size forL
andR
can be set directly by settinginit_step_size_R
,step_size_offset_R
,step_size_pow_R
,init_step_size_L
,step_size_offset_L
,step_size_pow_L
directly. Note thatinit_step_size_R
orinit_step_size_L
must be set to nonzero values if you want to set them directly instead of using step size determined by base step size. For example, ifinit_step_size_R
is set to a nonzero value, the step size at the t-th iter will beinit_step_size_R
* (t +step_size_offset_R
)^(-step_size_pow_R
). The step size formula forL
is analogous.step_size_offset_R=0.0
: Optional. Seeinit_step_size_R
.step_size_pow_R=0.0
: Optional. Seeinit_step_size_R
.init_step_size_L=0.0
: Optional. Seeinit_step_size_R
.step_size_offset_L=0.0
: Optional. Seeinit_step_size_R
.step_size_pow_L=0.0
: Optional. Seeinit_step_size_R
.
- Evaluation parameters
num_eval_minibatch=100
: Evaluate objective function per how many minibatches.num_eval_samples=$n
: How many samples to pick for each worker thread to evaluate objective function. If the size of data is so large that evaluating objective function takes too much time, you can use either a smallernum_eval_samples
or a largernum_eval_minibatch
value.
- System parameters
num_worker_threads=4
: Number of threads running NMF on each machine.table_staleness=0
: The staleness for tables.maximum_running_time=0.0
: The app will try to terminate after runningmaximum_running_time
hours. Valid if the value is greater than 0.0.
After all parameters have been chosen appropriately, use the command ./script/run_local.py
, then NMF application will run in background. The results will be put as specified in output_dirname
. Matrix R
will be stored in B.txt
in client 0 (whose ip appears in the first line in hostfile). Matrix L
will be stored in a row-partitioned manner, i.e., client i will have a L.txt.i
in output_dirname
, and the whole matrix L
can be obtained by putting all L.txt.i
together, which will be explained in Data partitioning.
Terminating the NMF app¶
The NMF app runs in the background, and outputs its progress to log files in user-specified directory. If you need to terminate the app before it finishes, just run
./script/kill.py <petuum_ps_hostfile>
Data partitioning¶
If there are multiple machines in host file, each machine will only take a part of input matrix. Concretely, if there are num_client
clients, client i will read the j-th row of X
if j mod num_client
=i. For example, if the data matrix X
is:
1.0 1.0 1.0 1.0
2.0 2.0 2.0 2.0
3.0 3.0 3.0 3.0
4.0 4.0 4.0 4.0
5.0 5.0 5.0 5.0
In the above X
, the feature dimension is 4, and the size of data is 5. Suppose there are 3 machines in host file, then machine 0 will read the 1-st, 4-th row of X
, machine 2 will read the 2-nd, 5-th row of X
and machine 3 will read the 3-rd row of X
.
Each machine only needs to read part of data, so we provide a parameter is_partitoned
. In order to use partitioned data (for example, when each client’s disk is not enough to hold all data), is_partitioned
shall be set to 1, and data needs to be partitioned according to the above partitioning strategy. Note that the name of partitioned data on each client needs to be data_filename
.client_id. We have provided a tool for partitioning data in script/partition_data.py
. Run python script/partition_data.py
for its usage.
When using multiple machines, the result matrix L
will be stored distributedly corresponding to the part of input data that client reads. For example, In the above example, machine 0 will store L.txt.0
, which is the 1-st, 4-th row of L
. We have provided a tool for merging partitioned L.txt.i
together in script/merge_data.py
. Run python script/merge_data.py
for its usage.
File IO from HDFS¶
Put datasets to HDFS
hadoop fs -mkdir -p /user/bosen/dataset/nmf
hadoop fs -put sample /user/bosen/dataset/nmf
Change the corresponding file paths in script/run_local.py
to the right HDFS path. Comment out the local path.
# , "output_path": join(app_dir, "sample/output")
, "output_path": "hdfs://hdfs-domain/user/bosen/dataset/nmf/sample/output"
# , "data_file": join(app_dir, "sample/data/sample.txt")
, "data_file": "hdfs://hdfs-domain/user/bosen/dataset/nmf/sample/data/sample.txt"
Launch it over ssh
./script/launch.py
Check the output
hadoop fs -cat /user/bosen/dataset/nmf/sample/output/time.txt
Use Yarn to launch NMF app¶
./script/launch_on_yarn.py