Sparse Coding¶
Given an input matrix X
, the Sparse Coding app (implemented on Bösen) learns the dictionary B
and the coefficients matrix S
such that S*B
is approximately equal to X
and the coefficients matrix S
is sparse.
If X
is N-by-M, then B
will be K-by-M and S
will be N-by-K, where N is the number of data points, M is the dimension of the data, K is a user-supplied parameter that controls the size of the dictionary. The objective of sparse coding app is to minimize the function ||X - S*B||^2 + lambda*||S||_1
subject to the constraint that ||B(i,:)||_2 <= c
, where lambda
and c
are hyperparameters and lambda/c
controls the regulariztion strength.
Our Sparse Coding app uses the projected Stochastic Gradient Descent (SGD) algorithm to learn the dictionary B
and the coefficients S
.
Performance¶
On a dataset with 1M data instances; 22K feature dimension; 22M parameters, the Bösen Sparse Coding implementation converges in approximately 12.5 hours, using 8 machines (16 cores each).
Quick Start¶
The Sparse Coding app can be found in bosen/app/sparsecoding/
. From this point on, all instructions will assume you are in bosen/app/sparsecoding/
. After building PMLS (as explained earlier in this manual), you can build the Sparse Coding app from bosen/app/sparsecoding
by running
make
This will put the Sparse Coding binary in the subdirectory bin/
.
Create directory
mkdir sample
mkdir sample/data
mkdir sample/output
Create a simulated dataset:
./script/make_synth_data.py 5 100 6 sample/data/sample.txt 1
Change app_dir
in script/run_local.py
; see example below. You need to provide the full path to the Sparse Coding directory. This will allow for correct YARN operation (if required).
app_dir = "/home/user/bosen/app/sparsecoding"
then you can test that the app works on the local machine (with 4 worker threads). Now, run:
./script/launch.py
The Sparse Coding app runs in the background (progress is output to stdout). After a few minutes, you should get the following output files in the subdirectory sample/output/
:
B.txt
S.txt.0
loss.txt
time.txt
The file B.txt
and the file S.txt.0
are the optimization results of dictionary B
and coefficients S
, which looks like this:
0.187123 0.584781 0.716686 0.259561 -0.20495
0.378984 0.0456092 -0.667287 -0.218985 -0.600886
-0.863578 -0.162164 -0.244522 -0.171278 -0.372572
0.188104 0.558853 0.744098 0.252362 0.186899
-0.71181 -0.165808 -0.562194 0.00220546 -0.386997
0.075108 0.496782 0.0201851 0.692971 0.516672
0 -0.232664 0 0.273726 -0.123055 0.216856
0.342858 0 0 0.249487 0 0.121574
0.392845 0 -0.214571 0.486131 -0.486699 0
0.346318 0 0 0.258979 0 0.118132
0 -0.221088 0 0.297128 -0.113161 0.209509
0 0 -0.058158 0.0697259 -0.00593826 0.378869
0 -0.252994 0 0.286514 -0.102685 0.203422
0.156121 0 -0.0727321 0.183034 -0.0765836 0.0734904
0.359529 0 0 0.249562 0 0.120825
...
The file loss.txt
contains the statistics of loss function evaluated at iterations specified by user. If there are num_client
machines in the experiment, it will has num_client
columns, and the i-th column is the loss function evaluated by the i-th machine. Note that loss function is evaluated by averaging loss function at randomly sampled data points. The file time.txt
also has num_client
columns, and contains the time (seconds) taken between evaluations on each client (Evaluating time is excluded). Don’t be surprised if the first row of time.txt
is nearly 0, that’s because the loss function is evaluated once before any optimization is conducted. Also note that at the end of time.txt
and loss.txt
there might be some N/A
s, that’s invalid and is due to the fact that different clients may evaluate different times.
You won’t see the exact same numeric values in different runs as there are multiple optimums of the optimization problem. All you need to confirm is that the optimized S
shall be sparse and average objective function value of the final result shoud be lower than or close to 0.1.
Data format¶
The Sparse Coding app takes a dense matrix X
as input, where each row corresponds to a data sample and each column corresponds to a feature. The file containing X
should be a binary or text file, and elements are sorted by rows then columns:
ele_0_0 ele_0_1 ... ele_0_n-1
ele_1_0 ele_1_1 ... ele_1_n-1
...
ele_n-1_0 ele_n-1_1 ... ele_n-1_n-1
You must specify the format of file (binary or text) by using the parameter file_format
, which will be explained in Running the Sparse Coding application section.
Creating synthetic data¶
We provide a synthetic data generator for testing the app (requires Python 2.7). To see detailed instructions, run
python script/make_synth_data.py
Running the Sparse Coding application¶
There are many parameters involved for running Sparse Coding. We provide default values for all of them in scripts/run_local.py
. Some important ones are:
- Input files
data_filename="sample/data/sample.txt"
: The data file name. The data file, which represents input matrixX
, can be text file or binary file. The format of file needs to be specified indata_format
. Matrix elements shall be sorted by rows then columns. Matrix elements in text file needs to be separated by blank characters. Matrix elements in binary file needs to be single-precision floating-point format which occupies 4 bytes each.is_partitioned=0
: Indicates whether or not the input file has been partitioned. For distributed setting, each machine needs to access their part of data. Ifis_partitioned
is set to 0, then each machine needs to get access to the whole data file and Sparse Coding app will do the partitioning automatically. Otherwise, the whole data file needs to be partitioned by column id (See Data partitioning for details), and each machine will read file named data_filename.client_id, e.g. if data_filename is “sample.txt”, then machine 0 will read file “sample.txt.0”.data_format="text"
: Specify the format of input and output data. Can be “text” or “binary”. Note that the format oftime.txt
andloss.txt
does not depend on this parameter and is always text.load_cache=0
: Specify if whether or not load previous saved results. Ifload_cache
is set to 1, the app will load the results in directorycache_dirname
, which has to contain the matrixB
and the partial matrixSi
. The data format of files incache_dirname
need to be the same as specified indata_format
.cache_dirname="N/A"
: Required ifload_cache
is set to 1. Seeload_cache
.
- Output files
output_dirname
: Directory where the results will be put into. Theoutput_dirname
is path relative tobosen/app/sparsecoding
. Changeoutput_path
directly if you want to use absolute path.
- Objective function parameters
m
: Dimension of input data. It is also the number of columns (for text format input file) in input data.n
: Size of input data. It is also the number of rows (for text file) in input data.dictionary_size
: Size of dictionary.c
: The l2-norm constraint on dictionary elements.lambda
: The regularization strength in objective function.
- Optimization parameters
num_epochs=500
: Perform how many epochs during optimization. Each epoch approximately visit all data points once (not exactly because the data points are visited stochastically).minibatch_size=100
: Size of minibatch.init_step_size=0.01
: Base init step size. The step size at the t-th iter is in the form ofinit_step_size
* (t +step_size_offset
)^(-step_size_pow
). As we are alternatively optimizingB
andS
, and there might be multiple threads or multiple clients running, the actual step size forB
andS
is rescaled by a factor related to number of threads and dimension of data. For most applications, it is enough to only tune this parameter, keepingstep_size_offset
andstep_size_pow
to 0. If you get nan or inf during optimization, then you should decreaseinit_step_size
. However, too small a step size will cause slow convergence speed.step_size_offset=0.0
: Seeinit_step_size
.step_size_pow=0.0
: Seeinit_step_size
.init_S_low=0.0
: Elements in matrixS
are initialized from uniform distribution with lower boundinit_S_low
and upper boundinit_S_high
.init_S_high=0.0
: Elements in matrixS
are initialized from uniform distribution with lower boundinit_S_low
and upper boundinit_S_high
.init_B_low=0.0
: Elements in matrixB
are initialized from uniform distribution with lower boundinit_B_low
and upper boundinit_B_high
.init_B_high=0.0
: Elements in matrixB
are initialized from uniform distribution with lower boundinit_B_low
and upper boundinit_B_high
.num_iter_S_per_minibatch=10
: How many iterations to perform gradient on a randomly picked data point to update the corresponding coefficient. The default value is enough for most applications. A bigger value will result in better optimization results at a given iteration, but at the cost of more time.init_step_size_B=0.0
: Optional. Valid when it is set to nonzero values. For advanced users, the step size forB
andS
can be set directly by settinginit_step_size_B
,step_size_offset_B
,step_size_pow_B
,init_step_size_S
,step_size_offset_S
,step_size_pow_S
directly. Note thatinit_step_size_B
orinit_step_size_S
must be set to nonzero values if you want to set them directly instead of using step size determined by base step size. For example, ifinit_step_size_B
is set to a nonzero value, the step size at the t-th iter will beinit_step_size_B
* (t +step_size_offset_B
)^(-step_size_pow_B
). The step size formula forS
is analogous.step_size_offset_B=0.0
: Optional. Seeinit_step_size_B
.step_size_pow_B=0.0
: Optional. Seeinit_step_size_B
.init_step_size_S=0.0
: Optional. Seeinit_step_size_B
.step_size_offset_S=0.0
: Optional. Seeinit_step_size_B
.step_size_pow_S=0.0
: Optional. Seeinit_step_size_B
.
- Evaluation parameters
num_eval_minibatch=100
: Evaluate objective function per how many minibatches.num_eval_samples=$n
: How many samples to pick for each worker thread to evaluate objective function. If the size of data is so large that evaluating objective function takes too much time, you can use either a smallernum_eval_samples
or a largernum_eval_minibatch
value.
- System parameters
num_worker_threads=4
: Number of threads running Sparse Coding on each machine.table_staleness=0
: The staleness for tables.maximum_running_time=0.0
: The app will try to terminate after runningmaximum_running_time
hours. Valid if the value is greater than 0.0.
After all parameters have been chosen appropriately, use the command ./script/run_local.py
, then Sparse Coding application will run in background. The results will be put as specified in output_dirname
. Matrix B
will be stored in B.txt
in client 0 (whose ip appears in the first line in hostfile). Matrix S
will be stored in a row-partitioned manner, i.e., client i will have a S.txt.i
in output_dirname
, and the whole matrix S
can be obtained by putting all S.txt.i
together, which will be explained in Data partitioning.
Terminating the Sparse Coding app¶
The Sparse Coding app runs in the background, and outputs its progress to log files in user-specified directory. If you need to terminate the app before it finishes, just run
./script/kill.py <petuum_ps_hostfile>
Data partitioning¶
If there are multiple machines in host file, each machine will only take a part of input matrix. Concretely, if there are num_client
clients, client i will read the j-th row of X
if j mod num_client
=i. For example, if the data matrix X
is:
1.0 1.0 1.0 1.0
2.0 2.0 2.0 2.0
3.0 3.0 3.0 3.0
4.0 4.0 4.0 4.0
5.0 5.0 5.0 5.0
In the above X
, the feature dimension is 4, and the size of data is 5. Suppose there are 3 machines in host file, then machine 0 will read the 1-st, 4-th row of X
, machine 2 will read the 2-nd, 5-th row of X
and machine 3 will read the 3-rd row of X
.
Each machine only needs to read part of data, so we provide a parameter is_partitioned
. In order to use partitioned data (for example, when each client’s disk is not enough to hold all data), is_partitioned
shall be set to 1, and data needs to be partitioned according to the above partitioning strategy. Note that the name of partitioned data on each client needs to be data_filename
.client_id. We have provided a tool for partitioning data in scripts/partition_data.py
. Run python scripts/partition_data.py
for its usage.
When using multiple machines, the result coefficients S
will be stored in a distributed fashion, corresponding to the part of input data that client reads. For example, In the above example, machine 0 will store S.txt.0
, which is the 1-st, 4-th row of X
. We have provided a tool for merging partitioned S.txt.i
together in scripts/merge_data.py
. Run python scripts/merge_data.py
for its usage.
File IO from HDFS¶
Put datasets to HDFS
hadoop fs -mkdir -p /user/bosen/dataset/sc
hadoop fs -put sample /user/bosen/dataset/sc
Change the corresponding file paths in script/run_local.py
to the right HDFS path. Comment out the local path.
# , "output_path": join(app_dir, "sample/output")
, "output_path": "hdfs://hdfs-domain/user/bosen/dataset/sc/sample/output"
# , "data_file": join(app_dir, "sample/data/sample.txt")
, "data_file": "hdfs://hdfs-domain/user/bosen/dataset/sc/sample/data/sample.txt"
Launch it over ssh
./script/launch.py
Check the output
hadoop fs -cat /user/bosen/dataset/sc/sample/output/time.txt
Use Yarn to launch Sparse Coding app¶
./script/launch_on_yarn.py