Deep Neural Network for Speech Recognition¶
This tutorial shows how the Deep Neural Network (DNN) application (implemented on Bösen) can be applied to speech recognition, using Kaldi (http://kaldi.sourceforge.net/about.html) as our tool for feature extraction and decoding. Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. It also provides two DNN applications (http://kaldi.sourceforge.net/dnn.html), and we follow Dan’s setting in feature extraction, preprocessing and decoding.
Our DNN consists of an input layer, arbitrary number of hidden layers and an output layer. Each layer contains a certain amount of neuron units. Each unit in the input layer corresponds to an element in the feature vector. We represent the class label using 1-of-K coding and each unit in the output layer corresponds to a class label. The number of hidden layers and the number of units in each hidden layer are configured by the users. Units between adjacent layers are fully connected. In terms of DNN learning, we use the cross entropy loss and stochastic gradient descent where the gradient is computed using backpropagation method.
Installation¶
PMLS Deep Neural Network Application¶
The DNN for Speech Recognition app can be found in bosen/app/dnn_speech/
. From this point on, all instructions will assume you are in bosen/app/dnn_speech/
. After building PMLS (as explained earlier in this manual), you can build the DNN from bosen/app/dnn_speech/
by running
make
This will put the DNN binary in the subdirectory bin/
.
Kaldi¶
From bosen/app/dnn_speech/
, extract Kaldi by running:
tar -xvf kaldi-trunk.tar.gz
cd kaldi-trunk/tools
Next, we must build the ATLAS libraries in a local directory. From kaldi-trunk/tools
,
sudo apt-get install gfortran
./install_atlas.sh
This process will take some time, and will report some messages of the form Error 1 (ignored)
- this is normal. More details can be found in kaldi-trunk/INSTALL
, kaldi-trunk/tools/INSTALL
and kaldi-trunk/src/INSTALL
.
Once ATLAS has been set up,
make
cd ../src/
./configure
make depend
make
The first make
may produce some “Error 1 (ignored)” messages, this is normal and can be ignored. The ./configure
may produce a warning about GCC 4.8.2, which can also be ignored for our purposes. Be advised that these steps will take a while (up to 1-2 hours).
The Whole Pipeline¶
Currently, we only support TIMIT dataset (https://catalog.ldc.upenn.edu/LDC93S1), a well-known benchmark dataset for speech recognition. You can process this dataset through following steps.
1. Feature extraction¶
WARNING: this stage will take several hours, and requires at least 16GB free RAM.
Run
sh scripts/PrepDNNFeature.sh <TIMIT_path>
where <TIMIT_path>
is the absolute path to the TIMIT directory. This will extract features and do some preprocessing work to generate Train.fea
, Train.label
, Train.para
, head.txt
and tail.txt
in app/dnn_speech
directory and exp/petuum_dnn
in kaldi-trunk/egs/timit/s5
directory. The script will take 1-2 hours to complete.
<TIMIT_path>
is the absolute path of TIMIT dataset. You can get it through https://catalog.ldc.upenn.edu/LDC93S1.Train.fea
andTrain.label
save the feature and label of training examples, one example per line.Train.para
saves the information of training examples, including the feature dimension, the number of classes of labels and the number of examples.head.txt
contains the transistion model and the preprocessing information of training features, including splicing and linear discriminant analysis (LDA), saved in the format of Dan’s setup in Kaldi.tail.txt
contains the empirical distribution of the classes of labels, saved in the format of Dan’s setup in Kaldi.kaldi-trunk/egs/timit/s5/exp/petuum_dnn
contains all the log file and intermediate results.
2. DNN Training¶
According to the information in Train.para
, you can set the configuration file for DNN (more details can be found in Input data format
and Format of DNN Configuration file
section). For example, if Train.para
is
360 2001 1031950
You can set datasets/data_partition.txt
by
/home/user/bosen/app/dnn_speech/Train 1031950
and datasets/para_imnet.txt
by
num_layers: 4
num_units_in_each_layer: 360 512 512 2001
num_epochs: 2
stepsize: 0.1
mini_batch_size: 256
num_smp_evaluate: 2000
num_iters_evaluate: 100
And run
scripts/run_dnn.sh 4 5 machinefiles/localserver datasets/para_imnet.txt datasets/data_partition.txt DNN_para.txt
The DNN app runs in the background (progress is output to stdout). After the app terminates, you should get 1 output files:
DNN_para.txt
which stores weight matrices and bias vectors as Dan’s setup in Kaldi. More details can be found in next section.
3. Decoding¶
Run
scripts/NetworkDecode.sh DNN_para.txt datasets/para_imnet.txt
Wait for several minutes and you will see the decode result over the core test dataset of TIMIT.
Running the Deep Neural Network application¶
Notice that the interface of DNN for speech recognition is slightly different from the general purpose DNN in app/dnn
. To see the instructions for the DNN for speech app, run
scripts/run_dnn.sh
The basic syntax is
scripts/run_dnn.sh <num_worker_threads> <staleness> <hostfile> <parameter_file> <data_partition_file> <model_para_file> "additional options"
<num_worker_threads>
: how many worker threads to use in each machine<staleness>
: staleness value<hostfile>
: machine configuration file<parameter_file>
: configuration file on DNN parameters<data_partition_file>
: a file containing the data file path and the number of training points in each data partition<model_para_file>
: the path where the output weight matrices and bias vector will be stored
The final argument, “additional options”, is an optional quote-enclosed string of the form "--opt1 x --opt2 y ..."
(you may omit this if you wish). This is used to pass in the following optional arguments:
ps_snapshot_clock x
: take snapshots every x iterationsps_snapshot_dir x
: save snapshots to directory x (please make sure x already exists!)ps_resume_clock x
: if specified, resume from iteration x (note: if –staleness s is specified, then we resume from iteration x-s instead)ps_resume_dir x
: resume from snapshots in directory x. You can continue to take snapshots by specifyingps_snapshot_dir y
, but do make sure directory y is not the same as x!
For example, to run the DNN app on local machine (one client) where the number of worker thread is 4, staleness is 5, machine file is machinefiles/localserver
, DNN configuration file is datasets/para_imnet.txt
, data partition file is datasets/data_partition.txt
, model parameter file is DNN_para.txt
, use the following command:
scripts/run_dnn.sh 4 5 machinefiles/localserver datasets/para_imnet.txt datasets/data_partition.txt DNN_para.txt
Input data format¶
We assume users have partitioned the data into M pieces, where M is the total number of clients (machines). Each client will be in charge of one piece. User needs to provide a file recording the data partition information. In this file, each line corresponds to one data partition. The format of each line is
<data_file> \t <num_data_in_partition>
<num_data_in_partition>
is the number of data points in this partition. <data_file>
is the prefix of the class label file (<data_file>.label
) and the feature file (<data_file>.fea
). And <data_file>
must be an absolute path.
For example,
/home/user/bosen/app/dnn_speech/Train 1031950
means there are 1031950 training examples, the class label file is /home/user/bosen/app/dnn_speech/Train.label
and the feature file is /home/user/bosen/app/dnn_speech/Train.fea
.
The format of <data_file>.fea
is:
<feature vector 1>
<feature vector 2>
...
Elements in the feature vector are separated with single blank.
The format of <data_file>.label
is
<label 1>
<label 2>
...
Note that class label starts from 0. If there are K classes, the range of class labels are [0,1,...,K-1].
Format of DNN Configuration File¶
The DNN configurations are stored in <parameter_file>
. Each line corresponds to a parameter and its format is
<parameter_name>: <parameter_value>
<parameter_name>
is the name of the parameter. It is followed by a :
(there is no blank between <parameter_name>
and :). <parameter_value>
is the value of this parameter. Note that :
and <parameter_value>
must be separated by a blank.
The list of parameters and their meanings are:
num_layers
: number of layers, including input layer, hidden layers, and output layernum_units_in_each_layer
: number of units in each layernum_epochs
: number of epochs in stochastic gradient descent trainingstepsize
: learn rate of stochastic gradient descentmini_batch_size
: mini batch size in each iterationnum_smp_evaluate
: when evaluating the objective function, we randomly sample<num_smp_evaluate>
points to compute the objectivenum_iters_evaluate
: every<num_iters_evaluate>
iterations, we do an objective function evaluation Note that, the order of the parameters cannot be switched.
Here is an example:
num_layers: 4
num_units_in_each_layer: 360 512 512 2001
num_epochs: 2
stepsize: 0.1
mini_batch_size: 256
num_smp_evaluate: 2000
num_iters_evaluate: 100
Output format¶
The DNN app outputs just one file:
<model_para_file>
<model_para_file>
saves the weight matrices and bias vectors. The order is: weight matrix between layer 1 (input layer) and layer 2 (the first hidden layer), bias vector for layer 2, weight matrix between layer 2 and layer 3, bias vector for layer 3, etc. All matrices are saved in row major order and each line corresponds to a row. Elements in each row are separated with blank.
Terminating the DNN app¶
The DNN app runs in the background, and outputs its progress to stdout. If you need to terminate the app before it finishes, for distributed version, run
scripts/kill_dnn.sh <hostfile>