# Term-Frequency Matrix Preprocessor¶

## Overview¶

Term-frequency matrices feature prominently in text processing and topic modeling algorithms. In these problems one typically starts with a set of documents and a list of words (the dictionary). A term-frequency matrix is constructed from the dictionary and the document set by counting the number of occurrences of each dictionary word in each document. If the rows of the matrix index the words and the columns index the documents, the matrix element at coordinates (r, c) represents the number of occurrences of dictionary word r in document c. Thus each entry of the matrix is either zero or a positive integer.

Construction of such a matrix is conceptually simple, but problems can arise if
the matrix contains duplicate rows or columns. The presence of duplicate
columns means that the documents at those indices are *identical*
with respect to the given dictionary. The linear algebra algorithms underlying
many text processing and information retrieval tasks can exhibit instability or
extremely slow convergence if duplicates are present. Mathematically, a
term-frequency matrix with duplicate columns has a rank that is numerically
less than the column count. Under such conditions it is advantageous to remove
the duplicated columns (and/or rows) and work with a smaller,
fuller-rank matrix.

The ClarityNLP matrix preprocessor is a command-line tool that scans a term-frequency matrix looking for duplicate rows and columns. If it finds any duplicates it prunes them and keeps only one row or column from each set of duplicates. After pruning it scans the matrix again, since removal of rows or columns could create further duplicates. This process of scanning and checking for duplicates proceeds iteratively until either a stable matrix is achieved or nothing is left (a rare occurrence, mainly for ill-posed problems). The resulting matrix is written to disk, along with the surviving row and column index lists.

## Source Code¶

The source code for the matrix preprocessor is located in
`nlp/algorithms/matrix_preprocessor`

. The code is written in C++ with a
python driver `preprocess.py`

.

### Building the Code¶

A C++ compiler is required to build the matrix preprocessor.

On Linux systems, use your package manager to install the `build-essential`

package, which should contain the Gnu C++ compiler and other tools needed to
build C++ code. After installation, run the command `g++ --version`

, which
should print out the version string for the Gnu C++ compiler. If this command
produces a `command not found`

error, then use your package manager to
explicitly install the `g++`

package.

On MacOSX, install the xcode command-line tools with this command:
`xcode-select --install`

. After installation run the command
`clang++ --version`

, which should generate a version string for the clang
C++ compiler.

After verifying that the C++ compiler works, build the matrix preprocessor code with these commands:

```
cd nlp/algorithms/matrix_preprocessor
make
```

The build process should run to completion with no errors, after which these
binaries should be present in the `build/bin`

folder: `libpreprocess.a`

,
`preprocessor`

, and `test_preprocessor`

.

## Inputs¶

The matrix preprocessor requires a single input file. The input file must be in MatrixMarket format, a popular and efficient format for sparse matrices.

Python supports the MatrixMarket format via the `scipy`

module and the
functions `scipy.io.mmwrite`

and `scipy.io.mmread`

.

### Input Options¶

The matrix preprocessor supports the following set of command line options. All
are optional except for `--infile`

, which specifies the file containing the
term-frequency matrix to be processed:

Option | Argument | Explanation |

`-i` , `--infile` |
string | path to input file, MatrixMarket format |

`-r` , `--min_docs_per_term` |
integer | min number of docs per dictionary term, default 3 |

`-c` , `--min_terms_per_doc` |
integer | min number of dictionary terms per doc, default 5 |

`-p` , `--precision` |
integer | precision of values in output file, default 4 digits
(valid only if `--weights` flag is present) |

`-w` , `--weights` |
none | if present, generate TF-IDF weights for entries and output a floating point term-document matrix |

`-b` , `--boolean` |
none | if present, enable boolean mode, in which nonzero values in the input matrix are set to 1 |

`-h` , `--help` |
none | print user help to stdout |

`-v` , `--version` |
none | print version information to stdout |

The `--min_docs_per_term`

option is the cutoff value for pruning rows. Any
dictionary term that appears in fewer than this many documents will be pruned.
In other words, a row of the input matrix will be pruned if its row sum is less
than this value.

Similarly, the `--min_terms_per_doc`

option is the cutoff value for pruning
columns. Any document that contains fewer than this many dictionary words will
be pruned. In other words, a column of the input matrix will be pruned if its
column sum is less than this value.

## Outputs¶

The matrix preprocessor generates three output files.

One file, `reduced_dictionary_indices.txt`

, is a list of row indices from the
original matrix that survived the pruning process. Another file,
`reduced_document_indices.txt`

, contains a list of original document indices
that survived the pruning process.

The third file, in MatrixMarket format, is the pruned matrix. The contents and
name of this file depend on whether the `--weights`

flag was used for the
run.

If the `--weights`

flag was absent, the output is another term-frequency
matrix in MatrixMarket format. The output file name is `reduced_matrix_tf.mtx`

and it contains nonnegative integer entries.

If the `--weights`

flag was present, the output is a term-document matrix
containing TF_IDF weights for the entries. In this case the output file name
is `reduced_matrix.mtx`

and it contains floating point entries. The precision
of each entry is set by the `--precision`

flag.

All output files are written to the current directory.

## Examples¶

Prune duplicate rows/columns from the input term-frequency matrix. Write pruned matrix to

`reduced_matrix_tf.mtx`

; generate the two index files as well:python3 ./preprocess.py --infile /path/to/mymatrix.mtx

Same as in example 1, but generate an output term-document matrix containing TF-IDF weights. Write result matrix to

`reduced_matrix.mtx`

; generate the two index files also:python3 ./preprocess.py --infile /path/to/mymatrix.mtx --weights

Same as 2, but require a mininim row sum of 6 and a mininum column sum of 8 in the pruned term-frequency matrix. Compute TF-IDF weights and output a floating point term-document matrix.

python ./preprocess.py -i /path/to/mymatrix.mtx -r 6 -c 8 -w

## Important Note¶

The matrix preprocessor was designed for sparse matrices. The term-frequency matrices that occur in typical text processing problems are extremely sparse, with occupancies of only a few percent. Dense matrices should be handled with different techniques.