NMF Algorithm

Optimization:

For boldsymbol{u}in mathbb{R}^d, boldsymbol{V} in mathbb{R}^{ntimes d} and V subseteq mathbb{R}^d, let mathscr{D}(boldsymbol{u};boldsymbol{V}), mathscr{D}(boldsymbol{u};V) be the squared ell_2 Distances of boldsymbol{u} from the convex envelope of the rows of boldsymbol{V} and the set V, respectively. Namely, letting Delta^n be the standard (n-1)–simplex, we let

 mathscr{D}(boldsymbol{u};boldsymbol{V}) equiv  min_{boldsymbol{pi}in Delta^n}|boldsymbol{u} - boldsymbol{V}^{sf{T}}boldsymbol{pi}|_2,~~~~~ mathscr{D}(boldsymbol{u};V) equiv  min_{boldsymbol{v}in V}|boldsymbol{u} - boldsymbol{v}|_2.

Using this notation, if boldsymbol{X}in mathbb{R}^{ntimes d} is the matrix with data points boldsymbol{x_1}, boldsymbol{x}_2, cdots, boldsymbol{x}_n in mathbb{R}^d and boldsymbol{H}in mathbb{R}^{rtimes d} is the matrix with archetypes boldsymbol{h}_1, boldsymbol{h}_2, cdots, boldsymbol{h}_r in mathbb{R}^d as their rows, we attempt to solve the following nonconvex problem:

 {rm{minimize}}_{boldsymbol{H}} ~~ sum_{i=1}^{n}mathscr{D}(boldsymbol{x}_i; boldsymbol{H}) + lambdasum_{ell=1}^{r}mathscr{D}(boldsymbol{h}_ell; boldsymbol{X}).

We propose an initialization algorithm based on the Successive projections algorithm in a paper by Araùjo et al. In addition, Proximal Alternating Linearized Minimization (PALM) proposed in this work by Bolte et al. is used for minimizing the above cost function.

Initialization

Input: Data {boldsymbol{x}_i}_{ile n}, boldsymbol{x}_iinmathbb{R}^d, integer r;
Output: Initial archetypes {hat{boldsymbol{h}}_{ell}^{(0)}}_{1le ellle r};

  1. Set i(1) = argmax_{i le n} |boldsymbol{x}_i|_2^2;

  2. Set hat{boldsymbol{h}}^{(0)}_{1}= boldsymbol{x}_{i(1)};

  3. For ellin {1,dots, r}

    1. Define V_{ell}equiv rm{aff}(hat{boldsymbol{h}}_{1}^{(0)},hat{boldsymbol{h}}_{2}^{(0)},dots,hat{boldsymbol{h}}_{ell}^{(0)}), the affine hull of hat{boldsymbol{h}}_{1}^{(0)},hat{boldsymbol{h}}_{2}^{(0)},dots,hat{boldsymbol{h}}_{ell}^{(0)};

    2. Set i(ell+1) = argmax {mathscr{D}(boldsymbol{x}_{i};V_{ell}), :; ile n};

    3. Set hat{boldsymbol{h}}^{(0)}_{ell+1} = boldsymbol{x}_{i(ell+1)};

  4. Return {hat{boldsymbol{h}}_{ell}^{(0)}}_{1le ellle r}.

After finding the initial set of archetypes {hat{boldsymbol{h}}_{ell}^{(0)}}_{1le ellle r}, initial set of weights {hat{boldsymbol{w}}_{i}^{(0)}}_{1le ile n} can be found by

 hat{boldsymbol{w}}_{i}^{(0)} = argmin_{boldsymbol{w}in Delta^r}|boldsymbol{x}_{i} - boldsymbol{w}hat{boldsymbol{H}}^{(0)}|_2.

After finding the above initial estimates, we perform the PALM iterations that is guaranteed to converge to critical points of the risk function. For boldsymbol{u}in mathbb{R}^d, and V subseteq mathbb{R}^d, If we let

 boldsymbol{Pi}_{V}(boldsymbol{v}) = argmin_{boldsymbol{v}in V}|boldsymbol{u} - boldsymbol{v}|_2

denote the projection of boldsymbol{u} on V, for this risk function the iterations will take the following form:

Proximal Alternating Linearized Minimization (PALM) Iterations

Let {rm{conv}}(boldsymbol{X}) represent the convex envelope of the rows of boldsymbol{X} and Initialize k = 0. While |boldsymbol{H}^{k+1}-boldsymbol{H}^{k}|_F > epsilon_1, |boldsymbol{W}^{k+1}-boldsymbol{W}^{k}|_F > epsilon_2:

  1. widetilde{boldsymbol{H}}^{k} = boldsymbol{H}^{k} - frac{1}{gamma_1^k}(boldsymbol{W}^k)^{sf{T}}Big(boldsymbol{W}^kboldsymbol{H}^k - boldsymbol{X}Big);

  2. boldsymbol{H}^{k+1} = widetilde{boldsymbol{H}}^{k} - frac{lambda}{lambda + gamma_1^k}Big(widetilde{boldsymbol{H}}^{k} - boldsymbol{Pi}_{{rm{conv}}(boldsymbol{X})}Big(widetilde{boldsymbol{H}}^{k}Big)Big);

  3. boldsymbol{W}^{k+1} = boldsymbol{Pi}_{Delta^r}bigg(boldsymbol{W}^k - frac{1}{gamma_2^k}Big(boldsymbol{W}^kboldsymbol{H}^{k+1}-boldsymbol{X}Big)(boldsymbol{H}^{k+1})^{sf{T}}bigg);

  4. k := k+1.

In our code, we have also provided an accelerated version of the PALM iterations by employing the technique used by Beck and Teboulle in MFISTA in this paper and the extension of this technique presented by Li and Lin in this paper. Our code gives this option to the user to choose the parameter lambda or let the algorithm to decide an appropriate value for lambda using a data driven procedure explained below.

How we choose parameter lambda:

Let boldsymbol{X} = boldsymbol{U}boldsymbol{Sigma}boldsymbol{V}^{sf{T}} be the singular value decomposition of the data matrix with Sigma_{11}geq Sigma_{22} geq dots geq Sigma_{n,n}. Taking widehat{boldsymbol{Sigma}}^{(r)} such that widehat{boldsymbol{Sigma}}_{i,i}^{(r)} = Sigma_{i,i} for i leq r and widehat{Sigma}_{i,i}^{(r)} = 0 for r+1 leq i, we let

 widehat{boldsymbol{X}}^{(r)}equiv boldsymbol{U}widehat{boldsymbol{Sigma}}^{(r)}boldsymbol{V}^{sf{T}}

be the rank-r PCA approximation for boldsymbol{X} and

 mathcal D^{(r)}_{{rm{pca}}} equiv bigg|boldsymbol{X} - widehat{boldsymbol{X}}^{(r)}bigg|_F

be its corresponding error. Starting from lambda = lambda_0 we increase the value of lambda in a grid of values until we reach lambda^* as the smallest lambda such that

 mathcal Dbig(boldsymbol{X}, widehat{boldsymbol{H}}_lambdabig) - {mathcal{D}}^{(r)}_{{rm{pca}}} geq c big(mathcal{D}big(boldsymbol{X}, widehat{boldsymbol{H}}_{lambda_0}big) - {mathcal{D}}^{(r)}_{{rm{pca}}}big)

for some constant c>1. The user can choose this constant. The default is set to c = 1.2 in the code.