Knockoffs-SPR
The irrepresentable condition and the information of the ground-truth clean set in the noisy set recovery theorem are practically unknown, making this theory hard to be used in the real life. Particularly, with \(O\) unknown, the algorithm can stop at an improper time such that the noisy rate of the selected clean data can be still high, making the next-round trained model corrupted a lot by noisy patterns.
To resolve the problem of false selection in SPR, we in this section propose a data-adaptive sample selection method, that targets controlling the expected noisy rate of the selected data dubbed as False-Selection-Rate (FSR) under the desired level \(q\) (\(0 < q < 1\)):
\[
\mathrm{FSR}=\mathbb{E}\left[\frac{\#\left\{ j: j\not\in \mathcal{H}_0 \cap \hat{C}\right\} }{\#\left\{ j:j\in\hat{C}\right\} \lor1}\right],
\]
where \(\hat{C}=\{j:\hat{\gamma}_j= 0\}\) is the recovered clean set of \(\gamma\), and \(\mathcal{H}_0: \gamma^*_i = 0\) denotes the null hypothesis, i.e., the sample \(i\) belonging to the clean dataset. Therefore, the FSR targets controlling the false rate among selected null hypotheses, which is also called the expected rate of the type-II error in hypothesis testing.
Formally speaking, we split the whole data \(\mathcal{D}\) into \(\mathcal{D}_1:=(X_1,Y_1)\) and \(\mathcal{D}_2:=(X_2,Y_2)\) with \(n_i:=|\mathcal{D}_i|\), and implement Knockoffs-SPR on both \(\mathcal{D}_1\) and \(\mathcal{D}_2\). In the following, we only introduce the procedure on \(\mathcal{D}_2\), as the procedure for \(\mathcal{D}_1\) shares the same spirit. Roughly speaking, the procedure is composed of three steps: i) estimate \(\beta\) on \(\mathcal{D}_1\); ii) estimate \(\tilde{\gamma}(\lambda))\) on \(\mathcal{D}_2\); and iii) construct the comparison statistics and selection filters.
Step i): Estimating \(\beta\) on \(\mathcal{D}_1\).
Our target is to provide an estimate of \(\beta\) that is independent of \(\mathcal{D}_2\).
The simplest strategy is to use the standard OLS estimator to obtain \(\hat{\beta}_1\).
However, this estimator may not be accurate since it is corrupted by noisy samples. For this consideration, we first run SPR on \(\mathcal{D}_1\) to get clean data and then solve \(\beta\) via OLS on the estimated clean data.
Step ii): Estimating \(\left(\gamma(\lambda),\tilde{\gamma}(\lambda)\right)\) on \(\mathcal{D}_2\).
After obtaining the solution
\(\hat{\beta}_1\) on \(\mathcal{D}_1\)
, we learn the \(\gamma(\lambda)\) on \(\mathcal{D}_2\):
\begin{equation}
\frac{1}{2}\left\Vert Y_2-X_2\hat{\beta}_1-\gamma_{2}\right\Vert_{\mathrm{F}}^{2}+P(\gamma_{2};\lambda).
\end{equation}
For each one-hot encoded vector \(y_{2,j}\), we randomly permute the position of 1 and obtain another one-hot vector \(\tilde{y}_{2,j}\neq y_{2,j}\).
For clean data \(j\), the \(\tilde{y}_{2,j}\) turns to be a noisy label; while for noisy data, the \(\tilde{y}_{2,j}\) is switched to another noisy label with probability \(\frac{c-2}{c-1}\) or clean label with probability \(\frac{1}{c-1}\).
After obtaining the permuted matrix as \(\tilde{Y}_2\), we learn the solution paths \(\left(\gamma_{2}(\lambda), \tilde{\gamma}_2(\lambda)\right)\) using the same algorithm as SPR via:
\begin{equation}
\begin{cases}
\frac{1}{2}\left\Vert Y_2-X_2\tilde{\beta}_1-\gamma_2\right\Vert _{\mathrm{F}}^{2}+\sum_j P(\gamma_{2,j};\lambda),\\
\frac{1}{2}\left\Vert \tilde{Y}_2-X_2\tilde{\beta}_1-\tilde{\gamma}_2\right\Vert _{\mathrm{F}}^{2}+\sum_j P(\tilde{\gamma}_{2,j};\lambda).
\end{cases}
\end{equation}
Step iii): Comparison statistics and selection filters. After obtaining the solution path \(\left(\gamma_{2}(\lambda), \tilde{\gamma}_2(\lambda)\right)\), we define sample significance scores with respect to \(Y_{2,i}\) and \(\tilde{Y}_{2,i}\) of each \(i\) respectively, as the selection time: \(Z_i := \sup\{\lambda: \Vert \gamma_{2,i}(\lambda) \Vert_2 \neq 0\}\) and \(\tilde{Z}_i := \sup\{\lambda: \Vert \tilde{\gamma}_{2,i} (\lambda) \Vert_2 \neq 0\}\). Our motivation is that for clean samples, we will be more likely to have \(Z_i < \tilde{Z}_i\), and for noisy samples, if the permuted label is still a noise label, we have no information to compare\(Z_i, \tilde {Z}_i\), so it is assumed that \(Z_i < \tilde{Z}_i\) is close to the probability of \(Z_i > \tilde{Z}_i\), and if the permuted label is a clean label, then it is more likely to have \(Z_i > \tilde{Z}_i\). Based on this difference, using \(Z_i, \tilde{Z}_i\), we construct a comparison statistic as:
\begin{equation}
\label{eq:w}
W_i := Z_i \cdot \mathrm{sign}(Z_i - \tilde{Z}_i).
\end{equation}
Based on these statistics, we define a data-dependent threshold \(T\) as
\begin{equation}
T=\max\left\{ t>0:\frac{1 + \#\left\{ j:0 < W_{j} \leq t\right\} }{\#\left\{ j:-t\leq W_{j} < 0 \right\} \lor 1}\leq q\right\},
\end{equation}
or \(T=0\) if this set is empty,
where \(q\) is the pre-defined upper bound.
Our algorithm will select the clean subset identified by
\begin{equation}
\label{eq:clean-set}
C_2:=\{j:-T\leq W_j < 0 \}.
\end{equation}
Empirically, \(T\) may be equal to \(0\) if the threshold \(q\) is sufficiently small. In this regard, no clean data are selected, which is meaningless.
Therefore, we start with a small \(q\) and iteratively increase \(q\) and calculate \(T\), until an attainable \(T\) such that \(T > 0\) to bound the FSR as small as possible.
In practice, when the FSR cannot be bounded by \(q=50\%\), we will end the selection and simply select half of the most possible clean examples via \(\{W_j\}\).