The linear regression model with incidental parameter is expressed as \[y_i=x_i^{\top}\beta+\gamma_i+\varepsilon_i\] where the data-dependent \(\gamma_i\) is considered as a metric of the credibility of the corresponding pseudo-labeled instance. The larger of \(\Vert \gamma_i \Vert \), the more difficult it is for the model to fit the instance.
Our optimization problem is as follows \[(\hat{\beta},\hat{\gamma})=\underset{\beta,\gamma}{\mathrm{argmin}}\Vert Y-X\beta-\gamma\Vert_\mathrm{F}^2+\lambda R(\gamma)\] By finding the closed-form solution of \(\beta\) with respect to \(\gamma\), we can transform the problem into \[\underset{\gamma}{\mathrm{argmin}}\left\Vert \tilde{Y}-\tilde{X}\gamma\right\Vert_\mathrm{F}^2+\lambda R\left(\gamma\right)\] with some variable substitutions. Then we can solve the regularization path of \(\gamma\) to get the sparsity level of each instance.
The logistic regression model with incidental parameters can be formed as \[y_{i,c} = \dfrac{\exp(x_{i,\cdot} \beta_{\cdot,c}+\gamma_{i,c})}{\sum_{l=1}^C\exp(x_{i,\cdot}\beta_{\cdot,l}+\gamma_{i,l})} + \varepsilon_{i,c}\] We can reformulate it into a standard logistic regression model by setting \[\bar{X}=(X,I),\bar{\beta}=(\beta,\gamma)^\top\] Our optimization problem is as follows \[\underset{\bar{\beta}=(\beta,\gamma)^\top}{\mathrm{argmin}} - \frac{1}{n} \sum_{i=1}^N (\sum_{l=1}^C Y_{i,l}(\bar{X}_{i,\cdot}\bar{\beta}_{\cdot,l})-\log(\sum_{l=1}^C\exp(\bar{X}_{i,\cdot}\bar{\beta}_{\cdot,l}))) + \lambda_1 R(\beta) + \lambda_2 R(\gamma).\]
We regard \(\hat{\gamma}\) as a function of \(\lambda\). When \(\lambda\) changes from \(0\) to \(\infty\), the sparsity of \(\hat{\gamma}\) is increased until all of its elements are forced to vanish. Further, we use the penalty \(R(\gamma)\) to encourage \(\gamma\) vanishes row by row, i.e., instance by instance. For example, \(R(\gamma)=\sum_{i=1}^n\sum_{j=1}^c|\gamma_{i,j}|\) or \(R(\gamma)=\sum_{i=1}^n\Vert\gamma_{i}\Vert_2\). Moreover, the penalty tends to vanish the subset of \(\tilde{X}\) with the lowest deviations, indicating less discrepancy between the prediction and the ground truth. Hence we could rank the pseudo-labeled data by the smallest \(\lambda\) value when the corresponding \(\hat{\gamma}_i\) vanishes.