| Title: | Hierarchical Neyman-Pearson Classification for Ordered Classes |
|---|---|
| Description: | The Hierarchical Neyman-Pearson (H-NP) classification framework extends the Neyman-Pearson classification paradigm to multi-class settings where classes have a natural priority ordering. This is particularly useful for classification in unbalanced dataset, for example, disease severity classification, where under-classification errors (misclassifying patients into less severe categories) are more consequential than other misclassifications. The package implements H-NP umbrella algorithms that controls under-classification errors under user specified control levels with high probability. It supports the creation of H-NP classifiers using scoring functions based on built-in classification methods (including logistic regression, support vector machines, and random forests), as well as user-trained scoring functions. For theoretical details, please refer to Lijia Wang, Y. X. Rachel Wang, Jingyi Jessica Li & Xin Tong (2024) <doi:10.1080/01621459.2023.2270657>. |
| Authors: | Che Shen [aut, cre] (Implementation and maintenance), Lujia Yang [aut] (Testing and debugging), Lijia Wang [aut] (Original theory and supervision), Shunan Yao [aut] (Supervision and debugging) |
| Maintainer: | Che Shen <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-13 09:13:23 UTC |
| Source: | https://github.com/cran/HNPclassifier |
Fit one of the supported classifiers for ternary classification:
Random Forest, SVM (with probabilities), or multinomial logistic regression
via nnet::multinom.
base_function(x, y, method = "randomforest")base_function(x, y, method = "randomforest")
x |
A data.frame of predictors/features. |
y |
A factor response with levels "1","2","3". |
method |
Character string: one of 'randomforest', 'svm', or 'logistic'. |
A trained model object compatible with the downstream scoring functions.
set.seed(123) x <- data.frame(a = rnorm(20), b = rnorm(20)) y <- factor(sample(c("1","2","3"), 20, TRUE)) model <- base_function(x, y, method = 'randomforest')set.seed(123) x <- data.frame(a = rnorm(20), b = rnorm(20)) y <- factor(sample(c("1","2","3"), 20, TRUE)) model <- base_function(x, y, method = 'randomforest')
Runs multiple iterations of HNP experiment on a dataset (with random 7:3 splits) and generates a PDF with 15 boxplots comparing Before vs After NP performance.
hnp_box_plot( data, class_col, method = "logistic", n_runs = 100, levels = c(0.05, 0.05), tolerances = c(0.05, 0.05), output_file = NULL, hnp_split = NULL, split_ratio = c(0.7, 0.3) )hnp_box_plot( data, class_col, method = "logistic", n_runs = 100, levels = c(0.05, 0.05), tolerances = c(0.05, 0.05), output_file = NULL, hnp_split = NULL, split_ratio = c(0.7, 0.3) )
data |
A data.frame containing features and class label. |
class_col |
Character. Name of the class column (must be mapped to "1","2","3"). |
method |
Character. Base classifier method ('randomforest', 'svm', 'logistic'). |
n_runs |
Integer. Number of iterations to run. |
levels |
Numeric vector. Alpha levels (constraints) for classes (e.g., c(0.05, 0.1)). |
tolerances |
Numeric vector. Delta tolerances for classes (e.g., c(0.01, 0.02)). |
output_file |
Character. Path to save the PDF output. |
hnp_split |
List. Split configuration for HNP internal validation. |
split_ratio |
Numeric vector. Ratio of data used for training and testing (e.g., c(0.7, 0.3)). |
No return value, called for side effects.
set.seed(123) n <- 2000 features <- data.frame( x1 = rnorm(n), x2 = rnorm(n) ) y <- factor(sample(c("1", "2", "3"), n, replace = TRUE, prob = c(0.2, 0.3, 0.5))) data <- cbind(features, y) hnp_box_plot( data = data, class_col = "y", method = "logistic", n_runs = 2, levels = c(0.05, 0.05), tolerances = c(0.05, 0.05), output_file = tempfile(fileext = ".pdf") )set.seed(123) n <- 2000 features <- data.frame( x1 = rnorm(n), x2 = rnorm(n) ) y <- factor(sample(c("1", "2", "3"), n, replace = TRUE, prob = c(0.2, 0.3, 0.5))) data <- cbind(features, y) hnp_box_plot( data = data, class_col = "y", method = "logistic", n_runs = 2, levels = c(0.05, 0.05), tolerances = c(0.05, 0.05), output_file = tempfile(fileext = ".pdf") )
Calculate the order k of the statistic that satisfies the given confidence requirements for determining classification thresholds.
hnp_delta_search(n, level, delta)hnp_delta_search(n, level, delta)
n |
Integer specifying the cardinality of the
grid set Tau (size of |
level |
Numeric between 0 and 1 representing the desired control level (alpha) for the ith under-classification error. |
delta |
Numeric tolerance parameter for the confidence bound. |
An integer k representing the order of the
statistic that meets the
confidence requirements. Returns NA if no valid solution exists.
Lijia Wang, Y. X. Rachel Wang, Jingyi Jessica Li, and Xin Tong (2024). "Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data." Journal of the American Statistical Association, 119(545), 39-51. doi:10.1080/01621459.2023.2270657
k <- hnp_delta_search(n = 100, level = 0.05, delta = 0.01)k <- hnp_delta_search(n = 100, level = 0.05, delta = 0.01)
Validate the class column and re-label provided class names to canonical factor levels "1", "2", and "3". Useful for preparing datasets before training and evaluation in the HNP Umbrella pipeline.
hnp_map_classes(data, class_col, class_1, class_2, class_3)hnp_map_classes(data, class_col, class_1, class_2, class_3)
data |
A data.frame or data.table containing the dataset. |
class_col |
Character scalar. Name of the class/label column in |
class_1 |
Character. Original label that should map to level "1" (most severe with most attentions). |
class_2 |
Character. Original label that should map to level "2" (median severe). |
class_3 |
Character. Original label that should map to level "3" (normal or less important). |
The input data with class_col converted to a factor with
levels c("1","2","3").
df <- data.frame(y = c("low","mid","high","mid"), x1 = rnorm(4)) df2 <- hnp_map_classes(df, class_col = "y", class_1 = "low", class_2 = "mid", class_3 = "high") table(df2$y)df <- data.frame(y = c("low","mid","high","mid"), x1 = rnorm(4)) df2 <- hnp_map_classes(df, class_col = "y", class_1 = "low", class_2 = "mid", class_3 = "high") table(df2$y)
Compute confusion matrix, class-wise false positive/negative rates, over- and under-classification errors, overall accuracy, and a normalized error table for a ternary classifier produced by the HNP pipeline.
hnp_summary(classifier, data, class_col, class_number = NULL)hnp_summary(classifier, data, class_col, class_number = NULL)
classifier |
A function |
data |
A data.frame containing features and the true class column. |
class_col |
Character scalar. Name of the true class/label column. |
class_number |
Optional integer. Number of classes; if |
A list with components: confusion_matrix, false_positive_rate,
false_negative_rate, overall_accuracy, predictions,
under_classification_error, over_classification_error,
total_over_classification_error, total_under_classification_error, and
error_table.
set.seed(123) n <- 50 x <- data.frame(a = rnorm(n), b = rnorm(n)) y <- factor(sample(c("1","2","3"), n, TRUE)) df <- cbind(x, y) clf <- function(X) sample(c(1,2,3), nrow(X), replace=TRUE) res <- hnp_summary(clf, data = df, class_col = "y")set.seed(123) n <- 50 x <- data.frame(a = rnorm(n), b = rnorm(n)) y <- factor(sample(c("1","2","3"), n, TRUE)) df <- cbind(x, y) clf <- function(X) sample(c(1,2,3), nrow(X), replace=TRUE) res <- hnp_summary(clf, data = df, class_col = "y")
Implementation of the HNP Umbrella algorithm for ternary classification
hnp_umbrella( S, levels, tolerances, A1 = NULL, method = "randomforest", hnp_split = NULL, class_col )hnp_umbrella( S, levels, tolerances, A1 = NULL, method = "randomforest", hnp_split = NULL, class_col )
S |
Training dataset |
levels |
Confidence levels (alpha) for each class |
tolerances |
Tolerance parameters (delta) for each class |
A1 |
Candidate thresholds for class 1 |
method |
Classification method to use ('randomforest', 'svm', 'logistic') |
hnp_split |
Data splitting ratios for each class |
class_col |
Character scalar. Name of the class column in the dataset (must be mapped to levels "1","2","3"). |
A classifier function that takes new data and classifies it into a class with controlled type-one error rate
Lijia Wang, Y. X. Rachel Wang, Jingyi Jessica Li, and Xin Tong (2024). "Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data." Journal of the American Statistical Association, 119(545), 39-51. doi:10.1080/01621459.2023.2270657
set.seed(123) n <- 500 features <- data.frame( x1 = rnorm(n), x2 = rnorm(n) ) y <- factor(sample(c("1", "2", "3"), n, replace = TRUE, prob = c(0.2, 0.3, 0.5))) data <- cbind(features, y) clf <- hnp_umbrella( S = data, levels = c(0.1, 0.1), tolerances = c(0.1, 0.1), class_col = "y", method = "randomforest" )set.seed(123) n <- 500 features <- data.frame( x1 = rnorm(n), x2 = rnorm(n) ) y <- factor(sample(c("1", "2", "3"), n, replace = TRUE, prob = c(0.2, 0.3, 0.5))) data <- cbind(features, y) clf <- hnp_umbrella( S = data, levels = c(0.1, 0.1), tolerances = c(0.1, 0.1), class_col = "y", method = "randomforest" )
Flexible variant of the HNP Umbrella algorithm that accepts user-provided scoring functions and explicit data splits for thresholding and error estimation. This bypasses model training inside and focuses on threshold selection with confidence controls.
hnp_umbrella_flex( score_data, threshold_data, error_data, levels, tolerances, A1 = NULL, score_functions = NULL, class_col )hnp_umbrella_flex( score_data, threshold_data, error_data, levels, tolerances, A1 = NULL, score_functions = NULL, class_col )
score_data |
A data.frame for fitting/deriving scoring behavior. |
threshold_data |
A data.frame used to compute thresholds. |
error_data |
A data.frame used to estimate empirical errors. |
levels |
Numeric vector of length 2. Confidence levels (alpha) for class 1 and class 2 under-classification controls. |
tolerances |
Numeric vector of length 2. Tolerance (delta) values for the corresponding classes. |
A1 |
Optional numeric vector of candidate thresholds for class 1. |
score_functions |
A list with at least two functions: |
class_col |
Character scalar. Name of the class column in the data. |
A classifier function function(new_data) data.frame(result=...),
or NULL if no valid classifier is found.
Lijia Wang, Y. X. Rachel Wang, Jingyi Jessica Li, and Xin Tong (2024). "Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data." Journal of the American Statistical Association, 119(545), 39-51. doi:10.1080/01621459.2023.2270657
set.seed(123) n <- 500 score_data <- data.frame(x=rnorm(n), y=factor(sample(1:3, n, replace=TRUE))) threshold_data <- data.frame(x=rnorm(n), y=factor(sample(1:3, n, replace=TRUE))) error_data <- data.frame(x=rnorm(n), y=factor(sample(1:3, n, replace=TRUE))) T1 <- function(d) as.numeric(d$x > 0) T2 <- function(d) as.numeric(d$x > 0.5) clf <- hnp_umbrella_flex(score_data, threshold_data, error_data, levels = c(0.05, 0.05), tolerances = c(0.01, 0.01), score_functions = list(T1, T2), class_col = 'y') preds <- clf(score_data)set.seed(123) n <- 500 score_data <- data.frame(x=rnorm(n), y=factor(sample(1:3, n, replace=TRUE))) threshold_data <- data.frame(x=rnorm(n), y=factor(sample(1:3, n, replace=TRUE))) error_data <- data.frame(x=rnorm(n), y=factor(sample(1:3, n, replace=TRUE))) T1 <- function(d) as.numeric(d$x > 0) T2 <- function(d) as.numeric(d$x > 0.5) clf <- hnp_umbrella_flex(score_data, threshold_data, error_data, levels = c(0.05, 0.05), tolerances = c(0.01, 0.01), score_functions = list(T1, T2), class_col = 'y') preds <- clf(score_data)
Compute the optimal threshold for class i using score functions and confidence bounds, given tolerance and under classification error level.
hnp_upper_bound(S_it, level, delta_i, score_functions, thresholds, i)hnp_upper_bound(S_it, level, delta_i, score_functions, thresholds, i)
S_it |
The left-out class-i samples. |
level |
(alpha) desired control level for the ith under classification error. |
delta_i |
ith tolerance parameter. |
score_functions |
A list of score functions (T_1, ..., T_i). |
thresholds |
Numeric vector of length |
i |
Class-i. |
t_i_bar Optimal ith threshold.
Lijia Wang, Y. X. Rachel Wang, Jingyi Jessica Li, and Xin Tong (2024). "Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data." Journal of the American Statistical Association, 119(545), 39-51. doi:10.1080/01621459.2023.2270657
set.seed(123) n <- 200 S_it <- data.frame( feature1 = rnorm(n, mean = 2, sd = 1), feature2 = runif(n, min = 0, max = 5) ) level <- 0.05 delta_i <- 0.01 score_functions <- list( function(data) runif(nrow(data)), function(data) runif(nrow(data)) ) thresholds <- c(2.5, NA) i <- 1 t_i_bar <- hnp_upper_bound(S_it, level, delta_i, score_functions, thresholds, i)set.seed(123) n <- 200 S_it <- data.frame( feature1 = rnorm(n, mean = 2, sd = 1), feature2 = runif(n, min = 0, max = 5) ) level <- 0.05 delta_i <- 0.01 score_functions <- list( function(data) runif(nrow(data)), function(data) runif(nrow(data)) ) thresholds <- c(2.5, NA) i <- 1 t_i_bar <- hnp_upper_bound(S_it, level, delta_i, score_functions, thresholds, i)
Return a function that takes new data and outputs the score
for class 1, typically the predicted probability P(Y=1|X). Works with the
supported methods used by base_function.
probability_to_score_1(model, method)probability_to_score_1(model, method)
model |
A fitted model returned by |
method |
Character string specifying the model family used: one of 'svm', 'randomforest', or 'logistic'. |
A function of the form function(X) numeric, where X is a
data.frame of features and the returned numeric vector are scores for class 1.
Lijia Wang, Y. X. Rachel Wang, Jingyi Jessica Li, and Xin Tong (2024). "Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data." Journal of the American Statistical Association, 119(545), 39-51. doi:10.1080/01621459.2023.2270657
set.seed(123) x <- data.frame(a = rnorm(20), b = rnorm(20)) y <- factor(sample(c("1","2","3"), 20, TRUE)) model <- base_function(x, y, method = 'randomforest') T1 <- probability_to_score_1(model, method = 'randomforest') newx <- data.frame(a = rnorm(5), b = rnorm(5)) scores <- T1(newx)set.seed(123) x <- data.frame(a = rnorm(20), b = rnorm(20)) y <- factor(sample(c("1","2","3"), 20, TRUE)) model <- base_function(x, y, method = 'randomforest') T1 <- probability_to_score_1(model, method = 'randomforest') newx <- data.frame(a = rnorm(5), b = rnorm(5)) scores <- T1(newx)
Return a function that produces the ratio of predicted
probabilities P(Y=2|X) / P(Y=3|X), with safeguards for zeros/NA and
infinite values. Works with the supported methods used by base_function.
probability_to_score_2(model, method)probability_to_score_2(model, method)
model |
A fitted model returned by |
method |
Character string specifying the model family used: one of 'svm', 'randomforest', or 'logistic'. |
A function of the form function(X) numeric, where X is a
data.frame of features and the returned numeric vector are T2 scores.
Lijia Wang, Y. X. Rachel Wang, Jingyi Jessica Li, and Xin Tong (2024). "Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data." Journal of the American Statistical Association, 119(545), 39-51. doi:10.1080/01621459.2023.2270657
set.seed(123) x <- data.frame(a = rnorm(20), b = rnorm(20)) y <- factor(sample(c("1","2","3"), 20, TRUE)) model <- base_function(x, y, method = 'randomforest') T2 <- probability_to_score_2(model, method = 'randomforest') newx <- data.frame(a = rnorm(5), b = rnorm(5)) scores <- T2(newx)set.seed(123) x <- data.frame(a = rnorm(20), b = rnorm(20)) y <- factor(sample(c("1","2","3"), 20, TRUE)) model <- base_function(x, y, method = 'randomforest') T2 <- probability_to_score_2(model, method = 'randomforest') newx <- data.frame(a = rnorm(5), b = rnorm(5)) scores <- T2(newx)