| Title: | Generalized Entropy Calibration |
|---|---|
| Description: | Generalized Entropy Calibration produces calibration weights using generalized entropy as the objective function for optimization. This approach, as implemented in the 'GECal' package, is based on Kwon, Kim, and Qiu (2024) <doi:10.48550/arXiv.2404.01076>. 'GECal' incorporates design weights into the constraints to maintain design consistency, rather than including them in the objective function itself. |
| Authors: | Yonghyun Kwon [aut, cre] (ORCID: <https://orcid.org/0000-0001-9923-6790>), Jae Kwang Kim [aut] (ORCID: <https://orcid.org/0000-0002-0246-6029>), Yumou Qiu [aut] (ORCID: <https://orcid.org/0000-0003-4846-1263>) |
| Maintainer: | Yonghyun Kwon <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.7 |
| Built: | 2026-06-02 10:01:44 UTC |
| Source: | https://github.com/yonghyun-k/gecal |
estimate performs statistical inference after calibration.
estimate(formula, data = NULL, calibration, pimat = NULL)estimate(formula, data = NULL, calibration, pimat = NULL)
formula |
An object of class "formula" specifying the calibration model. |
data |
An optional data frame containing the variables in the model (specified by |
calibration |
An object of class "calibration", generated by |
pimat |
An optional matrix contatining the joint inclusion probability matrix used for variance estimation. |
A list of class estimation including the point estimates and its standard error.
Kwon, Y., Kim, J., & Qiu, Y. (2024). Debiased calibration estimation using generalized entropy in survey sampling. Arxiv preprint <https://arxiv.org/abs/2404.01076>
Deville, J. C., and Särndal, C. E. (1992). Calibration estimators in survey sampling. Journal of the American statistical Association, 87(418), 376-382.
set.seed(11) N = 10000 x = data.frame(x1 = rnorm(N, 2, 1), x2= runif(N, 0, 4)) pi = pt((-x[,1] / 2 - x[,2] / 2), 3); pi = ifelse(pi >.7, .7, pi) delta = rbinom(N, 1, pi) Index_S = (delta == 1) pi_S = pi[Index_S]; d_S = 1 / pi_S x_S = x[Index_S,,drop = FALSE] # pimat = diag(d_S^2 - d_S) / N^2 # 1 / pi_i * (1 - 1 / pi_i) e = rnorm(N, 0, 1) y = x[,1] + x[,2] + e; y_S = y[Index_S] # plot(x_S, y_S) calibration0 <- GECal::GEcalib(~ 1, dweight = d_S, data = x_S, const = N, entropy = "SL", method = "DS") GECal::estimate(y_S ~ 1, calibration = calibration0)$estimate # Hajek estimator # sum(y_S * d_S) * N / sum(d_S) calibration <- GECal::GEcalib(~ 0, dweight = d_S, data = x_S, const = numeric(0), entropy = "SL", method = "DS") GECal::estimate(y_S ~ 1, calibration = calibration)$estimate # HT estimator calibration1 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "DS") GECal::estimate(y_S ~ 1, calibration = calibration1)$estimate calibration2 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "GEC0") GECal::estimate(y_S ~ 1, calibration = calibration2)$estimate calibration3 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, log(1 / pi))), entropy = "ET", method = "GEC") GECal::estimate(y_S ~ 1, calibration = calibration3)$estimate calibration4 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC") GECal::estimate(y_S ~ 1, calibration = calibration4)$estimate calibration5 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC", K_alpha = "log") GECal::estimate(y_S ~ 1, calibration = calibration5)$estimateset.seed(11) N = 10000 x = data.frame(x1 = rnorm(N, 2, 1), x2= runif(N, 0, 4)) pi = pt((-x[,1] / 2 - x[,2] / 2), 3); pi = ifelse(pi >.7, .7, pi) delta = rbinom(N, 1, pi) Index_S = (delta == 1) pi_S = pi[Index_S]; d_S = 1 / pi_S x_S = x[Index_S,,drop = FALSE] # pimat = diag(d_S^2 - d_S) / N^2 # 1 / pi_i * (1 - 1 / pi_i) e = rnorm(N, 0, 1) y = x[,1] + x[,2] + e; y_S = y[Index_S] # plot(x_S, y_S) calibration0 <- GECal::GEcalib(~ 1, dweight = d_S, data = x_S, const = N, entropy = "SL", method = "DS") GECal::estimate(y_S ~ 1, calibration = calibration0)$estimate # Hajek estimator # sum(y_S * d_S) * N / sum(d_S) calibration <- GECal::GEcalib(~ 0, dweight = d_S, data = x_S, const = numeric(0), entropy = "SL", method = "DS") GECal::estimate(y_S ~ 1, calibration = calibration)$estimate # HT estimator calibration1 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "DS") GECal::estimate(y_S ~ 1, calibration = calibration1)$estimate calibration2 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "GEC0") GECal::estimate(y_S ~ 1, calibration = calibration2)$estimate calibration3 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, log(1 / pi))), entropy = "ET", method = "GEC") GECal::estimate(y_S ~ 1, calibration = calibration3)$estimate calibration4 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC") GECal::estimate(y_S ~ 1, calibration = calibration4)$estimate calibration5 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC", K_alpha = "log") GECal::estimate(y_S ~ 1, calibration = calibration5)$estimate
It returns the debiasing covariate, which is equivalent to the first order derivatie of
the generalized entropy .
g(x, entropy = NULL, del = NULL)g(x, entropy = NULL, del = NULL)
x |
A vector of design weights |
entropy |
An optional data frame containing the variables in the model (specified by |
del |
The optional vector for threshold ( |
A vector of debiasing covariate.
set.seed(11) N = 10000 x = data.frame(x1 = rnorm(N, 2, 1), x2= runif(N, 0, 4)) pi = pt((-x[,1] / 2 - x[,2] / 2), 3); pi = ifelse(pi >.7, .7, pi) g_EL <- g(1 / pi, entropy = 1) g_ET <- g(1 / pi, entropy = 0) g_EL <- g(1 / pi, entropy = -1)set.seed(11) N = 10000 x = data.frame(x1 = rnorm(N, 2, 1), x2= runif(N, 0, 4)) pi = pt((-x[,1] / 2 - x[,2] / 2), 3); pi = ifelse(pi >.7, .7, pi) g_EL <- g(1 / pi, entropy = 1) g_ET <- g(1 / pi, entropy = 0) g_EL <- g(1 / pi, entropy = -1)
GEcalib computes the calibration weights.
Generalized entropy calibration weights maximize the generalized entropy:
subject to the calibration constraints ,
where denotes the sample index, and represents the population index.
The auxiliary variables, whose population totals are known, are defined as ,
where is the first-order derivative of the gerenalized entropy ,
and is the design weight for each sampled unit .
GEcalib( formula, dweight, data = NULL, const, method = c("GEC", "GEC0", "DS"), entropy = c("SL", "EL", "ET", "CE", "HD", "PH"), weight.scale = 1, G.scale = 1, K_alpha = NULL, is.total = TRUE, del = NULL, xtol = 1e-16, maxit = 1e+05, allowSingular = T )GEcalib( formula, dweight, data = NULL, const, method = c("GEC", "GEC0", "DS"), entropy = c("SL", "EL", "ET", "CE", "HD", "PH"), weight.scale = 1, G.scale = 1, K_alpha = NULL, is.total = TRUE, del = NULL, xtol = 1e-16, maxit = 1e+05, allowSingular = T )
formula |
An object of class "formula" specifying the calibration model. |
dweight |
A vector of sampling weights. |
data |
An optional data frame containing the variables in the model (specified by |
const |
A vector used in the calibration constraint for population totals( or means). |
method |
The method to be used in calibration. See "Details" for more information. |
entropy |
The generalized entropy used in calibration, which can be either a numeric value or a string.
If numeric, |
weight.scale |
Positive scaling factor for the calibration weights |
G.scale |
Positive scaling factor for the generalized entropy function |
K_alpha |
The |
is.total |
Logical, |
del |
The optional threshold ( |
xtol |
Optional relative steplength tolerance in nleqslv |
maxit |
Optional maximum number of major iterations in nleqslv |
allowSingular |
Optional logical value indicating if a small correction to the Jacobian is allowed in nleqslv
|
The GEcal object returns the calibration weights and necessary information for estimating population totals(or mean).
The terms to the right of the ~ symbol in the formula argument define the calibration constraints.
When method == "GEC", the debiasing covariate g(dweight) must be included in the formula.
If the population total(mean) of g(dweight) is unavailable, const that corresponds to g(dweight) can be set to NA.
In this case, GECalib performs joint optimization over
both the calibration weights and the missing value of const.
The length of the const vector should match the number of columns in the model.matrix generated by formula.
Additionally, the condition number of the model.matrix must exceed .Machine$double.eps to ensure its invertibility.
Both weight.scale and G.scale are positive scaling factors used for calibration.
Note that weight.scale is not supported when method == "DS".
Let be the scaling factor for the generalized entropy function ,
and be the scaling factor for the calibration weights .
If method == "GEC", GEcalib minimizes the negative entropy:
with respect to subject to the calibration constraints ,
where , denotes the sample index, and represents the population index.
If method == "GEC", but an element of const corresponding to the debiasing covariate
is NA, GEcalib minimizes the negative adjusted entropy:
with respect to and subject to the calibration constraints
,
where the solution is an estimate of population total for .
Examples of includes when K_alpha == NULL, and
when K_alpha == "log".
If method == "GEC0", GEcalib minimizes the negative adjusted entropy:
with respect to subject to the calibration constraints .
If method == "DS", GEcalib minimizes the divergence between and :
with respect to subject to the calibration constraints .
When method == "DS", weight.scale, the scaling factor for the calibration weights , is not applicable.
Examples of and are given in "Summary".
A list of class calibration including the calibration weights
and data needed for estimation.
The table below provides a comparison between the GEC and DS methods.
| GEC | DS |
|
|
s.t. |
s.t. |
|
|
If method == "GEC", further examples include
when entropy == "CE", and
for a threshold when entropy == "PH".
Yonghyun Kwon
Kwon, Y., Kim, J., & Qiu, Y. (2024). Debiased calibration estimation using generalized entropy in survey sampling. Arxiv preprint <https://arxiv.org/abs/2404.01076>
Deville, J. C., and Särndal, C. E. (1992). Calibration estimators in survey sampling. Journal of the American statistical Association, 87(418), 376-382.
set.seed(11) N = 10000 x = data.frame(x1 = rnorm(N, 2, 1), x2= runif(N, 0, 4)) pi = pt((-x[,1] / 2 - x[,2] / 2), 3); pi = ifelse(pi >.7, .7, pi) delta = rbinom(N, 1, pi) Index_S = (delta == 1) pi_S = pi[Index_S]; d_S = 1 / pi_S x_S = x[Index_S,] # Deville & Sarndal(1992)'s calibration using divergence w1 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "DS")$w # Generalized entropy calibration without debiasing covariate w2 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "GEC0")$w all.equal(w1, w2) # Generalized entropy calibration with debiasing covariate w3 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, log(1 / pi))), entropy = "ET", method = "GEC")$w # Generalized entropy calibration with debiasing covariate # when its population total is unknown w4 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC")$w all.equal(w1, w4) w5 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC", K_alpha = "log")$wset.seed(11) N = 10000 x = data.frame(x1 = rnorm(N, 2, 1), x2= runif(N, 0, 4)) pi = pt((-x[,1] / 2 - x[,2] / 2), 3); pi = ifelse(pi >.7, .7, pi) delta = rbinom(N, 1, pi) Index_S = (delta == 1) pi_S = pi[Index_S]; d_S = 1 / pi_S x_S = x[Index_S,] # Deville & Sarndal(1992)'s calibration using divergence w1 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "DS")$w # Generalized entropy calibration without debiasing covariate w2 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "GEC0")$w all.equal(w1, w2) # Generalized entropy calibration with debiasing covariate w3 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, log(1 / pi))), entropy = "ET", method = "GEC")$w # Generalized entropy calibration with debiasing covariate # when its population total is unknown w4 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC")$w all.equal(w1, w4) w5 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC", K_alpha = "log")$w
A synthetic proprietary pesticide usage survey data in Iowa CRD(Crop Reporting District) collected from GfK Kynetec in 2020.
A data frame with 1197 rows on the following 32 variables:
Haversted acres of corn in each CRD
Haversted acres of soybean in each CRD
Haversted acres of alfalfa in each CRD
Acres of pasture in each CRD
Design weights, or inverse first-order inclusion probabilities of the sample
Pesticide usage($) which is of an interest.
The original data is contaminated by adding noise and creating missing values and imputation.
data(IAdata) data(IApimat) total <- c( Corn10 = 2093000, Corn20 = 1993600, Corn30 = 1803200, Corn40 = 2084600, Corn50 = 2056600, Corn60 = 1429400, Corn70 = 2539600, Soybean10 = 1472980, Soybean20 = 1192860, Soybean30 = 721920, Soybean40 = 1477680, Soybean50 = 1353600, Soybean60 = 918380, Soybean70 = 1485200, Soybean90 = 777380, Alfalfa10 = 60590, Alfalfa30 = 154395, Alfalfa40 = 57816, Alfalfa50 = 150453, Alfalfa70 = 66065, Alfalfa80 = 240681, Pasture10 = 141947, Pasture20 = 61476, Pasture30 = 188310, Pasture40 = 213635, Pasture50 = 160737, Pasture60 = 222214, Pasture70 = 250807, Pasture80 = 570647, Pasture90 = 232630 ) calibration <- GECal::GEcalib(~ 0, dweight = d, data = IAdata, const = numeric(0), entropy = "EL", method = "DS") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d, dweight = d, data = IAdata, const = total, entropy = "SL", method = "DS") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d, dweight = d, data = IAdata, const = c(total), entropy = "ET", method = "DS") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d + g(d), dweight = d, data = IAdata, const = c(total, NA), entropy = "HD", method = "GEC") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d, dweight = d, data = IAdata, const = total, entropy = "HD", method = "GEC0") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimatedata(IAdata) data(IApimat) total <- c( Corn10 = 2093000, Corn20 = 1993600, Corn30 = 1803200, Corn40 = 2084600, Corn50 = 2056600, Corn60 = 1429400, Corn70 = 2539600, Soybean10 = 1472980, Soybean20 = 1192860, Soybean30 = 721920, Soybean40 = 1477680, Soybean50 = 1353600, Soybean60 = 918380, Soybean70 = 1485200, Soybean90 = 777380, Alfalfa10 = 60590, Alfalfa30 = 154395, Alfalfa40 = 57816, Alfalfa50 = 150453, Alfalfa70 = 66065, Alfalfa80 = 240681, Pasture10 = 141947, Pasture20 = 61476, Pasture30 = 188310, Pasture40 = 213635, Pasture50 = 160737, Pasture60 = 222214, Pasture70 = 250807, Pasture80 = 570647, Pasture90 = 232630 ) calibration <- GECal::GEcalib(~ 0, dweight = d, data = IAdata, const = numeric(0), entropy = "EL", method = "DS") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d, dweight = d, data = IAdata, const = total, entropy = "SL", method = "DS") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d, dweight = d, data = IAdata, const = c(total), entropy = "ET", method = "DS") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d + g(d), dweight = d, data = IAdata, const = c(total, NA), entropy = "HD", method = "GEC") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d, dweight = d, data = IAdata, const = total, entropy = "HD", method = "GEC0") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate
The matrix that is used for variance estimation in IAdata. The sample is collected from a straitified random sampling.
data(IApimat)data(IApimat)