Title: | Generalized Entropy Calibration |
---|---|
Description: | Generalized Entropy Calibration produces calibration weights using generalized entropy as the objective function for optimization. This approach, as implemented in the 'GECal' package, is based on Kwon, Kim, and Qiu (2024) <doi:10.48550/arXiv.2404.01076>. Unlike traditional methods, 'GECal' incorporates design weights into the constraints to maintain design consistency, rather than including them in the objective function itself. |
Authors: | Yonghyun Kwon [aut, cre] , Jae Kwang Kim [aut] , Yumou Qiu [aut] |
Maintainer: | Yonghyun Kwon <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.5 |
Built: | 2024-10-28 14:15:26 UTC |
Source: | https://github.com/yonghyun-k/gecal |
estimate
performs statistical inference after calibration.
estimate(formula, data = NULL, calibration, pimat = NULL)
estimate(formula, data = NULL, calibration, pimat = NULL)
formula |
An object of class "formula" specifying the calibration model. |
data |
An optional data frame containing the variables in the model (specified by |
calibration |
An object of class "calibration", generated by |
pimat |
An optional matrix contatining the joint inclusion probability matrix used for variance estimation. |
A list of class estimation
including the point estimates and its standard error.
Kwon, Y., Kim, J., & Qiu, Y. (2024). Debiased calibration estimation using generalized entropy in survey sampling. Arxiv preprint <https://arxiv.org/abs/2404.01076>
Deville, J. C., and Särndal, C. E. (1992). Calibration estimators in survey sampling. Journal of the American statistical Association, 87(418), 376-382.
set.seed(11) N = 10000 x = data.frame(x1 = rnorm(N, 2, 1), x2= runif(N, 0, 4)) pi = pt((-x[,1] / 2 - x[,2] / 2), 3); pi = ifelse(pi >.7, .7, pi) delta = rbinom(N, 1, pi) Index_S = (delta == 1) pi_S = pi[Index_S]; d_S = 1 / pi_S x_S = x[Index_S,,drop = FALSE] # pimat = diag(d_S^2 - d_S) / N^2 # 1 / pi_i * (1 - 1 / pi_i) e = rnorm(N, 0, 1) y = x[,1] + x[,2] + e; y_S = y[Index_S] # plot(x_S, y_S) calibration0 <- GECal::GEcalib(~ 1, dweight = d_S, data = x_S, const = N, entropy = "SL", method = "DS") GECal::estimate(y_S ~ 1, calibration = calibration0)$estimate # Hajek estimator # sum(y_S * d_S) * N / sum(d_S) calibration <- GECal::GEcalib(~ 0, dweight = d_S, data = x_S, const = numeric(0), entropy = "SL", method = "DS") GECal::estimate(y_S ~ 1, calibration = calibration)$estimate # HT estimator calibration1 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "DS") GECal::estimate(y_S ~ 1, calibration = calibration1)$estimate calibration2 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "GEC0") GECal::estimate(y_S ~ 1, calibration = calibration2)$estimate calibration3 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, log(1 / pi))), entropy = "ET", method = "GEC") GECal::estimate(y_S ~ 1, calibration = calibration3)$estimate calibration4 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC") GECal::estimate(y_S ~ 1, calibration = calibration4)$estimate calibration5 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC", K_alpha = "log") GECal::estimate(y_S ~ 1, calibration = calibration5)$estimate
set.seed(11) N = 10000 x = data.frame(x1 = rnorm(N, 2, 1), x2= runif(N, 0, 4)) pi = pt((-x[,1] / 2 - x[,2] / 2), 3); pi = ifelse(pi >.7, .7, pi) delta = rbinom(N, 1, pi) Index_S = (delta == 1) pi_S = pi[Index_S]; d_S = 1 / pi_S x_S = x[Index_S,,drop = FALSE] # pimat = diag(d_S^2 - d_S) / N^2 # 1 / pi_i * (1 - 1 / pi_i) e = rnorm(N, 0, 1) y = x[,1] + x[,2] + e; y_S = y[Index_S] # plot(x_S, y_S) calibration0 <- GECal::GEcalib(~ 1, dweight = d_S, data = x_S, const = N, entropy = "SL", method = "DS") GECal::estimate(y_S ~ 1, calibration = calibration0)$estimate # Hajek estimator # sum(y_S * d_S) * N / sum(d_S) calibration <- GECal::GEcalib(~ 0, dweight = d_S, data = x_S, const = numeric(0), entropy = "SL", method = "DS") GECal::estimate(y_S ~ 1, calibration = calibration)$estimate # HT estimator calibration1 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "DS") GECal::estimate(y_S ~ 1, calibration = calibration1)$estimate calibration2 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "GEC0") GECal::estimate(y_S ~ 1, calibration = calibration2)$estimate calibration3 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, log(1 / pi))), entropy = "ET", method = "GEC") GECal::estimate(y_S ~ 1, calibration = calibration3)$estimate calibration4 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC") GECal::estimate(y_S ~ 1, calibration = calibration4)$estimate calibration5 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC", K_alpha = "log") GECal::estimate(y_S ~ 1, calibration = calibration5)$estimate
It returns the debiasing covariate, which is equivalent to the first order derivatie of
the generalized entropy .
g(x, entropy = NULL, del = NULL)
g(x, entropy = NULL, del = NULL)
x |
A vector of design weights |
entropy |
An optional data frame containing the variables in the model (specified by |
del |
The optional vector for threshold ( |
A vector of debiasing covariate.
set.seed(11) N = 10000 x = data.frame(x1 = rnorm(N, 2, 1), x2= runif(N, 0, 4)) pi = pt((-x[,1] / 2 - x[,2] / 2), 3); pi = ifelse(pi >.7, .7, pi) g_EL <- g(1 / pi, entropy = 1) g_ET <- g(1 / pi, entropy = 0) g_EL <- g(1 / pi, entropy = -1)
set.seed(11) N = 10000 x = data.frame(x1 = rnorm(N, 2, 1), x2= runif(N, 0, 4)) pi = pt((-x[,1] / 2 - x[,2] / 2), 3); pi = ifelse(pi >.7, .7, pi) g_EL <- g(1 / pi, entropy = 1) g_ET <- g(1 / pi, entropy = 0) g_EL <- g(1 / pi, entropy = -1)
GEcalib
computes the calibration weights.
Generalized entropy calibration weights maximize the generalized entropy:
subject to the calibration constraints ,
where
denotes the sample index, and
represents the population index.
The auxiliary variables, whose population totals are known, are defined as
,
where
is the first-order derivative of the gerenalized entropy
,
and
is the design weight for each sampled unit
.
GEcalib( formula, dweight, data = NULL, const, method = c("GEC", "GEC0", "DS"), entropy = c("SL", "EL", "ET", "CE", "HD", "PH"), weight.scale = 1, G.scale = 1, K_alpha = NULL, is.total = TRUE, del = NULL )
GEcalib( formula, dweight, data = NULL, const, method = c("GEC", "GEC0", "DS"), entropy = c("SL", "EL", "ET", "CE", "HD", "PH"), weight.scale = 1, G.scale = 1, K_alpha = NULL, is.total = TRUE, del = NULL )
formula |
An object of class "formula" specifying the calibration model. |
dweight |
A vector of sampling weights. |
data |
An optional data frame containing the variables in the model (specified by |
const |
A vector used in the calibration constraint for population totals( or means). |
method |
The method to be used in calibration. See "Details" for more information. |
entropy |
The generalized entropy used in calibration, which can be either a numeric value or a string.
If numeric, |
weight.scale |
Positive scaling factor for the calibration weights |
G.scale |
Positive scaling factor for the generalized entropy function |
K_alpha |
The |
is.total |
Logical, |
del |
The optional threshold ( |
The GEcal
object returns the calibration weights and necessary information for estimating population totals(or mean).
The terms to the right of the ~ symbol in the formula
argument define the calibration constraints.
When method == "GEC"
, the debiasing covariate g(dweight)
must be included in the formula
.
If the population total(mean) of g(dweight)
is unavailable, const
that corresponds to g(dweight)
can be set to NA
.
In this case, GECalib
performs joint optimization over
both the calibration weights and the missing value of
const
.
The length of the const
vector should match the number of columns in the model.matrix
generated by formula
.
Additionally, the condition number of the model.matrix
must exceed .Machine$double.eps
to ensure its invertibility.
Both weight.scale
and G.scale
are positive scaling factors used for calibration.
Note that weight.scale
is not supported when method == "DS"
.
Let be the scaling factor for the generalized entropy function
,
and
be the scaling factor for the calibration weights
.
If method == "GEC"
, GEcalib
minimizes the negative entropy:
with respect to subject to the calibration constraints
,
where
,
denotes the sample index, and
represents the population index.
If method == "GEC"
, but an element of const
corresponding to the debiasing covariate
is
NA
, GEcalib
minimizes the negative adjusted entropy:
with respect to and
subject to the calibration constraints
,
where the solution
is an estimate of population total for
.
Examples of
includes
when
K_alpha == NULL
, and
when K_alpha == "log"
.
If method == "GEC0"
, GEcalib
minimizes the negative adjusted entropy:
with respect to subject to the calibration constraints
.
If method == "DS"
, GEcalib
minimizes the divergence between and
:
with respect to subject to the calibration constraints
.
When
method == "DS"
, weight.scale
, the scaling factor for the calibration weights , is not applicable.
Examples of and
are given in "Summary".
A list of class calibration
including the calibration weights
and data needed for estimation.
The table below provides a comparison between the GEC and DS methods.
GEC | DS |
|
|
s.t. |
s.t. |
|
|
If method == "GEC"
, further examples include
when entropy == "CE"
, and
for a threshold when
entropy == "PH"
.
Yonghyun Kwon
Kwon, Y., Kim, J., & Qiu, Y. (2024). Debiased calibration estimation using generalized entropy in survey sampling. Arxiv preprint <https://arxiv.org/abs/2404.01076>
Deville, J. C., and Särndal, C. E. (1992). Calibration estimators in survey sampling. Journal of the American statistical Association, 87(418), 376-382.
set.seed(11) N = 10000 x = data.frame(x1 = rnorm(N, 2, 1), x2= runif(N, 0, 4)) pi = pt((-x[,1] / 2 - x[,2] / 2), 3); pi = ifelse(pi >.7, .7, pi) delta = rbinom(N, 1, pi) Index_S = (delta == 1) pi_S = pi[Index_S]; d_S = 1 / pi_S x_S = x[Index_S,] # Deville & Sarndal(1992)'s calibration using divergence w1 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "DS")$w # Generalized entropy calibration without debiasing covariate w2 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "GEC0")$w all.equal(w1, w2) # Generalized entropy calibration with debiasing covariate w3 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, log(1 / pi))), entropy = "ET", method = "GEC")$w # Generalized entropy calibration with debiasing covariate # when its population total is unknown w4 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC")$w all.equal(w1, w4) w5 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC", K_alpha = "log")$w
set.seed(11) N = 10000 x = data.frame(x1 = rnorm(N, 2, 1), x2= runif(N, 0, 4)) pi = pt((-x[,1] / 2 - x[,2] / 2), 3); pi = ifelse(pi >.7, .7, pi) delta = rbinom(N, 1, pi) Index_S = (delta == 1) pi_S = pi[Index_S]; d_S = 1 / pi_S x_S = x[Index_S,] # Deville & Sarndal(1992)'s calibration using divergence w1 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "DS")$w # Generalized entropy calibration without debiasing covariate w2 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S, const = colSums(cbind(1, x)), entropy = "ET", method = "GEC0")$w all.equal(w1, w2) # Generalized entropy calibration with debiasing covariate w3 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, log(1 / pi))), entropy = "ET", method = "GEC")$w # Generalized entropy calibration with debiasing covariate # when its population total is unknown w4 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC")$w all.equal(w1, w4) w5 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S, const = colSums(cbind(1, x, NA)), entropy = "ET", method = "GEC", K_alpha = "log")$w
A synthetic proprietary pesticide usage survey data in Iowa CRD(Crop Reporting District) collected from GfK Kynetec in 2020.
A data frame with 1197 rows on the following 32 variables:
Haversted acres of corn in each CRD
Haversted acres of soybean in each CRD
Haversted acres of alfalfa in each CRD
Acres of pasture in each CRD
Design weights, or inverse first-order inclusion probabilities of the sample
Pesticide usage($) which is of an interest.
The original data is contaminated by adding noise and creating missing values and imputation.
data(IAdata) data(IApimat) total <- c( Corn10 = 2093000, Corn20 = 1993600, Corn30 = 1803200, Corn40 = 2084600, Corn50 = 2056600, Corn60 = 1429400, Corn70 = 2539600, Soybean10 = 1472980, Soybean20 = 1192860, Soybean30 = 721920, Soybean40 = 1477680, Soybean50 = 1353600, Soybean60 = 918380, Soybean70 = 1485200, Soybean90 = 777380, Alfalfa10 = 60590, Alfalfa30 = 154395, Alfalfa40 = 57816, Alfalfa50 = 150453, Alfalfa70 = 66065, Alfalfa80 = 240681, Pasture10 = 141947, Pasture20 = 61476, Pasture30 = 188310, Pasture40 = 213635, Pasture50 = 160737, Pasture60 = 222214, Pasture70 = 250807, Pasture80 = 570647, Pasture90 = 232630 ) calibration <- GECal::GEcalib(~ 0, dweight = d, data = IAdata, const = numeric(0), entropy = "EL", method = "DS") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d, dweight = d, data = IAdata, const = total, entropy = "SL", method = "DS") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d, dweight = d, data = IAdata, const = c(total), entropy = "ET", method = "DS") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d + g(d), dweight = d, data = IAdata, const = c(total, NA), entropy = "HD", method = "GEC") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d, dweight = d, data = IAdata, const = total, entropy = "HD", method = "GEC0") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate
data(IAdata) data(IApimat) total <- c( Corn10 = 2093000, Corn20 = 1993600, Corn30 = 1803200, Corn40 = 2084600, Corn50 = 2056600, Corn60 = 1429400, Corn70 = 2539600, Soybean10 = 1472980, Soybean20 = 1192860, Soybean30 = 721920, Soybean40 = 1477680, Soybean50 = 1353600, Soybean60 = 918380, Soybean70 = 1485200, Soybean90 = 777380, Alfalfa10 = 60590, Alfalfa30 = 154395, Alfalfa40 = 57816, Alfalfa50 = 150453, Alfalfa70 = 66065, Alfalfa80 = 240681, Pasture10 = 141947, Pasture20 = 61476, Pasture30 = 188310, Pasture40 = 213635, Pasture50 = 160737, Pasture60 = 222214, Pasture70 = 250807, Pasture80 = 570647, Pasture90 = 232630 ) calibration <- GECal::GEcalib(~ 0, dweight = d, data = IAdata, const = numeric(0), entropy = "EL", method = "DS") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d, dweight = d, data = IAdata, const = total, entropy = "SL", method = "DS") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d, dweight = d, data = IAdata, const = c(total), entropy = "ET", method = "DS") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d + g(d), dweight = d, data = IAdata, const = c(total, NA), entropy = "HD", method = "GEC") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate calibration <- GECal::GEcalib(~ 0 + . -y -d, dweight = d, data = IAdata, const = total, entropy = "HD", method = "GEC0") GECal::estimate(y ~ 1, data = IAdata, calibration = calibration, pimat = IApimat)$estimate
The matrix that is used for variance estimation in IAdata. The sample is collected from a straitified random sampling.
data(IApimat)
data(IApimat)