Estimation of the mean of the partially linear single-index errors-in-variables model with missing response variables

In this paper, we estimate the mean of the partially linear single-index errors-in-variables model with missing response variables. The linear covariate is measured with additive error, therefore missing is not random. Two special estimators are defined that include a semiparametric regression imputation estimator and a marginal average estimator. These estimators are shown to be asymptotically normal and have the same asymptotic variance. A simulation experiment is used to illustrate our proposed method.


Introduction
Semiparametric errors-in-variables (EV) models have attracted broad attention and have been deeply studied during the last two decades. Relevant studies include partially linear EV models (Liang et al. [5], He and Liang [4]), varying coefficient EV models (You et al. [16], Zhao and Xue [17]), partially linear varying coefficient EV models (You and Chen [15], Wei and Mei [13]), partially linear additive EV models (Wei et al. [11,12]). Here, we consider the following partially linear single-index EV model: (1.1) where Y is a response variable, the single covariate Z ∈ R p is observed completely, the linear covariate X ∈ R q is observed with additive error, and only its substitute V can be observed; g(·) is an unknown smooth link function, ε is the random error with E(ε|Z, X) = 0, Var(ε|Z, X) < ∞; (α, β) is an unknown vector in R p × R q with α = 1 which ensures identifiability, and the first nonzero component of α is positive, where · denotes the Euclidean norm. The measurement error e is independent of (Y , Z, X) with E(e) = 0 and Cov(e) = Σ e . Here, we assume that Σ e is known. If it is unknown, the estimation method is analogous to the partial replication method of Liang et al. [5] in a partially linear EV model.
For the complete data set, the partially linear single-index EV model has been discussed by Liang and Wang [6] and Chen and Cui [1].
It is well known that the studies on the mean E(Y ) = θ are very important in regression models. If all the responses in the sample are available, the response variable mean can be usually obtained. However, in fact, some responses may be missing. This missing response problem may be caused by various reasons. For example, it may be too expensive to acquire the response Y 's and only part of Y 's are available. In practice, missing-data problems frequently occur in epidemiology studies, survey sampling, social science, and many other fields. Therefore, it is necessary to study the mean E(Y ) = θ based on the missing data set.
However, there's little research about the response variable mean in the partially linear single-index model. In this paper, we focus on the mean E(Y ) = θ , when there are missing responses in the partially linear single-index EV model (1.1). An indicator variable δ is introduced in order to indicate whether an observation of Y is missing or observed, i.e., δ = 0 indicates that Y is missing and δ = 1 indicates that Y is observed. Throughout this paper, if X is observable, we assume the data missing mechanism is as follows: for some unknown π(Z, X). In addition, we also assume that the measurement error e is independent of δ, p(δ = 1|Y , Z, X, V ) = π(Z, X). Since X is observed with measurement errors, Y is not missing at random if there are no further assumptions. The details can be seen in the paper of Liang et al. [7].
The imputation method is a common method of dealing with missing data, which fills in a plausible value for each missing data and then analyzes the result as if they were complete data. When some responses are missing, Cheng [2] applied kernel regression imputation to estimate θ in a Nonparametric Model. Similar to the method of Cheng [2], Wei [10] estimated θ in a partially linear varying-coefficient EV model with missing responses. In addition, the marginal average method also can be used in a missing data set in place of the imputation method. When some responses are missing in a partially linear model, Wang et al. [9] and Liang et al. [7] used the above two methods to estimate the mean of the responses with the covariates X being observed and not observed, respectively. In this paper, we extend the method in Liang et al. [7] to the partially linear single index EV models, propose two estimators of θ in model (1.1) with missing response. The estimators are shown to be asymptotically normal and have the same asymptotic variance.
The rest of this paper is organized as follows. In Sect. 2, two estimators of θ are proposed and a relative asymptotic result is presented. In Sect. 3, some simulation results are reported. All proofs are shown in Sect. 4.

Estimation of the mean E(Y) = θ
In order to derive the estimators of θ , first we use the complete method of Qi and Wang [8] to estimate the regression coefficients, the single-index coefficients and the nonparametric function. By the least-squares method and the correction for attenuation technique, an estimator of can be defined aŝ ) h 1 , with K 1 (·) being a kernel function and h 1 being a suitable bandwidth.
After obtaining the estimator of β, we can obtain the estimatorsĝ n (·) andĝ n (·) of g(·) and g (·) for any fixed α. By the locally linear method of Fan and Gijbels [3], we approximate ) h 2 , with K 2 (·) being a kernel function and h 2 being a suitable bandwidth. However, (2.1) and (2.2) cannot be applied directly in practice, since α is unknown. So we need to estimate by minimizing which yields, sayα n . Note thatβ n andĝ n (·) can also be used to obtainα n in (2.3). The complete estimation procedure is decomposed in an iterative process with the following steps: Step 1. Acquire an initial valueα 0 , for example, by the method of Xia and Härdle [14], and letα n =α 0 α 0 . Step 2. When α =α n , we can obtainβ nk ,ĝ nk (·) based on (2.1) and (2.2).
Step 3. The solution of (2.3) is denoted asα n(k+1) . Letα n =α n(k+1) Step 4. Iterate Steps 2 and 3 until convergence is achieved. Next, we turn to estimate the mean E(Y ) = θ . Similar to Wang et al. [9] and Liang et al. [7], we construct two estimators of θ . First, each missing Y i is imputed by the estimated regression function V T iβ n +ĝ n (Z T iα n ). Consequently, we obtain the semiparametric regression imputation estimator of θ , which is designed aŝ Second, we only consider the sample average of the estimated regression function, that is, every Y i is ignored. Accordingly, we get the marginal average estimator of θ , which is defined aŝ

Asymptotic result
In this section, the asymptotic normality of θ s will be summarized. And it will be shown that they are asymptotically equivalent. For a concise representation, let . Moreover, in order to state the asymptotic results, the following assumptions will be used:

Simulation
In this section, we present a simulation study to analyze the finite sample performance of the regression imputation estimator θ 1 and the marginal average estimator θ 2 . The simulation uses the partial linear single-index EV model (1.1) with a specific link function: where X is generated from the standard normal distribution, trivariate Z is simulated from the uniform distribution U[0, 1], e is generated from the normal distribution N(0, 0.25 2 ), ε is simulated from the normal distribution with mean 0 and variance 0.01, and α = ( 3 ) T , β = 1. The kernel functions were taken to be K i (t) = 3 4 (1t 2 ) 2 if t ≤ 1, and 0 otherwise, i = 1, 2.
The choices of bandwidths are quite crucial. In this paper, we use the least-squares delete-one cross-validation (CV) method to select bandwidths:ĥ 1 andĥ 2 are chosen as (ĥ 1 ,ĥ 2 ) = arg min

2)
n andα (-i) n are the "leave-one-out" versions ofβ n ,ĝ n andα n , respectively. However, the h i , i = 1, 2 from (3.2) may not the optimal bandwidths because they may not satisfy the conditions imposed in the theorems. According to their conditions, the optimal bandwidth according to (3.2) is to choose a constant h 0 .
Based on model (3.1), we considered the following four response probabilities of missing, namely:  Table 1.
From Table 1, we observe that (a) Biases and SE decrease as n increases for every fixed missing rate. Also, SE increase as the missing rate increases for every fixed sample size n. (b) The SE ofθ 1 ,θ 2 are nearly the same for every fixed missing rate and sample size.

Proof of the main result
In order to prove the main result, we first give some lemmas.
Proof of Lemma 4.1 When α =α n , the estimators of g(·) and g (·) can be obtained from (2.2). By a straightforward calculation, Then focusing on the top equation only and using Taylor expansion, we have Note that 1 Noting that Similarly, we also have Thus we get equation (4.1).

Lemma 4.2 Under conditions (C1)-(C7), we havê
Proof of Lemma 4.2 This proof is given in Qi and Wang [8], we omit the details here.
Proof of Lemma 4.3 The proof of Lemma 4.3 is similar to the proof of Theorem 1 by Liang et al. [7], we omit the details here.
Proof of Theorem 2.1 Here we only consider the asymptotic normality of θ 1 . The asymptotic result for θ 2 is obtained similarly.
For θ 1 , we havê where β), From Taylor expansion and the continuity of g (·), we obtain that By Lemma 4.1 and (4.4), it is easy to get where , , .