Introducing Nonparametric Predictive Inference Methods for Reproducibility of Likelihood Ratio Tests

This paper introduces the nonparametric predictive inference approach for reproducibility of likelihood ratio tests. The general idea of this approach is outlined for tests between two simple hypotheses, followed by an investigation of reproducibility for tests between two beta distributions. The paper reports on the first steps of a wider research programme towards tests involving composite hypotheses and substantial computational challenges.


Introduction
In recent years, reproducibility of statistical hypotheses tests has received increasing attention [4].The issue involves a straightforward question: if a statistical test were repeated, under the same circumstances, would it lead to the same conclusion with regard to rejection or non-rejection of the null hypothesis?Or, more precisely, what would be the probability of the test conclusion for the repeated test to be the same as for the original test?This is called the reproducibility probability (RP).The issue was first raised by Goodman [20], who pointed out that among practitioners there 15 Page 2 of 14 appeared to be a misunderstanding about the meaning of the p-value.Senn [26] provided an extensive discussion of Goodman's paper, from a statistical perspective, emphasizing the difference between RP and the p-value.Of course, as explained by Senn, upon rejection of a null hypothesis, a smaller value of the p-value suggests a larger RP.But, remarkably, it was not clear at all how RP could be computed or estimated.Traditionally, a test is designed for a specified level of significance, and the power of the test for a precisely specified alternative hypothesis, also called a 'simple hypothesis', can also be taken into account for the sample size or more general aspects of the test design.But it remains somewhat vague how the concept of a repeat of a test, and hence reproducibility of the test results, fits into the classical frequentist framework of statistics.
The power of a test is the probability that the null hypothesis is rejected if a simple alternative hypothesis is true.Due to the typical test formulation where a strong indication is being sought in favour of the alternative hypothesis, it is the correct rejection of the null hypothesis that is often considered to be the target of the test.For example, in the development of new medication one may test its superior effect on patients compared to existing medication by formulating a null hypothesis of there being no difference and an alternative hypothesis specifying a specific level of improvement for the patients by using the new medication.This led Shao and Chow [27] to focus on the power of the test, and they suggested to call an estimate of the power, based on the data of the original test, the 'reproducibility' of the test.In this approach, if the hypotheses involve a value for a parameter for an assumed model, and the original null hypothesis is rejected, then the data are used to estimate the parameter value, and this estimated parameter is then considered to be the simple alternative hypothesis parameter value for which the power of the test is computed; hence, overall, this leads to an estimate for the power of the test which is interpreted as an estimate of RP.This approach was also followed by De Martini [18], who in addition proposed to use the estimated RP to design tests, and following work by De Capitani and De Martini [17].While it is of course necessary to base inference on RP on the data of the original test, the explicit focus on the power of the test, hence on the assumption that the simple alternative hypothesis is true, is somewhat restrictive.In this 'estimated power' approach, RP can be regarded as a 'within the model' concept in the sense that the data of the original are used to estimate a parameter of the model, which in turn is linked to the power and then interpreted as RP.While such a power estimate is of interest, we do not think that it is in line with the natural interpretation of test reproducibility, both because it only considers cases where the null hypothesis is rejected and because it does not really consider repeat application of the test, which would lead to different data.We should emphasize that there have been quite a few further attempts to specify RP, but they are less convincing than the approach by Shao and Chow [27]; a short introduction was provided by Coolen and Bin Himd [14].
Coolen and Bin Himd [14] presented a different perspective on test reproducibility, using the nonparametric predictive inference (NPI) framework of frequentist statistical methods [5,12,13].The main difference to the estimated power approach of Shao and Chow [27] is that the NPI approach for the reproducibility probability of a test (NPI-RP) is explicitly predictive, so it considers the test result for a predicted future sample of the same size as the original sample.This approach seems to be well aligned to the nature of test reproducibility, which is more naturally considered as a prediction problem than as an estimation problem.Given the observed data from the original test, the NPI-RP approach first predicts future data sets, this is accomplished nonparametrically and without any consideration of the model assumed for the test.Then, the same test as performed on the original data is considered for the random future data sets, and the proportion of these that lead to the same conclusion as the original test is investigated.Due to NPI being only based on few modelling assumptions, there is imprecision in this process [6] as will be explained in more detail later within the context of the specific test scenario considered in this paper.
Coolen and Bin Himd [14] introduced NPI for RP by considering some basic nonparametric tests, namely the sign test, Wilcoxon's signed-rank test and the twosample rank-sum test [19].For these inferences, NPI for Bernoulli quantities [11] and for real-valued observations [5] is used.Recently, Coolen and Alqifari [16] presented NPI-RP for two basic nonparametric tests based on order statistics, namely a quantile test [19] and a precedence test [7], using NPI for future order statistics [1,15].These basic tests all enabled analytical results for NPI-RP.To enable NPI for more complex test scenarios, the NPI-bootstrap method can be used, as introduced by Bin Himd [8] who illustrates the use of NPI-bootstrap for NPI-RP for the Kolmogorov-Smirnov test.Computational aspects for more complex test scenarios are briefly commented on in this paper; they are an important topic for future research towards real-world implementation of NPI-RP.
This paper introduces NPI-RP to the important setting of likelihood ratio tests (LRT).These tests were introduced by Neyman and Pearson in 1928 and since then have been widely applied in the most different fields of statistics; for example, applications can be easily found in engineering, economics, medicine and ecology [9,24,25,29].Their good large sample properties and Wilks' theorem [28] which states, for composite hypotheses, that the distribution of the logarithm of the LRT statistic can be approximated by a 2 distribution allow the simple use of this testing procedure and make it an attractive and commonly used tool.However, as already stated by several authors (see, for example Johansen [22] and Marques et al. [23]), this approximation, although simple and easy to use, does not provide precise results in many situations; for example, in the multivariate setting, it often does not perform well in scenarios with large number of variables or for small sample sizes.For these more complex scenarios, other more precise approximations can be considered, namely the so-called near-exact approximations [10].In this paper, one will consider mainly the case of simple hypotheses, but the case of composite hypotheses will be addressed in a possible follow-up paper.For simple hypotheses, where all the parameters are specified, one will have, for a random sample X 1 , … , X n extracted from a population X, the null and alternative hypotheses specified in the following form where f 0 and f 1 stand for the densities of the model considered under the null and alternative hypotheses, respectively.The LRT statistic is given by 15 Page 4 of 14 As already mentioned before, we will be interested in studying the RP of the LRT for some introductory examples.Along the way, one will also derive the exact distribution of the LR statistics which will allow us to determine exact quantiles.Therefore, Sect. 2 of this paper introduces the general idea of NPI-RP for LRT through the case of a test between two simple hypotheses.This is then illustrated, first with a simple scenario, in Sect.3, where one considers tests between a distribution with an increasing density in (0,1) versus the uniform distribution and then, in Sect.4, for a test between two beta distributions.Section 5 provides some concluding remarks and outlines important challenges for future research related to the generalization of this methodology to composite hypotheses and the computational problems involved.

NPI-RP for LRT with Simple Hypotheses
Nonparametric predictive inference (NPI) [5,12,13] is a frequentist statistical method based on Hill's assumption A (n) [21].This assumption considers a single future real-valued observation X n+1 , given n data observations, with the assumption that there are no ties among the data (this assumption is made throughout this paper), and assigns probability 1∕(n + 1) for X n+1 to each open interval between con- secutive data observations (and −∞ and ∞ for the left-and right-most intervals).We denote the n data observations by x 1 < x 2 < … < x n and for ease of notation, we define x 0 = −∞ and x n+1 = ∞ .Of course, if finite bounds are known for the obser- vation values, then we can use these bounds as x 0 and x n+1 .It should be emphasized that no further assumptions are made, in particular not on the distribution of the probability 1∕(n + 1) within each interval.As a generalization, NPI for m ≥ 1 future real-valued observations, based on n ≥ 1 data observations, uses the sequential assumptions A (n) , … , A (n+m−1) [2,3], and by doing so, it explicitly takes the interde- pendence of the future observations into account.These assumptions lead to the following inferential method: given n data observations and m future observations, the m + n m different orderings of all these observations are all equally likely, with again no further assumptions on where future observations would be within intervals between consecutive data observations.In this paper, we restrict attention to the case m = n , as this is most logical for studying reproducibility of a test based on n observations. We For these future observations, nothing more is assumed, so they can take on any value within the specific interval.We wish to perform the same test on the future data that was applied to the real data, and hence, we wish to compute the likelihood ratio based on the future data, for each given ordering O j .This is not possible, but we can find bounds for the likelihood ratio by minimizing and maximizing it over the ranges of values that the observations can have, given the specific ordering.This leads to three groups of orderings.Firstly, orderings for which we certainly do not reject H 0 , so for all possible locations of the n future observations within the respective intervals, the resulting value of the likelihood ratio leads to non-rejection of H 0 .Secondly, and following from similar arguments as for the first case, there are orderings for which we certainly reject H 0 .Thirdly, orderings for which the minimum and maximum values of the LR lead to different conclusions with regard to rejection of H 0 .All the orderings O j are equally likely, so to calculate the NPI lower RP, if for the original data we do not reject H 0 then we count the number of orderings in the first group, and to calculate the corresponding NPI upper RP in this case, we count the number of orderings in the first and third groups.Similarly, if for the original data we reject H 0 then we count the number of orderings in the second group to calculate the NPI lower RP, and to calculate the corresponding NPI upper RP in this case, we count the number of orderings in the second and third groups.
Clearly, imprecision in NPI-RP results from future orderings for which it is both possible that H 0 would be rejected or not rejected, given the ranges of values the observations can have within the intervals created by the original data.The main challenge for the NPI-RP approach to LRT is the derivation of the minimum and maximum values of the LR for each ordering of future observations.We start exploring this method, in Sect.3, by considering a basic scenario for NPI-RP for the likelihood ratio test with two simple hypotheses.Suppose we have independent and identically distributed random quantities X i on [0, 1] and wish to test , with probability density function (pdf) f (x) > 0 increasing on [0, 1] and U[0, 1] denoting the uniform distribution on [0, 1].Due to the specific choice of H 1 , the likelihood ratio based on data The LRT is such that H 0 is not rejected if LR(x) > K , for some K depending on the chosen level of significance for the test, and H 0 is rejected if LR(x) ≤ K .With this specific support for the probability distribution of the random quantities of interest, we define x 0 = 0 and x n+1 = 1.
Due to the assumption that the pdf f is increasing on [0, 1], the NPI lower and upper reproducibility probabilities are relatively straightforward to derive.For ordering O j , the likelihood ratio LR(O j ) has minimum possible value LR(O j ) and maximum possible value LR(O j ) derived by Suppose that the LRT for the original data x 1 , … , x n leads to non-rejection of H 0 , so LR(x) > K .Then, the NPI lower RP is derived by counting all of the 2n n orderings O j for which LR(O j ) > K , while the corresponding NPI upper RP is derived by counting all orderings for which LR(O j ) > K .Similarly, if the LRT for the original data leads to rejection of H 0 , so LR(x) ≤ K , then the NPI lower RP is derived by counting all of the 2n n orderings O j for which LR(O j ) ≤ K , while the correspond- ing NPI upper RP is derived by counting all orderings for which LR(O j ) ≤ K .In this methodology, the computation of all possible orderings, O j , may be time-consuming and an obstacle to its implementation when large samples are considered.However, this problem may be overcome by using bootstrap techniques [8] or by sampling the orderings.The authors intend to extend and apply these techniques to NPI-RP for LRTs in future works.In Sect.4, one considers a more general setting where the densities may not be increasing functions and may assume the value zero in the extremes of their support.In this case, for independent and identically distributed random quantities In the example provided in Sect.4, f 0 and f 1 are the densities of beta distributions.The LR based on the observed data Then, if f is a monotone function, for an ordering O j , the likelihood ratio LR(O j ) will have minimum possible value LR(O j ) and maximum possible value LR(O j ) given respectively by where, for a given l, with l = 1, … , n + 1 and For the general case with any two likelihood functions, we need to find lr l and lr l , the infimum and supremum, respectively, of the (likelihood) ratio of the probability density functions over (x l−1 , x l ) ; while this may not be trivial, it only needs to be done once for a given data set, and perhaps some approximate results may be possible, note that for large data sets it is likely that for most of these intervals the ratio of the pdf values within it does not change much.These lr l and lr l then replace the pdfs in Eqs. ( 1) and (2) above.
To avoid possible issues related to the fact that the density functions of the models under the null and alternative hypotheses may assume the value zero in the extremes of interval (0,1), we propose a basic and initial approach to this problem, which is to consider 2 .Other possible techniques as well as the possible effects of this choice will be studied in future works.

A First Simple Example
In our first example, we consider the following hypotheses with the LR, for an observed sample of size n, is given by The exact distribution of the LR statistic, under the null hypothesis, is given in the following theorem.
Theorem 3.1 Let X 1 , … , X n be independent and identically distributed with density given in (5).Then, the cumulative distribution function of ) is given by where F (n,2) (.) represents the cumulative distribution function of Gamma distribu- tion with shape parameter n, rate parameter 2. Proof It is easy to note that Y i = X i + 1∕2 has density given by and that the h-th moment of Y is given by given the properties of characteristic functions, we have that the characteristic function of W is given by which after some algebraic manipulation and using the binomial expansion is possible to write as The previous expression is the characteristic function of a mixture of shifted gamma distributions.After the necessary transformations, the cumulative distribution function of the LR can be written as where F (n,2) (.) represents the cumulative distribution function of gamma distribu- tion with shape parameter n, rate parameter 2. □ Using the previous results, one may determine the exact quantiles for the LR statistic and develop numerical simulations to study the RP of this test.In the following simulations, we consider samples of sizes n = 5 and n = 10 and for each case, we consider 15 and 50 replications simulated under H 0 .The 0.2 exact quantiles were determined using the cumulative distribution function given in Theorem 3.1 and are equal to 0.7267 and 0.7240 respectively for n = 5 and n = 10 .We have considered the 0.2 significance level in order to make it easier to analyse the figures.In Fig. 1, the filled circle dots are the upper RP and the filled square dots are the lower RP evaluated for each simulated value of the LR statistic.The vertical line marks the value of the exact quantile.From Fig. 1, one may observe that the values of the RP tend to increase when the simulated value of the LR statistic moves away from the quantile considered, this was already expected, and it is also observed in the next examples.

A Test Between Two Beta Distributions
In this section, we will consider a more complex case using the procedure illustrated in Sect. 2. We are interested in testing 15 Page 10 of 14 for a ≠ b , the LR, for a sample of size n, is given by with thus and

Fig. 1 Simulated values of the RP
The following theorem specifies the distribution of the LR statistic.In this theorem, one will consider just the case a > b ; however, the case a < b can be considered using a similar procedure.Proof Following a similar procedure to the one used in two previous cases, we will use the random variable Since we know that − log(X i ) has an exponential distribution with parameter a and that the hth moment of X a−b i is given by the expression of the characteristic function of −(a − b) log(X i ) will be given by and the characteristic function of W by which is the characteristic function of a gamma distribution with shape parameter n and rate parameter a a−b .Therefore, it is, again, straightforward to determine, with the necessary transformations, the cumulative distribution function of the LR which is given by For example, when a = 5 and b = 1∕5 , for samples of sizes n = 5 and n = 10 , the 0.2 exact quantiles were determined using the cumulative distribution function given in Theorem 4.1 and are respectively 15401.8 and 5.755 × 10 8 .The simulations were performed considering 15 and 50 replications of data generated under H 0 .
In Fig. 2, we observe similar features to the ones already described in Fig. 1; (1) the values of the lower and upper RPs tend to increase with increasing distance between the observed LR and the quantiles considered; (2) if the observed values of the LR are close to the quantile, the lower RP decreases substantially and may even assume values below 0.5; (3) these figures show some oscillation of the values of the RPs which is, essentially, due to randomness and to the products involved in the expression of the LR which are reflected in the process for computing the minimum and maximum possible values of the LR.

Concluding Remarks
This paper introduced nonparametric predictive inference methods for reproducibility of likelihood ratio tests.The main idea is exemplified with two examples of testing procedures.The simulations carried out show the increasing trend of the lower and upper RPs together with the decreasing trend of the difference between these probabilities, for increasing values of the distance between the observed LRs and the quantiles.The simulations were performed for small samples due to the computational time required for the computation of the number of possible orderings.However, this difficulty may be overcome by using bootstrap techniques [8] or sampling of orderings.Further research challenges include discussion of significance level (p-value), power and RP to be taken into account for test designs, and the development of a general set up for composite hypotheses.

Theorem 4 . 1
For a sample X 1 , … , X n , independent and identically distributed, with X i ∼ Beta(a, 1) , the cumulative distribution function of LR = is the cumulative distribution function of a Gamma distribution with shape parameter n and rate parameter a a−b .

15
Page 12 of 14 where F n, a a−b (.) is the cumulative distribution function of a gamma distribution with shape parameter n and rate parameter a a−b .□

Fig. 2
Fig. 2 Simulated values of the RP is the number of future observations in the interval (x i−1 , x i ) , according to ordering O j .Here s The general idea of the NPI-RP approach is as follows.Given n real-valued observations for which the original test is performed, we consider the denote the 2n n different orderings of the n future real-valued observations among the n data observations, by O j for j = 1, … , 2n n .Each ordering O j can be LR =