# Different formulas for two-sided mid $p$-value ($2\times2$ table setting)

by stats134711   Last Updated September 11, 2019 09:19 AM

I'm interested in testing independence of two groups (e.g. case and control) in a $$2\times 2$$ table: i.e. $$H_0: \theta=1$$ against the two-sided alternative $$H_1:\theta\neq 1$$, where $$\theta$$ is the odds-ratio. Suppose that the margins of the table are fixed, then the random variable for the number of hits among cases is given by $$X\sim \text{HyperGeom}(n,N_1,N_2)$$, where $$n$$ is the total number of hits, $$N_1$$ is the total number of cases and $$N_2$$ is the total number of controls. The pmf is given by $$Pr(X=x)=\frac{\binom{N_1}{n}\binom{N_2}{n-x}}{\binom{N_1+N_2}{n}}$$ for $$\max(0,n-N_2)\leq x\leq \min(n,N_1)$$.

For testing, I'm using the mid $$p$$-value as it is one way to reduce the conservativeness of the Fisher's exact test without resorting to randomized tests. Suppose that the observed number of hits among cases is $$x_0$$. I've seen two formulations of the two-sided mid $$p$$-value in the literature:

Formulation 1 Eq 1.10 or Section 2.2 $$p^{(1)}_{\text{mid}}=\sum_{j:Pr(X=j)

Formulation 2 $$p_{lt} = Pr(Xx_0)+0.5~Pr(X=x_0)\\ p^{(2)}_{\text{mid}}=2\min(p_{lt},p_{gt})=2\min(p_{lt},1-p_{lt})$$ where the one-sided versions, $$p_{lt}$$ or $$p_{gt}$$, can be found in: Eq 1.7 or Section 5.1, to name a few.

In fact, Formulation 2 is the one used in SAS PROC FREQ and in certain functions in R packages such as epitools::ormid.test.

From a simple test on the $$2\times 2$$ table below in R, I noticed that these two functions sometimes don't produce the same $$p$$-values. In fact trying several tables seems to suggest that Formulation 1 can be much less conservative compared to Formulation 2. Additionally, Formulation 2 can be more conservative than the two-sided Fisher's exact test, as shown below.

Question Which formulation is appropriate (and under what situations)?

midpval_f1 <- function(ct){

x <- ct[1,1]
n <- sum(ct[1,])
N1 <- sum(ct[,1])
N2 <- sum(ct[,2])

lo <- max(0L, n - N2)
hi <- min(n, N1)

support <- lo : hi
out <- dhyper(support, N1, N2, n)

return(sum(out[out < out[x - lo + 1]]) + sum(out[out==out[x-lo+1]])/2)
}

midpval_f2 <- function(ct){

x <- ct[1,1]
n <- sum(ct[1,])
N1 <- sum(ct[,1])
N2 <- sum(ct[,2])

plt <- phyper(x-1,N1,N2,n) + 0.5*dhyper(x,N1,N2,n)
pgt <- phyper(x,N1,N2,n,lower.tail = FALSE) + 0.5*dhyper(x,N1,N2,n)

return(2*min(plt,pgt))
}

test_ct <- matrix(c(3,5,7,9),ncol=2,byrow=T)

> midpval_f1(test_ct)
[1] 0.8366761
> midpval_f2(test_ct)
[1] 0.7956208

test_ct2 <- matrix(c(5,10,2,38),ncol=2,byrow=T)

> midpval_f1(test_ct2)
[1] 0.006789634
> midpval_f2(test_ct2)
[1] 0.01357927
> fisher.test(test_ct2)\$p.value
[1] 0.012561

Tags :

## Related Questions

Updated November 07, 2017 17:19 PM

Updated December 12, 2018 16:19 PM

Updated November 08, 2017 16:19 PM

Updated December 04, 2017 15:19 PM

Updated March 07, 2018 16:19 PM