Identifying a cutoff value from bimodal data

Installing and loading the package
Bimodal data
Finite mixture models
Identifying a cutoff value

This tutorial shows step by step how to use finite mixture models to model bimodal data and to derive a cut-off value that separates the two peaks for a given type-I error. It shows in particular how to estimate the parameters of the finite mixture model with the Expectation-Maximization (E-M) algorithm together with the confidence intervals of the estimations. This method is used in (Trang et al. 2015). For questions and/or help, contact me at marc.choisy@ird.fr.

Installing and loading the package

Install the package from GitHub:

> install.packages("devtools")
> devtools::install_github("choisy/cutoff")

This needs to be done only once on a computer, unless there is a new version of R or the package.

Once the package is installed on your computer you need to load it to use it

> library(cutoff)

and this need to be done for each working session. This package contains two main functions: * that fit a finite mixture model to bimodal data with the Expectation-Maximization algorithm * that calculate a cutoff value from a fitted finite mixture model, given a type-1 error There are additional functions, mainly to visualize the results and that we introduce below.

Bimodal data

The data should be contained in a numerical vector. Here is an example of bimodal data containing the IgG concentration against measles virus:

> length(measles)
[1] 1044
> range(measles)
[1] -0.1157462  9.0195135
> # A histogram of the data:
> hist(measles,100,F,xlab="concentration",ylab="density",ylim=c(0,.55),
+      main=NULL,col="grey")
> # A kernel density estimation of the distribution:
> lines(density(measles),lwd=1.5,col="blue")

This figure shows the histogram of the data together with a non-parametric estimation of the distribution. The two suggest a bimodal distribution of the data.

Finite mixture models

We can model such a bimodal distribution by a finite mixture model that uses continuous distributions from the exponential family (Schlattmann 2009). In absence of clear expectation on the shape of these distributions, we can consider the normal, the gamma and the Weibull distributions. The normal distribution is defined on all the real numbers, whereas the gamma and Weibull distributions are defined on positive reals. These three distributions are characterized by 2 parameters: a location parameter \(\mu\) and a scale parameter \(\sigma\). The first parameter accounts for the location of most of the data and corresponds to the mean of the normal distribution, and the shape parameter for the gamma and Weibull distributions. The second parameter accounts for the spread of the data around the location parameter and corresponds to the standard deviation in the case of the normal distribution, the rate parameter (1/scale) for the gamma distribution and the scale parameter for the Weibull distribution. We refer to \(\mu_1\) and \(\sigma_1\) for the location and scale parameters of the peak of the lower concentrations and \(\mu_2\) and \(\sigma_2\) for the location and scale parameters of the peak of the higher concentrations. The density of the bimodal distribution of concentrations thus reads \[ f(x|\lambda,\mu_1,\sigma_1,\mu_2,\sigma_2) = \lambda\times\mathcal{D}_1(x|\mu_1,\sigma_1) + (1-\lambda)\times\mathcal{D}_2(x|\mu_2,\sigma_2) \]

The parameters of the finite mixture model are estimated by the E-M algorithm (Do and Batzoglou 2008) as coded by the function. These parameters include two parameters for each of the two probability distributions and a mixture parameter.

> # Estimating the parameters of the finite mixture model:
> (measles_out <- em(measles,"normal","normal"))

        Finite mixture model fitting to dataset "measles":

probability to belong to distribution 1 = 0.3197664 
distribution 1: normal 
  location (mean) = 3.6030882
  scale (sd) = 1.0746808
distribution 2: normal 
  location (mean) = 6.6245496
  scale (sd) = 0.7363208
deviance of the fitted model: 1305.9575861 
> # The confidence interval of the parameter estimates:
> confint(measles_out,nb=100,level=.95)
Loading required package: bbmle
Loading required package: stats4
        Estimate     2.5 %    97.5 %
mu1    3.6030882 3.4779257 3.7293115
sigma1 1.0746808 0.9940315 1.1651894
mu2    6.6245496 6.5668056 6.6818508
sigma2 0.7363208 0.6972442 0.7787633
lambda 0.3199565 0.2916654 0.3482475
> # The plot:
> hist(measles,100,F,xlab="concentration",ylab="density",ylim=c(0,.55),
+      main=NULL,col="grey")
> lines(measles_out,lwd=1.5,col="red")

Confidence interval for the mixture parameter \(\lambda\) is found using the method of Oakes (1999).

Identifying a cutoff value

We can use the fitted finite mixture model to identify a cutoff value that discriminate the two modes of the dataset. For that, we compute the probability for a datum to belong to distribution \(\mathcal{D}_1\) as \[ p_1 = \frac{\lambda\times\mathcal{D}_1(x|\mu_1,\sigma_1)} {\lambda\times\mathcal{D}_1(x|\mu_1,\sigma_1)+ (1-\lambda)\times\mathcal{D}_2(x|\mu_2,\sigma_2)} \] and the probability for a datum to belong to distribution \(\mathcal{D}_1\) as \[ p_2 = \frac{(1-\lambda)\times\mathcal{D}_2(x|\mu_2,\sigma_2)} {\lambda\times\mathcal{D}_1(x|\mu_1,\sigma_1)+ (1-\lambda)\times\mathcal{D}_2(x|\mu_2,\sigma_2)} \] We equate this probability to the type-I error we aim at to find the cutoff value. The confidence interval of the cutoff value is computed by Monte Carlo simulations (see the help of the cutoff function for more details). In practice, this gives:

> hist(measles,100,F,xlab="concentration",ylab="density",ylim=c(0,.55),
+      main=NULL,col="grey")
> lines(measles_out,lwd=1.5,col="red")
> # Estimating a cutoff value from this fitted finite mixture model:
> (cut_off <- cutoff(measles_out))
Estimate    2.5 %   97.5 % 
 5.90017  5.87821  5.92213 
> # Plotting it:
> polygon(c(cut_off[-1],rev(cut_off[-1])),c(0,0,.55,.55),
+         col=rgb(0,0,1,.2),border=NA)
> abline(v=cut_off[-1],lty=2,col="blue")
> abline(v=cut_off[1],col="blue")

References

Do, C. B., and S. Batzoglou. 2008. What is the expectation maximization algorithm? Nat Biotechnol 26:897–899.

Oakes, D. 1999. Direct calculation of the information matrix via the EM algorithm. J R Statist Soc B 61:479–482.

Schlattmann, P. 2009. Medical Applications of Finite Mixture Models. Springer Verlag.

Trang, N. V., M. Choisy, N. T. Nakagomi, N. T. U. Chinh, Y. H. Doan, T. Yamashiro, J. E. Bryant, et al. 2015. Determination of cut-off cycle threshold values in routine RT–PCR assays to assist differential diagnosis of norovirus in children hospitalized for acute gastroenteritis. Epidemiol Infect. [PDF]