Package 'optBiomarker'

Title: Estimation of Optimal Number of Biomarkers for Two-Group Microarray Based Classifications at a Given Error Tolerance Level for Various Classification Rules
Description: Estimates optimal number of biomarkers for two-group classification based on microarray data.
Authors: Mizanur Khondoker <[email protected]>
Maintainer: Mizanur Khondoker <[email protected]>
License: GPL (>= 2)
Version: 1.0-28
Built: 2025-02-26 04:31:09 UTC
Source: https://github.com/mkhondoker/optbiomarker

Help Index


R package for estimating optimal number of biomarkers at a given error tolerance level for various classification rules

Description

Using interactive control panel (rpanel) and 3D real-time rendering system (rgl) , this package provides a user friendly GUI for estimating the minimum number of biomarkers (variables) needed to achieve a given level of accuracy for two-group classification problems based on microarray data.

Details

The function optimiseBiomarker is a user friendly GUI for interrogating the database of leave-one-out cross-validation errors, errorDbase, to estimate optimal number of biomarkers for microarray based classifications. The database is built on the basis of simulated data using the classificationError function. The function simData is used for simulating microarray data for various combinations of factors such as the number of biomarkers, training set size, biological variation, experimental variation, fold change, replication, and correlation.

Author(s)

Mizanur Khondoker, Till Bachmann, Peter Ghazal

Maintainer: Mizanur Khondoker [email protected].

References

Khondoker, M. R., Till T. Bachmann, T. T., Mewissen, M., Dickinson, P. et al.(2010). Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules. Journal of Bioinformatics and Computational Biology, 8, 945-965.

Breiman, L. (2001). Random Forests, Machine Learning 45(1), 5–32.

Chang, Chih-Chung and Lin, Chih-Jen: LIBSVM: a library for Support Vector Machines, https://www.csie.ntu.edu.tw/~cjlin/libsvm/.

Ripley, B. D. (1996). Pattern Recognition and Neural Networks.Cambridge: Cambridge University Press.

Efron, B. and Tibshirani, R. (1997). Improvements on Cross-Validation: The .632+ Bootstrap Estimator. Journal of the American Statistical Association 92(438), 548–560.

Bowman, A., Crawford, E., Alexander, G. and Bowman, R. W. (2007). rpanel: Simple interactive controls for R functions using the tcltk package. Journal of Statistical Software 17(9).

See Also

simData classificationError optimiseBiomarker

Examples

if(interactive()){
data(errorDbase)
optimiseBiomarker(error=errorDbase)
}

Estimation of misclassification errors (generalisation errors) based on statistical and various machine learning methods

Description

Estimates misclassification errors (generalisation errors), sensitivity and specificity using cross-validation, bootstrap and 632plus bias corrected bootstrap methods based on Random Forest, Support Vector Machines, Linear Discriminant Analysis and k-Nearest Neighbour methods.

Usage

## S3 method for class 'data.frame'
classificationError(
          formula,
          data, 
          method=c("RF","SVM","LDA","KNN"), 
          errorType = c("cv", "boot", "six32plus"),
	  senSpec=TRUE,
          negLevLowest=TRUE,
	  na.action=na.omit, 
          control=control.errorest(k=NROW(na.action(data)),nboot=100),
          ...)

Arguments

formula

A formula of the form lhs ~ rhs relating response (class) variable and the explanatory variables. See lm for more detail.

data

A data frame containing the response (class membership) variable and the explanatory variables in the formula.

method

A character vector of length 1 to 4 representing the classification methods to be used. Can be one or more of "RF" (Random Forest), "SVM" (Support Vector Machines), "LDA" (Linear Discriminant Analysis) and "KNN" (k-Nearest Neighbour). Defaults to all four methods.

errorType

A character vector of length 1 to 3 representing the type of estimators to be used for computing misclassification errors. Can be one or more of the "cv" (cross-validation), "boot" (bootstrap) and "632plus" (632plus bias corrected bootstrap) estimators. Defaults to all three estimators.

senSpec

Logical. Should sensitivity and specificity (for cross-validation estimator only) be computed? Defaults to TRUE.

negLevLowest

Logical. Is the lowest of the ordered levels of the class variable represnts the negative control? Defaults to TRUE.

na.action

Function which indicates what should happen when the data contains NA's, defaults to na.omit.

control

Control parameters of the the function errorest.

...

additional parameters to method.

Details

In the current version of the package, estimation of sensitivity and specificity is limited to cross-validation estimator only. For LDA sample size must be greater than the number of explanatory variables to avoid singularity. The function classificationError does not check if this is satisfied, but the underlying function lda produces warnings if this condition is violated.

Value

Returns an object of class classificationError with components

call

The call of the classificationError function.

errorRate

A length(errorType) by length(method) matrix of classification errors.

rocData

A 2 by length(method) matrix of sensitivities (first row) and specificities (second row).

Author(s)

Mizanur Khondoker, Till Bachmann, Peter Ghazal
Maintainer: Mizanur Khondoker [email protected].

References

Khondoker, M. R., Till T. Bachmann, T. T., Mewissen, M., Dickinson, P. et al.(2010). Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules. Journal of Bioinformatics and Computational Biology, 8, 945-965.

Breiman, L. (2001). Random Forests, Machine Learning 45(1), 5–32.

Chang, Chih-Chung and Lin, Chih-Jen: LIBSVM: a library for Support Vector Machines, https://www.csie.ntu.edu.tw/~cjlin/libsvm/.

Ripley, B. D. (1996). Pattern Recognition and Neural Networks.Cambridge: Cambridge University Press.

Efron, B. and Tibshirani, R. (1997). Improvements on Cross-Validation: The .632+ Bootstrap Estimator. Journal of the American Statistical Association 92(438), 548–560.

See Also

simData

Examples

## Not run: 
mydata<-simData(nTrain=30,nBiom=3)$data
classificationError(formula=class~., data=mydata)

## End(Not run)

Database of leave-one-out cross validation errors for various combinations of data characteristics

Description

This is a 7-dimensional array (database) of leave-one-out cross validation errors for Random Forest, Support Vector Machines, Linear Discriminant Analysis and k-Nearest Neighbour classifiers. The database is the basis for estimating the optimal number of biomarkers at a given error tolerance level using optimiseBiomarker function. See Details for more information.

Usage

data(errorDbase)

Format

7-dimensional numeric array.

Details

The following table gives the dimension names, lengths and values/levels of the data object errorDbase.

Dimension name Length Values/Levels
No. of biomarkers 14 (1-6, 7, 9, 11, 15, 20, 30, 40, 50, 100)
Size of replication 5 (1, 3, 5, 7, 10)
Biological variation (σb\sigma_b) 4 (0.5, 1.0, 1.5, 2.5)
Experimental variation (σe\sigma_e) 4 (0.1, 0.5, 1.0, 1.5)
Minimum (Average) fold change 4 (1 (1.73), 2(2.88), 3(4.03), 5(6.33))
Training set size 5 (10, 20, 50, 100, 250)
Classification method 3 (Random Forest, Support Vector Machine, k-Nearest Neighbour)

We have a plan to expand the database to a 8-dimensional one by adding another dimension to store error rates at different level of correlation between biomarkers. Length of each dimension will also be increased leading to a bigger database with a wider coverage of the parameter space. Current version of the database contain error rates for independent (correlation = 0) biomarkers only. Also, it does not contain error rates for Linear Discriminant Analysis, which we plan to implement in the next release of the package. With the current version of the database, optimal number of biomarkers can be estimated using the optimiseBiomarker function for any intermediate values of the factors represented by the dimensions of the database.

Author(s)

Mizanur Khondoker, Till Bachmann, Peter Ghazal
Maintainer: Mizanur Khondoker [email protected].

References

Khondoker, M. R., Till T. Bachmann, T. T., Mewissen, M., Dickinson, P. et al.(2010). Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules. Journal of Bioinformatics and Computational Biology, 8, 945-965.

See Also

optimiseBiomarker


Estimates optimal number of biomarkers at a given error tolerance level for various classification rules

Description

Using interactive control panel (see rpanel) and 3D real-time rendering system (rgl), this package provides a user friendly GUI for estimating the minimum number of biomarkers (variables) needed to achieve a given level of accuracy for two-group classification problems based on microarray data.

Usage

optimiseBiomarker (error,
                   errorTol = 0.05,
                   method = "RF", nTrain = 100, 
                   sdB = 1.5,
                   sdW = 1,
                   foldAvg = 2.88,
                   nRep = 3)

Arguments

error

The database of classification errors. See errorDbase for details.

errorTol

Error tolerance limit.

method

Classification method. Can be one of "RF", "SVM", and "KNN" for Random Forest, Support Vector Machines, Linear Discriminant Analysis and k-Nearest Neighbour respectively.

nTrain

Training set size, i.e., the total number of biological samples in group 1 and group 2.

sdB

Biological variation (σb\sigma_b) of data in log (base 2) scale.

sdW

Experimental (technical) variation (σe\sigma_e) of data in log (base 2) scale.

foldAvg

Average fold change of the biomarkers.

nRep

Number of technical replications.

Details

The function optimiseBiomarker is a user friendly GUI for interrogating the database of leave-one-out cross-validation errors, errorDbase, to estimate optimal number of biomarkers for microarray based classifications. The database is built on the basis of simulated data using the classificationError function. The function simData is used for simulating microarray data for various combinations of factors such as the number of biomarkers, training set size, biological variation, experimental variation, fold change, replication, and correlation.

Author(s)

Mizanur Khondoker, Till Bachmann, Peter Ghazal

Maintainer: Mizanur Khondoker [email protected].

References

Khondoker, M. R., Till T. Bachmann, T. T., Mewissen, M., Dickinson, P. et al.(2010). Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules. Journal of Bioinformatics and Computational Biology, 8, 945-965.

Breiman, L. (2001). Random Forests, Machine Learning 45(1), 5–32.

Chang, Chih-Chung and Lin, Chih-Jen: LIBSVM: a library for Support Vector Machines, https://www.csie.ntu.edu.tw/~cjlin/libsvm/.

Ripley, B. D. (1996). Pattern Recognition and Neural Networks.Cambridge: Cambridge University Press.

Efron, B. and Tibshirani, R. (1997). Improvements on Cross-Validation: The .632+ Bootstrap Estimator. Journal of the American Statistical Association 92(438), 548–560.

Bowman, A., Crawford, E., Alexander, G. and Bowman, R. W. (2007). rpanel: Simple interactive controls for R functions using the tcltk package. Journal of Statistical Software 17(9).

See Also

simData classificationError

Examples

if(interactive()){
data(errorDbase)
optimiseBiomarker(error=errorDbase)
}

A set of 54359 median gene expressions in log (base 2) scale

Description

This data set contains a set of 54359 log base 2 gene expression values from a neonatal whole blood gene expression study described in Smith et al. (2007). The data represent the median of 28 microarrays corresponding to 28 control (healthy) patients of the neonatal study. This data set is used as a base expressions set for simulating biomarker data using simData function of the optBiomarker package.

Usage

data(realBiomarker)

Format

A vector of 54359 gene expressions in log (base 2) scale.

References

Smith, C. L., Dickinson, P., Forster, T., Khondoker, M. R., Craigon, M., Ross, A., Storm, P., Burgess, S., Lacaze, P., Stenson, B. J.and Ghazal, P. (2007). Quantitative assessment of whole blood RNA as a potential biomarker for infectious disease. Analyst 132, 1200–1209.

Khondoker, M. R., Till T. Bachmann, T. T., Mewissen, M., Dickinson, P. et al.(2010). Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules. Journal of Bioinformatics and Computational Biology, 8, 945-965.


Simulation of microarray data

Description

The function simulates microarray data for two-group comparison with user supplied parameters such as number of biomarkers (genes or proteins), sample size, biological and experimental (technical) variation, replication, differential expression, and correlation between biomarkers.

Usage

simData(nTrain=100,
        nGr1=floor(nTrain/2),
        nBiom=50,nRep=3,
        sdW=1.0,
        sdB=1.0,
        rhoMax=NULL, rhoMin=NULL, nBlock=NULL,bsMin=3, bSizes=NULL, gamma=NULL,
        sigma=0.1,diffExpr=TRUE,
        foldMin=2,
        orderBiom=TRUE,
        baseExpr=NULL)

Arguments

nTrain

Training set size,.i.e., the total number of biological samples in group 1 (nGr1) and group 2.

nGr1

Size of group 1. Defaults to floor(nTrain/2).

nBiom

Number of biomarkers (genes, probes or proteins).

nRep

Number of technical replications.

sdW

Experimental (technical) variation (σe\sigma_e) of data in log (base 2) scale.

sdB

Biological variation (σb\sigma_b) of data in log (base 2) scale.

rhoMax

Maximum Pearson's correlation coefficient between biomarkers. To ensure positive definiteness, allowed values are restricted between 0 and 0.95 inclusive. If NULL, set to runif(1,min=0.6,max=0.8).

rhoMin

Minimum Pearson's correlation coefficient between biomarkers. To ensure positive definiteness, allowed values are restricted between 0 and 0.95 inclusive. If NULL, set to runif(1,min=0.2,max=0.4).

nBlock

Number of blocks in the block diagonal (Hub-Toeplitz) correlation matrix. If NULL, set to 1 for nBiom<5 and randomly selected from c(1:floor(nBiom/bsMin)) for nBiom>=5.

bsMin

Minimum block size. bsMin=3 by default.

bSizes

A vector of length nBlock representing the block sizes (should sum to nBlock). If NULL, set to c(bs+mod,rep(bs,nBlock-1), where bs is the integer part of nBiom/nBlock and mod is the remainder after integer division.

gamma

Specifies a correlation structure. If NULL, assumes independence.gamma=0 indicates a single block exchangeable correlation marix with constant correlation rho=0.5*(rhoMin+rhoMax). A value greater than zero indicates block diagonal (Hub-Toeplitz) correlation matrix with decline rate determined by the value of gamma. Decline rate is linear for gamma=1.

sigma

Standard deviation of the normal distribution (before truncation) where fold changes are generated from. See details.

diffExpr

Logical. Should systematic difference be introduced between the data of the two groups?

foldMin

Minimum value of fold changes. See details.

orderBiom

Logical. Should columns (biomarkers) be arranged in order of differential expression?

baseExpr

A vector of length nBiom to be used as base expressions μ\mu. See realBiomarker for details.

Details

Differential expressions are introduced by adding zδz\delta to the data of group 2 where δ\delta values are generated from a truncated normal distribution and zz is randomly selected from (-1,1) to characterise up- or down-regulation.

Assuming that Y is N(μ,σ2)Y ~is~ N(\mu, \sigma^2), and A=[a1,a2]A=[a_1,a_2], a subset of Inf<y<Inf-Inf <y < Inf, the conditional distribution of YY given AA is called truncated normal distribution:

f(y,μ,σ)=(1/σ)ϕ((yμ)/σ)/(Φ((a2μ)/σ)Φ((a1μ)/σ))f(y, \mu, \sigma)= (1/\sigma) \phi((y-\mu)/\sigma) / (\Phi((a2-\mu)/\sigma) - \Phi((a_1-\mu)/\sigma))

for a1<=y<=a2a_1 <= y <= a_2, and 0 otherwise,

where μ\mu is the mean of the original Normal distribution before truncation, σ\sigma is the corresponding standard deviation,a2a_2 is the upper truncation point, a1a_1 is the lower truncation point, ϕ(x)\phi(x) is the density of the standard normal distribution, and Φ(x)\Phi(x) is the distribution function of the standard normal distribution. For simData function, we consider a1=log2(foldMin)a_1=log_2(\code{foldMin}) and a2=Infa_2=Inf. This ensures that the biomarkers are differentially expressed by a fold change of foldMin or more.

Value

A dataframe of dimension nTrain by nBiom+1. The first column is a factor (class) representing the group memberships of the samples.

Author(s)

Mizanur Khondoker, Till Bachmann, Peter Ghazal
Maintainer: Mizanur Khondoker [email protected].

References

Khondoker, M. R., Till T. Bachmann, T. T., Mewissen, M., Dickinson, P. et al.(2010). Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules. Journal of Bioinformatics and Computational Biology, 8, 945-965.

See Also

classificationError

Examples

simData(nTrain=10,nBiom=3)