Title: | Distribution-Free Exact High Dimensional Low Sample Size k-Sample Tests |
---|---|
Description: | Testing homogeneity of k multivariate distributions is a classical and challenging problem in statistics, and this becomes even more challenging when the dimension of the data exceeds the sample size. We construct some tests for this purpose which are exact level (size) alpha tests based on clustering. These tests are easy to implement and distribution-free in finite sample situations. Under appropriate regularity conditions, these tests have the consistency property in HDLSS asymptotic regime, where the dimension of data grows to infinity while the sample size remains fixed. We also consider a multiscale approach, where the results for different number of partitions are aggregated judiciously. Details are in Biplab Paul, Shyamal K De and Anil K Ghosh (2021) <doi:10.1016/j.jmva.2021.104897>; Soham Sarkar and Anil K Ghosh (2019) <doi:10.1109/TPAMI.2019.2912599>; William M Rand (1971) <doi:10.1080/01621459.1971.10482356>; Cyrus R Mehta and Nitin R Patel (1983) <doi:10.2307/2288652>; Joseph C Dunn (1973) <doi:10.1080/01969727308546046>; Sture Holm (1979) <doi:10.2307/4615733>; Yoav Benjamini and Yosef Hochberg (1995) <doi: 10.2307/2346101>. |
Authors: | Biplab Paul [aut, cre], Shyamal K. De [aut], Anil K. Ghosh [aut] |
Maintainer: | Biplab Paul <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.1.0 |
Built: | 2024-10-27 04:33:50 UTC |
Source: | https://github.com/cran/HDLSSkST |
Testing homogeneity of k () multivariate distributions is a classical and challenging problem in statistics, and this becomes even more challenging when the dimension of the data exceeds the sample size. We construct some tests for this purpose which are exact level (size)
tests based on clustering. These tests are easy to implement and distribution-free in finite sample situations. Under appropriate regularity conditions, these tests have the consistency property in HDLSS asymptotic regime, where the dimension of data
grows to
while the sample size remains fixed. We also consider a multiscale approach, where the results for the different number of partitions are aggregated judiciously. This package includes eight tests, namely (i) RI test, (ii) FS test, (iii) MRI test, (iv) MFS test, (v) MTRI test , (vi) MTFS test, (vii) ARI test and (viii) AFS test. In MRI and MFS test, we modified the RI and FS test, respectively, using an estimated clustering number. In the multiscale approach (MTRI and MTFS), we use Holm's step-down-procedure (1979) and Benjamini-Hochberg FDR controlling procedure (1995).
Biplab Paul, Shyamal K. De and Anil K. Ghosh
Maintainer: Biplab Paul<[email protected]>
Biplab Paul, Shyamal K De and Anil K Ghosh (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, doi:10.1016/j.jmva.2021.104897.
Soham Sarkar and Anil K Ghosh (2019). On perfect clustering of high dimension, low sample size data, IEEE transactions on pattern analysis and machine intelligence, doi:10.1109/TPAMI.2019.2912599.
William M Rand (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical association, 66(336):846-850, doi:10.1080/01621459.1971.10482356.
Cyrus R Mehta and Nitin R Patel (1983). A network algorithm for performing Fisher's exact test in rxc contingency tables, Journal of the American Statistical Association, 78(382):427-434, doi:10.2307/2288652.
Joseph C Dunn (1973). A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters, doi:10.1080/01969727308546046.
Sture Holm (1979). A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics, 65-70, doi:10.2307/4615733.
Yoav Benjamini and Yosef Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: series B (Methodological) 57.1: 289-300, doi: 10.2307/2346101.
Performs the distribution free exact k-sample test for equality of multivariate distributions in the HDLSS regime. This an aggregate test of the two sample versions of the FS test over numbers of two-sample comparisons, and the test statistic is the minimum of these two sample FS test statistics. Holm's step-down-procedure (1979) and Benjamini-Hochberg procedure (1995) are applied for multiple testing.
AFStest(M, sizes, randomization = TRUE, clust_alg = "knwClustNo", kmax = 4, multTest = "Holm", s_psi = 1, s_h = 1, lb = 1, n_sts = 1000, alpha = 0.05)
AFStest(M, sizes, randomization = TRUE, clust_alg = "knwClustNo", kmax = 4, multTest = "Holm", s_psi = 1, s_h = 1, lb = 1, n_sts = 1000, alpha = 0.05)
M |
|
sizes |
vector of sample sizes |
randomization |
logical; if TRUE (default), randomization test and FALSE, non-randomization test |
clust_alg |
|
kmax |
maximum value of total number of clusters to estimate total number of clusters for two-sample comparition, default: |
multTest |
|
s_psi |
function required for clustering, 1 for |
s_h |
function required for clustering, 1 for |
lb |
each observation is partitioned into some numbers of smaller vectors of same length |
n_sts |
number of simulation of the test statistic, default: |
alpha |
numeric, confidence level |
AFStest returns a list containing the following items:
AFSStat |
value of the observed test statistic |
AFCutoff |
cut-off of the test |
randomGamma |
randomized coefficient of the test |
decisionAFS |
if returns |
multipleTest |
indicates where two populations are different according to multiple tests |
Biplab Paul, Shyamal K. De and Anil K. Ghosh
Maintainer: Biplab Paul<[email protected]>
Biplab Paul, Shyamal K De and Anil K Ghosh (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, doi:10.1016/j.jmva.2021.104897.
Cyrus R Mehta and Nitin R Patel (1983). A network algorithm for performing Fisher's exact test in rxc contingency tables, Journal of the American Statistical Association, 78(382):427-434, doi:10.2307/2288652.
Sture Holm (1979). A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics, 65-70, doi:10.2307/4615733.
Yoav Benjamini and Yosef Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: series B (Methodological) 57.1: 289-300, doi: 10.2307/2346101.
# muiltivariate normal distribution: # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) X <- as.matrix(rbind(I1,I2,I3,I4)) #AFS test: results <- AFStest(M=X, sizes = c(n1,n2,n3,n4)) ## outputs: results$AFSStat #[1] 5.412544e-06 results$AFCutoff #[1] 0.0109604 results$randomGamma #[1] 0 results$decisionAFS #[1] 1 results$multipleTest # Population.1 Population.2 rejected pvalues #1 1 2 TRUE 0 #2 1 3 TRUE 0 #3 1 4 TRUE 0 #4 2 3 TRUE 0 #5 2 4 TRUE 0 #6 3 4 TRUE 0
# muiltivariate normal distribution: # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) X <- as.matrix(rbind(I1,I2,I3,I4)) #AFS test: results <- AFStest(M=X, sizes = c(n1,n2,n3,n4)) ## outputs: results$AFSStat #[1] 5.412544e-06 results$AFCutoff #[1] 0.0109604 results$randomGamma #[1] 0 results$decisionAFS #[1] 1 results$multipleTest # Population.1 Population.2 rejected pvalues #1 1 2 TRUE 0 #2 1 3 TRUE 0 #3 1 4 TRUE 0 #4 2 3 TRUE 0 #5 2 4 TRUE 0 #6 3 4 TRUE 0
Performs the distribution free exact k-sample test for equality of multivariate distributions in the HDLSS regime. This an aggregate test of the two sample versions of the RI test over numbers of two-sample comparisons, and the test statistic is the minimum of these two sample RI test statistics. Holm's step-down-procedure (1979) and Benjamini-Hochberg procedure (1995) are applied for multiple testing.
ARItest(M, sizes, randomization = TRUE, clust_alg = "knwClustNo", kmax = 4, multTest = "Holm", s_psi = 1, s_h = 1, lb = 1, n_sts = 1000, alpha = 0.05)
ARItest(M, sizes, randomization = TRUE, clust_alg = "knwClustNo", kmax = 4, multTest = "Holm", s_psi = 1, s_h = 1, lb = 1, n_sts = 1000, alpha = 0.05)
M |
|
sizes |
vector of sample sizes |
randomization |
logical; if TRUE (default), randomization test and FALSE, non-randomization test |
clust_alg |
|
kmax |
maximum value of total number of clusters to estimate total number of clusters for two-sample comparition, default: |
multTest |
|
s_psi |
function required for clustering, 1 for |
s_h |
function required for clustering, 1 for |
lb |
each observation is partitioned into some numbers of smaller vectors of same length |
n_sts |
number of simulation of the test statistic, default: |
alpha |
numeric, confidence level |
ARItest returns a list containing the following items:
ARIStat |
value of the observed test statistic |
Cutoff |
cut-off of the test |
randomGamma |
randomized coefficient of the test |
decisionARI |
if returns |
multipleTest |
indicates where two populations are different according to multiple tests |
Biplab Paul, Shyamal K. De and Anil K. Ghosh
Maintainer: Biplab Paul<[email protected]>
Biplab Paul, Shyamal K De and Anil K Ghosh (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, doi:10.1016/j.jmva.2021.104897.
William M Rand (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical association, 66(336):846-850, doi:10.1080/01621459.1971.10482356.
Sture Holm (1979). A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics, 65-70, doi:10.2307/4615733.
Yoav Benjamini and Yosef Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: series B (Methodological) 57.1: 289-300, doi: 10.2307/2346101.
# muiltivariate normal distribution: # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) X <- as.matrix(rbind(I1,I2,I3,I4)) #ARI test: results <- ARItest(M=X, sizes = c(n1,n2,n3,n4)) ## outputs: results$ARIStat #[1] 0 results$ARICutoff #[1] 0.3368421 results$randomGamma #[1] 0 results$decisionARI #[1] 1 results$multipleTest # Population.1 Population.2 rejected pvalues #1 1 2 TRUE 0 #2 1 3 TRUE 0 #3 1 4 TRUE 0 #4 2 3 TRUE 0 #5 2 4 TRUE 0 #6 3 4 TRUE 0
# muiltivariate normal distribution: # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) X <- as.matrix(rbind(I1,I2,I3,I4)) #ARI test: results <- ARItest(M=X, sizes = c(n1,n2,n3,n4)) ## outputs: results$ARIStat #[1] 0 results$ARICutoff #[1] 0.3368421 results$randomGamma #[1] 0 results$decisionARI #[1] 1 results$multipleTest # Population.1 Population.2 rejected pvalues #1 1 2 TRUE 0 #2 1 3 TRUE 0 #3 1 4 TRUE 0 #4 2 3 TRUE 0 #5 2 4 TRUE 0 #6 3 4 TRUE 0
Benjamini-Hochbergs step-up-procedure (1995) for multiple tests.
BenHoch(pvalues, alpha)
BenHoch(pvalues, alpha)
pvalues |
vector of p-values |
alpha |
numeric, false discovery rate controling level |
a vector of s and
s.
: fails to reject the corresponding hypothesis and
: reject the corresponding hypothesis
Biplab Paul, Shyamal K. De and Anil K. Ghosh
Maintainer: Biplab Paul<[email protected]>
Yoav Benjamini and Yosef Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: series B (Methodological) 57.1: 289-300, doi: 10.2307/2346101.
# Benjamini-Hochbergs step-up-procedure: pvalues <- c(0.50,0.01,0.001,0.69,0.02,0.05,0.0025) alpha <- 0.05 BenHoch(pvalues, alpha) ## outputs: #[1] 0 1 1 0 1 0 1
# Benjamini-Hochbergs step-up-procedure: pvalues <- c(0.50,0.01,0.001,0.69,0.02,0.05,0.0025) alpha <- 0.05 BenHoch(pvalues, alpha) ## outputs: #[1] 0 1 1 0 1 0 1
Performs the distribution free exact k-sample test for equality of multivariate distributions in the HDLSS regime.
FStest(M, labels, sizes, n_clust, randomization = TRUE, clust_alg = "knwClustNo", kmax = 2 * n_clust, s_psi = 1, s_h = 1, lb = 1, n_sts = 1000, alpha = 0.05)
FStest(M, labels, sizes, n_clust, randomization = TRUE, clust_alg = "knwClustNo", kmax = 2 * n_clust, s_psi = 1, s_h = 1, lb = 1, n_sts = 1000, alpha = 0.05)
M |
|
labels |
length |
sizes |
vector of sample sizes |
n_clust |
number of the Populations |
randomization |
logical; if TRUE (default), randomization test and FALSE, non-randomization test |
clust_alg |
|
kmax |
maximum value of total number of clusters to estimate total number of clusters in the whole observations, default: |
s_psi |
function required for clustering, 1 for |
s_h |
function required for clustering, 1 for |
lb |
each observation is partitioned into some numbers of smaller vectors of same length |
n_sts |
number of simulation of the test statistic, default: |
alpha |
numeric, confidence level |
FStest returns a list containing the following items:
estClustLabel |
a vector of length |
obsCtyTab |
observed contingency table |
ObservedProb |
value of the observed test statistic |
FCutoff |
cut-off of the test |
randomGamma |
randomized coefficient of the test |
estPvalue |
estimated p-value of the test |
decisionF |
if returns |
estClustNo |
total number of the estimated classes |
Biplab Paul, Shyamal K. De and Anil K. Ghosh
Maintainer: Biplab Paul<[email protected]>
Biplab Paul, Shyamal K De and Anil K Ghosh (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, doi:10.1016/j.jmva.2021.104897.
Cyrus R Mehta and Nitin R Patel (1983). A network algorithm for performing Fisher's exact test in rxc contingency tables, Journal of the American Statistical Association, 78(382):427-434, doi:10.2307/2288652.
# muiltivariate normal distribution: # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 k = 4 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) levels <- c(rep(0,n1), rep(1,n2), rep(2,n3), rep(3,n4)) X <- as.matrix(rbind(I1,I2,I3,I4)) #FS test: results <- FStest(M=X, labels=levels, sizes = c(n1,n2,n3,n4), n_clust = k) ## outputs: results$estClustLabel #[1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 results$obsCtyTab # [,1] [,2] [,3] [,4] #[1,] 10 0 0 0 #[2,] 0 10 0 0 #[3,] 0 0 10 0 #[4,] 0 0 0 10 results$ObservedProb #[1] 2.125236e-22 results$FCutoff #[1] 1.115958e-07 results$randomGamma #[1] 0 results$estPvalue #[1] 0 results$decisionF #[1] 1
# muiltivariate normal distribution: # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 k = 4 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) levels <- c(rep(0,n1), rep(1,n2), rep(2,n3), rep(3,n4)) X <- as.matrix(rbind(I1,I2,I3,I4)) #FS test: results <- FStest(M=X, labels=levels, sizes = c(n1,n2,n3,n4), n_clust = k) ## outputs: results$estClustLabel #[1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 results$obsCtyTab # [,1] [,2] [,3] [,4] #[1,] 10 0 0 0 #[2,] 0 10 0 0 #[3,] 0 0 10 0 #[4,] 0 0 0 10 results$ObservedProb #[1] 2.125236e-22 results$FCutoff #[1] 1.115958e-07 results$randomGamma #[1] 0 results$estPvalue #[1] 0 results$decisionF #[1] 1
Performs modified K-means algorithm by using a new dissimilarity measure, called MADD, and provides estimated cluster (class) labels or memberships of observations.
gMADD(s_psi, s_h, n_clust, lb, M)
gMADD(s_psi, s_h, n_clust, lb, M)
s_psi |
function required for clustering, 1 for |
s_h |
function required for clustering, 1 for |
n_clust |
total number of the classes in the whole observations |
lb |
each observation is partitioned into some numbers of smaller vectors of same length |
M |
|
a vector of length n of estimated cluster (class) labels of observations
Biplab Paul, Shyamal K. De and Anil K. Ghosh
Maintainer: Biplab Paul<[email protected]>
Biplab Paul, Shyamal K De and Anil K Ghosh (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, doi:10.1016/j.jmva.2021.104897.
Soham Sarkar and Anil K Ghosh (2019). On perfect clustering of high dimension, low sample size data, IEEE transactions on pattern analysis and machine intelligence, doi:10.1109/TPAMI.2019.2912599.
# Modified K-means algorithm: # muiltivariate normal distribution # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) n_cl <- 4 X <- as.matrix(rbind(I1,I2,I3,I4)) gMADD(1,1,n_cl,1,X) ## outputs: #[1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
# Modified K-means algorithm: # muiltivariate normal distribution # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) n_cl <- 4 X <- as.matrix(rbind(I1,I2,I3,I4)) gMADD(1,1,n_cl,1,X) ## outputs: #[1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Performs modified K-means algorithm by using a new dissimilarity measure, called MADD and DUNN index, and provides estimated cluster (class) labels or memberships and corresponding DUNN index of the observations.
gMADD_DI(s_psi, s_h, kmax, lb, M)
gMADD_DI(s_psi, s_h, kmax, lb, M)
s_psi |
function required for clustering, 1 for |
s_h |
function required for clustering, 1 for |
kmax |
maximum value of total number of clusters to estimate total number of clusters in the whole observations |
lb |
each observation is partitioned into some numbers of smaller vectors of same length |
M |
|
DUNN index is used for cluster validation, but here we use it to estimate total number of cluster by
. Here
represents the DUNN index and we use
.
a matrix of the estimated cluster (class) labels and corresponding DUNN indexes of observations
The result of this gMADD_DI function is a matrix. The 1st row of this matrix doesn't provide anything about estimated class labels or DUNN index of observations since the DUNN index is only defined for . The last column of this matrix represents the DUNN indexes. The estimated cluster labels of observations are calculated by finding out the corresponding row of maximum DUNN index.
Biplab Paul, Shyamal K. De and Anil K. Ghosh
Maintainer: Biplab Paul<[email protected]>
Biplab Paul, Shyamal K De and Anil K Ghosh (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, doi:10.1016/j.jmva.2021.104897.
Soham Sarkar and Anil K Ghosh (2019). On perfect clustering of high dimension, low sample size data, IEEE transactions on pattern analysis and machine intelligence, doi:10.1109/TPAMI.2019.2912599.
Joseph C Dunn (1973). A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters, doi:10.1080/01969727308546046.
# Modified K-means algorithm: # muiltivariate normal distribution # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) n_cl <- 4 N <- n1+n2+n3+n4 X <- as.matrix(rbind(I1,I2,I3,I4)) dvec_di_mat <- gMADD_DI(1,1,2*n_cl,1,X) est_no_cl <- which.max(dvec_di_mat[ ,(N+1)]) dvec_di_mat[est_no_cl,1:N] ## outputs: #[1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
# Modified K-means algorithm: # muiltivariate normal distribution # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) n_cl <- 4 N <- n1+n2+n3+n4 X <- as.matrix(rbind(I1,I2,I3,I4)) dvec_di_mat <- gMADD_DI(1,1,2*n_cl,1,X) est_no_cl <- which.max(dvec_di_mat[ ,(N+1)]) dvec_di_mat[est_no_cl,1:N] ## outputs: #[1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Holm's step-down-procedure (1979) for mutiple tests.
Holm(pvalues, alpha)
Holm(pvalues, alpha)
pvalues |
vector of p-values |
alpha |
numeric, family wise error rate controling level |
a vector of s and
s.
: fails to reject the corresponding hypothesis and
: reject the corresponding hypothesis
Biplab Paul, Shyamal K. De and Anil K. Ghosh
Maintainer: Biplab Paul<[email protected]>
Sture Holm (1979). A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics, 65-70, doi:10.2307/4615733.
# Holm's step down procedure: pvalues <- c(0.50,0.01,0.001,0.69,0.02,0.05,0.0025) alpha <- 0.05 Holm(pvalues, alpha) ## outputs: #[1] 0 0 1 0 0 0 1
# Holm's step down procedure: pvalues <- c(0.50,0.01,0.001,0.69,0.02,0.05,0.0025) alpha <- 0.05 Holm(pvalues, alpha) ## outputs: #[1] 0 0 1 0 0 0 1
Performs the distribution free exact k-sample test for equality of multivariate distributions in the HDLSS regime. This test is a multiscale approach based on FS test, where the results for different number of partitions are aggregated judiciously.
MTFStest(M, labels, sizes, k_max, multTest = "Holm", s_psi = 1, s_h = 1, lb = 1, n_sts = 1000, alpha = 0.05)
MTFStest(M, labels, sizes, k_max, multTest = "Holm", s_psi = 1, s_h = 1, lb = 1, n_sts = 1000, alpha = 0.05)
M |
|
labels |
length |
sizes |
vector of sample sizes |
k_max |
maximum value of total number of clusters which is required for the test |
multTest |
|
s_psi |
function required for clustering, 1 for |
s_h |
function required for clustering, 1 for |
lb |
each observation is partitioned into some numbers of smaller vectors of same length |
n_sts |
number of simulation of the test statistic, default: |
alpha |
numeric, confidence level |
MTFStest returns a list containing the following items:
RIvec |
a vector of the Rand indices based on different number of clusters |
Pvalues |
a vector of FS test p-values based on different number of clusters |
decisionMTRI |
if returns |
contTabs |
a list of the observed contingency table based on different number of clusters |
mulTestdec |
a vector of |
Biplab Paul, Shyamal K. De and Anil K. Ghosh
Maintainer: Biplab Paul<[email protected]>
Biplab Paul, Shyamal K De and Anil K Ghosh (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, doi:10.1016/j.jmva.2021.104897.
Sture Holm (1979). A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics, 65-70, doi:10.2307/4615733.
Yoav Benjamini and Yosef Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: series B (Methodological) 57.1: 289-300, doi: 10.2307/2346101.
# muiltivariate normal distribution: # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) levels <- c(rep(0,n1), rep(1,n2), rep(2,n3), rep(3,n4)) X <- as.matrix(rbind(I1,I2,I3,I4)) #MTFS test: results <- MTFStest(X, levels, c(n1,n2,n3,n4), 8) ## outputs: results$fpmfvec #[1] 7.254445e-12 6.137740e-16 2.125236e-22 2.125236e-22 2.125236e-22 2.125236e-22 2.125236e-22 results$Pvalues #[1] 0 0 0 0 0 0 0 results$decisionMTFS #[1] 1 results$contTabs #$contTabs[[1]] # [,1] [,2] #[1,] 10 0 #[2,] 10 0 #[3,] 0 10 #[4,] 0 10 #$contTabs[[2]] # [,1] [,2] [,3] #[1,] 10 0 0 #[2,] 0 10 0 #[3,] 0 8 2 #[4,] 0 0 10 #$contTabs[[3]] # [,1] [,2] [,3] [,4] #[1,] 10 0 0 0 #[2,] 0 10 0 0 #[3,] 0 0 10 0 #[4,] 0 0 0 10 #$contTabs[[4]] # [,1] [,2] [,3] [,4] [,5] #[1,] 10 0 0 0 0 #[2,] 0 10 0 0 0 #[3,] 0 0 4 6 0 #[4,] 0 0 0 0 10 #$contTabs[[5]] # [,1] [,2] [,3] [,4] [,5] [,6] #[1,] 10 0 0 0 0 0 #[2,] 0 10 0 0 0 0 #[3,] 0 0 4 6 0 0 #[4,] 0 0 0 0 8 2 #$contTabs[[6]] # [,1] [,2] [,3] [,4] [,5] [,6] [,7] #[1,] 10 0 0 0 0 0 0 #[2,] 0 5 5 0 0 0 0 #[3,] 0 0 0 4 6 0 0 #[4,] 0 0 0 0 0 8 2 #$contTabs[[7]] # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] #[1,] 8 2 0 0 0 0 0 0 #[2,] 0 0 5 5 0 0 0 0 #[3,] 0 0 0 0 4 6 0 0 #[4,] 0 0 0 0 0 0 8 2 results$mulTestdec #[1] 1 1 1 1 1 1 1
# muiltivariate normal distribution: # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) levels <- c(rep(0,n1), rep(1,n2), rep(2,n3), rep(3,n4)) X <- as.matrix(rbind(I1,I2,I3,I4)) #MTFS test: results <- MTFStest(X, levels, c(n1,n2,n3,n4), 8) ## outputs: results$fpmfvec #[1] 7.254445e-12 6.137740e-16 2.125236e-22 2.125236e-22 2.125236e-22 2.125236e-22 2.125236e-22 results$Pvalues #[1] 0 0 0 0 0 0 0 results$decisionMTFS #[1] 1 results$contTabs #$contTabs[[1]] # [,1] [,2] #[1,] 10 0 #[2,] 10 0 #[3,] 0 10 #[4,] 0 10 #$contTabs[[2]] # [,1] [,2] [,3] #[1,] 10 0 0 #[2,] 0 10 0 #[3,] 0 8 2 #[4,] 0 0 10 #$contTabs[[3]] # [,1] [,2] [,3] [,4] #[1,] 10 0 0 0 #[2,] 0 10 0 0 #[3,] 0 0 10 0 #[4,] 0 0 0 10 #$contTabs[[4]] # [,1] [,2] [,3] [,4] [,5] #[1,] 10 0 0 0 0 #[2,] 0 10 0 0 0 #[3,] 0 0 4 6 0 #[4,] 0 0 0 0 10 #$contTabs[[5]] # [,1] [,2] [,3] [,4] [,5] [,6] #[1,] 10 0 0 0 0 0 #[2,] 0 10 0 0 0 0 #[3,] 0 0 4 6 0 0 #[4,] 0 0 0 0 8 2 #$contTabs[[6]] # [,1] [,2] [,3] [,4] [,5] [,6] [,7] #[1,] 10 0 0 0 0 0 0 #[2,] 0 5 5 0 0 0 0 #[3,] 0 0 0 4 6 0 0 #[4,] 0 0 0 0 0 8 2 #$contTabs[[7]] # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] #[1,] 8 2 0 0 0 0 0 0 #[2,] 0 0 5 5 0 0 0 0 #[3,] 0 0 0 0 4 6 0 0 #[4,] 0 0 0 0 0 0 8 2 results$mulTestdec #[1] 1 1 1 1 1 1 1
Performs the distribution free exact k-sample test for equality of multivariate distributions in the HDLSS regime. This test is a multiscale approach based on RI test, where the results for different number of partitions are aggregated judiciously.
MTRItest(M, labels, sizes, k_max, multTest = "Holm", s_psi = 1, s_h = 1, lb = 1, n_sts = 1000, alpha = 0.05)
MTRItest(M, labels, sizes, k_max, multTest = "Holm", s_psi = 1, s_h = 1, lb = 1, n_sts = 1000, alpha = 0.05)
M |
|
labels |
length |
sizes |
vector of sample sizes |
k_max |
maximum value of total number of clusters which is required for the test |
multTest |
|
s_psi |
function required for clustering, 1 for |
s_h |
function required for clustering, 1 for |
lb |
each observation is partitioned into some numbers of smaller vectors of same length |
n_sts |
number of simulation of the test statistic, default: |
alpha |
numeric, confidence level |
MTRItest returns a list containing the following items:
RIvec |
a vector of the Rand indices based on different number of clusters |
Pvalues |
a vector of RI test p-values based on different number of clusters |
decisionMTRI |
if returns |
contTabs |
a list of the observed contingency table based on different number of clusters |
mulTestdec |
a vector of |
Biplab Paul, Shyamal K. De and Anil K. Ghosh
Maintainer: Biplab Paul<[email protected]>
Biplab Paul, Shyamal K De and Anil K Ghosh (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, doi:10.1016/j.jmva.2021.104897.
Sture Holm (1979). A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics, 65-70, doi:10.2307/4615733.
Yoav Benjamini and Yosef Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: series B (Methodological) 57.1: 289-300, doi: 10.2307/2346101.
# muiltivariate normal distribution: # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) levels <- c(rep(0,n1), rep(1,n2), rep(2,n3), rep(3,n4)) X <- as.matrix(rbind(I1,I2,I3,I4)) #MTRI test: results <- MTRItest(X, levels, c(n1,n2,n3,n4), 8) ## outputs: results$RIvec #[1] 0.25641026 0.14871795 0.00000000 0.03076923 0.05128205 0.08333333 0.10384615 results$Pvalues #[1] 0 0 0 0 0 0 0 results$decisionMTRI #[1] 1 results$contTabs #$contTabs[[1]] # [,1] [,2] #[1,] 10 0 #[2,] 10 0 #[3,] 0 10 #[4,] 0 10 #$contTabs[[2]] # [,1] [,2] [,3] #[1,] 10 0 0 #[2,] 0 10 0 #[3,] 0 10 0 #[4,] 0 0 10 #$contTabs[[3]] # [,1] [,2] [,3] [,4] #[1,] 10 0 0 0 #[2,] 0 10 0 0 #[3,] 0 0 10 0 #[4,] 0 0 0 10 #$contTabs[[4]] # [,1] [,2] [,3] [,4] [,5] #[1,] 10 0 0 0 0 #[2,] 0 10 0 0 0 #[3,] 0 0 4 6 0 #[4,] 0 0 0 0 10 #$contTabs[[5]] # [,1] [,2] [,3] [,4] [,5] [,6] #[1,] 10 0 0 0 0 0 #[2,] 0 10 0 0 0 0 #[3,] 0 0 4 6 0 0 #[4,] 0 0 0 0 8 2 #$contTabs[[6]] # [,1] [,2] [,3] [,4] [,5] [,6] [,7] #[1,] 10 0 0 0 0 0 0 #[2,] 0 5 5 0 0 0 0 #[3,] 0 0 0 4 6 0 0 #[4,] 0 0 0 0 0 8 2 #$contTabs[[7]] # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] #[1,] 8 2 0 0 0 0 0 0 #[2,] 0 0 5 5 0 0 0 0 #[3,] 0 0 0 0 4 6 0 0 #[4,] 0 0 0 0 0 0 8 2 results$mulTestdec #[1] 1 1 1 1 1 1 1
# muiltivariate normal distribution: # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) levels <- c(rep(0,n1), rep(1,n2), rep(2,n3), rep(3,n4)) X <- as.matrix(rbind(I1,I2,I3,I4)) #MTRI test: results <- MTRItest(X, levels, c(n1,n2,n3,n4), 8) ## outputs: results$RIvec #[1] 0.25641026 0.14871795 0.00000000 0.03076923 0.05128205 0.08333333 0.10384615 results$Pvalues #[1] 0 0 0 0 0 0 0 results$decisionMTRI #[1] 1 results$contTabs #$contTabs[[1]] # [,1] [,2] #[1,] 10 0 #[2,] 10 0 #[3,] 0 10 #[4,] 0 10 #$contTabs[[2]] # [,1] [,2] [,3] #[1,] 10 0 0 #[2,] 0 10 0 #[3,] 0 10 0 #[4,] 0 0 10 #$contTabs[[3]] # [,1] [,2] [,3] [,4] #[1,] 10 0 0 0 #[2,] 0 10 0 0 #[3,] 0 0 10 0 #[4,] 0 0 0 10 #$contTabs[[4]] # [,1] [,2] [,3] [,4] [,5] #[1,] 10 0 0 0 0 #[2,] 0 10 0 0 0 #[3,] 0 0 4 6 0 #[4,] 0 0 0 0 10 #$contTabs[[5]] # [,1] [,2] [,3] [,4] [,5] [,6] #[1,] 10 0 0 0 0 0 #[2,] 0 10 0 0 0 0 #[3,] 0 0 4 6 0 0 #[4,] 0 0 0 0 8 2 #$contTabs[[6]] # [,1] [,2] [,3] [,4] [,5] [,6] [,7] #[1,] 10 0 0 0 0 0 0 #[2,] 0 5 5 0 0 0 0 #[3,] 0 0 0 4 6 0 0 #[4,] 0 0 0 0 0 8 2 #$contTabs[[7]] # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] #[1,] 8 2 0 0 0 0 0 0 #[2,] 0 0 5 5 0 0 0 0 #[3,] 0 0 0 0 4 6 0 0 #[4,] 0 0 0 0 0 0 8 2 results$mulTestdec #[1] 1 1 1 1 1 1 1
A function that provides the probability of observing an contingency table using generalized hypergeometric probability.
pmf(M)
pmf(M)
M |
|
a single value between and
Biplab Paul, Shyamal K. De and Anil K. Ghosh
Maintainer: Biplab Paul<[email protected]>
Biplab Paul, Shyamal K De and Anil K Ghosh (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, doi:10.1016/j.jmva.2021.104897.
Cyrus R Mehta and Nitin R Patel (1983). A network algorithm for performing Fisher's exact test in rxc contingency tables, Journal of the American Statistical Association, 78(382):427-434, doi:10.2307/2288652.
# Generalized hypergeometric probability of rxc Contingency Table: mat <- matrix(1:20,5,4, byrow = TRUE) pmf(mat) ## outputs: #[1] 4.556478e-09
# Generalized hypergeometric probability of rxc Contingency Table: mat <- matrix(1:20,5,4, byrow = TRUE) pmf(mat) ## outputs: #[1] 4.556478e-09
Measures to compare the dissimilarity of exact cluster labels (memberships) and estimated cluster labels (memberships) of the observations.
randfun(lvel, dv)
randfun(lvel, dv)
lvel |
exact cluster labels of the observations |
dv |
estimated cluster labels of the observations |
a single value between 0 and 1
Biplab Paul, Shyamal K. De and Anil K. Ghosh
Maintainer: Biplab Paul<[email protected]>
William M Rand (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical association, 66(336):846-850, doi:10.1080/01621459.1971.10482356.
# Measures of dissimilarity: exl <- c(rep(0,5), rep(1,5), rep(2,5), rep(3,5)) el <- c(0,0,1,0,0,1,2,1,0,1,2,2,3,2,2,3,2,3,1,3) randfun(exl,el) ## outputs: #[1] 0.2368421
# Measures of dissimilarity: exl <- c(rep(0,5), rep(1,5), rep(2,5), rep(3,5)) el <- c(0,0,1,0,0,1,2,1,0,1,2,2,3,2,2,3,2,3,1,3) randfun(exl,el) ## outputs: #[1] 0.2368421
Contingency Table
A function that generates an contingency table with the same marginal totals as given
contingency table.
rctab(M)
rctab(M)
M |
|
generated contingency table
Biplab Paul, Shyamal K. De and Anil K. Ghosh
Maintainer: Biplab Paul<[email protected]>
Cyrus R Mehta and Nitin R Patel (1983). A network algorithm for performing Fisher's exact test in rxc contingency tables, Journal of the American Statistical Association, 78(382):427-434, doi:10.2307/2288652.
# Generation of rxc Contingency Table: set.seed(151) mat <- matrix(1:20,5,4, byrow = TRUE) rctab(mat) ## outputs: # [,1] [,2] [,3] [,4] #[1,] 3 4 0 3 #[2,] 4 5 10 7 #[3,] 8 7 12 15 #[4,] 18 16 13 11 #[5,] 12 18 20 24
# Generation of rxc Contingency Table: set.seed(151) mat <- matrix(1:20,5,4, byrow = TRUE) rctab(mat) ## outputs: # [,1] [,2] [,3] [,4] #[1,] 3 4 0 3 #[2,] 4 5 10 7 #[3,] 8 7 12 15 #[4,] 18 16 13 11 #[5,] 12 18 20 24
Performs the distribution free exact k-sample test for equality of multivariate distributions in the HDLSS regime.
RItest(M, labels, sizes, n_clust, randomization = TRUE, clust_alg = "knwClustNo", kmax = 2 * n_clust, s_psi = 1, s_h = 1, lb = 1, n_sts = 1000, alpha = 0.05)
RItest(M, labels, sizes, n_clust, randomization = TRUE, clust_alg = "knwClustNo", kmax = 2 * n_clust, s_psi = 1, s_h = 1, lb = 1, n_sts = 1000, alpha = 0.05)
M |
|
labels |
length |
sizes |
vector of sample sizes |
n_clust |
number of the Populations |
randomization |
logical; if TRUE (default), randomization test and FALSE, non-randomization test |
clust_alg |
|
kmax |
maximum value of total number of clusters to estimate total number of clusters in the whole observations, default: |
s_psi |
function required for clustering, 1 for |
s_h |
function required for clustering, 1 for |
lb |
each observation is partitioned into some numbers of smaller vectors of same length |
n_sts |
number of simulation of the test statistic, default: |
alpha |
numeric, confidence level |
RItest returns a list containing the following items:
estClustLabel |
a vector of length |
obsCtyTab |
observed contingency table |
ObservedRI |
value of the observed test statistic |
RICutoff |
cut-off of the test |
randomGamma |
randomized coefficient of the test |
estPvalue |
estimated p-value of the test |
decisionRI |
if returns |
estClustNo |
total number of the estimated classes |
Biplab Paul, Shyamal K. De and Anil K. Ghosh
Maintainer: Biplab Paul<[email protected]>
Biplab Paul, Shyamal K De and Anil K Ghosh (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, doi:10.1016/j.jmva.2021.104897.
William M Rand (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical association, 66(336):846-850, doi:10.1080/01621459.1971.10482356.
# muiltivariate normal distribution: # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 k = 4 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) levels <- c(rep(0,n1), rep(1,n2), rep(2,n3), rep(3,n4)) X <- as.matrix(rbind(I1,I2,I3,I4)) # RI test: results <- RItest(M=X, labels=levels, sizes = c(n1,n2,n3,n4), n_clust = k) ## outputs: results$estClustLabel #[1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 results$obsCtyTab # [,1] [,2] [,3] [,4] #[1,] 10 0 0 0 #[2,] 0 10 0 0 #[3,] 0 0 10 0 #[4,] 0 0 0 10 results$ObservedRI #[1] 0 results$RICutoff #[1] 0.3307692 results$randomGamma #[1] 0 results$estPvalue #[1] 0 results$decisionRI #[1] 1
# muiltivariate normal distribution: # generate data with dimension d = 500 set.seed(151) n1=n2=n3=n4=10 k = 4 d = 500 I1 <- matrix(rnorm(n1*d,mean=0,sd=1),n1,d) I2 <- matrix(rnorm(n2*d,mean=0.5,sd=1),n2,d) I3 <- matrix(rnorm(n3*d,mean=1,sd=1),n3,d) I4 <- matrix(rnorm(n4*d,mean=1.5,sd=1),n4,d) levels <- c(rep(0,n1), rep(1,n2), rep(2,n3), rep(3,n4)) X <- as.matrix(rbind(I1,I2,I3,I4)) # RI test: results <- RItest(M=X, labels=levels, sizes = c(n1,n2,n3,n4), n_clust = k) ## outputs: results$estClustLabel #[1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 results$obsCtyTab # [,1] [,2] [,3] [,4] #[1,] 10 0 0 0 #[2,] 0 10 0 0 #[3,] 0 0 10 0 #[4,] 0 0 0 10 results$ObservedRI #[1] 0 results$RICutoff #[1] 0.3307692 results$randomGamma #[1] 0 results$estPvalue #[1] 0 results$decisionRI #[1] 1