2 Generating data

We start by defining functions that we can use to generate a design matrix with two kinds of groups of variables, which we call cluster structures and complex dependencies.

2.1 Cluster structure

Here, we generate \(s\) clusters of features, each with \(d\) features. A cluster consists of a representative variable \(X_k\sim N_n(0, I_n)\) and \(d-1\) proxy variables \(X_j=X_k+\delta_j\), with \(\delta_j\sim N_n(0,\eta^2 I_n)\).

#' Generate cluster design matrix
#' 
#' A function that generates a design matrix with cluster structures. 
#' The representative feature follows N(0,1), and perturbation direction follows
#'  N(0, eta^2).
#' 
#' @param n sample size
#' @param s number of clusters
#' @param d number of features in each cluster
#' @param eta standard deviation of the perturbation direction
#' 
#' @return an n by s*d matrix
#' 
#' @export
gen_clust <- function(n, s, d, eta = 0.1) {
  X = matrix(stats::rnorm(n*s*d), nrow = n)
  
  id_rep = (1:s) * d - (d-1)
  id_prox = setdiff(1:ncol(X), id_rep)
  track_rep = 0
  for(i in 1:ncol(X)) {
    if(i %in% id_rep) {
      track_rep = i
    } else {
      X[,i] = X[,track_rep] + stats::rnorm(n, sd = eta)
    }
  }
  return(X)
}

We have used a function in the package stats, so let’s import it.

usethis::use_package("stats")
## ✔ Adding stats to 'Imports' field in DESCRIPTION.
## ☐ Refer to functions with `stats::fun()`.

This creates s*d features, where the first feature of each cluster is the representative feature. To quickly check this, observe that if we set eta to be 1, the variance of the proxy features are all about twice that of the representative feature:

testthat::test_that("gen_clust works as expected", {
  set.seed(123)
  Xclust <- gen_clust(n=1000, s=2, d = 4, eta = 1)
  testthat::expect_equal(round(apply(Xclust, 2, var)),
                         c(1, 2, 2, 2, 1, 2, 2, 2))
})
## Test passed 😸

2.2 Complex Dependencies Block

The complex dependencies blocks consist of parent-child relationships. As described in the paper, “the parent features are generated by \(X_j \overset{iid}{\sim} \mathcal{N}(0, I_n)\), \(j\in \mathcal{K}_l\), and the child is defined as \(X_{c_l} = \sum_{j\in \mathcal{K}_l} X_j + \delta_{c_l}\), where the perturbation directions follow \(\delta_{c_l} \overset{iid}{\sim} \mathcal{N}(0, \eta_l^2 I_n)\).”

#' Generate design matrix with complex dependency
#' 
#' A function that generates a design matrix with complex dependency structures. 
#' The parent features follow N(0,1), and perturbation direction follows N(0, eta^2).
#' 
#' @param n sample size
#' @param nb number of blocks
#' @param b number of parents in each block
#' @param eta standard deviation of the perturbation direction
#' 
#' @return an n by nb*(b+1) matrix
#' 
#' @export
gen_block <- function(n, nb, b, eta = 0.1) {
  X = matrix(stats::rnorm(n*nb*(b+1)), nrow = n)
  id_child = seq(from = b+1, to = nb*(b+1), by = b+1)
  id_parents = setdiff(1:ncol(X), id_child)
  for(i in 1:ncol(X)) {
    if(i %in% id_parents) {
    } else {
      X[,i] = apply(X[,(i-b):(i-1)], 1, sum) + stats::rnorm(n, sd = eta)
    }
  }
  return(X)
}

In this function, the child feature is the last of each block. Let’s check that we have this correct:

testthat::test_that("gen_block works as expected", {
  set.seed(123)
  Xblock <- gen_block(n=1000, nb=2, b = 4, eta = 1)
  testthat::expect_equal(round(apply(Xblock, 2, var)),
                         c(1, 1, 1, 1, 5, 1, 1, 1, 1, 5))
})
## Test passed 🥇

2.3 Make some example data

With these two functions defined, we can now generate some example data that we’ll use to demonstrate the functions defined in the rest of this document.

ex <- list(
  n = 300,
  nclust = 2,
  sindep = 5,
  p = 50,
  eta = 0.1
  )
ex$Xclust = gen_clust(ex$n, ex$nclust, d = 2, eta = ex$eta)
ex$Xblock = gen_block(ex$n, nb = 1, b = 2, eta = ex$eta)
ex$Xindep = matrix(stats::rnorm(ex$n * (ex$p - ncol(ex$Xclust) - ncol(ex$Xblock) )),
                   nrow = ex$n)
ex$X = cbind(ex$Xclust, ex$Xblock, ex$Xindep)
ex$beta = c(
  rep(c(1, 0), ex$nclust),
  c(1, -1, 0),
  rep(0, ex$p - ncol(ex$Xclust) - ncol(ex$Xblock))
)
ex$y = ex$X %*% ex$beta + stats::rnorm(ex$n, sd = 1.5)