Package 'simtrait' reference manual

Title:	Simulate Complex Traits from Genotypes
Description:	Simulate complex traits given a SNP genotype matrix and model parameters (the desired heritability, optional environment group effects, number of causal loci, and either the true ancestral allele frequencies used to generate the genotypes or the mean kinship for a real dataset). Emphasis is on avoiding common biases due to the use of estimated allele frequencies. The code selects random loci to be causal, constructs coefficients for these loci and random independent non-genetic effects, and can optionally generate random group effects. Traits can follow three models: random coefficients, fixed effect sizes, and infinitesimal (multivariate normal). GWAS method benchmarking functions are also provided. Described in Yao and Ochoa (2023) <doi:10.7554/eLife.79238>.
Authors:	Alejandro Ochoa [aut, cre] , Grace E. Rhodes [aut]
Maintainer:	Alejandro Ochoa <[email protected]>
License:	GPL-3
Version:	1.1.9.9000
Built:	2025-02-28 03:46:59 UTC
Source:	https://github.com/ochoalab/simtrait

Compute locus allele frequencies

Description

On a regular matrix, this is essentially a wrapper for colMeans() or rowMeans() depending on loci_on_cols. On a BEDMatrix object, the locus allele frequencies are computed keeping memory usage low.

Usage

allele_freqs(
  X,
  loci_on_cols = FALSE,
  fold = FALSE,
  m_chunk_max = 1000,
  subset_ind = NULL,
  want_counts = FALSE
)
allele_freqs(
  X,
  loci_on_cols = FALSE,
  fold = FALSE,
  m_chunk_max = 1000,
  subset_ind = NULL,
  want_counts = FALSE
)

Arguments

`X`	The genotype matrix (regular R matrix or BEDMatrix object). Missing values are ignored in averages.
`loci_on_cols`	If `TRUE`, `X` has loci on columns and individuals on rows; if false (the default), loci are on rows and individuals on columns. If `X` is a BEDMatrix object, code assumes loci on columns (`loci_on_cols` is ignored).
`fold`	If `TRUE`, allele frequencies are converted to minor allele frequencies. Default is to return frequencies for the given allele counts in `X` (regardless of whether it is the minor or major allele).
`m_chunk_max`	BEDMatrix-specific, sets the maximum number of loci to process at the time. If memory usage is excessive, set to a lower value than default (expected only for extremely large numbers of individuals).
`subset_ind`	Optionally subset individuals by providing their indexes (negative indexes to exclude) or a boolean vector (in other words, the usual ways to subset matrices). Most useful for BEDMatrix inputs, to subset chunks and retain low memory usage.
`want_counts`	If `TRUE` (default `FALSE`), raw allele counts are also returned. Note `fold` option has no effect on these counts.

Value

If want_counts = FALSE, the vector of estimated ancestral allele frequencies, one per locus. Names are set to the locus names, if present. If want_counts = TRUE, a named list containing both the estimated ancestral allele frequencies p_anc_est and the allele counts matrix, with loci along the rows (also with names if present), and alleles along the columns.

Examples

# Construct toy data
X <- matrix(
    c(0, 1, 2,
      1, 0, 1,
      1, NA, 2),
    nrow = 3,
    byrow = TRUE
)

# row means
allele_freqs(X)
c(1/2, 1/3, 3/4)

# row means, in minor allele frequencies
allele_freqs(X, fold = TRUE)
c(1/2, 1/3, 1/4)

# col means
allele_freqs(X, loci_on_cols = TRUE)
c(1/3, 1/4, 5/6)

# Construct toy data
X <- matrix(
    c(0, 1, 2,
      1, 0, 1,
      1, NA, 2),
    nrow = 3,
    byrow = TRUE
)

# row means
allele_freqs(X)
c(1/2, 1/3, 3/4)

# row means, in minor allele frequencies
allele_freqs(X, fold = TRUE)
c(1/2, 1/3, 1/4)

# col means
allele_freqs(X, loci_on_cols = TRUE)
c(1/3, 1/4, 5/6)

The model covariance matrix of the trait

Description

This function returns the expected covariance matrix of trait vectors simulated via sim_trait() and sim_trait_mvn(). Below there are n individuals.

Usage

cov_trait(kinship, herit, sigma_sq = 1, labs = NULL, labs_sigma_sq = NULL)
cov_trait(kinship, herit, sigma_sq = 1, labs = NULL, labs_sigma_sq = NULL)

Arguments

`kinship`	The `n`-by-`n` kinship matrix of the individuals. These values should be scaled such that an outbred individual has 1/2 self-kinship, the parent-child relationship is 1/4, etc (which is half the values sometimes defined for kinship).
`herit`	The desired heritability (proportion of trait variance due to genetics).
`sigma_sq`	The desired parametric variance factor of the trait (scalar, default 1). Corresponds to the variance of an outbred individual.
`labs`	Optional labels assigning individuals to groups, to simulate environment group effects. Values can be numeric or strings, simply assigning the same values to individuals in the same group. If vector (single environment), length must be number of individuals. If matrix (multiple environments), individuals must be along rows, and environments along columns. The environments are not required to be nested. If this is non-`NULL`, then `labs_sigma_sq` must also be given!
`labs_sigma_sq`	Optional vector of environment effect variance proportions, one value for each environment given in `labs` (a scalar if `labs` is a vector, otherwise its length should be the number of columns of `labs`). Ignored unless `labs` is also given. As these are variance proportions, each value must be non-negative and `sum(labs_sigma_sq) + herit <= 1` is required so residual variance is non-negative.

Value

The n-by-n trait covariance matrix, which under no environment effects equals sigma_sq * ( herit * 2 * kinship + sigma_sq_residual * I ), where I is the n-by-n identity matrix and sigma_sq_residual = 1 - herit. If there are labels, covariance will include the specified block diagonal effects and sigma_sq_residual = 1 - herit - sum(labs_sigma_sq).

Examples

# create a dummy kinship matrix
kinship <- matrix(
    data = c(
        0.6, 0.1, 0.0,
        0.1, 0.6, 0.1,
        0.0, 0.1, 0.6
    ),
    nrow = 3,
    byrow = TRUE
)
# covariance of simulated traits
V <- cov_trait(kinship = kinship, herit = 0.8)

# create a dummy kinship matrix
kinship <- matrix(
    data = c(
        0.6, 0.1, 0.0,
        0.1, 0.6, 0.1,
        0.0, 0.1, 0.6
    ),
    nrow = 3,
    byrow = TRUE
)
# covariance of simulated traits
V <- cov_trait(kinship = kinship, herit = 0.8)

Per-locus heritability contribution from allele frequency and causal coefficient

Description

Calculates the vector of per-locus heritability values, with each causal locus i calculated as h_i^2 = 2 * p_i * ( 1 - p_i ) * beta_i^2 / sigma_sq, where p_i is the ancestral allele frequency, beta_i is the causal coefficient, and sigma_sq is the trait variance scale. These are all assumed to be true parameters (not estimated). These per-locus heritabilities equal per-locus effect sizes divided by sigma_sq.

Usage

herit_loci(p_anc, causal_coeffs, causal_indexes = NULL, sigma_sq = 1)
herit_loci(p_anc, causal_coeffs, causal_indexes = NULL, sigma_sq = 1)

Arguments

`p_anc`	The ancestral allele frequency vector.
`causal_coeffs`	The vector of causal coefficients.
`causal_indexes`	The optional vector of causal indexes. If `NULL` (default), `p_anc` and `causal_coeffs` are assumed to be for causal loci only (must be the same length). If non-`NULL`, `causal_loci` is used to subset both `p_anc` and `causal_coeffs` as needed: if each of these vectors is longer than `causal_loci`, then it is subset; otherwise they must have equal lengths as `causal_loci` or an error is thrown.
`sigma_sq`	The parametric variance factor of the trait (default 1). This factor corresponds to the variance of an outbred individual.

Value

The vector of per-locus heritability contributions. The sum of these values gives the overall heritability. This value can be greater than one (or wrong, more generally) if sigma_sq is misspecified.

Examples

# create toy random data
m_loci <- 10
# ancestral allele frequencies
p_anc <- runif( m_loci )
# causal loci
causal_coeffs <- rnorm( m_loci ) / m_loci
# resulting heritability contributions vector
herit_loci( p_anc, causal_coeffs )

# create toy random data
m_loci <- 10
# ancestral allele frequencies
p_anc <- runif( m_loci )
# causal loci
causal_coeffs <- rnorm( m_loci ) / m_loci
# resulting heritability contributions vector
herit_loci( p_anc, causal_coeffs )

Calculate expectations of inverse variance terms over the posterior distribution of the ancestral allele frequency

Description

This function assumes that each sample allele frequency x in the input has a Beta distribution with mean p and variance p*(1-p)*kinship⁠, where ⁠p⁠is the true allele frequency and is unknown, while⁠kinship⁠is the mean kinship of the sample and is known. Using this assumption, an expectation over the posterior distribution of⁠p⁠is calculated for what we call an inverse variance term, namely⁠1 / ( p * ( 1 - p ) )^g⁠, where ⁠p * ( 1 - p )⁠corresponds to the Bernoulli variance, and⁠g⁠is some desired exponent. Note that this function estimates these posteriors each from a single data point⁠x'.

Usage

inv_var_est_bayesian(p_anc_est, kinship, g)
inv_var_est_bayesian(p_anc_est, kinship, g)

Arguments

`p_anc_est`	A vector of sample allele frequencies (each refered to as `x` above)
`kinship`	The mean kinship coefficient of the data
`g`	The exponent of the variance term

Value

A vector of expectations of 1 / ( p * ( 1 - p ) )^g over the marginalized, unknown true ancestral allelel frequencies p, weighted by how likely they are in the posterior given each input sample allele frequency. The length of the vector is the same as the input p_anc_est, and each element in the output was estimated from each corresponding element in the input only.

Examples

# select a relatively high value for example
kinship <- 0.1

# try a grid of values, including edge cases (0,1)
m_loci <- 1000
p_anc_est <- 0 : (m_loci - 1) / (m_loci - 1)

# calculate the desired estimates!
# here g=0.5 estimates `1 / sqrt( p * ( 1 - p ) )`
inv_var_est <- inv_var_est_bayesian( p_anc_est, kinship, 0.5 )

# select a relatively high value for example
kinship <- 0.1

# try a grid of values, including edge cases (0,1)
m_loci <- 1000
p_anc_est <- 0 : (m_loci - 1) / (m_loci - 1)

# calculate the desired estimates!
# here g=0.5 estimates `1 / sqrt( p * ( 1 - p ) )`
inv_var_est <- inv_var_est_bayesian( p_anc_est, kinship, 0.5 )

Calculate maximum likelihood estimates of allele frequencies from Beta model

Description

For each sample allele frequency x in the input, this function calculates the maximum likelihood estimate of the true allele frequency p⁠assuming that the sample was drawn from a Beta distribution with mean⁠p⁠ and variance p*(1-p)*kinship⁠, where kinship is the mean kinship of the sample and is known. Note that, oddly, this function estimates the unknown p from a single data point x. Nevertheless, this procedure results in favorable shrinkage of estimate towards 0.5.

Usage

p_anc_est_beta_mle(p_anc_est, kinship)
p_anc_est_beta_mle(p_anc_est, kinship)

Arguments

`p_anc_est`	A vector of sample allele frequencies (each refered to as `x` above)
`kinship`	The mean kinship coefficient of the data

Value

A vector of maximum likelihood estimates of allele frequencies. The length of the vector is the same as the input p_anc_est, and each element in the output was estimated from each corresponding element in the input only.

Examples

# select a relatively high value for example
kinship <- 0.1

# try a grid of values, including edge cases (0,1)
m_loci <- 1000
p_anc_est <- 0 : (m_loci - 1) / (m_loci - 1)

# calculate the desired estimates!
p_anc_mle <- p_anc_est_beta_mle( p_anc_est, kinship )

# notice values tend to be shrunk towards 0.5,
# except values near the edges of the range are shrunk less
plot( p_anc_est, p_anc_mle )
abline(0,1)

# select a relatively high value for example
kinship <- 0.1

# try a grid of values, including edge cases (0,1)
m_loci <- 1000
p_anc_est <- 0 : (m_loci - 1) / (m_loci - 1)

# calculate the desired estimates!
p_anc_mle <- p_anc_est_beta_mle( p_anc_est, kinship )

# notice values tend to be shrunk towards 0.5,
# except values near the edges of the range are shrunk less
plot( p_anc_est, p_anc_mle )
abline(0,1)

Area under the precision-recall curve

Description

Calculates the Precision-Recall (PR) Area Under the Curve (AUC) given a vector of p-values and the true classes (causal (alternative) vs non-causal (null)). This is a wrapper around PRROC::pr.curve(), which actually calculates the AUC (see that for details).

Usage

pval_aucpr(pvals, causal_indexes, curve = FALSE)
pval_aucpr(pvals, causal_indexes, curve = FALSE)

Arguments

`pvals`	The vector of association p-values to analyze. `NA` values are allowed in input, are internally set to 1 (worst score) prior to AUC calculation (to prevent methods to get good AUCs by setting more cases to `NA`). Non-`NA` values outside of [0,1] will trigger an error.
`causal_indexes`	The vector of causal indexes, defining the true classes used for AUC calculation. Values of `causal_indexes` as returned by `sim_trait` work. There must be at least one causal index and at least one non-causal case.
`curve`	If `FALSE` (default), only scalar AUC is returned. If `TRUE`, then `curve = TRUE` is passed to `PRROC::pr.curve()` and the full object (class `PRROC`) is returned (see below).

Value

If curve = FALSE, returns the PR AUC scalar value. If curve = TRUE, returns the PRROC object as returned by PRROC::pr.curve(), which can be plotted directly, and which contains the AUC under the named value auc.integral.

However, if the input pvals is NULL (taken for case of singular association test, which is rare but may happen), then the returned value is NA.

Examples

# simulate truly null p-values, which should be uniform
pvals <- runif(10)
# for toy example, take the first two p-values to be truly causal
causal_indexes <- 1:2
# calculate desired measure
pval_aucpr( pvals, causal_indexes )

# simulate truly null p-values, which should be uniform
pvals <- runif(10)
# for toy example, take the first two p-values to be truly causal
causal_indexes <- 1:2
# calculate desired measure
pval_aucpr( pvals, causal_indexes )

Apply genomic control transformation to a list of p-values

Description

Genomic control (GC) is a procedure designed to correct the calibration of p-values calculated from test statistics that are supposed to have a Chi-squared distribution but are misspecified and have a different median than desired, in which case the test statistics are divided by the inflation factor (the ratio between the observed and desired medians). This function applies the correction using p-values only, calculating the underlying test statistics assuming they are Chi-squared ditributed with known degrees of freedom. The function guarantees that the corrected p-values have a median of 0.5, as desired if they were calibrated, but otherwise there is no guarantee that these corrected p-values will have a reasonable distribution.

Usage

pval_gc(pvals, df = 1)
pval_gc(pvals, df = 1)

Arguments

`pvals`	The vector of association p-values to analyze. This function assumes all p-values are provided (a mix of null and alternative tests). `NA` values are allowed in input and removed. Non-`NA` values outside of [0, 1] will trigger an error.
`df`	The degrees of freedom of the assumed chi-squared distribution (default 1).

Value

A list with the following named elements:

pvals: the vector of GC-corrected p-values
lambda: the inflation factor of the input p-values, which was used to calculate the corrected p-values

Examples

# a simulated set of highly inflated p-values, skewed toward zero
pvals <- rbeta( 100, 1, 10 )
hist( pvals )

# calculate the GC-corrected p-values
obj <- pval_gc( pvals )
pvals2 <- obj$pvals
# and get the inflation factor of the original data:
obj$lambda

# note GC-corrected p-values are often not uniform:
hist( pvals2 )
# but they have a median of 0.5
median( pvals2 )

# a simulated set of highly inflated p-values, skewed toward zero
pvals <- rbeta( 100, 1, 10 )
hist( pvals )

# calculate the GC-corrected p-values
obj <- pval_gc( pvals )
pvals2 <- obj$pvals
# and get the inflation factor of the original data:
obj$lambda

# note GC-corrected p-values are often not uniform:
hist( pvals2 )
# but they have a median of 0.5
median( pvals2 )

Calculate inflation factor from p-values

Description

The inflation factor is defined as the median association test statistic divided by the expected median under the null hypothesis, which is typically assumed to have a chi-squared distribution. This function takes a p-value distribution and maps its median back to the chi-squared value (using the quantile function) in order to compute the inflation factor in the chi-squared scale. The full p-value distribution (a mix of null and alternative cases) is used to calculate the desired median value (the true causal_loci is not needed, unlike pval_srmsd()).

Usage

pval_infl(pvals, df = 1)
pval_infl(pvals, df = 1)

Arguments

`pvals`	The vector of association p-values to analyze. This function assumes all p-values are provided (a mix of null and alternative tests). `NA` values are allowed in input and removed. Non-`NA` values outside of [0, 1] will trigger an error.
`df`	The degrees of freedom of the assumed chi-squared distribution (default 1).

Value

The inflation factor

Examples

# simulate truly null p-values, which should be uniform
pvals <- runif(10)
# calculate desired measure
pval_infl( pvals )

# simulate truly null p-values, which should be uniform
pvals <- runif(10)
# calculate desired measure
pval_infl( pvals )

Estimate calibrated power

Description

Given a significance level alpha and p-values with known causal status, this function estimates the calibrated power. First it estimates the p-value threshold at which the desired type I error of alpha is achieved, then it uses this p-value threshold (not alpha) to estimate statistical power. Note that these simple empirical estimates are likely to be inaccurate unless the number of p-values is much larger than 1/alpha.

Usage

pval_power_calib(pvals, causal_indexes, alpha = 0.05)
pval_power_calib(pvals, causal_indexes, alpha = 0.05)

Arguments

`pvals`	The vector of association p-values to analyze. This function assumes all p-values are provided (a mix of null and alternative tests). `NA` values are allowed in input and removed. Non-`NA` values outside of [0, 1] will trigger an error.
`causal_indexes`	The vector of causal indexes, defining the true classes used for calibrated power estimation. Values of `causal_indexes` as returned by `sim_trait` work. There must be at least one causal index and at least one non-causal case.
`alpha`	The desired significance level (default 0.05). May be a vector.

Value

The calibrated power estimates at each alpha

Examples

# simulate truly null p-values, which should be uniform
pvals <- runif(10)
# for toy example, take the first two p-values to be truly causal
causal_indexes <- 1:2
# estimate desired measure
pval_power_calib( pvals, causal_indexes )

# simulate truly null p-values, which should be uniform
pvals <- runif(10)
# for toy example, take the first two p-values to be truly causal
causal_indexes <- 1:2
# estimate desired measure
pval_power_calib( pvals, causal_indexes )

Signed RMSD measure of null p-value uniformity

Description

Quantifies null p-value uniformity by computing the RMSD (root mean square deviation) between the sorted observed null (truly non-causal) p-values and their expected quantiles under a uniform distribution. Meant as a more robust alternative to the "inflation factor" common in the GWAS literature, which compares median values only and uses all p-values (not just null p-values). Our signed RMSD, to correspond with the inflation factor, includes a sign that depends on the median null p-value: positive if this median is ⁠<= 0.5⁠ (corresponds with test statistic inflation), negative otherwise (test statistic deflation). Zero corresponds to uniform null p-values, which arises in expectation only if test statistics have their assumed null distribution (there is no misspecification, including inflation).

Usage

pval_srmsd(pvals, causal_indexes, detailed = FALSE)
pval_srmsd(pvals, causal_indexes, detailed = FALSE)

Arguments

`pvals`	The vector of association p-values to analyze. This function assumes all p-values are provided (a mix of null and alternative tests). `NA` values are allowed in input and removed. Non-`NA` values outside of [0, 1] will trigger an error.
`causal_indexes`	The vector of causal indexes, whose p-values will be omitted. Values of `causal_indexes` as returned by `sim_trait` work. This parameter is required to prevent use of this function except when the true status of every test (null vs alternative) is known. Set to `NULL` if all loci are truly null (non-causal). Otherwise, `causal_indexes` must have at least one causal index.
`detailed`	If `FALSE` (default) only SRMSD is returned. If `TRUE`, sorted null p-values without NAs and their expectations are returned (useful for plots).

Value

If detailed is FALSE, returns the signed RMSD between the observed p-value order statistics and their expectation under true uniformity. If detailed is TRUE, returns data useful for plots, a named list containing:

srmsd: The signed RMSD between the observed p-value order statistics and their expectation under true uniformity.
pvals_null: Sorted null p-values (observed order statistics). If any input null p-values were NA, these have been removed here (removed by sort()).
pvals_unif: Expected order statistics assuming uniform distribution, same length as pvals_null.

If the input pvals is NULL (taken for case of singular association test, which is rare but may happen), then the returned value is NA if detailed was FALSE, or otherwise the list contains NA, NULL and NULL for the above three items.

Examples

# simulate truly null p-values, which should be uniform
pvals <- runif(10)
# for toy example, take the first p-value to be truly causal (will be ignored below)
causal_indexes <- 1
# calculate desired measure
pval_srmsd( pvals, causal_indexes )

# simulate truly null p-values, which should be uniform
pvals <- runif(10)
# for toy example, take the first p-value to be truly causal (will be ignored below)
causal_indexes <- 1
# calculate desired measure
pval_srmsd( pvals, causal_indexes )

Estimate type I error rate

Description

Given a significance level and p-values with known causal status, this function estimates the type I error rate, defined as the proportion of null p-values that are below or equal to the threshold. Note that these simple empirical estimates are likely to be zero unless the number of p-values is much larger than 1/alpha.

Usage

pval_type_1_err(pvals, causal_indexes, alpha = 0.05)
pval_type_1_err(pvals, causal_indexes, alpha = 0.05)

Arguments

`pvals`	The vector of association p-values to analyze. This function assumes all p-values are provided (a mix of null and alternative tests). `NA` values are allowed in input and removed. Non-`NA` values outside of [0, 1] will trigger an error.
`causal_indexes`	The vector of causal indexes, whose p-values will be omitted. Values of `causal_indexes` as returned by `sim_trait` work. This parameter is required to prevent use of this function except when the true status of every test (null vs alternative) is known. Set to `NULL` if all loci are truly null (non-causal). Otherwise, `causal_indexes` must have at least one causal index.
`alpha`	The desired significance level (default 0.05). May be a vector.

Value

The type I error rate estimates at each alpha

Examples

# simulate truly null p-values, which should be uniform
pvals <- runif(10)
# for toy example, take the first p-value to be truly causal (will be ignored below)
causal_indexes <- 1
# estimate desired measure
pval_type_1_err( pvals, causal_indexes )

# simulate truly null p-values, which should be uniform
pvals <- runif(10)
# for toy example, take the first p-value to be truly causal (will be ignored below)
causal_indexes <- 1
# estimate desired measure
pval_type_1_err( pvals, causal_indexes )

Root mean square deviation

Description

Calculates the euclidean distance between two vectors x and y divided by the square root of the lengths of the vectors. NA values are ignored by default when calculating the mean squares (so the denominator is the number of non-NA differences).

Usage

rmsd(x, y, na.rm = TRUE)
rmsd(x, y, na.rm = TRUE)

Arguments

`x`	The first vector to compare (required).
`y`	The second vector to compare (required). Lengths of `x` and `y` must be equal.
`na.rm`	If `TRUE` (default), `NA` values are removed before calculating the mean square difference. If `FALSE`, any missing values in either `x` or `y` result in `NA` returned. Passed to `mean()`, see that for more info.

Value

the square root of the mean square difference between x and y, after removing NA comparisons (cases where either is NA).

Examples

x <- rnorm(10)
y <- rnorm(10)
rmsd( x, y )

x <- rnorm(10)
y <- rnorm(10)
rmsd( x, y )

Simulate a complex trait from genotypes

Description

Simulate a complex trait given a SNP genotype matrix and model parameters, which are minimally: the number of causal loci, the heritability, and either the true ancestral allele frequencies used to generate the genotypes or the mean kinship of all individuals. An optional minimum marginal allele frequency for the causal loci can be set. The output traits have by default a zero mean and unit variance (for outbred individuals), but those parameters can be modified. The code selects random loci to be causal, constructs coefficients for these loci (scaled appropriately) and random Normal independent non-genetic effects and random environment group effects if specified. There are two models for constructing causal coefficients: random coefficients (RC; default) and fixed effect sizes (FES; i.e., coefficients roughly inversely proportional to allele frequency; use fes = TRUE). Suppose there are m loci and n individuals.

Usage

sim_trait(
  X,
  m_causal,
  herit,
  p_anc = NULL,
  kinship = NULL,
  mu = 0,
  sigma_sq = 1,
  labs = NULL,
  labs_sigma_sq = NULL,
  maf_cut = NA,
  mac_cut = NA,
  loci_on_cols = FALSE,
  m_chunk_max = 1000,
  fes = FALSE,
  fes_kinship_method = c("original", "mle", "bayesian"),
  causal_indexes = NULL,
  old_center_scale = FALSE,
  old_sample_var = FALSE
)
sim_trait(
  X,
  m_causal,
  herit,
  p_anc = NULL,
  kinship = NULL,
  mu = 0,
  sigma_sq = 1,
  labs = NULL,
  labs_sigma_sq = NULL,
  maf_cut = NA,
  mac_cut = NA,
  loci_on_cols = FALSE,
  m_chunk_max = 1000,
  fes = FALSE,
  fes_kinship_method = c("original", "mle", "bayesian"),
  causal_indexes = NULL,
  old_center_scale = FALSE,
  old_sample_var = FALSE
)

Arguments

`X`	The `m`-by-`n` genotype matrix (if `loci_on_cols = FALSE`, transposed otherwise), or a BEDMatrix object. This is a numeric matrix consisting of reference allele counts (in `c(0, 1, 2, NA)` for a diploid organism).
`m_causal`	The desired number of causal loci. Ignored if `causal_indexes` is provided.
`herit`	The desired heritability (proportion of trait variance due to genetics).
`p_anc`	The length-`m` vector of true ancestral allele frequencies. Optional but recommended for simulations. Either this or `kinship` must be specified.
`kinship`	The mean kinship value of the individuals in the data. The `n`-by-`n` kinship matrix of the individuals in the data is also accepted. Optional but recommended for real data. Either this or `p_anc` must be specified.
`mu`	The desired parametric mean value of the trait (scalar, default 0).
`sigma_sq`	The desired parametric variance factor of the trait (scalar, default 1). Corresponds to the variance of an outbred individual.
`labs`	Optional labels assigning individuals to groups, to simulate environment group effects. Values can be numeric or strings, simply assigning the same values to individuals in the same group. If vector (single environment), length must be number of individuals. If matrix (multiple environments), individuals must be along rows, and environments along columns. The environments are not required to be nested. If this is non-`NULL`, then `labs_sigma_sq` must also be given!
`labs_sigma_sq`	Optional vector of environment effect variance proportions, one value for each environment given in `labs` (a scalar if `labs` is a vector, otherwise its length should be the number of columns of `labs`). Ignored unless `labs` is also given. As these are variance proportions, each value must be non-negative and `sum(labs_sigma_sq) + herit <= 1` is required so residual variance is non-negative.
`maf_cut`	The optional minimum allele frequency threshold (default `NA`, no threshold). This prevents rare alleles from being causal in the simulation. Threshold is applied to the sample allele frequencies and not their true parametric values (`p_anc`), even if these are available. Ignored if `causal_indexes` is provided.
`mac_cut`	The optional minimum allele count threshold (default `NA`, no threshold). If both this and `maf_cut` are provided, both thresholds are applied (they are not redundant under missingness). Ignored if `causal_indexes` is provided.
`loci_on_cols`	If `TRUE`, `X` has loci on columns and individuals on rows; if `FALSE` (the default), loci are on rows and individuals on columns. If `X` is a BEDMatrix object, loci are always on the columns (`loci_on_cols` is ignored).
`m_chunk_max`	BEDMatrix-specific, sets the maximum number of loci to process at the time. If memory usage is excessive, set to a lower value than default (expected only for extremely large numbers of individuals).
`fes`	If `TRUE`, causal coefficients are inversely proportional to the square root of `p_anc * ( 1 - p_anc )` (estimated when `p_anc` is unavailable), which ensures fixed effect sizes (FES) per causal locus. Signs (+/-) are drawn randomly with equal probability. If `FALSE` (the default), random coefficients (RC) are drawn from a standard Normal distribution. In both cases coefficients are rescaled to result in the desired heritability.
`fes_kinship_method`	String specifying the bias correction method for the case when `fes = TRUE` and `kinship` is provided while `p_anc` is absent (option is ignored otherwise). `original` corresponds to a simple formula with substantial bias. `mle` replaces sample allele frequencies with maximum likelihood estimates assuming a Beta distribution with a known variance scale given by `kinship` (see `p_anc_est_beta_mle()`). `bayesian` calculates the expectation over the posterior of a desired inverse variance term, assuming the same Beta distribution as the `mle` option (see `inv_var_est_bayesian()`).
`causal_indexes`	If provided, will use the loci at these indexes as causal loci. When `NULL` (default), causal indexes are drawn randomly. Thus, parameters `m_loci` and `maf_cut` are ignored and have no effect if this parameter is set.
`old_center_scale`	If `TRUE` (default `FALSE`), simulates traits as its more frequently done in the literature, namely genotypes are centered and scaled ("standardized") and coefficients are drawn from a normal distribution with mean zero and variance equal to the heritability times `sigma_sq` divided by the number of causal loci. True ancestral allele frequencies `p_anc` are used if available, otherwise sample estimates are used. This mode is for comparisons only, it is not recommended as it can result in mispecified heritabilities, particularly when true ancestral allele frequencies are not available.
`old_sample_var`	If `TRUE` (default `FALSE`), `old_center_scale = TRUE` is required, in which case sample variance estimates are used to scale the genotypes. Otherwise, if `old_sample_var = FALSE` and `old_center_scale = TRUE`, genotype variance is calculated using the genetics-motivated formula `2 * p_anc * ( 1 - p_anc )` or the sample frequency plug-in estimate if `p_anc` is not available.

Details

To center and scale the trait and locus coefficients vector correctly to the desired parameters (mean, variance, heritability), the parametric ancestral allele frequencies (p_anc) must be known. This is necessary since in the heritability model the genotypes are random variables (with means given by p_anc and a covariance structure given by p_anc and the kinship matrix), so these genotype distribution parameters are required. If p_anc are known (true for simulated genotypes), then the trait will have the specified mean and covariance matrix in agreement with cov_trait(). To simulate traits using real genotypes, where p_anc is unknown, a compromise that works well in practice is possible if the mean kinship is known (see package vignette). We recommend estimating the mean kinship using the popkin package!

Value

A named list containing:

trait: length-n vector of the simulated trait
causal_indexes: length-m_causal vector of causal locus indexes
causal_coeffs: length-m_causal vector of coefficients at the causal loci. They are in the scale of the given genotypes except if old_center_scale = TRUE, in which case they are in the scale of standardized genotypes.
group_effects: length-n vector of simulated environment group effects, or 0 (scalar) if not simulated

However, if herit = 0 then causal_indexes and causal_coeffs will have zero length regardless of m_causal.

Examples

# construct a dummy genotype matrix
X <- matrix(
    data = c(
        0, 1, 2,
        1, 2, 1,
        0, 0, 1
    ),
    nrow = 3,
    byrow = TRUE
)
# made up ancestral allele frequency vector for example
p_anc <- c(0.5, 0.6, 0.2)
# made up mean kinship
kinship <- 0.2
# desired heritability
herit <- 0.8

# create simulated trait and associated data
# default is *random coefficients* (RC) model
obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc)

# trait vector
obj$trait
# randomly-picked causal locus indexes
obj$causal_indexes
# regression coefficients vector
obj$causal_coeffs

# *fixed effect sizes* (FES) model
obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc, fes = TRUE)

# either model, can apply to real data by replacing `p_anc` with `kinship`
obj <- sim_trait(X = X, m_causal = 2, herit = herit, kinship = kinship)

# construct a dummy genotype matrix
X <- matrix(
    data = c(
        0, 1, 2,
        1, 2, 1,
        0, 0, 1
    ),
    nrow = 3,
    byrow = TRUE
)
# made up ancestral allele frequency vector for example
p_anc <- c(0.5, 0.6, 0.2)
# made up mean kinship
kinship <- 0.2
# desired heritability
herit <- 0.8

# create simulated trait and associated data
# default is *random coefficients* (RC) model
obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc)

# trait vector
obj$trait
# randomly-picked causal locus indexes
obj$causal_indexes
# regression coefficients vector
obj$causal_coeffs

# *fixed effect sizes* (FES) model
obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc, fes = TRUE)

# either model, can apply to real data by replacing `p_anc` with `kinship`
obj <- sim_trait(X = X, m_causal = 2, herit = herit, kinship = kinship)

Simulate traits from a kinship matrix under the infinitesimal model

Description

Simulate matrix of trait replicates given a kinship matrix and model parameters (the desired heritability, group effects, total variance scale, and mean). Although these traits have the covariance structure of genetic traits, and have heritabilities that can be estimated, they do not have causal loci (an association test against any locus should fail). Below n is the number of individuals.

Usage

sim_trait_mvn(
  rep,
  kinship,
  herit,
  mu = 0,
  sigma_sq = 1,
  labs = NULL,
  labs_sigma_sq = NULL,
  tol = 1e-06
)
sim_trait_mvn(
  rep,
  kinship,
  herit,
  mu = 0,
  sigma_sq = 1,
  labs = NULL,
  labs_sigma_sq = NULL,
  tol = 1e-06
)

Arguments

`rep`	The number of replicate traits to simulate. Simulating all you need at once is more efficient than simulating each separately (the kinship matrix is eigendecomposed once per run, shared across replicates).
`kinship`	The `n`-by-`n` kinship matrix of the individuals to simulate from.
`herit`	The desired heritability (proportion of trait variance due to genetics).
`mu`	The desired parametric mean value of the trait (scalar, default 0).
`sigma_sq`	The desired parametric variance factor of the trait (scalar, default 1). Corresponds to the variance of an outbred individual.
`labs`	Optional labels assigning individuals to groups, to simulate environment group effects. Values can be numeric or strings, simply assigning the same values to individuals in the same group. If vector (single environment), length must be number of individuals. If matrix (multiple environments), individuals must be along rows, and environments along columns. The environments are not required to be nested. If this is non-`NULL`, then `labs_sigma_sq` must also be given!
`labs_sigma_sq`	Optional vector of environment effect variance proportions, one value for each environment given in `labs` (a scalar if `labs` is a vector, otherwise its length should be the number of columns of `labs`). Ignored unless `labs` is also given. As these are variance proportions, each value must be non-negative and `sum(labs_sigma_sq) + herit <= 1` is required so residual variance is non-negative.
`tol`	Tolerance factor for an internal test of positive semi-definiteness of the trait covariance matrix. Procedure fails if any eigenvalues are smaller than `-tol` times the absolute value of the largest eigenvalue. Increase this value only if you are getting errors but you're sure your covariance matrix (the output of `cov_trait()`) is positive semi-definite.

Value

A rep-by-n matrix containing the simulated traits along the rows, individuals along the columns.

Examples

# create a dummy kinship matrix
# make sure it is positive definite!
kinship <- matrix(
    data = c(
        0.6, 0.1, 0.0,
        0.1, 0.5, 0.0,
        0.0, 0.0, 0.5
    ),
    nrow = 3
)
# draw simulated traits (matrix)
traits <- sim_trait_mvn( rep = 10, kinship = kinship, herit = 0.8 )
traits

# create a dummy kinship matrix
# make sure it is positive definite!
kinship <- matrix(
    data = c(
        0.6, 0.1, 0.0,
        0.1, 0.5, 0.0,
        0.0, 0.0, 0.5
    ),
    nrow = 3
)
# draw simulated traits (matrix)
traits <- sim_trait_mvn( rep = 10, kinship = kinship, herit = 0.8 )
traits

simtrait: simulate complex traits from genotypes

Description

This package enables simulation of complex (polygenic and continuous) traits from a simulated or real genotype matrix. The focus is on constructing the mean and covariance structure of the data to yield the desired heritability. The main function is sim_trait(), which returns the simulated trait and the vector of causal loci (randomly selected) and their coefficients. The causal coefficients are constructed under two models: random coefficients (RC) and fixed effect sizes (FES). The function cov_trait() computes the expected covariance matrix of the trait given the model parameters (namely the desired heritability and the true kinship matrix). Infinitesimal traits (without causal loci) can also be simulated using sim_trait_mvn().

Details

Package also provides some functions for evaluating genetic association approaches. pval_srmsd() and pval_infl() quantify null p-value accuracy, while pval_aucpr() quantifies predictive power.

The recommended inputs are simulated genotypes with known ancestral allele frequencies. The bnpsd package simulates genotypes for admixed individuals, resulting in a complex population structure.

For real data it is necessary to estimate the kinship matrix. popkin::popkin()' provides high-accuracy kinship estimates.

Author(s)

Maintainer: Alejandro Ochoa [email protected] (ORCID)

Authors:

Grace E. Rhodes [email protected] (ORCID)

Examples

# construct a dummy genotype matrix
X <- matrix(
    data = c(
        0, 1, 2,
        1, 2, 1,
        0, 0, 1
    ),
    nrow = 3,
    byrow = TRUE
)
# made up ancestral allele frequency vector for example
p_anc <- c(0.5, 0.6, 0.2)
# desired heritability
herit <- 0.8
# create a dummy kinship matrix for example
# make sure it is positive definite!
kinship <- matrix(
    data = c(
        0.6, 0.1, 0.0,
        0.1, 0.5, 0.0,
        0.0, 0.0, 0.5
    ),
    nrow = 3
)

# create simulated trait and associated data
# default is *random coefficients* (RC) model
obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc)
# trait vector
obj$trait
# randomly-picked causal locus indeces
obj$causal_indexes
# regression coefficient vector
obj$causal_coeffs

# *fixed effect sizes* (FES) model
obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc, fes = TRUE)

# either model, can apply to real data by replacing `p_anc` with `kinship`
obj <- sim_trait(X = X, m_causal = 2, herit = herit, kinship = kinship)

# covariance of simulated traits
V <- cov_trait(kinship = kinship, herit = herit)

# draw simulated traits (matrix of replicates) from infinitesimal model
traits <- sim_trait_mvn( rep = 10, kinship = kinship, herit = herit )
traits

# Metrics for genetic association approaches

# simulate truly null p-values, which should be uniform
pvals <- runif(10)
# for toy example, take these p-value to be truly causal
causal_indexes <- c(1, 5, 7)

# calculate desired measures
# this one quantifies p-value uniformity
pval_srmsd( pvals, causal_indexes )
# related, calculates inflation factors
pval_infl( pvals )
# this one quantifies predictive power
pval_aucpr( pvals, causal_indexes )


# construct a dummy genotype matrix
X <- matrix(
    data = c(
        0, 1, 2,
        1, 2, 1,
        0, 0, 1
    ),
    nrow = 3,
    byrow = TRUE
)
# made up ancestral allele frequency vector for example
p_anc <- c(0.5, 0.6, 0.2)
# desired heritability
herit <- 0.8
# create a dummy kinship matrix for example
# make sure it is positive definite!
kinship <- matrix(
    data = c(
        0.6, 0.1, 0.0,
        0.1, 0.5, 0.0,
        0.0, 0.0, 0.5
    ),
    nrow = 3
)

# create simulated trait and associated data
# default is *random coefficients* (RC) model
obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc)
# trait vector
obj$trait
# randomly-picked causal locus indeces
obj$causal_indexes
# regression coefficient vector
obj$causal_coeffs

# *fixed effect sizes* (FES) model
obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc, fes = TRUE)

# either model, can apply to real data by replacing `p_anc` with `kinship`
obj <- sim_trait(X = X, m_causal = 2, herit = herit, kinship = kinship)

# covariance of simulated traits
V <- cov_trait(kinship = kinship, herit = herit)

# draw simulated traits (matrix of replicates) from infinitesimal model
traits <- sim_trait_mvn( rep = 10, kinship = kinship, herit = herit )
traits

# Metrics for genetic association approaches

# simulate truly null p-values, which should be uniform
pvals <- runif(10)
# for toy example, take these p-value to be truly causal
causal_indexes <- c(1, 5, 7)

# calculate desired measures
# this one quantifies p-value uniformity
pval_srmsd( pvals, causal_indexes )
# related, calculates inflation factors
pval_infl( pvals )
# this one quantifies predictive power
pval_aucpr( pvals, causal_indexes )

Package 'simtrait'

Help Index

Compute locus allele frequencies

Description

Usage

Arguments

Value

Examples

The model covariance matrix of the trait

Description

Usage

Arguments

Value

See Also

Examples

Per-locus heritability contribution from allele frequency and causal coefficient

Description

Usage

Arguments

Value

See Also

Examples

Calculate expectations of inverse variance terms over the posterior distribution of the ancestral allele frequency

Description

Usage

Arguments

Value

See Also

Examples

Calculate maximum likelihood estimates of allele frequencies from Beta model

Description

Usage

Arguments

Value

See Also

Examples

Area under the precision-recall curve

Description

Usage

Arguments

Value

See Also

Examples

Apply genomic control transformation to a list of p-values

Description

Usage

Arguments

Value

See Also

Examples

Calculate inflation factor from p-values

Description

Usage

Arguments

Value

See Also

Examples

Estimate calibrated power

Description

Usage

Arguments

Value

See Also

Examples

Signed RMSD measure of null p-value uniformity

Description

Usage

Arguments

Value

See Also

Examples

Estimate type I error rate

Description

Usage

Arguments

Value

See Also

Examples

Root mean square deviation

Description