Title: | Simulate Complex Traits from Genotypes |
---|---|
Description: | Simulate complex traits given a SNP genotype matrix and model parameters (the desired heritability, optional environment group effects, number of causal loci, and either the true ancestral allele frequencies used to generate the genotypes or the mean kinship for a real dataset). Emphasis is on avoiding common biases due to the use of estimated allele frequencies. The code selects random loci to be causal, constructs coefficients for these loci and random independent non-genetic effects, and can optionally generate random group effects. Traits can follow three models: random coefficients, fixed effect sizes, and infinitesimal (multivariate normal). GWAS method benchmarking functions are also provided. Described in Yao and Ochoa (2023) <doi:10.7554/eLife.79238>. |
Authors: | Alejandro Ochoa [aut, cre] , Grace Rhodes [aut] |
Maintainer: | Alejandro Ochoa <[email protected]> |
License: | GPL-3 |
Version: | 1.1.7.9000 |
Built: | 2024-11-20 05:24:21 UTC |
Source: | https://github.com/ochoalab/simtrait |
On a regular matrix, this is essentially a wrapper for colMeans()
or rowMeans()
depending on loci_on_cols
.
On a BEDMatrix object, the locus allele frequencies are computed keeping memory usage low.
allele_freqs( X, loci_on_cols = FALSE, fold = FALSE, m_chunk_max = 1000, subset_ind = NULL, want_counts = FALSE )
allele_freqs( X, loci_on_cols = FALSE, fold = FALSE, m_chunk_max = 1000, subset_ind = NULL, want_counts = FALSE )
X |
The genotype matrix (regular R matrix or BEDMatrix object). Missing values are ignored in averages. |
loci_on_cols |
If |
fold |
If |
m_chunk_max |
BEDMatrix-specific, sets the maximum number of loci to process at the time. If memory usage is excessive, set to a lower value than default (expected only for extremely large numbers of individuals). |
subset_ind |
Optionally subset individuals by providing their indexes (negative indexes to exclude) or a boolean vector (in other words, the usual ways to subset matrices). Most useful for BEDMatrix inputs, to subset chunks and retain low memory usage. |
want_counts |
If |
If want_counts = FALSE
, the vector of estimated ancestral allele frequencies, one per locus.
Names are set to the locus names, if present.
If want_counts = TRUE
, a named list containing both the estimated ancestral allele frequencies p_anc_est
and the allele counts
matrix, with loci along the rows (also with names if present), and alleles along the columns.
# Construct toy data X <- matrix( c(0, 1, 2, 1, 0, 1, 1, NA, 2), nrow = 3, byrow = TRUE ) # row means allele_freqs(X) c(1/2, 1/3, 3/4) # row means, in minor allele frequencies allele_freqs(X, fold = TRUE) c(1/2, 1/3, 1/4) # col means allele_freqs(X, loci_on_cols = TRUE) c(1/3, 1/4, 5/6)
# Construct toy data X <- matrix( c(0, 1, 2, 1, 0, 1, 1, NA, 2), nrow = 3, byrow = TRUE ) # row means allele_freqs(X) c(1/2, 1/3, 3/4) # row means, in minor allele frequencies allele_freqs(X, fold = TRUE) c(1/2, 1/3, 1/4) # col means allele_freqs(X, loci_on_cols = TRUE) c(1/3, 1/4, 5/6)
This function returns the expected covariance matrix of trait vectors simulated via sim_trait()
and sim_trait_mvn()
.
Below there are n
individuals.
cov_trait(kinship, herit, sigma_sq = 1, labs = NULL, labs_sigma_sq = NULL)
cov_trait(kinship, herit, sigma_sq = 1, labs = NULL, labs_sigma_sq = NULL)
kinship |
The |
herit |
The desired heritability (proportion of trait variance due to genetics). |
sigma_sq |
The desired parametric variance factor of the trait (scalar, default 1). Corresponds to the variance of an outbred individual. |
labs |
Optional labels assigning individuals to groups, to simulate environment group effects.
Values can be numeric or strings, simply assigning the same values to individuals in the same group.
If vector (single environment), length must be number of individuals.
If matrix (multiple environments), individuals must be along rows, and environments along columns.
The environments are not required to be nested.
If this is non- |
labs_sigma_sq |
Optional vector of environment effect variance proportions, one value for each environment given in |
The n
-by-n
trait covariance matrix, which under no environment effects equals
sigma_sq * ( herit * 2 * kinship + sigma_sq_residual * I )
,
where I
is the n
-by-n
identity matrix and sigma_sq_residual = 1 - herit
.
If there are labels, covariance will include the specified block diagonal effects and sigma_sq_residual = 1 - herit - sum(labs_sigma_sq)
.
# create a dummy kinship matrix kinship <- matrix( data = c( 0.6, 0.1, 0.0, 0.1, 0.6, 0.1, 0.0, 0.1, 0.6 ), nrow = 3, byrow = TRUE ) # covariance of simulated traits V <- cov_trait(kinship = kinship, herit = 0.8)
# create a dummy kinship matrix kinship <- matrix( data = c( 0.6, 0.1, 0.0, 0.1, 0.6, 0.1, 0.0, 0.1, 0.6 ), nrow = 3, byrow = TRUE ) # covariance of simulated traits V <- cov_trait(kinship = kinship, herit = 0.8)
Calculates the vector of per-locus heritability values, with each causal locus i
calculated as
h_i^2 = 2 * p_i * ( 1 - p_i ) * beta_i^2 / sigma_sq
,
where p_i
is the ancestral allele frequency,
beta_i
is the causal coefficient,
and sigma_sq
is the trait variance scale.
These are all assumed to be true parameters (not estimated).
These per-locus heritabilities equal per-locus effect sizes divided by sigma_sq
.
herit_loci(p_anc, causal_coeffs, causal_indexes = NULL, sigma_sq = 1)
herit_loci(p_anc, causal_coeffs, causal_indexes = NULL, sigma_sq = 1)
p_anc |
The ancestral allele frequency vector. |
causal_coeffs |
The vector of causal coefficients. |
causal_indexes |
The optional vector of causal indexes.
If |
sigma_sq |
The parametric variance factor of the trait (default 1). This factor corresponds to the variance of an outbred individual. |
The vector of per-locus heritability contributions.
The sum of these values gives the overall heritability.
This value can be greater than one (or wrong, more generally) if sigma_sq
is misspecified.
sim_trait()
generates random traits by drawing causal loci and their coefficients to fit a desired heritability.
cov_trait()
calculates the covariance structure of the random traits.
# create toy random data m_loci <- 10 # ancestral allele frequencies p_anc <- runif( m_loci ) # causal loci causal_coeffs <- rnorm( m_loci ) / m_loci # resulting heritability contributions vector herit_loci( p_anc, causal_coeffs )
# create toy random data m_loci <- 10 # ancestral allele frequencies p_anc <- runif( m_loci ) # causal loci causal_coeffs <- rnorm( m_loci ) / m_loci # resulting heritability contributions vector herit_loci( p_anc, causal_coeffs )
This function assumes that each sample allele frequency x
in the input has a Beta distribution with mean p
and variance p*(1-p)*kinship, where
pis the true allele frequency and is unknown, while
kinshipis the mean kinship of the sample and is known. Using this assumption, an expectation over the posterior distribution of
pis calculated for what we call an inverse variance term, namely
1 / ( p * ( 1 - p ) )^g, where
p * ( 1 - p )corresponds to the Bernoulli variance, and
gis some desired exponent. Note that this function estimates these posteriors each from a single data point
x'.
inv_var_est_bayesian(p_anc_est, kinship, g)
inv_var_est_bayesian(p_anc_est, kinship, g)
p_anc_est |
A vector of sample allele frequencies (each refered to as |
kinship |
The mean kinship coefficient of the data |
g |
The exponent of the variance term |
A vector of expectations of 1 / ( p * ( 1 - p ) )^g
over the marginalized, unknown true ancestral allelel frequencies p
, weighted by how likely they are in the posterior given each input sample allele frequency.
The length of the vector is the same as the input p_anc_est
, and each element in the output was estimated from each corresponding element in the input only.
p_anc_est_beta_mle()
for another way to get less biased estimates of the ancestral allele frequencies and related terms.
# select a relatively high value for example kinship <- 0.1 # try a grid of values, including edge cases (0,1) m_loci <- 1000 p_anc_est <- 0 : (m_loci - 1) / (m_loci - 1) # calculate the desired estimates! # here g=0.5 estimates `1 / sqrt( p * ( 1 - p ) )` inv_var_est <- inv_var_est_bayesian( p_anc_est, kinship, 0.5 )
# select a relatively high value for example kinship <- 0.1 # try a grid of values, including edge cases (0,1) m_loci <- 1000 p_anc_est <- 0 : (m_loci - 1) / (m_loci - 1) # calculate the desired estimates! # here g=0.5 estimates `1 / sqrt( p * ( 1 - p ) )` inv_var_est <- inv_var_est_bayesian( p_anc_est, kinship, 0.5 )
For each sample allele frequency x
in the input, this function calculates the maximum likelihood estimate of the true allele frequency passuming that the sample was drawn from a Beta distribution with mean
p and variance p*(1-p)*kinship
, where kinship
is the mean kinship of the sample and is known.
Note that, oddly, this function estimates the unknown p
from a single data point x
.
Nevertheless, this procedure results in favorable shrinkage of estimate towards 0.5.
p_anc_est_beta_mle(p_anc_est, kinship)
p_anc_est_beta_mle(p_anc_est, kinship)
p_anc_est |
A vector of sample allele frequencies (each refered to as |
kinship |
The mean kinship coefficient of the data |
A vector of maximum likelihood estimates of allele frequencies.
The length of the vector is the same as the input p_anc_est
, and each element in the output was estimated from each corresponding element in the input only.
inv_var_est_bayesian()
for another way to get unbiased estimates of a specific class of inverse variance terms of interest in this project.
# select a relatively high value for example kinship <- 0.1 # try a grid of values, including edge cases (0,1) m_loci <- 1000 p_anc_est <- 0 : (m_loci - 1) / (m_loci - 1) # calculate the desired estimates! p_anc_mle <- p_anc_est_beta_mle( p_anc_est, kinship ) # notice values tend to be shrunk towards 0.5, # except values near the edges of the range are shrunk less plot( p_anc_est, p_anc_mle ) abline(0,1)
# select a relatively high value for example kinship <- 0.1 # try a grid of values, including edge cases (0,1) m_loci <- 1000 p_anc_est <- 0 : (m_loci - 1) / (m_loci - 1) # calculate the desired estimates! p_anc_mle <- p_anc_est_beta_mle( p_anc_est, kinship ) # notice values tend to be shrunk towards 0.5, # except values near the edges of the range are shrunk less plot( p_anc_est, p_anc_mle ) abline(0,1)
Calculates the Precision-Recall (PR) Area Under the Curve (AUC) given a vector of p-values and the true classes (causal (alternative) vs non-causal (null)).
This is a wrapper around PRROC::pr.curve()
, which actually calculates the AUC (see that for details).
pval_aucpr(pvals, causal_indexes, curve = FALSE)
pval_aucpr(pvals, causal_indexes, curve = FALSE)
pvals |
The vector of association p-values to analyze.
|
causal_indexes |
The vector of causal indexes, defining the true classes used for AUC calculation.
Values of |
curve |
If |
If curve = FALSE
, returns the PR AUC scalar value.
If curve = TRUE
, returns the PRROC
object as returned by PRROC::pr.curve()
, which can be plotted directly, and which contains the AUC under the named value auc.integral
.
However, if the input pvals
is NULL
(taken for case of singular association test, which is rare but may happen), then the returned value is NA
.
PRROC::pr.curve()
, which is used internally by this function.
pval_power_calib()
for calibrated power estimates.
# simulate truly null p-values, which should be uniform pvals <- runif(10) # for toy example, take the first two p-values to be truly causal causal_indexes <- 1:2 # calculate desired measure pval_aucpr( pvals, causal_indexes )
# simulate truly null p-values, which should be uniform pvals <- runif(10) # for toy example, take the first two p-values to be truly causal causal_indexes <- 1:2 # calculate desired measure pval_aucpr( pvals, causal_indexes )
The inflation factor is defined as the median association test statistic divided by the expected median under the null hypothesis, which is typically assumed to have a chi-squared distribution.
This function takes a p-value distribution and maps its median back to the chi-squared value (using the quantile function) in order to compute the inflation factor in the chi-squared scale.
The full p-value distribution (a mix of null and alternative cases) is used to calculate the desired median value (the true causal_loci
is not needed, unlike pval_srmsd()
).
pval_infl(pvals, df = 1)
pval_infl(pvals, df = 1)
pvals |
The vector of association p-values to analyze.
This function assumes all p-values are provided (a mix of null and alternative tests).
|
df |
The degrees of freedom of the assumed chi-squared distribution (default 1). |
The inflation factor
pval_srmsd()
, a more robust measure of null p-value accuracy, but which requires knowing the true causal loci.
pval_type_1_err()
for classical type I error rate estimates.
# simulate truly null p-values, which should be uniform pvals <- runif(10) # calculate desired measure pval_infl( pvals )
# simulate truly null p-values, which should be uniform pvals <- runif(10) # calculate desired measure pval_infl( pvals )
Given a significance level alpha
and p-values with known causal status, this function estimates the calibrated power.
First it estimates the p-value threshold at which the desired type I error of alpha
is achieved, then it uses this p-value threshold (not alpha
) to estimate statistical power.
Note that these simple empirical estimates are likely to be inaccurate unless the number of p-values is much larger than 1/alpha
.
pval_power_calib(pvals, causal_indexes, alpha = 0.05)
pval_power_calib(pvals, causal_indexes, alpha = 0.05)
pvals |
The vector of association p-values to analyze.
This function assumes all p-values are provided (a mix of null and alternative tests).
|
causal_indexes |
The vector of causal indexes, defining the true classes used for calibrated power estimation.
Values of |
alpha |
The desired significance level (default 0.05). May be a vector. |
The calibrated power estimates at each alpha
pval_aucpr()
, a robust proxy for calibrated power that integrates across significance thresholds.
# simulate truly null p-values, which should be uniform pvals <- runif(10) # for toy example, take the first two p-values to be truly causal causal_indexes <- 1:2 # estimate desired measure pval_power_calib( pvals, causal_indexes )
# simulate truly null p-values, which should be uniform pvals <- runif(10) # for toy example, take the first two p-values to be truly causal causal_indexes <- 1:2 # estimate desired measure pval_power_calib( pvals, causal_indexes )
Quantifies null p-value uniformity by computing the RMSD (root mean square deviation) between the sorted observed null (truly non-causal) p-values and their expected quantiles under a uniform distribution.
Meant as a more robust alternative to the "inflation factor" common in the GWAS literature, which compares median values only and uses all p-values (not just null p-values).
Our signed RMSD, to correspond with the inflation factor, includes a sign that depends on the median null p-value:
positive if this median is <= 0.5
(corresponds with test statistic inflation), negative otherwise (test statistic deflation).
Zero corresponds to uniform null p-values, which arises in expectation only if test statistics have their assumed null distribution (there is no misspecification, including inflation).
pval_srmsd(pvals, causal_indexes, detailed = FALSE)
pval_srmsd(pvals, causal_indexes, detailed = FALSE)
pvals |
The vector of association p-values to analyze.
This function assumes all p-values are provided (a mix of null and alternative tests).
|
causal_indexes |
The vector of causal indexes, whose p-values will be omitted.
Values of |
detailed |
If |
If detailed
is FALSE
, returns the signed RMSD between the observed p-value order statistics and their expectation under true uniformity.
If detailed
is TRUE
, returns data useful for plots, a named list containing:
srmsd
: The signed RMSD between the observed p-value order statistics and their expectation under true uniformity.
pvals_null
: Sorted null p-values (observed order statistics). If any input null p-values were NA
, these have been removed here (removed by sort()
).
pvals_unif
: Expected order statistics assuming uniform distribution, same length as pvals_null
.
If the input pvals
is NULL
(taken for case of singular association test, which is rare but may happen), then the returned value is NA
if detailed
was FALSE
, or otherwise the list contains NA
, NULL
and NULL
for the above three items.
rmsd()
for the generic root-mean-square deviation function.
pval_infl()
for the more traditional inflation factor, which focuses on the median of the full distribution (combination of causal and null cases).
pval_type_1_err()
for classical type I error rate estimates.
# simulate truly null p-values, which should be uniform pvals <- runif(10) # for toy example, take the first p-value to be truly causal (will be ignored below) causal_indexes <- 1 # calculate desired measure pval_srmsd( pvals, causal_indexes )
# simulate truly null p-values, which should be uniform pvals <- runif(10) # for toy example, take the first p-value to be truly causal (will be ignored below) causal_indexes <- 1 # calculate desired measure pval_srmsd( pvals, causal_indexes )
Given a significance level and p-values with known causal status, this function estimates the type I error rate, defined as the proportion of null p-values that are below or equal to the threshold.
Note that these simple empirical estimates are likely to be zero unless the number of p-values is much larger than 1/alpha
.
pval_type_1_err(pvals, causal_indexes, alpha = 0.05)
pval_type_1_err(pvals, causal_indexes, alpha = 0.05)
pvals |
The vector of association p-values to analyze.
This function assumes all p-values are provided (a mix of null and alternative tests).
|
causal_indexes |
The vector of causal indexes, whose p-values will be omitted.
Values of |
alpha |
The desired significance level (default 0.05). May be a vector. |
The type I error rate estimates at each alpha
pval_srmsd()
to directly quantify null p-value uniformity, a more robust alternative to type I error rate.
pval_infl()
for the more traditional inflation factor, which focuses on the median of the full distribution (combination of causal and null cases).
# simulate truly null p-values, which should be uniform pvals <- runif(10) # for toy example, take the first p-value to be truly causal (will be ignored below) causal_indexes <- 1 # estimate desired measure pval_type_1_err( pvals, causal_indexes )
# simulate truly null p-values, which should be uniform pvals <- runif(10) # for toy example, take the first p-value to be truly causal (will be ignored below) causal_indexes <- 1 # estimate desired measure pval_type_1_err( pvals, causal_indexes )
Calculates the euclidean distance between two vectors x
and y
divided by the square root of the lengths of the vectors.
NA
values are ignored by default when calculating the mean squares (so the denominator is the number of non-NA
differences).
rmsd(x, y, na.rm = TRUE)
rmsd(x, y, na.rm = TRUE)
x |
The first vector to compare (required). |
y |
The second vector to compare (required).
Lengths of |
na.rm |
If |
the square root of the mean square difference between x
and y
, after removing NA
comparisons (cases where either is NA
).
x <- rnorm(10) y <- rnorm(10) rmsd( x, y )
x <- rnorm(10) y <- rnorm(10) rmsd( x, y )
Simulate a complex trait given a SNP genotype matrix and model parameters, which are minimally: the number of causal loci, the heritability, and either the true ancestral allele frequencies used to generate the genotypes or the mean kinship of all individuals.
An optional minimum marginal allele frequency for the causal loci can be set.
The output traits have by default a zero mean and unit variance (for outbred individuals), but those parameters can be modified.
The code selects random loci to be causal, constructs coefficients for these loci (scaled appropriately) and random Normal independent non-genetic effects and random environment group effects if specified.
There are two models for constructing causal coefficients: random coefficients (RC; default) and fixed effect sizes (FES; i.e., coefficients roughly inversely proportional to allele frequency; use fes = TRUE
).
Suppose there are m
loci and n
individuals.
sim_trait( X, m_causal, herit, p_anc = NULL, kinship = NULL, mu = 0, sigma_sq = 1, labs = NULL, labs_sigma_sq = NULL, maf_cut = NA, mac_cut = NA, loci_on_cols = FALSE, m_chunk_max = 1000, fes = FALSE, fes_kinship_method = c("original", "mle", "bayesian"), causal_indexes = NULL )
sim_trait( X, m_causal, herit, p_anc = NULL, kinship = NULL, mu = 0, sigma_sq = 1, labs = NULL, labs_sigma_sq = NULL, maf_cut = NA, mac_cut = NA, loci_on_cols = FALSE, m_chunk_max = 1000, fes = FALSE, fes_kinship_method = c("original", "mle", "bayesian"), causal_indexes = NULL )
X |
The |
m_causal |
The desired number of causal loci.
Ignored if |
herit |
The desired heritability (proportion of trait variance due to genetics). |
p_anc |
The length- |
kinship |
The mean kinship value of the individuals in the data.
The |
mu |
The desired parametric mean value of the trait (scalar, default 0). |
sigma_sq |
The desired parametric variance factor of the trait (scalar, default 1). Corresponds to the variance of an outbred individual. |
labs |
Optional labels assigning individuals to groups, to simulate environment group effects.
Values can be numeric or strings, simply assigning the same values to individuals in the same group.
If vector (single environment), length must be number of individuals.
If matrix (multiple environments), individuals must be along rows, and environments along columns.
The environments are not required to be nested.
If this is non- |
labs_sigma_sq |
Optional vector of environment effect variance proportions, one value for each environment given in |
maf_cut |
The optional minimum allele frequency threshold (default |
mac_cut |
The optional minimum allele count threshold (default |
loci_on_cols |
If |
m_chunk_max |
BEDMatrix-specific, sets the maximum number of loci to process at the time. If memory usage is excessive, set to a lower value than default (expected only for extremely large numbers of individuals). |
fes |
If |
fes_kinship_method |
String specifying the bias correction method for the case when |
causal_indexes |
If provided, will use the loci at these indexes as causal loci.
When |
To center and scale the trait and locus coefficients vector correctly to the desired parameters (mean, variance, heritability), the parametric ancestral allele frequencies (p_anc
) must be known.
This is necessary since in the heritability model the genotypes are random variables (with means given by p_anc
and a covariance structure given by p_anc
and the kinship matrix), so these genotype distribution parameters are required.
If p_anc
are known (true for simulated genotypes), then the trait will have the specified mean and covariance matrix in agreement with cov_trait()
.
To simulate traits using real genotypes, where p_anc
is unknown, a compromise that works well in practice is possible if the mean kinship
is known (see package vignette).
We recommend estimating the mean kinship using the popkin
package!
A named list containing:
trait
: length-n
vector of the simulated trait
causal_indexes
: length-m_causal
vector of causal locus indexes
causal_coeffs
: length-m_causal
vector of coefficients at the causal loci
group_effects
: length-n
vector of simulated environment group effects, or 0 (scalar) if not simulated
However, if herit = 0
then causal_indexes
and causal_coeffs
will have zero length regardless of m_causal
.
# construct a dummy genotype matrix X <- matrix( data = c( 0, 1, 2, 1, 2, 1, 0, 0, 1 ), nrow = 3, byrow = TRUE ) # made up ancestral allele frequency vector for example p_anc <- c(0.5, 0.6, 0.2) # made up mean kinship kinship <- 0.2 # desired heritability herit <- 0.8 # create simulated trait and associated data # default is *random coefficients* (RC) model obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc) # trait vector obj$trait # randomly-picked causal locus indexes obj$causal_indexes # regression coefficients vector obj$causal_coeffs # *fixed effect sizes* (FES) model obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc, fes = TRUE) # either model, can apply to real data by replacing `p_anc` with `kinship` obj <- sim_trait(X = X, m_causal = 2, herit = herit, kinship = kinship)
# construct a dummy genotype matrix X <- matrix( data = c( 0, 1, 2, 1, 2, 1, 0, 0, 1 ), nrow = 3, byrow = TRUE ) # made up ancestral allele frequency vector for example p_anc <- c(0.5, 0.6, 0.2) # made up mean kinship kinship <- 0.2 # desired heritability herit <- 0.8 # create simulated trait and associated data # default is *random coefficients* (RC) model obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc) # trait vector obj$trait # randomly-picked causal locus indexes obj$causal_indexes # regression coefficients vector obj$causal_coeffs # *fixed effect sizes* (FES) model obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc, fes = TRUE) # either model, can apply to real data by replacing `p_anc` with `kinship` obj <- sim_trait(X = X, m_causal = 2, herit = herit, kinship = kinship)
Simulate matrix of trait replicates given a kinship matrix and model parameters (the desired heritability, group effects, total variance scale, and mean).
Although these traits have the covariance structure of genetic traits, and have heritabilities that can be estimated, they do not have causal loci (an association test against any locus should fail).
Below n
is the number of individuals.
sim_trait_mvn( rep, kinship, herit, mu = 0, sigma_sq = 1, labs = NULL, labs_sigma_sq = NULL, tol = 1e-06 )
sim_trait_mvn( rep, kinship, herit, mu = 0, sigma_sq = 1, labs = NULL, labs_sigma_sq = NULL, tol = 1e-06 )
rep |
The number of replicate traits to simulate. Simulating all you need at once is more efficient than simulating each separately (the kinship matrix is eigendecomposed once per run, shared across replicates). |
kinship |
The |
herit |
The desired heritability (proportion of trait variance due to genetics). |
mu |
The desired parametric mean value of the trait (scalar, default 0). |
sigma_sq |
The desired parametric variance factor of the trait (scalar, default 1). Corresponds to the variance of an outbred individual. |
labs |
Optional labels assigning individuals to groups, to simulate environment group effects.
Values can be numeric or strings, simply assigning the same values to individuals in the same group.
If vector (single environment), length must be number of individuals.
If matrix (multiple environments), individuals must be along rows, and environments along columns.
The environments are not required to be nested.
If this is non- |
labs_sigma_sq |
Optional vector of environment effect variance proportions, one value for each environment given in |
tol |
Tolerance factor for an internal test of positive semi-definiteness of the trait covariance matrix.
Procedure fails if any eigenvalues are smaller than |
A rep
-by-n
matrix containing the simulated traits along the rows, individuals along the columns.
# create a dummy kinship matrix # make sure it is positive definite! kinship <- matrix( data = c( 0.6, 0.1, 0.0, 0.1, 0.5, 0.0, 0.0, 0.0, 0.5 ), nrow = 3 ) # draw simulated traits (matrix) traits <- sim_trait_mvn( rep = 10, kinship = kinship, herit = 0.8 ) traits
# create a dummy kinship matrix # make sure it is positive definite! kinship <- matrix( data = c( 0.6, 0.1, 0.0, 0.1, 0.5, 0.0, 0.0, 0.0, 0.5 ), nrow = 3 ) # draw simulated traits (matrix) traits <- sim_trait_mvn( rep = 10, kinship = kinship, herit = 0.8 ) traits
This package enables simulation of complex (polygenic and continuous) traits from a simulated or real genotype matrix.
The focus is on constructing the mean and covariance structure of the data to yield the desired heritability.
The main function is sim_trait()
, which returns the simulated trait and the vector of causal loci (randomly selected) and their coefficients.
The causal coefficients are constructed under two models: random coefficients (RC) and fixed effect sizes (FES).
The function cov_trait()
computes the expected covariance matrix of the trait given the model parameters (namely the desired heritability and the true kinship matrix).
Infinitesimal traits (without causal loci) can also be simulated using sim_trait_mvn()
.
Package also provides some functions for evaluating genetic association approaches.
pval_srmsd()
and pval_infl()
quantify null p-value accuracy, while pval_aucpr()
quantifies predictive power.
The recommended inputs are simulated genotypes with known ancestral allele frequencies.
The bnpsd
package simulates genotypes for admixed individuals, resulting in a complex population structure.
For real data it is necessary to estimate the kinship matrix.
popkin::popkin()
' provides high-accuracy kinship estimates.
Maintainer: Alejandro Ochoa [email protected] (ORCID)
Authors:
Grace Rhodes
Useful links:
# construct a dummy genotype matrix X <- matrix( data = c( 0, 1, 2, 1, 2, 1, 0, 0, 1 ), nrow = 3, byrow = TRUE ) # made up ancestral allele frequency vector for example p_anc <- c(0.5, 0.6, 0.2) # desired heritability herit <- 0.8 # create a dummy kinship matrix for example # make sure it is positive definite! kinship <- matrix( data = c( 0.6, 0.1, 0.0, 0.1, 0.5, 0.0, 0.0, 0.0, 0.5 ), nrow = 3 ) # create simulated trait and associated data # default is *random coefficients* (RC) model obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc) # trait vector obj$trait # randomly-picked causal locus indeces obj$causal_indexes # regression coefficient vector obj$causal_coeffs # *fixed effect sizes* (FES) model obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc, fes = TRUE) # either model, can apply to real data by replacing `p_anc` with `kinship` obj <- sim_trait(X = X, m_causal = 2, herit = herit, kinship = kinship) # covariance of simulated traits V <- cov_trait(kinship = kinship, herit = herit) # draw simulated traits (matrix of replicates) from infinitesimal model traits <- sim_trait_mvn( rep = 10, kinship = kinship, herit = herit ) traits # Metrics for genetic association approaches # simulate truly null p-values, which should be uniform pvals <- runif(10) # for toy example, take these p-value to be truly causal causal_indexes <- c(1, 5, 7) # calculate desired measures # this one quantifies p-value uniformity pval_srmsd( pvals, causal_indexes ) # related, calculates inflation factors pval_infl( pvals ) # this one quantifies predictive power pval_aucpr( pvals, causal_indexes )
# construct a dummy genotype matrix X <- matrix( data = c( 0, 1, 2, 1, 2, 1, 0, 0, 1 ), nrow = 3, byrow = TRUE ) # made up ancestral allele frequency vector for example p_anc <- c(0.5, 0.6, 0.2) # desired heritability herit <- 0.8 # create a dummy kinship matrix for example # make sure it is positive definite! kinship <- matrix( data = c( 0.6, 0.1, 0.0, 0.1, 0.5, 0.0, 0.0, 0.0, 0.5 ), nrow = 3 ) # create simulated trait and associated data # default is *random coefficients* (RC) model obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc) # trait vector obj$trait # randomly-picked causal locus indeces obj$causal_indexes # regression coefficient vector obj$causal_coeffs # *fixed effect sizes* (FES) model obj <- sim_trait(X = X, m_causal = 2, herit = herit, p_anc = p_anc, fes = TRUE) # either model, can apply to real data by replacing `p_anc` with `kinship` obj <- sim_trait(X = X, m_causal = 2, herit = herit, kinship = kinship) # covariance of simulated traits V <- cov_trait(kinship = kinship, herit = herit) # draw simulated traits (matrix of replicates) from infinitesimal model traits <- sim_trait_mvn( rep = 10, kinship = kinship, herit = herit ) traits # Metrics for genetic association approaches # simulate truly null p-values, which should be uniform pvals <- runif(10) # for toy example, take these p-value to be truly causal causal_indexes <- c(1, 5, 7) # calculate desired measures # this one quantifies p-value uniformity pval_srmsd( pvals, causal_indexes ) # related, calculates inflation factors pval_infl( pvals ) # this one quantifies predictive power pval_aucpr( pvals, causal_indexes )