--- title: "`simtrait`: Simulate Complex Traits from Genotypes" author: "Alejandro Ochoa" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: simtrait.bib vignette: > %\VignetteIndexEntry{`simtrait`: Simulate Complex Traits from Genotypes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- $\DeclareMathOperator{\E}{E}$ $\DeclareMathOperator{\Cov}{Cov}$ $\DeclareMathOperator{\Var}{Var}$ ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r, cache = FALSE, include = FALSE} ## copied from examples from the "simmer" R package ## after: https://www.enchufa2.es/archives/suggests-and-vignettes.html ## by Iñaki Úcar required <- c("popkin", "bnpsd") # only suggested since simtrait doesn't need them to run... if (!all(sapply(required, requireNamespace, quietly = TRUE))) knitr::opts_chunk$set(eval = FALSE) ``` # Introduction This vignette has three main parts: 1. Practical examples that show how to use the functions and demonstrate that the random traits generated by this package have the desired covariance structure. 2. The mathematical trait model that motivated this package. 3. The algorithm implementation, which follows straightforwardly from the model when ancestral allele frequencies are known. However, more painful details are necessary when ancestral allele frequencies must be estimated from the genotypes, which induces biases that fortunately can be corrected. # Sample usage In this section we first simulated an admixed population using `bnpsd`, then we simulate traits using `simtrait`. In particular, we simulate a large number of traits to demonstrate that their sample covariance matrix is as expected. ## Load libraries required for this vignette ```{r} library(popkin) # to create plots of our covariance matrices library(bnpsd) # to simulate an admixed population library(simtrait) # this package ``` ## Simulate an admixed population ```{r} # dimensions of data/model # number of loci m_loci <- 10000 # number of individuals, smaller than usual for easier visualizations n_ind <- 30 # number of intermediate subpops (source populations for admixed individuals) k_subpops <- 3 # define population structure # FST values for 3 subpopulations (proportional/unnormalized) inbr_subpops <- 1 : k_subpops bias_coeff <- 0.5 # bias coeff of standard Fst estimator Fst <- 0.3 # desired final Fst obj <- admix_prop_1d_linear( n_ind = n_ind, k_subpops = k_subpops, bias_coeff = bias_coeff, coanc_subpops = inbr_subpops, fst = Fst ) admix_proportions <- obj$admix_proportions # rescaled Fst vector for intermediate subpops inbr_subpops <- obj$coanc_subpops # get pop structure parameters of the admixed individuals coancestry <- coanc_admix(admix_proportions, inbr_subpops) kinship <- coanc_to_kinship(coancestry) # draw allele freqs and genotypes out <- draw_all_admix(admix_proportions, inbr_subpops, m_loci) X <- out$X # genotypes p_anc <- out$p_anc # ancestral AFs ``` ## Simulate a trait from *random coefficients* (RC) model First we simulate one trait. Note that we pick non-default values for the mean `mu` and variance factor `sigma_sq` to validate that these can be anything we want. ```{r} # parameters of simulation m_causal <- 100 herit <- 0.8 # default 0, let's try a non-trivial case mu <- 1 # default 1, also let's see that this more complicated case works well sigma_sq <- 1.5 # create simulated trait # case of exact p_anc obj <- sim_trait( X = X, m_causal = m_causal, herit = herit, p_anc = p_anc, mu = mu, sigma_sq = sigma_sq ) # trait vector length(obj$trait) n_ind obj$trait # randomly-picked causal locus indexes length( obj$causal_indexes ) m_causal head( obj$causal_indexes ) # show partially... # locus coefficients vector length( obj$causal_coeffs ) m_causal head( obj$causal_coeffs ) # show partially... ``` ## Compare sample covariance of trait to theoretical expectation The interesting validation is simulation a large number of random traits, from which we can estimate a sample covariance matrix to compare to the desired theoretical one. We shall compare this to other versions of the simulation. We distinguish this version as having *random coefficients* (RC) and employing true allele frequencies `p_anc`. ```{r} # the theoretical covariance matrix of the trait is calculated by cov_trait V <- cov_trait(kinship = kinship, herit = herit, sigma_sq = sigma_sq) # simulate these many traits n_traits <- 1000 # store in this matrix, initialize with zeroes Y_rc_freq <- matrix(data = 0, nrow = n_traits, ncol = n_ind) # start loop for (i in 1 : n_traits) { obj <- sim_trait( X = X, m_causal = m_causal, herit = herit, p_anc = p_anc, mu = mu, sigma_sq = sigma_sq ) Y_rc_freq[i,] <- obj$trait # store in i^th row } # estimate sample covariance V_rc_freq <- cov(Y_rc_freq) ``` First let's verify that the mean is as expected. Below the red line marks the desired mean. ```{r, fig.width = 3, fig.align = 'center'} par_orig <- par(mgp = c(2, 0.5, 0)) # reduce margins from default par(mar = c(3.5, 3, 0, 0) + 0.2) # visualize distribution boxplot( list( 'RC freq' = rowMeans(Y_rc_freq) ), xlab = "Trait Type", ylab = 'Sample Mean' ) # red line marks expected mean abline(h = mu, col = 'red') par( par_orig ) # reset `par` ``` Now let's visualize the covariance matrices using `plot_popkin` from the `popkin` package. Since both matrices have large diagonals, we shrink them somewhat using `inbr_diag` also from the `popkin` package. ```{r, fig.width = 6, fig.height = 2.8, fig.align = 'center'} plot_popkin( list(V, V_rc_freq), titles = c('Theoretical', 'RC freq'), leg_title = 'Covariance', panel_letters_adj = 0, # set margin for title (top is non-zero) mar = c(0, 2) ) ``` This plot verifies that the empirical covariance matches the theoretical expectation! ## Simulated trait without ancestral allele frequencies For real data, true ancestral allele frequencies are unknown. A reasonable trait can still be simulated in these cases, but this solution no longer has theoretical guarantees to yield the desired mean value in particular. This solution relies on a known mean kinship to compensate for the biases of estimated ancestral allele frequencies. A good kinship matrix estimate can be obtained using the `popkin` package. For simplicity, here we use the *true* kinship matrix rather than an *estimate*: ```{r} # store this in new matrix Y_rc_kin <- matrix(data = 0, nrow = n_traits, ncol = n_ind) # start loop for (i in 1 : n_traits) { obj <- sim_trait( X = X, m_causal = m_causal, herit = herit, # whole kinship matrix can be passed instead of just mean kinship = kinship, mu = mu, sigma_sq = sigma_sq ) Y_rc_kin[i,] <- obj$trait # store in i^th row } # estimate sample covariance V_rc_kin <- cov(Y_rc_kin) ``` First let's verify the means again. Recall the red line marks the desired mean. Below the original sample (simulated using the true `p_anc`) is shown first as "RC freq", while the new sample based on the kinship matrix is "RC kinship": ```{r, fig.width = 3, fig.align = 'center'} par_orig <- par(mgp = c(2, 0.5, 0)) # reduce margins from default par(mar = c(3.5, 3, 0, 0) + 0.2) # visualize distribution boxplot( list( "RC freq" = rowMeans(Y_rc_freq), "RC kinship" = rowMeans(Y_rc_kin) ), xlab = "Trait Type", ylab = 'Sample Mean' ) # red line marks expected mean abline(h = mu, col = 'red') par( par_orig ) # reset `par` ``` Now we compare all three matrices: ```{r, fig.width = 7, fig.height = 2.35, fig.align = 'center'} plot_popkin( list(V, V_rc_freq, V_rc_kin), titles = c('Theoretical', 'RC freq', 'RC kinship'), leg_title = 'Covariance', panel_letters_adj = 0, mar = c(0, 2) ) ``` This plot shows again good agreement between the sample covariance matrix of traits simulated without true ancestral allele frequencies ("RC kinship") and the desired "theoretical" covariance matrix. ## Simulated trait from infinitesimal model An alternative approach for simulating traits is by drawing them from a Multivariate Normal (MVN) model with the desired mean and covariance structure. This is often called the *infinitesimal* model, since it follows from the central limit theorem under the assumption that there are infinite causal loci, each with an infinitesimal effect size. A trait simulated this way has no use in GWAS tests, as there are no causal loci (in other words, the null hypothesis holds across the genome). However, these traits have a heritability that can be estimated, and in fact this infinitesimal model is assumed by approaches that estimate heritability by fitting variance components, such as GCTA [@yang_gcta:_2011]. We draw the MVN traits this way: ```{r} # This function simulates trait replicates in one call, # generating a matrix comparable to the previous ones. Y_mvn <- sim_trait_mvn( rep = n_traits, kinship = kinship, herit = herit, mu = mu, sigma_sq = sigma_sq ) # estimate sample covariance V_mvn <- cov(Y_mvn) ``` First let's verify the means again. Recall the red line marks the desired mean. The new sample is denoted as "MVN", and the other two are as in the previous sections (traits simulated from genotypes, using the true `p_anc` or with bias corrections from the kinship matrix): ```{r, fig.width = 4, fig.align = 'center'} par_orig <- par(mgp = c(2, 0.5, 0)) # reduce margins from default par(mar = c(3.5, 3, 0, 0) + 0.2) # visualize distribution boxplot( list( "RC freq" = rowMeans(Y_rc_freq), "RC kinship" = rowMeans(Y_rc_kin), "MVN" = rowMeans(Y_mvn) ), xlab = "Trait Type", ylab = 'Sample Mean' ) # red line marks expected mean abline(h = mu, col = 'red') par( par_orig ) # reset `par` ``` Now we compare all four covariance matrices: ```{r, fig.width = 7, fig.height = 2, fig.align = 'center'} plot_popkin( list(V, V_rc_freq, V_rc_kin, V_mvn), titles = c('Theoretical', 'RC freq', 'RC kinship', 'MVN'), leg_title = 'Covariance', panel_letters_adj = 0, mar = c(0, 2), leg_width = 0.4 ) ``` This plot shows again good agreement between the sample covariance matrix of traits simulated under the infinitesimal model ("MVN") and the desired "theoretical" covariance matrix and first two simulations. ## Simulated trait from *fixed effect sizes* (FES) model This package can also simulate traits from a model where coefficients are larger for rarer variants, which may be more realistic for disease traits where selection prevents common variants from having large coefficients, while allowing rare variants to have larger coefficients. The effect size of locus $i$ is its variance contribution, equal to $2 \beta^2_i p_i(1-p_i)$ for outbred individuals, where $\beta_i$ is the locus coefficient and $p_i$ is the ancestral allele frequency. The limit of strong negative and other modes of selection forces effect sizes to be equal for all loci, so the coefficient at locus $i$ is proportional to $1 / \sqrt{p_i(1-p_i)}$. As in the other models, the coefficients are rescaled to yield the desired heritability and variance factor. This is related to previous models proposed in the literature, for example [@speed_improved_2012]. This time we simulate both the true allele frequency version, "freq", and the "kinship" version that unbiases sample allele frequencies. ```{r} # store this in new matrix Y_fes_freq <- matrix(data = 0, nrow = n_traits, ncol = n_ind) Y_fes_kin <- matrix(data = 0, nrow = n_traits, ncol = n_ind) # start loop for (i in 1 : n_traits) { obj <- sim_trait( X = X, m_causal = m_causal, herit = herit, p_anc = p_anc, mu = mu, sigma_sq = sigma_sq, fes = TRUE # only diff from orig run ) Y_fes_freq[i,] <- obj$trait # store in i^th row obj <- sim_trait( X = X, m_causal = m_causal, herit = herit, kinship = kinship, mu = mu, sigma_sq = sigma_sq, fes = TRUE # only diff from orig run ) Y_fes_kin[i,] <- obj$trait # store in i^th row } # estimate sample covariance V_fes_freq <- cov(Y_fes_freq) V_fes_kin <- cov(Y_fes_kin) ``` First let's verify the means again. Recall the red line marks the desired mean. The new samples are labeled as "FES freq" and "FES kinship", and the first three are as in the previous sections (traits simulated from genotypes and the *random coefficients* (RC) model, using the true `p_anc` or with bias corrections from the kinship matrix, and the MVN traits): ```{r, fig.width = 6, fig.align = 'center'} par_orig <- par(mgp = c(2, 0.5, 0)) # reduce margins from default par(mar = c(3.5, 3, 0, 0) + 0.2) # visualize distribution boxplot( list( "RC freq" = rowMeans(Y_rc_freq), "RC kinship" = rowMeans(Y_rc_kin), "MVN" = rowMeans(Y_mvn), "FES freq" = rowMeans(Y_fes_freq), "FES kinship" = rowMeans(Y_fes_kin) ), xlab = "Trait Type", ylab = 'Sample Mean' ) # red line marks expected mean abline(h = mu, col = 'red') par( par_orig ) # reset `par` ``` Now we compare all covariance matrices: ```{r, fig.width = 7, fig.height = 4, fig.align = 'center'} plot_popkin( list( V, V_rc_freq, V_rc_kin, V_mvn, V_fes_freq, V_fes_kin ), titles = c('Theoretical', 'RC freq', 'RC kinship', 'MVN', 'FES freq', 'FES kinship'), leg_title = 'Covariance', panel_letters_adj = 0, mar = c(0, 2), leg_width = 0.4, layout_rows = 2 ) ``` These plots show again good agreement between the sample covariance matrix of traits simulated under this "fixed effect sizes" model ("FES freq" and "FES kinship") and the desired "theoretical" covariance matrix and first three simulations. In closing these simulation results, let's make more direct and detailed comparisons between the empirical covariances matrices and the desired theoretical one: ```{r, fig.width = 7, fig.height = 5, fig.align = 'center'} par_orig <- par(mgp = c(2, 0.5, 0)) # create multipanel figure par( mfrow = c(2, 3) ) # reduce margins from default par(mar = c(3.5, 3, 0, 0) + 0.2) plot( V, V_rc_freq, xlab = 'Theoretical Cov', ylab = 'RC freq Cov' ) abline( 0, 1, lty = 2, col = 'gray' ) plot( V, V_rc_kin, xlab = 'Theoretical Cov', ylab = 'RC kinship Cov' ) abline( 0, 1, lty = 2, col = 'gray' ) plot( V, V_mvn, xlab = 'Theoretical Cov', ylab = 'MVN Cov' ) abline( 0, 1, lty = 2, col = 'gray' ) plot( V, V_fes_freq, xlab = 'Theoretical Cov', ylab = 'FES freq Cov' ) abline( 0, 1, lty = 2, col = 'gray' ) plot( V, V_fes_kin, xlab = 'Theoretical Cov', ylab = 'FES kinship Cov' ) abline( 0, 1, lty = 2, col = 'gray' ) barplot( c( rmsd( V, V_rc_freq ), rmsd( V, V_rc_kin ), rmsd( V, V_mvn ), rmsd( V, V_fes_freq ), rmsd( V, V_fes_kin ) ), names.arg = c('RC freq', 'RC kin', 'MVN', 'FES freq', 'FES kin'), ylab = 'RMSD from Theoretical', las = 3 ) par( par_orig ) # reset `par` ``` Empirical covariance matrices estimate the theoretical one without bias when ancestral allele frequencies are used ("freq" versions) and with the MVN model, all of which have a comparable small mean squared error (RMSD in figure above). However, use of estimated allele frequencies, despite the necessary kinship bias correction ("kin" versions), still resulted in empirical covariances that are upwardly biased estimators of the desired theoretical values, which is slight for "RC kin" but severe for "FES kin". ## Environment group effects We can also simulate random group effects not determined by genetics (i.e., functions of the kinship matrix), but rather a result of the environment (any categorical covariates, which may represent geography, culture, etc). The code can simulate several such levels of groups, each level drawing random Normal effects with mean zero and a given fixed variance parameter. For the illustration we shall simulate two levels, each determined by the index or coordinate of the individual, but these two levels are not nested. ```{r} # first level, first half is all one group called "a" # second half is another group called "b" n_half <- round( n_ind / 2 ) labs1 <- c( rep.int( 'a', n_half ), rep.int( 'b', n_ind - n_half ) ) # second level will be in thirds instead # each level is independent, so group names can repeat across levels n_third <- round( n_ind / 3 ) labs2 <- c( rep.int( 'a', n_third ), rep.int( 'b', n_third ), rep.int( 'c', n_ind - 2 * n_third ) ) # combine! labs <- cbind( labs1, labs2 ) # set heritability and variance effects for these levels # reduce heritability to allow for large group variances herit <- 0.3 labs_sigma_sq <- c( 0.3, 0.2 ) ``` Let's redraw all traits to have these block effects, and validate them by comparing their empirical covariances to the desired, theoretical covariance. It is done as before except the parameters `labs` and `labs_sigma_sq` are now passed to the functions `cov_trait`, `sim_trait`, and `sim_trait_mvn`: ```{r} V <- cov_trait( kinship = kinship, herit = herit, sigma_sq = sigma_sq, labs = labs, labs_sigma_sq = labs_sigma_sq ) # store in this matrix, initialize with zeroes Y_rc_freq <- matrix(data = 0, nrow = n_traits, ncol = n_ind) Y_rc_kin <- matrix(data = 0, nrow = n_traits, ncol = n_ind) Y_fes_freq <- matrix(data = 0, nrow = n_traits, ncol = n_ind) Y_fes_kin <- matrix(data = 0, nrow = n_traits, ncol = n_ind) # start loop for (i in 1 : n_traits) { Y_rc_freq[i,] <- sim_trait( X=X, m_causal=m_causal, herit=herit, mu=mu, sigma_sq=sigma_sq, labs=labs, labs_sigma_sq=labs_sigma_sq, p_anc=p_anc )$trait Y_rc_kin[i,] <- sim_trait( X=X, m_causal=m_causal, herit=herit, mu=mu, sigma_sq=sigma_sq, labs=labs, labs_sigma_sq=labs_sigma_sq, kinship=kinship )$trait Y_fes_freq[i,] <- sim_trait( X=X, m_causal=m_causal, herit=herit, mu=mu, sigma_sq=sigma_sq, labs=labs, labs_sigma_sq=labs_sigma_sq, p_anc=p_anc, fes=TRUE )$trait Y_fes_kin[i,] <- sim_trait( X=X, m_causal=m_causal, herit=herit, mu=mu, sigma_sq=sigma_sq, labs=labs, labs_sigma_sq=labs_sigma_sq, kinship=kinship, fes=TRUE )$trait } Y_mvn <- sim_trait_mvn( rep = n_traits, kinship = kinship, herit = herit, mu = mu, sigma_sq = sigma_sq, labs = labs, labs_sigma_sq = labs_sigma_sq ) # estimate sample covariance V_rc_freq <- cov(Y_rc_freq) V_rc_kin <- cov(Y_rc_kin) V_fes_freq <- cov(Y_fes_freq) V_fes_kin <- cov(Y_fes_kin) V_mvn <- cov(Y_mvn) ``` We verify the means as before. Recalling that the red line marks the desired mean, we see that all means are as requested: ```{r, fig.width = 6, fig.align = 'center'} par_orig <- par(mgp = c(2, 0.5, 0)) # reduce margins from default par(mar = c(3.5, 3, 0, 0) + 0.2) # visualize distribution boxplot( list( "RC freq" = rowMeans(Y_rc_freq), "RC kinship" = rowMeans(Y_rc_kin), "MVN" = rowMeans(Y_mvn), "FES freq" = rowMeans(Y_fes_freq), "FES kinship" = rowMeans(Y_fes_kin) ), xlab = "Trait Type", ylab = 'Sample Mean' ) # red line marks expected mean abline(h = mu, col = 'red') par( par_orig ) # reset `par` ``` Now we compare all covariance matrices. We clearly see the group effects (covariance diagonal blocks), and in this case the kinship effect is much reduced (we reduced the heritability), so all simulations (which only differ in their genetic effect) agree even more than in the earlier high heritability example: ```{r, fig.width = 7, fig.height = 4, fig.align = 'center'} plot_popkin( list( V, V_rc_freq, V_rc_kin, V_mvn, V_fes_freq, V_fes_kin ), titles = c('Theoretical', 'RC freq', 'RC kinship', 'MVN', 'FES freq', 'FES kinship'), leg_title = 'Covariance', panel_letters_adj = 0, mar = c(0, 2), leg_width = 0.4, layout_rows = 2 ) ``` Lastly, here is the direct comparisons between the empirical covariances matrices and the desired theoretical one: ```{r, fig.width = 7, fig.height = 5, fig.align = 'center'} par_orig <- par(mgp = c(2, 0.5, 0)) # create multipanel figure par( mfrow = c(2, 3) ) # reduce margins from default par(mar = c(3.5, 3, 0, 0) + 0.2) plot( V, V_rc_freq, xlab = 'Theoretical Cov', ylab = 'RC freq Cov' ) abline( 0, 1, lty = 2, col = 'gray' ) plot( V, V_rc_kin, xlab = 'Theoretical Cov', ylab = 'RC kinship Cov' ) abline( 0, 1, lty = 2, col = 'gray' ) plot( V, V_mvn, xlab = 'Theoretical Cov', ylab = 'MVN Cov' ) abline( 0, 1, lty = 2, col = 'gray' ) plot( V, V_fes_freq, xlab = 'Theoretical Cov', ylab = 'FES freq Cov' ) abline( 0, 1, lty = 2, col = 'gray' ) plot( V, V_fes_kin, xlab = 'Theoretical Cov', ylab = 'FES kinship Cov' ) abline( 0, 1, lty = 2, col = 'gray' ) barplot( c( rmsd( V, V_rc_freq ), rmsd( V, V_rc_kin ), rmsd( V, V_mvn ), rmsd( V, V_fes_freq ), rmsd( V, V_fes_kin ) ), names.arg = c('RC freq', 'RC kin', 'MVN', 'FES freq', 'FES kin'), ylab = 'RMSD from Theoretical', las = 3 ) par( par_orig ) # reset `par` ``` All errors are low except "FES kin" still has a noticeable upward bias and the largest error, although this is still a small error compared to the previous high heritability simulation. # Model Here is a summary of the trait model, which explains what this package does internally. Suppose there are $n$ individuals and $m$ (causal) loci. For simplicity we shall assume that every locus has a coefficient, although in practice many of these coefficients will be zero. The following variables are part of the model: | Variable | Dimensions | Description | |-------------------------|----------------|--------------------------------------------------------------| | $\mathbf{X}$ | $m \times n$ | Genotype matrix | | $\mathbf{x}_i$ | $n \times 1$ | Genotype vector at locus $i$ | | $\mathbf{y}$ | $n \times 1$ | Trait | | $\boldsymbol{\beta}$ | $m \times 1$ | Locus coefficients | | $\boldsymbol{\epsilon}$ | $n \times 1$ | Residual effects | | $\mathbf{Z}$ | $n \times d$ | Environment effects design matrix (for all environments) | | $\mathbf{Z}_e$ | $n \times G_e$ | Environment effects design matrix for environment $e$ | | $\boldsymbol{\eta}$ | $d \times 1$ | Environment coefficients (for all environments) | | $\boldsymbol{\eta}_e$ | $G_e \times 1$ | Environment coefficients for environment $e$ | | $\mathbf{p}$ | $m \times 1$ | Ancestral allele frequencies | | $\mathbf{\Phi}$ | $n \times n$ | Kinship matrix | | $\alpha$ | $1 \times 1$ | Intercept coefficient | | $\mu$ | $1 \times 1$ | Trait mean | | $h^2$ | $1 \times 1$ | Heritability (variance proportion of genetic effects) | | $\sigma^2_\eta$ | $1 \times 1$ | Total environment variance proportion (for all environments) | | $\sigma^2_{\eta e}$ | $1 \times 1$ | Variance proportion for environment $e$ | | $\sigma^2_\epsilon$ | $1 \times 1$ | Variance proportion for residual effects | | $\sigma^2$ | $1 \times 1$ | Total trait variance factor | | $\mathbf{1}_n$ | $n \times 1$ | Vector of ones | | $\mathbf{0}_n$ | $n \times 1$ | Vector of zeroes | | $\mathbf{I}_n$ | $n \times n$ | Identity matrix | We assume the linear polygenic model for a quantitative trait: $$ \mathbf{y} = \mathbf{1}_n \alpha + \mathbf{X}' \boldsymbol{\beta} + \mathbf{Z} \boldsymbol{\eta} + \boldsymbol{\epsilon} . $$ To analyze the covariance structure of the trait, we shall assume that $\alpha$, $\boldsymbol{\beta}$, and $\mathbf{Z} = ( \mathbf{Z}_e )$ are fixed parameters, while $\mathbf{X} = (\mathbf{x}_i')$, $\boldsymbol{\eta} = (\boldsymbol{\eta}_e)$, and $\boldsymbol{\epsilon}$ are random with expectations and covariances of \begin{align*} \E[\mathbf{X}] &= 2 \mathbf{p} \mathbf{1}_n' , \\ \Cov(\mathbf{x}_i) &= 4 p_i (1-p_i) \mathbf{\Phi} , \\ \E[ \boldsymbol{\eta}_e] &= \mathbf{0}_{G_e} , \\ \Cov( \boldsymbol{\eta}_e ) &= \sigma^2_{\eta e} \sigma^2 \mathbf{I}_{G_e} , \\ \E[\boldsymbol{\epsilon}] &= \mathbf{0}_n , \\ \Cov(\boldsymbol{\epsilon}) &= \sigma^2_\epsilon \sigma^2 \mathbf{I}_n , \end{align*} where $\mathbf{p} = (p_i)$ and $G_e$ is the number of groups of environment $e$. The number of columns $d$ of $\mathbf{Z}$ equals the number of groups over all environments: $$ d = \sum_e G_e. $$ The expectation of the trait is therefore \begin{align*} \E[\mathbf{y}] &= \mathbf{1}_n \alpha + \E[\mathbf{X}'] \boldsymbol{\beta} + \mathbf{Z} \E[ \boldsymbol{\eta} ] + \E[\boldsymbol{\epsilon}] \\ &= \mathbf{1}_n \alpha + 2 \mathbf{1}_n \mathbf{p}' \boldsymbol{\beta} , \end{align*} which can be written as \begin{align*} \E[\mathbf{y}] = \mu \mathbf{1}_n , \quad \text{where} \quad \mu = \alpha + 2 \mathbf{p}' \boldsymbol{\beta} . \end{align*} The covariance matrix of the trait is \begin{align*} \Cov(\mathbf{y}) &= \sum_{i=1}^m \Cov(\mathbf{x}_i) \beta_i^2 + \mathbf{Z} \Cov( \boldsymbol{\eta} ) \mathbf{Z}' + \Cov(\boldsymbol{\epsilon}) \\ &= \mathbf{\Phi} \left( \sum_{i=1}^m 4 p_i (1-p_i) \beta_i^2 \right) + \left( \sum_e \sigma^2_{\eta e} \sigma^2 \mathbf{Z}_e \mathbf{Z}_e' \right) + \sigma^2_\epsilon \sigma^2 \mathbf{I}_n , \end{align*} where $\boldsymbol{\beta} = (\beta_i)$. Therefore, we can write the covariance in terms of the heritability, the other variance components, and the overall variance scale: \begin{align*} \Cov(\mathbf{y}) &= \sigma^2 \left( h^2 \left( 2 \mathbf{\Phi} \right) + \left( \sum_e \sigma^2_{\eta e} \mathbf{Z}_e \mathbf{Z}_e' \right) + \sigma^2_\epsilon \mathbf{I}_n \right) , \quad \text{where} \\ \sigma^2 h^2 &= \sum_{i=1}^m 2 p_i (1-p_i) \beta_i^2 . \end{align*} Let's consider the diagonal of the above covariance matrices, which corresponds to the vector of variances for individuals. The residual effects covariance matrix is simply $\mathbf{I}_n$, whose diagonal is $\mathbf{1}_n$, as desired, so the entire contribution to the variance comes from the factors $\sigma^2_\epsilon \sigma^2$. The environment effect covariance matrices $\mathbf{Z}_e \mathbf{Z}_e'$ all similarly have diagonal values equal to $\mathbf{1}_n$, so again their contribution to the variance is given by the factors $\sigma^2_{\eta e} \sigma^2$. Lastly, the genetic covariance matrix $2 \mathbf{\Phi}$ features the kinship matrix multiplied by 2 as it is customary, since it results in diagonal values of 1 for outbred individuals, for whom the variance is given by the factor $h^2 \sigma^2$. In this model we want $\sigma^2$ to be the total variance (for outbred individuals), so the rest of the factors must sum to one, so they correspond to proportions of variance: $$ h^2 + \sigma^2_\epsilon + \sum_e \sigma^2_{\eta e} = 1. $$ Given that equation, it makes sense to define the total environment variance proportion as $$ \sigma^2_\eta = \sum_e \sigma^2_{\eta e}, $$ which is also a variance proportion. Note that population structure increases the total variance averaged over individuals, because in those cases the average diagonal value of $2 \mathbf{\Phi}$ exceeds 1. # Algorithm In all cases the user sets the heritability and other parameters but not the coefficients $\boldsymbol{\beta}$ directly. To choose $\boldsymbol{\beta}$, the algorithm initially draws random coefficients or sets them using a formula (depending on the model) and scales them to yield the desired covariance structure. The clever parts of our algorithm are entirely about how to scale $\boldsymbol{\beta}$ correctly, since the variance of genotypes is non-trivial when there is population structure, and particularly when key parameters such as ancestral allele frequencies are not known, in which case key variance factors are estimated carefuly to avoid or reduce biases. The user provides a genotype matrix and sets the number of causal loci. The algorithm selects random loci to be the causal ones. From this moment on $\mathbf{X}$ will contain only those causal loci. ## Constructing environment and residual effects Simulating the residual and environment effects is straightforward. Since $h^2$ and every $\sigma^2_{\eta e}$ are provided, then $\sigma^2_\eta$ is calculated as above and $\sigma^2_\epsilon$ is set to $$ \sigma^2_\epsilon = 1 - h^2 - \sigma^2_\eta. $$ These input parameters are required to be non-negative, and so is the resulting $\sigma^2_\epsilon$. The user sets an environment group for each individual and each environment $e$. The environment design matrix $\mathbf{Z}_e$ has for each row one individual and each column one group, and the cell has a value of 1 if the individual belongs to that group, or a value of 0 otherwise. Lastly, the residual effects and environment coefficients ($\boldsymbol{\epsilon}$ and $\boldsymbol{\eta}_e$) are drawn independently from Normal distributions with mean zero and desired variance parameter ($\sigma^2_\epsilon \sigma^2$ and $\sigma^2_{\eta e} \sigma^2$, respectively), in agreement with the assumed moments given earlier. ## Constructing coefficients Under the *random coefficients* (RC) model, the initial coefficients are drawn independently from a standard normal distribution: $$ \beta_i \sim \text{N}(0,1). $$ Under the *fixed effect sizes* (FES) model, the initial coefficients are $$ \beta_i = 1 / \sqrt{p_i(1-p_i)}. $$ (When $p_i$ are unknown, their sample estimates are used for this step.) Lastly, the sign of $\beta_i$ is drawn randomly (it is negative with probability 0.5). Again, whichever form these coefficients take, they are rescaled to result in the desired heritability and variance factor using the procedures described next. Below we divide the algorithm into two steps: (1) scaling the coefficients, and (2) centering the trait. Each step forks into two cases: whether the true ancestral allele frequencies are known or not (the latter requires a known mean kinship). ## Scaling coefficients ### Scaling using known ancestral allele frequencies Here we assume that $\mathbf{p} = (p_i)$ is provided by the user. The user has also provided the desired values of both $h^2$ and $\sigma^2$. The initial genetic variance factor is $$ \sigma^2_0 = \sum_{i=1}^m 2 p_i (1-p_i) \beta_i^2. $$ We obtain the desired variance by dividing each $\beta_i$ by $\sigma_0$ (which results in a variance of 1) and then multiply by $h \sigma$ (which finally results in the desired variance of $h^2 \sigma^2$). Combining both steps, the update is $$ \boldsymbol{\beta} \leftarrow \boldsymbol{\beta} \frac{ h \sigma }{\sigma_0}. $$ ### Scaling using a known kinship matrix When $\mathbf{p}$ isn't known, sample estimates $\mathbf{\hat{p}}$ are constructed from the genotype data. Let $$ \hat{p}_i = \frac{1}{2n} \mathbf{1}_n' \mathbf{x}_i . $$ Although this estimator is unbiased ($\E[\mathbf{\hat{p}}] = \mathbf{p}$), the resulting variance estimates of interest are downwardly biased [@Ochoa083923]: $$ \E \left[ \hat{p}_i \left( 1-\hat{p}_i \right) \right] = p_i(1-p_i) (1 - \bar{\varphi}), $$ where $\bar{\varphi} = \frac{1}{n^2} \mathbf{1}_n' \mathbf{\Phi} \mathbf{1}_n$ is the mean kinship coefficient in the data. Therefore the initial genetic variance factor, estimated as $$ \hat{\sigma}^2_0 = \sum_{i=1}^m 2 \hat{p}_i (1-\hat{p}_i) \beta_i^2, $$ has an expectation of $$ \E \left[ \hat{\sigma}^2_0 \right] = \sigma^2_0 (1 - \bar{\varphi}) $$ Since this additional factor $(1 - \bar{\varphi})$ is known in this setting, the adjusted update $$ \boldsymbol{\beta} \leftarrow \boldsymbol{\beta} \frac{ h \sigma \sqrt{1-\bar{\varphi}} }{\hat{\sigma}_0} $$ also results in the desired variance. ## Centering the trait ### Centering using known ancestral allele frequencies This is the preferred approach as it is the only case that guarantees success. Given our model, we obtain the desired overall trait mean $\mu$ by choosing the intercept to be $$ \alpha = \mu - 2 \mathbf{p}' \boldsymbol{\beta} $$ ### Centering without ancestral allele frequencies The solution that this version of the algorithm takes is to choose the intercept \begin{align*} \alpha &= \mu - 2 \hat{\bar{p}} \mathbf{1}_m' \boldsymbol{\beta} , \quad \text{where} \\ \hat{\bar{p}} &= \frac{1}{m} \mathbf{1}_m' \mathbf{\hat{p}} = \frac{1}{2 m n} \mathbf{1}_m' \mathbf{X} \mathbf{1}_n = \frac{1}{2} \bar{X} . \end{align*} This works very well in practice since $\boldsymbol{\beta}$ is drawn randomly, so it is uncorrelated with the true $\mathbf{p}$ (this is true in FES too since the sign of each coefficient is random). In this setting it suffices to consider each coefficient $\beta_i$ as acting on the average locus, which is treated as having a random ancestral allele frequency $p_i$, and all that matters is the global mean of $p_i$ values. ### How NOT to center the trait vector Now let's discuss why the obvious way of centering the trait without known ancestral allele frequencies doesn't work. Why not use the sample allele frequencies as $$ \alpha = \mu - 2 \mathbf{\hat{p}}' \boldsymbol{\beta} \quad ? $$ Centering the trait this way is equivalent to centering genotypes at each locus: $$ \mathbf{y} = \mathbf{1}_n \alpha + \sum_{i=1}^m (\mathbf{x}_i - 2 \hat{p}_i \mathbf{1}_n) \beta_i + \boldsymbol{\epsilon}. $$ However, this operation introduces a distortion in the covariance of the genotypes [@Ochoa083923]: $$ \Cov \left( \mathbf{x}_i - 2 \hat{p}_i \mathbf{1}_n \right) = p_i (1-p_i) \left( \mathbf{\Phi} + \bar{\varphi} \mathbf{1}_n\mathbf{1}_n' - \boldsymbol{\varphi} \mathbf{1}_n' - \mathbf{1}_n \boldsymbol{\varphi}' \right), $$ where $\boldsymbol{\varphi} = \frac{1}{n} \mathbf{\Phi} \mathbf{1}_n$. These undesirable distortions propagate to the trait, which we confirmed in simulations (not shown). It is not clear how these distortions can be corrected for after centering the trait as shown above. Note that the intercept version we chose instead does not induce this genotype centering, which prevents the undesirable distortions in the trait covariance. # References