simtrait
: Simulate Complex
Traits from Genotypes
This vignette has three main parts:
Practical examples that show how to use the functions and demonstrate that the random traits generated by this package have the desired covariance structure.
The mathematical trait model that motivated this package.
The algorithm implementation, which follows straightforwardly from the model when ancestral allele frequencies are known. However, more painful details are necessary when ancestral allele frequencies must be estimated from the genotypes, which induces biases that fortunately can be corrected.
In this section we first simulated an admixed population using
bnpsd
, then we simulate traits using simtrait
.
In particular, we simulate a large number of traits to demonstrate that
their sample covariance matrix is as expected.
# dimensions of data/model
# number of loci
m_loci <- 10000
# number of individuals, smaller than usual for easier visualizations
n_ind <- 30
# number of intermediate subpops (source populations for admixed individuals)
k_subpops <- 3
# define population structure
# FST values for 3 subpopulations (proportional/unnormalized)
inbr_subpops <- 1 : k_subpops
bias_coeff <- 0.5 # bias coeff of standard Fst estimator
Fst <- 0.3 # desired final Fst
obj <- admix_prop_1d_linear(
n_ind = n_ind,
k_subpops = k_subpops,
bias_coeff = bias_coeff,
coanc_subpops = inbr_subpops,
fst = Fst
)
admix_proportions <- obj$admix_proportions
# rescaled Fst vector for intermediate subpops
inbr_subpops <- obj$coanc_subpops
# get pop structure parameters of the admixed individuals
coancestry <- coanc_admix(admix_proportions, inbr_subpops)
kinship <- coanc_to_kinship(coancestry)
# draw allele freqs and genotypes
out <- draw_all_admix(admix_proportions, inbr_subpops, m_loci)
X <- out$X # genotypes
p_anc <- out$p_anc # ancestral AFs
First we simulate one trait. Note that we pick non-default values for
the mean mu
and variance factor sigma_sq
to
validate that these can be anything we want.
# parameters of simulation
m_causal <- 100
herit <- 0.8
# default 0, let's try a non-trivial case
mu <- 1
# default 1, also let's see that this more complicated case works well
sigma_sq <- 1.5
# create simulated trait
# case of exact p_anc
obj <- sim_trait(
X = X,
m_causal = m_causal,
herit = herit,
p_anc = p_anc,
mu = mu,
sigma_sq = sigma_sq
)
# trait vector
length(obj$trait)
#> [1] 30
n_ind
#> [1] 30
obj$trait
#> [1] 0.7523315 0.3896250 2.4669657 0.1523110 1.8550673 -1.4234850
#> [7] -0.6163766 2.7113115 0.7417712 2.6601989 -1.2612429 2.4072987
#> [13] 1.9458118 2.5660450 -0.4201859 2.1266628 1.5284778 3.9891700
#> [19] 1.2832391 1.8223375 2.7176131 3.8630625 5.4899098 3.8373502
#> [25] 5.2091437 3.4807752 6.5574708 4.9708362 5.6760872 5.3295379
# randomly-picked causal locus indexes
length( obj$causal_indexes )
#> [1] 100
m_causal
#> [1] 100
head( obj$causal_indexes ) # show partially...
#> [1] 4376 3415 9362 9609 90 6624
# locus coefficients vector
length( obj$causal_coeffs )
#> [1] 100
m_causal
#> [1] 100
head( obj$causal_coeffs ) # show partially...
#> [1] -0.30007639 -0.13979990 -0.01161261 -0.07489658 0.21135443 0.15671915
The interesting validation is simulation a large number of random
traits, from which we can estimate a sample covariance matrix to compare
to the desired theoretical one. We shall compare this to other versions
of the simulation. We distinguish this version as having random
coefficients (RC) and employing true allele frequencies
p_anc
.
# the theoretical covariance matrix of the trait is calculated by cov_trait
V <- cov_trait(kinship = kinship, herit = herit, sigma_sq = sigma_sq)
# simulate these many traits
n_traits <- 1000
# store in this matrix, initialize with zeroes
Y_rc_freq <- matrix(data = 0, nrow = n_traits, ncol = n_ind)
# start loop
for (i in 1 : n_traits) {
obj <- sim_trait(
X = X,
m_causal = m_causal,
herit = herit,
p_anc = p_anc,
mu = mu,
sigma_sq = sigma_sq
)
Y_rc_freq[i,] <- obj$trait # store in i^th row
}
# estimate sample covariance
V_rc_freq <- cov(Y_rc_freq)
First let’s verify that the mean is as expected. Below the red line marks the desired mean.
par_orig <- par(mgp = c(2, 0.5, 0))
# reduce margins from default
par(mar = c(3.5, 3, 0, 0) + 0.2)
# visualize distribution
boxplot(
list(
'RC freq' = rowMeans(Y_rc_freq)
),
xlab = "Trait Type",
ylab = 'Sample Mean'
)
# red line marks expected mean
abline(h = mu, col = 'red')
Now let’s visualize the covariance matrices using
plot_popkin
from the popkin
package. Since
both matrices have large diagonals, we shrink them somewhat using
inbr_diag
also from the popkin
package.
plot_popkin(
list(V, V_rc_freq),
titles = c('Theoretical', 'RC freq'),
leg_title = 'Covariance',
panel_letters_adj = 0,
# set margin for title (top is non-zero)
mar = c(0, 2)
)
This plot verifies that the empirical covariance matches the theoretical expectation!
For real data, true ancestral allele frequencies are unknown. A
reasonable trait can still be simulated in these cases, but this
solution no longer has theoretical guarantees to yield the desired mean
value in particular. This solution relies on a known mean kinship to
compensate for the biases of estimated ancestral allele frequencies. A
good kinship matrix estimate can be obtained using the
popkin
package.
For simplicity, here we use the true kinship matrix rather than an estimate:
# store this in new matrix
Y_rc_kin <- matrix(data = 0, nrow = n_traits, ncol = n_ind)
# start loop
for (i in 1 : n_traits) {
obj <- sim_trait(
X = X,
m_causal = m_causal,
herit = herit,
# whole kinship matrix can be passed instead of just mean
kinship = kinship,
mu = mu,
sigma_sq = sigma_sq
)
Y_rc_kin[i,] <- obj$trait # store in i^th row
}
# estimate sample covariance
V_rc_kin <- cov(Y_rc_kin)
First let’s verify the means again. Recall the red line marks the
desired mean. Below the original sample (simulated using the true
p_anc
) is shown first as “RC freq”, while the new sample
based on the kinship matrix is “RC kinship”:
par_orig <- par(mgp = c(2, 0.5, 0))
# reduce margins from default
par(mar = c(3.5, 3, 0, 0) + 0.2)
# visualize distribution
boxplot(
list(
"RC freq" = rowMeans(Y_rc_freq),
"RC kinship" = rowMeans(Y_rc_kin)
),
xlab = "Trait Type",
ylab = 'Sample Mean'
)
# red line marks expected mean
abline(h = mu, col = 'red')
Now we compare all three matrices:
plot_popkin(
list(V, V_rc_freq, V_rc_kin),
titles = c('Theoretical', 'RC freq', 'RC kinship'),
leg_title = 'Covariance',
panel_letters_adj = 0,
mar = c(0, 2)
)
This plot shows again good agreement between the sample covariance matrix of traits simulated without true ancestral allele frequencies (“RC kinship”) and the desired “theoretical” covariance matrix.
An alternative approach for simulating traits is by drawing them from a Multivariate Normal (MVN) model with the desired mean and covariance structure. This is often called the infinitesimal model, since it follows from the central limit theorem under the assumption that there are infinite causal loci, each with an infinitesimal effect size. A trait simulated this way has no use in GWAS tests, as there are no causal loci (in other words, the null hypothesis holds across the genome). However, these traits have a heritability that can be estimated, and in fact this infinitesimal model is assumed by approaches that estimate heritability by fitting variance components, such as GCTA (Yang et al. 2011).
We draw the MVN traits this way:
# This function simulates trait replicates in one call,
# generating a matrix comparable to the previous ones.
Y_mvn <- sim_trait_mvn(
rep = n_traits,
kinship = kinship,
herit = herit,
mu = mu,
sigma_sq = sigma_sq
)
# estimate sample covariance
V_mvn <- cov(Y_mvn)
First let’s verify the means again. Recall the red line marks the
desired mean. The new sample is denoted as “MVN”, and the other two are
as in the previous sections (traits simulated from genotypes, using the
true p_anc
or with bias corrections from the kinship
matrix):
par_orig <- par(mgp = c(2, 0.5, 0))
# reduce margins from default
par(mar = c(3.5, 3, 0, 0) + 0.2)
# visualize distribution
boxplot(
list(
"RC freq" = rowMeans(Y_rc_freq),
"RC kinship" = rowMeans(Y_rc_kin),
"MVN" = rowMeans(Y_mvn)
),
xlab = "Trait Type",
ylab = 'Sample Mean'
)
# red line marks expected mean
abline(h = mu, col = 'red')
Now we compare all four covariance matrices:
plot_popkin(
list(V, V_rc_freq, V_rc_kin, V_mvn),
titles = c('Theoretical', 'RC freq', 'RC kinship', 'MVN'),
leg_title = 'Covariance',
panel_letters_adj = 0,
mar = c(0, 2),
leg_width = 0.4
)
This plot shows again good agreement between the sample covariance matrix of traits simulated under the infinitesimal model (“MVN”) and the desired “theoretical” covariance matrix and first two simulations.
This package can also simulate traits from a model where coefficients are larger for rarer variants, which may be more realistic for disease traits where selection prevents common variants from having large coefficients, while allowing rare variants to have larger coefficients. The effect size of locus i is its variance contribution, equal to 2βi2pi(1 − pi) for outbred individuals, where βi is the locus coefficient and pi is the ancestral allele frequency. The limit of strong negative and other modes of selection forces effect sizes to be equal for all loci, so the coefficient at locus i is proportional to $1 / \sqrt{p_i(1-p_i)}$. As in the other models, the coefficients are rescaled to yield the desired heritability and variance factor. This is related to previous models proposed in the literature, for example (Speed et al. 2012).
This time we simulate both the true allele frequency version, “freq”, and the “kinship” version that unbiases sample allele frequencies.
# store this in new matrix
Y_fes_freq <- matrix(data = 0, nrow = n_traits, ncol = n_ind)
Y_fes_kin <- matrix(data = 0, nrow = n_traits, ncol = n_ind)
# start loop
for (i in 1 : n_traits) {
obj <- sim_trait(
X = X,
m_causal = m_causal,
herit = herit,
p_anc = p_anc,
mu = mu,
sigma_sq = sigma_sq,
fes = TRUE # only diff from orig run
)
Y_fes_freq[i,] <- obj$trait # store in i^th row
obj <- sim_trait(
X = X,
m_causal = m_causal,
herit = herit,
kinship = kinship,
mu = mu,
sigma_sq = sigma_sq,
fes = TRUE # only diff from orig run
)
Y_fes_kin[i,] <- obj$trait # store in i^th row
}
# estimate sample covariance
V_fes_freq <- cov(Y_fes_freq)
V_fes_kin <- cov(Y_fes_kin)
First let’s verify the means again. Recall the red line marks the
desired mean. The new samples are labeled as “FES freq” and “FES
kinship”, and the first three are as in the previous sections (traits
simulated from genotypes and the random coefficients (RC)
model, using the true p_anc
or with bias corrections from
the kinship matrix, and the MVN traits):
par_orig <- par(mgp = c(2, 0.5, 0))
# reduce margins from default
par(mar = c(3.5, 3, 0, 0) + 0.2)
# visualize distribution
boxplot(
list(
"RC freq" = rowMeans(Y_rc_freq),
"RC kinship" = rowMeans(Y_rc_kin),
"MVN" = rowMeans(Y_mvn),
"FES freq" = rowMeans(Y_fes_freq),
"FES kinship" = rowMeans(Y_fes_kin)
),
xlab = "Trait Type",
ylab = 'Sample Mean'
)
# red line marks expected mean
abline(h = mu, col = 'red')
Now we compare all covariance matrices:
plot_popkin(
list( V, V_rc_freq, V_rc_kin, V_mvn, V_fes_freq, V_fes_kin ),
titles = c('Theoretical', 'RC freq', 'RC kinship', 'MVN', 'FES freq', 'FES kinship'),
leg_title = 'Covariance',
panel_letters_adj = 0,
mar = c(0, 2),
leg_width = 0.4,
layout_rows = 2
)
These plots show again good agreement between the sample covariance matrix of traits simulated under this “fixed effect sizes” model (“FES freq” and “FES kinship”) and the desired “theoretical” covariance matrix and first three simulations.
In closing these simulation results, let’s make more direct and detailed comparisons between the empirical covariances matrices and the desired theoretical one:
par_orig <- par(mgp = c(2, 0.5, 0))
# create multipanel figure
par( mfrow = c(2, 3) )
# reduce margins from default
par(mar = c(3.5, 3, 0, 0) + 0.2)
plot( V, V_rc_freq, xlab = 'Theoretical Cov', ylab = 'RC freq Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_rc_kin, xlab = 'Theoretical Cov', ylab = 'RC kinship Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_mvn, xlab = 'Theoretical Cov', ylab = 'MVN Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_fes_freq, xlab = 'Theoretical Cov', ylab = 'FES freq Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_fes_kin, xlab = 'Theoretical Cov', ylab = 'FES kinship Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
barplot(
c(
rmsd( V, V_rc_freq ),
rmsd( V, V_rc_kin ),
rmsd( V, V_mvn ),
rmsd( V, V_fes_freq ),
rmsd( V, V_fes_kin )
),
names.arg = c('RC freq', 'RC kin', 'MVN', 'FES freq', 'FES kin'),
ylab = 'RMSD from Theoretical',
las = 3
)
Empirical covariance matrices estimate the theoretical one without bias when ancestral allele frequencies are used (“freq” versions) and with the MVN model, all of which have a comparable small mean squared error (RMSD in figure above). However, use of estimated allele frequencies, despite the necessary kinship bias correction (“kin” versions), still resulted in empirical covariances that are upwardly biased estimators of the desired theoretical values, which is slight for “RC kin” but severe for “FES kin”.
We can also simulate random group effects not determined by genetics (i.e., functions of the kinship matrix), but rather a result of the environment (any categorical covariates, which may represent geography, culture, etc). The code can simulate several such levels of groups, each level drawing random Normal effects with mean zero and a given fixed variance parameter. For the illustration we shall simulate two levels, each determined by the index or coordinate of the individual, but these two levels are not nested.
# first level, first half is all one group called "a"
# second half is another group called "b"
n_half <- round( n_ind / 2 )
labs1 <- c(
rep.int( 'a', n_half ),
rep.int( 'b', n_ind - n_half )
)
# second level will be in thirds instead
# each level is independent, so group names can repeat across levels
n_third <- round( n_ind / 3 )
labs2 <- c(
rep.int( 'a', n_third ),
rep.int( 'b', n_third ),
rep.int( 'c', n_ind - 2 * n_third )
)
# combine!
labs <- cbind( labs1, labs2 )
# set heritability and variance effects for these levels
# reduce heritability to allow for large group variances
herit <- 0.3
labs_sigma_sq <- c( 0.3, 0.2 )
Let’s redraw all traits to have these block effects, and validate
them by comparing their empirical covariances to the desired,
theoretical covariance. It is done as before except the parameters
labs
and labs_sigma_sq
are now passed to the
functions cov_trait
, sim_trait
, and
sim_trait_mvn
:
V <- cov_trait( kinship = kinship, herit = herit, sigma_sq = sigma_sq,
labs = labs, labs_sigma_sq = labs_sigma_sq )
# store in this matrix, initialize with zeroes
Y_rc_freq <- matrix(data = 0, nrow = n_traits, ncol = n_ind)
Y_rc_kin <- matrix(data = 0, nrow = n_traits, ncol = n_ind)
Y_fes_freq <- matrix(data = 0, nrow = n_traits, ncol = n_ind)
Y_fes_kin <- matrix(data = 0, nrow = n_traits, ncol = n_ind)
# start loop
for (i in 1 : n_traits) {
Y_rc_freq[i,] <- sim_trait( X=X, m_causal=m_causal, herit=herit, mu=mu, sigma_sq=sigma_sq,
labs=labs, labs_sigma_sq=labs_sigma_sq, p_anc=p_anc )$trait
Y_rc_kin[i,] <- sim_trait( X=X, m_causal=m_causal, herit=herit, mu=mu, sigma_sq=sigma_sq,
labs=labs, labs_sigma_sq=labs_sigma_sq, kinship=kinship )$trait
Y_fes_freq[i,] <- sim_trait( X=X, m_causal=m_causal, herit=herit, mu=mu, sigma_sq=sigma_sq,
labs=labs, labs_sigma_sq=labs_sigma_sq, p_anc=p_anc, fes=TRUE )$trait
Y_fes_kin[i,] <- sim_trait( X=X, m_causal=m_causal, herit=herit, mu=mu, sigma_sq=sigma_sq,
labs=labs, labs_sigma_sq=labs_sigma_sq, kinship=kinship, fes=TRUE )$trait
}
Y_mvn <- sim_trait_mvn( rep = n_traits, kinship = kinship, herit = herit,
mu = mu, sigma_sq = sigma_sq, labs = labs, labs_sigma_sq = labs_sigma_sq )
# estimate sample covariance
V_rc_freq <- cov(Y_rc_freq)
V_rc_kin <- cov(Y_rc_kin)
V_fes_freq <- cov(Y_fes_freq)
V_fes_kin <- cov(Y_fes_kin)
V_mvn <- cov(Y_mvn)
We verify the means as before. Recalling that the red line marks the desired mean, we see that all means are as requested:
par_orig <- par(mgp = c(2, 0.5, 0))
# reduce margins from default
par(mar = c(3.5, 3, 0, 0) + 0.2)
# visualize distribution
boxplot(
list(
"RC freq" = rowMeans(Y_rc_freq),
"RC kinship" = rowMeans(Y_rc_kin),
"MVN" = rowMeans(Y_mvn),
"FES freq" = rowMeans(Y_fes_freq),
"FES kinship" = rowMeans(Y_fes_kin)
),
xlab = "Trait Type",
ylab = 'Sample Mean'
)
# red line marks expected mean
abline(h = mu, col = 'red')
Now we compare all covariance matrices. We clearly see the group effects (covariance diagonal blocks), and in this case the kinship effect is much reduced (we reduced the heritability), so all simulations (which only differ in their genetic effect) agree even more than in the earlier high heritability example:
plot_popkin(
list( V, V_rc_freq, V_rc_kin, V_mvn, V_fes_freq, V_fes_kin ),
titles = c('Theoretical', 'RC freq', 'RC kinship', 'MVN', 'FES freq', 'FES kinship'),
leg_title = 'Covariance',
panel_letters_adj = 0,
mar = c(0, 2),
leg_width = 0.4,
layout_rows = 2
)
Lastly, here is the direct comparisons between the empirical covariances matrices and the desired theoretical one:
par_orig <- par(mgp = c(2, 0.5, 0))
# create multipanel figure
par( mfrow = c(2, 3) )
# reduce margins from default
par(mar = c(3.5, 3, 0, 0) + 0.2)
plot( V, V_rc_freq, xlab = 'Theoretical Cov', ylab = 'RC freq Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_rc_kin, xlab = 'Theoretical Cov', ylab = 'RC kinship Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_mvn, xlab = 'Theoretical Cov', ylab = 'MVN Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_fes_freq, xlab = 'Theoretical Cov', ylab = 'FES freq Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_fes_kin, xlab = 'Theoretical Cov', ylab = 'FES kinship Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
barplot(
c(
rmsd( V, V_rc_freq ),
rmsd( V, V_rc_kin ),
rmsd( V, V_mvn ),
rmsd( V, V_fes_freq ),
rmsd( V, V_fes_kin )
),
names.arg = c('RC freq', 'RC kin', 'MVN', 'FES freq', 'FES kin'),
ylab = 'RMSD from Theoretical',
las = 3
)
All errors are low except “FES kin” still has a noticeable upward bias and the largest error, although this is still a small error compared to the previous high heritability simulation.
Here is a summary of the trait model, which explains what this package does internally.
Suppose there are n individuals and m (causal) loci. For simplicity we shall assume that every locus has a coefficient, although in practice many of these coefficients will be zero. The following variables are part of the model:
Variable | Dimensions | Description |
---|---|---|
X | m × n | Genotype matrix |
xi | n × 1 | Genotype vector at locus i |
y | n × 1 | Trait |
β | m × 1 | Locus coefficients |
ϵ | n × 1 | Residual effects |
Z | n × d | Environment effects design matrix (for all environments) |
Ze | n × Ge | Environment effects design matrix for environment e |
η | d × 1 | Environment coefficients (for all environments) |
ηe | Ge × 1 | Environment coefficients for environment e |
p | m × 1 | Ancestral allele frequencies |
Φ | n × n | Kinship matrix |
α | 1 × 1 | Intercept coefficient |
μ | 1 × 1 | Trait mean |
h2 | 1 × 1 | Heritability (variance proportion of genetic effects) |
ση2 | 1 × 1 | Total environment variance proportion (for all environments) |
σηe2 | 1 × 1 | Variance proportion for environment e |
σϵ2 | 1 × 1 | Variance proportion for residual effects |
σ2 | 1 × 1 | Total trait variance factor |
1n | n × 1 | Vector of ones |
0n | n × 1 | Vector of zeroes |
In | n × n | Identity matrix |
We assume the linear polygenic model for a quantitative trait: y = 1nα + X′β + Zη + ϵ. To analyze the covariance structure of the trait, we shall assume that α, β, and Z = (Ze) are fixed parameters, while X = (xi′), η = (ηe), and ϵ are random with expectations and covariances of where p = (pi) and Ge is the number of groups of environment e. The number of columns d of Z equals the number of groups over all environments: d = ∑eGe. The expectation of the trait is therefore which can be written as The covariance matrix of the trait is where β = (βi). Therefore, we can write the covariance in terms of the heritability, the other variance components, and the overall variance scale:
Let’s consider the diagonal of the above covariance matrices, which corresponds to the vector of variances for individuals. The residual effects covariance matrix is simply In, whose diagonal is 1n, as desired, so the entire contribution to the variance comes from the factors σϵ2σ2. The environment effect covariance matrices ZeZe′ all similarly have diagonal values equal to 1n, so again their contribution to the variance is given by the factors σηe2σ2. Lastly, the genetic covariance matrix 2Φ features the kinship matrix multiplied by 2 as it is customary, since it results in diagonal values of 1 for outbred individuals, for whom the variance is given by the factor h2σ2. In this model we want σ2 to be the total variance (for outbred individuals), so the rest of the factors must sum to one, so they correspond to proportions of variance: h2 + σϵ2 + ∑eσηe2 = 1. Given that equation, it makes sense to define the total environment variance proportion as ση2 = ∑eσηe2, which is also a variance proportion. Note that population structure increases the total variance averaged over individuals, because in those cases the average diagonal value of 2Φ exceeds 1.
In all cases the user sets the heritability and other parameters but not the coefficients β directly. To choose β, the algorithm initially draws random coefficients or sets them using a formula (depending on the model) and scales them to yield the desired covariance structure. The clever parts of our algorithm are entirely about how to scale β correctly, since the variance of genotypes is non-trivial when there is population structure, and particularly when key parameters such as ancestral allele frequencies are not known, in which case key variance factors are estimated carefuly to avoid or reduce biases.
The user provides a genotype matrix and sets the number of causal loci. The algorithm selects random loci to be the causal ones. From this moment on X will contain only those causal loci.
Simulating the residual and environment effects is straightforward. Since h2 and every σηe2 are provided, then ση2 is calculated as above and σϵ2 is set to σϵ2 = 1 − h2 − ση2. These input parameters are required to be non-negative, and so is the resulting σϵ2. The user sets an environment group for each individual and each environment e. The environment design matrix Ze has for each row one individual and each column one group, and the cell has a value of 1 if the individual belongs to that group, or a value of 0 otherwise. Lastly, the residual effects and environment coefficients (ϵ and ηe) are drawn independently from Normal distributions with mean zero and desired variance parameter (σϵ2σ2 and σηe2σ2, respectively), in agreement with the assumed moments given earlier.
Under the random coefficients (RC) model, the initial coefficients are drawn independently from a standard normal distribution: βi ∼ N(0, 1).
Under the fixed effect sizes (FES) model, the initial coefficients are $$ \beta_i = 1 / \sqrt{p_i(1-p_i)}. $$ (When pi are unknown, their sample estimates are used for this step.) Lastly, the sign of βi is drawn randomly (it is negative with probability 0.5).
Again, whichever form these coefficients take, they are rescaled to result in the desired heritability and variance factor using the procedures described next.
Below we divide the algorithm into two steps: (1) scaling the coefficients, and (2) centering the trait. Each step forks into two cases: whether the true ancestral allele frequencies are known or not (the latter requires a known mean kinship).
Here we assume that p = (pi) is provided by the user. The user has also provided the desired values of both h2 and σ2. The initial genetic variance factor is $$ \sigma^2_0 = \sum_{i=1}^m 2 p_i (1-p_i) \beta_i^2. $$ We obtain the desired variance by dividing each βi by σ0 (which results in a variance of 1) and then multiply by hσ (which finally results in the desired variance of h2σ2). Combining both steps, the update is $$ \boldsymbol{\beta} \leftarrow \boldsymbol{\beta} \frac{ h \sigma }{\sigma_0}. $$
When p isn’t known, sample estimates p̂ are constructed from the genotype data. Let $$ \hat{p}_i = \frac{1}{2n} \mathbf{1}_n' \mathbf{x}_i . $$ Although this estimator is unbiased ($\E[\mathbf{\hat{p}}] = \mathbf{p}$), the resulting variance estimates of interest are downwardly biased (Ochoa and Storey 2016): $$ \E \left[ \hat{p}_i \left( 1-\hat{p}_i \right) \right] = p_i(1-p_i) (1 - \bar{\varphi}), $$ where $\bar{\varphi} = \frac{1}{n^2} \mathbf{1}_n' \mathbf{\Phi} \mathbf{1}_n$ is the mean kinship coefficient in the data. Therefore the initial genetic variance factor, estimated as $$ \hat{\sigma}^2_0 = \sum_{i=1}^m 2 \hat{p}_i (1-\hat{p}_i) \beta_i^2, $$ has an expectation of $$ \E \left[ \hat{\sigma}^2_0 \right] = \sigma^2_0 (1 - \bar{\varphi}) $$ Since this additional factor (1 − φ̄) is known in this setting, the adjusted update $$ \boldsymbol{\beta} \leftarrow \boldsymbol{\beta} \frac{ h \sigma \sqrt{1-\bar{\varphi}} }{\hat{\sigma}_0} $$ also results in the desired variance.
This is the preferred approach as it is the only case that guarantees success. Given our model, we obtain the desired overall trait mean μ by choosing the intercept to be α = μ − 2p′β
The solution that this version of the algorithm takes is to choose the intercept This works very well in practice since β is drawn randomly, so it is uncorrelated with the true p (this is true in FES too since the sign of each coefficient is random). In this setting it suffices to consider each coefficient βi as acting on the average locus, which is treated as having a random ancestral allele frequency pi, and all that matters is the global mean of pi values.
Now let’s discuss why the obvious way of centering the trait without known ancestral allele frequencies doesn’t work. Why not use the sample allele frequencies as α = μ − 2p̂′β ? Centering the trait this way is equivalent to centering genotypes at each locus: $$ \mathbf{y} = \mathbf{1}_n \alpha + \sum_{i=1}^m (\mathbf{x}_i - 2 \hat{p}_i \mathbf{1}_n) \beta_i + \boldsymbol{\epsilon}. $$ However, this operation introduces a distortion in the covariance of the genotypes (Ochoa and Storey 2016): $$ \Cov \left( \mathbf{x}_i - 2 \hat{p}_i \mathbf{1}_n \right) = p_i (1-p_i) \left( \mathbf{\Phi} + \bar{\varphi} \mathbf{1}_n\mathbf{1}_n' - \boldsymbol{\varphi} \mathbf{1}_n' - \mathbf{1}_n \boldsymbol{\varphi}' \right), $$ where $\boldsymbol{\varphi} = \frac{1}{n} \mathbf{\Phi} \mathbf{1}_n$. These undesirable distortions propagate to the trait, which we confirmed in simulations (not shown). It is not clear how these distortions can be corrected for after centering the trait as shown above.
Note that the intercept version we chose instead does not induce this genotype centering, which prevents the undesirable distortions in the trait covariance.