simfam
: simulate and model
family pedigrees with structured foundersThe simfam
package is for constructing large random
families—with population studies in mind—with realistic constraints
including avoidance of close relative pairings but otherwise biased for
closest pairs in a 1-dimensional geography. There is also code to draw
genotypes across the pedigree starting from genotypes for the founders.
Our model allows for the founders to be related or structured—which
arises in practice when there are different ancestries among
founders—and given known parameters for these founders (including known
kinship matrices and ancestry proportions) we provide code to calculate
the true kinship and expected admixture proportions of all
descendants.
This vignette focuses on two examples. One is small, necessary to
actually visualize the pedigree. The second is larger, where the
pedigree cannot be fully visualized but whose population parameters are
more realistic. This vignette draws from other general packages to
visualize the data (popkin
and kinship2
), as
well as to construct/simulate structured founders (bnpsd
,
which is based on the admixture model).
Let’s start with a 3-generation pedigree with 15 individuals per generation. (The code can also handle variable numbers of individuals per generation, which is not demonstrated in this vignette for simplicity).
library(simfam)
# data dimensions
n <- 15 # number of individuals per generation
G <- 3 # number of generations
# the result changes in every run unless we set the seed
set.seed(1)
# draw the random pedigree with those properties
data <- sim_pedigree( n, G )
# save these objects separately
fam <- data$fam
ids <- data$ids
kinship_local_G <- data$kinship_local
The function returns a list with two objects. The first element in
the list is the most important: fam
is a table that
describes the pedigree in a standard notation in genetics research: the
plink FAM
format. Every row corresponds to one individual, and columns name
individuals (id
), their parents (pat
and
mat
, founders have parents set to NA
), and
their sex
(1
=male, 2
=female). The
FAM table format also requires a family ID (fam
) and a
phenotype (pheno
) column, both of which take on dummy
values in the output of sim_pedigree
:
# the top of the table
head( fam )
#> # A tibble: 6 × 6
#> fam id pat mat sex pheno
#> <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 fam1 1-1 <NA> <NA> 1 0
#> 2 fam1 1-2 <NA> <NA> 2 0
#> 3 fam1 1-3 <NA> <NA> 1 0
#> 4 fam1 1-4 <NA> <NA> 1 0
#> 5 fam1 1-5 <NA> <NA> 2 0
#> 6 fam1 1-6 <NA> <NA> 1 0
# and the bottom
tail( fam )
#> # A tibble: 6 × 6
#> fam id pat mat sex pheno
#> <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 fam1 3-10 2-7 2-10 2 0
#> 2 fam1 3-11 2-7 2-10 1 0
#> 3 fam1 3-12 2-7 2-10 2 0
#> 4 fam1 3-13 2-14 2-11 2 0
#> 5 fam1 3-14 2-14 2-11 2 0
#> 6 fam1 3-15 2-14 2-11 2 0
The IDs (columns id
, pat
, mat
)
are formatted as g-i
, where the first number is the
generation and the second number is an index within the generation (1 to
n
). Every individual belongs to a generation in the
simulation (founders are the first generation), and only individuals in
the same generation may be paired. Sex is drawn at random for each
individual prior to pairing. Every individual has a unique index within
the generation, that places them in a 1D geography—an important detail
since pairings are strongly biased to be between closest individuals.
Founders are ordered as they are constructed, and children are ordered
according to the mean index of their parents. Pairings occur randomly
and iteratively: while there are available male-female pairs, a male is
drawn at random from the available males and he is paired with the
closest unpaired female that is not a close relative (by default they
must be less related than second cousins, i.e. have a pedigree kinship
coefficient of less than 1/4^3
).
The second element in the return list of sim_pedigree
is
ids
, the list of IDs separated by generation. This is very
handy as often we want to copy the names of the founders
(ids[[ 1 ]]
) to other objects, or want to filter data to
contain the last generation only (ids[[ G ]]
).
The third element in the return list of sim_pedigree
is
the local (i.e., pedigree-based) kinship matrix of the individuals. This
data can be recalculated from the fam
table alone, so it is
redundant, but it is calculated because the pairing algorithm requires
it to avoid close relatives, so it is returned mostly because it’s there
already. By default, only individuals in the last generation are
included in the matrix that is returned, so this is a n * n
(here 15 x 15) matrix:
kinship_local_G
#> 3-1 3-2 3-3 3-4 3-5 3-6 3-7 3-8 3-9 3-10
#> 3-1 0.5000 0.2500 0.2500 0.1250 0.1250 0.1250 0.0625 0.0625 0.0625 0.0625
#> 3-2 0.2500 0.5000 0.2500 0.1250 0.1250 0.1250 0.0625 0.0625 0.0625 0.0625
#> 3-3 0.2500 0.2500 0.5000 0.1250 0.1250 0.1250 0.0625 0.0625 0.0625 0.0625
#> 3-4 0.1250 0.1250 0.1250 0.5000 0.2500 0.2500 0.0625 0.0625 0.0625 0.0625
#> 3-5 0.1250 0.1250 0.1250 0.2500 0.5000 0.2500 0.0625 0.0625 0.0625 0.0625
#> 3-6 0.1250 0.1250 0.1250 0.2500 0.2500 0.5000 0.0625 0.0625 0.0625 0.0625
#> 3-7 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.5000 0.2500 0.1250 0.1250
#> 3-8 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.2500 0.5000 0.1250 0.1250
#> 3-9 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.1250 0.1250 0.5000 0.2500
#> 3-10 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.1250 0.1250 0.2500 0.5000
#> 3-11 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.1250 0.1250 0.2500 0.2500
#> 3-12 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.1250 0.1250 0.2500 0.2500
#> 3-13 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0625 0.0625 0.0625 0.0625
#> 3-14 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0625 0.0625 0.0625 0.0625
#> 3-15 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0625 0.0625 0.0625 0.0625
#> 3-11 3-12 3-13 3-14 3-15
#> 3-1 0.0625 0.0625 0.0000 0.0000 0.0000
#> 3-2 0.0625 0.0625 0.0000 0.0000 0.0000
#> 3-3 0.0625 0.0625 0.0000 0.0000 0.0000
#> 3-4 0.0625 0.0625 0.0000 0.0000 0.0000
#> 3-5 0.0625 0.0625 0.0000 0.0000 0.0000
#> 3-6 0.0625 0.0625 0.0000 0.0000 0.0000
#> 3-7 0.1250 0.1250 0.0625 0.0625 0.0625
#> 3-8 0.1250 0.1250 0.0625 0.0625 0.0625
#> 3-9 0.2500 0.2500 0.0625 0.0625 0.0625
#> 3-10 0.2500 0.2500 0.0625 0.0625 0.0625
#> 3-11 0.5000 0.2500 0.0625 0.0625 0.0625
#> 3-12 0.2500 0.5000 0.0625 0.0625 0.0625
#> 3-13 0.0625 0.0625 0.5000 0.2500 0.2500
#> 3-14 0.0625 0.0625 0.2500 0.5000 0.2500
#> 3-15 0.0625 0.0625 0.2500 0.2500 0.5000
This kinship matrix is later visualized more conveniently as a
heatmap using the popkin
package. If you’re interested, a
list of local kinship matrices (one per generation) is returned in place
of this if sim_pedigree
is called with the
full = TRUE
option.
We use the excellent kinship2
package to visualize our
randomly-constructed pedigree:
# load the package
library(kinship2)
#> Loading required package: Matrix
#> Loading required package: quadprog
# Convert the `fam` table into a `pedigree` object required by the `kinship2` package.
obj <- pedigree(
id = fam$id,
dadid = fam$pat,
momid = fam$mat,
sex = fam$sex,
affected = fam$pheno
)
# plot the pedigree
plot( obj, cex = 0.7 )
#> Did not plot the following people: 1-3 1-4 1-6 1-7 1-11 1-13 1-15
In these kinds of pedigree plots, node shapes represent sex (box=male, circle=female), and the dashed curved lines (if any) represent individuals copied in two places (once next to its parents, the second time next to its mate) to simplify the plot (the two nodes connected by a dashed line have the same individual ID). It is immediate clear that only individuals with opposite sex were paired, and no close relatives were paired (i.e. no siblings or first or second cousins).
You will notice several things from this random pedigree. One is that some individuals do not get paired in each generation, which happens because there are too many constraints: the number of males and females is random so they may be unequal, and close-relative avoidance further limits pairings. However, this problem is exacerbated in this toy example because the population is small; in a large population most people are paired (see further below). Here, founders that weren’t paired are not plotted (see message below figure). Pairings are always between people of the same generation (first number in ID), and usually but not always between people with close indexes (second number in ID).
When there are pairings, sim_pedigree
enforces a minimum
number of children per family (default 1). Each generation has exactly
the desired number of total individuals (15 per generation in this
example), which is achieved by randomly drawing family sizes from a
modified Poisson distribution (the number of children in excess of the
minimum is drawn from Poisson with target mean per family minus minimum
size) followed by smaller random adjustments in the direction of the
discrepancy until the target population size is met.
If we are only interested in the last generation, then individuals
from previous generations that were unpaired are irrelevant, since for
example they don’t determine the genotypes of the last generation. So
for simplicity or efficiency, we may want to remove those individuals
without descendants. We do this with the function
prune_fam
!
# replace original with pruned FAM table
# containing only ancestors of last generation
fam <- prune_fam( fam, ids[[ G ]] )
# inspect
fam
#> # A tibble: 33 × 6
#> fam id pat mat sex pheno
#> <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 fam1 1-1 <NA> <NA> 1 0
#> 2 fam1 1-2 <NA> <NA> 2 0
#> 3 fam1 1-5 <NA> <NA> 2 0
#> 4 fam1 1-8 <NA> <NA> 1 0
#> 5 fam1 1-9 <NA> <NA> 2 0
#> 6 fam1 1-10 <NA> <NA> 2 0
#> 7 fam1 1-12 <NA> <NA> 1 0
#> 8 fam1 1-14 <NA> <NA> 1 0
#> 9 fam1 2-2 1-1 1-2 2 0
#> 10 fam1 2-3 1-1 1-2 2 0
#> # ℹ 23 more rows
# and plot pedigree
obj <- pedigree(
id = fam$id,
dadid = fam$pat,
momid = fam$mat,
sex = fam$sex,
affected = fam$pheno
)
plot( obj, cex = 0.7 )
Inspection of the top of the table shows that founders that were originally unpaired are now removed (the plot also doesn’t report people not plotted). The plot further shows that many individuals in the second generation (childless uncles and aunts of the last generation) were also removed, but nobody in the third generation was removed (as desired).
We will draw genotypes in two ways to reflect the traditional
(unrelated founders) and the intended use of simfam
:
structured founders. Let’s do unrelated founders first. Note that, to
keep this vignette runtime low, the number of loci is much lower than
any real dataset, so data will look artificially noisier here.
# number of loci to draw
m <- 5000
# draw uniform ancestral allele frequencies
p_anc <- runif( m )
# draw independent (unstructured) genotypes for founders (includes founders without descendants)
X_1 <- matrix(
rbinom( n * m, 2, p_anc ),
nrow = m,
ncol = n
)
# `simfam` requires names for the founders on `X_1`, to validate identities and align
colnames( X_1 ) <- ids[[ 1 ]]
Before moving on, it’s important to understand that these founders
have a kinship matrix, albeit a trivial one, with zero kinship
(unrelated) between individuals and a self-kinship of 1/2 (expected for
outbred individuals). We can verify this directly using a kinship
estimator applied to the genotypes we just drew. We use
popkin
, which is the only unbiased kinship estimator.
popkin
also provides heatmap visualization of kinship
matrices via plot_popkin
.
library(popkin)
# estimate kinship
kinship_est_1 <- popkin( X_1 )
# true/expected kinship
kinship_1 <- diag( n ) / 2
# assign same IDs as `X_1` and `fam`; `simfam` generally requires these to validate/align
colnames( kinship_1 ) <- colnames( X_1 )
rownames( kinship_1 ) <- colnames( X_1 )
# plot them together for comparison
plot_popkin(
list( kinship_1, kinship_est_1 ),
titles = c('Expected', 'Estimated'),
names = TRUE,
leg_width = 0.2,
mar = c(3, 2)
)
Now we draw genotypes across the pedigree using
geno_fam
! Since we have genotypes of the founders, we can
simulate the random inheritance of alleles to every descendant, which
produces the correct pedigree-based kinship/covariance structure at
every locus. However, this code draws loci independently from each other
(there’s no linkage disequilibrium). We also construct the total
expected kinship across this pedigree with a similar function,
kinship_fam
.
X <- geno_fam( X_1, fam )
# expected kinship of entire pruned pedigree, starting from true kinship of founders.
# Called "local" kinship because it ignores the ancestral population structure (will be reused)
# NOTE: agrees with `kinship2::kinship( fam )`, but only because founders are unstructured
kinship_local <- kinship_fam( kinship_1, fam )
We can view and validate the resulting data in several ways. The more
complete validation is to estimate kinship for all X
(all 3
generations) and see that it agrees with kinship_fam
:
# estimate kinship
kinship_local_est <- popkin( X )
# plot them together for comparison
plot_popkin(
list( kinship_local, kinship_local_est ),
titles = c('Expected', 'Estimated'),
names = TRUE,
names_cex = 0.8,
leg_width = 0.2,
mar = c(3, 2)
)
Above you can see that only founders in the pruned FAM table were kept
in the output. However, in a real dataset previous generations might not
be available (especially if G
were much larger than 3), so
we might only have the genotype data of a subset of more recent
relatives. Here we subset to keep the last generation, and compare not
only to kinship_fam
, but also to the original
kinship_local_G
that was calculated by
sim_pedigree
to avoid pairing close relatives.
# indexes (really logicals) marking individuals in last generation,
# to subset data (which is aligned with `fam`)
indexes_G <- fam$id %in% ids[[ G ]]
# estimate kinship
kinship_local_est <- popkin( X )
# plot them together for comparison
plot_popkin(
list(
kinship_local[ indexes_G, indexes_G ],
kinship_local_est[ indexes_G, indexes_G ],
kinship_local_G
),
titles = c('Expected', 'Estimated', 'Local'),
names = TRUE,
mar = c(3, 2)
)
Now we simulate data from the much more interesting case, where
founders have differing mixed ancestries! This is much more like real
human data. For this we use another external package,
bnpsd
, to construct the admixture parameters of the
founders, which also includes calculating their true population kinship
matrix.
library(bnpsd)
# additional admixture parameters:
# number of ancestries
K <- 3
# define population structure
# FST values for k=3 subpopulations
inbr_subpops <- c(0.2, 0.4, 0.6)
# admixture proportions from 1D geography (founders)
admix_proportions_1 <- admix_prop_1d_linear( n, K, sigma = 1 )
# add founder names to `admix_proportions_1` (will propagate to other vars of interest)
rownames( admix_proportions_1 ) <- ids[[ 1 ]]
# also name subpopulations simply as S1, S2, S3
colnames( admix_proportions_1 ) <- paste0( 'S', 1:K )
# calculate true kinship matrix of the admixed founders
# NOTE: overwrites `kinship` for unstructured founders
kinship_1 <- coanc_to_kinship( coanc_admix( admix_proportions_1, inbr_subpops ) )
# draw genotypes from admixture model
# NOTE: overwrites `X_1` for unstructured founders
X_1 <- draw_all_admix( admix_proportions_1, inbr_subpops, m )$X
We can again verify the population kinship of the founders, which is no longer full of zeroes but rather takes on continuous values that depend on the admixture proportions of each pair of founders:
# estimate kinship (NOTE: overwrites `kinship_est_1` for unstructured founders)
kinship_est_1 <- popkin( X_1 )
# plot them together for comparison
plot_popkin(
list( kinship_1, kinship_est_1 ),
titles = c('Expected', 'Estimated'),
names = TRUE,
leg_width = 0.2,
mar = c(3, 2)
)
Note that, unlike the corresponding demonstration in the
bnpsd
package, here the diagonal plots self-kinship values
instead of inbreeding coefficients, as kinship matrices are the central
focus of this simfam
package.
We draw genotypes across the pedigree the same way as before:
X <- geno_fam( X_1, fam )
# expected kinship of entire pruned pedigree, starting from true kinship of founders
# NOTE: DOESN'T agree with `kinship2::kinship( fam )` anymore because founders are structured!
kinship_total <- kinship_fam( kinship_1, fam )
We again compare the expected kinship matrix to the estimate from genotypes:
# estimate kinship
kinship_total_est <- popkin( X )
# plot them together for comparison
plot_popkin(
list( kinship_total, kinship_total_est ),
titles = c('Expected', 'Estimated'),
names = TRUE,
names_cex = 0.8,
leg_width = 0.2,
mar = c(3, 2)
)
Lastly, we again subset individuals in the last generation only and compare their total to their local kinship, which are related but different things when a population is structured. Although undoubtedly the total kinship (panels A-B below) capture much of the same correlation patterns as the local kinship (panel C), the total kinship captures additional covariance due to the structure of the founders that the local kinship ignores (the local kinship uses only the pedigree information and ignores the ancestry differences of the founders).
In this admixture model the structure comes from fractional shared ancestry due to admixture as well as the relatedness between ancestries. We can look at these relatedness components in isolation from the pedigree and see that their contribution to the total kinship matrix is nearly additive (will explain more precisely shortly)! First propagate the admixture proportions across the pedigree, under the simple model that the ancestry of a child is the average of the ancestry of its parents:
# admixture proportions of pruned family
admix_proportions <- admix_fam( admix_proportions_1, fam )
# visualize as bar plot
# for nice colors
library(RColorBrewer)
# colors for independent subpopulations
col_subpops <- brewer.pal( K, "Paired" )
# shrink default margins
par_orig <- par(mar = c(4, 4, 0.5, 0) + 0.2)
barplot(
t( admix_proportions ),
col = col_subpops,
legend.text = TRUE,
args.legend = list( cex = 0.7, bty = 'n' ),
border = NA,
space = 0,
ylab = 'Admixture prop.',
las = 2
)
mtext('Individuals', side = 1, line = 3)
The above figure shows that admixture proportions in the first generation vary a lot: for example, 1-1 has a large proportion of S1 ancestry, while 1-14 has mostly S3 instead. In contrast, in the third generation ancestry is much more averaged out, with fewer extremes. This effect will be weaker in larger populations, an example we go into shortly.
This code calculates the kinship expected from the admixture structure only, which will be visualized shortly.
We also demonstrate a good approximation of the total kinship that treats the structural (here admixture) and pedigree relatedness (in this case treating founders as unstructured) as independent effects, which roughly says that the kinship components satisfy (pseudocode):
# These quantities are independent
(1 - total) = (1 - struct) * (1 - family).
# solved for total, reveals near additivity:
total = struct + family - struct * family.
So when both the structural and family kinship values are much
smaller than 1, their sum is a good match of the total kinship. Only
when these values approach 1 does it become necessary to subtract their
product, which is necessary since all of these values are probabilities
and the total kinship cannot exceed 1. The same applies to inbreeding
coefficient, but if we start from kinship matrices we need to transform
the diagonal values back and forth (since, in pseudocode,
kinship[i,i] = ( 1 + inbreeding[i] ) / 2
). We plot all of
the matrices and verify that the approximation is very good, so the
conceptual partitioning into structural and family effects is
justified:
# this is the approximation of the total kinship:
# NOTE: a bit complicated because diagonal has to be transformed.
# Approximation operates on inbreeding coefficients, not self-kinship,
# so `popkin::inbr_diag` converts self-kinship to inbreeding,
# `bnpsd::coanc_to_kinship` converts inbreeding back to self-kinship.
# Off-diagonal values are unmodified by both functions.
kinship_total_approx <- coanc_to_kinship(
inbr_diag( kinship_local ) +
inbr_diag( kinship_admix ) -
inbr_diag( kinship_admix ) * inbr_diag( kinship_local )
)
# plot all model kinship matrices (no estimates from genotypes this time)
plot_popkin(
list( kinship_admix, kinship_local, kinship_total, kinship_total_approx ),
titles = c('Admixture only', 'Family only', 'Total', 'Total approx.'),
names = TRUE,
names_cex = 0.8,
leg_width = 0.2,
layout_rows = 2,
mar = c(3, 2)
)
Here we will simulate a much larger population size and a pedigree
with a larger number of generations, to produce data that may better
resemble real human data. We will make no attempt of visualize pedigrees
or show individual labels. The whole code, redundant with our earlier
analysis, is presented with minor explanations. Note, however, use of
the newly-introduced functions geno_last_gen
,
kinship_last_gen
, and admix_last_gen
to
directly construct data for the last generation only, which also saves
memory compared to the more general geno_fam
,
kinship_fam
, and admix_fam
shown earlier.
library(simfam)
# data dimensions
n <- 500 # number of individuals per generation
G <- 10 # number of generations
K <- 3 # number of ancestries
m <- 5000 # number of loci
# draw the random pedigree with those properties.
data <- sim_pedigree( n, G )
fam <- data$fam
ids <- data$ids
# use this local kinship calculation in plot
kinship_local_G <- data$kinship_local
# prune pedigree to speed up simulations/etc
fam <- prune_fam( fam, ids[[ G ]] )
### admixture model
# define population structure
# FST values for k=3 subpopulations
inbr_subpops <- c(0.2, 0.4, 0.6)
# admixture proportions from 1D geography (founders)
admix_proportions_1 <- admix_prop_1d_linear( n, K, sigma = 1 )
# add founder names to `admix_proportions_1` (will propagate to other vars of interest)
rownames( admix_proportions_1 ) <- ids[[ 1 ]]
# also name subpopulations simply as S1, S2, S3
colnames( admix_proportions_1 ) <- paste0( 'S', 1:K )
# draw genotypes from admixture model
# admixed founders
X_1 <- draw_all_admix( admix_proportions_1, inbr_subpops, m )$X
# draw genotypes, returning last generation only
X_G <- geno_last_gen( X_1, fam, ids )
# total kinship estimated from genotypes
kinship_total_G_est <- popkin( X_G )
# calculate true total kinship matrix
kinship_total_1 <- coanc_to_kinship( coanc_admix( admix_proportions_1, inbr_subpops ) )
kinship_total_G <- kinship_last_gen( kinship_total_1, fam, ids )
# kinship due to admixture only
admix_proportions_G <- admix_last_gen( admix_proportions_1, fam, ids )
kinship_admix_G <- coanc_to_kinship( coanc_admix( admix_proportions_G, inbr_subpops ) )
# total kinship approximation from components
kinship_total_approx_G <- coanc_to_kinship(
inbr_diag( kinship_local_G ) +
inbr_diag( kinship_admix_G ) -
inbr_diag( kinship_admix_G ) * inbr_diag( kinship_local_G )
)
The first important validation is that the total structure (estimated from genotypes) agrees with the theoretical calculations based on the pedigree and the known founder total kinship. For simplicity we visualize the last generation only:
plot_popkin(
list( kinship_total_G, kinship_total_G_est ),
titles = c('Expected', 'Estimated'),
leg_width = 0.2,
mar = c(3, 2)
)
The next demonstration is that the component partitioning approximation works again (limited to last generation):
# plot all model kinship matrices (no estimates from genotypes this time)
plot_popkin(
list( kinship_admix_G, kinship_local_G, kinship_total_G, kinship_total_approx_G ),
titles = c('Admixture only', 'Family only', 'Total', 'Total approx.'),
leg_width = 0.2,
layout_rows = 2,
mar = c(3, 2)
)
Lastly, let’s determine how many individuals in each generation were ancestors of individuals in the last generation. The bar plot below shows that, in this large simulation, most individuals in the first generation (founders), and in every subsequent generation, had descendants in the last generation.
# the IDs in `fam` (the pruned FAM table) contain the individuals of interest
frac_per_gen <- vector('numeric', G)
for ( g in 1 : G ) {
# this is the desired fraction
frac_per_gen[g] <- sum( fam$id %in% ids[[ g ]] ) / n
}
# visualize as a barplot
barplot(
frac_per_gen,
names.arg = 1 : G,
xlab = 'Generation',
ylab = 'Frac. individuals with descendants'
)
# add a line marking the maximum fraction of individuals per generation
abline( h = 1, lty = 2, col = 'red' )
All previous examples had independent loci. Here we use new functions to simulate recombination breaks and therefore a realistic linkage disequilibrium (LD) structure consistent with the pedigree.
For comparison, let’s start from the same pedigree used in the last “large” example. Here we shall simulate the first two human autosomes (coordinates in genome version hg38) with a real recombination map (slightly simplified) provided with this package. The steps are lengthy primarily due to the non-trivial construction of ancestral haplotypes.
# real human recombination map is also used to define chromosome numbers and lengths
# however, to keep example brief we use only first two chromosomes
map <- recomb_map_hg38[ 1L:2L ]
# tricky part is need to define ancestral haplotypes, following the desired genetic structure
# will simulate each chromosome separately
haplos_anc <- vector( 'list', length( map ) )
for ( chr in 1L : length( map ) ) {
# Positions matter for genetic map, so for this example choose random positions with a
# desired density, here 100 per Kb on average, keep it sparse for small example
pos_max <- max( map[[ chr ]]$pos )
m_loci_chr <- round( pos_max / 100000L )
pos_chr <- sample.int( pos_max, m_loci_chr )
# Draw haplotypes. Here we use admixture code to get genotypes
# with same structure as before for first generation
X_chr <- draw_all_admix( admix_proportions_1, inbr_subpops, m_loci_chr )$X
# and randomly split genotypes into haplotypes
# (there was no LD/phase so this works well enough for our purposes)
H1_chr <- matrix( rbinom( X_chr, 1L, X_chr/2L ), nrow = m_loci_chr )
# second haplotype is complement of first given genotypes
H2_chr <- X_chr - H1_chr
# column names are required for code later, identifying ancestors (need _pat/mat suffixes)
colnames( H1_chr ) <- paste0( colnames( X_chr ), '_pat' )
colnames( H2_chr ) <- paste0( colnames( X_chr ), '_mat' )
# final data all together
X_chr <- cbind( H1_chr, H2_chr )
# add to structure, in a list
haplos_anc[[ chr ]] <- list( X = X_chr, pos = pos_chr )
}
# initialize trivial founders chromosomes (unrecombined chromosomes with unique labels)
# must use IDs of first generation from pedigree as input
founders <- recomb_init_founders( ids[[ 1 ]], map )
# draw recombination breaks along pedigree, with coordinates in genetic distance (centiMorgans)
inds <- recomb_last_gen( founders, fam, ids )
# map recombination break coordinates to base pairs
inds <- recomb_map_inds( inds, map )
# determine haplotypes of descendants given ancestral haplotypes
haplos <- recomb_haplo_inds( inds, haplos_anc )
# and reduce haplotype data to genotypes, same standard matrix format as previous examples
X_G_recomb <- recomb_geno_inds( haplos )
# total kinship estimated from genotypes (simulated with recombination)
kinship_total_G_recomb_est <- popkin( X_G_recomb )
First we validate that the kinship structure of these genotypes with recombination matches the earlier estimate from the simulation with independent loci as well as the expected kinship from the pedigree. Below it can be seen that the coarse structure is the same, although the data with recombination appears noisier, as expected because LD increases estimation variance. Simulating more chromosomes would reduce the variance considerably.
plot_popkin(
list( kinship_total_G, kinship_total_G_est, kinship_total_G_recomb_est ),
titles = c('Expected', 'Estimated (indep loci)', 'Estimated (recomb)'),
mar = c(3, 2)
)
Now let’s confirm the presence of LD. The below plot shows that genotype correlations are noisy overall (because the number of individuals is small in our toy example) but nevertheless, correlations within the two chromosomes in the simulation with recombination (two diagonal blocks in panel B below) are noticeably larger than correlations between chromosomes, as well as the correlations in the earlier simulation with independent loci (panel A).
# Notice that the number of loci is similar for both simulations
# (rows of the genotype matrices)
dim( X_G )
#> [1] 5000 500
dim( X_G_recomb )
#> [1] 4912 500
# Calculate correlation matrices between loci (hence transposition).
# Exclude fixed loci to prevent warnings about zero standard deviations.
ld_G <- abs( cor( t( X_G[ !fixed_loci( X_G ), ] ) ) )
ld_G_recomb <- abs( cor( t( X_G_recomb[ !fixed_loci( X_G_recomb ), ] ) ) )
# cap extreme values, to be able to see the relatively weak LD signal
ld_max <- 0.3
ld_G[ ld_G > ld_max ] <- ld_max
ld_G_recomb[ ld_G_recomb > ld_max ] <- ld_max
# plot!
plot_popkin(
list( ld_G, ld_G_recomb ),
titles = c('Indep loci', 'Recomb'),
leg_width = 0.2,
mar = c(3, 2),
leg_title = 'Absolute Correlation',
ylab = 'Loci'
)