Title: | Simulate and Model Family Pedigrees with Structured Founders |
---|---|
Description: | The focus is on simulating and modeling families with founders drawn from a structured population (for example, with different ancestries or other potentially non-family relatedness), in contrast to traditional pedigree analysis that treats all founders as equally unrelated. Main function simulates a random pedigree for many generations, avoiding close relatives, pairing closest individuals according to a 1D geography and their randomly-drawn sex, and with variable children sizes to result in a target population size per generation. Auxiliary functions calculate kinship matrices, admixture matrices, and draw random genotypes across arbitrary pedigree structures starting from the corresponding founder values. The code is built around the plink FAM table format for pedigrees. There are functions that simulate independent loci and also functions that use an explicit recombination model to simulate linkage disequilibrium (LD) in the pedigree, as well as population analogs resembling the Li-Stephens model. Described in Yao and Ochoa (2023) <doi:10.7554/eLife.79238>. |
Authors: | Alejandro Ochoa [aut, cre] |
Maintainer: | Alejandro Ochoa <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.1.13.9000 |
Built: | 2024-11-19 15:28:55 UTC |
Source: | https://github.com/ochoalab/simfam |
Calculates a full admixture proportions matrix (for all individuals in the provided pedigree FAM table) starting from the admixture proportions of the founders as provided.
admix_fam(admix, fam, missing_vals = c("", 0))
admix_fam(admix, fam, missing_vals = c("", 0))
admix |
The admixture proportions matrix of the founders (individuals along rows and ancestries along columns).
This matrix must have row names that identify each founder (matching codes in |
fam |
The pedigree data.frame, in plink FAM format.
Only columns |
missing_vals |
The list of ID values treated as missing.
|
The admixture proportions matrix of the entire fam
table, based on the admixture of the founders.
These are expectations, calculated for each individual as the average ancestry proportion of the parents.
The rows of this admixture matrix correspond to fam$id
in that order.
The columns (ancestries) are the same as in the input admix
.
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# The smallest pedigree, two parents and a child. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while "child" does not library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # admixture proportions of the parents admix <- rbind( c(0.3, 0.3, 0.4), c(0.5, 0.25, 0.25) ) # Name the parents with same codes as in `fam` # (order can be different) rownames( admix ) <- c('mother', 'father') # name ancestries too colnames( admix ) <- c('African', 'European', 'Asian') # Calculate the full admixture proportions matrix admix_all <- admix_fam( admix, fam ) # This is a 3x3 matrix with row names matching fam$id. # The parent submatrix equals the input (reordered), # but now there's admixture to the child too (averages of parents) admix_all
# The smallest pedigree, two parents and a child. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while "child" does not library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # admixture proportions of the parents admix <- rbind( c(0.3, 0.3, 0.4), c(0.5, 0.25, 0.25) ) # Name the parents with same codes as in `fam` # (order can be different) rownames( admix ) <- c('mother', 'father') # name ancestries too colnames( admix ) <- c('African', 'European', 'Asian') # Calculate the full admixture proportions matrix admix_all <- admix_fam( admix, fam ) # This is a 3x3 matrix with row names matching fam$id. # The parent submatrix equals the input (reordered), # but now there's admixture to the child too (averages of parents) admix_all
A wrapper around the more general admix_fam()
, specialized to save memory when only the last generation is desired (admix_fam()
returns admixture for the entire pedigree in a single matrix).
This function assumes that generations are non-overlapping (met by the output of sim_pedigree()
), in which case each generation g
can be drawn from generation g-1
data only.
That way, only two consecutive generations need be in memory at any given time.
The partitioning of individuals into generations is given by the ids
parameter (again matches the output of sim_pedigree()
).
admix_last_gen(admix, fam, ids, missing_vals = c("", 0))
admix_last_gen(admix, fam, ids, missing_vals = c("", 0))
admix |
The admixture proportions matrix of the founders (individuals along rows and ancestries along columns).
This matrix must have row names that identify each founder (matching codes in |
fam |
The pedigree data.frame, in plink FAM format.
Only columns |
ids |
A list containing vectors of IDs for each generation.
All these IDs must be present in |
missing_vals |
The list of ID values treated as missing.
|
The admixture proportions matrix of the last generation (the intersection of ids[ length(ids) ]
and fam$id
).
The rows of this matrix are last-generation individuals in the order that they appear in fam$id
.
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# A small pedigree, two parents and two children. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while children do not library(tibble) fam <- tibble( id = c('father', 'mother', 'child', 'sib'), pat = c(NA, NA, 'father', 'father'), mat = c(NA, NA, 'mother', 'mother') ) # need an `ids` list separating the generations ids <- list( c('father', 'mother'), c('child', 'sib') ) # admixture proportions of the parents admix <- rbind( c(0.3, 0.3, 0.4), c(0.5, 0.25, 0.25) ) # Name the parents with same codes as in `fam` # (order can be different) rownames( admix ) <- c('mother', 'father') # name ancestries too colnames( admix ) <- c('African', 'European', 'Asian') # calculate the admixture matrix of the children admix2 <- admix_last_gen( admix, fam, ids ) admix2
# A small pedigree, two parents and two children. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while children do not library(tibble) fam <- tibble( id = c('father', 'mother', 'child', 'sib'), pat = c(NA, NA, 'father', 'father'), mat = c(NA, NA, 'mother', 'mother') ) # need an `ids` list separating the generations ids <- list( c('father', 'mother'), c('child', 'sib') ) # admixture proportions of the parents admix <- rbind( c(0.3, 0.3, 0.4), c(0.5, 0.25, 0.25) ) # Name the parents with same codes as in `fam` # (order can be different) rownames( admix ) <- c('mother', 'father') # name ancestries too colnames( admix ) <- c('African', 'European', 'Asian') # calculate the admixture matrix of the children admix2 <- admix_last_gen( admix, fam, ids ) admix2
Given a table of base pair positions (a data frame with chromosome and position values), and a genetic map (see recomb_map_hg
), this function calculates genetic positions.
If genetic positions existed in input, they are overwritten.
bim_add_posg(bim, map)
bim_add_posg(bim, map)
bim |
The table of variants, which is a data.frame/tibble with at least two columns: |
map |
The genetic map, a list of chromosomes each of which is a data.frame/tibble with columns |
Base pair positions are converted to genetic positions from the provided map using linear interpolation, using stats::approx()
with options rule = 2
(out of range cases are set to nearest end's value) and ties = list( 'ordered', mean )
(assume data is ordered, interpolate ties in base pair positions in map using mean of genetic positions).
Output will be incorrect, without throwing errors, if genetic map is not ordered.
The bim
input with new or overwritten column posg
of genetic positions in cM.
Rows with values of chr
that are not numeric or are out of range (for given map
) are unedited if the posg
column was present, or assigned NA
otherwise.
recomb_map_hg
for simplified human recombination maps included in this package.
# let's define a very simple table of base pair positions, with minimal information library(tibble) bim <- tibble( chr = c(1, 1, 2, 2), pos = c(50, 200, 30, 123) * 1000000 ) # use latest human recombination map map <- recomb_map_hg38 # now use this function to add genetic positions column to `bim`! bim <- bim_add_posg( bim, map )
# let's define a very simple table of base pair positions, with minimal information library(tibble) bim <- tibble( chr = c(1, 1, 2, 2), pos = c(50, 200, 30, 123) * 1000000 ) # use latest human recombination map map <- recomb_map_hg38 # now use this function to add genetic positions column to `bim`! bim <- bim_add_posg( bim, map )
Each individual has their sex drawn between male and female with equal probability. Sex is encoded numerically following the convention for plink FAM files (see below).
draw_sex(n)
draw_sex(n)
n |
The number of individuals. |
The length-n
vector of integer sex assignments: 1L
corresponds to male, 2L
to female.
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
draw_sex( 10 )
draw_sex( 10 )
G
-generations deepCreates an idealized pedigree listing all ancestors of one individual starting from G
generations ago, without inbreeding (a binary tree).
IDs are automatically generated strings indicating generation and individual number within generation.
Useful for simple simulations of individuals with explicit ancestors.
fam_ancestors(G)
fam_ancestors(G)
G |
The desired number of generations.
|
A list with two named elements:
fam
: a tibble describing the pedigree, with the following columns
id
: The ID of each individual, a string in the format "g-i" joining with a dash the generation number ("g", numbered backward in time) and the individual number within the generation ("i").
pat
: The paternal ID. For individual "g-i" parent is (g+1)"-"(2*i-1), except for last generation it is NA
(their parents are missing).
mat
: The maternal ID. For individual "g-i" parent is (g+1)"-"(2*i), except for last generation it is NA
(their parents are missing).
sex
: 1 (male) for all odd-numbered individuals, 2 (female) for even-numbered individuals, consistent with pedigree structure. Side-effect is first-generation individual ("1-1") is always male (edit afterwards as desired).
ids
: A list containing vectors of IDs separated by generation, but here starting from the last generation (highest "g"), to be consistent with output of sim_pedigree()
and the expected input of all *_last_gen
functions.
sim_pedigree()
to simulate a random pedigree with a given number of generations, generation sizes, and other parameters.
# construct the 8-generation ancestor tree of one individual: data <- fam_ancestors( 8 ) # this is the pedigree fam <- data$fam # and this is the handy list of IDs by discrete generation, # used by `*_last_gen` functions to reduce memory usage ids <- data$ids
# construct the 8-generation ancestor tree of one individual: data <- fam_ancestors( 8 ) # this is the pedigree fam <- data$fam # and this is the handy list of IDs by discrete generation, # used by `*_last_gen` functions to reduce memory usage ids <- data$ids
Constructs a random genotype matrix (for all individuals in the provided pedigree FAM table) starting from the genotype matrix of the founders as provided.
geno_fam(X, fam, missing_vals = c("", 0))
geno_fam(X, fam, missing_vals = c("", 0))
X |
The genotype matrix of the founders (loci along rows, individuals along columns).
This matrix must have column names that identify each founder (matching codes in |
fam |
The pedigree data.frame, in plink FAM format.
Only columns |
missing_vals |
The list of ID values treated as missing.
|
The random genotype matrix of the entire fam
table, starting from the genotypes of the founders.
The columns of this matrix correspond to fam$id
in that order.
The rows (loci) are the same as in the input X
.
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# The smallest pedigree, two parents and a child. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while "child" does not library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # genotypes of the parents at 4 loci X <- cbind( c(1, 2, 0, 2), c(0, 2, 2, 1) ) # Name the parents with same codes as in `fam` # (order can be different) colnames( X ) <- c('mother', 'father') # name loci too rownames( X ) <- paste0( 'rs', 1:4 ) # Draw the full genotype matrix X_all <- geno_fam( X, fam ) # This is a 4x3 matrix with column names matching fam$id. # The parent submatrix equals the input (reordered), # but now there's random genotypes for the child too X_all
# The smallest pedigree, two parents and a child. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while "child" does not library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # genotypes of the parents at 4 loci X <- cbind( c(1, 2, 0, 2), c(0, 2, 2, 1) ) # Name the parents with same codes as in `fam` # (order can be different) colnames( X ) <- c('mother', 'father') # name loci too rownames( X ) <- paste0( 'rs', 1:4 ) # Draw the full genotype matrix X_all <- geno_fam( X, fam ) # This is a 4x3 matrix with column names matching fam$id. # The parent submatrix equals the input (reordered), # but now there's random genotypes for the child too X_all
A wrapper around the more general geno_fam()
, specialized to save memory when only the last generation is desired (geno_fam()
returns genotypes for the entire pedigree in a single matrix).
This function assumes that generations are non-overlapping (met by the output of sim_pedigree()
), in which case each generation g
can be drawn from generation g-1
data only.
That way, only two consecutive generations need be in memory at any given time.
The partitioning of individuals into generations is given by the ids
parameter (again matches the output of sim_pedigree()
).
geno_last_gen(X, fam, ids, missing_vals = c("", 0))
geno_last_gen(X, fam, ids, missing_vals = c("", 0))
X |
The genotype matrix of the founders (loci along rows, individuals along columns).
This matrix must have column names that identify each founder (matching codes in |
fam |
The pedigree data.frame, in plink FAM format.
Only columns |
ids |
A list containing vectors of IDs for each generation.
All these IDs must be present in |
missing_vals |
The list of ID values treated as missing.
|
The random genotype matrix of the last generation (the intersection of ids[ length(ids) ]
and fam$id
).
The columns of this matrix are last-generation individuals in the order that they appear in fam$id
.
The rows (loci) are the same as in the input X
.
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# A small pedigree, two parents and two children. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while children do not library(tibble) fam <- tibble( id = c('father', 'mother', 'child', 'sib'), pat = c(NA, NA, 'father', 'father'), mat = c(NA, NA, 'mother', 'mother') ) # need an `ids` list separating the generations ids <- list( c('father', 'mother'), c('child', 'sib') ) # genotypes of the parents at 4 loci X <- cbind( c(1, 2, 0, 2), c(0, 2, 2, 1) ) # Name the parents with same codes as in `fam` # (order can be different) colnames( X ) <- c('mother', 'father') # name loci too rownames( X ) <- paste0( 'rs', 1:4 ) # Draw the genotype matrix of the children X2 <- geno_last_gen( X, fam, ids ) X2
# A small pedigree, two parents and two children. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while children do not library(tibble) fam <- tibble( id = c('father', 'mother', 'child', 'sib'), pat = c(NA, NA, 'father', 'father'), mat = c(NA, NA, 'mother', 'mother') ) # need an `ids` list separating the generations ids <- list( c('father', 'mother'), c('child', 'sib') ) # genotypes of the parents at 4 loci X <- cbind( c(1, 2, 0, 2), c(0, 2, 2, 1) ) # Name the parents with same codes as in `fam` # (order can be different) colnames( X ) <- c('mother', 'father') # name loci too rownames( X ) <- paste0( 'rs', 1:4 ) # Draw the genotype matrix of the children X2 <- geno_last_gen( X, fam, ids ) X2
This function in essence combines pop_recomb()
to simulate founders of known ancestries with LD (following a Li-Stephens-like model), draws recombination breaks of focal last-generation descendants from the specified pedigree using recomb_last_gen()
, and their genomes from the founders variants using recomb_haplo_inds()
.
However, since a limited portion of founder sequences is actually inherited, the simulation is made much more efficient by simulating only those subsequences that were inherited, which saves time, and utilizing sparse matrices, which saves memory too.
See below for a more detailed algorithm.
geno_last_gen_admix_recomb( anc_haps, bim, map, G, fam, ids, founders_anc, indexes_chr_ends = NULL, loci_on_cols = FALSE, missing_vals = c("", 0) )
geno_last_gen_admix_recomb( anc_haps, bim, map, G, fam, ids, founders_anc, indexes_chr_ends = NULL, loci_on_cols = FALSE, missing_vals = c("", 0) )
anc_haps |
A named list that maps the code used for each ancestry to its haplotype matrix.
Each of the haplotype matrices the argument |
bim |
The table of variants of |
map |
The genetic map, a list of chromosomes each of which is a data.frame/tibble with columns |
G |
Number of generations since most recent common ancestor of population (to multiply standard recombination rate) |
fam |
The pedigree data.frame, in plink FAM format.
Only columns |
ids |
A list containing vectors of IDs for each generation.
All these IDs must be present in |
founders_anc |
a named vector that maps every founder haplotype (the names of this vector) to its ancestry code.
Ancestry codes must match the codes used in |
indexes_chr_ends |
Optional vector mapping each chromosome (index in vector) to the last index in the |
loci_on_cols |
If |
missing_vals |
The list of ID values treated as missing.
|
This function wraps around several exported package functions to achieve its objectives, which are roughly grouped into the following 4 phases.
Phase 1 simulates recombination in the family without explicit sequences.
In particular, it initializes the founder haplotype structure (without variants yet) using recomb_init_founders()
, then simulates recombination breaks along the pedigree and identifies all the founder haplotype blocks in the focal individuals using recomb_last_gen()
, and maps recombination breaks in cM to basepairs using recomb_map_inds()
.
Phase 2 reorganizes this data to identify the unique founder blocks that were inherited, first by making the data tidy with tidy_recomb_map_inds()
, then applying recomb_founder_blocks_inherited()
.
Phase 3 initializes founder haplotypes using sparse matrices from the package Matrix
, and draws inherited founder subsequences according to their known ancestries and using the Li-Stephens-like haplotype model of pop_recomb()
.
Phase 4 constructs the genotype matrices of focal individuals using the haplotypes of the founders drawn in phase 3 and the known origin of focal blocks from founders from phase 1, first constructing this data at the phased haplotype level with recomb_haplo_inds()
, reencoding as unphased genotypes using recomb_geno_inds()
, and constructing the corresponding local ancestry dosages using recomb_admix_inds()
.
A named list with three elements:
X
: the genotype matrix of the focal individuals, as returned by recomb_geno_inds()
.
Ls
: a list, mapping each ancestry to its matrix of local ancestry dosages, as returned by recomb_admix_inds()
.
haplos
: a phased version of the haplotypes and local ancestries of the focal individuals, structured as nested lists, as returned by recomb_haplo_inds()
.
recomb_init_founders()
, recomb_last_gen()
, recomb_map_inds()
, tidy_recomb_map_inds()
, recomb_founder_blocks_inherited()
, pop_recomb()
, recomb_haplo_inds()
, recomb_geno_inds()
, recomb_admix_inds()
.
library(tibble) # simulate random haplotypes for example # this toy data has 10 SNPs per chromosome, in fixed positions for simplicity bim <- tibble( chr = rep( 1 : 22, each = 10 ), pos = rep( (1:10) * 1e6, 22 ) ) # and random haplotype data to go with this n_ind_hap <- 10 m_loci <- nrow( bim ) # NOTE ancestry labels can be anything but must match `founders_anc` below anc_haps <- list( 'AFR' = matrix( rbinom( m_loci * n_ind_hap, 1L, 0.5 ), nrow = m_loci, ncol = n_ind_hap ), 'EUR' = matrix( rbinom( m_loci * n_ind_hap, 1L, 0.2 ), nrow = m_loci, ncol = n_ind_hap ) ) # now simulate a very small family with one individual, 2 parents, 4 implicit grandparents data <- fam_ancestors( 2 ) fam <- data$fam ids <- data$ids # select ancestries for each of the 4 grandparents / founder haplotypes (unadmixed) founders_anc <- c('AFR', 'AFR', 'AFR', 'EUR') # set names of founders with _pat/mat, needed to match recombination structure # order is odd but choices were random so that doesn't matter names( founders_anc ) <- c( paste0( ids[[1]], '_pat' ), paste0( ids[[1]], '_mat' ) ) # this performs the simulation! data <- geno_last_gen_admix_recomb( anc_haps, bim, recomb_map_hg38, 10, fam, ids, founders_anc ) # this is the genotype matrix for the one admixed individual data$X # the corresponding local ancestry dosage matrices # names match input labels data$Ls$AFR data$Ls$EUR # if desired, a more complete but more complicated structure holding phased haplotypes # and phased local ancestry information data$haplos
library(tibble) # simulate random haplotypes for example # this toy data has 10 SNPs per chromosome, in fixed positions for simplicity bim <- tibble( chr = rep( 1 : 22, each = 10 ), pos = rep( (1:10) * 1e6, 22 ) ) # and random haplotype data to go with this n_ind_hap <- 10 m_loci <- nrow( bim ) # NOTE ancestry labels can be anything but must match `founders_anc` below anc_haps <- list( 'AFR' = matrix( rbinom( m_loci * n_ind_hap, 1L, 0.5 ), nrow = m_loci, ncol = n_ind_hap ), 'EUR' = matrix( rbinom( m_loci * n_ind_hap, 1L, 0.2 ), nrow = m_loci, ncol = n_ind_hap ) ) # now simulate a very small family with one individual, 2 parents, 4 implicit grandparents data <- fam_ancestors( 2 ) fam <- data$fam ids <- data$ids # select ancestries for each of the 4 grandparents / founder haplotypes (unadmixed) founders_anc <- c('AFR', 'AFR', 'AFR', 'EUR') # set names of founders with _pat/mat, needed to match recombination structure # order is odd but choices were random so that doesn't matter names( founders_anc ) <- c( paste0( ids[[1]], '_pat' ), paste0( ids[[1]], '_mat' ) ) # this performs the simulation! data <- geno_last_gen_admix_recomb( anc_haps, bim, recomb_map_hg38, 10, fam, ids, founders_anc ) # this is the genotype matrix for the one admixed individual data$X # the corresponding local ancestry dosage matrices # names match input labels data$Ls$AFR data$Ls$EUR # if desired, a more complete but more complicated structure holding phased haplotypes # and phased local ancestry information data$haplos
This function tests that chromosomes appear in their numerical order (dies with an error otherwise), and returns the last index of each chromosome.
indexes_chr(chrs)
indexes_chr(chrs)
chrs |
Vector of integer chromosomes as it appears in a |
Vector of end indexes per chromosome.
The numerical value of the chromosome is its index in this vector.
Chromosomes that were not observed have NA
values as their end indexes.
# the end of chr1 is at index 4, for chr2 it's 7, and for chr3 it's 9 indexes_chr( c(1,1,1,1,2,2,2,3,3) )
# the end of chr1 is at index 4, for chr2 it's 7, and for chr3 it's 9 indexes_chr( c(1,1,1,1,2,2,2,3,3) )
Calculates a full kinship matrix (between all individuals in the provided pedigree FAM table) taking into account the relatedness of the founders as provided.
Output agrees with kinship2::kinship()
but only when founders are unrelated/outbred (in other words, that function does not allow relatedness between founders).
kinship_fam(kinship, fam, missing_vals = c("", 0))
kinship_fam(kinship, fam, missing_vals = c("", 0))
kinship |
The kinship matrix of the founders.
This matrix must have column and row names that identify each founder (matching codes in |
fam |
The pedigree data.frame, in plink FAM format.
Only columns |
missing_vals |
The list of ID values treated as missing.
|
The kinship matrix of the entire fam
table, taking the relatedness of the founders into account.
The rows and columns of this kinship matrix correspond to fam$id
in that order.
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# The smallest pedigree, two parents and a child. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while "child" does not library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # Kinship of the parents, here two unrelated/outbred individuals: kinship <- diag(2)/2 # Name the parents with same codes as in `fam` # (order can be different) colnames( kinship ) <- c('mother', 'father') rownames( kinship ) <- c('mother', 'father') # For a clearer example, make the father slightly inbred # (a self-kinship value that exceeds 1/2): kinship[2,2] <- 0.6 # Calculate the full kinship matrix kinship_all <- kinship_fam( kinship, fam ) # This is a 3x3 matrix with row/col names matching fam$id. # The parent submatrix equals the input (reordered), # but now there's relatedness to the child too kinship_all
# The smallest pedigree, two parents and a child. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while "child" does not library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # Kinship of the parents, here two unrelated/outbred individuals: kinship <- diag(2)/2 # Name the parents with same codes as in `fam` # (order can be different) colnames( kinship ) <- c('mother', 'father') rownames( kinship ) <- c('mother', 'father') # For a clearer example, make the father slightly inbred # (a self-kinship value that exceeds 1/2): kinship[2,2] <- 0.6 # Calculate the full kinship matrix kinship_all <- kinship_fam( kinship, fam ) # This is a 3x3 matrix with row/col names matching fam$id. # The parent submatrix equals the input (reordered), # but now there's relatedness to the child too kinship_all
A wrapper around the more general kinship_fam()
, specialized to save memory when only the last generation is desired (kinship_fam()
returns kinship for the entire pedigree in a single matrix).
This function assumes that generations are non-overlapping (met by the output of sim_pedigree()
), in which case each generation g
can be drawn from generation g-1
data only.
That way, only two consecutive generations need be in memory at any given time.
The partitioning of individuals into generations is given by the ids
parameter (again matches the output of sim_pedigree()
).
kinship_last_gen(kinship, fam, ids, missing_vals = c("", 0))
kinship_last_gen(kinship, fam, ids, missing_vals = c("", 0))
kinship |
The kinship matrix of the founders.
This matrix must have column and row names that identify each founder (matching codes in |
fam |
The pedigree data.frame, in plink FAM format.
Only columns |
ids |
A list containing vectors of IDs for each generation.
All these IDs must be present in |
missing_vals |
The list of ID values treated as missing.
|
The kinship matrix of the last generation (the intersection of ids[ length(ids) ]
and fam$id
).
The columns/rows of this matrix are last-generation individuals in the order that they appear in fam$id
.
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# A small pedigree, two parents and two children. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while children do not library(tibble) fam <- tibble( id = c('father', 'mother', 'child', 'sib'), pat = c(NA, NA, 'father', 'father'), mat = c(NA, NA, 'mother', 'mother') ) # need an `ids` list separating the generations ids <- list( c('father', 'mother'), c('child', 'sib') ) # Kinship of the parents, here two unrelated/outbred individuals: kinship <- diag(2)/2 # Name the parents with same codes as in `fam` # (order can be different) colnames( kinship ) <- c('mother', 'father') rownames( kinship ) <- c('mother', 'father') # For a clearer example, make the father slightly inbred # (a self-kinship value that exceeds 1/2): kinship[2,2] <- 0.6 # calculate the kinship matrix of the children kinship2 <- kinship_last_gen( kinship, fam, ids ) kinship2
# A small pedigree, two parents and two children. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while children do not library(tibble) fam <- tibble( id = c('father', 'mother', 'child', 'sib'), pat = c(NA, NA, 'father', 'father'), mat = c(NA, NA, 'mother', 'mother') ) # need an `ids` list separating the generations ids <- list( c('father', 'mother'), c('child', 'sib') ) # Kinship of the parents, here two unrelated/outbred individuals: kinship <- diag(2)/2 # Name the parents with same codes as in `fam` # (order can be different) colnames( kinship ) <- c('mother', 'father') rownames( kinship ) <- c('mother', 'father') # For a clearer example, make the father slightly inbred # (a self-kinship value that exceeds 1/2): kinship[2,2] <- 0.6 # calculate the kinship matrix of the children kinship2 <- kinship_last_gen( kinship, fam, ids ) kinship2
Each new genome has breaks drawn from the recombination model accelerated for the desired number of generations G
, and haplotypes are drawn and copied randomly from the population.
This results in a population with individuals drawn independently and identically distributed but with LD within individuals, since the ancestral LD is preserved to some extent (attenuated by recombination after G
generations).
If there is no LD in haps
, then the output will not have LD either except when the number of columns of haps
is very small (which resembles a bottleneck).
This model does not introduce mutations (unlike the original Li-Stephens).
Genotypes, when requested, are simply sums of independently drawn haplotype values.
pop_recomb( haps, bim, map, G, n_ind, geno = TRUE, loci_on_cols = FALSE, indexes_loci = NULL, indexes_chr_ends = NULL )
pop_recomb( haps, bim, map, G, n_ind, geno = TRUE, loci_on_cols = FALSE, indexes_loci = NULL, indexes_chr_ends = NULL )
haps |
Regular matrix or |
bim |
The table of variants of |
map |
The genetic map, a list of chromosomes each of which is a data.frame/tibble with columns |
G |
Number of generations since most recent common ancestor of population (to multiply standard recombination rate) |
n_ind |
Number of individuals (if geno = TRUE) or haplotypes (half individuals, if geno = FALSE) desired in output |
geno |
If |
loci_on_cols |
If |
indexes_loci |
Vector of index ranges (start, end) of loci to simulate, to request to simulate only a subset of the loci in |
indexes_chr_ends |
Optional vector mapping each chromosome (index in vector) to the last index in the |
A matrix with the same number of rows as haps
and n_ind
columns, with values copied from haps
in (recombination) blocks if geno = FALSE
, or sums of two such values drawn independently when geno = TRUE
.
# simulate a tiny population with few SNPs for example library(tibble) bim <- tibble( chr = c(2, 2, 3, 3, 22), pos = c(100, 121, 53, 154, 66) * 1e6 ) m_loci <- nrow( bim ) # Most often, haplotypes are binary data as simulated here. # Here haplotypes will be totally unstructured, but to have LD in the output use real human data # or data simulated to have LD n_ind_haps <- 5 haps <- matrix( rbinom( m_loci * n_ind_haps, 1, 0.5 ), nrow = m_loci, ncol = n_ind_haps ) # makes sense to have a lot of recombination at the population level G <- 500 # ask for small output for example n_ind <- 7 # use the recombination map for the same genome build as your data! map <- recomb_map_hg38 # simulate genotypes! (Usually more convenient, but phase information is lost) X <- pop_recomb( haps, bim, map, G, n_ind ) # simulate haplotypes instead (preserves true phase) H <- pop_recomb( haps, bim, map, G, n_ind, geno = FALSE )
# simulate a tiny population with few SNPs for example library(tibble) bim <- tibble( chr = c(2, 2, 3, 3, 22), pos = c(100, 121, 53, 154, 66) * 1e6 ) m_loci <- nrow( bim ) # Most often, haplotypes are binary data as simulated here. # Here haplotypes will be totally unstructured, but to have LD in the output use real human data # or data simulated to have LD n_ind_haps <- 5 haps <- matrix( rbinom( m_loci * n_ind_haps, 1, 0.5 ), nrow = m_loci, ncol = n_ind_haps ) # makes sense to have a lot of recombination at the population level G <- 500 # ask for small output for example n_ind <- 7 # use the recombination map for the same genome build as your data! map <- recomb_map_hg38 # simulate genotypes! (Usually more convenient, but phase information is lost) X <- pop_recomb( haps, bim, map, G, n_ind ) # simulate haplotypes instead (preserves true phase) H <- pop_recomb( haps, bim, map, G, n_ind, geno = FALSE )
This function accepts an input pedigree and a list of individuals of interest, and returns the subset of the pedigree including only the individuals of interest and their direct ancestors. This is useful in simulations, to avoid modeling/drawing genotypes of individuals without descendants in the last generation.
prune_fam(fam, ids, missing_vals = c("", 0))
prune_fam(fam, ids, missing_vals = c("", 0))
fam |
The pedigree data.frame, in plink FAM format.
Only columns |
ids |
The list of individuals of interest, whose ancestors we want to keep.
All must be present in |
missing_vals |
The list of ID values treated as missing.
|
The filtered FAM table with non-ancestors of ids
excluded.
IDs that are NA
-equivalent (see missing_vals
) will be mapped to NA
.
# construct a family with three founders, but one "bob" has no descendants library(tibble) fam <- tibble( id = c('mom', 'dad', 'bob', 'child'), pat = c( NA, NA, NA, 'dad'), mat = c( NA, NA, NA, 'mom') ) # only want 'child' and its ancestors ids <- 'child' fam2 <- prune_fam( fam, ids ) # the filtered pedigree has "bob" removed: fam2
# construct a family with three founders, but one "bob" has no descendants library(tibble) fam <- tibble( id = c('mom', 'dad', 'bob', 'child'), pat = c( NA, NA, NA, 'dad'), mat = c( NA, NA, NA, 'mom') ) # only want 'child' and its ancestors ids <- 'child' fam2 <- prune_fam( fam, ids ) # the filtered pedigree has "bob" removed: fam2
This function accepts haplotype data, such as the output from recomb_haplo_inds()
with ret_anc = TRUE
(required), and reduces it to a list of population ancestry dosage matrices.
In this context, "ancestors/ancestry" refer to haplotype blocks from specific ancestor individuals, whereas "population ancestry" groups these ancestors into populations (such as African, European, etc.).
Although the haplotype data separates individuals and chromosomes into lists (the way it is simulated), the output matrices concatenates data from all chromosomes into a single matrix, as it appears in simpler simulations and real data, and matching the format of recomb_geno_inds()
.
recomb_admix_inds(haplos, anc_map, pops = sort(unique(anc_map$pop)))
recomb_admix_inds(haplos, anc_map, pops = sort(unique(anc_map$pop)))
haplos |
A list of diploid individuals, each of which is a list with two haploid individuals named |
anc_map |
A data.frame or tibble with two columns: |
pops |
Optional order of populations in output, by default sorted alphabetically from |
A named list of population ancestry dosage matrices, ordered as in pops
, each of which counts populations in both alleles (in 0, 1, 2), with individuals along columns in same order as haplos
list, and loci along rows in order of appearance concatenating chromosomes in numerical order.
recomb_fam()
for drawing recombination (ancestor) blocks, defined in terms of genetic distance.
recomb_map_inds()
for transforming genetic to basepair coordinates given a genetic map.
recomb_haplo_inds()
for determining haplotypes of descendants given ancestral haplotypes (creates input to this function).
# Lengthy code creates individuals with recombination data to map # The smallest pedigree, two parents and a child (minimal fam table). library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # use latest human recombination map, but just first two chrs to keep this example fast map <- recomb_map_hg38[ 1L:2L ] # initialize parents with this other function founders <- recomb_init_founders( c('father', 'mother'), map ) # draw recombination breaks for child inds <- recomb_fam( founders, fam ) # now add base pair coordinates to recombination breaks inds <- recomb_map_inds( inds, map ) # also need ancestral haplotypes # these should be simulated carefully as needed, but for this example we make random data haplo <- vector( 'list', length( map ) ) # names of ancestor haplotypes for this scenario # (founders of fam$id but each with "_pat" and "_mat" suffixes) anc_names <- c( 'father_pat', 'father_mat', 'mother_pat', 'mother_mat' ) n_ind <- length( anc_names ) # number of loci per chr, for toy test m_loci <- 10L for ( chr in 1L : length( map ) ) { # draw random positions pos_chr <- sample.int( max( map[[ chr ]]$pos ), m_loci ) # draw haplotypes X_chr <- matrix( rbinom( m_loci * n_ind, 1L, 0.5 ), nrow = m_loci, ncol = n_ind ) # required column names! colnames( X_chr ) <- anc_names # add to structure, in a list haplo[[ chr ]] <- list( X = X_chr, pos = pos_chr ) } # determine haplotypes and per-position ancestries of descendants given ancestral haplotypes haplos <- recomb_haplo_inds( inds, haplo, ret_anc = TRUE ) # define individual to population ancestry map # take four ancestral haplotypes from above, assign them population labels anc_map <- tibble( anc = anc_names, pop = c('African', 'European', 'African', 'African') ) # finally, run desired function! # convert haplotypes structure to list of population ancestry dosage matrices Xs <- recomb_admix_inds( haplos, anc_map )
# Lengthy code creates individuals with recombination data to map # The smallest pedigree, two parents and a child (minimal fam table). library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # use latest human recombination map, but just first two chrs to keep this example fast map <- recomb_map_hg38[ 1L:2L ] # initialize parents with this other function founders <- recomb_init_founders( c('father', 'mother'), map ) # draw recombination breaks for child inds <- recomb_fam( founders, fam ) # now add base pair coordinates to recombination breaks inds <- recomb_map_inds( inds, map ) # also need ancestral haplotypes # these should be simulated carefully as needed, but for this example we make random data haplo <- vector( 'list', length( map ) ) # names of ancestor haplotypes for this scenario # (founders of fam$id but each with "_pat" and "_mat" suffixes) anc_names <- c( 'father_pat', 'father_mat', 'mother_pat', 'mother_mat' ) n_ind <- length( anc_names ) # number of loci per chr, for toy test m_loci <- 10L for ( chr in 1L : length( map ) ) { # draw random positions pos_chr <- sample.int( max( map[[ chr ]]$pos ), m_loci ) # draw haplotypes X_chr <- matrix( rbinom( m_loci * n_ind, 1L, 0.5 ), nrow = m_loci, ncol = n_ind ) # required column names! colnames( X_chr ) <- anc_names # add to structure, in a list haplo[[ chr ]] <- list( X = X_chr, pos = pos_chr ) } # determine haplotypes and per-position ancestries of descendants given ancestral haplotypes haplos <- recomb_haplo_inds( inds, haplo, ret_anc = TRUE ) # define individual to population ancestry map # take four ancestral haplotypes from above, assign them population labels anc_map <- tibble( anc = anc_names, pop = c('African', 'European', 'African', 'African') ) # finally, run desired function! # convert haplotypes structure to list of population ancestry dosage matrices Xs <- recomb_admix_inds( haplos, anc_map )
Create random recombination breaks for all autosomes of all individuals in the provided pedigree FAM table. Recombination lengths follow an exponential distribution with mean of 100 centiMorgans (cM). The output specifies identical-by-descent (IBD) blocks as ranges per chromosome (per individual) and the founder chromosome they arose from (are IBD with). All calculations are in terms of genetic distance (not base pairs), and no genotypes are constructed/drawn in this step.
recomb_fam(founders, fam, missing_vals = c("", 0))
recomb_fam(founders, fam, missing_vals = c("", 0))
founders |
The named list of founders with their chromosomes.
For unstructured founders, initialize with |
fam |
The pedigree data.frame, in plink FAM format.
Only columns |
missing_vals |
The list of ID values treated as missing.
|
The list of individuals with recombined chromosomes of the entire fam
table, in the same format as founders
above.
The names of this list correspond to fam$id
in that order.
recomb_init_founders()
to initialize founders
for this function.
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# The smallest pedigree, two parents and a child. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while "child" does not library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # initialize parents with this other function # Name the parents with same codes as in `fam` # (order can be different) ids <- c('mother', 'father') # simulate three chromosomes with these lengths in cM lengs <- c(50, 100, 150) founders <- recomb_init_founders( ids, lengs ) # draw recombination breaks for the whole fam table now: inds <- recomb_fam( founders, fam ) # This is a length-3 list with names matching fam$id. # The parent data equals the input (reordered), # but now there's data to the child too inds
# The smallest pedigree, two parents and a child. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while "child" does not library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # initialize parents with this other function # Name the parents with same codes as in `fam` # (order can be different) ids <- c('mother', 'father') # simulate three chromosomes with these lengths in cM lengs <- c(50, 100, 150) founders <- recomb_init_founders( ids, lengs ) # draw recombination breaks for the whole fam table now: inds <- recomb_fam( founders, fam ) # This is a length-3 list with names matching fam$id. # The parent data equals the input (reordered), # but now there's data to the child too inds
Transforms a tidy table describing the founder blocks inherited by each individual into a more summarized table describing the contiguous blocks of each founder inherited by at least one individual. This is especially useful when the focal individuals are several generations removed from the founders so only a small fraction of founder chromosomes are inherited.
recomb_founder_blocks_inherited(inds)
recomb_founder_blocks_inherited(inds)
inds |
The tidy table of recombination block data inherited in individuals from founders, such as the return value of |
A table in a similar format as the input (columns "anc", "chr", "start", and "end") where each row is a contiguous IBD block whose each basepair is inherited by at least one individual.
# manually construct a toy sample input, # with individuals marked for clarity although they are not required # note first two individuals library(tibble) inds <- tibble( ind = letters[1:3], parent = c('pat', 'mat', 'mat'), chr = c(1, 1, 2), start = c( 1, 800, 5), end = c(1000, 2000, 100), anc = c('f1', 'f1', 'f2') ) # the new table merges the first two rows, # because they overlapped from the same ancestor, # while the third row stays unchanged (after removing individual info) founder_blocks <- recomb_founder_blocks_inherited( inds )
# manually construct a toy sample input, # with individuals marked for clarity although they are not required # note first two individuals library(tibble) inds <- tibble( ind = letters[1:3], parent = c('pat', 'mat', 'mat'), chr = c(1, 1, 2), start = c( 1, 800, 5), end = c(1000, 2000, 100), anc = c('f1', 'f1', 'f2') ) # the new table merges the first two rows, # because they overlapped from the same ancestor, # while the third row stays unchanged (after removing individual info) founder_blocks <- recomb_founder_blocks_inherited( inds )
This function accepts haplotype data, such as the output from recomb_haplo_inds()
, and reduces it to a genotype matrix.
The haplotype data is more detailed because it is phased, while phase is lost in the genotype representation.
Moreover, the haplotype data separates individuals and chromosomes into lists (the way it is simulated), but the output genotype matrix concatenates data from all chromosomes into a single matrix, as it appears in simpler simulations and real data.
recomb_geno_inds(haplos)
recomb_geno_inds(haplos)
haplos |
A list of diploid individuals, each of which is a list with two haploid individuals named "pat" and "mat", each of which is a list of chromosomes.
Each chromosome can be a list, in which case the named element "x" must give the haplotype vector (ideally with values in zero and one counting reference alleles, including NA), otherwise the chromosome must be this vector (accommodating both output formats from |
The genotype matrix, which is the sum of the haplotype values (with values in 0, 1, 2, and NA, counting reference alleles), with individuals along columns in same order as haplos
list, and loci along rows in order of appearance concatenating chromosomes in numerical order.
recomb_fam()
for drawing recombination (ancestor) blocks, defined in terms of genetic distance.
recomb_map_inds()
for transforming genetic to basepair coordinates given a genetic map.
recomb_haplo_inds()
for determining haplotypes of descendants given ancestral haplotypes (creates input to this function).
# Lengthy code creates individuals with recombination data to map # The smallest pedigree, two parents and a child (minimal fam table). library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # use latest human recombination map, but just first two chrs to keep this example fast map <- recomb_map_hg38[ 1L:2L ] # initialize parents with this other function founders <- recomb_init_founders( c('father', 'mother'), map ) # draw recombination breaks for child inds <- recomb_fam( founders, fam ) # now add base pair coordinates to recombination breaks inds <- recomb_map_inds( inds, map ) # also need ancestral haplotypes # these should be simulated carefully as needed, but for this example we make random data haplo <- vector( 'list', length( map ) ) # names of ancestor haplotypes for this scenario # (founders of fam$id but each with "_pat" and "_mat" suffixes) anc_names <- c( 'father_pat', 'father_mat', 'mother_pat', 'mother_mat' ) n_ind <- length( anc_names ) # number of loci per chr, for toy test m_loci <- 10L for ( chr in 1L : length( map ) ) { # draw random positions pos_chr <- sample.int( max( map[[ chr ]]$pos ), m_loci ) # draw haplotypes X_chr <- matrix( rbinom( m_loci * n_ind, 1L, 0.5 ), nrow = m_loci, ncol = n_ind ) # required column names! colnames( X_chr ) <- anc_names # add to structure, in a list haplo[[ chr ]] <- list( X = X_chr, pos = pos_chr ) } # determine haplotypes of descendants given ancestral haplotypes haplos <- recomb_haplo_inds( inds, haplo ) # finally, run desired function! # convert haplotypes structure to a plain genotype matrix X <- recomb_geno_inds( haplos )
# Lengthy code creates individuals with recombination data to map # The smallest pedigree, two parents and a child (minimal fam table). library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # use latest human recombination map, but just first two chrs to keep this example fast map <- recomb_map_hg38[ 1L:2L ] # initialize parents with this other function founders <- recomb_init_founders( c('father', 'mother'), map ) # draw recombination breaks for child inds <- recomb_fam( founders, fam ) # now add base pair coordinates to recombination breaks inds <- recomb_map_inds( inds, map ) # also need ancestral haplotypes # these should be simulated carefully as needed, but for this example we make random data haplo <- vector( 'list', length( map ) ) # names of ancestor haplotypes for this scenario # (founders of fam$id but each with "_pat" and "_mat" suffixes) anc_names <- c( 'father_pat', 'father_mat', 'mother_pat', 'mother_mat' ) n_ind <- length( anc_names ) # number of loci per chr, for toy test m_loci <- 10L for ( chr in 1L : length( map ) ) { # draw random positions pos_chr <- sample.int( max( map[[ chr ]]$pos ), m_loci ) # draw haplotypes X_chr <- matrix( rbinom( m_loci * n_ind, 1L, 0.5 ), nrow = m_loci, ncol = n_ind ) # required column names! colnames( X_chr ) <- anc_names # add to structure, in a list haplo[[ chr ]] <- list( X = X_chr, pos = pos_chr ) } # determine haplotypes of descendants given ancestral haplotypes haplos <- recomb_haplo_inds( inds, haplo ) # finally, run desired function! # convert haplotypes structure to a plain genotype matrix X <- recomb_geno_inds( haplos )
Construct haplotypes of individuals given their ancestral blocks and the ancestral haplotype variants
recomb_haplo_inds(inds, haplo, ret_anc = FALSE)
recomb_haplo_inds(inds, haplo, ret_anc = FALSE)
inds |
A list of individuals in the same format as the output of |
haplo |
The ancestral haplotypes, which is a list of chromosomes, each of which is a list with two named elements: |
ret_anc |
If |
A list of diploid individuals, each of which is a list with two haploid individuals named pat
and mat
, each of which is a list of chromosomes.
If ret_anc = FALSE
(default), each chromosome is a haplotype (vector of values copied from ancestors in haplo
);
if ret_anc = TRUE
, each chromosome is a list with named elements x
for the haplotype vector and anc
for the vector of ancestor name per position.
recomb_fam()
for drawing recombination (ancestor) blocks, defined in terms of genetic distance.
recomb_map_inds()
for transforming genetic to basepair coordinates given a genetic map.
recomb_geno_inds()
for transforming the output of this function from haplotypes (a nested lists structure) to a plain genotype matrix.
# Lengthy code creates individuals with recombination data to map # The smallest pedigree, two parents and a child (minimal fam table). library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # use latest human recombination map, but just first two chrs to keep this example fast map <- recomb_map_hg38[ 1L:2L ] # initialize parents with this other function founders <- recomb_init_founders( c('father', 'mother'), map ) # draw recombination breaks for child inds <- recomb_fam( founders, fam ) # now add base pair coordinates to recombination breaks inds <- recomb_map_inds( inds, map ) # also need ancestral haplotypes # these should be simulated carefully as needed, but for this example we make random data haplo <- vector( 'list', length( map ) ) # names of ancestor haplotypes for this scenario # (founders of fam$id but each with "_pat" and "_mat" suffixes) anc_names <- c( 'father_pat', 'father_mat', 'mother_pat', 'mother_mat' ) n_ind <- length( anc_names ) # number of loci per chr, for toy test m_loci <- 10L for ( chr in 1L : length( map ) ) { # draw random positions pos_chr <- sample.int( max( map[[ chr ]]$pos ), m_loci ) # draw haplotypes X_chr <- matrix( rbinom( m_loci * n_ind, 1L, 0.5 ), nrow = m_loci, ncol = n_ind ) # required column names! colnames( X_chr ) <- anc_names # add to structure, in a list haplo[[ chr ]] <- list( X = X_chr, pos = pos_chr ) } # finally, run desired function! # determine haplotypes of descendants given ancestral haplotypes data <- recomb_haplo_inds( inds, haplo )
# Lengthy code creates individuals with recombination data to map # The smallest pedigree, two parents and a child (minimal fam table). library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # use latest human recombination map, but just first two chrs to keep this example fast map <- recomb_map_hg38[ 1L:2L ] # initialize parents with this other function founders <- recomb_init_founders( c('father', 'mother'), map ) # draw recombination breaks for child inds <- recomb_fam( founders, fam ) # now add base pair coordinates to recombination breaks inds <- recomb_map_inds( inds, map ) # also need ancestral haplotypes # these should be simulated carefully as needed, but for this example we make random data haplo <- vector( 'list', length( map ) ) # names of ancestor haplotypes for this scenario # (founders of fam$id but each with "_pat" and "_mat" suffixes) anc_names <- c( 'father_pat', 'father_mat', 'mother_pat', 'mother_mat' ) n_ind <- length( anc_names ) # number of loci per chr, for toy test m_loci <- 10L for ( chr in 1L : length( map ) ) { # draw random positions pos_chr <- sample.int( max( map[[ chr ]]$pos ), m_loci ) # draw haplotypes X_chr <- matrix( rbinom( m_loci * n_ind, 1L, 0.5 ), nrow = m_loci, ncol = n_ind ) # required column names! colnames( X_chr ) <- anc_names # add to structure, in a list haplo[[ chr ]] <- list( X = X_chr, pos = pos_chr ) } # finally, run desired function! # determine haplotypes of descendants given ancestral haplotypes data <- recomb_haplo_inds( inds, haplo )
This function initializes what is otherwise a tedious structure for founders, to be used for simulating recombination in a pedigree. The genetic structure is trivial, in that these "founder" chromosomes are each of a single ancestral individual (none are recombined).
recomb_init_founders(ids, lengs)
recomb_init_founders(ids, lengs)
ids |
The list of IDs to use for each individual |
lengs |
The lengths of each chromosome in centiMorgans (cM).
If this vector is named, the output inherits these chromosome names.
If it is a list, it is assumed to be a recombination map (see |
A named list of diploid individuals, each of which is a list with two haploid individuals named pat
and mat
, each of which is a list of chromosomes (inherits names of lengs
if present), each of which is a tibble with a single row and two columns: posg
equals the chromosome length, and anc
equals the ID of the individual (from ids
) concatenated with either _pat
or _mat
depending on which parent it is.
recomb_fam()
to simulate recombination across a pedigree using the founders initialized here.
# version with explicit recombination lengths ancs <- recomb_init_founders( c('a', 'b'), c(100, 200) ) ancs # version using genetic map (uses provided human map) from which lengths are extracted ancs <- recomb_init_founders( c('a', 'b'), recomb_map_hg38 ) ancs
# version with explicit recombination lengths ancs <- recomb_init_founders( c('a', 'b'), c(100, 200) ) ancs # version using genetic map (uses provided human map) from which lengths are extracted ancs <- recomb_init_founders( c('a', 'b'), recomb_map_hg38 ) ancs
A wrapper around the more general recomb_fam()
, specialized to save memory when only the last generation is desired (recomb_fam()
returns recombination blocks for the entire pedigree).
This function assumes that generations are non-overlapping (met by the output of sim_pedigree()
), in which case each generation g
can be drawn from generation g-1
data only.
That way, only two consecutive generations need be in memory at any given time.
The partitioning of individuals into generations is given by the ids
parameter (again matches the output of sim_pedigree()
).
recomb_last_gen(founders, fam, ids, missing_vals = c("", 0))
recomb_last_gen(founders, fam, ids, missing_vals = c("", 0))
founders |
The named list of founders with their chromosomes.
For unstructured founders, initialize with |
fam |
The pedigree data.frame, in plink FAM format.
Only columns |
ids |
A list containing vectors of IDs for each generation.
All these IDs must be present in |
missing_vals |
The list of ID values treated as missing.
|
The list of individuals with recombined chromosomes of the last generation (the intersection of ids[ length(ids) ]
and fam$id
), in the same format as founders
above.
The names of this list are last-generation individuals in the order that they appear in fam$id
.
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# A small pedigree, two parents and two children. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while children do not library(tibble) fam <- tibble( id = c('father', 'mother', 'child', 'sib'), pat = c(NA, NA, 'father', 'father'), mat = c(NA, NA, 'mother', 'mother') ) # need an `ids` list separating the generations ids <- list( c('father', 'mother'), c('child', 'sib') ) # initialize parents with this other function # simulate three chromosomes with these lengths in cM lengs <- c(50, 100, 150) founders <- recomb_init_founders( ids[[1]], lengs ) # draw recombination breaks for the children inds <- recomb_last_gen( founders, fam, ids )
# A small pedigree, two parents and two children. # A minimal fam table with the three required columns. # Note "mother" and "father" have missing parent IDs, while children do not library(tibble) fam <- tibble( id = c('father', 'mother', 'child', 'sib'), pat = c(NA, NA, 'father', 'father'), mat = c(NA, NA, 'mother', 'mother') ) # need an `ids` list separating the generations ids <- list( c('father', 'mother'), c('child', 'sib') ) # initialize parents with this other function # simulate three chromosomes with these lengths in cM lengs <- c(50, 100, 150) founders <- recomb_init_founders( ids[[1]], lengs ) # draw recombination breaks for the children inds <- recomb_last_gen( founders, fam, ids )
Given an existing recombination map and a chromosome length in base pairs, extrapolates the map to ensure all positions are covered, and shifts to ensure position one in basepairs corresponds to position 0 in genetic position.
Recombination rates are extrapolated from the first and last 10Mb of data by default (separately per end).
Therefore fixes the fact that common maps start genetic position zero at base pair position >> 1
and do not extend to ends (some SNPs from modern projects fall out of range without fixes).
recomb_map_fix_ends_chr(map, pos_length, pos_delta = 10000000L)
recomb_map_fix_ends_chr(map, pos_length, pos_delta = 10000000L)
map |
A tibble with two columns: |
pos_length |
The length of the chromosome in base pairs. |
pos_delta |
The size of the window used to extrapolate recombination rates. |
The extrapolated recombination map, shifted so the first non-trivial position maps to the genetic distance expected from the extrapolated rate at the beginning, then added a first trivial position (pos=1, posg=0
) and final basepair position at length of chromosome and expected genetic position from end extrapolation.
recomb_map_simplify_chr()
to simplify recombination maps to a desired numerical accuracy.
library(tibble) # create a toy recombination map with at least 10Mb at each end map <- tibble( pos = c( 3L, 15L, 100L, 120L ) * 1e6L, posg = c( 0, 10.4, 90.1, 110 ) ) # and length pos_length <- 150L * 1e6L # apply function! map_fixed <- recomb_map_fix_ends_chr( map, pos_length ) # inspect map_fixed
library(tibble) # create a toy recombination map with at least 10Mb at each end map <- tibble( pos = c( 3L, 15L, 100L, 120L ) * 1e6L, posg = c( 0, 10.4, 90.1, 110 ) ) # and length pos_length <- 150L * 1e6L # apply function! map_fixed <- recomb_map_fix_ends_chr( map, pos_length ) # inspect map_fixed
Human genetic recombination maps for builds 38 (GRCh38/hg38) and 37 (GRCh37/hg19, below suffixed as hg37 for simplicity although technically incorrect).
Processed each first with recomb_map_fix_ends_chr()
to shift and extrapolate to sequence ends, then simplified with recomb_map_simplify_chr()
to remove all values that can be interpolated with an error of up to tol = 0.1
, in order to reduce their sizes and interpolation runtime.
Defaults were used, which resulted in extrapolated recombination rates close to and centered around the average of 1e-6 cM/base).
Autosomes only.
recomb_map_hg38 recomb_map_hg37
recomb_map_hg38 recomb_map_hg37
A list with 22 elements (autosomes, not named), each a tibble with two columns defining the recombination map at that chromosome:
pos
: position in base pairs
posg
: position in centiMorgans (cM)
An object of class list
of length 22.
Raw genetic maps downloaded from this location prior to above processing: https://bochet.gcc.biostat.washington.edu/beagle/genetic_maps/
Chromosome lengths from: https://www.ncbi.nlm.nih.gov/grc/human/data
Given a list of individuals with recombination breaks given in genetic distance (such as the output of recomb_fam()
), and a genetic map (see recomb_map_hg
), this function determines all positions in base pair coordinates.
If base pair positions existed in input, they are overwritten.
recomb_map_inds(inds, map)
recomb_map_inds(inds, map)
inds |
The list of individuals, each of which is a list with two haploid individuals named |
map |
The genetic map, a list of chromosomes each of which is a data.frame/tibble with columns |
Genetic positions are converted to base pair positions from the provided map using linear interpolation, using stats::approx()
with options rule = 2
(out of range cases are set to nearest end's value) and ties = list( 'ordered', mean )
(assume data is ordered, interpolate ties in genetic position in map using mean of base pair positions).
Output will be incorrect, without throwing errors, if genetic map is not ordered.
Base pair positions are rounded to integers.
The input list of individuals, with each chromosome added column pos
corresponding to end coordinate in base pairs.
Each chromosome has columns reordered so pos
, posg
, and anc
appear first, and any additional columns appear afterwards.
recomb_fam()
for drawing recombination breaks of individuals from a pedigree.
recomb_map_hg
for simplified human recombination maps included in this package.
# Lengthy code creates individuals with recombination data to map # The smallest pedigree, two parents and a child (minimal fam table). library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # use latest human recombination map, but just first two chrs to keep this example fast map <- recomb_map_hg38[ 1:2 ] # initialize parents with this other function founders <- recomb_init_founders( c('father', 'mother'), map ) # draw recombination breaks for child inds <- recomb_fam( founders, fam ) # now use this function to add base pair coordinates for recombination breaks! inds <- recomb_map_inds( inds, map )
# Lengthy code creates individuals with recombination data to map # The smallest pedigree, two parents and a child (minimal fam table). library(tibble) fam <- tibble( id = c('father', 'mother', 'child'), pat = c(NA, NA, 'father'), mat = c(NA, NA, 'mother') ) # use latest human recombination map, but just first two chrs to keep this example fast map <- recomb_map_hg38[ 1:2 ] # initialize parents with this other function founders <- recomb_init_founders( c('father', 'mother'), map ) # draw recombination breaks for child inds <- recomb_fam( founders, fam ) # now use this function to add base pair coordinates for recombination breaks! inds <- recomb_map_inds( inds, map )
Given an input recombination map, this function iteratively removes rows that can be interpolated to less than a given error tol
.
This is a heuristic that works very well in practice, resulting in average interpolation errors well below tol
, and maximum final errors no greater than 3 * tol
in our internal benchmarks (expected in extremely concave or convex regions of the map; final errors are rarely above tol
with few exceptions).
recomb_map_simplify_chr(map, tol = 0.1)
recomb_map_simplify_chr(map, tol = 0.1)
map |
A tibble with two columns: |
tol |
Tolerance of interpolation errors, in cM. |
This function reduces recombination map sizes drastically, in order to include them in packages, and also makes linear interpolation faster. This simplification operation can be justified as the precision of many existing maps is both limited and overstated, and a high accuracy is not needed for simulations with many other approximations in place.
The recombination map with rows (positions) removed (if they are interpolated with errors below tol
in most cases).
recomb_map_fix_ends_chr()
to shift and extrapolate recombination map to ends of chromosome.
library(tibble) # create a toy recombination map to simplify # in this case all middle rows can be interpolated from the ends with practically no error map <- tibble( pos = c( 1L, 1e6L, 2e6L, 3e6L ), posg = c( 0.0, 1.0, 2.0, 3.0 ) ) # simplify map! map_simple <- recomb_map_simplify_chr( map ) # inspect map_simple
library(tibble) # create a toy recombination map to simplify # in this case all middle rows can be interpolated from the ends with practically no error map <- tibble( pos = c( 1L, 1e6L, 2e6L, 3e6L ), posg = c( 0.0, 1.0, 2.0, 3.0 ) ) # simplify map! map_simple <- recomb_map_simplify_chr( map ) # inspect map_simple
Specify the number of individuals per generation, and some other optional parameters, and a single pedigree with those properties will be simulated, where close relatives are never paired, sex is drawn randomly per individual and pairings are strictly across opposite-sex individuals, and otherwise closest individuals (on an underlying 1D geography given by their index) are paired in a random order. Pairs are reordered based on the average of their indexes, where their children are placed (determines their indexes in the 1D geography). The procedure may leave some individuals unpaired in the next generation, and family sizes vary randomly (with a fixed minimum family size) to achieve the desired population size in each generation.
sim_pedigree( n, G = length(n), sex = draw_sex(n[1]), kinship_local = diag(n[1])/2, cutoff = 1/4^3, children_min = 1L, full = FALSE )
sim_pedigree( n, G = length(n), sex = draw_sex(n[1]), kinship_local = diag(n[1])/2, cutoff = 1/4^3, children_min = 1L, full = FALSE )
n |
The number of individuals per generation.
If scalar, the number of generations |
G |
The number of generations (optional).
Note |
sex |
The numeric sex values for the founders (1L for male, 2L for female).
By default they are drawn randomly using |
kinship_local |
The local kinship matrix of the founder population. The default value is half the identity matrix, which corresponds to locally unrelated and locally outbred founders. This "local" kinship is the basis for all kinship calculations used to decide on close relative avoidance. The goal is to make a decision to not pair close relatives based on the pedigree only (and not based on population structure, which otherwise increases all kinship values), so the default value is appropriate. |
cutoff |
Local kinship values strictly less than |
children_min |
The minimum number of children per family.
Must be 0 or larger, but not exceed the average number of children per family in each generation (varies depending on how many individuals were left unpaired, but this upper limit is approximately |
full |
If |
A list with these named elements:
fam
: the pedigree, a tibble in plink FAM format. Following the column naming convention of the related genio
package, it contains columns:
fam
: Family ID, trivial "fam1" for all individuals
id
: Individual ID, in this case a code of format (in regular expression) "(\d+)-(\d+)" where the first integer is the generation number and the second integer is the index number (1 to n[g]
for generation g
).
pat
: Paternal ID. Matches an id
except for founders, which have fathers set to NA
.
mat
: Maternal ID. Matches an id
except for founders, which have mothers set to NA
.
sex
: integers 1L (male) or 2L (female) which were drawn randomly; no other values occur in these outputs.
pheno
: Phenotype, here all 0 (missing value).
ids
: a list of IDs for each generation (indexed in the list by generation).
kinship_local
: if full = FALSE
, the local kinship matrix of the last generation, otherwise a list of local kinship matrices for every generation.
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# number of individuals for each generation n <- c(15, 20, 25) # create random pedigree with 3 generations, etc data <- sim_pedigree( n ) # this is the FAM table defining the entire pedigree, # which is the most important piece of information desired! data$fam # the IDs separated by generation data$ids # bonus: the local kinship matrix of the final generation data$kinship_local
# number of individuals for each generation n <- c(15, 20, 25) # create random pedigree with 3 generations, etc data <- sim_pedigree( n ) # this is the FAM table defining the entire pedigree, # which is the most important piece of information desired! data$fam # the IDs separated by generation data$ids # bonus: the local kinship matrix of the final generation data$kinship_local
This function takes a compact structure of nested lists and tables describing the founder blocks of identity by descent (IBD) inherited in a pedigree with recombination, and outputs the same data organized as a single table following tidy conventions, which is easy to manipulate externally although it is also larger and more redundant.
tidy_recomb_map_inds(inds)
tidy_recomb_map_inds(inds)
inds |
The output value from |
A table with one row per IBD block, and the following columns: "ind" containing the label of the individual as given in the input object, "parent" equal to either "pat" or "mat", "chr" the chromosome, "start" and "end" the range of the block in basepairs, and "anc" the label of the founder individual.
# simulate the ancestors of one person to 3 generations obj <- fam_ancestors( 3 ) fam <- obj$fam ids <- obj$ids # initialize founders founders <- recomb_init_founders( ids[[ 1 ]], recomb_map_hg38 ) # draw recombination breaks along pedigree, with coordinates in genetic distance (centiMorgans), # with information for last generation only inds <- recomb_last_gen( founders, fam, ids ) # map recombination break coordinates to base pairs inds <- recomb_map_inds( inds, recomb_map_hg38 ) # now that the input structure is ready, this function returns a tidy table version! inds_tidy <- tidy_recomb_map_inds( inds )
# simulate the ancestors of one person to 3 generations obj <- fam_ancestors( 3 ) fam <- obj$fam ids <- obj$ids # initialize founders founders <- recomb_init_founders( ids[[ 1 ]], recomb_map_hg38 ) # draw recombination breaks along pedigree, with coordinates in genetic distance (centiMorgans), # with information for last generation only inds <- recomb_last_gen( founders, fam, ids ) # map recombination break coordinates to base pairs inds <- recomb_map_inds( inds, recomb_map_hg38 ) # now that the input structure is ready, this function returns a tidy table version! inds_tidy <- tidy_recomb_map_inds( inds )