Title: | Genetics Input/Output Functions |
---|---|
Description: | Implements readers and writers for file formats associated with genetics data. Reading and writing Plink BED/BIM/FAM and GCTA binary GRM formats is fully supported, including a lightning-fast BED reader and writer implementations. Other functions are 'readr' wrappers that are more constrained, user-friendly, and efficient for these particular applications; handles Plink and Eigenstrat tables (FAM, BIM, IND, and SNP files). There are also make functions for FAM and BIM tables with default values to go with simulated genotype data. |
Authors: | Alejandro Ochoa [aut, cre] |
Maintainer: | Alejandro Ochoa <[email protected]> |
License: | GPL-3 |
Version: | 1.1.4.9000 |
Built: | 2024-11-22 03:06:18 UTC |
Source: | https://github.com/ochoalab/genio |
This function returns the number of lines in a file.
It is intended to result in fast retrieval of numbers of individuals (from FAM or equivalent files) or loci (BIM or equivalent files) when the input files are extremely large and no other information is required from these files.
This code uses C++ to quickly counts lines (like linux's wc -l
but this one is cross-platform).
count_lines(file, ext = NA, verbose = TRUE)
count_lines(file, ext = NA, verbose = TRUE)
file |
The input file path to read (a string). |
ext |
An optional extension.
If |
verbose |
If |
Note: this function does not work correctly with compressed files (they are not uncompressed prior to counting newlines).
The number of lines in the file.
# count number of individuals from an existing plink *.fam file file <- system.file("extdata", 'sample.fam', package = "genio", mustWork = TRUE) n_ind <- count_lines(file) n_ind # count number of loci from an existing plink *.bim file file <- system.file("extdata", 'sample.bim', package = "genio", mustWork = TRUE) m_loci <- count_lines(file) m_loci
# count number of individuals from an existing plink *.fam file file <- system.file("extdata", 'sample.fam', package = "genio", mustWork = TRUE) n_ind <- count_lines(file) n_ind # count number of loci from an existing plink *.bim file file <- system.file("extdata", 'sample.bim', package = "genio", mustWork = TRUE) m_loci <- count_lines(file) m_loci
This function deletes each of the GCTA binary GRM files (grm.bin
, grm.N.bin
, and grm.id
extensions) given the shared base file path, warning if any of the files did not exist or if any were not successfully deleted.
delete_files_grm(file)
delete_files_grm(file)
file |
The shared file path (excluding extensions: |
Nothing
# if you want to delete "data.grm.bin", "data.grm.N.bin" and "data.grm.id", run like this: # delete_files_grm("data") # The following example is more awkward # because (only for these examples) the package must create *temporary* files to actually delete # create dummy GRM files file <- tempfile('delete-me-test') # no extension # add each extension and create empty files file.create( paste0(file, '.grm.bin') ) file.create( paste0(file, '.grm.N.bin') ) file.create( paste0(file, '.grm.id') ) # delete the GRM files we just created delete_files_grm(file)
# if you want to delete "data.grm.bin", "data.grm.N.bin" and "data.grm.id", run like this: # delete_files_grm("data") # The following example is more awkward # because (only for these examples) the package must create *temporary* files to actually delete # create dummy GRM files file <- tempfile('delete-me-test') # no extension # add each extension and create empty files file.create( paste0(file, '.grm.bin') ) file.create( paste0(file, '.grm.N.bin') ) file.create( paste0(file, '.grm.id') ) # delete the GRM files we just created delete_files_grm(file)
This function deletes a PHEN files given the base file path (without extension), warning if the file did not exist or if it was not successfully deleted.
delete_files_phen(file)
delete_files_phen(file)
file |
The base file path (excluding |
Nothing
# if you want to delete "data.phen", run like this: # delete_files_phen("data") # The following example is more awkward # because (only for these examples) the package must create a *temporary* file to actually delete # create dummy PHEN files file <- tempfile('delete-me-test') # no extension # add extension and create an empty file file.create( paste0(file, '.phen') ) # delete the PHEN file we just created delete_files_phen(file)
# if you want to delete "data.phen", run like this: # delete_files_phen("data") # The following example is more awkward # because (only for these examples) the package must create a *temporary* file to actually delete # create dummy PHEN files file <- tempfile('delete-me-test') # no extension # add extension and create an empty file file.create( paste0(file, '.phen') ) # delete the PHEN file we just created delete_files_phen(file)
This function deletes each of the Plink binary files (bed
, bim
, fam
extensions) given the shared base file path, warning if any of the files did not exist or if any were not successfully deleted.
delete_files_plink(file)
delete_files_plink(file)
file |
The shared file path (excluding extensions: |
Nothing
# if you want to delete "data.bed", "data.bim" and "data.fam", run like this: # delete_files_plink("data") # The following example is more awkward # because (only for these examples) the package must create *temporary* files to actually delete # create dummy BED/BIM/FAM files file <- tempfile('delete-me-test') # no extension # add each extension and create empty files file.create( paste0(file, '.bed') ) file.create( paste0(file, '.bim') ) file.create( paste0(file, '.fam') ) # delete the BED/BIM/FAM files we just created delete_files_plink(file)
# if you want to delete "data.bed", "data.bim" and "data.fam", run like this: # delete_files_plink("data") # The following example is more awkward # because (only for these examples) the package must create *temporary* files to actually delete # create dummy BED/BIM/FAM files file <- tempfile('delete-me-test') # no extension # add each extension and create empty files file.create( paste0(file, '.bed') ) file.create( paste0(file, '.bim') ) file.create( paste0(file, '.fam') ) # delete the BED/BIM/FAM files we just created delete_files_plink(file)
This package fully supports reading and writing Plink BED/BIM/FAM and GCTA GRM files, as illustrated below. These functions make it easy to create dummy annotation tables to go with simulated genotype data too. Lastly, there is functionality to read and write Eigenstrat tables.
Maintainer: Alejandro Ochoa [email protected] (ORCID)
Useful links:
# read existing BED/BIM/FAM files # first get path to BED file file <- system.file( "extdata", 'sample.bed', package = "genio", mustWork = TRUE ) # read genotypes and annotation tables plink_data <- read_plink( file ) # genotypes X <- plink_data$X # locus annotations bim <- plink_data$bim # individual annotations fam <- plink_data$fam # the same works without .bed extension file <- sub( '\\.bed$', '', file ) # remove extension plink_data <- read_plink( file ) # write data into new BED/BIM/FAM files file_out <- tempfile( 'delete-me-example' ) write_plink( file_out, X, bim, fam ) # delete example files when done delete_files_plink( file_out ) # read sample GRM files file <- system.file( "extdata", 'sample.grm.bin', package = "genio", mustWork = TRUE ) file <- sub( '\\.grm\\.bin$', '', file ) # remove extension from this path on purpose obj <- read_grm( file ) # the kinship matrix kinship <- obj$kinship # the pair sample sizes matrix M <- obj$M # the fam and ID tibble fam <- obj$fam # write data into new GRM files write_grm( file_out, kinship, M = M, fam = fam ) # delete example files when done delete_files_grm( file_out ) # other functions not shown here allow reading and writing individual files, # creating dummy tables to go with simulated genotypes, # requiring the existence of these files, # and reading and writing of Eigenstrat tables too.
# read existing BED/BIM/FAM files # first get path to BED file file <- system.file( "extdata", 'sample.bed', package = "genio", mustWork = TRUE ) # read genotypes and annotation tables plink_data <- read_plink( file ) # genotypes X <- plink_data$X # locus annotations bim <- plink_data$bim # individual annotations fam <- plink_data$fam # the same works without .bed extension file <- sub( '\\.bed$', '', file ) # remove extension plink_data <- read_plink( file ) # write data into new BED/BIM/FAM files file_out <- tempfile( 'delete-me-example' ) write_plink( file_out, X, bim, fam ) # delete example files when done delete_files_plink( file_out ) # read sample GRM files file <- system.file( "extdata", 'sample.grm.bin', package = "genio", mustWork = TRUE ) file <- sub( '\\.grm\\.bin$', '', file ) # remove extension from this path on purpose obj <- read_grm( file ) # the kinship matrix kinship <- obj$kinship # the pair sample sizes matrix M <- obj$M # the fam and ID tibble fam <- obj$fam # write data into new GRM files write_grm( file_out, kinship, M = M, fam = fam ) # delete example files when done delete_files_grm( file_out ) # other functions not shown here allow reading and writing individual files, # creating dummy tables to go with simulated genotypes, # requiring the existence of these files, # and reading and writing of Eigenstrat tables too.
Given the genotype matrix X
and bim
table (as they are parsed by read_plink()
, this outputs a matrix of the same dimensions as X
but with the numeric codes (all values in 0, 1, 2) translated to human-readable character codes (such as 'A/A', 'A/G', 'G/G', depending on which are the two alleles at the locus as given in the bim
table, see return value).
geno_to_char(X, bim)
geno_to_char(X, bim)
X |
The genotype matrix.
It must have values only in 0, 1, 2, and |
bim |
The variant table.
It is required to have the same number of rows as |
The genotype matrix reencoded as strings.
At one locus, if the two alleles (alt and ref) are 'A' and 'B', then the genotypes in the input are encoded as characters as: 0 -> 'A/A', 1 -> 'B/A', and 2 -> 'B/B'.
Thus, the numeric encoding counts the reference allele dosage.
NA
values in input X
remain NA
in the output.
If the input genotype matrix had row and column names, these are inherited by the output matrix.
read_plink()
,
read_bed()
,
read_bim()
.
# a numeric/dosage genotype matrix with two loci (rows) # and three individuals (columns) X <- rbind( 0:2, c(0, NA, 2) ) # corresponding variant table (minimal case with just two required columns) library(tibble) bim <- tibble( alt = c('C', 'GT'), ref = c('A', 'G') ) # genotype matrix translated as characters X_char <- geno_to_char( X, bim ) X_char
# a numeric/dosage genotype matrix with two loci (rows) # and three individuals (columns) X <- rbind( 0:2, c(0, NA, 2) ) # corresponding variant table (minimal case with just two required columns) library(tibble) bim <- tibble( alt = c('C', 'GT'), ref = c('A', 'G') ) # genotype matrix translated as characters X_char <- geno_to_char( X, bim ) X_char
Given an existing plink-formatted BED (binary) file, this function reads it, transforms genotypes on the go, and writes a new BED file such that heterozygotes are encoded as 2 and homozygotes as 0.
In other words, it transforms the numerical genotype values c( 0, 1, 2, NA )
into c( 0, 2, 0, NA )
.
Heterozygotes are encoded as 2, rather than 1, so existing code for calculating allele frequencies and related quantities, such as kinship estimates, works on this data as intended.
Intended to transform extremely large files that should not be loaded entirely into memory at once.
het_reencode_bed( file_in, file_out, m_loci = NA, n_ind = NA, make_bim_fam = TRUE, verbose = TRUE )
het_reencode_bed( file_in, file_out, m_loci = NA, n_ind = NA, make_bim_fam = TRUE, verbose = TRUE )
file_in |
Input file path.
*.bed extension may be omitted (will be added automatically if |
file_out |
Output file path. *.bed extension may be omitted (will be added automatically if it is missing). |
m_loci |
Number of loci in the input genotype table.
If |
n_ind |
Number of individuals in the input genotype table.
If |
make_bim_fam |
If |
verbose |
If |
read_bed()
and write_bed()
, from which much of the code of this function is derived, which explains additional BED format requirements.
# define input and output, both of which will also work without extension # read an existing Plink *.bed file file_in <- system.file("extdata", 'sample.bed', package = "genio", mustWork = TRUE) # write to a *temporary* location for this example file_out <- tempfile('delete-me-example') # in default mode, deduces dimensions from paired *.bim and *.fam tables het_reencode_bed( file_in, file_out ) # delete output when done delete_files_plink( file_out )
# define input and output, both of which will also work without extension # read an existing Plink *.bed file file_in <- system.file("extdata", 'sample.bed', package = "genio", mustWork = TRUE) # write to a *temporary* location for this example file_out <- tempfile('delete-me-example') # in default mode, deduces dimensions from paired *.bim and *.fam tables het_reencode_bed( file_in, file_out ) # delete output when done delete_files_plink( file_out )
This function takes an existing IND tibble and creates a FAM tibble with the same information and dummy values for missing data.
In particular, the output FAM tibble will contain these columns with these contents
(IND only contain id
, sex
, and label
, so there is no loss of information):
fam
: IND label
id
: IND id
pat
: 0
(missing paternal ID)
mat
: 0
(missing maternal ID)
sex
: IND sex
converted to Plink integer codes via sex_to_int()
peno
: 0
(missing phenotype)
ind_to_fam(ind)
ind_to_fam(ind)
ind |
The input Eigenstrat IND tibble to convert. |
A Plink FAM tibble.
Eigenstrat IND format reference: https://github.com/DReichLab/EIG/tree/master/CONVERTF
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# create a sample IND tibble library(tibble) ind <- tibble( id = 1:3, sex = c('U', 'M', 'F'), label = c(1, 1, 2) ) # convert to FAM fam <- ind_to_fam(ind) # inspect: fam
# create a sample IND tibble library(tibble) ind <- tibble( id = 1:3, sex = c('U', 'M', 'F'), label = c(1, 1, 2) ) # convert to FAM fam <- ind_to_fam(ind) # inspect: fam
This function simplifies the creation of Plink BIM-formatted tibbles, which autocompletes missing information if a partial tibble is provided, or generates a completely made up tibble if the number of individuals is provided. The default values are most useful for simulated genotypes, where IDs can be made up but must be unique, and there are no chromosomes, positions, or particular reference or alternative alleles.
make_bim(tib, n = NA)
make_bim(tib, n = NA)
tib |
The input tibble (optional).
Missing columns will be autocompleted with reasonable values that are accepted by Plink and other external software.
If missing, all will be autocompleted, but |
n |
The desired number of loci (rows).
Required if |
Autocompleted column values:
chr
: 1
(all data is on a single chromosome)
id
: 1:n
posg
: 0
(missing)
pos
: 1:n
ref
: 1
alt
: 2
Note that n
is either given directly or obtained from the input tibble.
The input tibble with autocompleted columns and columns in default order, or the made up tibble if only the number of individuals was provided. The output begins with the standard columns in standard order: chr, id, posg, pos, ref, alt. Additional columns in the input tibble are preserved but placed after the standard columns.
Plink BIM format reference: https://www.cog-genomics.org/plink/1.9/formats#bim
# create a synthetic tibble for 10 loci # (most common use case) bim <- make_bim(n = 10) # manually create a partial tibble with only chromosomes defined library(tibble) bim <- tibble(chr = 0:2) # autocomplete the rest of the columns bim <- make_bim(bim)
# create a synthetic tibble for 10 loci # (most common use case) bim <- make_bim(n = 10) # manually create a partial tibble with only chromosomes defined library(tibble) bim <- tibble(chr = 0:2) # autocomplete the rest of the columns bim <- make_bim(bim)
This function simplifies the creation of Plink FAM-formatted tibbles, which autocompletes missing information if a partial tibble is provided, or generates a completely made up tibble if the number of individuals is provided. The default values are most useful for simulated genotypes, where IDs can be made up but must be unique, and there are no parents, families, gender, or phenotype.
make_fam(tib, n = NA)
make_fam(tib, n = NA)
tib |
The input tibble (optional).
Missing columns will be autocompleted with reasonable values that are accepted by Plink and other external software.
If missing, all will be autocompleted, but |
n |
The desired number of individuals (rows).
Required if |
Autocompleted column values:
fam
: 1:n
id
: 1:n
pat
: 0
(missing)
mat
: 0
(missing)
sex
: 0
(missing)
pheno
: 0
(missing)
Note that n
is either given directly or obtained from the input tibble.
The input tibble with autocompleted columns and columns in default order, or the made up tibble if only the number of individuals was provided.
The output begins with the standard columns in standard order: fam
, id
, pat
, mat
, sex
, pheno
.
Additional columns in the input tibble are preserved but placed after the standard columns.
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# create a synthetic tibble for 10 individuals # (most common use case) fam <- make_fam(n = 10) # manually create a partial tibble with only phenotypes defined library(tibble) fam <- tibble(pheno = 0:2) # autocomplete the rest of the columns fam <- make_fam(fam)
# create a synthetic tibble for 10 individuals # (most common use case) fam <- make_fam(n = 10) # manually create a partial tibble with only phenotypes defined library(tibble) fam <- tibble(pheno = 0:2) # autocomplete the rest of the columns fam <- make_fam(fam)
This function reads genotypes encoded in a Plink-formatted BED (binary) file, returning them in a standard R matrix containing genotypes encoded numerically as dosages (values in c( 0, 1, 2, NA )
).
Each genotype per locus (m
loci) and individual (n
total) counts the number of reference alleles, or NA
for missing data.
No *.fam or *.bim files are read by this basic function.
Since BED does not encode the data dimensions internally, these values must be provided by the user.
read_bed( file, names_loci = NULL, names_ind = NULL, m_loci = NA, n_ind = NA, ext = "bed", verbose = TRUE )
read_bed( file, names_loci = NULL, names_ind = NULL, m_loci = NA, n_ind = NA, ext = "bed", verbose = TRUE )
file |
Input file path.
*.bed extension may be omitted (will be added automatically if |
names_loci |
Vector of loci names, to become the row names of the genotype matrix.
If provided, its length sets |
names_ind |
Vector of individual names, to become the column names of the genotype matrix.
If provided, its length sets |
m_loci |
Number of loci in the input genotype table.
Required if |
n_ind |
Number of individuals in the input genotype table.
Required if |
ext |
The desired file extension (default "bed").
Ignored if |
verbose |
If |
The code enforces several checks to validate data given the requested dimensions. Errors are thrown if file terminates too early or does not terminate after genotype matrix is filled. In addition, as each locus is encoded in an integer number of bytes, and each byte contains up to four individuals, bytes with fewer than four are padded. To agree with other software (plink2, BEDMatrix), byte padding values are ignored (may take on any value without causing errors).
This function only supports locus-major BED files, which are the standard for modern data. Format is validated via the BED file's magic numbers (first three bytes of file). Older BED files can be converted using Plink.
The m
-by-n
genotype matrix.
read_plink()
for reading a set of BED/BIM/FAM files.
geno_to_char()
for translating numerical genotypes into more human-readable character encodings.
Plink BED format reference: https://www.cog-genomics.org/plink/1.9/formats#bed
# first obtain data dimensions from BIM and FAM files # all file paths file_bed <- system.file("extdata", 'sample.bed', package = "genio", mustWork = TRUE) file_bim <- system.file("extdata", 'sample.bim', package = "genio", mustWork = TRUE) file_fam <- system.file("extdata", 'sample.fam', package = "genio", mustWork = TRUE) # read annotation tables bim <- read_bim(file_bim) fam <- read_fam(file_fam) # read an existing Plink *.bim file # pass locus and individual IDs as vectors, setting data dimensions too X <- read_bed(file_bed, bim$id, fam$id) X # can specify without extension file_bed <- sub('\\.bed$', '', file_bed) # remove extension from this path on purpose file_bed # verify .bed is missing X <- read_bed(file_bed, bim$id, fam$id) # loads too! X
# first obtain data dimensions from BIM and FAM files # all file paths file_bed <- system.file("extdata", 'sample.bed', package = "genio", mustWork = TRUE) file_bim <- system.file("extdata", 'sample.bim', package = "genio", mustWork = TRUE) file_fam <- system.file("extdata", 'sample.fam', package = "genio", mustWork = TRUE) # read annotation tables bim <- read_bim(file_bim) fam <- read_fam(file_fam) # read an existing Plink *.bim file # pass locus and individual IDs as vectors, setting data dimensions too X <- read_bed(file_bed, bim$id, fam$id) X # can specify without extension file_bed <- sub('\\.bed$', '', file_bed) # remove extension from this path on purpose file_bed # verify .bed is missing X <- read_bed(file_bed, bim$id, fam$id) # loads too! X
This function reads a standard Plink *.bim file into a tibble with named columns.
It uses readr::read_table()
to do it efficiently.
read_bim(file, verbose = TRUE)
read_bim(file, verbose = TRUE)
file |
Input file (whatever is accepted by |
verbose |
If |
A tibble with columns: chr
, id
, posg
, pos
, alt
, ref
.
read_plink()
for reading a set of BED/BIM/FAM files.
Plink BIM format references: https://www.cog-genomics.org/plink/1.9/formats#bim https://www.cog-genomics.org/plink/2.0/formats#bim
# to read "data.bim", run like this: # bim <- read_bim("data") # this also works # bim <- read_bim("data.bim") # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing Plink *.bim file file <- system.file("extdata", 'sample.bim', package = "genio", mustWork = TRUE) bim <- read_bim(file) bim # can specify without extension file <- sub('\\.bim$', '', file) # remove extension from this path on purpose file # verify .bim is missing bim <- read_bim(file) # loads too! bim
# to read "data.bim", run like this: # bim <- read_bim("data") # this also works # bim <- read_bim("data.bim") # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing Plink *.bim file file <- system.file("extdata", 'sample.bim', package = "genio", mustWork = TRUE) bim <- read_bim(file) bim # can specify without extension file <- sub('\\.bim$', '', file) # remove extension from this path on purpose file # verify .bim is missing bim <- read_bim(file) # loads too! bim
This function reads a Plink eigenvec file, parsing columns strictly. First two must be 'fam' and 'id', which are strings, and all remaining columns (eigenvectors) must be numeric.
read_eigenvec( file, ext = "eigenvec", plink2 = FALSE, comment = if (plink2) "" else "#", verbose = TRUE )
read_eigenvec( file, ext = "eigenvec", plink2 = FALSE, comment = if (plink2) "" else "#", verbose = TRUE )
file |
The input file path, potentially excluding extension. |
ext |
File extension (default "eigenvec") can be changed if desired.
Set to |
plink2 |
If |
comment |
A string used to identify comments.
Any text after the comment characters will be silently ignored.
Passed to |
verbose |
If |
A list with two elements:
eigenvec
: A numeric R matrix containing the parsed eigenvectors.
If plink2 = TRUE
, the original column names will be preserved in this matrix.
fam
: A tibble with two columns, fam
and id
, which are the first two columns of the parsed file.
These column names are always the same even if plink2 = TRUE
(i.e. they won't be FID
or IID
).
write_eigenvec()
for writing an eigenvec file.
Plink 1 eigenvec format reference: https://www.cog-genomics.org/plink/1.9/formats#eigenvec
Plink 2 eigenvec format reference: https://www.cog-genomics.org/plink/2.0/formats#eigenvec
GCTA eigenvec format reference: https://cnsgenomics.com/software/gcta/#PCA
# to read "data.eigenvec", run like this: # data <- read_eigenvec("data") # this also works # data <- read_eigenvec("data.eigenvec") # # either way you get a list with these two items: # numeric eigenvector matrix # data$eigenvec # fam/id tibble # data$fam # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing *.eigenvec file created by GCTA file <- system.file("extdata", 'sample-gcta.eigenvec', package = "genio", mustWork = TRUE) data <- read_eigenvec(file) # numeric eigenvector matrix data$eigenvec # fam/id tibble data$fam # can specify without extension file <- sub('\\.eigenvec$', '', file) # remove extension from this path on purpose file # verify .eigenvec is missing data <- read_eigenvec(file) # load it anyway! data$eigenvec # read an existing *.eigenvec file created by Plink 2 file <- system.file("extdata", 'sample-plink2.eigenvec', package = "genio", mustWork = TRUE) # this version ignores header data <- read_eigenvec(file) # numeric eigenvector matrix data$eigenvec # fam/id tibble data$fam # this version uses header data <- read_eigenvec(file, plink2 = TRUE) # numeric eigenvector matrix data$eigenvec # fam/id tibble data$fam
# to read "data.eigenvec", run like this: # data <- read_eigenvec("data") # this also works # data <- read_eigenvec("data.eigenvec") # # either way you get a list with these two items: # numeric eigenvector matrix # data$eigenvec # fam/id tibble # data$fam # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing *.eigenvec file created by GCTA file <- system.file("extdata", 'sample-gcta.eigenvec', package = "genio", mustWork = TRUE) data <- read_eigenvec(file) # numeric eigenvector matrix data$eigenvec # fam/id tibble data$fam # can specify without extension file <- sub('\\.eigenvec$', '', file) # remove extension from this path on purpose file # verify .eigenvec is missing data <- read_eigenvec(file) # load it anyway! data$eigenvec # read an existing *.eigenvec file created by Plink 2 file <- system.file("extdata", 'sample-plink2.eigenvec', package = "genio", mustWork = TRUE) # this version ignores header data <- read_eigenvec(file) # numeric eigenvector matrix data$eigenvec # fam/id tibble data$fam # this version uses header data <- read_eigenvec(file, plink2 = TRUE) # numeric eigenvector matrix data$eigenvec # fam/id tibble data$fam
This function reads a standard Plink *.fam file into a tibble with named columns.
It uses readr::read_table()
to do it efficiently.
read_fam(file, verbose = TRUE)
read_fam(file, verbose = TRUE)
file |
Input file (whatever is accepted by |
verbose |
If |
A tibble with columns: fam
, id
, pat
, mat
, sex
, pheno
.
read_plink()
for reading a set of BED/BIM/FAM files.
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# to read "data.fam", run like this: # fam <- read_fam("data") # this also works # fam <- read_fam("data.fam") # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing Plink *.fam file file <- system.file("extdata", 'sample.fam', package = "genio", mustWork = TRUE) fam <- read_fam(file) fam # can specify without extension file <- sub('\\.fam$', '', file) # remove extension from this path on purpose file # verify .fam is missing fam <- read_fam(file) # load it anyway! fam
# to read "data.fam", run like this: # fam <- read_fam("data") # this also works # fam <- read_fam("data.fam") # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing Plink *.fam file file <- system.file("extdata", 'sample.fam', package = "genio", mustWork = TRUE) fam <- read_fam(file) fam # can specify without extension file <- sub('\\.fam$', '', file) # remove extension from this path on purpose file # verify .fam is missing fam <- read_fam(file) # load it anyway! fam
This function reads a GCTA Genetic Relatedness Matrix (GRM, i.e. kinship) set of files in their binary format, returning the kinship matrix and, if available, the corresponding matrix of pair sample sizes (non-trivial under missingness) and individuals table. Setting some options allows reading plink2 binary kinship formats such as "king" (see examples).
read_grm( name, n_ind = NA, verbose = TRUE, ext = "grm", shape = c("triangle", "strict_triangle", "square"), size_bytes = 4, comment = "#" )
read_grm( name, n_ind = NA, verbose = TRUE, ext = "grm", shape = c("triangle", "strict_triangle", "square"), size_bytes = 4, comment = "#" )
name |
The base name of the input files.
Files with that base, plus shared extension (default "grm", see |
n_ind |
The number of individuals, required if the file with the extension |
verbose |
If |
ext |
Shared extension for all three inputs (see |
shape |
The shape of the information to read (may be abbreviated).
Default "triangle" assumes there are |
size_bytes |
The number of bytes per number encoded. Default 4 corresponds to GCTA GRM and plink2 "bin4", whereas plink2 "bin" requires a value of 8. |
comment |
Character to start comments in |
A list with named elements:
kinship
: The symmetric n
-times-n
kinship matrix (GRM). Has IDs as row and column names if the file with extension .<ext>.id
exists. If shape='strict_triangle'
, diagonal will have missing values.
M
: The symmetric n
-times-n
matrix of pair sample sizes (number of non-missing loci pairs), if the file with extension .<ext>.N.bin
exists. Has IDs as row and column names if the file with extension .<ext>.id
was available. If shape='strict_triangle'
, diagonal will have missing values.
fam
: A tibble with two columns: fam
and id
, same as in Plink FAM files. Returned if the file with extension .<ext>.id
exists.
Greatly adapted from sample code from GCTA: https://cnsgenomics.com/software/gcta/#MakingaGRM
# to read "data.grm.bin" and etc, run like this: # obj <- read_grm("data") # obj$kinship # the kinship matrix # obj$M # the pair sample sizes matrix # obj$fam # the fam and ID tibble # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing set of GRM files file <- system.file("extdata", 'sample.grm.bin', package = "genio", mustWork = TRUE) file <- sub('\\.grm\\.bin$', '', file) # remove extension from this path on purpose obj <- read_grm(file) obj$kinship # the kinship matrix obj$M # the pair sample sizes matrix obj$fam # the fam and ID tibble # Read sample plink2 KING-robust files (several variants). # Read both base.king.bin and base.king.id files. # All generated with "plink2 <input> --make-king <options> --out base" # (replace "base" with actual base name) with these options: # #1) "triangle bin" # data <- read_grm( 'base', ext = 'king', shape = 'strict', size_bytes = 8 ) # #2) "triangle bin4" # data <- read_grm( 'base', ext = 'king', shape = 'strict' ) # #3) "square bin" # data <- read_grm( 'base', ext = 'king', shape = 'square', size_bytes = 8 ) # #4) "square bin4" # data <- read_grm( 'base', ext = 'king', shape = 'square' )
# to read "data.grm.bin" and etc, run like this: # obj <- read_grm("data") # obj$kinship # the kinship matrix # obj$M # the pair sample sizes matrix # obj$fam # the fam and ID tibble # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing set of GRM files file <- system.file("extdata", 'sample.grm.bin', package = "genio", mustWork = TRUE) file <- sub('\\.grm\\.bin$', '', file) # remove extension from this path on purpose obj <- read_grm(file) obj$kinship # the kinship matrix obj$M # the pair sample sizes matrix obj$fam # the fam and ID tibble # Read sample plink2 KING-robust files (several variants). # Read both base.king.bin and base.king.id files. # All generated with "plink2 <input> --make-king <options> --out base" # (replace "base" with actual base name) with these options: # #1) "triangle bin" # data <- read_grm( 'base', ext = 'king', shape = 'strict', size_bytes = 8 ) # #2) "triangle bin4" # data <- read_grm( 'base', ext = 'king', shape = 'strict' ) # #3) "square bin" # data <- read_grm( 'base', ext = 'king', shape = 'square', size_bytes = 8 ) # #4) "square bin4" # data <- read_grm( 'base', ext = 'king', shape = 'square' )
This function reads a standard Eigenstrat *.ind file into a tibble.
It uses readr::read_table()
to do it efficiently.
read_ind(file, verbose = TRUE)
read_ind(file, verbose = TRUE)
file |
Input file (whatever is accepted by |
verbose |
If |
A tibble with columns: id
, sex
, label
.
Eigenstrat IND format reference: https://github.com/DReichLab/EIG/tree/master/CONVERTF
# to read "data.ind", run like this: # ind <- read_ind("data") # this also works # ind <- read_ind("data.ind") # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing Eigenstrat *.ind file file <- system.file("extdata", 'sample.ind', package = "genio", mustWork = TRUE) ind <- read_ind(file) ind # can specify without extension file <- sub('\\.ind$', '', file) # remove extension from this path on purpose file # verify .ind is missing ind <- read_ind(file) # load it anyway! ind
# to read "data.ind", run like this: # ind <- read_ind("data") # this also works # ind <- read_ind("data.ind") # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing Eigenstrat *.ind file file <- system.file("extdata", 'sample.ind', package = "genio", mustWork = TRUE) ind <- read_ind(file) ind # can specify without extension file <- sub('\\.ind$', '', file) # remove extension from this path on purpose file # verify .ind is missing ind <- read_ind(file) # load it anyway! ind
Reads a matrix file under strict assumptions that it is entirely numeric and there are no row or column names present in this file.
It uses readr::read_table()
to do it efficiently.
Intended for outputs such as those of admixture inference approaches.
read_matrix(file, ext = "txt", verbose = TRUE)
read_matrix(file, ext = "txt", verbose = TRUE)
file |
Input file (whatever is accepted by |
ext |
The desired file extension.
Ignored if |
verbose |
If |
A numeric matrix without row or column names.
write_matrix()
, the inverse function.
# to read "data.txt", run like this: # mat <- read_matrix("data") # this also works # mat <- read_matrix("data.txt") # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing matrix *.txt file file <- system.file("extdata", 'sample-Q3.txt', package = "genio", mustWork = TRUE) mat <- read_matrix(file) mat # can specify without extension file <- sub('\\.txt$', '', file) # remove extension from this path on purpose file # verify .txt is missing mat <- read_matrix(file) # load it anyway! mat
# to read "data.txt", run like this: # mat <- read_matrix("data") # this also works # mat <- read_matrix("data.txt") # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing matrix *.txt file file <- system.file("extdata", 'sample-Q3.txt', package = "genio", mustWork = TRUE) mat <- read_matrix(file) mat # can specify without extension file <- sub('\\.txt$', '', file) # remove extension from this path on purpose file # verify .txt is missing mat <- read_matrix(file) # load it anyway! mat
This function reads a standard *.phen file into a tibble.
It uses readr::read_table()
to do it efficiently.
GCTA and EMMAX use this format.
read_phen(file, verbose = TRUE)
read_phen(file, verbose = TRUE)
file |
Input file (whatever is accepted by |
verbose |
If |
A tibble with columns: fam
, id
, pheno
.
GCTA PHEN format reference: https://cnsgenomics.com/software/gcta/#GREMLanalysis
# to read "data.phen", run like this: # phen <- read_phen("data") # this also works # phen <- read_phen("data.phen") # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing plink *.phen file file <- system.file("extdata", 'sample.phen', package = "genio", mustWork = TRUE) phen <- read_phen(file) phen # can specify without extension file <- sub('\\.phen$', '', file) # remove extension from this path on purpose file # verify .phen is missing phen <- read_phen(file) # load it anyway! phen
# to read "data.phen", run like this: # phen <- read_phen("data") # this also works # phen <- read_phen("data.phen") # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing plink *.phen file file <- system.file("extdata", 'sample.phen', package = "genio", mustWork = TRUE) phen <- read_phen(file) phen # can specify without extension file <- sub('\\.phen$', '', file) # remove extension from this path on purpose file # verify .phen is missing phen <- read_phen(file) # load it anyway! phen
This function reads a genotype matrix (X
, encoded as reference allele dosages) and its associated locus (bim
) and individual (fam
) data tables in the three Plink files in BED, BIM, and FAM formats, respectively.
All inputs must exist or an error is thrown.
This function is a wrapper around the more basic functions
read_bed()
,
read_bim()
,
read_fam()
,
which simplifies data parsing and additionally better guarantees data integrity.
Below suppose there are m
loci and n
individuals.
read_plink(file, verbose = TRUE)
read_plink(file, verbose = TRUE)
file |
Input file path, without extensions (each of .bed, .bim, .fam extensions will be added automatically as needed). Alternatively, input file path may have .bed extension (but not .bim, .fam, or other extensions). |
verbose |
If |
A named list with items in this order: X
(genotype matrix, see description in return value of read_bed()
), bim
(tibble, see read_bim()
), fam
(tibble, see read_fam()
).
X
has row and column names corresponding to the id
values of the bim
and fam
tibbles.
read_bed()
,
read_bim()
, and
read_fam()
for individual parsers of each input table, including a description of each object returned.
geno_to_char()
for translating numerical genotypes into more human-readable character encodings.
Plink BED/BIM/FAM format reference: https://www.cog-genomics.org/plink/1.9/formats
# to read "data.bed" etc, run like this: # obj <- read_plink("data") # this also works # obj <- read_plink("data.bed") # # you get a list with these three items: # genotypes # obj$X # locus annotations # obj$bim # individual annotations # obj$fam # The following example is more awkward # because package sample data has to be specified in this weird way: # first get path to BED file file <- system.file("extdata", 'sample.bed', package = "genio", mustWork = TRUE) # read genotypes and annotation tables plink_data <- read_plink(file) # genotypes plink_data$X # locus annotations plink_data$bim # individual annotations plink_data$fam # the same works without .bed extension file <- sub('\\.bed$', '', file) # remove extension # it works! plink_data <- read_plink(file)
# to read "data.bed" etc, run like this: # obj <- read_plink("data") # this also works # obj <- read_plink("data.bed") # # you get a list with these three items: # genotypes # obj$X # locus annotations # obj$bim # individual annotations # obj$fam # The following example is more awkward # because package sample data has to be specified in this weird way: # first get path to BED file file <- system.file("extdata", 'sample.bed', package = "genio", mustWork = TRUE) # read genotypes and annotation tables plink_data <- read_plink(file) # genotypes plink_data$X # locus annotations plink_data$bim # individual annotations plink_data$fam # the same works without .bed extension file <- sub('\\.bed$', '', file) # remove extension # it works! plink_data <- read_plink(file)
This function reads a standard Eigenstrat *.snp file into a tibble.
It uses readr::read_table()
to do it efficiently.
read_snp(file, verbose = TRUE)
read_snp(file, verbose = TRUE)
file |
Input file (whatever is accepted by |
verbose |
If |
A tibble with columns: id
, chr
, posg
, pos
, ref
, alt
Eigenstrat SNP format reference: https://github.com/DReichLab/EIG/tree/master/CONVERTF
# to read "data.snp", run like this: # snp <- read_snp("data") # this also works # snp <- read_snp("data.snp") # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing Eigenstrat *.snp file file <- system.file("extdata", 'sample.snp', package = "genio", mustWork = TRUE) snp <- read_snp(file) snp # can specify without extension file <- sub('\\.snp$', '', file) # remove extension from this path on purpose file # verify .snp is missing snp <- read_snp(file) # load it anyway! snp
# to read "data.snp", run like this: # snp <- read_snp("data") # this also works # snp <- read_snp("data.snp") # The following example is more awkward # because package sample data has to be specified in this weird way: # read an existing Eigenstrat *.snp file file <- system.file("extdata", 'sample.snp', package = "genio", mustWork = TRUE) snp <- read_snp(file) snp # can specify without extension file <- sub('\\.snp$', '', file) # remove extension from this path on purpose file # verify .snp is missing snp <- read_snp(file) # load it anyway! snp
This function checks that each of the GCTA binary GRM files (grm.bin
, grm.N.bin
, and grm.id
extensions) are present, given the shared base file path, stopping with an informative message if any of the files is missing.
This function aids troubleshooting, as various downstream external software report missing files differently and sometimes using confusing or obscure messages.
require_files_grm(file)
require_files_grm(file)
file |
The shared file path (excluding extensions: |
Nothing
# to require all of "data.grm.bin", "data.grm.N.bin", and "data.grm.id", run like this: # (stops if any of the three files is missing) # require_files_grm("data") # The following example is more awkward # because package sample data has to be specified in this weird way: # check that the samples we want exist # start with bed file file <- system.file("extdata", 'sample.grm.bin', package = "genio", mustWork = TRUE) # remove extension file <- sub('\\.grm\\.bin$', '', file) # since all sample.grm.{bin,N.bin,id} files exist, this will not stop with error messages: require_files_grm(file)
# to require all of "data.grm.bin", "data.grm.N.bin", and "data.grm.id", run like this: # (stops if any of the three files is missing) # require_files_grm("data") # The following example is more awkward # because package sample data has to be specified in this weird way: # check that the samples we want exist # start with bed file file <- system.file("extdata", 'sample.grm.bin', package = "genio", mustWork = TRUE) # remove extension file <- sub('\\.grm\\.bin$', '', file) # since all sample.grm.{bin,N.bin,id} files exist, this will not stop with error messages: require_files_grm(file)
This function checks that the PHEN file is present, given the base file path, stopping with an informative message if the file is missing. This function aids troubleshooting, as various downstream external software report missing files differently and sometimes using confusing or obscure messages.
require_files_phen(file)
require_files_phen(file)
file |
The base file path (excluding |
Nothing
# to require "data.phen", run like this: # (stops if file is missing) # require_files_phen("data") # The following example is more awkward # because package sample data has to be specified in this weird way: # check that the samples we want exist # get path to an existing phen file file <- system.file("extdata", 'sample.phen', package = "genio", mustWork = TRUE) # remove extension file <- sub('\\.phen$', '', file) # since sample.phen file exist, this will not stop with error messages: require_files_phen(file)
# to require "data.phen", run like this: # (stops if file is missing) # require_files_phen("data") # The following example is more awkward # because package sample data has to be specified in this weird way: # check that the samples we want exist # get path to an existing phen file file <- system.file("extdata", 'sample.phen', package = "genio", mustWork = TRUE) # remove extension file <- sub('\\.phen$', '', file) # since sample.phen file exist, this will not stop with error messages: require_files_phen(file)
This function checks that each of the Plink binary files (BED/BIM/FAM extensions) are present, given the shared base file path, stopping with an informative message if any of the files is missing. This function aids troubleshooting, as various downstream external software report missing files differently and sometimes using confusing or obscure messages.
require_files_plink(file)
require_files_plink(file)
file |
The shared file path (excluding extensions |
Nothing
# to require all of "data.bed", "data.bim", and "data.fam", run like this: # (stops if any of the three files is missing) # require_files_plink("data") # The following example is more awkward # because package sample data has to be specified in this weird way: # check that the samples we want exist # start with bed file file <- system.file("extdata", 'sample.bed', package = "genio", mustWork = TRUE) # remove extension file <- sub('\\.bed$', '', file) # since all sample.{bed,bim,fam} files exist, this will not stop with error messages: require_files_plink(file)
# to require all of "data.bed", "data.bim", and "data.fam", run like this: # (stops if any of the three files is missing) # require_files_plink("data") # The following example is more awkward # because package sample data has to be specified in this weird way: # check that the samples we want exist # start with bed file file <- system.file("extdata", 'sample.bed', package = "genio", mustWork = TRUE) # remove extension file <- sub('\\.bed$', '', file) # since all sample.{bed,bim,fam} files exist, this will not stop with error messages: require_files_plink(file)
This function accepts the integer sex codes accepted by Plink and turns them into the character codes accepted by Eigenstrat.
Only upper-case characters are returned.
Cases outside the table below are mapped to U
(unknown) with a warning.
The correspondence is:
0
: U
(unknown)
1
: M
(male)
2
: F
(female)
sex_to_char(sex)
sex_to_char(sex)
sex |
Integer vector of sex codes |
The converted character vector of sex codes
Eigenstrat IND format reference: https://github.com/DReichLab/EIG/tree/master/CONVERTF
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# verify the mapping above sex_int <- 0:2 sex_char <- c('U', 'M', 'F') # expected values stopifnot( all( sex_to_char( sex_int ) == sex_char ) )
# verify the mapping above sex_int <- 0:2 sex_char <- c('U', 'M', 'F') # expected values stopifnot( all( sex_to_char( sex_int ) == sex_char ) )
This function accepts the character sex codes accepted by Eigenstrat and turns them into the integer codes accepted by Plink.
Matching is case insensitive.
Cases outside the table below are mapped to 0
(unknown) with a warning.
The correspondence is:
U
: 0
(unknown)
M
: 1
(male)
F
: 2
(female)
sex_to_int(sex)
sex_to_int(sex)
sex |
Character vector of sex codes |
The converted numeric vector of sex codes
Eigenstrat IND format reference: https://github.com/DReichLab/EIG/tree/master/CONVERTF
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# verify the mapping above sex_char <- c('U', 'm', 'f') # mixed case works! sex_int <- 0:2 # expected values stopifnot( all( sex_to_int( sex_char ) == sex_int ) )
# verify the mapping above sex_char <- c('U', 'm', 'f') # mixed case works! sex_int <- 0:2 # expected values stopifnot( all( sex_to_int( sex_char ) == sex_int ) )
To save memory, simulate small chunks of variants at the time and write them to file as you go.
This is a wrapper around write_plink()
and readr::write_lines()
(for ancestral allele frequencies, optional) with append = TRUE
that simplifies looping somewhat.
The function always appends to the output, so it can be called several times if convenient, for example to simulate separate chromosomes.
sim_and_write_plink( sim_chunk, m_loci, fam, file, file_p_anc = NA, n_data_cut = 10^6 )
sim_and_write_plink( sim_chunk, m_loci, fam, file, file_p_anc = NA, n_data_cut = 10^6 )
sim_chunk |
A function that generates a small number of loci at the time, to be called iteratively until the whole genome is complete. It should accept a single parameter, the number of loci to simulate at one time, and returns a list with these named elements:
|
m_loci |
Total number of loci to simulate. |
fam |
Sample table of simulation to write. |
file |
Output file path, without extensions (each of .bed, .bim, .fam extensions will be added automatically as needed). |
file_p_anc |
Complete file path (with extensions) of vector of ancestral allele frequencies, if |
n_data_cut |
Number of cells (individuals times loci) to aim to simulate at the time. Actual number may be smaller to ensure that the number of loci is an integer, except if the number of individuals is greater than |
# some global constants that will be accessed by simulator function n <- 10 # and a global variable updated as we go m_last <- 0 # define a trivial but complete genotype simulator function my_sim_chunk <- function( m_chunk ) { # construct ancestral allele frequencies p_anc <- runif( m_chunk ) # simulate genotypes from HWE X <- matrix( rbinom( m_chunk * n, 2, p_anc ), m_chunk, n ) # construct a trivial BIM table bim <- make_bim( n = m_chunk ) # but make sure count continues across chunks without repeats # (so IDs and positions don't clash!) bim$id <- m_last + ( 1 : m_chunk ) # update global value (use <<-) for next round m_last <<- m_last + m_chunk # return all of these elements in a named list! return( list( X = X, bim = bim, p_anc = p_anc ) ) } # the fam table is created fully now fam <- make_fam( n = n ) # set other parameters m_loci <- 100 # this is only necessary for example files to be in a *temporary* location # (don't use `tempfile` in real cases) # plink files path without extension file <- tempfile('test') # p_anc file should have extension filep <- tempfile('test-p-anc.txt.gz') # simulate and write as we go! sim_and_write_plink( my_sim_chunk, m_loci, fam, file, filep ) # clean up: delete sample outputs delete_files_plink( file ) file.remove( filep )
# some global constants that will be accessed by simulator function n <- 10 # and a global variable updated as we go m_last <- 0 # define a trivial but complete genotype simulator function my_sim_chunk <- function( m_chunk ) { # construct ancestral allele frequencies p_anc <- runif( m_chunk ) # simulate genotypes from HWE X <- matrix( rbinom( m_chunk * n, 2, p_anc ), m_chunk, n ) # construct a trivial BIM table bim <- make_bim( n = m_chunk ) # but make sure count continues across chunks without repeats # (so IDs and positions don't clash!) bim$id <- m_last + ( 1 : m_chunk ) # update global value (use <<-) for next round m_last <<- m_last + m_chunk # return all of these elements in a named list! return( list( X = X, bim = bim, p_anc = p_anc ) ) } # the fam table is created fully now fam <- make_fam( n = n ) # set other parameters m_loci <- 100 # this is only necessary for example files to be in a *temporary* location # (don't use `tempfile` in real cases) # plink files path without extension file <- tempfile('test') # p_anc file should have extension filep <- tempfile('test-p-anc.txt.gz') # simulate and write as we go! sim_and_write_plink( my_sim_chunk, m_loci, fam, file, filep ) # clean up: delete sample outputs delete_files_plink( file ) file.remove( filep )
This function creates a symbolic (soft) link to a file, in a solution that works for all major operating systems, so a file can have two names without actually duplicating data. Although the two paths can be specified directly, this function automatically handles a conversion for a common but troublesome case when the link is not in the current directory, in which case the file must be relative to the parent directory of the link, although it is more natural to specify the file relative to the current directory.
symlink(file, link, adjust_path = TRUE, verbose = TRUE)
symlink(file, link, adjust_path = TRUE, verbose = TRUE)
file |
The file that will be linked. This function does not require this file to exist, but the link will be broken in that case. |
link |
The path to the link to the file. If this points to an existing file, or an existing link, it will be overwritten. |
adjust_path |
If |
verbose |
If |
# in this example, for the existing file, use this file provided by the package. # Note that it is an absolute path, so it will not be edited. file <- system.file("extdata", 'sample.bed', package = "genio", mustWork = TRUE) # this is the path to the link link <- tempfile('delete-me-example', fileext = '.bed') # create the symbolic link! symlink( file, link ) # delete example link when done file.remove( link )
# in this example, for the existing file, use this file provided by the package. # Note that it is an absolute path, so it will not be edited. file <- system.file("extdata", 'sample.bed', package = "genio", mustWork = TRUE) # this is the path to the link link <- tempfile('delete-me-example', fileext = '.bed') # create the symbolic link! symlink( file, link ) # delete example link when done file.remove( link )
A square symmetric kinship matrix is transformed into a tibble, with a row per unique element in the kinship matrix, and three columns: ID of row, ID of column, and the kinship value.
tidy_kinship(kinship, sort = TRUE)
tidy_kinship(kinship, sort = TRUE)
kinship |
The |
sort |
If |
A tibble with n * ( n + 1 ) / 2
rows (the upper triangle, including the diagonal), and 3 columns with names: id1
, id2
, kinship
.
# create a symmetric matrix kinship <- matrix( c( 0.5, 0.1, 0.0, 0.1, 0.5, 0.2, 0.0, 0.2, 0.6 ), nrow = 3 ) # add names (best for tidy version) colnames(kinship) <- paste0('pop', 1:3) rownames(kinship) <- paste0('pop', 1:3) # this returns tidy version kinship_tidy <- tidy_kinship( kinship ) # test colnames stopifnot( colnames( kinship_tidy ) == c('id1', 'id2', 'kinship') ) # test row number stopifnot( nrow( kinship_tidy ) == 6 ) # inspect it kinship_tidy
# create a symmetric matrix kinship <- matrix( c( 0.5, 0.1, 0.0, 0.1, 0.5, 0.2, 0.0, 0.2, 0.6 ), nrow = 3 ) # add names (best for tidy version) colnames(kinship) <- paste0('pop', 1:3) rownames(kinship) <- paste0('pop', 1:3) # this returns tidy version kinship_tidy <- tidy_kinship( kinship ) # test colnames stopifnot( colnames( kinship_tidy ) == c('id1', 'id2', 'kinship') ) # test row number stopifnot( nrow( kinship_tidy ) == 6 ) # inspect it kinship_tidy
This function accepts a standard R matrix containing genotypes (values in c( 0, 1, 2, NA )
) and writes it into a Plink-formatted BED (binary) file.
Each genotype per locus (m
loci) and individual (n
total) counts the number of alternative alleles or NA
for missing data.
No *.fam or *.bim files are created by this basic function.
write_bed(file, X, verbose = TRUE, append = FALSE)
write_bed(file, X, verbose = TRUE, append = FALSE)
file |
Output file path. .bed extension may be omitted (will be added automatically if it is missing). |
X |
The |
verbose |
If |
append |
If |
Genotypes with values outside of [0, 2] cause an error, in which case the partial output is deleted. However, beware that decimals get truncated internally, so values that truncate to 0, 1, or 2 will not raise errors. The BED format does not accept fractional dosages, so such data will not be written as expected.
Nothing
write_plink()
for writing a set of BED/BIM/FAM files.
Plink BED format reference: https://www.cog-genomics.org/plink/1.9/formats#bed
# to write an existing matrix `X` into file "data.bed", run like this: # write_bed("data", X) # this also works # write_bed("data.bed", X) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location file_out <- tempfile('delete-me-example', fileext = '.bed') # will also work without extension # create 10 random genotypes X <- rbinom(10, 2, 0.5) # replace 3 random genotypes with missing values X[sample(10, 3)] <- NA # turn into 5x2 matrix X <- matrix(X, nrow = 5, ncol = 2) # write this data to file in BED format # (only *.bed gets created, no *.fam or *.bim in this call) write_bed(file_out, X) # delete output when done file.remove(file_out)
# to write an existing matrix `X` into file "data.bed", run like this: # write_bed("data", X) # this also works # write_bed("data.bed", X) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location file_out <- tempfile('delete-me-example', fileext = '.bed') # will also work without extension # create 10 random genotypes X <- rbinom(10, 2, 0.5) # replace 3 random genotypes with missing values X[sample(10, 3)] <- NA # turn into 5x2 matrix X <- matrix(X, nrow = 5, ncol = 2) # write this data to file in BED format # (only *.bed gets created, no *.fam or *.bim in this call) write_bed(file_out, X) # delete output when done file.remove(file_out)
This function writes a tibble with the right columns into a standard Plink *.bim file.
It uses readr::write_tsv()
to do it efficiently.
write_bim(file, tib, verbose = TRUE, append = FALSE)
write_bim(file, tib, verbose = TRUE, append = FALSE)
file |
Output file (whatever is accepted by |
tib |
The tibble or data.frame to write.
It must contain these columns: |
verbose |
If |
append |
If |
The output tib
invisibly (what readr::write_tsv()
returns).
write_plink()
for writing a set of BED/BIM/FAM files.
Plink BIM format references: https://www.cog-genomics.org/plink/1.9/formats#bim https://www.cog-genomics.org/plink/2.0/formats#bim
# to write an existing table `bim` into file "data.bim", run like this: # write_bim("data", bim) # this also works # write_bim("data.bim", bim) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create a dummy tibble with the right columns library(tibble) tib <- tibble( chr = 1:3, id = 1:3, posg = 0, pos = 1:3, alt = 'B', ref = 'A' ) # a dummy file file_out <- tempfile('delete-me-example', fileext = '.bim') # will also work without extension # write the table out in *.bim format (no header, columns in right order) write_bim(file_out, tib) # example cleanup file.remove(file_out)
# to write an existing table `bim` into file "data.bim", run like this: # write_bim("data", bim) # this also works # write_bim("data.bim", bim) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create a dummy tibble with the right columns library(tibble) tib <- tibble( chr = 1:3, id = 1:3, posg = 0, pos = 1:3, alt = 'B', ref = 'A' ) # a dummy file file_out <- tempfile('delete-me-example', fileext = '.bim') # will also work without extension # write the table out in *.bim format (no header, columns in right order) write_bim(file_out, tib) # example cleanup file.remove(file_out)
This function writes eigenvectors in Plink 1 (same as GCTA) format (table with no header, with first two columns being fam
and id
), which is a subset of Plink 2 format (which optionally allows column names and does not require fam column).
Main expected case is eigenvec
passed as a numeric matrix and fam
provided to complete first two missing columns.
However, input eigenvec
may also be a data.frame already containing the fam
and id
columns, and other reasonable intermediate cases are also handled.
If both eigenvec
and fam
are provided and contain overlapping columns, those in eigenvec
get overwritten with a warning.
write_eigenvec( file, eigenvec, fam = NULL, ext = "eigenvec", plink2 = FALSE, verbose = TRUE )
write_eigenvec( file, eigenvec, fam = NULL, ext = "eigenvec", plink2 = FALSE, verbose = TRUE )
file |
The output file name (possibly without extension) |
eigenvec |
A matrix or tibble containing the eigenvectors to include in the file.
Column names other than |
fam |
An optional |
ext |
Output file extension. Since the general "covariates" file format in GCTA and Plink are the same as this, this function may be used to write more general covariates files if desired, in which case users may wish to change this extension for clarity. |
plink2 |
If |
verbose |
If |
Invisibly, the final eigenvec
data.frame or tibble written to file, starting with columns fam
and id
(merged from the fam
input, if it was passed) followed by the rest of columns in the input eigenvec
.
Column names are instead #FID
, IID
, etc if plink2 = TRUE
.
read_eigenvec()
for reading an eigenvec file.
Plink 1 eigenvec format reference: https://www.cog-genomics.org/plink/1.9/formats#eigenvec
Plink 2 eigenvec format reference: https://www.cog-genomics.org/plink/2.0/formats#eigenvec
GCTA eigenvec format reference: https://cnsgenomics.com/software/gcta/#PCA
# to write an existing matrix `eigenvec` and optional `fam` tibble into file "data.eigenvec", # run like this: # write_eigenvec("data", eigenvec, fam = fam) # this also works # write_eigenvec("data.eigenvec", eigenvec, fam = fam) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create dummy eigenvectors matrix, in this case from a small identity matrix # number of individuals n <- 10 eigenvec <- eigen( diag( n ) )$vectors # subset columns to use top 3 eigenvectors only eigenvec <- eigenvec[ , 1:3 ] # dummy fam data library(tibble) fam <- tibble( fam = 1:n, id = 1:n ) # write this data to .eigenvec file # output path without extension file <- tempfile('delete-me-example') eigenvec_final <- write_eigenvec( file, eigenvec, fam = fam ) # inspect the tibble that was written to file (returned invisibly) eigenvec_final # remove temporary file (add extension before deletion) file.remove( paste0( file, '.eigenvec' ) )
# to write an existing matrix `eigenvec` and optional `fam` tibble into file "data.eigenvec", # run like this: # write_eigenvec("data", eigenvec, fam = fam) # this also works # write_eigenvec("data.eigenvec", eigenvec, fam = fam) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create dummy eigenvectors matrix, in this case from a small identity matrix # number of individuals n <- 10 eigenvec <- eigen( diag( n ) )$vectors # subset columns to use top 3 eigenvectors only eigenvec <- eigenvec[ , 1:3 ] # dummy fam data library(tibble) fam <- tibble( fam = 1:n, id = 1:n ) # write this data to .eigenvec file # output path without extension file <- tempfile('delete-me-example') eigenvec_final <- write_eigenvec( file, eigenvec, fam = fam ) # inspect the tibble that was written to file (returned invisibly) eigenvec_final # remove temporary file (add extension before deletion) file.remove( paste0( file, '.eigenvec' ) )
This function writes a tibble with the right columns into a standard Plink *.fam file.
It uses readr::write_tsv()
to do it efficiently.
write_fam(file, tib, verbose = TRUE)
write_fam(file, tib, verbose = TRUE)
file |
Output file (whatever is accepted by |
tib |
The tibble or data.frame to write.
It must contain these columns: |
verbose |
If |
The output tib
invisibly (what readr::write_tsv()
returns).
write_plink()
for writing a set of BED/BIM/FAM files.
Plink FAM format reference: https://www.cog-genomics.org/plink/1.9/formats#fam
# to write an existing table `fam` into file "data.fam", run like this: # write_fam("data", fam) # this also works # write_fam("data.fam", fam) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create a dummy tibble with the right columns library(tibble) tib <- tibble( fam = 1:3, id = 1:3, pat = 0, mat = 0, sex = 1, pheno = 1 ) # a dummy file file_out <- tempfile('delete-me-example', fileext = '.fam') # will also work without extension # write the table out in *.fam format (no header, columns in right order) write_fam(file_out, tib) # delete output when done file.remove(file_out)
# to write an existing table `fam` into file "data.fam", run like this: # write_fam("data", fam) # this also works # write_fam("data.fam", fam) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create a dummy tibble with the right columns library(tibble) tib <- tibble( fam = 1:3, id = 1:3, pat = 0, mat = 0, sex = 1, pheno = 1 ) # a dummy file file_out <- tempfile('delete-me-example', fileext = '.fam') # will also work without extension # write the table out in *.fam format (no header, columns in right order) write_fam(file_out, tib) # delete output when done file.remove(file_out)
This function writes a GCTA Genetic Relatedness Matrix (GRM, i.e. kinship) set of files in their binary format, given a kinship matrix and, if available, the corresponding matrix of pair sample sizes (non-trivial under missingness) and individuals table.
Setting some options allows writing plink2 binary kinship formats such as "king" (follow examples in read_grm()
).
write_grm( name, kinship, M = NULL, fam = NULL, verbose = TRUE, ext = "grm", shape = c("triangle", "strict_triangle", "square"), size_bytes = 4 )
write_grm( name, kinship, M = NULL, fam = NULL, verbose = TRUE, ext = "grm", shape = c("triangle", "strict_triangle", "square"), size_bytes = 4 )
name |
The base name of the output files.
Files with that base, plus shared extension (default "grm", see |
kinship |
The symmetric |
M |
The optional symmetric |
fam |
The optional data.frame or tibble with individual annotations (columns with names |
verbose |
If |
ext |
Shared extension for all three outputs (see |
shape |
The shape of the information to write (may be abbreviated).
Default "triangle" assumes there are |
size_bytes |
The number of bytes per number encoded. Default 4 corresponds to GCTA GRM and plink2 "bin4", whereas plink2 "bin" requires a value of 8. |
# to write existing data `kinship`, `M`, and `fam` into files "data.grm.bin" etc, run like this: # write_grm("data", kinship, M = M, fam = fam ) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create dummy data to write # kinship for 3 individuals kinship <- matrix( c( 0.6, 0.2, 0.0, 0.2, 0.5, 0.1, 0.0, 0.1, 0.5 ), nrow = 3 ) # pair sample sizes matrix M <- matrix( c( 10, 9, 8, 9, 9, 7, 8, 7, 8 ), nrow = 3 ) # individual annotations table library(tibble) fam <- tibble( fam = 1:3, id = 1:3 ) # dummy files to write and delete name <- tempfile('delete-me-example') # no extension # write the data now! write_grm( name, kinship, M = M, fam = fam ) # delete outputs when done delete_files_grm( name )
# to write existing data `kinship`, `M`, and `fam` into files "data.grm.bin" etc, run like this: # write_grm("data", kinship, M = M, fam = fam ) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create dummy data to write # kinship for 3 individuals kinship <- matrix( c( 0.6, 0.2, 0.0, 0.2, 0.5, 0.1, 0.0, 0.1, 0.5 ), nrow = 3 ) # pair sample sizes matrix M <- matrix( c( 10, 9, 8, 9, 9, 7, 8, 7, 8 ), nrow = 3 ) # individual annotations table library(tibble) fam <- tibble( fam = 1:3, id = 1:3 ) # dummy files to write and delete name <- tempfile('delete-me-example') # no extension # write the data now! write_grm( name, kinship, M = M, fam = fam ) # delete outputs when done delete_files_grm( name )
This function writes a tibble with the right columns into a standard Eigenstrat *.ind file.
It uses readr::write_tsv()
to do it efficiently.
write_ind(file, tib, verbose = TRUE)
write_ind(file, tib, verbose = TRUE)
file |
Output file (whatever is accepted by |
tib |
The tibble or data.frame to write.
It must contain these columns: |
verbose |
If |
The output tib
invisibly (what readr::write_tsv()
returns).
Eigenstrat IND format reference: https://github.com/DReichLab/EIG/tree/master/CONVERTF
# to write an existing table `ind` into file "data.ind", run like this: # write_ind("data", ind) # this also works # write_ind("data.ind", ind) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create a dummy tibble with the right columns library(tibble) tib <- tibble( id = 1:3, sex = 1, label = 1 ) # a dummy file file_out <- tempfile('delete-me-example', fileext = '.ind') # will also work without extension # write the table out in *.ind format (no header, columns in right order) write_ind(file_out, tib) # delete output when done file.remove(file_out)
# to write an existing table `ind` into file "data.ind", run like this: # write_ind("data", ind) # this also works # write_ind("data.ind", ind) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create a dummy tibble with the right columns library(tibble) tib <- tibble( id = 1:3, sex = 1, label = 1 ) # a dummy file file_out <- tempfile('delete-me-example', fileext = '.ind') # will also work without extension # write the table out in *.ind format (no header, columns in right order) write_ind(file_out, tib) # delete output when done file.remove(file_out)
The inverse function of read_matrix()
, this writes what is intended to be a numeric matrix to a tab-delimited file without row or column names present.
It uses readr::write_tsv()
to do it efficiently.
Intended for outputs such as those of admixture inference approaches.
write_matrix(file, x, ext = "txt", verbose = TRUE, append = FALSE)
write_matrix(file, x, ext = "txt", verbose = TRUE, append = FALSE)
file |
Output file (whatever is accepted by |
x |
The matrix to write.
Unlike |
ext |
The desired file extension.
If |
verbose |
If |
append |
If |
The output x
, coerced into data.frame, invisibly (what readr::write_tsv()
returns).
read_matrix()
, the inverse function.
# to write an existing matrix `x` into file "data.txt", run like this: # write_matrix( "data", x ) # this also works # write_matrix( "data.txt", x ) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create a dummy matrix with the right columns x <- rbind( 1:3, (0:2)/10, -1:1 ) # a dummy file file_out <- tempfile('delete-me-example', fileext = '.txt') # will also work without extension # write the matrix without header write_matrix( file_out, x ) # delete output when done file.remove( file_out )
# to write an existing matrix `x` into file "data.txt", run like this: # write_matrix( "data", x ) # this also works # write_matrix( "data.txt", x ) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create a dummy matrix with the right columns x <- rbind( 1:3, (0:2)/10, -1:1 ) # a dummy file file_out <- tempfile('delete-me-example', fileext = '.txt') # will also work without extension # write the matrix without header write_matrix( file_out, x ) # delete output when done file.remove( file_out )
This function writes a tibble with the right columns into a standard *.phen file.
It uses readr::write_tsv()
to do it efficiently.
GCTA and EMMAX use this format.
write_phen(file, tib, verbose = TRUE)
write_phen(file, tib, verbose = TRUE)
file |
Output file (whatever is accepted by |
tib |
The tibble or data.frame to write.
It must contain these columns: |
verbose |
If |
The output tib
invisibly (what readr::write_tsv()
returns).
GCTA PHEN format reference: https://cnsgenomics.com/software/gcta/#GREMLanalysis
# to write an existing table `phen` into file "data.phen", run like this: # write_phen("data", phen) # this also works # write_phen("data.phen", phen) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create a dummy tibble with the right columns library(tibble) tib <- tibble( fam = 1:3, id = 1:3, pheno = 1 ) # a dummy file file_out <- tempfile('delete-me-example', fileext = '.phen') # will also work without extension # write the table out in *.phen format (no header, columns in right order) write_phen(file_out, tib) # delete output when done file.remove(file_out)
# to write an existing table `phen` into file "data.phen", run like this: # write_phen("data", phen) # this also works # write_phen("data.phen", phen) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create a dummy tibble with the right columns library(tibble) tib <- tibble( fam = 1:3, id = 1:3, pheno = 1 ) # a dummy file file_out <- tempfile('delete-me-example', fileext = '.phen') # will also work without extension # write the table out in *.phen format (no header, columns in right order) write_phen(file_out, tib) # delete output when done file.remove(file_out)
This function writes a genotype matrix (X
) and its associated locus (bim
) and individual (fam
) data tables into three Plink files in BED, BIM, and FAM formats, respectively.
This function is a wrapper around the more basic functions
write_bed()
,
write_bim()
,
write_fam()
,
but additionally tests that the data dimensions agree (or stops with an error).
Also checks that the genotype row and column names agree with the bim and fam tables if they are all present.
In addition, if bim = NULL
or fam = NULL
, these are auto-generated using
make_bim()
and
make_fam()
,
which is useful behavior for simulated data.
Lastly, the phenotype can be provided as a separate argument and incorporated automatically if fam = NULL
(a common scenario for simulated genotypes and traits).
Below suppose there are m
loci and n
individuals.
write_plink( file, X, bim = NULL, fam = NULL, pheno = NULL, verbose = TRUE, append = FALSE, write_phen = FALSE )
write_plink( file, X, bim = NULL, fam = NULL, pheno = NULL, verbose = TRUE, append = FALSE, write_phen = FALSE )
file |
Output file path, without extensions (each of .bed, .bim, .fam extensions will be added automatically as needed). |
X |
The |
bim |
The tibble or data.frame containing locus information.
It must contain |
fam |
The tibble or data.frame containing individual information.
It must contain |
pheno |
The phenotype to write into the FAM file assuming |
verbose |
If |
append |
If |
write_phen |
If |
Invisibly, a named list with items in this order: X
(genotype matrix), bim
(tibble), fam
(tibble).
This is most useful when either BIM or FAM tables were auto-generated.
write_bed()
,
write_bim()
,
write_fam()
,
make_bim()
,
make_fam()
.
Plink BED/BIM/FAM format reference: https://www.cog-genomics.org/plink/1.9/formats
# to write existing data `X`, `bim`, `fam` into files "data.bed", "data.bim", and "data.fam", # run like this: # write_plink("data", X, bim = bim, fam = fam) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # here is an example for a simulation # create 10 random genotypes X <- rbinom(10, 2, 0.5) # replace 3 random genotypes with missing values X[sample(10, 3)] <- NA # turn into 5x2 matrix X <- matrix(X, nrow = 5, ncol = 2) # simulate a trait for two individuals pheno <- rnorm(2) # write this data to BED/BIM/FAM files # output path without extension file_out <- tempfile('delete-me-example') # here all of the BIM and FAM columns except `pheno` are autogenerated write_plink(file_out, X, pheno = pheno) # delete all three outputs when done delete_files_plink( file_out )
# to write existing data `X`, `bim`, `fam` into files "data.bed", "data.bim", and "data.fam", # run like this: # write_plink("data", X, bim = bim, fam = fam) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # here is an example for a simulation # create 10 random genotypes X <- rbinom(10, 2, 0.5) # replace 3 random genotypes with missing values X[sample(10, 3)] <- NA # turn into 5x2 matrix X <- matrix(X, nrow = 5, ncol = 2) # simulate a trait for two individuals pheno <- rnorm(2) # write this data to BED/BIM/FAM files # output path without extension file_out <- tempfile('delete-me-example') # here all of the BIM and FAM columns except `pheno` are autogenerated write_plink(file_out, X, pheno = pheno) # delete all three outputs when done delete_files_plink( file_out )
This function writes a tibble with the right columns into a standard Eigenstrat *.snp file.
It uses readr::write_tsv()
to do it efficiently.
write_snp(file, tib, verbose = TRUE)
write_snp(file, tib, verbose = TRUE)
file |
Output file (whatever is accepted by |
tib |
The tibble or data.frame to write.
It must contain these columns: |
verbose |
If |
The output tib
invisibly (what readr::write_tsv()
returns).
Eigenstrat SNP format reference: https://github.com/DReichLab/EIG/tree/master/CONVERTF
# to write an existing table `snp` into file "data.snp", run like this: # write_snp("data", snp) # this also works # write_snp("data.snp", snp) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create a dummy tibble with the right columns library(tibble) tib <- tibble( id = 1:3, chr = 1:3, posg = 0, pos = 1:3, ref = 'A', alt = 'B' ) # a dummy file file_out <- tempfile('delete-me-example', fileext = '.snp') # will also work without extension # write the table out in *.snp format (no header, columns in right order) write_snp(file_out, tib) # delete output when done file.remove(file_out)
# to write an existing table `snp` into file "data.snp", run like this: # write_snp("data", snp) # this also works # write_snp("data.snp", snp) # The following example is more detailed but also more awkward # because (only for these examples) the package must create the file in a *temporary* location # create a dummy tibble with the right columns library(tibble) tib <- tibble( id = 1:3, chr = 1:3, posg = 0, pos = 1:3, ref = 'A', alt = 'B' ) # a dummy file file_out <- tempfile('delete-me-example', fileext = '.snp') # will also work without extension # write the table out in *.snp format (no header, columns in right order) write_snp(file_out, tib) # delete output when done file.remove(file_out)