API for linkage disequilibrium
LD statistics class and function
- class moments.LD.LDstats(data, num_pops=None, pop_ids=None)[source]
Represents linkage disequilibrium statistics as a list of arrays, where each entry in the list is an array of statistics for a corresponding recombination rate. The final entry in the list is always the heterozygosity statistics. Thus, if we have an LDstats object for 3 recombination rate values, the list will have length 4.
LDstats are represented as a list of statistics over two locus pairs for a given recombination distance.
- Parameters:
data (list of arrays) – A list of LD and heterozygosity stats.
num_pops (int) – Number of populations. For one population, higher order statistics may be computed.
pop_ids (list of strings, optional) – Population IDs in order that statistics are represented here.
- H(pops=None)[source]
Returns heterozygosity statistics for the populations given.
- Parameters:
pops (list of ints, optional) – The indexes of populations to return stats for.
- H2(X, Y=None, phased=True)[source]
Note: the H2 name may change! This is sometimes called “D+”.
This is the statistics E[2*fAB*fab + 2*fAb*faB], which measures the probability of polymorphism between two sampled genome copies at two loci.
This is closely related to pi2=p(1-p)q(1-q), which is inherently a four-haplotype statistic. Instead, H2 is a two-haplotype statistic, which can be measured with just a single diploid.
In the one population case, it equals 4D^2+2Dz+4pi2. In the two population case, it equals 4D1*D2+Dz(1,2,2)+Dz(2,1,1)+4pi2(1,2,1,2).
At steady state, the solution is theta**2 * (36 + 14*rho + rho**2) / (18 + 13*rho + rho**2).
To compare to unphased data estimated assuming that double heterozygotes have equal probability of being in coupling or repulsion LD, the statistic equals D1*D2+1/2(Dz(1,2,2)+Dz(2,1,1)+4pi2.
Note: This unphased case (setting phased=False) is only relevant to comparing to data where a single diploid individual exists from each population, and it is only needed for cross-population H2!
- LD(pops=None)[source]
Returns LD stats for populations given (if None, returns all).
- Parameters:
pops (list of ints, optional) – The indexes of populations to return stats for.
- admix(pop0, pop1, f, new_id='Adm')[source]
Admixture between pop0 and pop1, given by indexes. f is the fraction contributed by pop0, so pop1 contributes 1-f. If new_id is not specified, the admixed population’s name is ‘Adm’. Otherwise, we can set it with new_id=new_pop_id.
- Parameters:
pop0 (int) – First population to admix.
pop1 (int) – Second population to admix.
f (float) – The fraction of ancestry contributed by pop0, so pop1 contributes 1 - f.
new_id (str, optional) – The name of the admixed population.
- f2(X, Y)[source]
Returns \(f_2(X, Y) = (X-Y)^2\).
X, and Y can be specified as population ID strings, or as indexes (but these cannot be mixed).
- Parameters:
X – One of the populations, as index or population ID.
Y – The other population, as index or population ID.
- f3(X, Y, Z)[source]
Returns \(f_3(X; Y, Z) = (X-Y)(X-Z)\). A significantly negative \(f_3\) of this form suggests that population X is the result of admixture between ancient populations related to Y and Z. A positive value suggests that X is an outgroup to Y and Z.
X, Y, and Z can be specified as population ID strings, or as indexes (but these cannot be mixed).
- Parameters:
X – The “test” population, as index or population ID.
Y – The first reference population, as index or population ID.
Z – The second reference population, as index or population ID.
- f4(X, Y, Z, W)[source]
Returns \(f_4(X, Y; Z, W) = (X-Y)(Z-W)\).
X, Y, Z and W can be specified as population ID strings, or as indexes (but these cannot be mixed).
- Parameters:
X – A population index or ID.
Y – A population index or ID.
Z – A population index or ID.
W – A population index or ID.
- Returns:
Patterson’s f4 statistic (pX-pY)*(pZ-pW).
- static from_demes(g, sampled_demes, sample_times=None, rho=None, theta=0.001, r=None, u=None)[source]
Takes a deme graph and computes the LD stats.
demes
is a package for specifying demographic models in a user-friendly, human-readable YAML format. This function automatically parses the demographic description and returns a LD for the specified populations and recombination and mutation rates.- Parameters:
g (
demes.DemeGraph
) – Ademes
DemeGraph from which to compute the LD.sampled_demes (list of strings) – A list of deme IDs to take samples from. We can repeat demes, as long as the sampling of repeated deme IDs occurs at distinct times.
sample_times (list of floats, optional) – If None, assumes all sampling occurs at the end of the existence of the sampled deme. If there are ancient samples,
sample_times
must be a list of same length assampled_demes
, giving the sampling times for each sampled deme. Sampling times are given in time units of the original deme graph, so might not necessarily be generations (e.g. ifg.time_units
is years)rho – The population-size scaled recombination rate(s). Can be None, a non-negative float, or a list of values.
theta – The population-size scaled mutation rate. This defaults to
theta=0.001
, which is very roughly what is observed in humans.r – The raw recombination rate. Can be None, a non-negative float, or a list of values.
u – The raw per-base mutation rate.
theta
is set to4 * Ne * u
, whereNe
is the reference population size from the root deme.
- Returns:
A
moments.LD
LD statistics object, with number of populations equal to the length ofsampled_demes
.- Return type:
- static from_file(fid, return_statistics=False, return_comments=False)[source]
Read LD statistics from file.
- Parameters:
fid (str) – The file name to read from or an open file object.
return_statistics (bool, optional) – If true, returns statistics writen to file.
return_comments (bool, optional) – If true, the return value is (y, comments), where comments is a list of strings containing the comments from the file (without #’s).
- integrate(nu, tf, dt=None, dt_fac=0.02, rho=None, theta=0.001, m=None, selfing=None, selfing_rate=None, frozen=None)[source]
Integrates the LD statistics forward in time. When integrating LD statistics for a list of recombination rates and mutation rate, they must be passed as keywork arguments to this function. We can integrate either single-population LD statistics up to order 10, or multi-population LD statistics but only for order 2 (which includes \(D^2\), \(Dz\), and \(\pi_2\)).
- Parameters:
nu (list or function) – The relative population size, may be a function of time, given as a list [nu1, nu2, …]
tf (float) – Total time to integrate
dt (float) – Integration timestep. This is deprecated! Use dt_fac instead.
dt_fac (float) – The integration time step factor, so that dt is determined by tf * dt_fac. Note: Should also build in adaptive time-stepping… this is to come.
rho (float or list of floats) – Can be a single recombination rate or list of recombination rates (in which case we are integrating a list of LD stats for each rate)
theta – The per base population-scaled mutation rate (4N*mu) if we pass [theta1, theta2], differing mutation rates at left and right locus, implemented in the ISM=True model
m (array) – The migration matrix (num_pops x num_pops, storing m_ij migration rates where m_ij is probability that a lineage in i had parent in j m_ii is unused, and found by summing off diag elements in the ith row
selfing (list of floats) – A list of selfing probabilities, same length as nu.
selfing_rate (list of floats) – Alias for selfing.
frozen (list of bools) – A list of True and False same length as nu. True implies that a lineage is frozen (as in ancient samples). False integrates as normal.
- marginalize(pops)[source]
Marginalize over the LDstats, removing moments for given populations.
- Parameters:
pops (int or list of ints) – The index or list of indexes of populations to marginalize.
- merge(pop0, pop1, f, new_id='Merged')[source]
Merger of populations pop0 and pop1, with fraction f from pop0 and 1-f from pop1. Places new population at the end, then marginalizes pop0 and pop1. To admix two populations and keep one or both, use pulse migrate or admix, respectively.
- Parameters:
pop0 (int) – First population to merge.
pop1 (int) – Second population to merge.
f (float) – The fraction of ancestry contributed by pop0, so pop1 contributes 1 - f.
new_id (str, optional) – The name of the merged population.
- names()[source]
Returns the set of LD and heterozygosity statistics names for the number of populations represented by the LDstats.
Note that this will always return the full set of statistics,
- pulse_migrate(pop0, pop1, f)[source]
Pulse migration/admixure event from pop0 to pop1, with fraction f replacement. We use the admix function above. We want to keep the original population names the same, if they are given in the LDstats object, so we use new_pop=self.pop_ids[pop1].
We admix pop0 and pop1 with fraction f and 1-f, then swap the new admixed population with pop1, then marginalize the original pop1.
- Parameters:
pop0 (int) – The index of the source population.
pop1 (int) – The index of the target population.
f (float) – The fraction of ancestry contributed by the source population.
- split(pop_to_split, new_ids=None)[source]
Splits the population given into two child populations. One child population keeps the same index and the second child population is placed at the end of the list of present populations. If new_ids is given, we can set the population IDs of the child populations, but only if the input LDstats have population IDs available.
- Parameters:
pop_to_split (int) – The index of the population to split.
new_ids (list of strings, optional) – List of child population names, of length two.
- static steady_state(nus, m=None, rho=None, theta=0.001, selfing_rate=None, pop_ids=None)[source]
Computes the steady state solution for any number of populations. The number of populations is determined by the length of
nus
, which is a list with relative population sizes (often, these will be set to 1, meaning sizes are equal to some reference or ancestral population size).If the solution for more than one population is desired, we must provide
m
, an n-by-n migration matrix, and there must be migrations that connect all populations so that a steady state solution exists. This corresponds to an island model with potentially asymmetric migration and allows for unequal population sizes.- Parameters:
nus (list-like) – The relative population sizes, with one or two entries, corresponding to a steady state solution with one or two populations, resp.
m (array-like) – A migration matrix, only provided when the length of nus is 2 or more.
rho – The population-size scaled recombination rate(s). Can be None, a non-negative float, or a list of values.
theta (float) – The population-size scaled mutation rate
selfing_rate (number or list of numbers) – Self-fertilization rate(s), given as a number (for a single population, or list of numbers (for two populations). Selfing rates must be between 0 and 1.
pop_ids (list of strings) – The population IDs.
- Returns:
A
moments.LD
LD statistics object.- Return type:
- swap_pops(pop0, pop1)[source]
Swaps pop0 and pop1 in the order of the population in the LDstats.
- Parameters:
pop0 (int) – The index of the first population to swap.
pop1 (int) – The index of the second population to swap.
- to_file(fid, precision=16, statistics='ALL', comment_lines=[])[source]
Write LD statistics to file.
The file format is:
# Any number of comment lines beginning with a ‘#’
A single line containing an integer giving the number of populations.
On the same line, optional, the names of those populations. If names are given, there needs to be the same number of pop_ids as the integer number of populations. For example, the line could be ‘3 YRI CEU CHB’.
A single line giving the names of the LD statistics, in the order they appear for each recombination rate distance or bin. Optionally, this line could read ALL, indicating that every statistic in the basis is given, and in the ‘correct’ order.
A single line giving the names of the heterozygosity statistics, in the order they appear in the final row of data. Optionally, this line could read ALL.
A line giving the number of recombination rate bins/distances we have data for (so we know how many to read)
One line for each row of LD statistics.
A single line for the heterozygosity statistics.
- Parameters:
fid (str) – The file name to write to or an open file object.
precision (int) – The precision with which to write out entries of the LD stats. (They are formated via %.<p>g, where <p> is the precision.)
statistics (list of list of strings) – Defaults to ‘ALL’, meaning all statistics are given in the LDstats object. Otherwise, list of two lists, first giving present LD stats, and the second giving present het stats.
comment_lines (list of srtings) – List of strings to be used as comment lines in the header of the output file. I use comment lines mainly to record the recombination bins or distances given in the LDstats (something like “‘edges = ‘ + str(r_edges)”.
Demographic functions
- moments.LD.Demographics1D.bottlegrowth(params, order=2, rho=None, theta=0.001, pop_ids=None)[source]
Exponential growth (or decay) model after size change.
- Parameters:
params (list) – The relative initial and final sizes of the final epoch and its integration time in genetic units: (nuB, nuF, T).
order (int) – The maximum order of the LD statistics. Defaults to 2.
rho (float or list of floats, optional) – Population-scaled recombination rate (4Nr), given as scalar or list of rhos.
theta (float) – Population-scaled mutation rate (4Nu). Defaults to 0.001.
pop_ids (lits of str, optional) – List of population IDs of length 1.
- moments.LD.Demographics1D.growth(params, order=2, rho=None, theta=0.001, pop_ids=None)[source]
Exponential growth (or decay) model.
- Parameters:
params (list) – The relative final size and integration time of recent epoch, in genetic units: (nuF, T)
order (int) – The maximum order of the LD statistics. Defaults to 2.
rho (float or list of floats, optional) – Population-scaled recombination rate (4Nr), given as scalar or list of rhos.
theta (float) – Population-scaled mutation rate (4Nu). Defaults to 0.001.
pop_ids (lits of str, optional) – List of population IDs of length 1.
- moments.LD.Demographics1D.snm(params=None, order=2, rho=None, theta=0.001, pop_ids=None)[source]
Equilibrium neutral model. Does not take demographic parameters.
- Parameters:
params – Unused.
order (int) – The maximum order of the LD statistics. Defaults to 2.
rho (float or list of floats, optional) – Population-scaled recombination rate (4Nr), given as scalar or list of rhos.
theta (float) – Population-scaled mutation rate (4Nu). Defaults to 0.001.
pop_ids (lits of str, optional) – List of population IDs of length 1.
- moments.LD.Demographics1D.three_epoch(params, order=2, rho=None, theta=0.001, pop_ids=None)[source]
Three epoch model with constant sized epochs.
- Parameters:
params (list) – The relative sizes and integration times of recent epochs, in genetic units: (nu1, nu2, T1, T2).
order (int) – The maximum order of the LD statistics. Defaults to 2.
rho (float or list of floats, optional) – Population-scaled recombination rate (4Nr), given as scalar or list of rhos.
theta (float) – Population-scaled mutation rate (4Nu). Defaults to 0.001.
pop_ids (lits of str, optional) – List of population IDs of length 1.
- moments.LD.Demographics1D.two_epoch(params, order=2, rho=None, theta=0.001, pop_ids=None)[source]
Two epoch model with a single size change and constant sized epochs.
- Parameters:
params (list) – The relative size and integration time of recent epoch, in genetic units: (nu, T).
order (int) – The maximum order of the LD statistics. Defaults to 2.
rho (float or list of floats, optional) – Population-scaled recombination rate (4Nr), given as scalar or list of rhos.
theta (float) – Population-scaled mutation rate (4Nu). Defaults to 0.001.
pop_ids (lits of str, optional) – List of population IDs of length 1.
- moments.LD.Demographics2D.island_model(params, rho=None, theta=0.001, pop_ids=None)[source]
Split into two populations of specifed size, which then have their own relative constant sizes and symmetric migration between populations.
nu1: Relative size of population 1 after split.
nu2: Relative size of population 2 after split.
m12: Migration rate, from 2 to 1 forward in time (in units 2*Ne*m)
m21: Migration rate, from 1 to 2 forward in time (in units 2*Ne*m)
- Parameters:
params – The input parameters: (nu1, nu2, m12, m21)
rho (float or list of floats, optional) – Population-scaled recombination rate (4Nr), given as scalar or list of rhos.
theta (float) – Population-scaled mutation rate (4Nu). Defaults to 0.001.
pop_ids (lits of str, optional) – List of population IDs of length 1.
- moments.LD.Demographics2D.snm(params=None, rho=None, theta=0.001, pop_ids=None)[source]
Equilibrium neutral model. Neutral steady state followed by split in the immediate past.
- Parameters:
params – Unused.
rho (float or list of floats, optional) – Population-scaled recombination rate (4Nr), given as scalar or list of rhos.
theta (float) – Population-scaled mutation rate (4Nu). Defaults to 0.001.
pop_ids (lits of str, optional) – List of population IDs of length 2.
- moments.LD.Demographics2D.split_mig(params, rho=None, theta=0.001, pop_ids=None)[source]
Split into two populations of specifed size, which then have their own relative constant sizes and symmetric migration between populations.
nu1: Size of population 1 after split.
nu2: Size of population 2 after split.
T: Time in the past of split (in units of 2*Na generations)
m: Migration rate between populations (2*Na*m)
- Parameters:
params – The input parameters: (nu1, nu2, T, m)
rho (float or list of floats, optional) – Population-scaled recombination rate (4Nr), given as scalar or list of rhos.
theta (float) – Population-scaled mutation rate (4Nu). Defaults to 0.001.
pop_ids (lits of str, optional) – List of population IDs of length 1.
Three-population demographic models.
- moments.LD.Demographics3D.out_of_Africa(params, rho=None, theta=0.001, pop_ids=['YRI', 'CEU', 'CHB'])[source]
The Gutenkunst et al (2009) out-of-Africa that has been reinferred a number of times.
- Parameters:
params – List of parameters, in the order (nuA, TA, nuB, TB, nuEu0, nuEuF, nuAs0, nuAsF, TF, mAfB, mAfEu, mAfAs, mEuAs).
rho – Recombination rate or list of recombination rates (population-size scaled).
theta – Population-size scaled mutation rate.
pop_ids – List of population IDs.
Utility functions
- moments.LD.Util.het_names(num_pops)[source]
Returns the heterozygosity statistic representation names.
- Parameters:
num_pops (int) – Number of populations.
- moments.LD.Util.ld_names(num_pops)[source]
Returns the LD statistic representation names.
- Parameters:
num_pops (int) – Number of populations.
- moments.LD.Util.map_moment(mom)[source]
There are repeated moments with equal expectations, so we collapse them into the same moment.
- Parameters:
mom (str) – The moment to map to its “canonical” name.
- moments.LD.Util.moment_names(num_pops)[source]
Returns a tuple of length two with LD and heterozygosity moment names.
- Parameters:
num_pops (int) – Number of populations
- moments.LD.Util.perturb_params(params, fold=1, lower_bound=None, upper_bound=None)[source]
Generate a perturbed set of parameters. Each element of params is randomly perturbed by the given factor of 2 up or down.
- Parameters:
params (list) – A list of input parameters.
fold (float) – Number of factors of 2 to perturb by.
lower_bound (list) – If not None, the resulting parameter set is adjusted to have all value greater than lower_bound. Must have equal length to
params
.upper_bound (list) – If not None, the resulting parameter set is adjusted to have all value less than upper_bound. Must have equal length to
params
.
- moments.LD.Util.rescale_params(params, types, Ne=None, gens=1, uncerts=None, time_offset=0)[source]
Rescale parameters to physical units, so that times are in generations or years, sizes in effective instead of relative sizes, and migration probabilities in per-generation units.
For generation times of events to be correctly rescaled, times in the parameters list must be specified so that earlier epochs are earlier in the list, because we return rescaled cumulative times. All time parameters must refer to consecutive epochs. Epochs need not start at contemporary time, and we can specify the time offset using time_offset.
If uncertainties are not given (uncerts = None), the return value is an array of rescaled parameters. If uncertainties are given, the return value has length two: the first entry is an array of rescaled parameters, and the second entry is an array of rescaled uncertainties.
- Parameters:
params (list) – List of parameters.
types (list) – List of parameter types. Times are given by “T”, sizes by “nu”, effective size by “Ne”, migration rates by “m”, and fractions by “x” or “f”.
Ne (float) – The effective population size, typically as the last entry in
params
.gens (float) – The generation time.
uncerts (list) – List of uncertainties, same length as
params
.time_offset (int or float) – The amount of time added to each rescaled time point. This lets us have consecutive epochs that stop short of time 0 (final sampling time).
Parsing functions
- moments.LD.Parsing.bootstrap_data(all_data, normalization=0)[source]
Returns bootstrapped variances for LD statistics. This function operates on data that is sums (i.e. the direct output of
compute_ld_statistics()
), instead of mean statistics.We first check that all ‘stats’, ‘bins’, ‘pops’ (if present), match across all regions
If there are N total regions, we compute N bootstrap replicates by sampling N times with replacement and summing over all ‘sums’.
- Parameters:
all_data (dict) – A dictionary (with arbitrary keys), where each value is LD statistics computed from a distinct region. all_data[reg] stats from each region has keys, ‘bins’, ‘sums’, ‘stats’, and optional ‘pops’.
normalization (int) – we work with \(\sigma_d^2\) statistics, and by default we use population 0 to normalize stats
- moments.LD.Parsing.compute_average_stats(Gs, genotypes=True)[source]
Takes the outputs of
compute_pairwise_stats
and returns the average value for each statistic.- Parameters:
Gs – A genotype matrix, of size L-by-n, where L is the number of loci and n is the sample size. Missing data is encoded as -1.
genotypes – If True, use 0, 1, 2 genotypes. If False, use 0, 1 phased haplotypes.
- moments.LD.Parsing.compute_average_stats_between(Gs1, Gs2, genotypes=True)[source]
Takes the outputs of compute_pairwise_stats_between and returns the average value for each statistic.
- Parameters:
Gs1 – A genotype matrices, of size L1 by n, where L1 is the number of loci and n is the sample size. Missing data is encoded as -1.
Gs2 – A genotype matrices, of size L2 by n, where L1 is the number of loci and n is the sample size. Missing data is encoded as -1.
- moments.LD.Parsing.compute_ld_statistics(vcf_file, bed_file=None, chromosome=None, rec_map_file=None, map_name=None, map_sep=None, pop_file=None, pops=None, cM=True, r_bins=None, bp_bins=None, min_bp=None, use_genotypes=True, use_h5=True, stats_to_compute=None, ac_filter=True, report=True, report_spacing=1000, use_cache=True)[source]
Computes LD statistics for a given VCF. Binning can be done by base pair or recombination distances, the latter requiring a recombination map. For more than one population, we include a population file that maps samples to populations, and specify with populations to compute statistics fro.
If data is phased, we can set
use_genotypes
to False, and there are other options for masking data.Note
Currently, the recombination map is not given in HapMap format. Future versions will accept HapMap formatted recombination maps and deprecate some of the boutique handling of map options here.
- Parameters:
vcf_file (str) – The input VCF file name.
bed_file (str) – An optional bed file that specifies regions over which to compute LD statistics. If None, computes statistics for all positions in VCF.
chromosome (str) – If None, treats all positions in VCF as coming from same chromosome. If multiple chromosomes are reported in the same VCF, we need to specify which chromosome to keep variants from.
rec_map_file (str) – The input recombination map. The format is {pos} {map (cM)} {additional maps}
map_name (str) – If None, takes the first map column, otherwise takes the specified map column with the name matching the recombination map file header.
map_sep (str) – Deprecated! We now read the recombination map, splitting by any white space. Previous behaviour: Tells pandas how to parse the recombination map.
pop_file (str) – A file the specifies the population for each sample in the VCF. Each sample is listed on its own line, in the format “{sample} {pop}”. The first line must be “sample pop”.
pops (list(str)) – List of populations to compute statistics for. If none are given, it treates every sample as coming from the same population.
cM (bool) – If True, the recombination map is specified in cM. If False, the map is given in units of Morgans.
r_bins (list(float)) – A list of raw recombination rate bin edges.
bp_bins (list(float)) – If
r_bins
are not given, a list of bp bin edges (for use when no recombination map is specified).min_bp (int, float) – The minimum bp allowed for a segment specified by the bed file.
use_genotypes (bool) – If True, we assume the data in the VCF is unphased. Otherwise, we use phased information.
use_h5 (bool) – If True, we use the h5 format.
stats_to_compute (list) – If given, we compute only the statistics specified. Otherwise, we compute all possible statistics for the populations given.
ac_filter – Ensure at least two samples are present per population. This prevents computed heterozygosity statistics from returning NaN when some loci have too few called samples.
report (bool) – If True, we report the progress of our parsing.
report_spacing (int) – We track the number of “left” variants we compute, and report our progress with the given spacing.
use_cache (bool) – If True, cache intermediate results.
- moments.LD.Parsing.compute_pairwise_stats(Gs, genotypes=True)[source]
Computes \(D^2\), \(Dz\), \(\pi_2\), and \(D\) for every pair of loci within a block of SNPs, coded as a genotype matrix.
- Parameters:
Gs – A genotype matrix, of size L-by-n, where L is the number of loci and n is the sample size. Missing data is encoded as -1.
genotypes – If True, use 0, 1, 2 genotypes. If False, use 0, 1 phased haplotypes.
- moments.LD.Parsing.compute_pairwise_stats_between(Gs1, Gs2, genotypes=True)[source]
Computes \(D^2\), \(Dz\), \(\pi_2\), and \(D\) for every pair of loci between two blocks of SNPs, coded as genotype matrices.
The Gs are matrices, where rows correspond to loci and columns to individuals. Both matrices must have the same number of individuals. If Gs1 has length L1 and Gs2 has length L2, we compute all pairwise counts, which has size (L1*L2, 9).
We use the sparse genotype matrix representation, where we first “sparsify” the genotype matrix, and then count two-locus genotype configurations from that, from which we compute two-locus statistics
- Parameters:
Gs1 – A genotype matrices, of size L1 by n, where L1 is the number of loci and n is the sample size. Missing data is encoded as -1.
Gs2 – A genotype matrices, of size L2 by n, where L1 is the number of loci and n is the sample size. Missing data is encoded as -1.
genotypes – If True, use 0, 1, 2 genotypes. If False, use 0, 1 phased haplotypes.
- moments.LD.Parsing.get_bootstrap_sets(all_data, num_bootstraps=None, normalization=0, remove_norm_stats=True, remove_Dz=False)[source]
From a dictionary of all the regional data, resample with replacement to construct bootstrap data.
Returns a list of bootstrapped datasets of mean statistics.
- Parameters:
all_data (dict) – Dictionary of regional LD statistics. Keys are region identifiers and must be unique, and the items are the outputs of
compute_ld_statistics
.num_bootstraps (int) – The number of bootstrap replicates to compute. If None, it computes the same number as the nubmer of regions in
all_data
.normalization (int) – The index of the population to normalize by. Defaults to 0.
remove_norm_stats (bool) – If we should remove the stat used for normalization.
remove_Dz (bool) – If we should remove Dz statistics.
- moments.LD.Parsing.get_genotypes(vcf_file, bed_file=None, chromosome=None, min_bp=None, use_h5=True, report=True)[source]
Given a vcf file, we extract the biallelic SNP genotypes. If bed_file is None, we use all valid variants. Otherwise we filter genotypes by the given bed file. If chromosome is given, filters to keep snps only in given chrom (useful for vcfs spanning multiple chromosomes).
If use_h5 is True, we try to load the h5 file, which has the same path/name as vcf_file, but with {fname}.h5 instead of {fname}.vcf or {fname}.vcf.gz. If the h5 file does not exist, we create it and save it as {fname}.h5
Returns (biallelic positions, biallelic genotypes, biallelic allele counts, sampled ids).
- Parameters:
vcf_file (str) – A VCF-formatted file.
bed_file (str, optional) – A bed file specifying regions to compute statistics from. The chromosome name formatting must match the chromosome name formatting of the input VCF (i.e., both carry the leading “chr” or both omit it).
min_bp (int, optional) – only used with bed file, filters out features that are smaller than
min_bp
.chromosome (int or str, optional) – Chromosome to compute LD statistics from.
use_h5 (bool, optional) – If use_h5 is True, we try to load the h5 file, which has the same path/name as vcf_file, but with .h5 instead of .vcf or .vcf.gz extension. If the h5 file does not exist, we create it and save it with .h5 extension. Defaults to True.
report (bool, optional) – Prints progress updates if True, silent otherwise. Defaults to True.
- moments.LD.Parsing.means_from_region_data(all_data, stats, norm_idx=0)[source]
Get means over all parsed regions.
- Parameters:
all_data (dict) – A dictionary with keys as unique identifiers of the regions and values as reported stats from
compute_ld_statistics
.stats (list of lists) – The list of LD and H statistics that are present in the data replicates.
norm_idx (int, optional) – The index of the population to normalize by.
- moments.LD.Parsing.subset_data(data, pops_to, normalization=0, r_min=None, r_max=None, remove_Dz=False)[source]
Take the output data and get r_edges, ms, vcs, and stats to pass to inference machinery.
pops_to
are the subset of the populations to marginalize the data to.r_min
andr_max
trim bins that fall outside of this range, andremove_Dz
allows us to remove all \(\sigma_{Dz}\) statistics.- Parameters:
data – The output of
bootstrap_data
, which contains bins, statistics, populations, means, and variance-covariance matrices.pops_to – A list of populations to subset to.
normalization – The population index that the original data was normalized by.
r_min – The minimum recombination distance to keep.
r_max – The maximum recombination distance to keep.
remove_Dz – If True, remove all Dz statistics. Otherwise keep them.
Inference and computing confidence intervals
Inference methods
- moments.LD.Inference.bin_stats(model_func, params, rho=[], theta=0.001, spread=None, kwargs={})[source]
Computes LD statist for a given model function over bins defined by
rho
. Here,rho
gives the bin edges, and we assume no spaces between bins. That is, if the length of the input recombination rates is \(l\), the number of bins is \(l-1\).- Parameters:
model_func – The model function that takes parameters in the form
model_func(params, rho=rho, theta=theta, **kwargs)
.params (list of floats) – The parameters to evaluate the model at.
rho (list of floats) – The scaled recombination rate bin edges.
theta (float, optional) – The mutation rate
spread (list of arrays) – A list of length rho-1 (number of bins), where each entry is an array of length rho+1 (number of bins plus amount outside bin range to each side). Each array must sum to one.
kwargs – Extra keyword arguments to pass to
model_func
.
- moments.LD.Inference.ll_over_bins(xs, mus, Sigmas)[source]
Compute the composite log-likelihood over LD and heterozygosity statistics, given data and expectations. Inputs must be in the same order, and we assume each bin is independent, so we sum _ll(x, mu, Sigma) over each bin.
- Parameters:
xs – A list of data arrays.
mus – A list of model function output arrays, same length as
xs
.Sigmas – A list of var-cov matrices, same length as
xs
.
- moments.LD.Inference.optimize_log_fmin(p0, data, model_func, rs=None, theta=None, u=2e-08, Ne=None, lower_bound=None, upper_bound=None, verbose=0, flush_delay=0.5, normalization=0, func_args=[], func_kwargs={}, fixed_params=None, use_afs=False, Leff=None, multinom=False, ns=None, statistics=None, pass_Ne=False, spread=None, maxiter=None, maxfun=None)[source]
Optimize (using the log of) the parameters using a downhill simplex algorithm. Initial parameters
p0
, the data[means, varcovs]
, the demographicmodel_func
, andrs
to specify recombination bin edges are required.Ne
must either be specified as a keyword argument or is included as the last parameter inp0
.- Parameters:
p0 (list) – The initial guess for demographic parameters, demography parameters plus (optionally) Ne.
data (list) –
The parsed data[means, varcovs, fs]. The frequency spectrum fs is optional, and used only if use_afs=True.
Means: The list of mean statistics within each bin (has length
len(rs)
orlen(rs) - 1
if using AFS). If we are not using the AFS, which is typical, the heterozygosity statistics come last.varcovs: The list of varcov matrices matching the data in
means
.
model_func (list) – The demographic model to compute statistics for a given rho. If we are using AFS, it’s a list of the two models [LD func, AFS func]. If we’re using LD stats alone, we pass a single LD model as a list: [LD func].
rs (list) – The list of raw recombination rates defining bin edges.
theta (float, optional) – The population scaled per base mutation rate (4*Ne*mu, not 4*Ne*mu*L).
u (float, optional) – The raw per base mutation rate. Cannot be used with
theta
.Ne (float, optional) – The fixed effective population size to scale u and r. If
Ne
is a parameter to fit, it should be the last parameter inp0
.lower_bound (list, optional) – Defaults to
None
. Constraints on the lower bounds during optimization. These are given as lists of the same length of the parameters.upper_bound (list, optional) – Defaults to
None
. Constraints on the upper bounds during optimization. These are given as lists of the same length of the parameters.verbose (int, optional) – If an integer greater than 0, prints updates of the optimization procedure at intervals given by that spacing.
func_args (list, optional) – Additional arguments to be passed to
model_func
.func_kwargs (dict, optional) – Additional keyword arguments to be passed to
model_func
.fixed_params (list, optional) – Defaults to
None
. To fix some parameters, this should be a list of equal length asp0
, withNone
for parameters to be fit and fixed values at corresponding indexes.use_afs (bool, optional) – Defaults to
False
. We can pass a model to compute the frequency spectrum and use that instead of heterozygosity statistics for single-locus data.Leff (float, optional) – The effective length of genome from which the fs was generated (only used if fitting to afs).
multinom (bool, optional) – Only used if we are fitting the AFS. If
True
, the likelihood is computed for an optimally rescaled FS. IfFalse
, the likelihood is computed for a fixed scaling of the FS found by theta=4*Ne*u and Leffns (list of ints, optional) – The sample size, which is only needed if we are using the frequency spectrum, as the sample size does not affect mean LD statistics.
statistics (list, optional) – Defaults to
None
, which assumes that all statistics are present and in the conventional default order. If the data is missing some statistics, we must specify which statistics are present using the subset of statistic names given bymoments.LD.Util.moment_names(num_pops)
.pass_Ne (bool, optional) – Defaults to
False
. IfTrue
, the demographic model includesNe
as a parameter (in the final position of input parameters).maxiter (int) – Defaults to None. Maximum number of iterations to perform.
maxfun (int) – Defaults to None. Maximum number of function evaluations to make.
- moments.LD.Inference.optimize_log_lbfgsb(p0, data, model_func, rs=None, theta=None, u=2e-08, Ne=None, lower_bound=None, upper_bound=None, verbose=0, flush_delay=0.5, normalization=0, func_args=[], func_kwargs={}, fixed_params=None, use_afs=False, Leff=None, multinom=False, ns=None, statistics=None, pass_Ne=False, spread=None, maxiter=40000, epsilon=0.001, pgtol=1e-05)[source]
Optimize (using the log of) the parameters using the modified Powell’s method, which optimizes slices of parameter space sequentially. Initial parameters
p0
, the data[means, varcovs]
, the demographicmodel_func
, andrs
to specify recombination bin edges are required.Ne
must either be specified as a keyword argument or is included as the last parameter inp0
.It is best at burrowing down a single minimum. This method is better than optimize_log if the optimum lies at one or more of the parameter bounds. However, if your optimum is not on the bounds, this method may be much slower.
Because this works in log(params), it cannot explore values of params < 0. It should also perform better when parameters range over scales.
The L-BFGS-B method was developed by Ciyou Zhu, Richard Byrd, and Jorge Nocedal. The algorithm is described in:
R. H. Byrd, P. Lu and J. Nocedal. A Limited Memory Algorithm for Bound Constrained Optimization, (1995), SIAM Journal on Scientific and Statistical Computing , 16, 5, pp. 1190-1208.
C. Zhu, R. H. Byrd and J. Nocedal. L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization (1997), ACM Transactions on Mathematical Software, Vol 23, Num. 4, pp. 550-560.
- Parameters:
p0 (list) – The initial guess for demographic parameters, demography parameters plus (optionally) Ne.
data (list) –
The parsed data[means, varcovs, fs]. The frequency spectrum fs is optional, and used only if use_afs=True.
Means: The list of mean statistics within each bin (has length
len(rs)
orlen(rs) - 1
if using AFS). If we are not using the AFS, which is typical, the heterozygosity statistics come last.varcovs: The list of varcov matrices matching the data in
means
.
model_func (list) – The demographic model to compute statistics for a given rho. If we are using AFS, it’s a list of the two models [LD func, AFS func]. If we’re using LD stats alone, we pass a single LD model as a list: [LD func].
rs (list) – The list of raw recombination rates defining bin edges.
theta (float, optional) – The population scaled per base mutation rate (4*Ne*mu, not 4*Ne*mu*L).
u (float, optional) – The raw per base mutation rate. Cannot be used with
theta
.Ne (float, optional) – The fixed effective population size to scale u and r. If
Ne
is a parameter to fit, it should be the last parameter inp0
.lower_bound (list, optional) – Defaults to
None
. Constraints on the lower bounds during optimization. These are given as lists of the same length of the parameters.upper_bound (list, optional) – Defaults to
None
. Constraints on the upper bounds during optimization. These are given as lists of the same length of the parameters.verbose (int, optional) – If an integer greater than 0, prints updates of the optimization procedure at intervals given by that spacing.
func_args (list, optional) – Additional arguments to be passed to
model_func
.func_kwargs (dict, optional) – Additional keyword arguments to be passed to
model_func
.fixed_params (list, optional) – Defaults to
None
. To fix some parameters, this should be a list of equal length asp0
, withNone
for parameters to be fit and fixed values at corresponding indexes.use_afs (bool, optional) – Defaults to
False
. We can pass a model to compute the frequency spectrum and use that instead of heterozygosity statistics for single-locus data.Leff (float, optional) – The effective length of genome from which the fs was generated (only used if fitting to afs).
multinom (bool, optional) – Only used if we are fitting the AFS. If
True
, the likelihood is computed for an optimally rescaled FS. IfFalse
, the likelihood is computed for a fixed scaling of the FS found by theta=4*Ne*u and Leffns (list of ints, optional) – The sample size, which is only needed if we are using the frequency spectrum, as the sample size does not affect mean LD statistics.
statistics (list, optional) – Defaults to
None
, which assumes that all statistics are present and in the conventional default order. If the data is missing some statistics, we must specify which statistics are present using the subset of statistic names given bymoments.LD.Util.moment_names(num_pops)
.pass_Ne (bool, optional) – Defaults to
False
. IfTrue
, the demographic model includesNe
as a parameter (in the final position of input parameters).maxiter (int) – Defaults to 40,000. Maximum number of iterations to perform.
epsilon – Step-size to use for finite-difference derivatives.
pgtol (float) – Convergence criterion for optimization. For more info, see help(scipy.optimize.fmin_l_bfgs_b)
- moments.LD.Inference.optimize_log_powell(p0, data, model_func, rs=None, theta=None, u=2e-08, Ne=None, lower_bound=None, upper_bound=None, verbose=0, flush_delay=0.5, normalization=0, func_args=[], func_kwargs={}, fixed_params=None, use_afs=False, Leff=None, multinom=False, ns=None, statistics=None, pass_Ne=False, spread=None, maxiter=None, maxfun=None)[source]
Optimize (using the log of) the parameters using the modified Powell’s method, which optimizes slices of parameter space sequentially. Initial parameters
p0
, the data[means, varcovs]
, the demographicmodel_func
, andrs
to specify recombination bin edges are required.Ne
must either be specified as a keyword argument or is included as the last parameter inp0
.- Parameters:
p0 (list) – The initial guess for demographic parameters, demography parameters plus (optionally) Ne.
data (list) –
The parsed data[means, varcovs, fs]. The frequency spectrum fs is optional, and used only if use_afs=True.
Means: The list of mean statistics within each bin (has length
len(rs)
orlen(rs) - 1
if using AFS). If we are not using the AFS, which is typical, the heterozygosity statistics come last.varcovs: The list of varcov matrices matching the data in
means
.
model_func (list) – The demographic model to compute statistics for a given rho. If we are using AFS, it’s a list of the two models [LD func, AFS func]. If we’re using LD stats alone, we pass a single LD model as a list: [LD func].
rs (list) – The list of raw recombination rates defining bin edges.
theta (float, optional) – The population scaled per base mutation rate (4*Ne*mu, not 4*Ne*mu*L).
u (float, optional) – The raw per base mutation rate. Cannot be used with
theta
.Ne (float, optional) – The fixed effective population size to scale u and r. If
Ne
is a parameter to fit, it should be the last parameter inp0
.lower_bound (list, optional) – Defaults to
None
. Constraints on the lower bounds during optimization. These are given as lists of the same length of the parameters.upper_bound (list, optional) – Defaults to
None
. Constraints on the upper bounds during optimization. These are given as lists of the same length of the parameters.verbose (int, optional) – If an integer greater than 0, prints updates of the optimization procedure at intervals given by that spacing.
func_args (list, optional) – Additional arguments to be passed to
model_func
.func_kwargs (dict, optional) – Additional keyword arguments to be passed to
model_func
.fixed_params (list, optional) – Defaults to
None
. To fix some parameters, this should be a list of equal length asp0
, withNone
for parameters to be fit and fixed values at corresponding indexes.use_afs (bool, optional) – Defaults to
False
. We can pass a model to compute the frequency spectrum and use that instead of heterozygosity statistics for single-locus data.Leff (float, optional) – The effective length of genome from which the fs was generated (only used if fitting to afs).
multinom (bool, optional) – Only used if we are fitting the AFS. If
True
, the likelihood is computed for an optimally rescaled FS. IfFalse
, the likelihood is computed for a fixed scaling of the FS found by theta=4*Ne*u and Leffns (list of ints, optional) – The sample size, which is only needed if we are using the frequency spectrum, as the sample size does not affect mean LD statistics.
statistics (list, optional) – Defaults to
None
, which assumes that all statistics are present and in the conventional default order. If the data is missing some statistics, we must specify which statistics are present using the subset of statistic names given bymoments.LD.Util.moment_names(num_pops)
.pass_Ne (bool, optional) – Defaults to
False
. IfTrue
, the demographic model includesNe
as a parameter (in the final position of input parameters).maxiter (int) – Defaults to None. Maximum number of iterations to perform.
maxfun (int) – Defaults to None. Maximum number of function evaluations to make.
- moments.LD.Inference.remove_nonpresent_statistics(y, statistics=[[], []])[source]
Removes data not found in the given set of statistics.
- Parameters:
y (
LDstats
object.) – LD statistics.statistics – A list of lists for two and one locus statistics to keep.
- moments.LD.Inference.remove_normalized_data(means, varcovs, normalization=0, num_pops=1, statistics=None)[source]
Returns data means and covariance matrices with the normalizing statistics removed.
- Parameters:
means (list of arrays) – List of means normalized statistics, where each entry is the full set of statistics for a given recombination distance.
varcovs (list of arrays) – List of the corresponding variance covariance matrices.
normalization (int) – The index of the normalizing population.
num_pops (int) – The number of populations in the data set.
- moments.LD.Inference.remove_normalized_lds(y, normalization=0)[source]
Returns LD statistics with the normalizing statistic removed.
- Parameters:
y (
LDstats
object) – An LDstats object that has been normalized to get \(\sigma_D^2\)-formatted statistics.normalization (int) – The index of the normalizing population.
- moments.LD.Inference.sigmaD2(y, normalization=0)[source]
Compute the \(\sigma_D^2\) statistics normalizing by the heterozygosities in a given population.
- Parameters:
y (
LDstats
object) – The input data.normalization (int, optional) – The index of the normalizing population (normalized by pi2_i_i_i_i and H_i_i), default set to 0.
Confidence intervals
Parameter uncertainties are computed using Godambe information, described in Coffman et al, MBE (2016). doi: https://doi.org/10.1093/molbev/msv255
If you use moments.LD.Godambe to compute parameter uncertainties, please cite that paper. This was first developed by Alec Coffman for computing uncertainties from inferences performed with dadi, modified here to handle LD decay curves.
- moments.LD.Godambe.FIM_uncert(model_func, p0, ms, vcs, log=False, eps=0.01, r_edges=None, normalization=0, pass_Ne=False, statistics=None, verbose=0)[source]
Parameter uncertainties from Fisher Information Matrix. This approach typically underestimates the size of the true confidence intervals, as it does not take into account linkage between loci that causes data to be non-independent.
Returns standard deviations of parameter values.
- Parameters:
model_func – Model function
p0 – Best-fit parameters for model_func, with inferred Ne in last entry of parameter list.
ms – See below..
vcs – Original means and covariances of statistics from data. If statistics are not give, we remove the normalizing statistics. Otherwise, these need to be pared down so that the normalizing statistics are removed.
eps – Fractional stepsize to use when taking finite-difference derivatives. Note that if eps*param is < 1e-12, then the step size for that parameter will simply be eps, to avoid numerical issues with small parameter perturbations.
log – If True, assume log-normal distribution of parameters. Returned values are then the standard deviations of the logs of the parameter values, which can be interpreted as relative parameter uncertainties.
return_GIM – If true, also return the full GIM.
r_edges – The bin edges for LD statistics.
normalization – The index of the population that we normalized by.
pass_Ne – If True, Ne is a parameter in the model function, and by convention is the last entry in the parameters list. If False, Ne is only used to scale recombination rates.
statistics – Statistics that we have included given as a list of lists: [ld_stats, h_stats]. If statistics is not given, we assume all statistics are included except for the normalizing statistic in each bin
verbose (int, optional) – If an integer greater than 0, prints updates of the number of function calls and tested parameters at intervals given by that spacing.
- moments.LD.Godambe.GIM_uncert(model_func, all_boot, p0, ms, vcs, log=False, eps=0.01, return_GIM=False, r_edges=None, normalization=0, pass_Ne=False, statistics=None, verbose=0)[source]
Parameter uncertainties from Godambe Information Matrix (GIM). If you use this method, please cite Coffman et al., MBE (2016).
Returns standard deviations of parameter values.
- Parameters:
model_func – Model function
all_boot – List of bootstrap LD stat means [m0, m1, m2, …]
p0 – Best-fit parameters for model_func, with inferred Ne in last entry of parameter list.
ms – See below..
vcs – Original means and covariances of statistics from data. If statistics are not give, we remove the normalizing statistics. Otherwise, these need to be pared down so that the normalizing statistics are removed.
eps – Fractional stepsize to use when taking finite-difference derivatives. Note that if eps*param is < 1e-12, then the step size for that parameter will simply be eps, to avoid numerical issues with small parameter perturbations.
log – If True, assume log-normal distribution of parameters. Returned values are then the standard deviations of the logs of the parameter values, which can be interpreted as relative parameter uncertainties.
return_GIM – If true, also return the full GIM.
r_edges – The bin edges for LD statistics.
normalization – The index of the population that we normalized by.
pass_Ne – If True, Ne is a parameter in the model function, and by convention is the last entry in the parameters list. If False, Ne is only used to scale recombination rates.
statistics – Statistics that we have included given as a list of lists: [ld_stats, h_stats]. If statistics is not given, we assume all statistics are included except for the normalizing statistic in each bin
verbose (int, optional) – If an integer greater than 0, prints updates of the number of function calls and tested parameters at intervals given by that spacing.
Plotting
Todo
These docs are still needed.