Inferring demography with LD
As described in the linkage disequilibrium and LD Parsing sections, we use a family of normalized LD and heterozygosity statistics to compare between model expectations and data. We optimize demographic model parameters to find the expected binned LD and heterozygosity statistics that maximize a composite likelihood over all pairs of SNPs and recombination bins.
In this section, we’ll describe the likelihood framework, how to define
demographic models that can be used in inference, how to run optimization using
moments
’ built-in inference functions, and how to compute confidence
intervals. We include a short example, following the parsing of data simulated under an isolation-with-migration
model, to illustrate the main features and options.
Likelihood framework
For a given recombination distance bin indexed by \(i\), we have a set of
computed LD statistic means \(D_i\) from data along with the
variance-covariance matrix \(\Sigma_i\) as returned by
moments.LD.Parsing.bootstrap_data
. We assume a multivariate Guassian
likelihood function, so that a model parameterized by \(\Theta\) that has
expected statistics \(\mu_i(\Theta)\) has likelihood
The likelihood is computed similarly for heterozygosity statistics, given their variance-covariance matrix. Then the composite likelihood of two-locus data across recombination bins and single-locus heterozygosity (indexed by \(i=n+1\) where \(n\) is the total number of recombination bins), is
In practice, we work with the log of the likelihood, so that products turn to sums and we can drop constant factors:
As the data \(\{D_i, \Sigma_i\}\) is fixed, we search for the model parameters \(\Theta\) that provide \(\{\mu_i\}\) that maximizes \(\log\mathcal{L}\).
Defining demographic models
There are a handful of built-in demographic models for one-, two-, and
three-population scenarios that can be used in inference (see here). However, these are far from comprehensive and it is
likely that custom demographic models will need to be written for a given
inference problem. For inspiration, moments.LD.Demographics1D
,
Demographics2D
, and Demographics3D
can be used as starting points and
as illustrations of how to structure model functions.
Demographic models all require a params
positional argument and rho
and
(optionally) theta
keyword arguments. theta
, the population-size scaled
mutation rate, does not play a role in inference using relative statistics, as
the mutation rate cancels in \(\sigma_d^2\)-type statistics.
For example, the IM model we simulated data under in the LD Parsing section could be parameterized as
def model_func(params, rho=None, theta=0.001):
nu0, nu1, T, M = params
y = moments.LD.Numerics.steady_state([1], rho=rho, theta=theta)
y = moments.LD.LDstats(y, num_pops=1)
y = y.split(0)
y.integrate([nu0, nu1], T, m=[[0, M], [M, 0]], rho=rho, theta=theta)
return y
In the input demographic model to the simulations, we had the ancestral
effective population size as 10,000, the size of deme0 was 2,000, and the size
of deme1 was 20,000. The populations split 1,500 generations ago, and exchanged
migrants symmetrically at a rate of 0.0001 per-generation. Converted into
genetic units, nu0 = 0.2
, nu1 = 2
, T=1500 / 2 / 10000 = 0.075
, and
M = 2 * 10000 * 0.0001 = 2.0
.
Running optimization
Optimization with moments.LD
, much like moments
optimization with the
SFS, includes a handful functions that serve as wrappers for scipy
optimizers with options specific to working with LD statistics. The two primary
functions in moments.LD.Inference
are
optimize_log_fmin
: Uses the downhill simplex algorithm on the log of the parameters.optimize_log_powell
: Uses the modified Powell’s method, which optimizes slices of parameter space sequentially.
Each optimization method accepts the same arguments. Required positional arguments are
p0
: The initial guess for the parameters inmodel_func
.data
: Structured as a list of lists of data means and data var-cov matrices. I.e.,data = [[means[0], means[1], ...], [varcovs[0], varcovs[1], ...]]
, with the final entry of the lists the means and varcovs of the heterozygosity statistics.model_func
: The demographic model to be fit (see above section). Importantly, this is a list, where the first entry is the LD model, which is always used, and the optional second entry is a demographic model for the SFS (which is a rarely used option and can be ignored). So usually, we would setmodel_func
as[model_func_ld]
.
Additionally, we will almost always pass the list of unscaled recombination
bin edges as rs = [r0, r1, ..., rn]
, which defines n recombination bins.
The effective population size plays a different role in LD inference than it does in SFS inference. For the site frequency spectrum, \(N_e\) merely acts as a linear scaling factor and is absorbed by the scaled mutation rate \(\theta\), which is treated as a free parameter. Here, \(N_e\) instead rescales recombination rates, and because we use a recombination map to determine the binning of data by recombination distances separating loci, \(N_e\) is a parameter that must be either passed as a fixed value or simultaneously fit in the optimization.
If Ne
is a fixed value, we specify the population size using that keyword
argument. Otherwise, if Ne
is to be fit, our list of parameters to fit by
convention includes Ne
in the final position in the list. Typically, Ne
is not a parameter of the demographic model, as we work in rescaled genetic
units, so the parameters that get passed to model_func
are params[:-1]
.
However, it is also possible to write a demographic model that also uses Ne
as a parameter. In this case we set pass_Ne
to True
, so that Ne
both rescales recombination rates and is a model parameter, and all params
are passed to model_func
.
Ne
: The effective population size, used to rescalers
to getrhos = 4 * Ne * rs
.pass_Ne
: Defaults toFalse
. IfTrue
, the demographic model includesNe
as a parameter (in the final position of input parameters).
Other commonly used options include
fixed_params
: Defaults toNone
. To fix some parameters, this should be a list of equal length asp0
, withNone
for parameters to be fit and fixed values at corresponding indexes.lower_bound
: Defaults toNone
. Constraints on the lower bounds during optimization. These are given as lists of the same length of the parameters.upper_bound
: Defaults toNone
. Constraints on the upper bounds during optimization. These are given as lists of the same length of the parameters.statistics
: Defaults toNone
, which assumes that all statistics are present and in the conventional default order. If the data is missing some statistics, we must specify which statistics are present using the subset of statistic names given bymoments.LD.Util.moment_names(num_pops)
.normalization
: Defaults to0
. The index of the population to normalize by, which should match the population index that we normalized by when parsing the data.verbose
: If an integer greater than 0, prints updates of the optimization procedure at intervals given by that spacing.
Example
Using the data simulated in the Parsing section, we can
refit the demographic model under a parameterized IM model. For this, we could
use the moments.LD.Demographics2D.split_mig
model as our model_func
,
which is equivalent to the function we defined above (which we use in this
example). After loading the data and setting up the inference options, we’ll
use optimize_log_fmin
to fit the model.
import moments.LD
import pickle
data = pickle.load(open("data/means.varcovs.split_mig.100_reps.bp", "rb"))
rs = [0, 1e-6, 2e-6, 5e-6, 1e-5, 2e-5, 5e-5, 1e-4, 2e-4, 5e-4, 1e-3]
p_guess = [0.1, 2.0, 0.075, 2.0, 10000]
p0 = moments.LD.Util.perturb_params(p_guess, fold=0.2)
# run optimization
opt_params, LL = moments.LD.Inference.optimize_log_fmin(
p_guess,
[data["means"], data["varcovs"]],
[model_func],
rs=rs,
verbose=40,
)
# get physical units, rescaling by Ne
physical_units = moments.LD.Util.rescale_params(
opt_params, ["nu", "nu", "T", "m", "Ne"]
)
print("best fit parameters:")
print(f" N(deme0) : {physical_units[0]:.1f}")
print(f" N(deme1) : {physical_units[1]:.1f}")
print(f" Div. time (gen) : {physical_units[2]:.1f}")
print(f" Migration rate : {physical_units[3]:.6f}")
print(f" N(ancestral) : {physical_units[4]:.1f}")
40 , -214.304 , array([ 0.140269 , 1.95203 , 0.0514585 , 2.07543 , 12430 ])
80 , -130.319 , array([ 0.152596 , 1.99971 , 0.0535232 , 2.1097 , 11455.5 ])
120 , -117.615 , array([ 0.182618 , 2.02436 , 0.0667741 , 2.19032 , 10861.4 ])
160 , -83.6186 , array([ 0.170586 , 1.75416 , 0.0596334 , 1.87652 , 11330.6 ])
200 , -75.9597 , array([ 0.186329 , 1.82722 , 0.0679001 , 1.98702 , 10967.4 ])
240 , -75.9067 , array([ 0.186728 , 1.833 , 0.0682559 , 1.99477 , 10945.4 ])
280 , -75.881 , array([ 0.186734 , 1.83297 , 0.0682596 , 1.9916 , 10953.7 ])
320 , -75.5628 , array([ 0.186649 , 1.84801 , 0.0681039 , 1.95793 , 10964.4 ])
360 , -74.0139 , array([ 0.190583 , 1.91655 , 0.06896 , 1.85567 , 10778.8 ])
400 , -72.8789 , array([ 0.193579 , 2.06523 , 0.0684988 , 1.66751 , 10568.9 ])
440 , -72.8701 , array([ 0.193062 , 2.06041 , 0.068301 , 1.66962 , 10574 ])
480 , -72.87 , array([ 0.193113 , 2.06037 , 0.0683224 , 1.67006 , 10573.5 ])
best fit parameters:
N(deme0) : 2041.9
N(deme1) : 21786.0
Div. time (gen) : 1444.8
Migration rate : 0.000079
N(ancestral) : 10573.4
These should be pretty close to the input demographic parameters from the simulations! They won’t be spot on, as this was only using 100Mb of simulated data, but we should be in the ballpark.
Computing confidence intervals
When running demographic inference, we get a point estimate for the best fit demographic parameters. However, for an unknown underlying true value, it’s important to also estimate what’s called a confidence interval. The CI tells us the probability that the true value lies within some range, and provides some information about which parameters in our demographic model are tightly constrained and which parameters we have little power to pin down.
moments.LD
can estimate confidence intervals using either the Fisher
Information Matrix (FIM) or the Godambe Information Matrix (GIM). In almost all
cases when using real data (or even most simulated data), the FIM will estimate
a much smaller CI than the GIM. This occurs because the FIM assumes all data
points that we’ve used are independent, when in reality there is linkage that
causes data points to be sometimes highly correlated between pairs of loci and
between recombination bins. The Godambe method uses bootstrap-resampled
replicates of the data to account for this correlation and does a much better
job at estimating the true underlying CIs [Coffman2016].
Note
If you use the Godambe approach to estimate confidence intervals, please
cite [Coffman2016]. Alec originally implemented this approach in dadi
,
and moments
has more-or-less used this same implementation here.
To create bootstrap replicates from the dictionary of data sums computed over
regions, where rep_data = {0: ld_stats_0, 1: ld_stats_1, ...}
, e.g., we use
num_boots = 100
norm_idx = 0
bootstrap_sets = moments.LD.Parsing.get_bootstrap_sets(
rep_data, num_bootstraps=num_boots, normalization=norm_idx)
These bootstrap sets can then be used as the inputs to the moments.LD.Godambe
methods. The two CI estimation methods are
FIM_uncert
: Uses the Fisher Information Matrix. Usage isFIM_uncert(model_func, opt_params, means, varcovs, r_edges=rs)
.GIM_uncert
: Uses the Godambe Information Matrix. Usage isGIM_uncert(model_func, bootstrap_sets, opt_params, means, varcovs, r_edges=rs)
.
In each case, the model function is the same as used in inference (some
manipulation may be needed if we had any fixed parameters), means and varcovs
are the same data as input to the inference function, and r_edges
are the
bin edges used in the inference. Additional options for some corner cases are
described in the API reference for LD methods.
Example
We’ll use both the FIM and GIM to compute uncertainties from the above example inference.
Using the FIM approach:
# using FIM
uncerts_FIM = moments.LD.Godambe.FIM_uncert(
model_func,
opt_params,
data["means"],
data["varcovs"],
r_edges=rs,
)
# lower and upper CIs, in genetic units
lower = opt_params - 1.96 * uncerts_FIM
upper = opt_params + 1.96 * uncerts_FIM
# convert to physical units
lower_pu = moments.LD.Util.rescale_params(lower, ["nu", "nu", "T", "m", "Ne"])
upper_pu = moments.LD.Util.rescale_params(upper, ["nu", "nu", "T", "m", "Ne"])
print("95% CIs:")
print(f" N(deme0) : {lower_pu[0]:.1f} - {upper_pu[0]:.1f}")
print(f" N(deme1) : {lower_pu[1]:.1f} - {upper_pu[1]:.1f}")
print(f" Div. time (gen) : {lower_pu[2]:.1f} - {upper_pu[2]:.1f}")
print(f" Migration rate : {lower_pu[3]:.6f} - {upper_pu[3]:.6f}")
print(f" N(ancestral) : {lower_pu[4]:.1f} - {upper_pu[4]:.1f}")
95% CIs:
N(deme0) : 1825.7 - 2269.2
N(deme1) : 18859.9 - 24879.1
Div. time (gen) : 1266.8 - 1632.7
Migration rate : 0.000069 - 0.000088
N(ancestral) : 10166.7 - 10980.1
And using the GIM approach:
bootstrap_sets = pickle.load(open("data/bootstrap_sets.split_mig.100_reps.bp", "rb"))
# using GIM
uncerts_GIM = moments.LD.Godambe.GIM_uncert(
model_func,
bootstrap_sets,
opt_params,
data["means"],
data["varcovs"],
r_edges=rs,
)
# lower and upper CIs, in genetic units
lower = opt_params - 1.96 * uncerts_GIM
upper = opt_params + 1.96 * uncerts_GIM
# convert to physical units
lower_pu = moments.LD.Util.rescale_params(lower, ["nu", "nu", "T", "m", "Ne"])
upper_pu = moments.LD.Util.rescale_params(upper, ["nu", "nu", "T", "m", "Ne"])
print("95% CIs:")
print(f" N(deme0) : {lower_pu[0]:.1f} - {upper_pu[0]:.1f}")
print(f" N(deme1) : {lower_pu[1]:.1f} - {upper_pu[1]:.1f}")
print(f" Div. time (gen) : {lower_pu[2]:.1f} - {upper_pu[2]:.1f}")
print(f" Migration rate : {lower_pu[3]:.6f} - {upper_pu[3]:.6f}")
print(f" N(ancestral) : {lower_pu[4]:.1f} - {upper_pu[4]:.1f}")
95% CIs:
N(deme0) : 1570.7 - 2562.0
N(deme1) : 17083.3 - 26961.9
Div. time (gen) : 1020.2 - 1917.5
Migration rate : 0.000059 - 0.000096
N(ancestral) : 9846.2 - 11300.6
We can see above that the FIM uncertainties are considerably smaller (i.e. more constrained) than the GIM uncertainties. However, the GIM uncertainties are to be preferred here, as they more accurately estimate the underlying true uncertainty in the demographic inference.