.. _sec_ld_inference: ============================ Inferring demography with LD ============================ As described in the :ref:`linkage disequilibrium ` and :ref:`LD Parsing ` sections, we use a family of normalized LD and heterozygosity statistics to compare between model expectations and data. We optimize demographic model parameters to find the expected binned LD and heterozygosity statistics that maximize a composite likelihood over all pairs of SNPs and recombination bins. In this section, we'll describe the likelihood framework, how to define demographic models that can be used in inference, how to run optimization using ``moments``' built-in inference functions, and how to compute confidence intervals. We include a short example, following the :ref:`parsing ` of data simulated under an isolation-with-migration model, to illustrate the main features and options. ******************** Likelihood framework ******************** For a given recombination distance bin indexed by :math:`i`, we have a set of computed LD statistic means :math:`D_i` from data along with the variance-covariance matrix :math:`\Sigma_i` as returned by ``moments.LD.Parsing.bootstrap_data``. We assume a multivariate Guassian likelihood function, so that a model parameterized by :math:`\Theta` that has expected statistics :math:`\mu_i(\Theta)` has likelihood .. math:: \mathcal{L}_i(\Theta | D_i) = P(D_i | \mu_i, \Sigma_i) = \frac{\exp\left(-\frac{1}{2}(D_i-\mu_i)^T\Sigma_i^{-1}(D_i-\mu_i)\right)}{(2\pi)^{k/2}|\Sigma_i|^{1/2}}. The likelihood is computed similarly for heterozygosity statistics, given their variance-covariance matrix. Then the composite likelihood of two-locus data across recombination bins and single-locus heterozygosity (indexed by :math:`i=n+1` where :math:`n` is the total number of recombination bins), is .. math:: \mathcal{L} = \prod_{i=1}^{n+1}\mathcal{L}_i. In practice, we work with the log of the likelihood, so that products turn to sums and we can drop constant factors: .. math:: \log\mathcal{L} \propto -\frac{1}{2}\sum_{i=1}^{n+1} (D_i-\mu_i)^T\Sigma_i^{-1}(D_i-\mu_i). As the data :math:`\{D_i, \Sigma_i\}` is fixed, we search for the model parameters :math:`\Theta` that provide :math:`\{\mu_i\}` that maximizes :math:`\log\mathcal{L}`. *************************** Defining demographic models *************************** There are a handful of built-in demographic models for one-, two-, and three-population scenarios that can be used in inference (see :ref:`here `). However, these are far from comprehensive and it is likely that custom demographic models will need to be written for a given inference problem. For inspiration, ``moments.LD.Demographics1D``, ``Demographics2D``, and ``Demographics3D`` can be used as starting points and as illustrations of how to structure model functions. Demographic models all require a ``params`` positional argument and ``rho`` and (optionally) ``theta`` keyword arguments. ``theta``, the population-size scaled mutation rate, does not play a role in inference using relative statistics, as the mutation rate cancels in :math:`\sigma_d^2`-type statistics. For example, the IM model we simulated data under in the LD Parsing section could be parameterized as .. jupyter-execute:: def model_func(params, rho=None, theta=0.001): nu0, nu1, T, M = params y = moments.LD.Numerics.steady_state([1], rho=rho, theta=theta) y = moments.LD.LDstats(y, num_pops=1) y = y.split(0) y.integrate([nu0, nu1], T, m=[[0, M], [M, 0]], rho=rho, theta=theta) return y In the input demographic model to the simulations, we had the ancestral effective population size as 10,000, the size of deme0 was 2,000, and the size of deme1 was 20,000. The populations split 1,500 generations ago, and exchanged migrants symmetrically at a rate of 0.0001 per-generation. Converted into genetic units, ``nu0 = 0.2``, ``nu1 = 2``, ``T=1500 / 2 / 10000 = 0.075``, and ``M = 2 * 10000 * 0.0001 = 2.0``. ******************** Running optimization ******************** Optimization with ``moments.LD``, much like ``moments`` optimization with the SFS, includes a handful functions that serve as wrappers for ``scipy`` optimizers with options specific to working with LD statistics. The two primary functions in ``moments.LD.Inference`` are - ``optimize_log_fmin``: Uses the downhill simplex algorithm on the log of the parameters. - ``optimize_log_powell``: Uses the modified Powell’s method, which optimizes slices of parameter space sequentially. Each optimization method accepts the same arguments. Required positional arguments are - ``p0``: The initial guess for the parameters in ``model_func``. - ``data``: Structured as a list of lists of data means and data var-cov matrices. I.e., ``data = [[means[0], means[1], ...], [varcovs[0], varcovs[1], ...]]``, with the final entry of the lists the means and varcovs of the heterozygosity statistics. - ``model_func``: The demographic model to be fit (see above section). Importantly, this is a *list*, where the first entry is the LD model, which is always used, and the optional second entry is a demographic model for the SFS (which is a rarely used option and can be ignored). So usually, we would set ``model_func`` as ``[model_func_ld]``. Additionally, we will almost always pass the list of unscaled recombination bin edges as ``rs = [r0, r1, ..., rn]``, which defines *n* recombination bins. The effective population size plays a different role in LD inference than it does in SFS inference. For the site frequency spectrum, :math:`N_e` merely acts as a linear scaling factor and is absorbed by the scaled mutation rate :math:`\theta`, which is treated as a free parameter. Here, :math:`N_e` instead rescales recombination rates, and because we use a recombination map to determine the binning of data by recombination distances separating loci, :math:`N_e` is a parameter that must be either passed as a fixed value or simultaneously fit in the optimization. If ``Ne`` is a fixed value, we specify the population size using that keyword argument. Otherwise, if ``Ne`` is to be fit, our list of parameters to fit by convention includes ``Ne`` in the final position in the list. Typically, ``Ne`` is not a parameter of the demographic model, as we work in rescaled genetic units, so the parameters that get passed to ``model_func`` are ``params[:-1]``. However, it is also possible to write a demographic model that also uses ``Ne`` as a parameter. In this case we set ``pass_Ne`` to ``True``, so that ``Ne`` both rescales recombination rates and is a model parameter, and all ``params`` are passed to ``model_func``. - ``Ne``: The effective population size, used to rescale ``rs`` to get ``rhos = 4 * Ne * rs``. - ``pass_Ne``: Defaults to ``False``. If ``True``, the demographic model includes ``Ne`` as a parameter (in the final position of input parameters). Other commonly used options include - ``fixed_params``: Defaults to ``None``. To fix some parameters, this should be a list of equal length as ``p0``, with ``None`` for parameters to be fit and fixed values at corresponding indexes. - ``lower_bound``: Defaults to ``None``. Constraints on the lower bounds during optimization. These are given as lists of the same length of the parameters. - ``upper_bound``: Defaults to ``None``. Constraints on the upper bounds during optimization. These are given as lists of the same length of the parameters. - ``statistics``: Defaults to ``None``, which assumes that all statistics are present and in the conventional default order. If the data is missing some statistics, we must specify which statistics are present using the subset of statistic names given by ``moments.LD.Util.moment_names(num_pops)``. - ``normalization``: Defaults to ``0``. The index of the population to normalize by, which should match the population index that we normalized by when parsing the data. - ``verbose``: If an integer greater than 0, prints updates of the optimization procedure at intervals given by that spacing. Example ------- Using the data simulated in the :ref:`Parsing ` section, we can refit the demographic model under a parameterized IM model. For this, we could use the ``moments.LD.Demographics2D.split_mig`` model as our ``model_func``, which is equivalent to the function we defined above (which we use in this example). After loading the data and setting up the inference options, we'll use ``optimize_log_fmin`` to fit the model. .. jupyter-execute:: import moments.LD import pickle data = pickle.load(open("data/means.varcovs.split_mig.100_reps.bp", "rb")) rs = [0, 1e-6, 2e-6, 5e-6, 1e-5, 2e-5, 5e-5, 1e-4, 2e-4, 5e-4, 1e-3] p_guess = [0.1, 2.0, 0.075, 2.0, 10000] p0 = moments.LD.Util.perturb_params(p_guess, fold=0.2) # run optimization opt_params, LL = moments.LD.Inference.optimize_log_fmin( p_guess, [data["means"], data["varcovs"]], [model_func], rs=rs, verbose=40, ) # get physical units, rescaling by Ne physical_units = moments.LD.Util.rescale_params( opt_params, ["nu", "nu", "T", "m", "Ne"] ) print("best fit parameters:") print(f" N(deme0) : {physical_units[0]:.1f}") print(f" N(deme1) : {physical_units[1]:.1f}") print(f" Div. time (gen) : {physical_units[2]:.1f}") print(f" Migration rate : {physical_units[3]:.6f}") print(f" N(ancestral) : {physical_units[4]:.1f}") These should be pretty close to the input demographic parameters from the simulations! They won't be spot on, as this was only using 100Mb of simulated data, but we should be in the ballpark. ****************************** Computing confidence intervals ****************************** When running demographic inference, we get a point estimate for the *best fit* demographic parameters. However, for an unknown underlying true value, it's important to also estimate what's called a confidence interval. The CI tells us the probability that the true value lies within some range, and provides some information about which parameters in our demographic model are tightly constrained and which parameters we have little power to pin down. ``moments.LD`` can estimate confidence intervals using either the Fisher Information Matrix (FIM) or the Godambe Information Matrix (GIM). In almost all cases when using real data (or even most simulated data), the FIM will estimate a much smaller CI than the GIM. This occurs because the FIM assumes all data points that we've used are independent, when in reality there is linkage that causes data points to be sometimes highly correlated between pairs of loci and between recombination bins. The Godambe method uses bootstrap-resampled replicates of the data to account for this correlation and does a much better job at estimating the true underlying CIs [Coffman2016]_. .. note:: If you use the Godambe approach to estimate confidence intervals, please cite [Coffman2016]_. Alec originally implemented this approach in ``dadi``, and ``moments`` has more-or-less used this same implementation here. To create bootstrap replicates from the dictionary of data sums computed over regions, where ``rep_data = {0: ld_stats_0, 1: ld_stats_1, ...}``, e.g., we use .. code-block:: python num_boots = 100 norm_idx = 0 bootstrap_sets = moments.LD.Parsing.get_bootstrap_sets( rep_data, num_bootstraps=num_boots, normalization=norm_idx) These bootstrap sets can then be used as the inputs to the ``moments.LD.Godambe`` methods. The two CI estimation methods are - ``FIM_uncert``: Uses the Fisher Information Matrix. Usage is ``FIM_uncert(model_func, opt_params, means, varcovs, r_edges=rs)``. - ``GIM_uncert``: Uses the Godambe Information Matrix. Usage is ``GIM_uncert(model_func, bootstrap_sets, opt_params, means, varcovs, r_edges=rs)``. In each case, the model function is the same as used in inference (some manipulation may be needed if we had any fixed parameters), means and varcovs are the same data as input to the inference function, and ``r_edges`` are the bin edges used in the inference. Additional options for some corner cases are described in the :ref:`API reference for LD methods `. Example ------- We'll use both the FIM and GIM to compute uncertainties from the above example inference. Using the FIM approach: .. jupyter-execute:: # using FIM uncerts_FIM = moments.LD.Godambe.FIM_uncert( model_func, opt_params, data["means"], data["varcovs"], r_edges=rs, ) # lower and upper CIs, in genetic units lower = opt_params - 1.96 * uncerts_FIM upper = opt_params + 1.96 * uncerts_FIM # convert to physical units lower_pu = moments.LD.Util.rescale_params(lower, ["nu", "nu", "T", "m", "Ne"]) upper_pu = moments.LD.Util.rescale_params(upper, ["nu", "nu", "T", "m", "Ne"]) print("95% CIs:") print(f" N(deme0) : {lower_pu[0]:.1f} - {upper_pu[0]:.1f}") print(f" N(deme1) : {lower_pu[1]:.1f} - {upper_pu[1]:.1f}") print(f" Div. time (gen) : {lower_pu[2]:.1f} - {upper_pu[2]:.1f}") print(f" Migration rate : {lower_pu[3]:.6f} - {upper_pu[3]:.6f}") print(f" N(ancestral) : {lower_pu[4]:.1f} - {upper_pu[4]:.1f}") And using the GIM approach: .. jupyter-execute:: bootstrap_sets = pickle.load(open("data/bootstrap_sets.split_mig.100_reps.bp", "rb")) # using GIM uncerts_GIM = moments.LD.Godambe.GIM_uncert( model_func, bootstrap_sets, opt_params, data["means"], data["varcovs"], r_edges=rs, ) # lower and upper CIs, in genetic units lower = opt_params - 1.96 * uncerts_GIM upper = opt_params + 1.96 * uncerts_GIM # convert to physical units lower_pu = moments.LD.Util.rescale_params(lower, ["nu", "nu", "T", "m", "Ne"]) upper_pu = moments.LD.Util.rescale_params(upper, ["nu", "nu", "T", "m", "Ne"]) print("95% CIs:") print(f" N(deme0) : {lower_pu[0]:.1f} - {upper_pu[0]:.1f}") print(f" N(deme1) : {lower_pu[1]:.1f} - {upper_pu[1]:.1f}") print(f" Div. time (gen) : {lower_pu[2]:.1f} - {upper_pu[2]:.1f}") print(f" Migration rate : {lower_pu[3]:.6f} - {upper_pu[3]:.6f}") print(f" N(ancestral) : {lower_pu[4]:.1f} - {upper_pu[4]:.1f}") We can see above that the FIM uncertainties are considerably smaller (i.e. more constrained) than the GIM uncertainties. However, the GIM uncertainties are to be preferred here, as they more accurately estimate the underlying true uncertainty in the demographic inference. ********** References ********** .. [Coffman2016] Coffman, Alec J., et al. "Computationally efficient composite likelihood statistics for demographic inference." *Molecular biology and evolution* 33.2 (2016): 591-593.