.. _sec_sfs: .. jupyter-execute:: :hide-code: import matplotlib, matplotlib.pylab as plt plt.rcParams['legend.title_fontsize'] = 'xx-small' matplotlib.rc('xtick', labelsize=9) matplotlib.rc('ytick', labelsize=9) matplotlib.rc('axes', labelsize=12) matplotlib.rc('axes', titlesize=12) matplotlib.rc('legend', fontsize=10) =========================== The Site Frequency Spectrum =========================== This page describes the Site Frequency Spectrum (SFS), how to compute its expectation using ``moments``, manipulate spectra, implement demographic models using the ``moments`` API, and computing and saving spectra from a VCF. If you use the SFS methods in ``moments`` in your research, please cite - [Jouganous2017]_ Jouganous, J., Long, W., Ragsdale, A. P., & Gravel, S. (2017). Inferring the joint demographic history of multiple populations: beyond the diffusion approximation. *Genetics*, 206(3), 1549-1567. ******* The SFS ******* A site-frequency spectrum is a :math:`p`-dimensional histogram, where :math:`p` is the number of populations for which we have data. Thus, the shape of the SFS is :math:`(n_0+1) \times (n_1+1) \times \ldots (n_{p-1}+1)`, where :math:`n_i` is the haploid sample size in population :math:`i`. An entry of the SFS (call it ``fs``) stores the number, density, or probability for SNP frequencies given by the index of that entry. That is, ``fs[j, k, l]`` is the number (or density) of mutations with allele frequencies ``j`` in population 0, ``k`` in population 1, and ``l`` in population 2. (Note that all indexing, as is typical in Python, is zero-based.) Examples ======== It can be helpful to visualize site-frequency spectra if you are new to working with them. In the single-population case, a SFS is a one-dimensional array. For variable biallelic loci and steady-state demography (no historical size changes, migrants, etc), the SFS is proportional to :math:`1/i`, with total size depending on the mutation rate and sequence length. Historical size changes and demographic events perturb the SFS from this shape, as does negative or positive selection, skewing the SFS to lower or higher frequencies, resp. .. jupyter-execute:: import moments import numpy as np import matplotlib.pylab as plt sample_size = 40 # A neutral SFS fs_neu = moments.Demographics1D.snm([sample_size]) # SFS under negative selection fs_neg = moments.Spectrum( moments.LinearSystem_1D.steady_state_1D(sample_size, gamma=-10) ) # SFS under positive selection fs_pos = moments.Spectrum( moments.LinearSystem_1D.steady_state_1D(sample_size, gamma=10) ) fig, ax = plt.subplots(1, 1, figsize=(8, 4)) ax.plot(fs_neu, ".-", ms=8, lw=1, label="Neutral") ax.plot(fs_neg, "x-", ms=8, lw=1, label="Neg. selection") ax.plot(fs_pos, "+-", ms=8, lw=1, label="Pos. selection") ax.set_xlabel("Allele frequency") ax.set_ylabel("Density") ax.legend(); Multi-population SFS can be illustrated as multi-dimensional histograms, such as 2D heat maps. Here, we consider a very simple model of a population split and both derived populations are the same size as the ancestral population and do not exchange migrants. Allele frequencies in populations that split more recently will still be quite similar, while more distantly related populations are expected to have larger allele frequency differences. .. jupyter-execute:: sample_sizes = [50, 50] # parameters of `split_mig` are (nu0, nu1, T, m) # T is measured in units of 2Ne generations fs_recent = moments.Demographics2D.split_mig((1, 1, 0.02, 0), sample_sizes) fs_older = moments.Demographics2D.split_mig((1, 1, 0.15, 0), sample_sizes) # assume theta = 20000, and then resample to fake data fs_recent = (20000 * fs_recent).sample() fs_older = (20000 * fs_older).sample() moments.Plotting.plot_single_2d_sfs(fs_recent) moments.Plotting.plot_single_2d_sfs(fs_older) ******************************* Spectrum objects in ``moments`` ******************************* SFS are stored as ``moments.Spectrum`` objects. If you are familiar with `dadi `_'s Spectrum objects, then you already will know your way around a ``moments.Spectrum`` object. ``moments`` has built off the ``dadi`` SFS construction, manipulation, and demographic specification, with minor adjustments that reflect the differences between the simulation engines and parameterizations. ``Spectrum`` objects are a subclass of ``numpy.masked_array``, so that standard array manipulation is possible. Indexing also works the same way as a typical array, so that ``fs[2, 3, 5]`` will return the entry in the SFS corresponding to allele frequencies :math:`(2, 3, 5)` (here, in a three-population SFS). Similarly, we can check if the SFS is masked at a given entry. For example, ``fs.mask[0, 0]`` returns whether the "fixed" bin (where no samples carry the derived allele) is ignored. A ``Spectrum`` object has a few additional useful attributes: - ``fs.pop_ids``: A list of population IDs (as strings) for each population in the SFS. - ``fs.sample_sizes``: A list of sample sizes (as integers) corresponding to the shape of the SFS. - ``fs.folded``: If True, the SFS is folded, meaning we polarize allele frequencies by the minor allele frequency. If False, the SFS is polarized by the derived allele. **************** Manipulating SFS **************** Along with standard array manipulations, there are operations specific to SFS. Some of these are equivalent to standard array operations, but we ensure that the masking and population IDs are updated properly. Folding ======= Folding a SFS removes information about how SNPs are polarized, so that the Spectrum stores counts of mutations with a given minor allele frequency. To fold a SFS, we call ``fold()``, which returns a folded Spectrum object. For example, the standard neutral model of sample size 10, .. jupyter-execute:: fs = moments.Demographics1D.snm([10]) fs can be folded to the minor allele frequency, which updates the allele counts in the minor allele frequency bins and the mask: .. jupyter-execute:: fs_folded = fs.fold() fs_folded When folding multi-dimensional SFS, note that the folding occurs over the global minor allele frequency. Projecting ========== SFS projection takes a Spectrum of some sample size and reduces the sample size in one or more populations. The output Spectrum sums over all possible down-samplings so that it is equivalent to having sampled a smaller sample size to begin with. .. jupyter-execute:: fs_proj = fs.project([6]) fs_proj For multi-dimensional frequency spectra, we must pass a list of sample sizes of equal length to the dimension of the SFS: .. jupyter-execute:: fs = moments.Spectrum(np.random.rand(121).reshape((11, 11))) fs_proj = fs.project([6, 4]) fs_proj Marginalizing ============= If a population goes extinct, or if we want to subset a SFS to some focal populations, we use the ``marginalize( )`` function. This function takes a list of population indexes as input, and removes those indexes from the output SFS. The array operation is simply a sum over those axes, but the marginalization function also preserves population IDs if given. For example, given a three-population spectrum .. jupyter-execute:: fs = moments.Spectrum(np.ones((5, 5, 5)), pop_ids=["A", "B", "C"]) fs we can view the one-population SFS, here the first population: .. jupyter-execute:: fs_marg = fs.marginalize([1, 2]) fs_marg or the joint two-population SFS for population indexes 1 and 2: .. jupyter-execute:: fs_marg = fs.marginalize([0]) fs_marg Note that the population IDs stay consistent after marginalizing. Resampling ========== We can resample a new SFS from a given Spectrum using two approaches. First, a standard assumption is that entries in an "expected" SFS give the expectation of counts within each bin, and data follows a Poisson distribution with rates equal to the bin values. Then ``sample()`` creates a Poisson-sampled SFS: .. jupyter-execute:: fs = moments.Demographics1D.snm([10]) * 1000 fs_pois = fs.sample() fs_pois Alternatively, we could resample and enforce that we obtain a SFS with the same number of segregating sites: .. jupyter-execute:: fs_fixed = fs.fixed_size_sample(np.rint(fs.S())) print(f"number of sites in input:", f"{fs.S():.2f}") print(f"number of sites in resampled SFS:", fs_fixed.S()) fs_fixed ****************** Demographic events ****************** When defining demographic models with multiple populations, we need to apply demographic events such as population splits, mergers, and admixtures. These operations often change the dimension or size of the SFS, so they do not act in-place. Instead, they return a new Spectrum object, similar to the manipulations in the previous section. Population splits and branches ============================== New in ``moments`` version 1.1, the Spectrum class includes functions to directly apply demographic events. A population split is called using ``fs.split(idx, n0, n1)``, where the population indexed by ``idx`` splits into ``n0`` and ``n1`` lineages. The ``split`` function also takes a ``new_ids`` keyword argument, where we can specify the population IDs of the two new populations after the split. Note that ``n0`` and ``n1`` cannot sum to larger than the current sample size of the population that we are splitting. For example, to split a single population with 6 tracked lineages into two populations with 3 lineages in each population: .. jupyter-execute:: fs = moments.Demographics1D.snm([6]) fs_split = fs.split(0, 3, 3) fs_split If we use ``new_ids``, we can also keep track of population ids after a split event: .. jupyter-execute:: fs = moments.Demographics2D.snm([6, 2], pop_ids=["A", "B"]) fs .. jupyter-execute:: fs_split = fs.split(0, 4, 2, new_ids=["C", "D"]) fs_split As of version 1.1.5, we can apply a "branch" event. This is conceptually similar to a split, but simpler in that a child population branches off from a parental population. In this case, we just need to give the sample size of the new child population (and it's new population ID), and the parental population is left with the same number of lineages minus the size of the new population, and its population ID (if given) remains unchanged. .. jupyter-execute:: fs = moments.Demographics1D.snm([5], pop_ids=["A"]) fs_branch = fs.branch(0, 2, new_id="B") fs_branch .. note:: Previous versions of ``moments`` required calling functions such as ``moments.Manips.split_1D_to_2D(fs, n0, n1)`` or ``moments.Manips.split_3D_to_4D_2(fs, n0, n1)``. The new API (``fs.split(idx, n0, n1)``) wraps the different split functions in ``moments.Manips`` so that we don't need to worry about picking the correct split function. Admixture and mergers ===================== Here, we consider two types of admixture events. First, two populations mix with given proportions to form a new population (which we will call an "admix" event). And second, one population contributes some proportion to another population in the SFS (which we call a "pulse migration" event). In both cases, lineages within the SFS are moved from one or more populations to another, and its size and possibly dimension can change. To mix two population with a given proportion, we use ``fs.admix(idx0, idx1, num_lineages, proportion)``, where ``proportion`` is the proportion of the new population that comes from population ``idx0``, and 1-proportion comes from population indexed by ``idx1``. The number of lineages is the sample size in the new admixed population, and the sample sizes in the source populations necessarily decrease by that same amount. Note that if the sample size of a source population equals the number of lineages that are moved, that source population no longer exists and the dimension decreases by one. For example, in a two-population SFS, we can look at a few different scenarios of admixture and sample sizes: .. jupyter-execute:: fs = moments.Spectrum(np.ones((11, 11))) print("original SFS has sample size", fs.sample_sizes) fs_admix = fs.admix(0, 1, 10, 0.25) print("admix SFS has size", fs_admix.sample_sizes, "after moving 10 lineages") fs_admix2 = fs.admix(0, 1, 5, 0.5) print("second admix SFS has size", fs_admix2.sample_sizes, "after moving 5 lineages") And to account for population IDs after admixture: .. jupyter-execute:: fs = moments.Spectrum(np.ones((9, 7)), pop_ids=["A", "B"]) print("original SFS has size", fs.sample_sizes, "and pop ids", fs.pop_ids) fs_admix = fs.admix(0, 1, 4, 0.25, new_id="C") print("admix SFS has size", fs_admix.sample_sizes, "and pop ids", fs_admix.pop_ids, "after moving 4 lineages into new population C") *********** Integration *********** ``moments`` integrates the SFS forward in time by calling ``fs.integrate( )``. At a minimum, we need to pass the population size(s) ``nu`` and the integration time ``T``. All parameters are scaled by a reference effective population size, so that time is measured in units of :math:`2N_e` generations, sizes are relative to this same :math:`N_e`, and mutation and migration rates and the selection coefficient is scaled by :math:`2N_e`. Size functions ============== The ``integrate( )`` function can take either a list of relative sizes, equal to the number of populations represented by the SFS, or it can take a function that returns a list of population sizes over time. For example, to integrate a two-population SFS with the first population having relative size 2.0 (double the reference size), and the second having size 0.1 (one-tenth the relative size) for 0.05 time units: .. jupyter-execute:: fs = moments.Demographics2D.snm([10, 10]) fs.integrate([2.0, 0.1], 0.05) To specify a size function that changes over time, for example an exponential growth model, we can instead pass a size function to the integration method: .. jupyter-execute:: fs = moments.Demographics1D.snm([10]) nu0 = 0.5 nuF = 2.0 T = 0.2 nu_func = lambda t: [nu0 * np.exp(np.log(nuF / nu0) * t / T)] print("size at start of epoch:", nu_func(0)) print("size at end of epoch:", nu_func(T)) fs.integrate(nu_func, T) Integration time and time units =============================== Unlike coalescent simulators, such as ``msprime``, integration times in ``moments`` are in units of :math:`2N_e` generations. Thus, typical integration times for many demographic scenarios could be much smaller than one. Times are not cummulative when integrating multiple epochs - each time ``integrate( )`` is called, internally time starts from zero by default. Thus, when defining multiple epochs with size functions, keep in mind that time for that epoch runs from zero to the integration time ``T``. Migration rates =============== Migration between populations is specified by the migration matrix, with has shape :math:`p \\times p`, where :math:`p` is the number of populations represented by the SFS. The :math:`i`-th row of the migration matrix gives the migration rates from each other population *into* the population indexed by :math:`i`. Because rates are rescaled by the effective population size, the entry ``M[i, j]`` gives the migration rate ``2*Ne*m_ij``, where ``m_ij`` is the per-generation probability of a lineage in population ``i`` having its parent in population ``j``. Note that the diagonal elements of ``M`` are ignored. For example, to integrate a two-population SFS with migration: .. jupyter-execute:: fs = moments.Demographics2D.snm([10, 10]) M = np.array([ [0, 2.0], [0.75, 0] ]) fs.integrate([2, 3], 0.05, m=M) Mutation rates and mutation model ================================= By default, ``moments`` uses an infinite-sites model (ISM). Then the mutation rate :math:`\theta` is the population-size scaled mutation rate multiplied by the number of loci: ``theta = 4*Ne*u*L``. By default, ``theta`` is set to 1. Luckily, we do not often need to worry about setting ``theta``, because the ISM guarantees that the expected count in each frequency bin of the SFS scales linearly in the mutation rate. This means that we can happily integrate with the default ``theta`` and only rescale the SFS at the end: .. jupyter-execute:: theta = 100 fs_theta = moments.LinearSystem_1D.steady_state_1D(20) * 100 fs_theta = moments.Spectrum(fs_theta) fs_theta.integrate([2.0], 0.1, theta=theta) fs = moments.Demographics1D.two_epoch((2.0, 0.1), [20]) # default theta = 1 fs = theta * fs print(fs_theta.S()) print(fs.S()) Reversible mutations -------------------- Unlike ``dadi``, which solves the diffusion equation directly and can only simulate under the ISM, the moments-based engine in ``moments`` lets us accurately track the density of the "fixed" bins. That is, we can compute not just the distribution of segregating mutation frequencies, but also the probability that a locus is monomorphic in a sample for the derived or ancestral allele. To compute a SFS in which we track monomorphic loci, we use a reversible mutation model, which we specify by setting ``finite_genome=True``. When simulating under the finite genome model, the mutation rate is no longer scaled by the number of loci, ``L``. Instead, the mutation rates are simply ``theta_fd=4*Ne*u`` and ``theta_bd=4*Ne*v`` where ``u`` and ``v`` are the forward and backward mutation rates, respectively. Therefore, ``theta_fd`` and ``theta_bd`` are typically much less than 1 (and in fact the model breaks down for scaled mutation rates around 1). To simulate under the reversible mutation model, we first initialize the steady-state SFS with ``mask_corners=False``, and then apply demographic events as normal and integrate using ``finite_genome=True``: .. jupyter-execute:: theta_fd = 0.0005 # 4*Ne*u, with Ne = 1e4 and u = 1.25e-8 theta_bd = 0.001 # the backward mutation rate is double the forward rate fs = moments.LinearSystem_1D.steady_state_1D_reversible( 20, theta_fd=theta_fd, theta_bd=theta_bd) # sample size = 20 fs = moments.Spectrum(fs, mask_corners=False) fs.integrate( [5.0], 0.2, finite_genome=True, theta_fd=theta_fd, theta_bd=theta_bd) Note that if the forward and backward mutation rates are equal, we can use ``theta`` to set both mutation rates (which must be set, as ``theta`` must be less than 1). Illustration: ancestral state misidentification ----------------------------------------------- In SFS analyses, a typical confounder is the misidentification of the ancestral allele. This occurs because polarization requires estimating the ancestral state of a locus, which is typically done by comparing to one or more outgroup species in a sequence alignment. For humans, we typically use chimpanzee and other great apes to infer the ancstral allele. At longer evolutionary timescales, it is not uncommon for multiple independent mutations to occur at the same locus, so that when comparing to an outgroup species we classify some derived mutations as ancestral and some ancestral mutations as derived. For humans, the rate of ancestral misidentification is typically in the 1-3% range, depending on the method used to polarize alleles. For example, we can simulate using rough parameters (:math:`u=1.25\times 10^{-8}`, :math:`N_e=10^4`, divergence of 6 million years, and a generation time of 25 years) and symmetric mutation rates to see the effect of polarizing based on the allele in a chimp sequence. Here, if the chimp carries the derived allele, we will instead assume the ancestral allele is derived: .. jupyter-execute:: Ne = 1e4 u = 1.25e-8 theta = 4 * Ne * u generation_time = 25 divergence_years = 6e6 T = divergence_years / generation_time / 2 / Ne fs = moments.LinearSystem_1D.steady_state_1D_reversible( 101, theta_fd=theta, theta_bd=theta) fs = moments.Spectrum(fs, mask_corners=False) fs = fs.split(0, 100, 1) fs.integrate([1, 1], T, finite_genome=True, theta=theta) fs_polarized = fs[:,0] + fs[::-1,1] fs_polarized.mask_corners() Then visualizing using ``moments.Plotting.plot_1d_fs(fs_polarized)``, we can see the uptick at high-frequency variants due to ancestral misidentification - that is, recurrent mutations along the lineage leading from humans to chimps: .. figure:: ../figures/ancestral_misid.png :scale: 80 % :align: center Excess of high-frequency derived mutations due to ancestral misspecification. Selection and dominance ======================= One of the great benefits to forward simulators is their ability to include the effects of selection and dominance with little extra cost. In the selection model implemented in ``moments``, genotype fitnesses are given relative to the ancestral homozygous genotype (i.e. relative fitness of *aa* is 1), so that heterozygous genotypes (*Aa*) have relative fitness :math:`1+2hs` and homozygous derived genotypes (*AA*) have relative fitness :math:`1+2s`. When :math:`h=1/2`, selection is additive (or genic), which corresponds to haploid copies of the derived allele having average fitness :math:`1+s`. If ``h`` is unspecified, the selection model defaults to additivity (:math:`h=1/2`), and if ``gamma`` is unspecified, we default to neutrality. .. note:: We assume :math:`|s| \ll 1`, so that :math:`s^2` and higher order terms can be ignored. For strong selection in a moments framework, see recent advances from [Krukov2021]_. ``moments`` takes scaled selection coefficients :math:`\gamma = 2 N_e s` and dominance coefficients :math:`h` as keyword parameters when initializing the SFS and integrating. The reference :math:`N_e` is often taken as the ancestral effective population size. .. jupyter-execute:: gamma = -5 h = 0.1 ns = 30 fs = moments.LinearSystem_1D.steady_state_1D(ns, gamma=gamma, h=h) fs = moments.Spectrum(fs) print("Tajima's D (before expansion):", fs.Tajima_D()) fs.integrate([3], 0.2, gamma=gamma, h=h) print("Tajima's D (after expansion):", fs.Tajima_D()) Simulating selection with multiple populations works similarly. We can specify ``gamma`` and ``h`` as scalar values, which implies that the allele has the same selection and dominance effect in each population. We can instead simulate population-specific selection and dominance coefficients by setting ``gamma`` and/or ``h`` as a list of length equal to the number of populations in the spectrum, with indexing matching the ordering of the populations in the spectrum object. Ancient samples and frozen populations ====================================== So far, in all the examples we've seen the output SFS integrates all populations until the same end time. If one or more of the sampled populations are non-contemporary, we need to "freeze" those populations at their time of sampling. This is done by specifying which populations to freeze using the ``frozen`` argument. For example, if we sample two populations that split 100kya, and one population consisting of ancient samples from 20kya, we integrate the first 80 thousand years as normal, and then the last 20 thousand years with the ancient population frozen: .. jupyter-execute:: Ne = 1e4 generation_time = 25 T1 = 80e3 / 2 / Ne / generation_time T2 = 20e3 / 2 / Ne / generation_time migrate = 0.5 fs = moments.Demographics2D.snm([10, 10]) fs.integrate([1, 1], T1, m=[[0, migrate], [migrate, 0]]) fs.integrate([1, 1], T1, m=[[0, migrate], [migrate, 0]], frozen=[False, True]) **************************** Computing summary statistics **************************** ``moments`` allows us to compute a handful of summary statistics from the SFS. For single populations, we can get Watterson's :math:`\theta`, the diversity :math:`\pi`, or Tajima's :math:`D` directly from the SFS: .. jupyter-execute:: fs = moments.Demographics1D.two_epoch((3.0, 0.2), [20]) print("Watterson's theta:", fs.Watterson_theta()) print("Diversity:", fs.pi()) snm = moments.Demographics1D.snm([20]) print("Tajima's D at steady state:", snm.Tajima_D()) print("Tajima's D after expansion:", fs.Tajima_D()) For multi-population spectra, we can also compute FST using Weir and Cokerham's (1984) method, which generalizes to any number of populations greater than one: .. jupyter-execute:: fs = moments.Demographics2D.snm([10, 10]) print("FST immediately after split:", fs.Fst()) fs.integrate([1, 1], 0.05) print("FST after isolation of 0.05*2*Ne gens:", fs.Fst()) fs.integrate([1, 1], 0.05) print("FST after isolation of 0.1*2*Ne gens:", fs.Fst()) Note that FST is sensitive to sample sizes: smaller sample sizes artificially inflate the "true" divergence. .. jupyter-execute:: print("10 samples each:", moments.Demographics2D.snm([10, 10]).Fst()) print("100 samples each:", moments.Demographics2D.snm([100, 100]).Fst()) ******************** Compute SFS from VCF ******************** ``moments`` supports computing a SFS from files in VCF format, given a population information file. This takes two steps. We first parse the VCF using and we then pass that data dictionary to the Spectrum class: .. code-block:: python data_dict = moments.Misc.make_data_dict_vcf(vcf_filename, popinfo_filename) fs = moments.Spectrum.from_data_dict(data_dict) **************** Plotting the SFS **************** ``moments`` comes pre-installed with a number of plotting functions, which can be called from ``moments.Plotting``. These include functions to plot individual SFS, or to compare two SFS (for example, to compare a model to data). These functions can be used out-of-the-box, or serve as inspiration for your own ``matplotlib`` adventures. To see what plotting functions are available and view their documentation, head to the :ref:`moments API `. ********** References ********** .. [Jouganous2017] Jouganous, Julien, et al. "Inferring the joint demographic history of multiple populations: beyond the diffusion approximation." *Genetics* 206.3 (2017): 1549-1567. .. [Krukov2021] Krukov, Ivan, and Simon Gravel. "Taming strong selection with large sample sizes." *bioRxiv* (2021), doi: 10.1101/2021.03.30.437711.