API for site frequency spectra

The Spectrum object

class moments.Spectrum(data, mask=False, mask_corners=True, data_folded=None, check_folding=True, dtype=<class 'float'>, copy=True, fill_value=nan, keep_mask=True, shrink=True, pop_ids=None)[source]

Represents a single-locus biallelic frequency spectrum.

Spectra are represented by masked arrays. The masking allows us to ignore specific entries in the spectrum. When simulating under the standard infinite sites model (ISM), the entries we mask are the bins specifying absent or fixed variants. When using a reversible mutation model (i.e. the finite genome model), we track the density of variants in fixed bins, setting mask_corners to False.

Parameters:
  • data (array) – An array with dimension equal to the number of populations. Each dimension has length \(n_i+1\), where \(n_i\) is the sample size for the i-th population.

  • mask – An optional array of the same size as data. ‘True’ entries in this array are masked in the Spectrum. These represent missing data categories. (For example, you may not trust your singleton SNP calling.)

  • mask_corners – If True (default), the ‘observed in none’ and ‘observed in all’ entries of the FS will be masked. Typically these entries are masked. In the defaul infinite sites model, moments does not reliably calculate the fixed-bin entries, so you will almost always want mask_corners=True. The exception is if we are simulating under the finite genome model, in which case we track the probability of a site to be fixed for either allele.

  • data_folded (bool, optional) – If True, it is assumed that the input data is folded. An error will be raised if the input data and mask are not consistent with a folded Spectrum.

  • check_folding (bool, optional) – If True and data_folded=True, the data and mask will be checked to ensure they are consistent with a folded Spectrum. If they are not, a warning will be printed.

  • pop_ids (list of strings, optional) – Optional list of strings containing the population labels, with length equal to the dimension of data.

Returns:

A frequency spectrum object, as a masked array.

Fst(pairwise=False)[source]

Wright’s Fst between the populations represented in the fs.

This estimate of Fst assumes random mating, because we don’t have heterozygote frequencies in the fs.

Calculation is by the method of Weir and Cockerham Evolution 38:1358 (1984). For a single SNP, the relevant formula is at the top of page 1363. To combine results between SNPs, we use the weighted average indicated by equation 10.

Parameters:

pairwise (bool) – Defaults to False. If True, returns a dictionary of all pairwise Fst within the multi-dimensional spectrum.

S()[source]

Returns the number of segregating sites in the frequency spectrum.

Tajima_D()[source]

Returns Tajima’s D.

Following Gillespie “Population Genetics: A Concise Guide” pg. 45

Watterson_theta()[source]

Returns Watterson’s estimator of theta.

Note

This function is only sensible for 1-dimensional spectra.

Zengs_E()[source]

Returns Zeng et al.’s E statistic.

From Zeng et al., “Statistical Tests for Detecting Positive Selection by Utilizing High-Frequency Variants.” Genetics, 2016.

admix(idx0, idx1, num_lineages, proportion, new_id=None)[source]

Returns a new frequency spectrum with an admixed population that arose through admixture from indexed populations with given number of lineages and proportions from parental populations. This serves as a wrapper for Manips.admix_into_new, with the added feature of handling pop_ids.

If the number of lineages that move are equal to the number of lineages previously present in a source population, that source population is marginalized.

Parameters:
  • idx0 (int) – Index of first source population.

  • idx1 (int) – Index of second source population.

  • num_lineages (int) – Number of lineages in the new population. Cannot be greater than the number of existing lineages in either source populations.

  • proportion (float) – The proportion of lineages that come from the first source population (1-proportion acestry comes from the second source population). Must be a number between 0 and 1.

  • new_id (str, optional) – The ID of the new population. Can only be used if the population IDs are specified in the input SFS.

branch(idx, n, new_id=None)[source]

A “branch” event, where a population gives rise to a child population, while persisting. This is conceptually similar to the split event. The number of lineages in the new population is provided, and the number of lineages in the source/parental population is the original sample size minus the number requested for the branched population. Returns a new frequency spectrum.

Parameters:
  • idx (int) – The index of the population to branch.

  • n (int) – The sample size of the new population.

  • new_id – The population ID of the branch populations. The parental population retains its original population ID. Can only be used if pop_ids are given for the input spectrum.

f2(X, Y)[source]

Returns \(f_2(X, Y) = (pX-pY)^2\), where \(p\) are observed allele frequencies in populations X and Y.

X, and Y can be specified as population ID strings, or as indexes (but these cannot be mixed).

Parameters:
  • X – One of the populations, as index or population ID.

  • Y – The other population, as index or population ID.

Returns:

Patterson’s f2 statistics.

f3(X, Y, Z)[source]

Returns \(f_3(X; Y, Z) = (X-Y)(X-Z)\). A significantly negative \(f_3\) suggests that population X is the result of admixture between ancient populations related to Y and Z. A positive value suggests that X is an outgroup to Y and Z.

X, Y, and Z can be specified as population ID strings, or as indexes (but these cannot be mixed).

Parameters:
  • X – The “test” population, as index or population ID.

  • Y – The first reference population, as index or population ID.

  • Z – The second reference population, as index or population ID.

Returns:

Patterson’s f3 statistic.

f4(X, Y, Z, W)[source]

Returns \(f_4(X, Y; Z, W) = (X-Y)(Z-W)\).

X, Y, Z and W can be specified as population ID strings, or as indexes (but these cannot be mixed).

Parameters:
  • X – A population index or ID.

  • Y – A population index or ID.

  • Z – A population index or ID.

  • W – A population index or ID.

Returns:

Patterson’s f4 statistic (pX-pY)*(pZ-pW).

fixed_size_sample(nsamples, include_masked=False)[source]

Generate a resampled SFS from the current one. Thus, the resampled SFS follows a multinomial distribution given by the proportion of sites in each bin in the original SFS.

Parameters:
  • nsamples (int) – Number of samples to include in the new SFS.

  • include_masked (bool, optional) – If True, use all bins from the SFS. Otherwise, use only non-masked bins. Defaults to False.

fold()[source]

Returns a folded frequency spectrum.

The folded fs assumes that information on which allele is ancestral or derived is unavailable. Thus the fs is in terms of minor allele frequency. This makes the fs into a “triangular” array. If a masked cell is folded into non-masked cell, the destination cell is masked as well.

Folding is not done in-place. The return value is a new Spectrum object.

static from_angsd(sfs_file, sample_sizes, pop_ids=None, folded=False, mask_corners=True)[source]

Convert ANGSD output to a moments Spectrum object. The sample sizes are given as number of haploid genome copies (twice the number of sampled diploid individuals).

Parameters:
  • sfs_file (string) – The n-dimensional SFS from ANGSD. This should be a file with a single line of numbers, as entries in the SFS.

  • sample_sizes (list) – A list of integers with length equal to the number of population, storing the haploid sample size in each population. The order must match the population order provided to ANGSD.

  • pop_ids (list) – A list of strings equal with length equal to the number of population, specifying the population name for each.

  • folded (bool) – If False (default), we assume ancestral states are known, returning an unfolded SFS. If True, the returned SFS is folded.

  • mask_corners (bool) – If True (default), mask the fixed bins in the SFS. If False, the fixed bins will remain unmasked.

Returns:

A moments site frequency spectrum.

Return type:

moments.Spectrum

static from_data_dict(data_dict, pop_ids, projections, mask_corners=True, polarized=True)[source]

Spectrum from a dictionary of polymorphisms.

The data dictionary should be organized as:

{snp_id: {
    'segregating': ['A','T'],
    'calls': {
        'YRI': (23,3),
        'CEU': (7,3)
    },
    'outgroup_allele': 'T'
}}

The ‘calls’ entry gives the successful calls in each population, in the order that the alleles are specified in ‘segregating’. Non-diallelic polymorphisms are skipped.

Parameters:
  • pop_ids – list of which populations to make fs for.

  • projections – list of sample sizes to project down to for each population.

  • polarized – If True, the data are assumed to be correctly polarized by ‘outgroup_allele’. SNPs in which the ‘outgroup_allele’ information is missing or ‘-’ or not concordant with the segregating alleles will be ignored. If False, any ‘outgroup_allele’ info present is ignored, and the returned spectrum is folded.

static from_demes(g, sampled_demes=None, sample_sizes=None, sample_times=None, samples=None, unsampled_n=4, gamma=None, h=None, theta=1)[source]

Takes a deme graph and computes the SFS. demes is a package for specifying demographic models in a user-friendly, human-readable YAML format. This function automatically parses the demographic description and returns a SFS for the specified populations and sample sizes.

Note

If a deme sample time is requested that is earlier than the deme’s end time, for example to simulate ancient samples, we must create a new population for that ancient sample. This can cause large slow-downs, as the computation cost of computing the SFS grows quickly in the number of populations.

Parameters:
  • g (str or demes.DemeGraph) – A demes DemeGraph from which to compute the SFS. The DemeGraph can either be specified as a YAML file, in which case g is a string, or as a DemeGraph object.

  • sampled_demes (list of strings) – A list of deme IDs to take samples from. We can repeat demes, as long as the sampling of repeated deme IDs occurs at distinct times.

  • sample_sizes (list of ints) – A list of the same length as sampled_demes, giving the sample sizes for each sampled deme.

  • sample_times (list of floats, optional) – If None, assumes all sampling occurs at the end of the existence of the sampled deme. If there are ancient samples, sample_times must be a list of same length as sampled_demes, giving the sampling times for each sampled deme. Sampling times are given in time units of the original deme graph, so might not necessarily be generations (e.g. if g.time_units is years)

  • unsampled_n (int, optional) – The default sample size of unsampled demes, which must be greater than or equal to 4.

  • gamma (float or dict) – The scaled selection coefficient(s), 2*Ne*s. Defaults to None, which implies neutrality. Can be given as a scalar value, in which case all populations have the same selection coefficient. Alternatively, can be given as a dictionary, with keys given as population names in the input Demes model. Any population missing from this dictionary will be assigned a selection coefficient of zero. A non-zero default selection coefficient can be provided, using the key _default. See the Demes exension documentation for more details and examples.

  • h (float or dict) – The dominance coefficient(s). Defaults to additivity (or genic selection). Can be given as a scalar value, in which case all populations have the same dominance coefficient. Alternatively, can be given as a dictionary, with keys given as population names in the input Demes model. Any population missing from this dictionary will be assigned a dominance coefficient of 1/2 (additivity). A different default dominance coefficient can be provided, using the key _default. See the Demes exension documentation for more details and examples.

  • theta (scalar) – The population-size scaled mutation rate (4*Ne*u or 4*Ne*u*L). The default value is 1. For more control of mutation rates and mutation models (including using a reversible mutation model), please use moments.Demes.SFS, which has options to specify theta, the per-base mutation rate u, and/or a reversible mutation model that allows for different forward and backward mutation rates.

Returns:

A moments site frequency spectrum, with dimension equal to the length of sampled_demes, and shape equal to sample_sizes plus one in each dimension, indexing the allele frequency in each deme from 0 to n[i], where i is the deme index.

Return type:

moments.Spectrum

static from_file(fid, mask_corners=True, return_comments=False)[source]

Read frequency spectrum from file.

See to_file for details on the file format.

Parameters:
  • fid (string) – string with file name to read from or an open file object.

  • mask_corners (bool, optional) – If True, mask the ‘absent in all samples’ and ‘fixed in all samples’ entries.

  • return_comments (bool, optional) – If true, the return value is (fs, comments), where comments is a list of strings containing the comments from the file (without #’s).

static from_ms_file(fid, average=True, mask_corners=True, return_header=False, pop_assignments=None, pop_ids=None, bootstrap_segments=1)[source]

Read frequency spectrum from file of ms output.

Parameters:
  • fid – string with file name to read from or an open file object.

  • average – If True, the returned fs is the average over the runs in the ms file. If False, the returned fs is the sum.

  • mask_corners – If True, mask the ‘absent in all samples’ and ‘fixed in all samples’ entries.

  • return_header – If True, the return value is (fs, (command,seeds), where command and seeds are strings containing the ms commandline and the seeds used.

  • pop_assignments – If None, the assignments of samples to populations is done automatically, using the assignment in the ms command line. To manually assign populations, pass a list of the from [6,8]. This example places the first 6 samples into population 1, and the next 8 into population 2.

  • pop_ids – Optional list of strings containing the population labels. If pop_ids is None, labels will be “pop0”, “pop1”, …

  • bootstrap_segments – If bootstrap_segments is an integer greater than 1, the data will be broken up into that many segments based on SNP position. Instead of single FS, a list of spectra will be returned, one for each segment.

static fromfile(fid, mask_corners=True, return_comments=False)

Read frequency spectrum from file.

See to_file for details on the file format.

Parameters:
  • fid (string) – string with file name to read from or an open file object.

  • mask_corners (bool, optional) – If True, mask the ‘absent in all samples’ and ‘fixed in all samples’ entries.

  • return_comments (bool, optional) – If true, the return value is (fs, comments), where comments is a list of strings containing the comments from the file (without #’s).

genotype_matrix(num_sites=None, sample_sizes=None, diploid_genotypes=False)[source]

Generate a genotype matrix of independent loci. For multi-population spectra, the individual columns are filled in the sample order as the populations in the SFS.

Note

Sites in the output genotype matrix are necessarily separated by infinite recombination. The SFS assumes all loci are segregating independently, so there is no linkage between them.

Returns a genotype matrix of size number of sites by total sample size.

Parameters:
  • num_sites – Defaults to None, in which case we take a poisson sample from the SFS. Otherwise, we take a fixed number of sites.

  • sample_sizes – The sample size in each population, as a list with length of the number of dimension (populations) in the SFS.

Diploid_genotypes:

Defaults to False, in which case we return a haplotype matrix of size (num_sites x sum(sample_sizes)). If True, we return a diploid genotype matrix (filled with 0, 1, 2) of size (num_sites x sum(sample_sizes)/2).

integrate(Npop, tf, dt_fac=0.02, gamma=None, h=None, overdominance=None, m=None, theta=1.0, adapt_dt=False, finite_genome=False, theta_fd=None, theta_bd=None, frozen=[False])[source]

Method to simulate the spectrum’s evolution for a given set of demographic parameters. The SFS is integrated forward-in-time, and the integration occurs in-place, meaning you need only call fs.integrate( ), and the fs is updated.

Parameters:
  • Npop (list or function that returns a list) – List of populations’ relative effective sizes. Can be given as a list of positive values for constant sizes, or as a function that returns a list of sizes at a given time.

  • tf (float) – The total integration time in genetic units.

  • dt_fac (float, optional) – The timestep factor, default is 0.02. This parameter typically does not need to be adjusted.

  • gamma (float or list of floats or function that returns a float or list of floats, optional) – Scaled selection coefficient(s), which can be either a number or a vector gamma = [gamma1,…,gammap], allowing for different selection coefficients in each population. This can also be given as function returning a number or such a vector, allowing for selection coefficients to change over time.

  • h (float or list of floats or function that returns a float or list of floats, optional) – Dominance coefficient(s) (h=1/2 implies additive selection). Can be given as a number or a vector h = [h1,…,hp], or a function that returns a number or vector of coefficients, allowing for dominance coefficients to change over time.

  • overdominance (float or list of floats or function that returns a float or list of floats, optional) – Scaled selection coefficient(s) that is applied only to heterozygotes, in a selection system with fitnesses 1:1+s:1. Underdominance can be modeled by passing a negative value. Not that this is a symmetric under/over-dominance model, in which homozygotes for either the ancestral or derived allele have equal fitness. gamma, h, and overdominance can be combined (additively) to implement non-symmetric selection scenarios.

  • m (array-like, optional) – The migration rates matrix as a 2-D array with shape nxn, where n is the number of populations. The entry of the migration matrix m[i,j] is the migration rate from pop j to pop i in genetic units, that is, normalized by \(2N_e\). m may be either a 2-D array, or a function that returns a 2-D array (with dimensions equal to (num pops)x(num pops)).

  • theta (float, optional) – The scaled mutation rate \(4 N_e u\), which defaults to 1. theta can be used in the reversible model in the case of symmetric mutation rates. In this case, theta must be set to << 1.

  • adapt_dt (bool, optional) – flag to allow dt correction avoiding negative entries.

  • finite_genome (bool, optional) – If True, simulate under the finite-genome model with reversible mutations. If using this model, we can specify the forward and backward mutation rates, which are per-base rates that are not scaled by number of mutable loci. If theta_fd and theta_bd are not specified, we assume equal forward and backward mutation rates provided by theta, which must be set to less that 1. Defaults to False.

  • theta_fd (float, optional) – The forward mutation rate \(4 Ne u\).

  • theta_bd (float, optional) – The backward mutation rate \(4 Ne v\).

  • frozen (list of bools) – Specifies the populations that are “frozen”, meaning samples from that population no longer change due or contribute to migration to other populations. This feature is most often used to indicate ancient samples, for example, ancient DNA. The frozen parameter is given as a list of same length as number of pops, with True for frozen populations at the corresponding index, and False for populations that continue to evolve.

log()[source]

Returns the natural logarithm of the entries of the frequency spectrum.

Only necessary because np.ma.log now fails to propagate extra attributes after np 1.10.

marginalize(over, mask_corners=None)[source]

Reduced dimensionality spectrum summing over the set of populations given by over.

marginalize does not act in-place, so the input frequency spectrum will not be altered.

Parameters:
  • over (list of integers) – List of axes to sum over. For example (0,2) will marginalize populations 0 and 2.

  • mask_corners (bool, optional) – If True, the fixed bins of the resulting spectrum will be masked. The default behavior is to mask the corners only if at least one of the corners of the input frequency spectrum is masked. If either corner is masked, the output frequency spectrum masks the fixed bins.

mask_corners()[source]

Mask the ‘seen in 0 samples’ and ‘seen in all samples’ entries.

pi()[source]

Returns the estimated expected number of pairwise differences between two chromosomes in the population.

Note

This estimate includes a factor of sample_size / (sample_size - 1) to make \(\mathbb{E}[\pi] = \theta\).

project(ns)[source]

Project to smaller sample size.

project does not act in-place, so that the input frequency spectrum is not changed.

Parameters:

ns (list of integers) – Sample sizes for new spectrum.

pulse_migrate(idx_from, idx_to, keep_from, proportion)[source]

Mass migration (pulse admixture) between two existing populations. The target (destination) population has the same number of lineages in the output SFS, and the source population has keep_from number of lineages after the pulse event. The proportion is the expected ancestry proportion in the target population that comes from the source population.

This serves as a wrapper for Manips.admix_inplace.

Depending on the proportion and number of lineages, because this is an approximate operation, we often need a large number of lineages from the source population to maintain accuracy.

Parameters:
  • idx_from (int) – Index of source population.

  • idx_to (int) – Index of targeet population.

  • keep_from (int) – Number of lineages to keep in source population.

  • proportion (float) – Ancestry proportion of source population that migrates to target population.

sample()[source]

Generate a Poisson-sampled fs from the current one.

Entries where the current fs is masked or 0 will be masked in the output sampled fs.

scramble_pop_ids(mask_corners=True)[source]

Spectrum corresponding to scrambling individuals among populations.

This is useful for assessing how diverged populations are. Essentially, it pools all the individuals represented in the fs and generates new populations of random individuals (without replacement) from that pool. If this fs is significantly different from the original, that implies population structure.

split(idx, n0, n1, new_ids=None)[source]

Splits a population in the SFS into two populations, with the extra population placed at the end. Returns a new frequency spectrum.

Parameters:
  • idx (int) – The index of the population to split.

  • n0 (int) – The sample size of the first split population.

  • n1 (int) – The sample size of the second split population.

  • new_ids (list of strings, optional) – The population IDs of the split populations. Can only be used if pop_ids are given for the input spectrum.

swap_axes(ax1, ax2)[source]

Uses np’s swapaxes function, but also swaps pop_ids as appropriate if pop_ids are given.

Note

fs.swapaxes(ax1, ax2) will still work, but if population ids are given, it won’t swap the pop_ids entries as expected.

Parameters:
  • ax1 (int) – The index of the first population to swap.

  • ax2 (int) – The index of the second population to swap.

theta_L()[source]

Returns theta_L as defined by Zeng et al. “Statistical Tests for Detecting Positive Selection by Utilizing High-Frequency Variants” (2006) Genetics

Note

This function is only sensible for 1-dimensional spectra.

to_file(fid, precision=16, comment_lines=[], foldmaskinfo=True)[source]

Write frequency spectrum to file.

The file format is:

  • Any number of comment lines beginning with a ‘#’

  • A single line containing N integers giving the dimensions of the fs array. So this line would be ‘5 5 3’ for an SFS that was 5x5x3. (That would be 4x4x2 samples.)

  • On the same line, the string ‘folded’ or ‘unfolded’ denoting the folding status of the array

  • On the same line, optional strings each containing the population labels in quotes separated by spaces, e.g. “pop 1” “pop 2”

  • A single line giving the array elements. The order of elements is e.g.: fs[0,0,0] fs[0,0,1] fs[0,0,2] … fs[0,1,0] fs[0,1,1] …

  • A single line giving the elements of the mask in the same order as the data line. ‘1’ indicates masked, ‘0’ indicates unmasked.

Parameters:
  • fid (string) – string with file name to write to or an open file object.

  • precision (int, optional) – precision with which to write out entries of the SFS. (They are formated via %.<p>g, where <p> is the precision.) Defaults to 16.

  • comment_lines (list of strings, optional) – list of strings to be used as comment lines in the header of the output file.

  • foldmaskinfo (bool, optional) – If False, folding and mask and population label information will not be saved.

tofile(fid, precision=16, comment_lines=[], foldmaskinfo=True)

Write frequency spectrum to file.

The file format is:

  • Any number of comment lines beginning with a ‘#’

  • A single line containing N integers giving the dimensions of the fs array. So this line would be ‘5 5 3’ for an SFS that was 5x5x3. (That would be 4x4x2 samples.)

  • On the same line, the string ‘folded’ or ‘unfolded’ denoting the folding status of the array

  • On the same line, optional strings each containing the population labels in quotes separated by spaces, e.g. “pop 1” “pop 2”

  • A single line giving the array elements. The order of elements is e.g.: fs[0,0,0] fs[0,0,1] fs[0,0,2] … fs[0,1,0] fs[0,1,1] …

  • A single line giving the elements of the mask in the same order as the data line. ‘1’ indicates masked, ‘0’ indicates unmasked.

Parameters:
  • fid (string) – string with file name to write to or an open file object.

  • precision (int, optional) – precision with which to write out entries of the SFS. (They are formated via %.<p>g, where <p> is the precision.) Defaults to 16.

  • comment_lines (list of strings, optional) – list of strings to be used as comment lines in the header of the output file.

  • foldmaskinfo (bool, optional) – If False, folding and mask and population label information will not be saved.

unfold()[source]

Returns an unfolded frequency spectrum.

It is assumed that each state of a SNP is equally likely to be ancestral.

Unfolding is not done in-place. The return value is a new Spectrum object.

unmask_all()[source]

Unmask all entires of the frequency spectrum.

Miscellaneous functions

moments.Misc.perturb_params(params, fold=1, lower_bound=None, upper_bound=None)[source]

Generate a perturbed set of parameters. Each element of params is randomly perturbed fold factors of 2 up or down.

Parameters:
  • fold (float, optional) – Number of factors of 2 to perturb by, defaults to 1.

  • lower_bound (list of floats, optional) – If not None, the resulting parameter set is adjusted to have all value greater than lower_bound.

  • upper_bound (list of floats, optional) – If not None, the resulting parameter set is adjusted to have all value less than upper_bound.

moments.Misc.make_data_dict_vcf(vcf_filename, popinfo_filename, filter=True, flanking_info=[None, None], skip_multiallelic=True)[source]

Parse a VCF file containing genomic sequence information, along with a file identifying the population of each sample, and store the information in a properly formatted dictionary.

Each file may be zipped (.zip) or gzipped (.gz). If a file is zipped, it must be the only file in the archive, and the two files cannot be zipped together. Both files must be present for the function to work.

Parameters:
  • vcf_filename (str) – Name of VCF file to work with. The function currently works for biallelic SNPs only, so if REF or ALT is anything other than a single base pair (A, C, T, or G), the allele will be skipped. Additionally, genotype information must be present in the FORMAT field GT, and genotype info must be known for every sample, else the SNP will be skipped. If the ancestral allele is known it should be specified in INFO field ‘AA’. Otherwise, it will be set to ‘-‘.

  • popinfo_filename (str) – Name of file containing the population assignments for each sample in the VCF. If a sample in the VCF file does not have a corresponding entry in this file, it will be skipped. See _get_popinfo for information on how this file must be formatted.

  • filter (bool, optional) – If set to True, alleles will be skipped if they have not passed all filters (i.e. either ‘PASS’ or ‘.’ must be present in FILTER column.

  • flanking_info (list of strings, optional) – Flanking information for the reference and/or ancestral allele can be provided as field(s) in the INFO column. To add this information to the dict, flanking_info should specify the names of the fields that contain this info as a list (e.g. [‘RFL’, ‘AFL’].) If context info is given for only one allele, set the other item in the list to None, (e.g. [‘RFL’, None]). Information can be provided as a 3 base-pair sequence or 2 base-pair sequence, where the first base-pair is the one immediately preceding the SNP, and the last base-pair is the one immediately following the SNP.

  • skip_multiallelic (bool, optional) – If True, only keep biallelic sites, and skip sites that have more than one ALT allele.

moments.Misc.count_data_dict(data_dict, pop_ids)[source]

Summarize data in data_dict by mapping SNP configurations to counts.

Returns a dictionary with keys (successful_calls, derived_calls, polarized) mapping to counts of SNPs. Here successful_calls is a tuple with the number of good calls per population, derived_calls is a tuple of derived calls per pop, and polarized indicates whether that SNP was polarized using an ancestral state.

Parameters:
  • data_dict (data dictionary) – data_dict formatted as in Misc.make_data_dict

  • pop_ids (list of strings) – IDs of populations to collect data for

moments.Misc.bootstrap(data_dict, pop_ids, projections, mask_corners=True, polarized=True, bed_filename=None, num_boots=100, save_dir=None)[source]

Use a non-parametric bootstrap on SNP information contained in a dictionary to generate new data sets. The new data is created by sampling with replacement from independent units of the original data. These units can simply be chromosomes, or they can be regions specified in a BED file.

This function either returns a list of all the newly created SFS, or writes them to disk in a specified directory.

See moments.Spectrum.from_data_dict() for more details about the options for creating spectra.

Parameters:
  • data_dict (dict of SNP information) – Dictionary containing properly formatted SNP information (i.e. created using one of the make_data_dict methods).

  • pop_ids (list of strings) – List of population IDs.

  • projections (list of ints) – Projection sizes for the given population IDs.

  • mask_corners (bool, optional) – If True, mask the invariant bins of the SFS.

  • polarized (bool, optional) – If True, we assume we know the ancestral allele. If False, return folded spectra.

  • bed_filename (string as path to bed file) – If None, chromosomes will be used as the units for resampling. Otherwise, this should be the filename of a BED file specifying the regions to be used as resampling units. Chromosome names must be consistent between the BED file and the data dictionary, or bootstrap will not work. For example, if an entry in the data dict has ID X_Y, then the value in in the chromosome field of the BED file must also be X (not chrX, chromosomeX, etc.). If the name field is provided in the BED file, then any regions with the same name will be considered to be part of the same unit. This may be useful for sampling as one unit a gene that is located across non-continuous regions.

  • num_boots (int, optional) – Number of resampled SFS to generate.

  • save_dir (str, optional) – If None, the SFS are returned as a list. Otherwise this should be a string specifying the name of a new directory under which all of the new SFS should be saved.

Demographic functions

Single-population demographic models.

moments.Demographics1D.bottlegrowth(params, ns, pop_ids=None)[source]

Instantanous size change followed by exponential growth.

params = (nuB, nuF, T)

Parameters:
  • params

    Tuple of length three specifying (nuB, nuF, T).

    • nuB: Ratio of population size after instantanous change to ancient population size.

    • nuF: Ratio of contemporary to ancient population size.

    • T: Time in the past at which instantaneous change happened and growth began (in units of 2*Na generations).

  • ns – Number of samples in resulting Spectrum.

  • pop_ids – Optional list of length one specifying the population ID.

moments.Demographics1D.growth(params, ns, pop_ids=None)[source]

Exponential growth beginning some time ago.

params = (nu, T)

Parameters:
  • params

    Tupe of length two, specifying (nu, t).

    • nu: the final population size.

    • T: the time in the past at which growth began (in units of 2*Ne generations).

  • ns – Number of samples in resulting Spectrum. Must be a list of length one.

  • pop_ids – Optional list of length one specifying the population ID.

moments.Demographics1D.snm(ns, pop_ids=None)[source]

Standard neutral model with theta=1.

Parameters:
  • ns – Number of samples in resulting Spectrum. Must be a list of length one.

  • pop_ids – Optional list of length one specifying the population ID.

moments.Demographics1D.three_epoch(params, ns, pop_ids=None)[source]

Three epoch model of constant sizes.

params = (nuB, nuF, TB, TF)

Parameters:
  • params

    Tuple of length four specifying (nuB, nuF, TB, TF).

    • nuB: Ratio of bottleneck population size to ancient pop size.

    • nuF: Ratio of contemporary to ancient pop size.

    • TB: Length of bottleneck (in units of 2*Na generations).

    • TF: Time since bottleneck recovery (in units of 2*Na generations).

  • ns – Number of samples in resulting Spectrum.

  • pop_ids – Optional list of length one specifying the population ID.

moments.Demographics1D.two_epoch(params, ns, pop_ids=None)[source]

Instantaneous size change some time ago.

params = (nu, T)

Parameters:
  • params

    Tuple of length two, specifying (nu, T).

    • nu: the ratio of contemporary to ancient population size.

    • T: the time in the past at which size change happened (in units of 2*Ne generations).

  • ns – Number of samples in resulting Spectrum. Must be a list of length one.

  • pop_ids – Optional list of length one specifying the population ID.

Two-population demographic models.

moments.Demographics2D.IM(params, ns, pop_ids=None)[source]

Isolation-with-migration model with exponential pop growth.

params = (s, nu1, nu2, T, m12, m21)

ns = [n1, n2]

Parameters:
  • params

    Tuple of length 6.

    • s: Size of pop 1 after split. (Pop 2 has size 1-s.)

    • nu1: Final size of pop 1.

    • nu2: Final size of pop 2.

    • T: Time in the past of split (in units of 2*Na generations)

    • m12: Migration from pop 2 to pop 1 (2 * Na * m12)

    • m21: Migration from pop 1 to pop 2

  • ns – List of population sizes in first and second populations.

  • pop_ids – List of population IDs.

moments.Demographics2D.IM_pre(params, ns, pop_ids=None)[source]

params = (nuPre, TPre, s, nu1, nu2, T, m12, m21)

ns = [n1, n2]

Isolation-with-migration model with exponential pop growth and a size change prior to split.

  • nuPre: Size after first size change

  • TPre: Time before split of first size change.

  • s: Fraction of nuPre that goes to pop1. (Pop 2 has size nuPre*(1-s).)

  • nu1: Final size of pop 1.

  • nu2: Final size of pop 2.

  • T: Time in the past of split (in units of 2*Na generations)

  • m12: Migration from pop 2 to pop 1 (2*Na*m12)

  • m21: Migration from pop 1 to pop 2

  • n1, n2: Sample sizes of resulting Spectrum.

Parameters:
  • ns – List of population sizes in first and second populations.

  • pop_ids – List of population IDs.

moments.Demographics2D.bottlegrowth(params, ns, pop_ids=None)[source]

params = (nuB, nuF, T)

ns = [n1, n2]

Instantanous size change followed by exponential growth with no population split.

  • nuB: Ratio of population size after instantanous change to ancient population size

  • nuF: Ratio of contempoary to ancient population size

  • T: Time in the past at which instantaneous change happened and growth began (in units of 2*Na generations)

  • n1, n2: Sample sizes of resulting Spectrum.

Parameters:
  • params – List of parameters, (nuB, nuF, T).

  • ns – List of population sizes in first and second populations.

  • pop_ids – List of population IDs.

moments.Demographics2D.bottlegrowth_split(params, ns, pop_ids=None)[source]

params = (nuB, nuF, T, Ts)

ns = [n1, n2]

Instantanous size change followed by exponential growth then split.

  • nuB: Ratio of population size after instantanous change to ancient population size

  • nuF: Ratio of contempoary to ancient population size

  • T: Time in the past at which instantaneous change happened and growth began (in units of 2*Na generations)

  • Ts: Time in the past at which the two populations split.

  • n1, n2: Sample sizes of resulting Spectrum.

Parameters:
  • ns – List of population sizes in first and second populations.

  • pop_ids – List of population IDs.

moments.Demographics2D.bottlegrowth_split_mig(params, ns, pop_ids=None)[source]

params = (nuB, nuF, m, T, Ts) ns = [n1, n2]

Instantanous size change followed by exponential growth then split with migration.

  • nuB: Ratio of population size after instantanous change to ancient population size

  • nuF: Ratio of contempoary to ancient population size

  • m: Migration rate between the two populations (2*Na*m).

  • T: Time in the past at which instantaneous change happened and growth began (in units of 2*Na generations)

  • Ts: Time in the past at which the two populations split.

  • n1, n2: Sample sizes of resulting Spectrum.

Parameters:
  • ns – List of population sizes in first and second populations.

  • pop_ids – List of population IDs.

moments.Demographics2D.snm(ns, pop_ids=None)[source]

ns = [n1, n2]

Standard neutral model with a split but no divergence.

Parameters:
  • ns – List of population sizes in first and second populations.

  • pop_ids – List of population IDs.

moments.Demographics2D.split_mig(params, ns, pop_ids=None)[source]

Split into two populations of specifed size, with migration.

params = (nu1, nu2, T, m)

ns = [n1, n2]

Parameters:
  • params

    Tuple of length 4.

    • nu1: Size of population 1 after split.

    • nu2: Size of population 2 after split.

    • T: Time in the past of split (in units of 2*Na generations)

    • m: Migration rate between populations (2*Na*m)

  • ns – List of length two specifying sample sizes n1 and n2.

  • pop_ids – List of population IDs.

Three-population demographic models.

moments.Demographics3D.out_of_Africa(params, ns, pop_ids=['YRI', 'CEU', 'CHB'])[source]

The Gutenkunst et al (2009) out-of-Africa that has been reinferred a number of times.

Parameters:
  • params (list of floats) – List of parameters, in the order (nuA, TA, nuB, TB, nuEu0, nuEuF, nuAs0, nuAsF, TF, mAfB, mAfEu, mAfAs, mEuAs).

  • ns (list of ints) – List of population sizes in each population, in order given by pop_ids.

  • pop_ids (list of strings, optional) – List of population IDs, defaults to [“YRI”, “CEU”, “CHB”].

Inference functions

moments.Inference.ll(model, data)[source]

The log-likelihood of the data given the model sfs.

Evaluate the log-likelihood of the data given the model. This is based on Poisson statistics, where the probability of observing k entries in a cell given that the mean number is given by the model is \(P(k) = exp(-model) * model^k / k!\).

Note: If either the model or the data is a masked array, the return ll will ignore any elements that are masked in either the model or the data.

Parameters:
  • model – The model Spectrum object.

  • data – The data Spectrum object, with same size as model.

moments.Inference.ll_multinom(model, data)[source]

Log-likelihood of the data given the model, with optimal rescaling.

Evaluate the log-likelihood of the data given the model. This is based on Poisson statistics, where the probability of observing k entries in a cell given that the mean number is given by the model is \(P(k) = exp(-model) * model^k / k!\).

model is optimally scaled to maximize ll before calculation.

Note: If either the model or the data is a masked array, the return ll will ignore any elements that are masked in either the model or the data.

Parameters:
  • model – The model Spectrum object.

  • data – The data Spectrum object, with same size as model.

moments.Inference.optimal_sfs_scaling(model, data)[source]

Optimal multiplicative scaling factor between model and data.

This scaling is based on only those entries that are masked in neither model nor data.

Parameters:
  • model – The model Spectrum object.

  • data – The data Spectrum object, with same size as model.

moments.Inference.optimally_scaled_sfs(model, data)[source]

Optimially scale model sfs to data sfs.

Returns a new scaled model sfs.

Parameters:
  • model – The model Spectrum object.

  • data – The data Spectrum object, with same size as model.

moments.Inference.linear_Poisson_residual(model, data, mask=None)[source]

Return the Poisson residuals, (model - data)/sqrt(model), of model and data.

mask sets the level in model below which the returned residual array is masked. The default of 0 excludes values where the residuals are not defined.

In the limit that the mean of the Poisson distribution is large, these residuals are normally distributed. (If the mean is small, the Anscombe residuals are better.)

Parameters:
  • model – The model Spectrum object.

  • data – The data Spectrum object, with same size as model.

  • mask – Optional mask, with same size as model.

moments.Inference.Anscombe_Poisson_residual(model, data, mask=None)[source]

Return the Anscombe Poisson residuals between model and data.

mask sets the level in model below which the returned residual array is masked. This excludes very small values where the residuals are not normal. 1e-2 seems to be a good default for the NIEHS human data. (model = 1e-2, data = 0, yields a residual of ~1.5.)

Residuals defined in this manner are more normally distributed than the linear residuals when the mean is small. See this reference below for justification: Pierce DA and Schafer DW, “Residuals in generalized linear models” Journal of the American Statistical Association, 81(396)977-986 (1986).

Note that I tried implementing the “adjusted deviance” residuals, but they always looked very biased for the cases where the data was 0.

Parameters:
  • model – The model Spectrum object.

  • data – The data Spectrum object, with same size as model.

  • mask – Optional mask, with same size as model.

moments.Inference.optimize_log(p0, data, model_func, lower_bound=None, upper_bound=None, verbose=0, flush_delay=0.5, epsilon=0.001, gtol=1e-05, multinom=True, maxiter=None, full_output=False, func_args=[], func_kwargs={}, fixed_params=None, ll_scale=1, output_file=None)[source]

Optimize log(params) to fit model to data using the BFGS method. This optimization method works well when we start reasonably close to the optimum.

Because this works in log(params), it cannot explore values of params < 0. However, it should perform well when parameters range over different orders of magnitude.

Parameters:
  • p0 – Initial parameters.

  • data – Data SFS.

  • model_func – Function to evaluate model spectrum. Should take arguments model_func(params, (n1,n2...)).

  • lower_bound – Lower bound on parameter values. If not None, must be of same length as p0.

  • upper_bound – Upper bound on parameter values. If not None, must be of same length as p0.

  • verbose – If > 0, print optimization status every verbose steps.

  • output_file – Stream verbose output into this filename. If None, stream to standard out.

  • flush_delay – Standard output will be flushed once every <flush_delay> minutes. This is useful to avoid overloading I/O on clusters.

  • epsilon – Step-size to use for finite-difference derivatives.

  • gtol – Convergence criterion for optimization. For more info, see help(scipy.optimize.fmin_bfgs)

  • multinom – If True, do a multinomial fit where model is optimially scaled to data at each step. If False, assume theta is a parameter and do no scaling.

  • maxiter – Maximum iterations to run for.

  • full_output – If True, return full outputs as in described in help(scipy.optimize.fmin_bfgs)

  • func_args – Additional arguments to model_func. It is assumed that model_func’s first argument is an array of parameters to optimize, that its second argument is an array of sample sizes for the sfs, and that its last argument is the list of grid points to use in evaluation. Using func_args. For example, you could define your model function as def func((p1,p2), ns, f1, f2): .... If you wanted to fix f1=0.1 and f2=0.2 in the optimization, you would pass func_args = [0.1,0.2] (and ignore the fixed_params argument).

  • func_kwargs – Additional keyword arguments to model_func.

  • fixed_params – If not None, should be a list used to fix model parameters at particular values. For example, if the model parameters are (nu1,nu2,T,m), then fixed_params = [0.5,None,None,2] ll hold nu1=0.5 and m=2. The optimizer will only change T and m. Note that the bounds lists must include all parameters. Optimization will fail if the fixed values lie outside their bounds. A full-length p0 should be passed in; values corresponding to fixed parameters are ignored. For example, suppose your model function is def func((p1,f1,p2,f2), ns): ... If you wanted to fix f1=0.1 and f2=0.2 in the optimization, you would pass fixed_params = [None,0.1,None,0.2] (and ignore the func_args argument).

  • ll_scale – The bfgs algorithm may fail if your initial log-likelihood is too large. (This appears to be a flaw in the scipy implementation.) To overcome this, pass ll_scale > 1, which will simply reduce the magnitude of the log-likelihood. Once in a region of reasonable likelihood, you’ll probably want to re-optimize with ll_scale=1.

moments.Inference.optimize_log_fmin(p0, data, model_func, lower_bound=None, upper_bound=None, verbose=0, flush_delay=0.5, multinom=True, maxiter=None, maxfun=None, full_output=False, func_args=[], func_kwargs={}, fixed_params=None, output_file=None)[source]

Optimize log(params) to fit model to data using Nelder-Mead. This optimization method may work better than BFGS when far from a minimum. It is much slower, but more robust, because it doesn’t use gradient information.

Because this works in log(params), it cannot explore values of params < 0. It should also perform better when parameters range over large scales.

Parameters:
  • p0 – Initial parameters.

  • data – Spectrum with data.

  • model_function – Function to evaluate model spectrum. Should take arguments (params, (n1,n2…))

  • lower_bound – Lower bound on parameter values. If not None, must be of same length as p0. A parameter can be declared unbound by assigning a bound of None.

  • upper_bound – Upper bound on parameter values. If not None, must be of same length as p0. A parameter can be declared unbound by assigning a bound of None.

  • verbose – If True, print optimization status every <verbose> steps.

  • output_file – Stream verbose output into this filename. If None, stream to standard out.

  • flush_delay – Standard output will be flushed once every <flush_delay> minutes. This is useful to avoid overloading I/O on clusters.

  • multinom – If True, do a multinomial fit where model is optimially scaled to data at each step. If False, assume theta is a parameter and do no scaling.

  • maxiter – Maximum number of iterations to run optimization.

  • maxfun – Maximum number of objective function calls to perform.

  • full_output – If True, return full outputs as in described in help(scipy.optimize.fmin_bfgs)

  • func_args – Additional arguments to model_func. It is assumed that model_func’s first argument is an array of parameters to optimize, that its second argument is an array of sample sizes for the sfs, and that its last argument is the list of grid points to use in evaluation.

  • func_kwargs – Additional keyword arguments to model_func.

  • fixed_params – If not None, should be a list used to fix model parameters at particular values. For example, if the model parameters are (nu1,nu2,T,m), then fixed_params = [0.5,None,None,2] will hold nu1=0.5 and m=2. The optimizer will only change T and m. Note that the bounds lists must include all parameters. Optimization will fail if the fixed values lie outside their bounds. A full-length p0 should be passed in; values corresponding to fixed parameters are ignored.

moments.Inference.optimize_log_powell(p0, data, model_func, lower_bound=None, upper_bound=None, verbose=0, flush_delay=0.5, multinom=True, maxiter=None, full_output=False, func_args=[], func_kwargs={}, fixed_params=None, output_file=None)[source]

Optimize log(params) to fit model to data using Powell’s conjugate direction method.

This method works without calculating any derivatives, and optimizes along one direction at a time. May be useful as an initial search for an approximate solution, followed by further optimization using a gradient optimizer.

Because this works in log(params), it cannot explore values of params < 0.

Parameters:
  • p0 – Initial parameters.

  • data – Spectrum with data.

  • model_function – Function to evaluate model spectrum. Should take arguments (params, (n1,n2…))

  • lower_bound – Lower bound on parameter values. If not None, must be of same length as p0. A parameter can be declared unbound by assigning a bound of None.

  • upper_bound – Upper bound on parameter values. If not None, must be of same length as p0. A parameter can be declared unbound by assigning a bound of None.

  • verbose – If True, print optimization status every <verbose> steps. output_file: Stream verbose output into this filename. If None, stream to standard out.

  • flush_delay – Standard output will be flushed once every <flush_delay> minutes. This is useful to avoid overloading I/O on clusters. multinom: If True, do a multinomial fit where model is optimially scaled to data at each step. If False, assume theta is a parameter and do no scaling.

  • maxiter – Maximum iterations to run for.

  • full_output – If True, return full outputs as in described in help(scipy.optimize.fmin_bfgs)

  • func_args – Additional arguments to model_func. It is assumed that model_func’s first argument is an array of parameters to optimize, that its second argument is an array of sample sizes for the sfs, and that its last argument is the list of grid points to use in evaluation.

  • func_kwargs – Additional keyword arguments to model_func.

  • fixed_params – If not None, should be a list used to fix model parameters at particular values. For example, if the model parameters are (nu1,nu2,T,m), then fixed_params = [0.5,None,None,2] will hold nu1=0.5 and m=2. The optimizer will only change T and m. Note that the bounds lists must include all parameters. Optimization will fail if the fixed values lie outside their bounds. A full-length p0 should be passed in; values corresponding to fixed parameters are ignored. (See help(moments.Inference.optimize_log for examples of func_args and fixed_params usage.)

moments.Inference.optimize_log_lbfgsb(p0, data, model_func, lower_bound=None, upper_bound=None, verbose=0, flush_delay=0.5, epsilon=0.001, pgtol=1e-05, multinom=True, maxiter=100000.0, full_output=False, func_args=[], func_kwargs={}, fixed_params=None, ll_scale=1, output_file=None)[source]

Optimize log(params) to fit model to data using the L-BFGS-B method.

This optimization method works well when we start reasonably close to the optimum. It is best at burrowing down a single minimum. This method is better than optimize_log if the optimum lies at one or more of the parameter bounds. However, if your optimum is not on the bounds, this method may be much slower.

Because this works in log(params), it cannot explore values of params < 0. It should also perform better when parameters range over scales.

The L-BFGS-B method was developed by Ciyou Zhu, Richard Byrd, and Jorge Nocedal. The algorithm is described in:

  • R. H. Byrd, P. Lu and J. Nocedal. A Limited Memory Algorithm for Bound Constrained Optimization, (1995), SIAM Journal on Scientific and Statistical Computing , 16, 5, pp. 1190-1208.

  • C. Zhu, R. H. Byrd and J. Nocedal. L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization (1997), ACM Transactions on Mathematical Software, Vol 23, Num. 4, pp. 550-560.

Parameters:
  • p0 – Initial parameters.

  • data – Spectrum with data.

  • model_function – Function to evaluate model spectrum. Should take arguments (params, (n1,n2…))

  • lower_bound – Lower bound on parameter values. If not None, must be of same length as p0. A parameter can be declared unbound by assigning a bound of None.

  • upper_bound – Upper bound on parameter values. If not None, must be of same length as p0. A parameter can be declared unbound by assigning a bound of None.

  • verbose – If > 0, print optimization status every <verbose> steps.

  • output_file – Stream verbose output into this filename. If None, stream to standard out.

  • flush_delay – Standard output will be flushed once every <flush_delay> minutes. This is useful to avoid overloading I/O on clusters.

  • epsilon – Step-size to use for finite-difference derivatives.

  • pgtol – Convergence criterion for optimization. For more info, see help(scipy.optimize.fmin_l_bfgs_b)

  • multinom – If True, do a multinomial fit where model is optimially scaled to data at each step. If False, assume theta is a parameter and do no scaling.

  • maxiter – Maximum algorithm iterations to run.

  • full_output – If True, return full outputs as in described in help(scipy.optimize.fmin_bfgs)

  • func_args – Additional arguments to model_func. It is assumed that model_func’s first argument is an array of parameters to optimize, that its second argument is an array of sample sizes for the sfs, and that its last argument is the list of grid points to use in evaluation.

  • func_kwargs – Additional keyword arguments to model_func.

  • fixed_params – If not None, should be a list used to fix model parameters at particular values. For example, if the model parameters are (nu1,nu2,T,m), then fixed_params = [0.5,None,None,2] will hold nu1=0.5 and m=2. The optimizer will only change T and m. Note that the bounds lists must include all parameters. Optimization will fail if the fixed values lie outside their bounds. A full-length p0 should be passed in; values corresponding to fixed parameters are ignored.

  • ll_scale – The bfgs algorithm may fail if your initial log-likelihood is too large. (This appears to be a flaw in the scipy implementation.) To overcome this, pass ll_scale > 1, which will simply reduce the magnitude of the log-likelihood. Once in a region of reasonable likelihood, you’ll probably want to re-optimize with ll_scale=1.

Uncertainty functions

Parameter uncertainties and likelihood ratio tests using Godambe information.

moments.Godambe.FIM_uncert(func_ex, p0, data, log=False, multinom=True, eps=0.01)[source]

Parameter uncertainties from Fisher Information Matrix. Returns standard deviations of parameter values.

Parameters:
  • func_ex (demographic model) – Model function

  • p0 (list-like) – Best-fit parameters for func_ex

  • data (spectrum object) – Original data frequency spectrum

  • log (bool) – If True, assume log-normal distribution of parameters. Returned values are then the standard deviations of the logs of the parameter values, which can be interpreted as relative parameter uncertainties.

  • multinom (bool) – If True, assume model is defined without an explicit parameter for theta. Because uncertainty in theta must be accounted for to get correct uncertainties for other parameters, this function will automatically consider theta if multinom=True. In that case, the final entry of the returned uncertainties will correspond to theta.

  • eps (float) – Fractional stepsize to use when taking finite-difference derivatives

moments.Godambe.GIM_uncert(func_ex, all_boot, p0, data, log=False, multinom=True, eps=0.01, return_GIM=False)[source]

Parameter uncertainties from Godambe Information Matrix (GIM). Returns standard deviations of parameter values. Bootstrap data is typically generated by splitting the genome into N chunks and sampling with replacement from those chunks N times.

Parameters:
  • func_ex (demographic model) – Model function

  • all_boot (list of spectra) – List of bootstrap frequency spectra

  • p0 (list-like) – Best-fit parameters for func_ex

  • data (spectrum object) – Original data frequency spectrum

  • log (bool) – If True, assume log-normal distribution of parameters. Returned values are then the standard deviations of the logs of the parameter values, which can be interpreted as relative parameter uncertainties.

  • multinom (bool) – If True, assume model is defined without an explicit parameter for theta. Because uncertainty in theta must be accounted for to get correct uncertainties for other parameters, this function will automatically consider theta if multinom=True. In that case, the final entry of the returned uncertainties will correspond to theta.

  • eps (float) – Fractional stepsize to use when taking finite-difference derivatives

  • return_GIM – If True, also return the full GIM.

moments.Godambe.LRT_adjust(func_ex, all_boot, p0, data, nested_indices, multinom=True, eps=0.01)[source]

First-order moment matching adjustment factor for likelihood ratio test.

Parameters:
  • func_ex (demographic model) – Model function for complex model

  • all_boot (list of spectra) – List of bootstrap frequency spectra

  • p0 (list-like) – Best-fit parameters for the simple model, with nested parameter explicity defined. Although equal to values for simple model, should be in a list form that can be taken in by the complex model you’d like to evaluate.

  • data (spectrum object) – Original data frequency spectrum

  • nested_indices (list of ints) – List of positions of nested parameters in complex model parameter list

  • multinom (bool) – If True, assume model is defined without an explicit parameter for theta. Because uncertainty in theta must be accounted for to get correct uncertainties for other parameters, this function will automatically consider theta if multinom=True.

  • eps (float) – Fractional stepsize to use when taking finite-difference derivatives

Plotting features

Single-population plotting

moments.Plotting.plot_1d_comp_Poisson(model, data, fig_num=None, residual='Anscombe', plot_masked=False, out=None, show=True, labels=['Model', 'Data'])[source]

Poisson comparison between 1d model and data.

Parameters:
  • model – 1-dimensional model SFS

  • data – 1-dimensional data SFS

  • fig_num – Clear and use figure fig_num for display. If None, an new figure window is created.

  • residual – ‘Anscombe’ for Anscombe residuals, which are more normally distributed for Poisson sampling. ‘linear’ for the linear residuals, which can be less biased.

  • plot_masked – Additionally plots (in open circles) results for points in the model or data that were masked.

  • out – Output filename to save figure, if given.

  • show – If True, execute pylab.show command to make sure plot displays.

  • labels – A list of strings of length two, labels for the first and second input frequency spectra. Defaults to “Model” and “Data”.

moments.Plotting.plot_1d_comp_multinom(model, data, fig_num=None, residual='Anscombe', plot_masked=False, out=None, show=True, labels=['Model', 'Data'])[source]

Multinomial comparison between 1d model and data.

Parameters:
  • model – 1-dimensional model SFS

  • data – 1-dimensional data SFS

  • fig_num – Clear and use figure fig_num for display. If None, an new figure window is created.

  • residual – ‘Anscombe’ for Anscombe residuals, which are more normally distributed for Poisson sampling. ‘linear’ for the linear residuals, which can be less biased.

  • plot_masked – Additionally plots (in open circles) results for points in the model or data that were masked.

  • out – Output filename to save figure, if given.

  • show – If True, displays figure. Set to False to supress.

moments.Plotting.plot_1d_fs(fs, fig_num=None, show=True, ax=None, out=None, ms=3, lw=1)[source]

Plot a 1-dimensional frequency spectrum.

Note that all the plotting is done with pylab. To see additional pylab methods: “import pylab; help(pylab)”. Pylab’s many functions are documented at http://matplotlib.sourceforge.net/contents.html

Parameters:
  • fs – A single-population Spectrum

  • fig_num – If used, clear and use figure fig_num for display. If None, a new figure window is created.

  • show – If True, execute pylab.show command to make sure plot displays.

  • ax – If None, uses new or specified figure. Otherwise plots in axes object that is given after clearing.

  • out – If file name is given, saves before showing.

Multi-population plotting

moments.Plotting.plot_2d_comp_Poisson(model, data, vmin=None, vmax=None, resid_range=None, fig_num=None, pop_ids=None, residual='Anscombe', adjust=True, out=None, show=True)[source]

Poisson comparison between 2d model and data.

Parameters:
  • model – 2-dimensional model SFS

  • data – 2-dimensional data SFS

  • vmin – Minimum value plotted.

  • vmax – Maximum value plotted.

  • resid_range – Residual plot saturates at +- resid_range.

  • fig_num – Clear and use figure fig_num for display. If None, an new figure window is created.

  • pop_ids – If not None, override pop_ids stored in Spectrum.

  • residual – ‘Anscombe’ for Anscombe residuals, which are more normally distributed for Poisson sampling. ‘linear’ for the linear residuals, which can be less biased.

  • adjust – Should method use automatic ‘subplots_adjust’? For advanced manipulation of plots, it may be useful to make this False.

  • out – Output filename to save figure, if given.

  • show – If True, execute pylab.show command to make sure plot displays.

moments.Plotting.plot_2d_comp_multinom(model, data, vmin=None, vmax=None, resid_range=None, fig_num=None, pop_ids=None, residual='Anscombe', adjust=True, out=None, show=True)[source]

Multinomial comparison between 2d model and data.

Parameters:
  • model – 2-dimensional model SFS

  • data – 2-dimensional data SFS

  • vmin – Minimum value plotted.

  • vmax – Maximum value plotted.

  • resid_range – Residual plot saturates at +- resid_range.

  • fig_num – Clear and use figure fig_num for display. If None, an new figure window is created.

  • pop_ids – If not None, override pop_ids stored in Spectrum.

  • residual – ‘Anscombe’ for Anscombe residuals, which are more normally distributed for Poisson sampling. ‘linear’ for the linear residuals, which can be less biased.

  • adjust – Should method use automatic ‘subplots_adjust’? For advanced manipulation of plots, it may be useful to make this False.

  • out – Output filename to save figure, if given.

  • show – If True, execute pylab.show command to make sure plot displays.

moments.Plotting.plot_2d_resid(resid, resid_range=None, ax=None, pop_ids=None, extend='neither', colorbar=True, out=None, show=True)[source]

Linear heatmap of 2d residual array.

Parameters:
  • sfs – Residual array to plot.

  • resid_range – Values > resid range or < resid_range saturate the color spectrum.

  • ax – Axes object to plot into. If None, the result of pylab.gca() is used.

  • pop_ids – If not None, override pop_ids stored in Spectrum.

  • extend – Whether the colorbar should have ‘extension’ arrows. See help(pylab.colorbar) for more details.

  • colorbar – Should we plot a colorbar?

  • out – Output filename to save figure, if given.

  • show – If True, execute pylab.show command to make sure plot displays.

moments.Plotting.plot_3d_comp_Poisson(model, data, vmin=None, vmax=None, resid_range=None, fig_num=None, pop_ids=None, residual='Anscombe', adjust=True, out=None, show=True)[source]

Poisson comparison between 3d model and data.

Parameters:
  • model – 3-dimensional model SFS

  • data – 3-dimensional data SFS

  • vmin – Minimum value plotted.

  • vmax – Maximum value plotted.

  • resid_range – Residual plot saturates at +- resid_range.

  • fig_num – Clear and use figure fig_num for display. If None, an new figure window is created.

  • pop_ids – If not None, override pop_ids stored in Spectrum.

  • residual – ‘Anscombe’ for Anscombe residuals, which are more normally distributed for Poisson sampling. ‘linear’ for the linear residuals, which can be less biased.

  • adjust – Should method use automatic ‘subplots_adjust’? For advanced manipulation of plots, it may be useful to make this False.

  • out – Output filename to save figure, if given.

  • show – If True, execute pylab.show command to make sure plot displays.

moments.Plotting.plot_3d_comp_multinom(model, data, vmin=None, vmax=None, resid_range=None, fig_num=None, pop_ids=None, residual='Anscombe', adjust=True, out=None, show=True)[source]

Multinomial comparison between 3d model and data.

Parameters:
  • model – 3-dimensional model SFS

  • data – 3-dimensional data SFS

  • vmin – Minimum value plotted.

  • vmax – Maximum value plotted.

  • resid_range – Residual plot saturates at +- resid_range.

  • fig_num – Clear and use figure fig_num for display. If None, an new figure window is created.

  • pop_ids – If not None, override pop_ids stored in Spectrum.

  • residual – ‘Anscombe’ for Anscombe residuals, which are more normally distributed for Poisson sampling. ‘linear’ for the linear residuals, which can be less biased.

  • adjust – Should method use automatic ‘subplots_adjust’? For advanced manipulation of plots, it may be useful to make this False.

  • out – Output filename to save figure, if given.

  • show – If True, execute pylab.show command to make sure plot displays.

moments.Plotting.plot_3d_spectrum(fs, fignum=None, vmin=None, vmax=None, pop_ids=None, out=None, show=True)[source]

Logarithmic heatmap of single 3d FS.

Note that this method is slow, because it relies on matplotlib’s software rendering. For faster and better looking plots, use plot_3d_spectrum_mayavi.

Parameters:
  • fs – FS to plot

  • vmin – Values in fs below vmin are masked in plot.

  • vmax – Values in fs above vmax saturate the color spectrum.

  • fignum – Figure number to plot into. If None, a new figure will be created.

  • pop_ids – If not None, override pop_ids stored in Spectrum.

  • out – Output filename to save figure, if given.

  • show – If True, execute pylab.show command to make sure plot displays.

moments.Plotting.plot_4d_comp_Poisson(model, data, vmin=None, vmax=None, resid_range=None, fig_num=None, pop_ids=None, residual='Anscombe', adjust=True, out=None, show=True)[source]

Poisson comparison between 4d model and data.

Parameters:
  • model – 4-dimensional model SFS

  • data – 4-dimensional data SFS

  • vmin – Minimum value plotted.

  • vmax – Maximum value plotted.

  • resid_range – Residual plot saturates at +- resid_range.

  • fig_num – Clear and use figure fig_num for display. If None, an new figure window is created.

  • pop_ids – If not None, override pop_ids stored in Spectrum.

  • residual – ‘Anscombe’ for Anscombe residuals, which are more normally distributed for Poisson sampling. ‘linear’ for the linear residuals, which can be less biased.

  • adjust – Should method use automatic ‘subplots_adjust’? For advanced manipulation of plots, it may be useful to make this False.

  • out – Output filename to save figure, if given.

  • show – If True, execute pylab.show command to make sure plot displays.

moments.Plotting.plot_4d_comp_multinom(model, data, vmin=None, vmax=None, resid_range=None, fig_num=None, pop_ids=None, residual='Anscombe', adjust=True, out=None, show=True)[source]

Multinomial comparison between 4d model and data.

Parameters:
  • model – 4-dimensional model SFS

  • data – 4-dimensional data SFS

  • vmin – Minimum value plotted.

  • vmax – Maximum value plotted.

  • resid_range – Residual plot saturates at +- resid_range.

  • fig_num – Clear and use figure fig_num for display. If None, an new figure window is created.

  • pop_ids – If not None, override pop_ids stored in Spectrum.

  • residual – ‘Anscombe’ for Anscombe residuals, which are more normally distributed for Poisson sampling. ‘linear’ for the linear residuals, which can be less biased.

  • adjust – Should method use automatic ‘subplots_adjust’? For advanced manipulation of plots, it may be useful to make this False.

  • out – Output filename to save figure, if given.

  • show – If True, execute pylab.show command to make sure plot displays.

moments.Plotting.plot_single_2d_sfs(sfs, vmin=None, vmax=None, ax=None, pop_ids=None, extend='neither', colorbar=True, cmap=<matplotlib.colors.LinearSegmentedColormap object>, out=None, show=True)[source]

Heatmap of single 2d SFS.

If vmax is greater than a factor of 10, plot on log scale.

Returns colorbar that is created.

Parameters:
  • sfs – SFS to plot

  • vmin – Values in sfs below vmin are masked in plot.

  • vmax – Values in sfs above vmax saturate the color spectrum.

  • ax – Axes object to plot into. If None, the result of pylab.gca() is used.

  • pop_ids – If not None, override pop_ids stored in Spectrum.

  • extend – Whether the colorbar should have ‘extension’ arrows. See help(pylab.colorbar) for more details.

  • colorbar – Should we plot a colorbar?

  • cmap – Pylab colormap to use for plotting.

  • out – Output filename to save figure, if given.

  • show – If True, execute pylab.show command to make sure plot displays.