.. _sec_sfs_parsing: .. jupyter-execute:: :hide-code: =============================== Parsing the SFS from a VCF file =============================== As described in the `SFS section `_, the site frequencey spectrum (SFS) is a multidimensional histogram that records the counts, frequencies or probabilities of the observed counts of SNPs in one or more populations. Here we show how to use ``moments.Spectrum.from_vcf`` to compute the SFS from a VCF file. Basic usage ----------- The only argument required to compute the SFS is the path to the desired VCF. The function ``moments.Spectrum.from_vcf`` returns an SFS array in the form of a ``moments.Spectrum`` instance. A minimal example of usage is: .. code-block:: python import moments vcf_file = "path/to/vcf_file.vcf.gz" fs = moments.Spectrum.from_vcf(vcf_file) Specifying populations ---------------------- By default, all samples present in the VCF are collected into a single population ``"ALL"``. There are two ways to group samples into multiple populations. We can specify a ``pop_mapping``, a dictionary that directly maps population names to lists of VCF sample IDs: .. code-block:: python pop_mapping = {"pop_A": ["id_0", "id_1", "id_2"], "pop_B": ["id_4", "id_5"]} fs = moments.Spectrum.from_vcf(vcf_file, pop_mapping=pop_mapping) Or we can specify the path to a ``pop_file``, a whitespace-separated file following the pattern ``sample_id pop_id`` on each line. Optionally, we can parse only a subset of the populations listed in this file by passing a list of population names to the argument ``pops``. If nothing is passed to this argument, then all populations in the file will be included. .. code-block:: id_0 pop_A id_1 pop_A id_2 pop_A id_4 pop_B id_5 pop_B id_6 pop_C .. code-block:: python pop_file = 'path/to/pop_file.txt' pops = ['pop_A', 'pop_B'] fs = moments.Spectrum.from_vcf(vcf_file, pop_file=pop_file, pops=pops) Any VCF samples that are not mapped to a population are ignored. Using ancestral allele information ---------------------------------- By default, reference alleles are interpreted as ancestral alleles. In this case, we will most likely wish to pass ``True`` to the argument ``folded`` to fold the returned SFS, as there is no general correspondence between ancestral and reference alleles. We can provide inferred ancestral alleles for polarizing alleles in two different ways. If the input VCF file has ``AA`` (ancestral allele) information in its ``INFO`` field, we can pass ``True`` to ``use_AA`` to obtain ancestral allele assignments from this subfield. Sites where ``AA`` is absent or has missing data are skipped and an alert message is printed at the first such site. Otherwise, we can pass the path to a FASTA-format file containing estimated ancestral alleles to ``anc_seq_file``. Sites with invalid or missing data are skipped, raising a single alert message as described above. Often, FASTA files represent low-confidence ancestral allele assignments with lower-case nucleotide codes. By default, sites assigned these are skipped as though the data were missing, but the assignments may be taken as valid by passing ``True`` to ``allow_low_confidence``. Biallelic and multiallelic sites -------------------------------- By default, multiallelic sites are skipped. We can pass ``True``to ``allow_multiallelic`` to include derived alleles at multiallelic sites as distinct entries in the SFS. When we provide an ancestral sequence and ``allow_multiallelic`` is ``False``, biallelic sites where the reference and alternate alleles both differ from the ancestral allele are skipped, because these sites represent recurrent mutations and the relationship between the derived alleles is unclear. Conversely, when ``allow_multiallelic`` is ``True``, this exception is ignored and all the derived alleles at such sites are counted. Using filters ------------- It is often desirable to set quality thresholds or categorical requirements for the inclusion of sites in the returned SFS. We can do this by passing a flat dictionary representing the desired quantitative/categorical filters to ``filters``. Valid key-value combinations are listed here. ``"QUAL"`` specifies a lower bound on the VCF ``QUAL`` column and should map to a float or integer. ``"FILTER"`` should map to a string or list/set/tuple of strings. To pass, sites must have a ``FILTER`` entry equal to the value of ``"FILTER"`` if it is a string, or to one of its elements if it is a string, tuple or list. ``"INFO/SUBFIELD"`` imposes a filter on ``SUBFIELD`` in the ``INFO`` column. Its value may be a float or integer, in which case it imposes a minimum threshold on that entry. It may also map to a string or list/set/tuple of strings, with equivalent behavior to ``"FILTER"``. ``"SAMPLE/SUBFIELD"`` works in the same way, but imposes filters on individual samples rather than lines. The fields ``"SAMPLE"`` and ``"FORMAT"`` are equivalent and refer to subfields enumerated in the ``FORMAT`` VCF column. Their types and behavior are the same as for ``INFO`` subfields, but filtering occurs at the sample level. An arbitrary example is: .. code-block:: python filter_dict = { "QUAL": 30, "FILTER": "PASS", "INFO/DP": 30, "INFO/DB": "DB" "FORMAT/GQ": 30 "FORMAT/DP": 30 } fs = moments.Spectrum.from_vcf(vcf_file, filters=filter_dict) The types of filters are not explicitly checked for consistency with their definitions in the VCF file, so care should be taken when specifying them. Inappropriately typed filters will generally raise errors. Lines or samples with absent fields/missing data are not skipped, but one-time alert messages are printed for each unique exception. Projecting to a smaller sample size ----------------------------------- We may wish to reduce the size the output SFS to reduce the space it occupies in memory, to make computing an expected SFS for the same shape faster, to allow sites where some samples are filtered out or missing to be retained in output, or for other reasons. We can accomplish this by passing a dictionary of desired haploid sample sizes to ``sample_sizes``. Any VCF sites with exactly this number of observed alleles will be retained without alteration in the output SFS, and the SFS from all sites with sample-size configurations larger than ``sample_sizes`` will be projected down to match. The output SFS is a sum over these cases. Projection is a procedure for reducing the size of the SFS by summing over the possible subsamplings of an entry. An example usage with the population file shown above is: .. code-block:: python sample_sizes = {"pop_A": 4, "pop_B": 2, "pop_C": 2} fs = moments.Spectrum.from_vcf( vcf_file, pop_file=pop_file, sample_sizes=sample_sizes ) Note that sample sizes can be equal to, but not greater than, the total haploid sample size of a population. Specifying regions ------------------ We can subset parsing to a genomic window by using the ``interval`` argument, which should be a 2-list of integers. This interval should be one-indexed and half-open. Additionally, we provide a mask file in BED format with the argument ``bed_file``, which will filter out sites that fall outside its region intervals. ``bed_file`` can be given alongside ``interval``, so that only sites which fall within a BED interval and within ``interval`` are parsed. Note that BED file intervals are half-open and zero-indexed. Also note that ``moments.Spectrum.from_vcf`` does not support VCF files that contain sites from multiple chromosomes. An example where both arguments are passed is: .. code-block:: python bed_file = "path/to/bed_file.bed.gz" interval = [1, 10000001] fs = moments.Spectrum.from_vcf(vcf_file, bed_file=bed_file, interval=interval) *********** Computing L *********** We can compute the effective sequence length corresponding to our SFS, :math:`L`, with the ``moments.Parsing.compute_L`` function. Its only required argument is ``bed_file``, the path to the BED file that was used to parse the SFS. We can also give an ``interval``, restricting sites to a one-indexed, half-open interval. Also, if we used an ancestral sequence from an external FASTA file, it can be passed to ``anc_seq_file``, with the interpretation of low-confidence allele assignements modulated by ``allow_low_confidence``. Providing a FASTA file will restrict sites counted in :math:`L` to those with inferred ancestral states. A maximal example is: .. code-block:: python bed_file = "path/to/bed_file.bed.gz" interval = [1, 10000001] anc_seq = "path/to/anc_seq_file.fa.gz" L = moments.Parsing.compute_L( bed_file, interval=interval, anc_seq_file=anc_seq, allow_low_confidence=False ) ****************** Bootstrapping data ****************** TODO: Currently, Misc.bootstrap() works with the data dict to create bootstrap replicates. We should replace this function to work with independently parsed "tally" dictionaries from different regions, and show some example code blocks here.