Parsing the SFS from a VCF file
As described in the SFS section, the site frequencey spectrum (SFS) is
a multidimensional histogram that records the counts, frequencies or probabilities
of the observed counts of SNPs in one or more populations. Here we show how to use
moments.Spectrum.from_vcf
to compute the SFS from a VCF file.
Basic usage
The only argument required to compute the SFS is the path to the desired VCF.
The function moments.Spectrum.from_vcf
returns an SFS array in the form of a
moments.Spectrum
instance. A minimal example of usage is:
import moments
vcf_file = "path/to/vcf_file.vcf.gz"
fs = moments.Spectrum.from_vcf(vcf_file)
Specifying populations
By default, all samples present in the VCF are collected into a single population "ALL"
.
There are two ways to group samples into multiple populations. We can specify a pop_mapping
,
a dictionary that directly maps population names to lists of VCF sample IDs:
pop_mapping = {"pop_A": ["id_0", "id_1", "id_2"], "pop_B": ["id_4", "id_5"]}
fs = moments.Spectrum.from_vcf(vcf_file, pop_mapping=pop_mapping)
Or we can specify the path to a pop_file
, a whitespace-separated file following the
pattern sample_id pop_id
on each line. Optionally, we can parse only a subset of
the populations listed in this file by passing a list of population names to the argument pops
.
If nothing is passed to this argument, then all populations in the file will be included.
id_0 pop_A
id_1 pop_A
id_2 pop_A
id_4 pop_B
id_5 pop_B
id_6 pop_C
pop_file = 'path/to/pop_file.txt'
pops = ['pop_A', 'pop_B']
fs = moments.Spectrum.from_vcf(vcf_file, pop_file=pop_file, pops=pops)
Any VCF samples that are not mapped to a population are ignored.
Using ancestral allele information
By default, reference alleles are interpreted as ancestral alleles. In this case,
we will most likely wish to pass True
to the argument folded
to fold the
returned SFS, as there is no general correspondence between ancestral and reference
alleles. We can provide inferred ancestral alleles for polarizing alleles in two different
ways. If the input VCF file has AA
(ancestral allele) information in its INFO
field,
we can pass True
to use_AA
to obtain ancestral allele assignments from this
subfield. Sites where AA
is absent or has missing data are skipped and an alert
message is printed at the first such site.
Otherwise, we can pass the path to a FASTA-format file containing estimated ancestral alleles
to anc_seq_file
. Sites with invalid or missing data are skipped, raising a single alert
message as described above. Often, FASTA files represent low-confidence ancestral allele
assignments with lower-case nucleotide codes. By default, sites assigned these are skipped
as though the data were missing, but the assignments may be taken as valid by passing
True
to allow_low_confidence
.
Biallelic and multiallelic sites
By default, multiallelic sites are skipped. We can pass True``to ``allow_multiallelic
to include derived alleles at multiallelic sites as distinct entries in the SFS.
When we provide an ancestral sequence and allow_multiallelic
is False
,
biallelic sites where the reference and alternate alleles both differ from
the ancestral allele are skipped, because these sites represent recurrent mutations and
the relationship between the derived alleles is unclear. Conversely, when allow_multiallelic
is True
, this exception is ignored and all the derived alleles at such sites are counted.
Using filters
It is often desirable to set quality thresholds or categorical requirements for
the inclusion of sites in the returned SFS. We can do this by passing a flat
dictionary representing the desired quantitative/categorical filters to filters
.
Valid key-value combinations are listed here. "QUAL"
specifies a lower bound
on the VCF QUAL
column and should map to a float or integer. "FILTER"
should map to a string or list/set/tuple of strings. To pass, sites must have a FILTER
entry equal to the value of "FILTER"
if it is a string, or to one of its
elements if it is a string, tuple or list.
"INFO/SUBFIELD"
imposes a filter on SUBFIELD
in the INFO
column. Its
value may be a float or integer, in which case it imposes a minimum threshold on that
entry. It may also map to a string or list/set/tuple of strings, with equivalent
behavior to "FILTER"
. "SAMPLE/SUBFIELD"
works in the same way, but imposes
filters on individual samples rather than lines. The fields "SAMPLE"
and "FORMAT"
are equivalent and refer to subfields enumerated in the FORMAT
VCF column.
Their types and behavior are the same as for INFO
subfields, but filtering occurs
at the sample level. An arbitrary example is:
filter_dict = {
"QUAL": 30,
"FILTER": "PASS",
"INFO/DP": 30,
"INFO/DB": "DB"
"FORMAT/GQ": 30
"FORMAT/DP": 30
}
fs = moments.Spectrum.from_vcf(vcf_file, filters=filter_dict)
The types of filters are not explicitly checked for consistency with their definitions in the VCF file, so care should be taken when specifying them. Inappropriately typed filters will generally raise errors. Lines or samples with absent fields/missing data are not skipped, but one-time alert messages are printed for each unique exception.
Projecting to a smaller sample size
We may wish to reduce the size the output SFS to reduce the space it occupies in
memory, to make computing an expected SFS for the same shape faster, to allow sites
where some samples are filtered out or missing to be retained in output, or for
other reasons. We can accomplish this by passing a dictionary of desired haploid sample sizes to
sample_sizes
. Any VCF sites with exactly this number of observed alleles will be
retained without alteration in the output SFS, and the SFS from all sites with
sample-size configurations larger than sample_sizes
will be projected down to
match. The output SFS is a sum over these cases. Projection is a procedure for reducing
the size of the SFS by summing over the possible subsamplings of an entry. An example
usage with the population file shown above is:
sample_sizes = {"pop_A": 4, "pop_B": 2, "pop_C": 2}
fs = moments.Spectrum.from_vcf(
vcf_file,
pop_file=pop_file,
sample_sizes=sample_sizes
)
Note that sample sizes can be equal to, but not greater than, the total haploid sample size of a population.
Specifying regions
We can subset parsing to a genomic window by using the interval
argument, which
should be a 2-list of integers. This interval should be one-indexed and half-open.
Additionally, we provide a mask file in BED format with the argument bed_file
,
which will filter out sites that fall outside its region intervals.
bed_file
can be given alongside interval
, so that only sites which fall
within a BED interval and within interval
are parsed. Note that BED file
intervals are half-open and zero-indexed. Also note that moments.Spectrum.from_vcf
does not support VCF files that contain sites from multiple chromosomes. An
example where both arguments are passed is:
bed_file = "path/to/bed_file.bed.gz"
interval = [1, 10000001]
fs = moments.Spectrum.from_vcf(vcf_file, bed_file=bed_file, interval=interval)
Computing L
We can compute the effective sequence length corresponding to our SFS, \(L\),
with the moments.Parsing.compute_L
function. Its only required argument is
bed_file
, the path to the BED file that was used to parse the SFS. We can also
give an interval
, restricting sites to a one-indexed, half-open interval. Also,
if we used an ancestral sequence from an external FASTA file, it can be passed to
anc_seq_file
, with the interpretation of low-confidence allele assignements modulated
by allow_low_confidence
. Providing a FASTA file will restrict sites counted in
\(L\) to those with inferred ancestral states. A maximal example is:
bed_file = "path/to/bed_file.bed.gz"
interval = [1, 10000001]
anc_seq = "path/to/anc_seq_file.fa.gz"
L = moments.Parsing.compute_L(
bed_file,
interval=interval,
anc_seq_file=anc_seq,
allow_low_confidence=False
)
Bootstrapping data
TODO: Currently, Misc.bootstrap() works with the data dict to create bootstrap replicates. We should replace this function to work with independently parsed “tally” dictionaries from different regions, and show some example code blocks here.