prior_filter.Rd
prior_filter
filters loci prior to analysis using maelstRom:
a filter throwing out samples showcasing relevant allele counts for alleles not present in the ref_alleles column (if checkref_filter == TRUE)
a filter throwing out loci showcasing a low minor allele fraction (if prior_allelefreq_filter == TRUE, governed by min_PrioAlleleFreq)
a filter for median coverage (governed by min_median_cov, only looking at ref_count and var_count)
a filter for the number of samples (governed by min_nr_samples)
Also, samples having zero-counts in both "ref_count" and "var_count" are discarded, as they're useless in further analyses.
prior_filter(
data_pos,
min_median_cov = 5,
min_nr_samples = 30,
checkref_filter = FALSE,
prior_allelefreq_filter = FALSE,
min_PriorAlleleFreq = 0.1
)
Data frame. Data frame of a SNP position with columns: "ref_alleles", "A", "T", "C", "G", "ref", "var", "ref_count" and "var_count".
Number. Minimal median coverage necessary to retain locus (default is 5).
Number. Minimal number of samples necessary to retain locus (default is 30).
Logical. Should samples with a high allele count of alleles not present in the ref_alleles column of data_pos be filtered out?
With "high", we mean either the highest allele count in that sample, OR the second highest if there's heuristic evidence that the sample is not homozygous for its most common allele.
The particular check involves, for each sample, extracting its first and second most common allele, then using the entire population data in data_pos
to get
genotype probabilities for homozygocity in the most common allele and heterozygosity in these alleles, using these allele's observed population frequencies and assuming
Hardy-Weinberg Equilibrium (inbreeding parameter = 0). These are then multiplied with that particular sample's corresponding evidence for homozygocity of the most common allele
(using a multinomial distribution assuming equal observed frequencies for all other alleles, which should be low and are dictated by e.g. sequencing errors)
and evidence for heterozygosity if its most common alleles (using a multinomial distribution assuming equal expression of these two alleles which should be high,
and equal expression of the two remaining nucleotides which should be low, and dictated by e.g. sequencing error rate).
Logical. Should loci be filtered using a minimal prior minor allele frequency, which is simply determined as the percent occurrence of the less common (variant) allele across nucleotides. (default is FALSE).
Number. Prior allele frequency to filter SNPs on if prior_allelefreq_filter
is TRUE (default is 0.1).
The data as data frame with total and median coverage or NULL if the SNP positions was filtered.