EMfit_betabinom_robust.Rd
EMfit_betabinom_robust
estimates, per locus, parameters of an assumed beta-binomial mixture model via robust expectation maximization (genotype
frequencies, allelic bias i.e. heterozygous pi parameter, homozygous and heterozygous allelic divergence i.e. their respective theta- (overdispersion) parameters.
Through this fit, per-samples genotype probabilities are obtained as well. Optionally, a fit assuming no allelic bias (heterozygous pi-parameter = 0.5) is
performed as well, for the purpose of significant allelic bias detection via likelihood ratio test.
EMfit_betabinom_robust(
data_counts,
allelefreq = 0.5,
SE,
inbr = 0,
dltaco = 10^-6,
HWE = FALSE,
p_InitEst = FALSE,
ThetaInits = "moment",
ReEstThetas = "moment",
NoSplitHom = TRUE,
NoSplitHet = TRUE,
ResetThetaMin = 10^-10,
ResetThetaMax = 10^-1,
DistRob = "Cook",
CookMargin = 5,
LikEmpNum = 1000,
LikMargin = 0,
NumHetMin = 5,
MaxOutFrac = 0.5,
thetaTRY = c(10^-1, 10^-3, 10^-7),
fitH0 = TRUE,
FirstFewFixed = NULL,
epsabs = 0.001
)
Data frame. Data frame of a SNP with reference and variant counts ("ref_count" and "var_count", respectively) for each sample ("sample").
Number. Allele frequency. Only used when pInitEst is TRUE (default = 0.5)
Number. Sequencing error rate.
Number. Degree of inbreeding (default = 0).
Number. Minimal difference between 2 iterations (default = 0.001).
Logical. Should HWE be used for allele frequency estimation, not recommended (default = FALSE).
Logical. Calculate initial estimates of pr, pv and prv from allelefreq, not recommended (default = FALSE).
Numeric vector, "moment" or "TryThree". A vector of length two containing the initial dispersion estimate for the homozygous peaks, followed by the one for the heterozygous peak,
or "moment", in which case moment initial estimates are generated. In case of "TryThree", three initial values given by the thetaTRY
input argument are tried out and the one yielding the best fit is retained.
String. Accepts the methods "moment" or "simple" as arguments, in which case dispersions are re-estimated during every EM-step using a moment estimate or a simple custom procedure respectively, to be used as a starting
Logical. If TRUE, don't allow the beta-binomial fits for homozygotes to be bimodal
Logical. If TRUE, don't allow the beta-binomial fit for heterozygotes to be bimodal
Number. Initial theta values in numeric optimization get capped at this minimum (e.g. in case the moment estimate is even lower)
Number. Initial theta values in numeric optimization get capped at this maximum (e.g. in case the moment estimate is even higher)
One of "Cook", "CookEmp" or "LikelyEmp"; removes outlying heterozygous samples anyway, either by checking whether the difference of their leave-one-out maximum
likelihood estimates of pi and/or theta with the respective full-data estimates deviate more than CookMargin
observed standard deviations from their
observed mean ("Cook"). Or using a similar procedure but by calculating this standard deviation and mean via simulation assuming the fitted full-data model
is 100
100
observed likelihood distance of a particular sample should not be more extreme than LikMargin
of these simulated distanced for it to be not considered an outlier.
While outliers are disregarded in the final model fit and all parameter estimates, they remain in the final output to be genotyped, but are marked as such.
Number. See DistRob
.
Integer. See DistRob
.
Integer. See DistRob
.
Integer. Minimal amount of expected heterozygotes required to even try and perform outlier-removal on these (default 5).
Number. Maximum allowed fraction of detected outlying samples; if more, the outlier-detection procedure is disregarded (default 0.5)
Numeric vector. See ReEstThetas
, in case this argument equals "TryThree".
Logical. If TRUE, performs a fit assuming no allelic bias (heterozygous pi-parameter = 0.5) and a subsequent likihood ratio test for significant allelic bias.
A list containing the following components:
The estimated Allelic Bias (estimated reference read fraction for heterozygous samples).
The test statistic of the likelihood ratio test against perfectly balanced Allelic Bias (0.5). Not included if fitH0==FALSE
.
The p-value of the likelihood ratio test against perfectly balanced Allelic Bias (0.5). Not included if fitH0==FALSE
.
The number of iterations.
Indicates the quality of a locus. An "!" indicates bad data or a bad fit due to no apparent heterozygosity.
The variant homozygote genotype frequency for the locus.
The reference homozygote genotype frequency for the locus.
The heterozyous genotype frequency for the locus.
The variant homozygote genotype frequency with Allelic Bias = 0.5. Not included if fitH0==FALSE
.
The reference homozygote genotype frequency with Allelic Bias = 0.5. Not included if fitH0==FALSE
.
The heterozyous genotype frequency with Allelic Bias = 0.5. Not included if fitH0==FALSE
.
Data frame. Input data frame with extra columns: allelefreq, (most likely) genotype, genotype probabilities (prr, prv, pvv), and whether the data point is an outlier ("Outlier", equals 1 if so) per sample.
Final overdispersion estimate for the homozygous peaks.
Final overdispersion estimate for the heterozygous peak.
Final overdispersion estimate for the homozygous peaks with Allelic Bias = 0.5. Not included if fitH0==FALSE
.
Final overdispersion estimate for the heterozygous peak with Allelic Bias = 0.5. Not included if fitH0==FALSE
.
Provides some information on the outlier removal procedurs. Either there were not enough heterozygotes to start the procedure ("NotEnoughHet", see NumHetMin
),
or there were no outliers detected ("NoOutliers"), or there were too many detected and the procedure was disregarded ("TooManyOutliers"; see MaxOutFrac
),
or there was at least one outlier detected while passing all previous checks, after which the fit was re-done on the non-outlying data ("ReFittingDone").
Various Goodness-Of-Fit heuristics.