EMfit_betabinom_robust estimates, per locus, parameters of an assumed beta-binomial mixture model via robust expectation maximization (genotype frequencies, allelic bias i.e. heterozygous pi parameter, homozygous and heterozygous allelic divergence i.e. their respective theta- (overdispersion) parameters. Through this fit, per-samples genotype probabilities are obtained as well. Optionally, a fit assuming no allelic bias (heterozygous pi-parameter = 0.5) is performed as well, for the purpose of significant allelic bias detection via likelihood ratio test.

EMfit_betabinom_robust(
  data_counts,
  allelefreq = 0.5,
  SE,
  inbr = 0,
  dltaco = 10^-6,
  HWE = FALSE,
  p_InitEst = FALSE,
  ThetaInits = "moment",
  ReEstThetas = "moment",
  NoSplitHom = TRUE,
  NoSplitHet = TRUE,
  ResetThetaMin = 10^-10,
  ResetThetaMax = 10^-1,
  DistRob = "Cook",
  CookMargin = 5,
  LikEmpNum = 1000,
  LikMargin = 0,
  NumHetMin = 5,
  MaxOutFrac = 0.5,
  thetaTRY = c(10^-1, 10^-3, 10^-7),
  fitH0 = TRUE,
  FirstFewFixed = NULL,
  epsabs = 0.001
)

Arguments

data_counts

Data frame. Data frame of a SNP with reference and variant counts ("ref_count" and "var_count", respectively) for each sample ("sample").

allelefreq

Number. Allele frequency. Only used when pInitEst is TRUE (default = 0.5)

SE

Number. Sequencing error rate.

inbr

Number. Degree of inbreeding (default = 0).

dltaco

Number. Minimal difference between 2 iterations (default = 0.001).

HWE

Logical. Should HWE be used for allele frequency estimation, not recommended (default = FALSE).

p_InitEst

Logical. Calculate initial estimates of pr, pv and prv from allelefreq, not recommended (default = FALSE).

ThetaInits

Numeric vector, "moment" or "TryThree". A vector of length two containing the initial dispersion estimate for the homozygous peaks, followed by the one for the heterozygous peak, or "moment", in which case moment initial estimates are generated. In case of "TryThree", three initial values given by the thetaTRY input argument are tried out and the one yielding the best fit is retained.

ReEstThetas

String. Accepts the methods "moment" or "simple" as arguments, in which case dispersions are re-estimated during every EM-step using a moment estimate or a simple custom procedure respectively, to be used as a starting

NoSplitHom

Logical. If TRUE, don't allow the beta-binomial fits for homozygotes to be bimodal

NoSplitHet

Logical. If TRUE, don't allow the beta-binomial fit for heterozygotes to be bimodal

ResetThetaMin

Number. Initial theta values in numeric optimization get capped at this minimum (e.g. in case the moment estimate is even lower)

ResetThetaMax

Number. Initial theta values in numeric optimization get capped at this maximum (e.g. in case the moment estimate is even higher)

DistRob

One of "Cook", "CookEmp" or "LikelyEmp"; removes outlying heterozygous samples anyway, either by checking whether the difference of their leave-one-out maximum likelihood estimates of pi and/or theta with the respective full-data estimates deviate more than CookMargin observed standard deviations from their observed mean ("Cook"). Or using a similar procedure but by calculating this standard deviation and mean via simulation assuming the fitted full-data model is 100 100 observed likelihood distance of a particular sample should not be more extreme than LikMargin of these simulated distanced for it to be not considered an outlier. While outliers are disregarded in the final model fit and all parameter estimates, they remain in the final output to be genotyped, but are marked as such.

CookMargin

Number. See DistRob.

LikEmpNum

Integer. See DistRob.

LikMargin

Integer. See DistRob.

NumHetMin

Integer. Minimal amount of expected heterozygotes required to even try and perform outlier-removal on these (default 5).

MaxOutFrac

Number. Maximum allowed fraction of detected outlying samples; if more, the outlier-detection procedure is disregarded (default 0.5)

thetaTRY

Numeric vector. See ReEstThetas, in case this argument equals "TryThree".

fitH0

Logical. If TRUE, performs a fit assuming no allelic bias (heterozygous pi-parameter = 0.5) and a subsequent likihood ratio test for significant allelic bias.

Value

A list containing the following components:

AB

The estimated Allelic Bias (estimated reference read fraction for heterozygous samples).

AB_lrt

The test statistic of the likelihood ratio test against perfectly balanced Allelic Bias (0.5). Not included if fitH0==FALSE.

AB_p

The p-value of the likelihood ratio test against perfectly balanced Allelic Bias (0.5). Not included if fitH0==FALSE.

nrep

The number of iterations.

quality

Indicates the quality of a locus. An "!" indicates bad data or a bad fit due to no apparent heterozygosity.

rho_vv

The variant homozygote genotype frequency for the locus.

rho_rr

The reference homozygote genotype frequency for the locus.

rho_rv

The heterozyous genotype frequency for the locus.

rho_vv_H0

The variant homozygote genotype frequency with Allelic Bias = 0.5. Not included if fitH0==FALSE.

rho_rr_H0

The reference homozygote genotype frequency with Allelic Bias = 0.5. Not included if fitH0==FALSE.

rho_rv_H0

The heterozyous genotype frequency with Allelic Bias = 0.5. Not included if fitH0==FALSE.

data_hash

Data frame. Input data frame with extra columns: allelefreq, (most likely) genotype, genotype probabilities (prr, prv, pvv), and whether the data point is an outlier ("Outlier", equals 1 if so) per sample.

theta_hom

Final overdispersion estimate for the homozygous peaks.

theta_het

Final overdispersion estimate for the heterozygous peak.

theta_hom_H0

Final overdispersion estimate for the homozygous peaks with Allelic Bias = 0.5. Not included if fitH0==FALSE.

theta_het_H0

Final overdispersion estimate for the heterozygous peak with Allelic Bias = 0.5. Not included if fitH0==FALSE.

RobFlag

Provides some information on the outlier removal procedurs. Either there were not enough heterozygotes to start the procedure ("NotEnoughHet", see NumHetMin), or there were no outliers detected ("NoOutliers"), or there were too many detected and the procedure was disregarded ("TooManyOutliers"; see MaxOutFrac), or there was at least one outlier detected while passing all previous checks, after which the fit was re-done on the non-outlying data ("ReFittingDone").

GOF, GOFaltMEAN, GOFaltMEDIAN, GOFaltPERDIST, GOFaltPERDIST_S, GOFaltONLYHET, GOFexactMEAN, GOFexactMEANLOG

Various Goodness-Of-Fit heuristics.