EMfit_betabinom estimates, per locus, parameters of an assumed beta-binomial mixture model via expectation maximization (genotype frequencies, allelic bias i.e. heterozygous pi parameter, homozygous and heterozygous allelic divergence i.e. their respective theta- (overdispersion) parameters. Through this fit, per-samples genotype probabilities are obtained as well. Optionally, a fit assuming no allelic bias (heterozygous pi-parameter = 0.5) is performed as well, for the purpose of significant allelic bias detection via likelihood ratio test.

EMfit_betabinom(
  data_counts,
  allelefreq = 0.5,
  SE,
  inbr = 0,
  dltaco = 10^-6,
  HWE = FALSE,
  p_InitEst = FALSE,
  probshift_init = NULL,
  ThetaInits = "moment",
  ReEstThetas = "moment",
  NoSplitHom = TRUE,
  NoSplitHet = TRUE,
  ResetThetaMin = 10^-10,
  ResetThetaMax = 10^-1,
  thetaTRY = c(10^-1, 10^-3, 10^-7),
  fitH0 = TRUE,
  FirstFewFixed = NULL,
  epsabs = 0.001,
  MaxIt = 100
)

Arguments

data_counts

Data frame. Data frame of a SNP with reference and variant counts ("ref_count" and "var_count", respectively) for each sample ("sample").

allelefreq

Number. Allele frequency. Only used when pInitEst is TRUE (default = 0.5)

SE

Number. Sequencing error rate.

inbr

Number. Degree of inbreeding (default = 0).

dltaco

Number. Minimal difference between 2 iterations (default = 0.001).

HWE

Logical. Should HWE be used for allele frequency estimation, not recommended (default = FALSE).

p_InitEst

Logical. Calculate initial estimates of pr, pv and prv from allelefreq, not recommended (default = FALSE).

probshift_init

Number. Initial estimate for the allelic bias in heterozygotes; it's recommended to leave this at NULL, in which case this initial estimate will be 0.5

ThetaInits

Numeric vector, "moment" or "TryThree". A vector of length two containing the initial dispersion estimate for the homozygous peaks, followed by the one for the heterozygous peak, or "moment", in which case moment initial estimates are generated. In case of "TryThree", three initial values given by the thetaTRY input argument are tried out and the one yielding the best fit is retained.

ReEstThetas

String. Accepts the methods "moment" or "simple" as arguments, in which case dispersions are re-estimated during every EM-step using a moment estimate or a simple custom procedure respectively, to be used as a starting

NoSplitHom

Logical. If TRUE, don't allow the beta-binomial fits for homozygotes to be bimodal

NoSplitHet

Logical. If TRUE, don't allow the beta-binomial fit for heterozygotes to be bimodal

ResetThetaMin

Number. Initial theta values in numeric optimization get capped at this minimum (e.g. in case the moment estimate is even lower)

ResetThetaMax

Number. Initial theta values in numeric optimization get capped at this maximum (e.g. in case the moment estimate is even higher)

thetaTRY

Numeric vector. See ReEstThetas, in case this argument equals "TryThree".

fitH0

Logical. If TRUE, performs a fit assuming no allelic bias (heterozygous pi-parameter = 0.5) and a subsequent likihood ratio test for significant allelic bias.

Value

A list containing the following components:

AB

The estimated Allelic Bias (estimated reference read fraction for heterozygous samples).

AB_lrt

The test statistic of the likelihood ratio test against perfectly balanced Allelic Bias (0.5). Not included if fitH0==FALSE.

AB_p

The p-value of the likelihood ratio test against perfectly balanced Allelic Bias (0.5). Not included if fitH0==FALSE.

nrep

The number of iterations.

quality

Indicates the quality of a locus. An "!" indicates bad data or a bad fit due to no apparent heterozygosity.

rho_vv

The variant homozygote genotype frequency for the locus.

rho_rr

The reference homozygote genotype frequency for the locus.

rho_rv

The heterozyous genotype frequency for the locus.

rho_vv_H0

The variant homozygote genotype frequency with Allelic Bias = 0.5. Not included if fitH0==FALSE.

rho_rr_H0

The reference homozygote genotype frequency with Allelic Bias = 0.5. Not included if fitH0==FALSE.

rho_rv_H0

The heterozyous genotype frequency with Allelic Bias = 0.5. Not included if fitH0==FALSE.

data_hash

Data frame. Input data frame with extra columns: allelefreq, (most likely) genotype, genotype probabilities (prr, prv, pvv), and whether the data point is an outlier ("Outlier", equals 1 if so) per sample.

theta_hom

Final overdispersion estimate for the homozygous peaks.

theta_het

Final overdispersion estimate for the heterozygous peak.

theta_hom_H0

Final overdispersion estimate for the homozygous peaks with Allelic Bias = 0.5. Not included if fitH0==FALSE.

theta_het_H0

Final overdispersion estimate for the heterozygous peak with Allelic Bias = 0.5. Not included if fitH0==FALSE.

GOF, GOFaltMEAN, GOFaltMEDIAN, GOFaltPERDIST, GOFaltPERDIST_S, GOFaltONLYHET, GOFexactMEAN, GOFexactMEANLOG

Various Goodness-Of-Fit heuristics.