EMfit_betabinom_popcomb, similar to EMfit_betabinom, fits a beta-binomial mixture model to input reference- and allele RNAseq counts, but with those counts originating from two different populations (a control- and case-population). All distribution models are jointly fitted except for the heterozygous overdispersion parameters, which is allowed to differ between the populations. The main purpose of this fit is getting this model's likelihood to be used in a likelihood ratio test, checking for statistical evidence that this heterozygous overdispersion parameter truly differs between the populations, which would indicate differential allelic divergence.

EMfit_betabinom_popcomb(
  data_counts,
  allelefreq = 0.5,
  SE,
  inbr = 0,
  dltaco = 10^-6,
  HWE = FALSE,
  p_InitEst = FALSE,
  p_inits = c(1/3, 1/3, 1/3),
  ThetaInits = "moment",
  ReEstThetas = "moment",
  NoSplitHom = TRUE,
  NoSplitHet = TRUE,
  ResetThetaMin = 10^-10,
  ResetThetaMax = 10^-1,
  thetaTRY = c(10^-1, 10^-3, 10^-7),
  probshift_init = 0.5,
  FirstFewFixed = NULL,
  MemLim = 2048,
  Xtra = 5,
  epsabs = 0.001,
  MaxIt = 100
)

Arguments

data_counts

Data frame. Data frame of a SNP with reference and variant counts ("ref_count" and "var_count", respectively) for each sample ("sample"). Also required is the "isCase" column, equaling 1 for case-samples and 0 for control-samples

allelefreq

Number. Allele frequency. Only used when pInitEst is TRUE (default = 0.5)

SE

Number. Sequencing error rate.

inbr

Number. Degree of inbreeding (default = 0).

dltaco

Number. Minimal difference between 2 iterations (default = 0.001).

HWE

Logical. Should HWE be used for allele frequency estimation, not recommended (default = FALSE).

p_InitEst

Logical. Calculate initial estimates of pr, pv and prv from allelefreq, not recommended (default = FALSE).

ThetaInits

Numeric vector, "moment" or "TryThree". A vector of length two containing the initial dispersion estimate for the homozygous peaks, followed by the one for the heterozygous peak, or "moment", in which case moment initial estimates are generated. In case of "TryThree", three initial values given by the thetaTRY input argument are tried out and the one yielding the best fit is retained.

ReEstThetas

String. Accepts the methods "moment" or "simple" as arguments, in which case dispersions are re-estimated during every EM-step using a moment estimate or a simple custom procedure respectively, to be used as a starting

NoSplitHom

Logical. If TRUE, don't allow the beta-binomial fits for homozygotes to be bimodal

NoSplitHet

Logical. If TRUE, don't allow the beta-binomial fit for heterozygotes to be bimodal

ResetThetaMin

Number. Initial theta values in numeric optimization get capped at this minimum (e.g. in case the moment estimate is even lower)

ResetThetaMax

Number. Initial theta values in numeric optimization get capped at this maximum (e.g. in case the moment estimate is even higher)

thetaTRY

Numeric vector. See ReEstThetas, in case this argument equals "TryThree".

probshift_init

Number. Initial estimate position heterozygous peak (allelic bias; expected fraction of reference reads)

Value

A list containing the following components:

ParamVec

A vector containing fitted model parameters: genotype frequencies, heterozygous peak position (allelic bias), and overdispersion of the homozygous peaks and of the separate heterozygous control- and case-peaks.

GenoDF

A dataframe containing per-samples genotypes likelihoods according to the fitted model.

genotypeN

A character vector containing the most likely per-sample genotypes according to the fitted model.

nrep

The number of iterations.

quality

An "!" indicates a bad quality locus or fit (no apparent heterozygotes)