AdvancedFitter attempts to detect relationships between beta-binomial parameters (pi, theta) and any independent variable(s) suspected to play a role, e.g. increased theta (overdispersion) in function of age. It is currently, however, more of an exploratory function, as it does so by fitting these relationships assuming the supplied "WeightCol" reflects the "true" probability that a certain observation belongs to the genotype under study (heterozygotes) as this is kept invariant; there is of course no such thing (in reality, an observation has a one true underlying genotype, not "a probability) and, ideally, the parameter-variable(s) relationships here studied are taken into account during the complete beta-binomial mixture model EM fit.

AdvancedFitter(
  ThetaForm = Theta ~ 1,
  PiForm = NULL,
  SNPdata,
  RefCol = "ref_count",
  VarCol = "var_count",
  WeightCol = NULL,
  Pi_start,
  Theta_start,
  PiLink = "identity",
  ThetaLink = "identity",
  center = TRUE,
  ResetThetaMin = 10^-10,
  ResetThetaMax = 10^-1
)

Arguments

ThetaForm

An object describing the (linear) relationship between theta and any independent variables present in `SNPdata`, like `Theta ~ Var1 + Var2`; though the "ThetaLink" input (see later) allows for general linear relationships.

PiForm

An object describing the (linear) relationship between pi and any independent variables present in `SNPdata`, like `Pi ~ Var1 + Var2`; though the "PiLink" input (see later) allows for general linear relationships.

SNPdata

Dataframe. A dataframe containing reference- and variant allele counts, in columns with names as given by the "RefCol" and "VarCol" inputs. This dataframe also has to contain any independent variables passed to the "ThetaForm" and "PiForm" input arguments.

RefCol

String. Name of the column in SNPdata containing reference allele counts.

VarCol

String. Name of the column in SNPdata containing variant allele counts.

WeightCol

String. Optional name of a column of SNPdata containinig per-sample weights, that are - if specified - are used in a weighted maximum likelihood fit (maximizing sum(sample-weights * sample-log-likelihoods))

Pi_start

Number. Starting pi value for numerical optimization (when fitting PiForm, this starting value is used as the intercept and all independent variables start as having a regression coefficient of zero)

Theta_start

Number. Starting theta value for numerical optimization (when fitting ThetaForm, this starting value is used as the intercept and all independent variables start as having a regression coefficient of zero)

PiLink

String. One of "identity", "log" or "sqrt"; the linear relationship specified by PiForm is fit as PiLink(pi) ~ PiForm

ThetaLink

String. One of "identity", "log" or "sqrt"; the linear relationship specified by ThetaForm is fit as ThetaLink(theta) ~ ThetaForm

center

Logical. If TRUE, centers all exploratory variables (relative to their mean) before performing the maximum likelihood fit. This is recommended when the supplied Pi_start is an expected/mean pi-value across all samples (e.g. the result of a single pi-fit on these samples)

ResetThetaMin

Number. When the supplied Theta_start value is lower than this input, it is reset to this input. Default 10^-10; it is not recommended to change this value.

ResetThetaMax

Number. When the supplied Theta_start value is higher than this input, it is reset to this input. Default 10^-1; it is not recommended to change this value.

Value

A list containing the following components:

first list-item

The negative of the maximized log-likelihood.

second list-item

Optimized parameter values, given in the order (1) pi-intercept, (2) theta-intercept, (3) pi regression coefficients, (4) theta regression coefficients.

optional third list-item

If theta depends on only one independent variable, the mean of that variable across samples.