Highly Interpretable Prediction Models for SNP Data

Introduction

Single Nucleotide Polymorphisms (SNPs)

  • Common variation at a single position in a DNA sequence among individuals
  • Two possible alleles resulting in three possible genotypes
    • homozygous A
    • homozygous B
    • heterozygous AB

Interpretability

An interpretable model is […] following a domain-specific set of constraints that make reasoning processes understandable. […] Logical models […] are among the most popular algorithms […], since their statements provide human-understandable reasons for each prediction.

Rudin et al. (2021) Interpretable Machine Learning

Domain-specific constraints

It is well-known that in the analysis of genetic features such as SNPs, interactions, e.g., gene-gene interactions […] and gene-environment interactions […], play a crucial role.

Lau et al. (2024) logicDT: a procedure for identifying response‑associated interactions between binary predictors

Case Control Studies on Gene-Gene Interactions

SNP1 SNP2 SNPP Case
0 2 1 0
1 2 0 0
2 0 0 1

State of the Art

Microsoft’s AutoML

  • AveragedPerceptron
  • SdcaLogisticRegression
  • SdcaNonCalibrated
  • SymbolicSgdLogisticRegression
  • LbfgsLogisticRegression
  • LightGbm
  • FastTree
  • FastForest
  • Gam
  • FieldAwareFactorizationMachine
  • Prior
  • LinearSvm

logicDT

  • Decision tree based method
  • High interpretability
  • High prediction performance
  • Necessity of hyperparameter optimization

Further Established Learning Methods

logicFS

  • Reports important interactions
  • High interpretability
  • Interactions are considered seperately

GPAS / logicGP

  • GPAS is a special case of the unreleased logicGP
  • Models based on Disjunctive Normal Form of logical expressions
  • High interpretability
  • No automatic model selection

Automatic Model Selection

Algorithm for GPAS

Consolidation

  • Discard all models that do only appear in one run on the folds
  • Discard all models where a model of the same size with a better average accuracy on the validation data exists

Selection

  • Discard models that do not at least offer 1% more accuracy than all smaller models
  • Choose the model with the greatest accuracy

Experiments

Simulated Data

  • Standard simulation of R package scrime
  • 100 runs on training and test data with 1.000 observations each

Results

logicGP

Introduction

  • logicGP is a generalization of GPAS
  • Multiclass classification is possible
  • Offers better predictions for interactions with less SNPs
  • Future versions will enable the analysis of gene-environment interactions

Example Model

wControl wCase Condition
1.38 0.74 SNP6 ∈ {1}
0.35 1.78 (SNP9 ∈ {1})(SNP3 ∈ {1})(SNP10 ∈ {1})
0.74 1.35 SNP7 ∈ {1}
0.62 0.38 All above false

First Results

Thank you for your attention!