Oral Presentation 41st Lorne Genome Conference 2020

MaxHiC: a robust background correction model to identify statistically significant Interactions in Hi-C and capture Hi-C experiments (#28)

Hamid Alinejad-Rokny 1 2 , Rassa Ghavami 3 , Hamid Reza Rabiee 3 , Narges Rezaei 4 , Kin Tung Tam 2 , Alistair Forrest 2
  1. Systems Biology and Health Data Analytics Lab, The Graduate School of Biomedical Engineering, The University of New South Wales, Sydney, NSW, Australia
  2. Harry Perkins Institute of Medical Research, The University of Western Australia, Perth, WA, Australia
  3. Computational Biology and Bioinformatics, Sharif University of Technology, IR, Tehran, IR
  4. Center for Complex Biological Systems, University of California Irvine, California, California, Irvine, USA

Hi-C is a genome-wide chromosome conformation capture technology that has been used to detect interactions between pairs of genomic regions, and understand higher order chromatin structure. Conceptually Hi-C data counts interaction frequencies between every position in the genome and every other position. Biologically meaningful interactions are expected to occur more frequently than random (background) interactions. To identify biologically relevant interactions several background models that take into account biases such as distance, GC content and mappability have been proposed. Here we introduce MaxHiC, a robust machine learning based background correction tool that uses a negative binomial model and a maximum likelihood technique to deal with these complex biases and robustly identifies statistically significant interactions in both Hi-C and capture Hi-C experiments. Here, we systematically benchmark MaxHiC against all major Hi-C background correction tools and demonstrate using multiple publicly available Hi-C and capture Hi-C datasets that 1) The interacting regions identified by MaxHiC have significantly greater levels of features associated with regulatory features (e.g. active chromatin histone marks, CTCF binding sites, DNase accessibility) and also disease-associated genome-wide association SNPs than those identified by current existing models. And 2) the pairs of interacting regions are more likely to be linked by eQTL pairs and more likely to identify known enhancer-promoter pairs than any of the existing methods. Using MaxHiC we also find that H3K27ac-H3K27ac loops are preferred structural loops in shorter genomic distances (less than 200kb), but CTCF-CTCF loops are preferred structural loops at the longer distance (more than 200kb). Lastly, we use MaxHiC to generate first map of promoter-enhancer interactions in neuromuscular diseases and identify 2,253 putative neuromuscular disease-associated genes interacting regions. Ultimately this map will help make a diagnosis for the currently unknown fraction of patients with mutations in regulatory regions and enhance our understanding of skeletal muscle gene regulation.