Research

My research philosophy is to combine deep theoretical studies in high-dimensional statistics with practical studies in large-scale data driven genetics and genomics. With the marriage of solid statistical theory and cutting-edge biomedical applications, the success of this model will have a far-reaching impact on the critical role of statistics in the coming new era of personalized medicine. The ultimate goal of my research is to develop statistics for discovering the mysteries of life through analyzing the fast growing big data generated from modern biomedical experiments and practices. Currently, my statistical development is in the following areas. 

Signal detection theory

My statistical research emphasizes signal detection theory, which I believe is one of the core techniques to conquer the critical problem of mining big data generated in genetics, systems biology, and other bio-medical research. To develop the most powerful methods to detect weak and sparse effects, a key is to design effective procedures to test the profile of significance evidences over the targeted factors. The Komogorov-Smirnov type statistic for testing the empirical distribution profile is such a procedure, from which the Higher Criticism and some goodness-of-fit test statistics are developed and proved asymptotically optimal. Following this idea, we can create various optimal statistics, based on various ways of comparing the profiles, to address specific data types. For example, one important scenario is to design optimal methods under discrete distribution assumptions (instead of Gaussian distribution), which is one of the representative characteristics of the next generation sequencing data, such as RNA-seq (discrete counts of sequence reads) and DNA-seq (discrete counts of rare mutations). Furthermore, the analytical power calculation for these optimal methods, especially when sample size is infinite, is critical for study design and practical data analysis. 

Statistical genetics

The above research on signal detection theory can help to tackle one of the most critical genetic problems: the missing heritability. I have several on-going research efforts. First, we are trying to address one of the key obstacles impeding new gene discovery: gene-gene interactions that are ubiquitous and important to biological mechanism, but are often ignored by traditional method. It is mostly due to the difficulty of exploring an extremely large high-dimensional parameter space that bears the full signals of possible high-order interactions. The idea is to design optimal dimension reduction strategies through the study of detection boundaries and optimal tests for interactive signals at various levels of linear projections. Second, I study statistical tests for family data, which is very common for sequencing studies. The idea is to incorporate the correlation information among family members into association tests through multi-level models, or the extended Hotelling's T2 type tests. Third, I am developing methods for genetic signal detection based on L0-norm penalized model selection algorithms that have been proved optimal in our theoretical study. Forth, we are developing high-performance computation tool based on SQL database and GPU parallel computation for genetic data analysis. To apply these theoretical and methodological results, in particular, we will address the features of rare variants raised in the next generation sequencing studies. 

Epigenetics and systems biology

Genetic variation is certainly not the whole story of the differential terminal phenotypes. It is important to understand how genes are regulated and cells are programmed for various functionalities and effects on complex diseases. At the same time, big data have been collected from different aspects of a biological system: DNA, RNA, regulatory factors, proteins and their interactions, etc., at cell, tissue, organ, or higher system levels. The complex biological processes can be better understood when data from these processes are examined as a whole. I target the statistical problems emerging from integration and analysis for the data of such cooperatively functioning components. I have been working collaboratively with field experts on several projects in the direction of this research: (1) Improving gene expression prediction by incorporating high-dimensional chromatin structure into gene regulation modeling; (2) Combining protein-protein interaction network and DNA sequencing data to discover novel disease mechanism of autism spectrum disorder (ASD); and (3) Establishing good statistical models to predict the survival time of female reproductive cancer patients based on high-dimensional biomarkers selected from large-scale data of SNP, mRNA, small RNA, and protein. 

 Other collaborations 

Due to my expertise in statistical modeling, experimental design, and data analysis, I have helped researchers with successful grant applications and paper publications. I would like to contribute my expertise to the research community, in particular, to help scientific researchers to develop sound statistical experiments, to solve statistical problems in their research and data analysis, and to help them with grant applications. 


Here are the details regarding to my

Publications

Grants

Software tools

Patent

Professional services in relevant fields



© Zheyang Wu 2014