Fan Zhang

PhD Candidate
Teaching Assistant, Research Assistant
Worcester Polytechnic Institute

Address: 60 Prescott St, Worcester, MA 01605
Email: fzhang [at]
Google scholar profile
fzhang code on github

Research Summary

My research focuses on statistical methods development for characterizing genomic heterogeneity in mixed samples to improve the diagnosis and treatment of cancer. My background in computer science, statistics, and genetics helped me contribute to two interesting research projects in large-scale clinical data analysis.


Rare variant detection in deep, heterogeneous next-generation sequencing data

Statistical method development for variant detection in heterogeneous next-generation sequencing data Massively parallel sequencing data generated by next-generation sequencing technologies is routinely used to interrogate extensive genomic heterogeneity in tumor samples. Characterization of genomic heterogeneity in next-generation sequencing data is a major barrier in personalized treatment and drug resistance. Recently, a number of computational methods have been developed to detect genetic variants in massive genomic data sets. Yet, the noise inherent in the biological processes involved in next-generation sequencing necessitates the use of statistical methods to identify true rare variants. Thus, there is a need for accurate and scalable statistical methods to uncover variants in mixed samples. With the help of statistical and computational methods, we will be able to identify disease related variants to predict treatment response for an individual patient.

This research focuses on developing accurate and scalable statistical methods to quantify the contribution of various genomic factors toward genetic diseases. We developed a novel Bayesian statistical model for rare variant detection in low-depth heterogeneous next-generation sequencing data (Figure 1). I also proposed a variational expectation maximization (EM) inference algorithm (Figure 2) to detect rare variants with more computationally efficient and show comparable accuracy. The overall flowchart of calling variants by our statistical model is shown in Figure 3 below.

The code and data sets are available in our website: Rare Variant Detection.

Figure 1. A. Graphical model representation of the model. B. Graphical model representation of the variational approximation to approximate the posterior distribution.

Figure 2. Flowchat of variational EM algorithm.

Figure 3. The overall flowchart of identifying variants by our statistical model, variational RVD2.

We demonstrate that our variational algorithm has higher specificity than many state-of-the-art algorithms (Figure 4).

Figure 4. Sensitivity/Specificity comparison with other variant detection methods. NRAF stands for non-reference allele frequency.

In an analysis of a directed evolution longitudinal yeast data set, we are able to identify a time-series trend in non-reference allele frequency and detect novel variants thta have not yet been reported. This figure shows a segment of DNA sequencing data at position chr04:1014850 in generation 448 using IGV.

Figure 5. Evidence of a detected variant in genen MTH1.

This research is significant because it provides an accurate and scalable statistical method that can be extended to study drug resistance by characterizing tumor heterogeneity.

Scalable deterministic global optimization algorithm development for molecular subtypes classification

Mixed-membership models are popular for analyzing data sets that have within-sample heterogeneity (Figure 6). Several sampling and variational inference algorithms have been developed for mixed membership models, but they only provide approximate, locally optimal estimates rather than globally optimal estimates. Therefore, there is a need for a global optimization algorithm to accurately estimate the tumor subtype distribution in heterogeneous tumor samples.

Figure 6. Inter-tumor and intra-tumor heterogeneity (Burrell, R.A., 2013).

My research focuses on developing a global optimization framework with a goal of achieving the globally optimal solution for molecular subtypes classification in mixed samples (Figure 7). I am working on a global optimization algorithm for a sparse mixed-membership matrix factorization problem using deterministic strategies based on Benders’ decomposition. I have recently developed several extensions to improve the computational efficiency to solve this biconvex optimization problem on larger data set. We aim to provide a global view of latent correlated patterns of genomic subtypes in The Cancer Genome Atlas (TCGA) glioblastoma data. This research will yield a novel exact statistical inference algorithm that will be significant to understand the molecular mechanisms to subtype co-occurrence pattern and thus bring insights into personal-medicine treatment.

Figure 7. A sparse mixed-membership matrix factorization problem.

Last updated 2016-06-13