> Graphical Models
Copula-Gaussian Graphical Models with Hidden Variables
Ongoing since 2010
JD, Yu Hang, Wang Xueou (currently at NUS), Zhang Xu (currently at UCLA), Xu Shiyan (currently at SAP)
Sparse graphical models provide an effective way to capture the statistical structure of high-dimensional data such as gene expression data, multi-electrode brain recordings, and stock market data. A sparse graph displays the most significant interactions between the variables concerned, for instance genes, brain areas and stocks, which may aid in making interpretations of the data at hand.
Gaussian graphical models serve as a popular tool in this field since the graphical model structure for Gaussian distributed random variables is simply defined by the non-zero entries in its precision matrix (inverse covariance matrix). However, Gaussian graphical models have well-known limitations when applied to some real-world problems where the data involved is manifestly non-Gaussian. Moreover, the lasso-based learning method of standard Gaussian graphical models is not applicable to the case when hidden variables or long-range dependence exist, giving rise to research on various types of graphical models specific for different practical problems (e.g., Chandrasekaran et al, Choi et al).
In our study, we introduce copula-based graphical models whose marginals are modeled flexibly using the empirical distribution while the interactions between variables are captured efficiently in the latent layer via graphical models.
More explicitly, we propose two types of copula-based graphical models, namely, copula Gaussian hidden variable graphical model (CGHVGM) and copula Gaussian multiscale graphical model (CSIM). Those graphical models are copula-based extensions of Gaussian hidden variable graphical models (Chandrasekaran et al, 2012) and Gaussian multiscale graphical models (Choi et al, 2010) respectively, thus resulting in flexible models effectively solving the aforementioned problems.
CGHVGM is well-suited in the case when the data is arbitrarily distributed and when hidden variables exist. For instance, in a computational biology framework, one typically measures the expression of a limited subset of proteins chosen among the large number of proteins of an organism. The observed proteins may be strongly affected by proteins that have not been measured (yellow nodes in Fig. 1(a)), and these unobserved proteins may in turn be treated as hidden variables in a statistical model. Results of different methods to infer the network between the observed proteins are shown in Fig. 1.
Fig.1 Results of different methods on the cell signaling data.
On the other hand, CSIM becomes useful if long-range dependence exists between far apart variables. The graphical structure of CSIM is shown in Fig.2. The bottom scale represents the observed non-Gaussian variables while the remaining scales model the complex dependencies between these variables in the latent Gaussian layer. In this model, the dependencies are divided into two types: the long-range dependence captured by the quadtree and the short-range dependence in each scale modeled by the sparse in-scale conditional covariance. We apply the model to infer the missing data in a few applications of geophysics including surface temperature and Asian rainfall patterns. The result of the latter is shown in Fig. 3.
Fig 2. Graphical model structure of CSIM.
Fig.3 Mean absolute error (MAE) for inferring missing data in Asian precipitation data set., shown as function of the number of observed sites. When there are more observations, the MAE decreases.
Venkat Chandrasekaran, Pablo A. Parrilo, Alan S. Willsky, Latent variable graphical model selection via convex optimization, Annals of Statistics, Vol. 40, No. 4, pp. 1935-1967, 2012.
Myung Jin Choi, Venkat Chandrasekaran, and Alan S. Willsky, Gaussian Multiresolution Models: Exploiting Sparse Markov and Covariance Structure, IEEE Transactions on Signal Processing, Vol. 58, pp. 1012-1024, 2010.
Hang Yu, Justin Dauwels, Xu Zhang, Shiyan Xu, and Wayne Isaac T. Uy , Copula Gaussian Multiscale Graphical Models with Application to Geophysical Modeling, Fusion 2012 - 15th International Conference on Information Fusion, July 09 - 12 2012, Singapore. [ PDF ]
Hang Yu, Justin Dauwels, Xueou Wang, Copula Gaussian Graphical Model with Hidden Variables, ICASSP 2012, Mar 25-30, 2012, Kyoto, Japan. [ PDF ]
J. Dauwels, Hang Yu, Xueou Wang, F. Vialatte, C. Latchoumane, J. Jeong, and A. Cichocki, Inferring Brain Networks through Graphical Models with Hidden Variables, Machine Learning and Interpretation in Neuroimaging, Lecture Notes in Computer Science 2012, pp 194-201 Springer. [ PDF ]
Copula-Gaussian Graphical Models for Discrete Data
Ongoing since 2010
JD, Yu Hang, Xu Shiyan, Wang Xueou
Copula Gaussian graphical models are powerful tools to describe dependencies between a large number of heterogeneous variables. However, they are only applicable to continuous random variables whereas discrete data in this digital age are becoming more prevalent. A typical example involving data of a discrete nature is the Facebook inference problem. Facebook has over 845 million monthly active users as of December 31, 2011. By using statistical models to analyze a large group of users' information, like personal profiles and historical activities (all of which are discrete data), it is possible to extract the users' own network and infer their personal preferences. In this way, Facebook can provide users with more relevant and accurate information, such as friend recommendations and personalized advertisements. Therefore, it is necessary to extend copula Gaussian graphical models to analyze discrete data.
However, applying copula-based graphical models to discrete data poses an obvious challenge as discrete data cannot be transformed directly into Gaussian data as the mapping is one-to-many. A common approach is to apply the Markov chain Monte Carlo method (MCMC) to simulate both the latent Gaussian variables and the posterior distribution of the precision matrix. Different priors have been selected for the precision matrix (or the covariance matrix), including covariance selection prior, the inverse-Wishart distribution and the G-Wishart distribution.
In this work, low-complexity algorithms are proposed for learning copula Gaussian graphical models from discrete data. We aim to infer the point estimate of the precision matrix (inverse covariance matrix) in the latent layer, where the precision matrix characterizes the graphical model structure. The proposed method is reliable and at the same time, much more efficient than the full MCMC approach since it avoids the costly Monte Carlo simulations for the posterior distribution of the precision matrix. Instead, we apply the Monte-Carlo expectation maximization as an alternative. Efficient Gibbs sampling for truncated Gaussian is used in the E-step to compute the poster distribution for the latent Gaussian variables. The precision matrix is then estimated in the M-step by first learning its structure based on stability selection and subsequently learning its parameters using iterative proportional fitting (IPF). The model is validated on two data sets of social survey where Fig. 1 displays the result due to one of the data sets. The corresponding eight variables are as follows: a) Wife economically active (no, yes); b) Age of wife > 38; (no, yes); c) Husband unemployed (no, yes); d) number of Children <= 4 (no, yes); e)Wife's education, high-school+ (no, yes); f) Husband's education, high-school+ (no, yes); g) Asian origin (no, yes); h) Other household member working (no, yes). The result shows that the variables connected to a ("Wife economically active") are c ("Husband unemployed"), d ("Child <= 4"), e ("Wife's education"), and g ("Asian origin"), which is consistent with common sense and is consistent with previous research (e.g. Whittaker's method).
Fig. 1 Results of different methods on the Rochdale data: (a) Whittaker's method; (b) glasso; (c) continuous copula glasso; (d) discrete copula glasso.
J.Dauwels, Hang Yu, Shiyan Xu, and Xueou Wang, Copula Gaussian Graphical Model for Discrete Data, ICASSP, May 26-31, 2013, Vancouver, Canada. [ PDF ].
Change Points Detection and Copula-Gaussian Graphical Model Inference for Piecewise-stationary Time Series
Ongoing since 2013
JD, Yu Hang, Li Chenyang
Graphical models are powerful tools to describe complex systems. Especially sparse graphical models are currently en vogue, as they allow us to infer network structure from multiple time series (e.g., functional brain networks from multichannel electroencephalograms). So far, most of the literature deals with stationary time series, whereas real-life time series often exhibit non-stationarity. However, real data are often non-stationary, and statistical models designed for stationary data may not yield accurate results. For example, during epileptic seizures, functional brain networks are shown to evolve through a sequence of distinct topologies. Inferring such evolving networks in the framework of graphical models has received little attention until now.
In this work, techniques are proposed to infer graphical models from piecewise stationary time series; first change point are detected in the time series, and then graphical models are inferred for each stationary segment.
We establish Gaussian copula graphical models for non-stationary, in particular, piecewise stationary time series. Since those graphical models rely on copulas, they are applicable to non-Gaussian data. In order to reduce the computational complexity, we disentangle the process of change point detection and graphical model inference. Specifically, we first detect the change points by minimizing a cost function defined on covariance matrix using low-complexity Pruned Exact Linear Time (PELT) method. We next learn the graphical model with and without hidden variables based on the covariance of each stationary time segment. The procedure also infers the number of change points in an automated fashion. We apply the proposed model to scalp electroencephalograms (EEG) recorded during epilepsy seizures. The results of one patient are summarized in Fig. 1. Previous studies analyzed the dynamics of functional networks through the entire seizure in intracranial electrocorticogram (ECoG) recordings and found that the networks are dense at seizure onset and termination, but sparse during the middle portion of the seizure. Interestingly, although scalp EEG is more noisy than intracranial ECoG recordings, the proposed method can find the same pattern.
Fig. 1. Results of functional networks resulting from graphical models without hidden variables (a)-(g) and with hidden variables (h)-(n) (with the vertical lines denoting the onset and end of the seizure)
Hang Yu, Chenyang Li, and Justin Dauwels, Network Inference and Change Points Detection for Piecewise-Stationary Time Series, ICASSP, May 4-9 2014, Florence, Italy. [ PDF ]
Spatial Copula-Gaussian Graphical Models for Extreme Events
Ongoing since 2010
JD, Philip Jonathan (Shell Technology Centre Thornton, UK), Yu Hang, Choo Zheng (currently at University of Oxford), Zhou Qiao, Wayne Isaac Tan Uy (currently at Cornell University)
Natural extreme events, such as heat waves, high rainfall and snowfall, tides and windstorms, have a devastating impact on our society, causing loss of property and life. Citing the 2010 flooding in Pakistan as an example, 20 million people were directly affected by the tragedy mostly by destruction of property, livelihood, and infrastructure, with a death toll close to 2,000. There is thus a pressing need to study such events and develop models which facilitate in predicting them.
In this work, we plan to construct two types of statistical models, namely, marginal models and joint models. The objective of marginal analysis is to model the marginal distribution at each location accurately and further select the optimal locations or set design criteria based on the accurate model. On the other hand, a joint model is required when analyzing the extreme events globally. Previous works have shown that there are very strong correlations between extreme events occurring in different places with many of the maxima occurring simultaneously or as part of the same sequence of a few successive days. The model can then be used to further predict the probability of an extreme event occurring at one location conditioned on the extreme events that have occurred at other locations.
With regard to marginal analysis, we employ the generalized Pareto (GP) distribution to model the marginals of peaks over threshold according to extreme value theory. It has been shown that by considering the spatial dependence between the parameters of the GP distribution for different measuring sites, the accuracy of the estimation can be improved. Here, we use a Gauss-Markov random field prior to model GP parameters by setting its structure to be a thin-membrane model. We then follow an empirical Bayes approach to infer the model parameters. The GP parameters are inferred by Gaussian message passing, resulting in (approximate) posterior marginal distributions. On the other hand, the smoothness parameters of the thin membrane models, which are hyperparameters in the overall model, are inferred by expectation maximization (EM), resulting in point estimates. The plots in Fig.1 provide a comparison of the mean square error (MSE) of the local and MRF-GP estimates.
Fig. 1 Mean square error for ML and MRF-GP estimates in two case studies
We further build the model for joint analysis by removing the conditional independence assumption of the marginal analysis model. The proposed model is derived as follows. The block maxima at each location are assumed to follow a Generalized Extreme Value (GEV) distribution. Spatial dependence is then modeled in two complementary ways. The GEV parameters are first coupled through a thin-membrane model, a specific type of Gaussian graphical model often used as smoothness prior. The extreme events, on the other hand, are coupled through a copula Gaussian graphical model with the precision matrix corresponding to a (generalized) thin-membrane model. We then derive inference and interpolation algorithms for the proposed model and validate the approach on synthetic data as well as real data related to hurricanes in the Gulf of Mexico. The numerical results we have obtained suggest that it can accurately describe extreme events in a spatial domain, and can reliably interpolate extreme values at arbitrary sites.
Fig. 2. Interpolation of the maximum wave heights caused by a storm (irregular grid covering the Gulf of Mexico). (a) True graph; (b) Observed sites (black) and unobserved sites (red) on the grid; Interpolation by (c) the copula MRF-GEV model; (d) MRF-GEV model; (e) copula GEV model; (f) Gaussian model.
Hang Yu, Zheng Choo, Wayne Isaac T. Uy, Justin Dauwels, Philip Jonathan, Modeling Extreme Events in Spatial Domain by Copula Graphical Models, Fusion 2012 - 15th International Conference on Information Fusion, July 09 - 12 2012, Singapore.
[ PDF ]
Hang Yu, Zheng Choo, Justin Dauwels, Philip Jonathan, and Qiao Zhou, Modeling Spatial Extreme Events using Markov Random Field Priors, 2012 IEEE International Symposium on Information Theory, July 1 to July 6, 2012, MIT, Cambridge, MA. [ PDF ]
Spatial Copula-Gaussian Graphical Models for Extreme Events with Multiple Covariates
2013 - 2014
JD, Philip Jonathan, Yu Hang, Cheng Jingjing
Extreme-value theory governs the statistical behavior of extreme values of variables, such as extreme wave heights during hurricanes. The theory provides closed-form distribution functions for the extremes of single variables (marginals), such as block maxima (monthly or annually) and peaks over a sufficiently high threshold. The main challenge in fitting such distributions to measurements is the lack of data, as extreme events are by definition very rare. The problem can be alleviated by assuming that all the collected data (e.g., extreme wave heights at different measuring sites) are stationary and follow the same distribution. After combining all the data, the resulting sample size is sufficiently large to yield apparently reliable estimates. However, there usually exists clear heterogeneity in the extreme-value data caused by the underlying mechanisms that drive the weather events. Extreme temperature, for example, is greatly influenced by the altitude of the measuring site. The latter can be regarded as a covariate. Accommodating heterogeneity in the model is essential since the estimated model will be unreliable otherwise. In order to handle both heterogeneity as well as the problem of small sample size, the interactions among extreme events with different covariate values are often exploited. For instance, extreme temperatures at similar altitudes behave similarly, implying that the parameters of the corresponding extreme-value distributions vary smoothly with the covariate (i.e., altitude). Such prior knowledge may help to improve the fitting of extreme-value distributions.
In this work, we aim to exploit graphical models to incorporate multiple covariates in the extreme-value model. The interdependencies between extreme values with different covariates are often highly structured. This structure can in turn be leveraged by the graphical model framework to yield very efficient algorithms.
The proposed model is derived from the following ideas: Motivated by the extreme value theory, the extreme events are assumed to follow the fat-tailed Generalized Extreme Value (GEV) distributions. The parameters of those GEV distributions are further assumed to depend smoothly on the covariates. As an example, we model the storm-wise maxima of significant wave heights in the Gulf of Mexico (see Fig. 6), where the covariates are longitude, latitude, and wind direction. To facilitate the use of graphical models, we discretize the continuous covariates within a finite range. In the example of extreme wave heights in the Gulf of Mexico, space is discretized as a finite homogeneous two-dimensional lattice, and the wind direction is discretized in a finite number of equal-sized sectors. More generally, the GEV distributed variables (and hence also the GEV parameters) are defined on a finite number of points indexed by the (discretized) covariates. We characterize the dependence between the GEV parameters through a graphical model prior, in particular, a multidimensional thin-membrane model where edges are only present between pairs of neighboring points (see Fig. 1d). We demonstrate that the multidimensional model can be constructed flexibly from one-dimensional thin-membrane models for each covariate (cf. Fig. 1a-b). The proposed model can therefore easily be extended to cope with an arbitrary number of covariates. We follow the empirical Bayes approach to learn the parameters and hyper-parameters. Specifically, both the smoothed GEV parameters and the smoothness parameters are inferred via Expectation Maximization. A major challenge lies in the scalability of the algorithm since the dimension of the model is usually quite large. We therefore take advantage of the special pattern of the eigenvalues and eigenvectors corresponding to the one-dimensional thin-membrane models, and derive an efficient inference algorithm specialized for the proposed model.
Fig. 1. Thin-membrane models: (a) chain graph; (b) circle graph; (c) lattice; (d) spatio-directional model.
We further investigate the extreme wave heights in the Gulf of Mexico. To test the influence of the selected number of directional sectors on the results, we consider different numbers of sectors, i.e., 6, 8, 10, 12, 15 and 18 in sequence. We then compute the AIC score for each number of directional sectors. The results are summarized in Fig. 2. We can find that the proposed spatio-directional model (SDM) always achieves the best AIC score. Moreover, the performance of this model is not sensitive to the chosen number of directional sectors. In practice, we propose to choose the number that minimizes the AIC score, which is 12 for this data set.
Fig. 2. AIC for various models, including the proposed SDM.
Spatio-temporal Copula-Gaussian Graphical Models for Extreme Events
2013 - 2014
JD, Yu Hang, Zhang Liaofan (currently at CMU)
Analysis of multiple extreme-value time series has found applications and permeated the literature in a wide variety of domains, ranging from finance to climatology. For example, extreme precipitation can characterize climate change and cause flood or flash-flood related hazards. Therefore, assessing the temporal pattern of such events and making reliable predictions of future trends is crucial for risk management and disaster prevention.
In this paper, we propose to exploit undirected graphical models (i.e., Markov random fields)  to capture the spatial and temporal dependencies among GEV parameters. We aim to estimate the temporal pattern of extreme events, such as the trend or seasonality of the data in time. Furthermore, we intend to predict the distribution of extreme events in the future based on the current trend. In the example of extreme rainfall, forecasting whether the size of extremes will increase in the future is the key for flood warning and strategic planning.
To move forward to this goal, we first assume that the block maximum, particularly, monthly maximum, at each site and time position (i.e., each month) to follow a GEV distribution. We further assume that the shape and scale parameters of GEV distributions are constant in the spatio-temporal domain; therefore, the remaining location parameters characterize the variation and dependence in the data. The latter is well encoded by imposing a Gauss-Markov random field (GMRF) prior on the location parameters, as shown in Fig. 1.
Fig. 1. Neighborhood structure of the spatio-temporal graphical model.
We follow the empirical Bayes approach to estimate all the parameters. In particular, both the GEV parameters and the hyper parameters of the GMRF prior are learnt via expectation maximization (EM). To cope with the complicated functional form of GEV distributions, we recast the problem to a tractable one using a series of approximation algorithms. First of all, the GEV density is represented as a mixture of Gaussians for each specific shape parameter in a given discrete domain. We then select the shape parameter that maximizes the evidence of the data. In addition, we employ parallel expectation propagation (EP) to compute the sufficient statistics in the E-step. Interestingly, the Gaussian mixture representation of GEV distributions aids in fast implementation of EP algorithm. We validate the proposed model using both synthetic and real data. Results of synthetic data show that the proposed model can automatically recover the underlying pattern of location parameters across both space and time, although only one sample is observed at each location and time point. Moreover, we consider the extreme precipitation in Thailand, which is the major cause of severe floods in recent years. We select a 5 × 5 grid with spacing 0.75◦, and then fit the proposed spatio-temporal model using 120 monthly maxima (i.e., 10 years). We then predict the location parameters in the 11th year. We also compute the 95% confidence interval of the extreme rainfall amount given the predicted GEV parameters. The result of one arbitrary location is shown in Fig. 2. We can find that the observed extremes lie in the predicted the 95% confidence interval, implying the reliability of prediction.
Fig. 2. Estimation and prediction of location parameters across time at one location for extreme precipitation in Thailand.
Spatial Graphical Models for Extreme Events via Ensemble-of-Trees of Pairwise Copulas
Ongoing since 2012
JD, Yu Hang, Wayne Isaac Tan Uy
Assessing the risk of extreme events in a spatial domain, such as hurricanes, floods and droughts, presents unique significance in practice. To assess the likelihood of such events, statisticians have developed extreme-value theory, yielding statistical models that can reliably capture extreme events occurring in spatial domain. Unfortunately, the existing extreme-value models are often limited to tens of variables. Yet many practical statistical problems, for instance in Earth Sciences, involve thousands or millions of sites (variables). Graphical models can easily handle such large number of variables; nevertheless, they have not yet been applied in the realm of extreme-event analysis. In this work, we intend to address this gap by introducing an extreme-value graphical model, i.e., ensemble-of-trees of pairwise copulas.
We propose a class of graphical models for extreme-value analysis in spatial domain. Our main idea is to construct graphical models with pairwise copulas as building blocks. Concretely, we develop ensemble-of-trees models of pairwise copulas (ETPC). As a starting point, the sites in spatial domain are arranged on a lattice. The probability density function (PDF) of the ETPC model is a weighted sum the PDF of all possible spanning trees on that lattice. The PDF of these trees in turn are constructed from pairwise copulas.
Fig. 1. ETPC model: the lattice (a) and several decomposed spanning trees (b) - (g).
In this setting, the extremes can be modeled as asymptotically tail dependent or independent by choosing the pairwise copulas appropriately. We have proven that tail dependence in the ETPC model is preserved if all the pairwise copulas in the model are tail dependent. We propose efficient learning algorithms, and also derive scalable inference algorithms to impute extreme values at unobserved locations.
To assess our model, we consider the extreme precipitation from 1900 to 2011 in the Japanese archipelago. We select four 10×10 regular grids in South Japan and extract the extreme precipitation values. We learn the model and further impute the rainfall amounts at unobserved sites given sites on the boundary. We show the color plots in Fig. 4 for the imputation results of the most challenging case, i.e., a missing area of 6×6 grid for the 99th quantile inside Region 1.
Fig. 2. Color plots of the imputed 6×6 area in the first region for the 99th quantile.
As manifested in Fig 4, the ETPCG model and the CGGM (i.e., models based on tail-independent Gaussian copula) are unable to recover reliable estimates for the maximum precipitation values. While all three models underestimate the largest precipitation amount, the ETPCm model yields more reliable estimates. Intuitively, this is because the rainfall amounts for this case are extreme and reside in the tail part of the distribution, which the Gaussian copula is incapable of modeling. This also implies that the proposed model can reliably simulate the extreme events in a spatial domain given the boundary conditions.