PUBLICATIONS

All/ Time Series/ Transfer Learning/ Sparse Learning/ Ensemble Learning/
Structured Prediction/ Topic Models/ Anomaly Detection/ Application/
Application Fields: Climate/ Healthcare/ Social Media/

Learning Influence Functions from Incomplete Observations

Xinran He, Ke Xu, David Kempe and Yan Liu. Advances in Neural Information Processing Systems (NIPS), 2016.
We study the problem of learning influence functions under incomplete observations of node activations. Incomplete observations are a major concern as most (online and real-world) social networks are not fully observable. We establish both proper and improper PAC learnability of influence functions under uniformly randomly missing observations. Proper PAC learnability under the Discrete-Time Linear Threshold (DLT) and Discrete-Time Independent Cascade (DIC) models is established by essentially reducing incomplete observations to complete observations in a modified graph. Our improper PAC learnability result applies for the DLT and DIC models as well as the Continuous-Time Independent Cascade (CIC) model. It is based on a parametrization in terms of reachability features, and also gives rise to an efficient and practical heuristic. Experiments on synthetic and real-world datasets demonstrate the ability of our method to compensate even for fairly large loss rates.


SPALS: Fast Alternating Least Squares via Implicit Leverage Scores Sampling

Dehua Cheng, Richard Peng, Ioakeim Perros, and Yan Liu. Advances in Neural Information Processing Systems (NIPS), 2016.
Tensor CANDECOMP/PARAFAC (CP) decomposition is a powerful but computationally challenging tool in modern data analytics. In this paper, we show ways of sampling intermediate steps of alternating minimization algorithms for computing low rank tensor CP decompositions, leading to the sparse alternating least squares (SPALS) method. Specifically, we sample the the Khatri-Rao product, which arises as an intermediate object during the iterations of alternating least squares. This product captures the interactions between different tensor modes, and form the main computational bottleneck for solving many tensor related tasks. By exploiting the spectral structures of the matrix Khatri-Rao product, we provide efficient access to its statistical leverage scores. When applied to the tensor CP decomposition, our method leads to the first algorithm that runs in sublinear time per-iteration and approximates the output of deterministic alternating least squares algorithms. Empirical evaluations of this approach show significantly speedups over existing randomized and deterministic routines for performing CP decomposition. On a tensor of the size 2.4m by 6.6m by 92k with over 2 billion nonzeros formed by Amazon product reviews, our routine converges in two minutes to the same error as deterministic ALS.


Interpretable Deep Models for ICU Outcome Prediction

Z. Che, S. Purushotham, R. Khemani, and Y. Liu. American Medical Informatics Assocation Annual Symposium (AMIA), 2016.
Exponential surge in health care data, such as longitudinal data from electronic health records (EHR), sensor data from intensive care unit (ICU), etc., is providing new opportunities to discover meaningful data-driven characteristics and patterns of diseases. Recently, deep learning models have been employed for many computational phenotyping and healthcare prediction tasks to achieve state-of-the-art performance. However, deep models lack interpretability which is crucial for wide adoption in medical research and clinical decision-making. In this paper, we introduce a simple yet powerful knowledge-distillation approach called interpretable mimic learning, which uses gradient boosting trees to learn interpretable models and at the same time achieves strong prediction performance as deep learning models. Experiment results on Pediatric ICU dataset for acute lung injury (ALI) show that our proposed method not only outperforms state-of-the-art approaches for morality and ventilator free days prediction tasks but can also provide interpretable models to clinicians.


Latent Space Model for Road Networks to Predict Time-Varying Traffic

D. Deng, C. Shahabi, U. Demiryurek, L. Zhu, R. Yu and Yan Liu. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD'16), 2016.
Real-time traffic prediction from high-fidelity spatiotemporal traffic sensor datasets is an important problem for intelligent transportation systems and sustainability. However, it is challenging due to the complex topological dependencies and high dynamism associated with changing road conditions. In this paper, we propose a Latent Space Model for Road Networks (LSM-RN) to address these challenges holistically. In particular, given a series of road network snapshots, we learn the attributes of vertices in latent spaces which capture both topological and temporal properties. As these latent attributes are time-dependent, they can estimate how traffic patterns form and evolve. In addition, we present an incremental online algorithm which sequentially and adaptively learns the latent attributes from the temporal graph changes. Our framework enables real-time traffic prediction by 1) exploiting real-time sensor readings to adjust/update the existing latent spaces, and 2) training as data arrives and making predictions on-the-fly. By conducting extensive experiments with a large volume of real-world traffic sensor data, we demonstrate the superiority of our framework for real-time traffic prediction on large road networks over competitors as well as baseline graph-based LSM’s.


Learning from Multiway Data: Simple and Efficient Tensor Regression

R. Yu and Y. Liu. International conference on Machine Learning (ICML'16), 2016.
Tensor regression has shown to be advantageous in learning tasks with multi-directional relatedness. Given massive multiway data, traditional methods are often too slow to operate on or suffer from memory bottleneck. In this paper, we introduce subsampled tensor projected gradient to solve the problem. Our algorithm is impressively simple and efficient. It is built upon projected gradient method with fast tensor power iterations, leveraging randomized sketching for further acceleration. Theoretical analysis shows that our algorithm converges to the correct solution in fixed number of iterations. The memory requirement grows linearly with the size of the problem. We demonstrate superior empirical performance on both multi-linear multi-task learning and spatio-temporal applications.


Geographic Segmentation via Latent Poisson Factor Model

R. Yu, A. Gelfand, S. Rajan, C. Shahabi and Y. Liu.ACM International Conference on Web Search and Data Mining (WSDM), 2016.
Discovering latent structures in spatial data is of critical importance to understanding the user behavior of locationbased services. In this paper, we study the problem of geographic segmentation of spatial data, which involves dividing a collection of observations into distinct geo-spatial regions and uncovering abstract correlation structures in the data. We introduce a novel, Latent Poisson Factor (LPF) model to describe spatial count data. The model describes the spatial counts as a Poisson distribution with a mean that factors over a joint item-location latent space. The latent factors are constrained with weak labels to help uncover interesting spatial dependencies. We study the LPF model on a mobile app usage data set and a news article readership data set. We empirically demonstrate its effectiveness on a variety of prediction tasks on these two data sets.


Timeline Summarization from Social Media with Life Cycle Models

Yi Chang, Jiliang Tang, Dawei Yin, Makoto Yamada, Yan Liu. Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), 2016.
The popularity of social media shatters the barrier for online users to create and share information at any place at any time. As a consequence, it has be- come increasing difficult to locate relevance infor- mation about an entity. Timeline has been proven to provide an effective and efficient access to un- derstand an entity by displaying a list of episodes about the entity in chronological order. However, summarizing the timeline about an entity with so- cial media data faces new challenges. First, key timeline episodes about the entity are typically un- available in existing social media services. Second, the short, noisy and informal nature of social me- dia posts determines that only content-based sum- marization could be insufficient. In this paper, we investigate the problem of timeline summarization and propose a novel framework Timeline-Sumy, which consists of episode detecting and summary ranking. In episode detecting, we explicitly model temporal information with life cycle models to de- tect timeline episodes since episodes usually ex- hibit sudden-rise-and-heavy-tail patterns on time- series. In summary ranking, we rank social me- dia posts in each episode via a learning-to-rank ap- proach. The experimental results on social media datasets demonstrate the effectiveness of the pro- posed framework.


Causal Phenotype Discovery via Deep Networks

D. Kale, Z. Che, M. Bahadori, W. Li, Y. Liu, and R. Wetzel. American Medical Informatics Assocation Annual Symposium (AMIA), 2015.
The rapid growth of digital health databases has attracted many researchers interested in using modern computational methods to discover and model patterns of health and illness in a research program known as computational phenotyping. Much of the work in this area has focused on traditional statistical learning paradigms, such as classification, prediction, clustering, pattern mining. In this paper, we propose a related but different paradigm called causal phenotype discovery, which aims to discover latent representations of illness that are causally predictive. We illustrate this idea with a two-stage framework that combines the latent representation learning power of deep neural networks with state-of-the-art tools from causal inference. We apply this framework to two large ICU time series data sets and show that it can learn features that are predictively useful, that capture complex physiologic patterns associated with critical illnesses, and that are potentially more clinically meaningful than manually designed features.


Deep Computational Phenotyping

Z. Che*, D. Kale*, W. Li, M. T. Bahadori, and Y. Liu (*Equal Contributions). ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD'15), 2015.
We apply deep learning to the problem of discovery and detection of characteristic patterns of physiology in clinical time series data. We propose two novel modifications to standard neural net training that address challenges and exploit properties that are peculiar to medical data. First, we develop a general framework for using prior knowledge to regularize parameters in the topmost layers. Second, we describe a scalable procedure for training a collection of neural networks of different sizes but with partially shared architectures. We demonstrate the empirical efficacy of both techniques on two real-world hospital data sets and show that the resulting neural nets learn interpretable and clinically relevant features.


Accelerated Online Low Rank Tensor Learning for Multivariate Spatiotemporal Streams

R. Yu, D. Cheng and Y. Liu. International conference on Machine Learning (ICML'15), 2015.
In this paper, we propose an online accelerated low-rank tensor learning algorithm (ALTO) to solve the problem. At each iteration, we project the current tensor to the subspace of low-rank tensors in order to perform efficient tensor decomposition, then recover the decomposition of the new tensor. By randomly glancing at additional subspaces, we successfully avoid local optima at negligible extra computational cost. We evaluate our method on two tasks in streaming multivariate spatio-temporal analysis: online forecasting and multi-model ensemble.


Functional Subspace Clustering with Application to Time Series

M. T. Bahadori, D. Kale, Y. Fan, and Y. Liu. International conference on Machine Learning (ICML'15), 2015.
Functional data, where samples are random functions, are increasingly common and important in a variety of applications, such as health care and traffic analysis. They are naturally high dimensional and lie along complex manifolds. These properties warrant use of subspace assumption, but most state-of-the-art subspace learning algorithms are limited to linear or other simple settings. To address these challenges, we propose a new functional learning framework for clustering, Functional Subspace Clustering (FSC). FSC assumes that each subspace is composed of deformed basis functions and formulates the subspace learning problem as a sparse regression over functionals. The resulting problem can be efficiently solved via greedy variable selection given the access to a fast deformation oracle. We provide theoretical guarantees for FSC and show how it can be applied to time series with warped alignments.


HawkesTopic: A Joint Model for Network Inference and Topic Modeling from Text-Based Cascades

X. He, T. Rekatsinas, J. Foulds, L. Getoor, Y. Liu. International conference on Machine Learning (ICML'15), 2015.
In this work, we develop the HawkesTopic model (HTM) for analyzing text-based cascades. HTM combines Hawkes processes and topic modeling to simultaneously reason about the information diffusion pathways and the topics characterizing the observed textual information. We show how to jointly infer them with a mean-field variational inference algorithm and validate our approach on both synthetic and real-world data sets, including a news media dataset for modeling information diffusion, and an ArXiv publication dataset for modeling scientific influence.


Efficient Sampling for Gaussian Graphical Models via Spectral Sparsification

Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. The 28th Conference on Learning Theory (COLT '15), 2015.
Motivated by a sampling problem basic to computational statistical inference, we develop a toolset based on spectral sparsification for a family of fundamental problems involving Gaussian sampling, matrix functionals, and reversible Markov chains. Drawing on the connection between Gaussian graphical models and the recent breakthroughs in spectral graph theory, we give the first nearly linear time algorithm for the following basic matrix problem: Given an n-by-n Laplacian matrix M and a constant -1<= p <= 1, provide efficient access to a sparse n-by-n linear operator C such that M^p \approx C C^T, where \approx denotes spectral similarity. When p is set to -1, this gives the first parallel sampling algorithm that is essentially optimal both in total work and randomness for Gaussian random fields with symmetric diagonally dominant (SDD) precision matrices. It only requires nearly linear work and 2n i.i.d. random univariate Gaussian samples to generate an n-dimensional i.i.d. Gaussian random sample in polylogarithmic depth.


Model Selection for Topic Models via Spectral Decomposition

Dehua Cheng*, Xinran He*, and Y. Liu (*Equal Contributions). The 18th International Conference on Artificial Intelligence and Statistics (AISTATS '15), 2015.
Topic models have achieved significant successes in analyzing large-scale text corpus. In practical applications, we are always confronted with the challenge of model selection, i.e., how to appropriately set the number of topics. Following recent advances in topic model inference via tensor decomposition, we make a first attempt to provide theoretical analysis on model selection in latent Dirichlet allocation. Under mild conditions, we derive the upper bound and lower bound on the number of topics given a text collection of finite size. Experimental results demonstrate that our bounds are accurate and tight. Furthermore, using Gaussian mixture model as an example, we show that our methodology can be easily generalized to model selection analysis for other latent models.


Hierarchical Active Transfer Learning

D. Kale, M. Ghazvininejad, A. Ramakrishna, J. He, and Y. Liu. 2015 SIAM International Conference on Data Mining (SDM '15), 2015.
We describe a unified active transfer learning framework called Hierarchical Active Transfer Learning (HATL) that exploits cluster structure shared between different data domains to perform transfer learning by imputing labels for unlabeled target data and to generate effective label queries during active learning. The resulting framework can also be used to perform unsupervised and semi-supervised transfer learning. We derive an intuitive and useful upper bound on HATL’s error and present results on synthetic data that confirm both intuition and our analysis. Finally, we demonstrate HATL’s empirical effectiveness on a benchmark data set for sentiment classification.


Spectral Sparsification of Random-Walk Matrix Polynomials

Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. arXiv, 2015.
In this paper, we develop the first nearly linear time algorithm for this sparsification problem: For any G with n vertices and m edges, d coefficients α, and ϵ>0, our algorithm runs in time O(d^2 m log2n/ϵ^2) to construct a Laplacian matrix L=D−A with O(nlogn/ϵ^2) non-zeros.. Matrix polynomials arise in mathematical analysis of matrix functions as well as numerical solutions of matrix equations. Our work is particularly motivated by the algorithmic problems for speeding up the classic Newton's method in applications such as computing the inverse square-root of the precision matrix of a Gaussian random field, as well as computing the qth-root transition (for q≥1) in a time-reversible Markov model. The key algorithmic step for both applications is the construction of a spectral sparsifier of a constant degree random-walk matrix-polynomials introduced by Newton's method. Our algorithm can also be used to build efficient data structures for effective resistances for multi-step time-reversible Markov models, and we anticipate that it could be useful for other tasks in network analysis.


Computational Discovery of Physiomes in Critically Ill Children Using Deep Learning

D. Kale, Z. Che, Y. Liu, and R. Wetzel. Workshop on Data Mining for Medical Informatics (DMMI'14) at AMIA, 2014.
Critical illness is dynamic and complex, but physicians often diagnose and treat based on parsimonious, static sets of symptoms and signs. The increasing volume of digital health data offers an opportunity to use computational methods to learn richer descriptors of illness that incorporate temporal dynamics and more variables. In this work, we use deep neural networks to mine patterns from multivariate clinical time series. We apply this to a large pediatric intensive care unit (PICU) database from Children's Hospital Los Angeles (CHLA) to learn patterns that are associated with known critical illnesses.


Scalable Parallel Factorizations of SDD Matrices and Efficient Sampling for Gaussian Graphical Models

Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. arXiv, 2014.
Motivated by a sampling problem basic to computational statistical inference, we develop a nearly optimal algorithm for a fundamental problem in spectral graph theory and numerical analysis. Given an n×n SDDM matrix M, and a constant −1≤p≤1, our algorithm gives efficient access to a sparse n×n linear operator C such that M^p≈CC^T. The solution is based on factoring M into a product of simple and sparse matrices using squaring and spectral sparsification. For M with m non-zero entries, our algorithm takes work nearly-linear in m, and polylogarithmic depth on a parallel machine with m processors. This gives the first sampling algorithm that only requires nearly linear work and n i.i.d. random univariate Gaussian samples to generate i.i.d. random samples for n-dimensional Gaussian random fields with SDDM precision matrices. For sampling this natural subclass of Gaussian random fields, it is optimal in the randomness and nearly optimal in the work and parallel complexity. In addition, our sampling algorithm can be directly extended to Gaussian random fields with SDD precision matrices.


Ups and Downs in Buzzes: Life Cycle Modeling for Temporal Pattern Discovery

Y. Chang, M. Yamada, A. Ortega, Y. Liu. IEEE International Conference on Data Mining (ICDM'14), 2014.
Buzz modeling is a challenging task because a buzz time-series usually exhibits sudden spikes and heavy tails, which fails most existing time-series models. To deal with buzz time-series sequences, we propose a novel time-series modeling approach which captures the rise and fade temporal patterns via Product Life Cycle (PLC) models, a classical concept in economics. More specifically, we propose a mixture of PLC models to capture the multiple peaks in buzz time-series and furthermore develop a probabilistic graphical model (K-MPLC).


An Examination of Multivariate Time Series Hashing with Applications to Health Care

D. Kale*, D. Gong*, Z. Che*, G. Medioni, R. Wetzel, P. Ross, and Y. Liu. IEEE International Conference on Data Mining (ICDM'14), 2014. (* Equal contributions)
We demonstrated that kernelized locality sensitive hashing (KLSH) is a robust solution to the problem of similarity search over large scale databases of variable length multivariate time series data. Combining KLSH with a variety of different time series similarity measures (capturing both shape and structure), we demonstrate that the speed and accuracy of this approach on three large real world medical data sets.


Fast Multivariate Spatio-temporal Analysis via Low Rank Tensor Learning

M. T. Bahadori*, R. Yu*, and Y. Liu. Spotlight in Conference on Neural Information Processing Systems (NIPS'14), 2014. (* Equal contributions)
Accurate and efficient analysis of the multivariate spatio-temporal data is critical to climatology, geology, and sociology applications. Existing models usually underestimate the inter-dependence among variables, space and time and are often computationally expensive. We present a unified low rank tensor learning framework for the two main tasks in multivariate spatio-temporal analysis: cokriging and forecasting. It incorporates the commonalities of the variables as well as the spatial correlations. We develop an efficient greedy algorithm to solve the problem and show its rate of convergence.


What is Tumblr: A Statistical Overview and Comparison

Y. Chang, L. Tang, Y. Inagaki, Y. Liu. SIGKDD Explorations 2014 journal paper.
We provide some pioneer analysis on Tumblr from a variety of aspects. We study the social network structure among Tumblr users, analyze its user generated content, and describe reblogging patterns to analyze its user behavior. We aim to provide a comprehensive statistical overview of Tumblr and compare it with other popular social services. We find Tumblr has more rich content than other microblogging platforms, and it contains hybrid characteristics of social networking, traditional blogosphere, and social media.


GLAD: Group Anomaly Detection in Social Media Analysis

R. Yu, X. He, Y. Liu. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD'14), 2014.
In this paper, we take a generative approach by proposing a hierarchical Bayes model: Group Latent Anomaly Detection (GLAD) model. GLAD takes both pair-wise and point-wise data as input, automatically infers the groups and detects group anomalies simultaneously. To account for the dynamic properties of the social media data, we further generalize GLAD to its dynamic extension d-GLAD. We conduct extensive experiments to evaluate our models on both synthetic and real world datasets. The empirical results demonstrate that our approach is effective and robust in discovering latent groups and detecting group anomalies.


Parallel Gibbs Sampling for Hierarchical Dirichlet Processes via Gamma Processes Equivalence

D.Cheng, and Y. Liu, ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD'14), 2014.
In this paper, we propose an effective parallel Gibbs sampling algorithm for HDP by exploring its connections with the gamma-gamma-Poisson process. Specifically, we develop a novel framework that combines bootstrap and Reversible Jump MCMC algorithm to enable parallel variable updates. We also provide theoretical convergence analysis based on Gibbs sampling with asynchronous variable updates.


FBLG: A Simple and Effective Approach for Temporal Dependence Discovery from Time Series Data

D. Cheng, M. Bahadori, and Y. Liu, ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD'14), 2014.
In this paper, we propose a simple and effective approach for temporal dependence discovery from time series data. We first create a new time series by reversing the time stamps of original time series and combine both time series to improve the performance of temporal dependence recovery. We then provide theoretical justification for the proposed algorithm for several existing time series models.


Linking Heterogeneous Input Spaces with Pivots for Multi-Task Learning

J. He, Y. Liu and Q. Yang. SIAM Conference on Data Mining (SDM'14), 2014.
In this paper, we address a general setting for multi-task learning (MTL) where different tasks have heterogeneous input spaces. we first propose a learning scheme for multiple tasks and analyze its generalization performance. Then we focus on the problems where only a limited number of the pivots are available, and propose a general framework to leverage the pivot information. We further propose an effective optimization algorithm to find both the mappings and the prediction model.


High-resolution Functional Annotation of Human Transcriptome: Predicting Isoform Functions by a Novel Multiple Instance-based Label Propagation Method

W. Li, S. Kang, C. Liu, S. Zhang, Y. Shi, Y. Liu and X. Zhou. Nucleic acids research, 2014.
By integrating multiple human RNA-seq data sets, we carried out the first systematic prediction of isoform functions, enabling high-resolution functional annotation of human transcriptome. We modelled the gene–isoform relationships as multiple instance data and developed a novel label propagation method to predict functions. Our method achieved an average area under the receiver operating characteristic curve of 0.67 and assigned functions to 15572 isoforms.


Bayesian Regularization via Graph Laplacian

F. Liu, S. Chakraborty, F. Li, Y. Liu, A. C. Lozano. Journal of Baysian Statistics, 2014.
We propose a novel Bayesian approach, which explicitly models the dependence structure between variables through a graph Laplacian matrix. We generalize the graph Laplacian to allow both positive and negative correlations. A prior distribution for the graph Laplacian is then proposed, which allows conjugacy and thereby greatly simplifies the computation. We show that the proposed Bayesian model leads to proper posterior distribution.


Accelerating Active Learning with Transfer Learning

D. Kale, Y. Liu. IEEE International conference on Data Mining (ICDM'13), 2013.
we present a simple and principled transfer active learning framework that leverages pre-existing labeled data from related tasks to improve the performance of an active learner. We derive an intuitive bound on the generalization error for the classifiers learned by this algorithm that provides insight into the algorithm’s behavior and the problem in general. We provide experimental results using several well-known transfer learning data sets that confirm our theoretical analysis.


Fast Structure Learning in Generalized Stochastic Proceeses with Latent Factors

T. Bahadori, Y. Liu. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD'13), 2013.
In this work, we analyze a flexible stochastic process model, the generalized linear auto-regressive process (GLARP) and identify the conditions under which the impact of hidden variables appears as an additive term to the evolution matrix estimated with the maximum likelihood. In particular, we examine three examples, including two popular models for count data, i.e, Poisson and Conwey-Maxwell Poisson vector auto-regressive processes, and one powerful model for extreme value data, i.e., Gumbel vector auto-regressive processes. We demonstrate that the impact of hidden factors can be separated out via convex optimization in these three models.


An Examination of Practical Granger Causality Inference

T. Bahadori, Y. Liu. SIAM Conference on Data Mining (SDM'13), 2013.
Learning temporal causal structures among multiple time series is one of the major tasks in mining time series data. Granger causality is one of the most popular techniques in uncovering the temporal dependencies among time series; however it faces two main challenges: (i) the spurious effect of unobserved time series and (ii) the computational challenges in high dimensional settings. In this paper, we utilize the confounder path delays to find a subset of time series that via conditioning on them we are able to cancel out the spurious confounder effects. After study of consistency of different Granger causality techniques, we propose Copula-Granger and show that while it is consistent in high dimensions, it can efficiently capture non-linearity in the data. Extensive experiments on a synthetic and a social networking dataset confirm our theoretical results.


An Entropy-Based Model to Infer Social Strength from Spatiotemporal Data

V. Pham, C. Shahabi, Y. Liu. ACM SIGMOD Conference on Management of Data (SIGMOD'13), 2013.
we are interested in inferring these social connections by analyzing people’s location information, which is useful in a variety of application domains from sales and marketing to intelligence analysis. In particular, we propose an entropy-based model (EBM) that not only infers social connections but also estimates the strength of social connections by analyzing people’s co-occurrences in space and time.


Towards Twitter Context Summarization with User Influence Models

Y. Chang, X. Wang, Q. Mei and Y. Liu. International Conference of Web Search and Data Mining (WSDM'13), 2013.
"we propose the task of Twitter context summarization, which generates a succinct summary from a large but noisy Twitter context tree. Given that there are rich user interactions in Twitter, we thus study how to improve summarization methods by leveraging such signals. In particular, we study how user influence models, which project user interaction information onto a Twitter context tree, can help Twitter context summarization within a supervised learning framework."


Granger Graphical Models for Time-Series Anomaly Detection

H. Qiu, Y. Liu, N. Subrahmanya, W. Li. IEEE International conference on Data Mining (ICDM'12), 2012.
we propose Granger graphical models as an effective and scalable approach for anomaly detection whose results can be readily interpreted. Specifically, Granger graphical models are a family of graphical models that exploit the temporal dependencies between variables by applying L1-regularized learning to Granger causality. Our goal is to efficiently compute a robust “correlation anomaly” score for each variable via Granger graphical models that can provide insights on the possible reasons of anomalies.


Sparse-GEV: Sparse Latent Space Model for Multivariate Extreme Value Time Series Modeling

Y. Liu, M. T. Bahadori, and H. Li. International conference on Machine Learning (ICML'12), 2012.
The extreme value time series data usually exhibit a heavy-tailed distribution rather than a Gaussian distribution. This poses great challenges to existing approaches due to the significantly different assumptions on the data distributions and the lack of sufficient past data on extreme events. In this paper, we propose the Sparse-GEV model, a latent state model based on the theory of extreme value modeling to automatically learn sparse temporal dependence and make predictions. Our model is theoretically significant because it is among the first models to learn sparse temporal dependencies among multivariate extreme value time series.


Collaborative Topic Regression with Social Matrix Factorization for Recommendation Systems

S. Purushotham, Y. Liu, C-C. Kuo. International conference on Machine Learning (ICML'12), 2012.
We propose a novel hierarchical Bayesian model which jointly incorporates topic modeling and probabilistic matrix factorization of social networks. A major advantage of our model is to automatically infer useful latent topics and social information as well as their importance to collaborative filtering from the training data. Empirical experiments on two large-scale datasets show that our algorithm provides a more effective recommendation system than the state-of-the art approaches.


Community Discovery and Profiling with Social Messages

W. Zhou, H. Jin, Y. Liu. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD'12), 2012.
We aim to automatically discovering and profiling users' communities by taking into account both the contacts and the topics. More specifically, we propose a community profiling model called COCOMP, where the communities labels are latent, and each social document corresponds to an information sharing activity among the most probable community members regarding the most relevant community issues. Experiment results on several social communication datasets, including emails and Twitter messages, demonstrate that the model can discover users' communities effectively, and provide concrete semantics.


Transfer Topic Modeling with Ease and Scalability

J.-H. Kang, J. Ma, and Y. Liu. SIAM Conference on Data Mining (SDM'12), 2012.
We propose a transfer learning approach that utilizes abundant labeled documents from other domains (such as Yahoo! News or Wikipedia) to improve topic modeling, with better model fitting and result interpretation. Specifically, we develop Transfer Hierarchical LDA (thLDA) model, which incorporates the label information from other domains via informative priors.


Granger Causality Analysis in Irregular Time Series

M. T. Bahadori, and Y. Liu. SIAM Conference on Data Mining (SDM'12), 2012.
Learning temporal causal structures between time series is one of the key tools for analyzing time series data. In many real-world applications, we are confronted with Irregular Time Series, whose observations are not sampled at equally-spaced time stamps. The irregularity in sampling intervals violates the basic assumptions behind many models for structure learning. In this paper, we propose a nonparametric generalization of the Granger graphical models called Generalized Lasso Granger (GLG) to uncover the temporal dependencies from irregular time series.


Multiple Instance Learning on Structured Data

D. Zhang, Y. Liu, S. Luo, J. Zhang and R. Lawrence. Conference on Neural Information Processing Systems (NIPS'11), 2011.
This paper explores the research problem as multiple instance learning on structured data (MILSD) and formulates a novel framework that considers additional structure information. In particular, an effective and efficient optimization algorithm has been proposed to solve the original non-convex optimization problem by using a combination of Concave-Convex Constraint Programming (CCCP) method and an adapted Cutting Plane method, which deals with two sets of constraints caused by learning on instances within individual bags and learning on structured data.


Learning with Minimum Supervision: A General Framework for Transductive Transfer Learning

M. T. Bahadori, Y. Liu, and D. Zhang. IEEE International Conference on Data Mining (ICDM'11), 2011.
In this work, we propose a general framework to solve the problem by mapping the input features in both the source domain and the target domain into a shared latent space and simultaneously minimizing the feature reconstruction loss and prediction loss. We develop one specific example of the framework, namely latent large-margin transductive transfer learning algorithm, and analyze its theoretic bound of classification loss via Rademacher complexity. We also provide a unified view of several popular transfer learning algorithms under our framework.


A Framework for Efficient Text Analytics through Automatic Configuration and Customization of Scientific Workflows

M. Hauder, Y. Gil, Y. Liu. E-Science, 2011.
We have developed a framework to assist scientists with data analysis tasks in particular machine learning and data mining. It takes advantage of the unique capabilities of the Wings workflow system to reason about semantic constraints. We show how the framework can rule out invalid workflows and help scientists to explore the design space.


Latent Graphical Models for Quantifying and Predicting Patent Quality

Y. Liu, P. Hseuh, R. Lawrence, S. Meliksetian, C. Perlich, A. Veen. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD'11), 2011.
The number of patents led each year has increased dramatically in recent years, raising concerns that patents of questionable validity are restricting the issuance of truly innovative patents. In this paper, we develop a latent graphical model to infer patent quality from related measurements. In addition, we extract advanced lexical features via natural language processing techniques to capture the quality measures such as clarity of claims, originality, and importance of cited prior art.


Serendipitous Learning: Learning Beyond the Predefined Label Space

D. Zhang, Y. Liu, Luo Si. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD'11), 2011.
This paper explores this novel and practical learning scenario, which is named as Serendipitous Learning (SL). The basic idea is to leverage the knowledge in the labeled examples to help identify the unknown classes. In particular, a maximum margin formulation is proposed to model both the classification loss on the known classes and the clustering performance on the unknown ones.


Multi-View Transfer Learning with a Large Margin Approach

D. Zhang, J. He, Y. Liu, L. Si, R. D. Lawrence. Proceedings of the 17th ACM Conf. on Knowledge Discovery & Data Mining (SIGKDD), 2011.
To better leverage both the labeled data from the source domain and the features from different views, this paper proposes a general framework: Multi-View Transfer Learning with a Large Margin Approach (MVTL-LM). On one hand, labeled data from the source domain is effectively utilized to construct a large margin classifier; on the other hand, the data from both domains is employed to impose consistencies among multiple views.


Detecting Multilingual and Multi-Regional Query Intent in Web Search

Y. Chang, R. Zhang, S. Reddy, Y. Liu. AAAI Conference on Artificial Intelligence (AAAI'11), 2011.
We introduce a query intent probabilistic model, whose input is the number of clicks on documents from different regions and in different language, while the output of this model is a smoothed probabilistic distribution of multilingual and multi-regional query intent. Based on an editorial test to evaluate the accuracy of the intent classifier, our probabilistic model could improve the accuracy of multilingual intent detection for 15%, and improve multi-regional intent detection for 18%.


Transfer Latent Semantic Learning: Microblog Mining with Less Supervision

D. Zhang, Y. Liu, R. D. Lawrence, V. Chenthamarakshan. AAAI Conference on Artificial Intelligence (AAAI'11), 2011.
We introduce a novel transfer learning approach, namely transfer latent semantic learning, that utilizes a large number of related tagged documents with rich information from other sources (source domain) to help build a robust latent semantic space for the abbreviated texts (target domain). This is achieved by simultaneously minimizing the document reconstruction error and the classification error of the labeled examples from the source domain by building a classifier with hinge loss in the latent semantic space.


Learning Temporal Graphs for Relational Time-Series Analysis

Y. Liu, A. Niculescu-Mizil, A. Lozano, Y. Lu. International conference on Machine Learning (ICML'10), 2010.
We examine learning tasks where one is presented with multiple multivariate time-series, as well as a relational graph among the different time-series. We propose an L1 regularized hidden Markov random field regression framework to leverage the information provided by the relational graph and jointly infer more accurate temporal causal structures for all time-series.


Learning Spatial-Temporal Varying Graphs with Applications to Climate Data Analysis

X. Chen, Y. Liu, H. Liu, J.G. Carbonell. AAAI Conference on Artificial Intelligence (AAAI'10), 2010.
We develop a methodology to learn dynamic graph structures from spatial-temporal data so that the graph structures at adjacent time or locations are similar. Experimental results demonstrate that our method not only recovers the underlying graph well but also captures the smooth variation properties on both synthetic data and climate data.


Temporal Graphical Models for Cross-Species Gene Regulatory Network Discovery

Y.Liu, A. Niculescu-Mizil, A. Lozano, Y. Lu. International conference on Computational Systems Bioinformatics (CSB'10), 2010.
We present hidden Markov random field regression with L1 penalty to jointly uncover the regulatory networks for multiple species. The algorithm provides a framework for sharing information across species via hidden component graphs and can conveniently incorporate domain knowledge over evolution relationship between species.


Collaboration Analytics: Mining Work Patterns from Collaboration Activities

Q. Wang, H. Jin. Y. Liu. ACM International Conference on Information and Knowledge Management (CIKM'10), 2010.
In this article, we present an actionable solution of user analytics, namely collaboration analytics, by focusing on mining a person's work patterns from her collaboration activities. Our solution effectively makes use of a user's heterogeneous data collected from various collaboration tools to derive an integrated description of the user's collaborative work.


Medical data mining: insights from winning two competitions

S. Rosset, C. Perlich, G. Swirszcz, P. Melville, Y. Liu. Data Mining and Knowledge Discovery 20.3, 2010.
Two major data mining competitions in 2008 presented challenges in medical domains: KDD Cup 2008, which concerned cancer detection from mammography data; and Informs DataMining Challenge 2008, dealing with diagnosis of pneumonia based on patient information from hospital files. Our team won both of these competitions, and in this paper we share our lessons learned and insights.


Topic-Link LDA: Joint Models of Topic and Author Community

Y. Liu, A. Niculescu-Mizil, W. Gryc. International conference on Machine Learning (ICML'09), 2009.
We develop a Bayesian hierarchical approach that performs topic modeling and author community discovery in one unified framework. The effectiveness of our model is demonstrated on two blog data sets in different domains and one research paper citation data from CiteSeer.


Learning Dynamic Temporal Graphs for Oil-production Equipment Monitoring System

Y. Liu, J.R. Kalagnanam, O. Johnsen. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD'09), 2009.
We develop a dynamic temporal graphical models based on hidden Markov model regression and lasso-type algorithms. Our method is able to integrate two usually separate tasks, i.e. inferring underlying states and learning temporal graphs, in one unified model. The output temporal graphs provide better understanding about complex systems, i.e. how their dependency graphs evolve over time, and achieve more accurate predictions.


Spatial-temporal causal modeling for climate change attribution

A. Lozano, H. Li, A. Niculescu-Mizil, Y. Liu, C. Perlich, J. Hosking, N. Abe. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD'09), 2009.
We develop a novel method to infer causality from spatial-temporal data, as well as a procedure to incorporate extreme value modeling into our method in order to address the attribution of extreme climate events, such as heatwaves. Our experimental results on a real world dataset indicate that changes in temperature are not solely accounted for by solar radiance, but attributed more significantly to CO2 and other greenhouse gases.


Grouped Graphical Granger Modeling Methods for Temporal Causal Modeling

A.C. Lozano, N. Abe, Y. Liu, S. Rosset. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD'09), 2009.
With the surge of interest in model selection methodologies for regression, such as the Lasso, as practical alternatives to solving structural learning of graphical models, the question arises whether and how to combine these two notions into a practically viable approach for temporal causal modeling. In this paper, we examine a host of related algorithms that, loosely speaking, fall under the category of graphical Granger methods, and characterize their relative performance from multiple viewpoints.


Grouped Graphical Granger Modeling for Gene Expression Regulatory Networks Discovery

A.C. Lozano, N. Abe, Y. Liu, S. Rosset. International Conference on Intelligent Systems for Molecular Biology (ISMB'09), 2009.
we propose a novel methodology which we term 'grouped graphical Granger modeling method', which overcomes the limitations mentioned above by applying a regression method suited for high-dimensional and large data, and by leveraging the group structure among the lagged temporal variables according to the time series they belong to. We demonstrate the effectiveness of the proposed methodology on both simulated and actual gene expression data, specifically the human cancer cell (HeLa S3) cycle data.


Who is the Expert? Analyzing Gaze Data to Predict Expertise Level in Collaborative Applications

Y. Liu, P. Hsueh, J. Lai, M. Sangin, M.A. Nussli, P. Dillenbourg. IEEE International Conference on Multimedia and Expo (ICME'09), 2009.
We analyze complex gaze tracking data in a collaborative task and apply machine learning models to automatically predict skill-level differences between participants. Specifically, we present findings that address the two primary challenges for this prediction task: (1) extracting meaningful features from the gaze information, and (2) casting the prediction task as a machine learning (ML) problem.


Proximity-Based Anomaly Detection using Sparse Structure Learning

T. Ide, A. Lozano, N. Abe, Y. Liu. SIAM Conference on Data Mining (SDM'09), 2009.
In many applications involving real-valued time-series data, such as physical sensor data and economic metrics, discovering changes and anomalies in the way variables depend on one another is of particular importance. Our goal is to robustly compute the “correlation anomaly” score of each variable by comparing the test data with reference data, even when some of the variables are highly correlated (and thus collinearity exists).


Graph-based Rare Category Detection

J. He, Y. Liu, R. Lawrence. IEEE International Conference on Data Mining (ICDM'08), 2008.
we develop a new graph-based method for rare category detection named GRADE. It makes use of the global similarity matrix motivated by the manifold ranking algorithm, which results in more compact clusters for the minority classes; by selecting examples from the regions where probability density changes the most, it relaxes the assumption that the majority classes and the minority classes are separable.


Winner’s Report: KDD CUP Breast Cancer Identification

C. Perlich, P. Melville, Y. Liu, G. Swirszcz, R. Lawrence. KDD Workshop on Mining Medical Data, 2008.
We describe the ideas and methodologies that we developed in addressing the KDD Cup 2008 on early breast cancer detection, and discuss how they contributed to our success.


Intelligent System for Workforce Classification

Y. Liu, Z. Kou, C. Perlich, R. Lawrence. KDD Workshop on Data Mining for Business Applications, 2008.
We examine several machine learning algorithms for assigning job-categories to a large employee population based on their online directory profiles and reporting structures.


Temporal Causal Modeling with Graphical Granger Methods

A. Arnold, Y. Liu, N. Abe. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD'07), 2007.
we examine a host of related algorithms that, loosely speaking, fall under the category of graphical Granger methods, and characterize their relative performance from multiple viewpoints. Our experiments show, for instance, that the Lasso algorithm exhibits consistent gain over the canonical pairwise graphical Granger method.


Making the Most of Your Data: KDD Cup 2007 "How Many Ratings" Winners Report

S. Rosset, C. Perlich, Y. Liu. KDD Cup and Workshop 2007. First place for KDD Cup 2007 of Netflix “How Many Ratings in 2006” task.
We describe the ideas and methodologies that we developed in addressing the KDD Cup 2007 How Many Ratings task, and discuss how they contributed to our success.


Predicting Who Rated What in Large-Scale Datasets

Y. Liu, Z. Kou. Second Runner-up for KDD Cup 2007 of Netflix “Who Rated What in 2006” task.
KDD Cup 2007 focuses on movie rating behaviors. The goal of the task “Who Rated What” is to predict whether “existing” users will review “existing” movies in the future. We cast the task as a link prediction problem and address it via a simple classification approach.


Looking for Great Ideas: Analyzing the Innovation Jam

M. Helander, R. Lawrence, Y. Liu, C. Perlich, C. Reddy, S. Rosset. KDD Workshop on Web Mining and Social Network Analysis (WebKDD/SNAKDD'07), 2007.
We discuss the Innovation Jam that IBM carried out in 2006, with the objective of identifying innovative and promising “Big Ideas” through a moderated on-line discussion between IBM worldwide employees and external contributors. We describe the data available and investigate several analytical approaches to address the challenge of understanding “how innovation happens” and to facilitate the success of future Jams. We demonstrate the social network structure of data and its time dependence, and discuss the results of both supervised and unsupervised learning applied to this data.


Finding New Customers Using Unstructured and Structured Data

P. Melville, Y. Liu, R. Lawrence, I. Khabibrakhmanov, C. Pendus, T. Bowden. KDD Workshop on Web Mining Multiple Information Sources (MMIS'07), 2007.
Identifying new customers is a critical task for any sales oriented company. Of particular interest are companies that sell to other businesses, for which there is a wealth of structured information available through financial and firmographic databases. We demonstrate that the content of company web sites can often be an even richer source of information in identifying particular business alignments. We show how supervised learning can be used to build effective predictive models on unstructured web content as well as on structured firmographic data. We also explore methods to leverage the strengths of both sources by combining these data sources. Extensive empirical evaluation on a real-world marketing case study show promising results of our modeling efforts.


Harmonium Models for Semantic Video Representation and Classification

J. Yang, Y. Liu, E. P. Xing, A. Hauptmann. SIAM International Conference on Data Mining (SDM'07), 2007. Winner of the Best Application Paper Award.
we propose a probabilistic approach for video classification using intermediate semantic representations derived from the multi-modal features. Based on a class of bipartite undirected graphical models named harmonium, our approach represents video data as latent semantic topics derived by jointly modeling the transcript keywords and color-histogram features, and perform classification using these latent topics under a unified framework.


Protein Quaternary Fold Recognition Using Conditional Graphical Models

Y. Liu, J G. Carbonell, V. Gopalakrishnan, P. Weigele. International Joint Conference in Artificial Intelligence (IJCAI'07), 2007.
The complexity associated with modeling the quaternary fold poses major theoretical and computational challenges to current machine learning methods. We propose methods to address these challenges and show how (1) domain knowledge is encoded and utilized to characterize structural properties using segmentation conditional graphical models; and (2) model complexity is handled through efficient inference algorithms.


Graph-Based Semi-Supervised Learning as a Generative Model

J. He, J. Carbonell, Y. Liu. International Joint Conference in Artificial Intelligence (IJCAI'07), 2007.
This paper proposes and develops a new graph-based semi-supervised learning method. Different from previous graph-based methods that are based on discriminative models, our method is essentially a generative model in that the class conditional probabilities are estimated by graph propagation and the class priors are estimated by linear regression.


Semi-supervised Learning of Attribute-Value Pairs from Product Descriptions

K. Probst, R. Ghani, M. Krema, A. Fano, Y. Liu. International Joint Conference in Artificial Intelligence (IJCAI'07), 2007.
We describe an approach to extract attribute-value pairs from product descriptions. This allows us to represent products as sets of such attribute-value pairs to augment product databases. Such a representation is useful for a variety of tasks where treating a product as a set of attribute-value pairs is more useful than as an atomic entity.


Text Mining to Extract Product Attributes

R. Ghani, K. Probst, Y. Liu, M. Krema, and A. Fano. SIGKDD Explorations, 2006.
A detailed version for the paper “K. Probst, R. Ghani, M. Krema, A. Fano, Y. Liu. Semi-supervised Learning of Attribute-Value Pairs from Product Descriptions. International Joint Conference in Artificial Intelligence (IJCAI'07)”


Protein Fold Recognition Using Segmentation Conditional Random Fields (SCRFs)

Y. Liu, J. Carbonell, P. Weigele, V. Gopalakrishnan. Journal of Computational Biology, 2006.
A journal version of Liu’s thesis work on conditional graphical models on protein structural motif recognition.


Predicting Protein Folds with Structural Repeats Using a Chain Graph Model

Y. Liu, E. P. Xing, J. Carbonell. International conference on Machine Learning (ICML'05), 2005.
We propose a chain graph model built on a causally connected series of segmentation conditional random fields (SCRFs) to address these issues. Specifically, the SCRF model captures long-range interactions within recurring structural units and the Bayesian network backbone decomposes cross-repeat interactions into locally computable modules consisting of repeat-specific SCRFs and a model for sequence motifs.


Segmentation Conditional Random Fields (SCRFs): A New Approach for Protein Fold Recognition

Y. Liu, J. Carbonell, P. Weigele, V. Gopalakrishnan. International conference on Research in Computational Molecular Biology (RECOMB'05), 2005.
Protein fold recognition is an important step towards understanding protein three-dimensional structures and their functions. A conditional graphical model, i.e. segmentation conditional random fields (SCRFs), is proposed to solve the problem. In contrast to traditional graphical models such as hidden markov model (HMM), SCRFs follow a discriminative approach. It has the flexibility to include overlapping or long-range interaction features over the whole sequence, as well as global optimally solutions for the parameters.


Kernel Conditional Random Fields: Representation and Clique Selection

J. Lafferty, X. Zhu, Y. Liu. International Conference on Machine Learning (ICML'04), 2004.
Kernel conditional random fields (KCRFs) are introduced as a framework for discriminative modeling of graph-structured data. A representer theorem for conditional graphical models is given which shows how kernel conditional random fields arise from risk minimization procedures defined using Mercer kernels on labeled graphs.


Comparison of Probabilistic Combination Methods for Protein Secondary Structure Prediction

Y. Liu, J. Carbonell, J. Klein-Seetharaman, V. Gopalakrishnan. Bioinformatics, 2004.
Recent analysis by information theory indicates that the correlation between neighboring secondary structures are much stronger than that of neighboring amino acids. In this article, we focus on the combination problem for sequences, i.e. combining the scores or assignments from single or multiple prediction systems under the constraint of a whole sequence, as a target for improvement in protein secondary structure prediction.


Context Sensitive Vocabulary and Its Application in Protein Secondary Structure Prediction

Y. Liu, J. Carbonell, J. Klein-Seetharaman, V. Gopalakrishnan. International Conference on Research and Development in Information Retrieval (SIGIR'04), 2004.
we present a new method that applies information retrieval techniques to solve the problem: we extract a context sensitive biological vocabulary for protein sequences and apply text classification methods to predict protein secondary structure.


A New Boosting Algorithm Using Input-Dependent Regularizer

R. Jin, Y. Liu, S. Luo, J. Carbonell and A. Hauptmann. International Conference on Machine Learning (ICML'03), 2003.
we present a new boosting algorithm, “WeightBoost”, which tries to solve these two problems by introducing an input-dependent regularization factor to the combination weight. Similarly to AdaBoost, we derive a learning procedure for WeightBoost, which is guaranteed to minimize training errors.


A New Pairwise Ensemble Approach for Text Classification

Y. Liu, J. Carbonell and R. Jin. European Conference on Machine Learning (ECML'03), 2003.
we present a new pairwise ensemble approach, which uses pairwise Support Vector Machine (SVM) classifiers as base classifiers and “input-dependent latent variable” method for model combination. This new approach better captures the characteristics of genre classification, including its heterogeneous nature.


On Predicting Rare Class with SVM Ensemble in Scene Classification

R Yan, Y. Liu, R. Jin and A. Hauptmann. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'03), 2003.
We propose SVM ensembles to address the rare class problem. Various classifier combination strategies are investigated, including majority voting, sum rule, neural network gater and hierarchical SVMs.


Boosting to Correct the Inductive Bias for Text Classification

Y. Liu, Y, Yang and J. Carbonell. ACM International Conference on Information and Knowledge Management (CIKM'02), 2002.
This paper studies the effects of boosting in the context of different classification methods for text categorization, including Decision Trees, Naive Bayes, Support Vector Machines (SVMs) and a Rocchio-style classifier. We identify the inductive biases of each classifier and explore how boosting, as an error-driven resampling mechanism, reacts to those biases.