Causality Analysis in Large-scale Time Series Data

 

Yan Liu

Computer Science Department

Viterbi School of Engineering

University of Southern California

 

 

Overview

 

In the era of data deluge, we are confronted with large-scale time series data, i.e., sequences of observations of concerned variables over a period of time. For example, terabytes of neural activity time series data are produced to record the collective response of neurons to different stimuli; petabytes of climate and meteorological data, such as temperature, solar radiation, and precipitation, are collected over the years; and exa-bytes of social media contents are generated over time on the Internet.

 

A major task for time series data analysis is to uncover the temporal causal relationships among the time series. For example, in the climatology, we want to identify the factors that impact the climate patterns of certain regions. In social networks, we are interested in identification of the patterns of influence among users and how topics activate or suppress each other. Therefore developing effective and scalable data mining algorithms to uncover temporal dependency structures between time series and reveal insights from data has become a key problem in machine learning and data mining.

 

This tutorial aims to provide the participants a broad and comprehensive coverage on the foundations and recent developments on causality analysis for large-scale time series data. We will provide both theoretical and practical results as well as illustrative demos. In contrast to previous tutorials on causality analysis, we will focus on presenting and discussing a broad coverage of the emerging approaches of causality analysis for time series data in the context of scalability and practicability. We will also offer useful and complementary information to the CIKM community for whom prepare for pursuing this research area.

 

To summarize, we will

      Present a balanced review of the area of causality analysis for large-scale time series data by presenting topics of both practical and theoretical interest

      Describe state-of-the-art and emerging analysis technologies on massive large-scale time series data in order to identify the recent and future trends

      Provide a good starting point, including tutorial slides, supplementary survey paper, implementation packages and data repository with real application datasets, for researchers entering this active research area by looking at both system- and algorithmic-level developments.

 

 

Materials

 

Slides [PDF]

 

Sample dataset #1 [CSV]

 

Sample dataset #2 [CSV]

 

Scope

 

The tutorial will consist of two lectures and a break in the middle:

 

Lecture 1: Introduction to Granger Causality (90 mins)

 

Overview for Causality Analysis from Time Series Data (20 mins)

Granger causality (40 mins)

        Definition

        Identification and learning

        Applications

Known Issues of Granger causality compared with true causality analysis (30 mins)

        Non-linear extensions

        Latent factors

        Instantaneous causation

 

Break (30min)

 

Lecture 2: Alternative Approaches and New Trends (80 mins)

 

Practical Issues in Granger causality (30 mins)

        Time lag

        Group effect

        Non-stationary

        Collinearity

        Scalability

 

Alternative Approaches (30 mins)

        Randomization test

        Auto-correlation and cross-correlation

        Transfer entropy

 

Illustration Examples (20 mins)

        Demo

 

 

In addition, we plan to distribute the following materials:

      Lecture slides

      Demo

      Survey paper for details on the topic

      Implementation packages

      Data repository

 

Format of tutorial (1/2 day or 1 day)

½ day

 

Prerequisite knowledge of audience

 

Linear algebra, basic statistical regression analysis

 

Relevant references

 

A general survey paper on the topic: http://www-bcf.usc.edu/~liu32/granger.pdf