Data visualization techniques are very
important for data analysis, since the human eye has been frequently
advocated as the ultimate data-mining tool. However, there has been
surprisingly little work on visualizing massive time series datasets. To
this end, we developed VizTree, a time series pattern discovery and
visualization system based on augmenting suffix trees. VizTree visually
summarizes both the global and local structures of time series data at the
same time. In addition, it provides novel interactive solutions to many
pattern discovery problems, including the discovery of frequently occurring
patterns (motif discovery), surprising patterns (anomaly detection), and
query by content. The user interactive paradigm allows users to visually
explore the time series, and perform real-time hypotheses testing.
|Step 1: Subsequence Extraction and Discretization (via SAX)|
Using a sliding window of size n (user input), extract subsequences via a sliding window. Then discretize each subsequence via SAX, e.g., like the following (the figure shows the discretization of just one subsequence). The time series is converted to string "baabccbc" (see the SAX website for more details). Do this for every subsequence (or selected subsequences if the numerosity option is turned on).
|Step 2: Insertion|
Push the data into a depth-limited, augmented suffix tree.The frequencies of the strings are encoded as the
thickness of branches. The following figure summarizes these steps.
The design of VizTree follows the Visual Information Seeking Mantra: "Overview, zoom & filter, details-on-demand" championed by Dr. Ben Shneiderman.
Since the frequency of each string (pattern) is encoded in the line thickness, thick branches represent frequent patterns, whereas think braches represent infrequent or potentially anomalous patterns.
The example above shows the power consumption time series. A "normal" weekly pattern has 5 peaks, one for each day of the week. If we click on the branch "bab" (a thin branch), we can see that the subsequence mapped to this particular string (shown in the Detail 1 window) indeed has a different pattern: it has 3 peaks instead of 5, as a result of a short, Christmas week.
If training data is available, we can visualize the distributional differences of patterns between the training data and the testing data. The steps are summarized as follows.
Blue lines: under-represented patterns (pattern is more common in A)
Green lines: over-represented patterns (pattern is more common in B)
Red lines: surprising patterns
For the example below, the blue ECG data is the reference data, and the green ECG data is the testing data. The resulting tree shows the differences in pattern distributions in the two datasets. The surprising patterns are ranked. Clicking on the branch ranked #1, the anomalous heartbeat in the green time series is shown (also highlighted in the time series window).
Back to Main Page