Implementing Clustering Algorithms for Traffic Data Analysis

Students: Yeojin Jung & Branden Kang
Supervisor: Andras Gyorgy & Monica Menendez

Our research was extended from the previous traffic data analysis and visualization project conducted by Anel Orazgaliyeva under the supervision of Professor Monica Menendez and Professor Andras Gyorgy. Simulated traffic data of Zurich, Switzerland was generated, and congestion, congestion speed, flow, and densities were computed accordingly. The project ultimately aims to understand the traffic trends in the city by looking at different parameters and transforming them to extract new information.

Initially, dynamic time warping was used to analyze the sequential congestion data. However, it could only compute the similarities of congestion at each nodes (intersections of roads) and edges (roads) and failed to tell how the congestion spreads. Thus, we decided to cluster the edges and nodes in the network to see if clusters behave similarly with the collected traffic data (e.g., geographical locations, congestion, flow, density). One key aspect of the traffic data clustering was that it takes multiple input metrics to aggregate the roads and intersections. In other words, we were trying to identify the regions that are relevant to each other not only because of their geographical proximity but also by their similarities in traffic behaviors.

In order to implement multidimensional clustering that takes in multiple distance metrics, our algorithm works in three-folds. First, each data is transformed to a two-dimensional coordinate system using multi-dimensional scaling. Then, all the coordinates associated with each edge or node are concatenated to form a single higher-dimensional coordinate. For instance, if two metrics are used for clustering, each edge or node would be epresented by a four-dimensional coordinate.

Lastly, this higher-dimensional coordinate is reduced to two-dimension using the multi-dimensional scaling in order to be fed into the common clustering algorithm (e.g., spectral clustering). All the raw traffic data were scaled prior to running multidimensional scaling in order to ensure that all the metrics were comparable. Here, we used geographical distances and congestion similarities from dynamic time warping as the two inputs for the multi-dimensional clustering. Multiple plots with varying numbers of clusters were plotted with each dot representing the midpoint location of each edge and color-coded by their clustered labels. With the insight of a traffic data analysis researcher based in Zurich, we identified that four to seven clusters are appropriate. Moreover, we added an additional parameter lambda to control the relative weights of the metrics so that one metric is multiplied by lambda while the other by (1-lambda). We confirmed that our algorithm results in the identical clustering throughout multiple runs and that varying lambda values helps tune the weights of input metrics.

Application of the traffic data clustering is promising as it can be used to compare traffic behaviors at different times (e.g. morning peak, evening peak) to identify patterns that repeat themselves and/or to run it for longer duration (e.g. several days) to examine if the same clusters are detected.