Clustering High-dimensional Data

Objectives

Clustering real-world data sets is often hampered by the so-called curse of dimensionality: many real-world data sets consist of a very high dimensional feature space. In general, most of the common algorithms fail to generate meaningful results because of the inherent sparsity of the data space. Usually, clusters cannot be found in the original feature space because several features may be irrelevant for clustering due to correlation and/or redundancy. However, clusters are usually embedded in lower dimensional subspaces. In addition, different sets of features may be relevant for different sets of objects. Thus, objects can often be clustered differently in varying subspaces of the original feature space.

Techniques

Arbitrarily-oriented Correlation Clustering

Correlation clustering aims at partitioning the data objects into distinct sets of points that exhibit a dense, arbitrarily linear correlation, so called correlation sets. Points that do not belong to any correlation set are classified as noise.

4C (Computing Correlation Connected Clusters) 4C computes so-called correlation connected clusters, i.e.sets of neighboring points that exhibit an arbitrary linear correlation in a subset of the attributes. For that purpose, 4C examines the correlation in the neighborhood of each point and groups points with similar local correlation into clusters.
COPAC (COrrelation PArtition Clustering) Description
HICO (HIrachical Correlation Clustering) Description
ERiC (Exploring Relationships among Correlation clusters) Description
Increasing the Robustness of PCA-based Algorithms Description
CASH (Clustering in Arbitrary Subspaces based on the Hough transform) Description

Axis-parallel Projected Clustering

Projected clustering aims at partitioning the data into a set of clusters and a noise set while allowing the clusters to exist in different subspaces. Thus, for each cluster, a different set of features is relevant, whith the remaining features being irrelevant. The points of each cluster are densely packed in the subspace spanned by the corresponding relevant features but are sparsely distributed in the subspace spanned by the other dimensions.

PreDeCon (subspace PREference weighted DEnsity CONnnected clustering) PreDeCon extends the 'full-dimensional' clustering algorithm DBSCAN with the concept of subspace preference weights. For each point, the local neighborhood is examined to determine a subspace tendency, i.e. an axis parallel projection onto which the point is located in a dense area. If the point is noise, this subspace is the entire feature space. Neighboring points having a similar subspace preference weight are grouped into clusters using the density-connected clustering notion.
HiSC (HIerarchical Subspace Clustering) Description
DiSH (DerIving Subspace Hierarchies) Description

Axis-parallel Subspace Clustering

Subspace clustering aims at computing all clusters in all subspaces of the feature space. The information of objects clustered differently in varying subspaces is conserved. Objects may be assigned to several clusters (in different subspaces). Subspace ranking aims at identifying all subspaces of a (high dimensional) feature space that contain interesting clustering structures. The subspaces should be ranked according to this interestingness.

SUBCLU (density-based SUBspace CLUstering) SUBCLU is a subspace clustering algorithm that relies on the density-based clustering notion of DBSCAN. It efficiently computes all clusters DBSCAN would have found if applied to the set of all possible subspaces of the feature space. SUBCLU uses an Apriori-like subspace generation procedure which is based on the monotonicity property of density-connected sets and which ensures that only those subspaces are examined that contain clusters.
RIS (Ranking Interesting Subspaces) RIS is a subspace ranking algorithm that uses a quality criterion to rate the interestingness of subspaces. This criterion is based on the monotonicity of core points which are the central concept of the density-based clustering notion of DBSCAN. An Apriori-like subspace generation method (similar to SUBCLU) is used to compute all relevant subspaces and rank them by interestingness. The clusters can be computed in the generated subspaces using any clustering method of choice.
SURFING (SUbspaces Relevant For clusterING) SURFING is a subspace ranking algorithm that does not rely on a global density threshold. It computes the interstingness of a subspace based on the distribution of the k-nearest neighbors of all data points in the corresponding projection. An efficient, bottom-up subspace expansion heuristics ensures that less interesting subspaces are not generated for examination.
FIRES (FIlter/REfinement Subspace clustering) Description

Publications

2015
36	A. Zimek, J. Vreeken The Blind Men and the Elephant: On Meeting the Problem of Multiple Truths in Data from Clustering and Pattern Mining Perspectives Machine Learning, 98(1–2): 121–155, 2015. EE (springerlink) Bibtex (DBLP)
2014
35	A. Zimek, I. Assent, J. Vreeken Frequent Pattern Mining Algorithms for Data Clustering In C. C. Aggarwal, J. Han (ed.): Frequent Pattern Mining, Springer: 403–423, 2014. EE (Springer)
2013
34	K. Sim, V. Gopalkrishnan, A. Zimek, G. Cong A survey on enhanced subspace clustering Data Mining and Knowledge Discovery, 26(2): 332–397, 2013. EE (springerlink) Bibtex (DBLP)
33	A. Zimek Clustering High-Dimensional Data In C. C. Aggarwal, C. K. Reddy (ed.): Data Clustering: Algorithms and Applications, CRC Press: 201–230, 2013.
32	E. Achtert, H.-P. Kriegel, E. Schubert, A. Zimek Interactive Data Mining with 3D-Parallel-Coordinate-Trees In Proceedings of the ACM International Conference on Management of Data (SIGMOD), New York City, NY: 1009–1012, 2013. ELKI, EE (ACM) Bibtex (DBLP)
2012
31	H.-P. Kriegel, P. Kröger, A. Zimek Subspace Clustering Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(4): 351–364, 2012. EE (Wiley) Bibtex (DBLP)
30	E. Ntoutsi, A. Zimek, T. Palpanas, P. Kröger, H.-P. Kriegel Density-based Projected Clustering over High Dimensional Data Streams In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA: 987–998, 2012. EE (SIAM) Bibtex (DBLP)
2011
29	H.-P. Kriegel, P. Kröger, E. Ntoutsi, A. Zimek Density Based Subspace Clustering over Dynamic Data In Proceedings of the 23rd International Conference on Scientific and Statistical Database Management (SSDBM), Portland, OR: 387–404, 2011. EE (springerlink) Bibtex (DBLP)
28	H.-P. Kriegel, E. Ntoutsi, M. Spiliopoulou, G. Tsoumakas, A. Zimek Mining Complex Dynamic Data Tutorial at the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Athens, Greece, 2011.
2010
27	I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, T. Seidl, A. Zimek On Using Class-Labels in Evaluation of Clusterings In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010. pdf
26	H.-P. Kriegel, A. Zimek Subspace Clustering, Ensemble Clustering, Alternative Clustering, Multiview Clustering: What Can We Learn From Each Other? In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010. pdf
2009
25	H.-P. Kriegel, P. Kröger, A. Zimek Clustering High Dimensional Data: A Survey on Subspace Clustering, Pattern-based Clustering, and Correlation Clustering ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1): 1–58, 2009. EE (ACM) Bibtex (DBLP)
24	A. Zimek Correlation Clustering ACM SIGKDD Explorations, 11(1): 53–54, 2009. EE (SIGKDD Explorations), EE (ACM) Bibtex (DBLP)
23	G. Moise, A. Zimek, P. Kröger, H.-P. Kriegel, J. Sander Subspace and Projected Clustering: Experimental Evaluation and Analysis Knowledge and Information Systems (KAIS), 21(3): 299–326, 2009. EE (springerlink) Bibtex (DBLP)
22	P. Kröger, A. Zimek Subspace Clustering Techniques In L. Liu, M. T. Özsu (ed.): Encyclopedia of Database Systems, Springer: 2873–2875, 2009. EE (springerlink) Bibtex (DBLP)
2008
21	H.-P. Kriegel, P. Kröger, A. Zimek Detecting clusters in moderate-to-high dimensional data: subspace clustering, pattern-based clustering, and correlation clustering Proceedings of the VLDB Endowment, 1(2): 1528–1529, 2008. EE (VLDB endowment) Bibtex (DBLP)
20	E. Achtert, C. Böhm, J. David, P. Kröger, A. Zimek Global Correlation Clustering Based on the Hough Transform Statistical Analysis and Data Mining, 1(3): 111–127, 2008. EE (Wiley InterScience), code Bibtex (DBLP)
19	H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek A General Framework for Increasing the Robustness of PCA-based Correlation Clustering Algorithms In Proceedings of the 20th International Conference on Scientific and Statistical Database Management (SSDBM), Hong Kong, China: 418–435, 2008. pdf, code Bibtex (DBLP)
18	E. Achtert, H.-P. Kriegel, A. Zimek ELKI: A Software System for Evaluation of Subspace Clustering Algorithms In Proceedings of the 20th International Conference on Scientific and Statistical Database Management (SSDBM), Hong Kong, China: 580–585, 2008. EE (SpringerLink), pdf, ELKI Bibtex (DBLP)
17	E. Achtert, C. Böhm, J. David, P. Kröger, A. Zimek Robust Clustering in Arbitrarily Oriented Subspaces In Proceedings of the 8th SIAM International Conference on Data Mining (SDM), Atlanta, GA: 763–774, 2008. pdf, award, code Bibtex (DBLP)
16	H.-P. Kriegel, P. Kröger, A. Zimek Detecting Clusters in Moderate-to-High Dimensional Data: Subspace Clustering, Pattern-based Clustering, and Correlation Clustering Tutorial at the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Las Vegas, NV, 2008. pdf
15	H.-P. Kriegel, P. Kröger, A. Zimek Detecting Clusters in Moderate-to-High Dimensional Data: Subspace Clustering, Pattern-based Clustering, and Correlation Clustering Tutorial at the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Osaka, Japan, 2008. pdf
14	H.-P. Kriegel, P. Kröger, A. Zimek Detecting Clusters in Moderate-to-High Dimensional Data: Subspace Clustering, Pattern-based Clustering, and Correlation Clustering Tutorial at the 34nd International Conference on Very Large Data Bases (VLDB), Auckland, New Zealand, 2008. pdf
13	A. Zimek Correlation Clustering PhD Thesis, Ludwig-Maximilians-Universität München, Munich, Germany, 2008. EE (Universitätsbibliothek), Runner-Up of the 2009 SIGKDD Doctoral Dissertation Award
2007
12	E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, I. Müller-Gorman, A. Zimek Detection and Visualization of Subspace Cluster Hierarchies In Proceedings of the 12th International Conference on Database Systems for Advanced Applications (DASFAA), Bangkok, Thailand: 152–163, 2007. EE (SpringerLink), pdf, code Bibtex (DBLP)
11	E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, A. Zimek On Exploring Complex Relationships of Correlation Clusters In Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSDBM), Banff, Canada: 7–16, 2007. EE (IEEE Xplore), pdf, code Bibtex (DBLP)
10	E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, A. Zimek Robust, Complete, and Efficient Correlation Clustering In Proceedings of the 7th SIAM International Conference on Data Mining (SDM), Minneapolis, MN: 413–418, 2007. pdf, poster, code Bibtex (DBLP)
9	H.-P. Kriegel, P. Kröger, A. Zimek Detecting Clusters in Moderate-to-High Dimensional Data: Subspace Clustering, Pattern-based Clustering, and Correlation Clustering Tutorial at the 7th International Conference on Data Mining (ICDM), Omaha, NE, 2007. pdf
2006
8	E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, I. Müller-Gorman, A. Zimek Finding Hierarchies of Subspace Clusters In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Berlin, Germany: 446–453, 2006. pdf, poster, code Bibtex (DBLP)
7	E. Achtert, C. Böhm, P. Kröger, A. Zimek Mining Hierarchies of Correlation Clusters In Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM), Vienna, Austria: 119–128, 2006. EE (IEEE Xplore), pdf, code Bibtex (DBLP)
2005
6	H.-P. Kriegel, P. Kröger, M. Renz, S. Wurst A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM), Houston, TX: 250–257, 2005. pdf Bibtex (DBLP)
2004
5	C. Böhm, K. Kailing, P. Kröger, H.-P. Kriegel Immer größere und komplexere Datenmengen: Herausforderungen für Clustering-Algorithmen Datenbank-Spektrum, 4(9): 11–17, 2004. pdf Bibtex (DBLP)
4	C. Böhm, K. Kailing, H.-P. Kriegel, P. Kröger Density Connected Clustering with Local Subspace Preferences In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK: 27–34, 2004. pdf Bibtex (DBLP)
3	C. Baumgartner, K. Kailing, H.-P. Kriegel, P. Kröger, C. Plant Subspace Selection for Clustering High-Dimensional Data In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK: 11–18, 2004. pdf Bibtex (DBLP)
2	K. Kailing, H.-P. Kriegel, P. Kröger Density-Connected Subspace Clustering for High-Dimensional Data In Proceedings of the 4th SIAM International Conference on Data Mining (SDM), Lake Buena Vista, FL: 246–257, 2004. pdf Bibtex (DBLP)
1	C. Böhm, K. Kailing, P. Kröger, A. Zimek Computing Clusters of Correlation Connected Objects In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Paris, France: 455–466, 2004. pdf, code Bibtex (DBLP)

Team

Scientific Head:	Prof. Dr. Hans-Peter Kriegel
Project Leaders:	PD Dr. Peer Kröger PD Dr. Arthur Zimek
Current Members:	Dr. Erich Schubert

Von „http://fogo.dbs.ifi.lmu.de/cms/Clustering_High-dimensional_Data“