Scalable model-based clustering by working on data summaries

Huidong JIN, Man Leung WONG, Kwong Sak LEUNG

Research output: Book Chapters | Papers in Conference ProceedingsConference paper (refereed)Researchpeer-review

6 Citations (Scopus)

Abstract

The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources. In this paper, we present a two-phase scalable model-based clustering framework: First, a large data set is summed up into sub-clusters; Then, clusters are directly generated from the summary statistics of sub-clusters by a specifically designed Expectation-Maximization (EM) algorithm. Taking example for Gaussian mixture models, we establish a provably convergent EM algorithm, EMADS, which embodies cardinality, mean, and covariance information of each sub-cluster explicitly. Combining with different data summarization procedures, EMADS is used to construct two clustering systems: gEMADS and bEMADS. The experimental results demonstrate that they run several orders of magnitude faster than the classic EM algorithm with little loss of accuracy. They generate significantly better results than other model-based clustering systems using similar computational resources.
Original languageEnglish
Title of host publicationProceedings - IEEE International Conference on Data Mining, ICDM
Pages91-98
Number of pages8
DOIs
Publication statusPublished - 1 Jan 2003

Fingerprint

Data mining
Scalability
Statistics

Bibliographical note

Paper presented at the 3rd IEEE International Conference on Data Mining, Nov 19-22, 2003, Melbourne, Florida. ISBN of the source publication: 9780769519784

Cite this

JIN, H., WONG, M. L., & LEUNG, K. S. (2003). Scalable model-based clustering by working on data summaries. In Proceedings - IEEE International Conference on Data Mining, ICDM (pp. 91-98) https://doi.org/10.1109/ICDM.2003.1250907
JIN, Huidong ; WONG, Man Leung ; LEUNG, Kwong Sak. / Scalable model-based clustering by working on data summaries. Proceedings - IEEE International Conference on Data Mining, ICDM. 2003. pp. 91-98
@inproceedings{9f32926051684c2dac283f9c1e4a2a79,
title = "Scalable model-based clustering by working on data summaries",
abstract = "The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources. In this paper, we present a two-phase scalable model-based clustering framework: First, a large data set is summed up into sub-clusters; Then, clusters are directly generated from the summary statistics of sub-clusters by a specifically designed Expectation-Maximization (EM) algorithm. Taking example for Gaussian mixture models, we establish a provably convergent EM algorithm, EMADS, which embodies cardinality, mean, and covariance information of each sub-cluster explicitly. Combining with different data summarization procedures, EMADS is used to construct two clustering systems: gEMADS and bEMADS. The experimental results demonstrate that they run several orders of magnitude faster than the classic EM algorithm with little loss of accuracy. They generate significantly better results than other model-based clustering systems using similar computational resources.",
author = "Huidong JIN and WONG, {Man Leung} and LEUNG, {Kwong Sak}",
note = "Paper presented at the 3rd IEEE International Conference on Data Mining, Nov 19-22, 2003, Melbourne, Florida. ISBN of the source publication: 9780769519784",
year = "2003",
month = "1",
day = "1",
doi = "10.1109/ICDM.2003.1250907",
language = "English",
isbn = "9780769519784",
pages = "91--98",
booktitle = "Proceedings - IEEE International Conference on Data Mining, ICDM",

}

JIN, H, WONG, ML & LEUNG, KS 2003, Scalable model-based clustering by working on data summaries. in Proceedings - IEEE International Conference on Data Mining, ICDM. pp. 91-98. https://doi.org/10.1109/ICDM.2003.1250907

Scalable model-based clustering by working on data summaries. / JIN, Huidong; WONG, Man Leung; LEUNG, Kwong Sak.

Proceedings - IEEE International Conference on Data Mining, ICDM. 2003. p. 91-98.

Research output: Book Chapters | Papers in Conference ProceedingsConference paper (refereed)Researchpeer-review

TY - GEN

T1 - Scalable model-based clustering by working on data summaries

AU - JIN, Huidong

AU - WONG, Man Leung

AU - LEUNG, Kwong Sak

N1 - Paper presented at the 3rd IEEE International Conference on Data Mining, Nov 19-22, 2003, Melbourne, Florida. ISBN of the source publication: 9780769519784

PY - 2003/1/1

Y1 - 2003/1/1

N2 - The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources. In this paper, we present a two-phase scalable model-based clustering framework: First, a large data set is summed up into sub-clusters; Then, clusters are directly generated from the summary statistics of sub-clusters by a specifically designed Expectation-Maximization (EM) algorithm. Taking example for Gaussian mixture models, we establish a provably convergent EM algorithm, EMADS, which embodies cardinality, mean, and covariance information of each sub-cluster explicitly. Combining with different data summarization procedures, EMADS is used to construct two clustering systems: gEMADS and bEMADS. The experimental results demonstrate that they run several orders of magnitude faster than the classic EM algorithm with little loss of accuracy. They generate significantly better results than other model-based clustering systems using similar computational resources.

AB - The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources. In this paper, we present a two-phase scalable model-based clustering framework: First, a large data set is summed up into sub-clusters; Then, clusters are directly generated from the summary statistics of sub-clusters by a specifically designed Expectation-Maximization (EM) algorithm. Taking example for Gaussian mixture models, we establish a provably convergent EM algorithm, EMADS, which embodies cardinality, mean, and covariance information of each sub-cluster explicitly. Combining with different data summarization procedures, EMADS is used to construct two clustering systems: gEMADS and bEMADS. The experimental results demonstrate that they run several orders of magnitude faster than the classic EM algorithm with little loss of accuracy. They generate significantly better results than other model-based clustering systems using similar computational resources.

UR - http://commons.ln.edu.hk/sw_master/6819

U2 - 10.1109/ICDM.2003.1250907

DO - 10.1109/ICDM.2003.1250907

M3 - Conference paper (refereed)

SN - 9780769519784

SP - 91

EP - 98

BT - Proceedings - IEEE International Conference on Data Mining, ICDM

ER -

JIN H, WONG ML, LEUNG KS. Scalable model-based clustering by working on data summaries. In Proceedings - IEEE International Conference on Data Mining, ICDM. 2003. p. 91-98 https://doi.org/10.1109/ICDM.2003.1250907