Discovering knowledge from noisy databases using genetic programming

Man Leung WONG, Kwong Sak LEUNG, C. Y., Jack CHENG

Research output: Journal PublicationsJournal Article (refereed)Researchpeer-review

6 Citations (Scopus)

Abstract

In data mining, we emphasize the need for learning from huge, incomplete, and imperfect data sets. To handle noise in the problem domain, existing learning systems avoid overfitting the imperfect training examples by excluding insignificant patterns. The problem is that these systems use a limiting attribute-value language for representing the training examples and the induced knowledge. Moreover, some important patterns are ignored because they are statistically insignificant. In this article, we present a framework that combines Genetic Programming and Inductive Logic Programming to induce knowledge represented in various knowledge representation formalisms from noisy databases. The framework is based on a formalism of logic grammars, and it can specify the search space declaratively. An implementation of the framework, LOGENPRO (The Logic grammar based GENetic PROgramming system), has been developed. The performance of LOGENPRO is evaluated on the chess end-game domain. We compare LOGENPRO with FOIL and other learning systems in detail, and find its performance is significantly better than that of the others. This result indicates that the Darwinian principle of natural selection is a plausible noise handling method that can avoid overfitting and identify important patterns at the same time. Moreover, the system is applied to one real-life medical database. The knowledge discovered provides insights to and allows better understanding of the medical domains.
Original languageEnglish
Pages (from-to)870-881
Number of pages12
JournalJournal of the American Society for Information Science
Volume51
Issue number9
DOIs
Publication statusPublished - 1 Jan 2000

Fingerprint

Genetic programming
Learning systems
programming
Inductive logic programming (ILP)
Knowledge representation
Data mining
grammar
learning
logic
performance
Data base
language
Grammar
Overfitting
Logic
Values

Cite this

@article{c9e12ce2e83a4d658f201cf99a52e82a,
title = "Discovering knowledge from noisy databases using genetic programming",
abstract = "In data mining, we emphasize the need for learning from huge, incomplete, and imperfect data sets. To handle noise in the problem domain, existing learning systems avoid overfitting the imperfect training examples by excluding insignificant patterns. The problem is that these systems use a limiting attribute-value language for representing the training examples and the induced knowledge. Moreover, some important patterns are ignored because they are statistically insignificant. In this article, we present a framework that combines Genetic Programming and Inductive Logic Programming to induce knowledge represented in various knowledge representation formalisms from noisy databases. The framework is based on a formalism of logic grammars, and it can specify the search space declaratively. An implementation of the framework, LOGENPRO (The Logic grammar based GENetic PROgramming system), has been developed. The performance of LOGENPRO is evaluated on the chess end-game domain. We compare LOGENPRO with FOIL and other learning systems in detail, and find its performance is significantly better than that of the others. This result indicates that the Darwinian principle of natural selection is a plausible noise handling method that can avoid overfitting and identify important patterns at the same time. Moreover, the system is applied to one real-life medical database. The knowledge discovered provides insights to and allows better understanding of the medical domains.",
author = "WONG, {Man Leung} and LEUNG, {Kwong Sak} and CHENG, {C. Y., Jack}",
year = "2000",
month = "1",
day = "1",
doi = "10.1002/(SICI)1097-4571(2000)51:9<870::AID-ASI90>3.0.CO;2-R",
language = "English",
volume = "51",
pages = "870--881",
journal = "Journal of the Association for Information Science and Technology",
issn = "2330-1635",
publisher = "John Wiley and Sons Ltd",
number = "9",

}

Discovering knowledge from noisy databases using genetic programming. / WONG, Man Leung; LEUNG, Kwong Sak; CHENG, C. Y., Jack.

In: Journal of the American Society for Information Science, Vol. 51, No. 9, 01.01.2000, p. 870-881.

Research output: Journal PublicationsJournal Article (refereed)Researchpeer-review

TY - JOUR

T1 - Discovering knowledge from noisy databases using genetic programming

AU - WONG, Man Leung

AU - LEUNG, Kwong Sak

AU - CHENG, C. Y., Jack

PY - 2000/1/1

Y1 - 2000/1/1

N2 - In data mining, we emphasize the need for learning from huge, incomplete, and imperfect data sets. To handle noise in the problem domain, existing learning systems avoid overfitting the imperfect training examples by excluding insignificant patterns. The problem is that these systems use a limiting attribute-value language for representing the training examples and the induced knowledge. Moreover, some important patterns are ignored because they are statistically insignificant. In this article, we present a framework that combines Genetic Programming and Inductive Logic Programming to induce knowledge represented in various knowledge representation formalisms from noisy databases. The framework is based on a formalism of logic grammars, and it can specify the search space declaratively. An implementation of the framework, LOGENPRO (The Logic grammar based GENetic PROgramming system), has been developed. The performance of LOGENPRO is evaluated on the chess end-game domain. We compare LOGENPRO with FOIL and other learning systems in detail, and find its performance is significantly better than that of the others. This result indicates that the Darwinian principle of natural selection is a plausible noise handling method that can avoid overfitting and identify important patterns at the same time. Moreover, the system is applied to one real-life medical database. The knowledge discovered provides insights to and allows better understanding of the medical domains.

AB - In data mining, we emphasize the need for learning from huge, incomplete, and imperfect data sets. To handle noise in the problem domain, existing learning systems avoid overfitting the imperfect training examples by excluding insignificant patterns. The problem is that these systems use a limiting attribute-value language for representing the training examples and the induced knowledge. Moreover, some important patterns are ignored because they are statistically insignificant. In this article, we present a framework that combines Genetic Programming and Inductive Logic Programming to induce knowledge represented in various knowledge representation formalisms from noisy databases. The framework is based on a formalism of logic grammars, and it can specify the search space declaratively. An implementation of the framework, LOGENPRO (The Logic grammar based GENetic PROgramming system), has been developed. The performance of LOGENPRO is evaluated on the chess end-game domain. We compare LOGENPRO with FOIL and other learning systems in detail, and find its performance is significantly better than that of the others. This result indicates that the Darwinian principle of natural selection is a plausible noise handling method that can avoid overfitting and identify important patterns at the same time. Moreover, the system is applied to one real-life medical database. The knowledge discovered provides insights to and allows better understanding of the medical domains.

UR - http://commons.ln.edu.hk/sw_master/2233

U2 - 10.1002/(SICI)1097-4571(2000)51:9<870::AID-ASI90>3.0.CO;2-R

DO - 10.1002/(SICI)1097-4571(2000)51:9<870::AID-ASI90>3.0.CO;2-R

M3 - Journal Article (refereed)

VL - 51

SP - 870

EP - 881

JO - Journal of the Association for Information Science and Technology

JF - Journal of the Association for Information Science and Technology

SN - 2330-1635

IS - 9

ER -