Text categorization can solve the problem of information clutter to a large extent, and it also provides a more efficient search strategy and more effective search results for information retrieval. In recent years, Convolutional Neural Networks have been widely applied to this task. However, most existing CNN models are difficult to extract longer n-grams features for the reason as follow: the parameters of the standard CNN model will increase with the increase of the length of n-grams features because it extracts n-grams features through convolution filters of fixed window size. Meanwhile, the term weighting schemes assigning reasonable weight values to words have exhibited excellent performance in traditional bag-of-words models. Intuitively, considering the weight value of each word in n-grams features may be beneficial in text classification. In this paper, we proposed a model called weighted n-grams CNN model. It is a variant of CNN introducing a weighted n-grams layer. The parameters of the weighted n-grams layer are initialized by term weighting schemes. Only by adding fixed parameters can the model generate any length of weighted n-grams features. We compare our proposed model with other popular and latest CNN models on five datasets in text classification. The experimental results show that our proposed model exhibits comparable or even superior performance.
|Title of host publication||Information Retrieval Technology - 15th Asia Information Retrieval Societies Conference, AIRS 2019, Proceedings|
|Editors||Fu Lee WANG, Haoran XIE, Wai LAM, Aixin SUN, Lun-Wei KU, Tainyong HAO, Wei CHEN, Tak-Lam WONG, Xiaohui TAO|
|Number of pages||12|
|Publication status||E-pub ahead of print - 27 Feb 2020|
|Event||The 15th Asia Information Retrieval Societies Conference - Open University of Hong Kong, Hong Kong, Hong Kong|
Duration: 7 Nov 2019 → 9 Nov 2019
|Name||Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)|
|Conference||The 15th Asia Information Retrieval Societies Conference|
|Period||7/11/19 → 9/11/19|
Bibliographical noteThis work was supported by the Fundamental Research Funds for the Central Universities, SCUT (No. 2017ZD048, D2182480), the Science and Technology Planning Project of Guangdong Province (No. 2017B050506004), the Science and Technology Programs of Guangzhou (No. 201704030076, 201707010223, 201802010027, 201902010046).
- CNN model
- Text classification
- Weighted n-grams features