Abstract
Scientific articles contain rich knowledge that can significantly assists scientific research, but it is difficult to precisely extract knowledge information due to the complexity of the discourse structure of scientific articles. To provide more accurate scientific research knowledge for researchers in a specific academic domain, it is necessary to study the discourse structure of domain scientific articles and to propose an automatic annotation approach to automatically annotate discourse information from articles. Unfortunately, few works have studied the discourse structure of domain scientific articles and the corresponding automatic discourse annotation. To fill this gap, we take scientific articles of the wastewater-based epidemiology domain as a case to study how to automatically and efficiently annotate discourse information. This paper has three contributions. Firstly, we propose a hierarchical discourse model with two layers to cover all potential discourses in various domain scientific articles. Specifically, the first layer defines four core discourse concepts to describe the main process of a scientific research which can be applied in all scientific articles in various domains. The second layer defines fine-granular domain-specific structure, which can accurately describe the entire research contents of a specific domain. Secondly, based on the proposed model, we build a corpus dataset of 100 annotated scientific articles in the wastewater-based epidemiology domain. Thirdly, based on the model and dataset, we propose a simple yet efficient Top-K resampling-based approach to train a more effective classifier for automatic annotation. Extensive experiments verify the effectiveness and efficiency of our proposed hierarchical discourse model and the Top-K resampling-based classification approach.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2022 IEEE International Conference on Systems, Man and Cybernetics (SMC) |
Publisher | IEEE |
Pages | 774-781 |
Number of pages | 8 |
ISBN (Print) | 9781665452588 |
DOIs | |
Publication status | Published - Oct 2022 |
Externally published | Yes |
Event | 2022 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2022 - Prague, Czech Republic Duration: 9 Oct 2022 → 12 Oct 2022 |
Conference
Conference | 2022 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2022 |
---|---|
Country/Territory | Czech Republic |
City | Prague |
Period | 9/10/22 → 12/10/22 |
Funding
This work was supported in part by the National Key Research and Development Program of China under Grant 2019YFB2102102, in part by the National Natural Science Foundations of China (NSFC) under Grant 62176094 and Grant 61873097, in part by the Key-Area Research and Development of Guangdong Province under Grant 2020B010166002, and in part by the Guangdong Natural Science Foundation Research Team under Grants 2018B030312003.
Keywords
- automatic annotation
- discourse
- scientific articles
- text classification