Effectiveness of natural language processing techniques in categorizing scientific articles by research methodology.
The code and some other files used in my bachelor thesis.
Abstract: With the ever-growing number of published scientific articles, it is increasingly challenging for researchers to find, review and use relevant research. This study explores the potential of using unsupervised text classification models, specifically a zero-shot classification model (GPTNLI) and a similarity-based (Lbl2vec) classification model, to streamline the literature review process. These models predict the methodological approach based on simple information like the title, keywords and abstract, thereby adding filter to scientific database searches. To accomplish this, an extensive and well-structured definition is established for each class. The finding demonstrates that the GPTNLI model using GPT4, outperforms the other models in accuracy and f1 scores while showing reduced variability in its performance. Through using a binomial test it is shown that the model's performance statistically outperforms a random-guess strategy. Although these results are promising, the study has its limitations; For instance, the use of small test datasets and lack of cost-benefit analysis. Future research could improve the performance of the models by incorporating more sections of the study, further fine-tuning and adding self-learning capabilities.