How has IxD&A Evolved in the Last Decade? A Topic Modeling Approach to Identify Salient Themes and Cultural Shifts

Razvan Paroiu, Mihai Dascalu, Carlo Giovannella
pp.  269 – 288, download
(https://doi.org/10.55612/s-5002-067-013)

Abstract

The volume of published articles has grown exponentially, making research documentation considerably harder, even within a single archive of a scientific journal or a repository of educational resources. As such, we have developed and present in this paper a toolkit that can effectively process full-text manuscripts starting from their PDFs, extract topics, and provide correlations between manuscripts based on their common topics. The effectiveness of the toolkit is demonstrated by applying it to an extensive text collection, that of the ”Interaction Design and Architecture(s)” (IxD&A) journal’s full-text articles published between 2013 and 2024 (N = 450). Topic modeling was performed using BERTopic and Llama as LLM to generate coherent topics. Introduced as a modern substitute for Latent Dirichlet Allocation, BERTopic effectively identifies topics using Transformerbased models and other clustering and dimensionality reduction methods such as HDBSCAN and UMAP. As a result, we extracted 246 topics from the entire corpus, which were automatically filtered for specificity using Llama-3.1. Indepth visualizations and analyses covering the main topics and their evolution in time following the technological advances, but maintaining a focus on a humancentered, smart learning perspective, are presented. We release our toolkit as open-source on GitHub to enable users to easily apply our method to other journals, repositories of educational materials, and in other relevant contexts. In addition, we have integrated the generated views into the IxD&A journal webpage – https://ixdea.org/background-toi/- to enable the fruition of the outcomes produced by our toolkit to end users – in our case, the readers of IxD&A journal.

Keywords: topic modelling, BERTopic, text analysis, Llama 3.1, scientific journal archives

References

1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. In: Proceedings of the 2003 Conference on Neural Information Processing Systems (NIPS), pp. 601–608 (2003). http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
2. Blei, D.M., Griffiths, J.B., Blei, D.M., Jordan, M.I.: Probabilistic topic models. Communications of the ACM 55(4), 77–84 (2007) https://doi.org/10.1145/ 2133806.2133826
3. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999) https://doi.org/10.1038/44565
4. Cichocki, A., Zdunek, R., Amari, S.-i.: Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation. In: Wiley Encyclopedia of Electrical and Electronics Engineering (2009). https://doi.org/10.1002/9780470747278
5. Jansson, P., Liu, S.: Distributed representation, lda topic modelling and deep learning for emerging named entity recognition from social media. In: NUT@EMNLP (2017). https://doi.org/10.18653/v1/W17-4420
6. Khan, S.K., Ahmed, F., Mubeen, M.: A text-mining research based on lda topic modelling: A corpus-based analysis of pakistan’s un assembly speeches (19702018). Int. J. Humanit. Arts Comput. 16, 214–229 (2022) https://doi.org/10. 3366/ijhac.2022.0291
7. Uthirapathy, S.E., Sandanam, D.: Topic modelling and opinion analysis on climate change twitter data using lda and bert model. Procedia Computer Science (2023) https://doi.org/10.1016/j.procs.2023.01.071
8. Rani, R., Lobiyal, D.K.: An extractive text summarization approach using tagged-lda based topic modeling. Multimedia Tools and Applications 80, 3275– 3305 (2020) https://doi.org/10.1007/s11042-020-09549-3
9. Onah, D.F.O.: A data-driven latent semantic analysis for automatic text summarization using lda topic modelling. 2022 IEEE International Conference on Big Data (Big Data), 2771–2780 (2022) https://doi.org/10.1109/BigData55660.2022. 10020259
10. Sawahata, H., Nishino, T.: Automatic extractive summarization for japanese academic papers by lda. Information Engineering Express (2023) https://doi.org/ 10.52731/iee.v9.i2.759
11. Wang, H.-i., Sun, K., Wang, Y.: Exploring the chinese public’s perception of omicron variants on social media: Lda-based topic modeling and sentiment analysis. International Journal of Environmental Research and Public Health 19 (2022) https://doi.org/10.3390/ijerph19148377
12. Mei, Y., A.Hernandez, A.: Sentiment analysis of lijiang ancient town attraction reviews based on lda. Proceedings of the 2023 3rd International Conference on Big Data, Artificial Intelligence and Risk Management (2023) https://doi.org/ 10.1145/3656766.3656774
13. Erniyati, E., Harsani, P., Mulyati, M., Fahriza, L.D.: Topic modeling lda and svm in sentiment analysis of hotel reviews. Komputasi: Jurnal Ilmiah Ilmu Komputer dan Matematika (2023) https://doi.org/10.33751/komputasi.v20i2.7604
14. Grootendorst, M.: Bertopic: Neural topic modeling with bert. arXiv preprint arXiv:2009.04822 (2020)
15. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
16. Sanh, H., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
17. Liu, Y., Ott, M., Goyal, N., Du, J., Cohn, T., Chen, D., Manning, C.D.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
18. McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (2020). https://arxiv.org/abs/1802.03426
19. McInnes, L., Healy, J., Melville, J.: Hdbscan: Hierarchical density-based spatial clustering of applications with noise. In: Proceedings of the 2017 International Conference on Data Mining (ICDM), pp. 544–553 (2017). https://doi.org/10. 1109/ICDM.2017.79
20. Liu, Y.: Comparison of lda and bertopic in news topic modeling: A case study of the new york times’ reports on china. Pacific International Journal 7, 47–51 (2024) https://doi.org/10.55014/pij.v7i3.616
21. Griffiths, T.L., Steyvers, M.: Finding scientific topics. In: Proceedings of the National Academy of Sciences (PNAS), vol. 101, pp. 5228–5235 (2004). https: //doi.org/10.1073/pnas.0307752101
22. Suominen, A., Toivanen, H.: Map of science with topic modeling: Comparison of unsupervised learning and human-assigned subject classification. Journal of the Association for Information Science and Technology 67 (2016) https://doi.org/ 10.1002/asi.23596
23. Trevisani, M., Tuzzi, A.: Learning the evolution of disciplines from scientific literature: A functional clustering approach to normalized keyword count trajectories. Knowl. Based Syst. 146, 129–141 (2018) https://doi.org/10.1016/j.knosys.2018. 01.035
24. Uban, A.S., Caragea, C., Dinu, L.P.: Studying the evolution of scientific topics and their relationships. In: Findings (2021). https://doi.org/10.18653/v1/2021.
findings-acl.167
25. Gerasimenko, N., Chernyavskiy, A., Nikiforova, M., Ianina, A., Vorontsov, K.: Incremental topic modeling for scientific trend topics extraction. COMPUTATIONAL LINGUISTICS AND INTELLECTUAL TECHNOLOGIES” (2023)
26. Ionita, R.F., Corlatescu, D.G., Ruseti, S., Dascalu M., T.-M.S., N., T., Banica, C.K.: Comprehensive sociograms of the scientific bulletin community. Scientific Bulletin 81, 3–12 (2019)
27. Paroiu, R., Ruseti, S., Dascalu, M., Trausan-Matu, S., McNamara, D.S.: Asking questions about scientific articles—identifying large n studies with llms. Electronics 12(19) (2023) https://doi.org/10.3390/electronics12193996
28. developers: Pretrained Models 2014; Sentence Transformers documentation — sbert.net. https://www.sbert.net/docs/sentence transformer/pretrained models. html. [Accessed 19-09-2024]
29. Izzidien, A.: Using the interest theory of rights and hohfeldian taxonomy to address a gap in machine learning methods for legal document analysis. Humanit Soc Sci Commun 10(251) (2023) https://doi.org/10.1057/s41599-023-01693-z
30. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., et al.: The Llama 3 Herd of Models (2024). https://arxiv.org/abs/2407.21783
31. McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (2018). https://arxiv.org/abs/1802.03426
32. Andy, C., Adam, P.: Understanding UMAP (n.d.). https://pair-code.github.io/ understanding-umap/
33. developers: Interaction Design and Architecture(s) — scimagojr.com. https:
//www.scimagojr.com/journalsearch.php?q=21100440523&tip=sid&exact=no. [Accessed 24-09-2024]

back to Table of Contents