Skip to Main Content

Text-mining and Analysis in Digital Scholarship Research: Selected Text Mining Projects

Digital Scholarship Research Projects

Digital Scholarship projects employing different types of text mining techniques help researchers discovering hidden messages in data.

Text Data Preparation: a Practice in R using the Sheng Xuanhuai Collection

This project is initiated by the Library, and created by Dr. Yun Tai. It aims to make use of “R” to process and analyse Sheng Xuanhuai Collection that is owned by the Art Museum of CUHK to demonstrate how computational text processing and analysis can be done for Chinese texts.

R segmenting packages (e.g. jiebaR) is selected in this project and has been applied to two volumes initially to demonstrate the proof of concept. The entire process of the computational text processing from setting up the R environment to the creation of a matrix of word counts (term-document matrix (TDM)) and the wordcloud is described in this project. 


Author Frequency of The Chinese Student Weekly

The Chinese Student Weekly (henceforth The Weekly) is a magazine published in Hong Kong between 1952 and 1974, with a total of 1128 issues.  It started with one sheet of publication in 1950s and expanded to three to four sheets in the 1960s.  The space it provided for established as well as student authors nourished many important contemporary Hong Kong literature authors.

With the authorization of Mr. Lam Yut Hang in 2003, the CUHK Library has digitised the whole collection of The Weekly (except some 1967 issues) and deposited them in the Hong Kong Literature Database. Among nearly 80 thousand pieces of articles, which author has published the largest number of works?  In each year, who published more frequently?   How was the trend of frequent authors across the twenty-two years of publications?  It is difficult to explore these questions using traditional research methods in humanities.  Therefore, we used a bar chart race and word clouds to visualise the author frequency of The Weekly.  Other than presenting some of the visualisations in this report, we also introduce the methods and the code in creating the dataset and the visualisations.


"(Re-)Mining" An Annotated Bibliography of the Classical Writings of Hong Kong Poets for Social Network Study

"An Annotated Bibliography of  the Classical Writings of Hong Kong Poets" (The Bibliography) 《香港古典詩文集經眼錄》(經眼錄) was published in 2011 by CUHK Library.  Compiled by the Library's Research Associate, YW Chau, it has collected information of over 800 titles of poets written by 514 Hong Kong classical poets since 1840's to recent years.  "Hong Kong" poets refers to Chinese who fleed from China and stayed in Hong Kong or born in Hong Kong. This project is an experiment in "re-"mining and processing the content of The Bibliography.  We tried to explore in displaying the content with digital scholarship tools in Poet network visualisation Poet origins geo-spatial display.

Other Research Projects

Digital Humanities for Authorship Attribution Problem of Shiji: Based on Frequency of Function Words Within the Hereditary Houses Section 《史記》作者數位化研究初探-以三十世家虛字字頻為例 --- By Shih-Wen Chyu 邱詩雯

The author of Shiji (史記) is Sima Qian, but part of it was rewritten according to the old manuscript of his father Sima Tan. In addition to Sima Qian and Sima Tan, Chu Shaosun is also one of the authors of Shiji. This study used term frequency statistics tool on DocuSky to count 30 chapters of Shiji. Frist, the Chyu used virtual words in Shiji as the subject matter doing statistical analysis. And then used the writing of Chu Shaosun as a control group to compare the differences in the word frequency of the three people. This project justified we can use digital methods to prove the similarities and differences between the authors of the Shiji.

From: Journal of Digital Archives and Digital Humanities Vol.2, pp. 49-69 (October, 2018) 
DOI: 10.6853/DADH.201810_2.0003


Digital Model of Spatio-temporal Narratives of Chinese Classical Narrative Literature: A Case Study on The Tale of Li Wa 中国古典叙事文学的时空叙事数字模型研究——以《李娃传》为例 --- By MA Zhaoyi, HE Jie*, LIU Shuaishuai 马昭仪, 何捷*, 刘帅帅

Literary cartography enables the representation of literary space and the mapping between diachronic text and synchronic space. It serves as an 'ancillary science' in literary geography methodologies. Previous cartographic practices usually focus on either of these two items rather than considering text and space as an interactive entirety. In order to reconstruct the traditional relationship between linear narrative and juxtaposed space in Chinese classical narrative literature, this paper proposes a framework of digital models integrating theories of spatial narrative and methods from literary cartography, computational narratology, and geo-narrative, and consequently reveals the spatio-temporal narrative of a Chinese classical novel, The Tale of Li Wa, which is about a love story happening in the capital Chang'an of Tang Dynasty (618- 907 AD) and has been diversely interpreted by literary critics and historians since approximately 900 years ago. 

From: Journal of Geo-information Science, 2020,22(5):967-977.
DOI: 10.12082/dqxxkx.2020.190730