Skip to Main Content

Text-mining and Analysis in Digital Scholarship Research: Concepts of Text Mining


This guide is to provide information on text mining and analysis such as types and tools, project examples, useful websites, etc.  You are welcome to contact us if you would like to get advice or information in doing your research.

Digital Scholarship Projects

The Library welcomes CUHK faculty members and researchers to collaborate with us in conducting digital scholarship research. Please visit our Digital Scholarship Projects page for the projects conducted.

Contact us
Tel.: (852) 3943 9954

Digital Scholarship Lab

Located on the G/F of the University Library, The Chinese University of Hong Kong, the Digital Scholarship Lab aims at providing a cutting-edge space for supporting digital scholarship research. 

Contact us
Tel.: (852) 3943 9954

What is Text Mining?

(Source: Elsevier. (2015). What is Text Mining?

Text mining can be broadly defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools. In a manner analogous to data mining, text mining seeks to extract useful information from data sources through the identification and exploration of interesting patterns. In the case of text mining, however, the data sources are document collections, and interesting patterns are found not among formalized database records but in the unstructured textual data in the documents in these collections. (Feldman and Sanger, 2007:1)

Five Major Phases of Text Mining

(Source: KDnuggets. A General Approach to Preprocessing Text Data,

  • Data Collection
    • After researchers have defined the problem statement in the previous section, they need to collect data which can help to solve the defined problem. Traditionally, they need to prepare data manually. But nowadays researchers can prepare data with the help of some technical methods, such as scraping, to reduce time-consuming.

  • Preprocessing
    • Preprocessing is a fundamental stage of data analysis. While a lot of information is available in various data sources and on the Web. Text preprocessing is the practice of cleaning and transforming text data into a usable format. This step usually involves:
      • Segmentation
      • Normalization (Stemming, Lemmatization)
      • Tokenization

  • Feature Engineering
    • Feature extraction is a part of the dimensionality reduction process. In this step, an initial set of the raw data is divided and reduced to more manageable groups. It usually involve the techniques, such as:
      • TF - IDF (Term Frequency – Inverse Document Frequency)
      • Word2Vec

  • Modeling and Analyzing 
    • Text analytics involves using machine learning algorithms and artificial intelligence (AI) to understand the meaning of text documents. Below are the machine learning techniques:
      • Classification (Support Vector Machine, Naive Bayes) 
      • Clustering (KMeans)

  • Interpretation and Visualization
    • Visualization technique is used to simplify the process of finding relevant information. It helps to display textual information more attractively.
      • Word Cloud

Digital Scholarship Librarian

Profile Photo
Kitty Siu
Digital Initiatives Team
The Chinese University of Hong Kong Library, The Chinese University of Hong Kong, Shatin, N.T.
(852) 3943 9731

Need Help?

The image of the University Library

Live Chat