Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Text-mining and Analysis in Digital Scholarship Research: Concepts of Text Mining

Introduction

This guide is to provide information on text mining and analysis such as types and tools, project examples, useful websites, etc.  You are welcome to contact us if you would like to get advice or information in doing your research.

Digital Scholarship Projects

The Library welcomes CUHK faculty members and researchers to collaborate with us in conducting digital scholarship research. Please visit our Digital Scholarship Projects page for the projects conducted.

Contact us
Email: dslab@lib.cuhk.edu.hk
Tel.: (852) 3943 9954

Digital Scholarship Lab

Located on the G/F of the University Library, The Chinese University of Hong Kong, the Digital Scholarship Lab aims at providing a cutting-edge space for supporting digital scholarship research. 

Contact us
Email: dslab@lib.cuhk.edu.hk
Tel.: (852) 3943 9954

What is Text Mining?


(Source: Elsevier. (2015). What is Text Mining?https://www.youtube.com/watch?v=I3cjbB38Z4A)

Text mining can be broadly defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools. In a manner analogous to data mining, text mining seeks to extract useful information from data sources through the identification and exploration of interesting patterns. In the case of text mining, however, the data sources are document collections, and interesting patterns are found not among formalized database records but in the unstructured textual data in the documents in these collections. (Feldman and Sanger, 2007:1)

Five Major Phases of Text Mining

(Source: KDnuggets. A General Approach to Preprocessing Text Data, https://www.kdnuggets.com/2017/12/general-approach-preprocessing-text-data.html)

  • Data Collection
    • After researchers have defined the problem statement in the previous section, they need to collect data which can help to solve the defined problem. Traditionally, they need to prepare data manually. But nowadays researchers can prepare data with the help of some technical methods, such as scraping, to reduce time-consuming.

  • Preprocessing
    • Preprocessing is a fundamental stage of data analysis. While a lot of information is available in various data sources and on the Web. Text preprocessing is the practice of cleaning and transforming text data into a usable format. This step usually involves:
      • Segmentation
      • Normalization (Stemming, Lemmatization)
      • Tokenization

  • Feature Engineering
    • Feature extraction is a part of the dimensionality reduction process. In this step, an initial set of the raw data is divided and reduced to more manageable groups. It usually involve the techniques, such as:
      • TF - IDF (Term Frequency – Inverse Document Frequency)
      • Word2Vec

  • Modeling and Analyzing 
    • Text analytics involves using machine learning algorithms and artificial intelligence (AI) to understand the meaning of text documents. Below are the machine learning techniques:
      • Classification (Support Vector Machine, Naive Bayes) 
      • Clustering (KMeans)

  • Interpretation and Visualization
    • Visualization technique is used to simplify the process of finding relevant information. It helps to display textual information more attractively.
      • Word Cloud

Digital Scholarship Librarian

Profile Photo
Kitty Siu
Contact:
Research Support and Digital Initiatives, University Library, The Chinese University of Hong Kong, Shatin, N.T.
(852) 3943 9731

Need Help?

The image of the University Library

Live Chat