This guide is to provide information on text mining and analysis such as types and tools, project examples, useful websites, etc. You are welcome to contact us if you would like to get advice or information in doing your research.
The Library welcomes CUHK faculty members and researchers to collaborate with us in conducting digital scholarship research. Please visit our Digital Scholarship Projects page for the projects conducted.
Text mining can be broadly defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools. In a manner analogous to data mining, text mining seeks to extract useful information from data sources through the identification and exploration of interesting patterns. In the case of text mining, however, the data sources are document collections, and interesting patterns are found not among formalized database records but in the unstructured textual data in the documents in these collections. (Feldman and Sanger, 2007:1)
After researchers have defined the problem statement in the previous section, they need to collect data which can help to solve the defined problem. Traditionally, they need to prepare data manually. But nowadays researchers can prepare data with the help of some technical methods, such as scraping, to reduce time-consuming.
Preprocessing is a fundamental stage of data analysis. While a lot of information is available in various data sources and on the Web. Text preprocessing is the practice of cleaning and transforming text data into a usable format. This step usually involves:
Normalization (Stemming, Lemmatization)
Feature extraction is a part of the dimensionality reduction process. In this step, an initial set of the raw data is divided and reduced to more manageable groups. It usually involve the techniques, such as:
TF - IDF (Term Frequency – Inverse Document Frequency)
Modeling and Analyzing
Text analytics involves using machine learning algorithms and artificial intelligence (AI) to understand the meaning of text documents. Below are the machine learning techniques: