Skip to Main Content

Text-mining and Analysis in Digital Scholarship Research: Tools of Text Mining

Data Collection

ParseHub

  • ParseHub is a free and powerful web scraping tool. With it's advanced web scraper, extracting data is as easy as clicking on the data you need. You can use the data sourced with ParseHub to power your products, do research, create visualizations and make key business decisions.
  • Official Webpage: https://www.parsehub.com/

Beautiful Soup

  • Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
  • Official Document: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Selenium

  • Selenium is many things but at its core, it is a toolset for web browser automation that uses the best techniques available to remotely control browser instances and emulate a user’s interaction with the browser. It allows users to simulate common activities performed by end-users; entering text into fields, selecting drop-down values and checking boxes, and clicking links in documents. 
  • Official Webpage: https://www.selenium.dev/

Preprocessing

CORPRO (庫博中文獨立語料庫分析工具)

  • CORPRO is a text analysis software based on linguistics. It is designed to provide humanities research with functions such as word segmentation, category construction, and auxiliary word search. The purpose of CORPRO is to provide the humanities scholars with the opportunity to have a dialogue with their own subject field knowledge when reviewing the text corpus.
  • Website for software download: http://nlp.cse.ntou.edu.tw/CORPRO/
  • Tutorial playlist

 

OpenRefine

  • OpenRefine is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
  • Official Webpage: https://openrefine.org/

Source: Web Scraper (2020). Data Cleaning with OpenRefine

CKIP Tagger

Chinese Classics OCR

In recent years technology on Chinese classic text recognition has been more advanced.  Few platforms have been developed for individual use and help to boost related research with Chinese classics.

古籍酷AI服務 (https://ocr.gj.cool/)

Developed by Master Xianchao (賢超法師) from Longchuan Temple (龍泉寺) in Beijing in providing free OCR service on Chinese classics images.古籍酷 GJ COOL

 

中研院文字辨識與校對平台 (https://ocr.ascdc.tw)

The platform is developed by Academia Sinica Center for Digital Cultures (ASCDC), Taiwan using image processing and deep learning technology.  It includes image processing, textual detection and recognition with fine-tuning from user feedback. 
中研院文字辨識與校對平台 ASCDC OCR Platform

 

Chinese Historical documents Automatic Transcription (CHAT) models (https://github.com/colibrisson/CHAT_models#chinese-historical-documents-automatic-transcription-chat-models)

This is part of an ongoing project by the Numerica Sinologica consortium in building open-source digital tools for pre-modern Chinese studies.  Their repository contains segmentation and transcription models trained using the kraken OCR engine.

 

Feature Extraction/ Modeling

KNIME

  • KNIME is a free and open-source data analytics platform. It integrates various components for machine learning and data mining through its modular data pipelining "Lego of Analytics" concept.
  • Official Webpage: https://www.knime.com/

Weka

  • Weka is a collection of machine learning algorithms for data mining tasks. It contains tools for data preparation, classification, regression, clustering, association rules mining, and visualization.
  • Official Webpage: https://www.cs.waikato.ac.nz/ml/weka/

Scikit-Learn

  • Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.
  • Official Webpage: https://scikit-learn.org/stable

Tensorflow

  • TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications.
  • Official Webpage: https://www.tensorflow.org/

Visualization

Voyant Tools

  • Voyant Tools is an open-source, web-based application for performing text analysis. It supports scholarly reading and interpretation of texts or corpus, particularly by scholars in the digital humanities, but also by students and the general public.
  • Official Webpage: https://voyant-tools.org/

Tableau

  • Tableau is a visual analytics platform transforming the way we use data to solve problems—empowering people and organizations to make the most of their data. It helps researchers quickly and confidently transform and shape the data for analysis.
  • Official Webpage: https://www.tableau.com/

Matplotlib

  • Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
  • Official Webpage: https://matplotlib.org/

Seaborn

  • Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
  • Official Webpage: https://seaborn.pydata.org/