Recent years have seen an increase of textual data that is accessible in electronic form. Examples are annual reports, company releases, and newspaper articles, or user generated content on social media such as blogs, forums and tweets. All of this textual data is generated by humans and may thus contain information about the author`s opinions and preferences. Textual data analysis is the process of deriving high-quality information from text, which can subsequently be used for economic decision making. Employing computers to process textual data allows for (1) analysing digital information more quickly than what is possible for the human mind, (2) detecting high dimensional patterns, and (3) conducting structural analyses on textual data.
While this course primarily looks into applications for finance, textual data analysis can be applied to other fields as well.
In this course the following topics are covered:
- Introduction to R: The course is based on the free software environment R. Students will get an introduction to the basic concepts in R which are relevant for text mining. Students will learn to use websources such as R vignettes and stackoverflow.com effectively.
- Data collection: Students will learn how to write R-scripts to collect data from different sources. We will be using different Application Programming Interfaces (APIs) and the Electronic Data Gathering, Analysis, and Retrieval system (EDGAR) for firm data. For unstructured data, we will program simple web scrapers.
- Preprocessing and structuring of data: Most of the retrieved data contains unwanted mark-ups like html-tags or non-words that have to be removed before the analysis. The students will learn how to use regular expressions in R, in order to filter the data and prepare it for analysis.
- Data analysis and application: During the course the students will learn to analyse textual data in various ways, and apply the results to a finance/business context. In particular the course will introduce the students to the following concepts:
- Sentiment Analysis evaluates the tone (positive/negative) of a document. This can be used to judge the attitude of a given firm announcement, which in turn could be implemented into an automated trading algorithm.
- Social Media Analysis is used to extract structural information from data provided by the actual consumer (e.g. Tweets). For example, this can be used to measure the customer's attitude towards new products.
- Document Clustering evaluates the similarity of texts. This can be used to cluster companies to study economic linkages such as a product competition or a supplier relationship.
- Machine Learning applications for textual data. We will apply algorithms to detect which textual features (keywords etc.) can predict certain outcomes, such as predicting product ratings based on written customer reviews.
- Other methods from Natural Language Processing (NLP) and Corpus Linguistics: termness, part-of-speech tagging, collocations, representativeness, etc.
- NLP is in rapid development, so we will discuss current topics such as the language model BERT/FinBERT or ChatGPT.
- The course will include a guest lecture.
- Critical understanding: While computers work more efficiently than humans, they are unable to understand text in the way of the human mind. The students will be introduced to limitations of the text-mining approaches used in class.
BAN432 and BAN443 are complementary courses with different focus. In BAN432 we approach textual analysis bottom up: how to obtain data, clean it, and different applications. We implement these steps with code that we develop in class. BAN443 focuses on the application of LLMs in a business context.