An introduction to natural language processing

With so much financial news and data in unstructured formats across the Web, social media and newswires, how can traders and investors make better use of this information to achieve #SmarterTrading? Saeed Amen, co-author of The Book of Alternative Data, provides an introduction to natural language processing.

There is simply too much text being generated in unstructured formats, meaning traders or investors are unable to incorporate it all into their decision making processes.
Our introduction to natural language processing considers the different stages of understanding and interpreting unstructured text.
Examples of natural language processing from Refinitiv include datasets based on the Thomson Reuters newswires, or the MarketPsych indices.

Traders like working with well-structured time series data. Typically, this might represent a price series, or maybe it will be some economic data such as the unemployment rate. However, in practice, lots of data which could be useful for financial markets is not in such a format.

A lot of this unstructured data is likely to be text, rather than numerical data. This might be text on the web, social media or newswires containing market commentary, analysis or central bank communications.

There is simply too much text being generated for any investor/trader to read and incorporate into their decision-making process. How can we instead automate the process of understanding and interpreting text?

By using natural language processing (NLP), it is possible to structure text into a more usable form for trading. NLP incorporates many of the same tasks as a computer in order to mimic the way a human understands text.

Understanding natural language processing

Before we start discussing some of these tasks and algorithms, let’s first think of the various stages of understanding language.

First, we have phonetics, which understands the specific sounds generated by the human voice. Then we have phonology, which are the sounds in a specific language. This is a subset of the ones a human voice can represent.

While languages might share certain sounds, there are some that might be very specific to a particular one.

After that stage, we have morphology. Here, we seek to understand how words are constructed and how they can be decomposed. For example, the words monkey and monkeys are clearly related, but we can breakdown monkeys into its root (monkey) and ‘s’ is the suffix.

We also have different verbal forms of the same word, which can sometimes be the same or at least very similar, like run (verb), running (adjective) and run (noun). For certain languages such as Arabic, where verbs usually consist of three root letters, morphology can be very important.

A step above this, we have syntax, where we examine how words are combined to make a sentence.

Semantics then seeks to ask questions about meaning, such as who, what, why, where and when. Pragmatics involves understanding a text with context, and often requires additional information not necessary within the text.

Measuring sentiment

Normalization is one of the first tasks we might use on a text.

Word segmentation, or tokenization, is used to identify words. In English, we might use a space character to split words, however, this might split up entities (eg. “Burger King” is a specific entity, which shouldn’t be split).

In order to make sense of a text, we need to create vectorized representation of a text. These are known as word embeddings. Bag-of-words is a simple form, where we simply compute the word frequency, but this ignores grammar and word order.

If we have scores for the sentiment of words, we can use simply multiply each word frequency by its sentiment to come up with a sentiment for a whole article.

There are more complicated word embedding algorithms, such as word2vec and BERT, which are often used, and we can build sentiment-based tools on top of these, too. Obviously, from an investment/trading context, understanding the sentiment of an article is very important.

Structured datasets

Other higher level NLP tasks include topic modeling, which gives an idea of what a text is about. Being able to tag in this way (eg is this article about EUR/USD?) is a key part of structuring a text for use by traders.

However, doing every single NLP task, such as sentiment and entity tagging, can be very time consuming and challenging.

We might therefore choose to use a structured dataset, such as those available from Refinitiv, which include datasets based on the Thomson Reuters newswires. There are also the MarketPsych indices, which use both newswires and social media.

Once we have a structured dataset, it’s possible to create a time series for sentiment and other tags, which can be used in systematic trading rules or by discretionary traders to understand market sentiment.

For an in-depth review on Natural Language Processing, listen to our recent NLP and Quant Investing webinar.

The Book of Alternative Data

Alexander Denev and Saeed Amen are coauthors of The Book of Alternative Data (Wiley), which is available on pre-order on Amazon. The book discusses natural language processing and many other topics related to alternative data. Saeed is currently developing a course to teach alternative data, based on the book, which can be taught in-house at firms.

An introduction to natural language processing

Understanding natural language processing

Measuring sentiment

Structured datasets

The Book of Alternative Data

Solutions

News feeds, analytics, and indices

Destination Quant

Explore our sites

An introduction to natural language processing

Understanding natural language processing

Measuring sentiment

Structured datasets

The Book of Alternative Data

Solutions