Text analysis: Not all paths are the same

Home » Text analysis: Not all paths are the same

1 November 2018

Text analysis: Not all paths are the same

In 1998, Merrill Lynch claimed that between 80%-90% of all potentially usable business information originated in unstructured form.

This unstructured data, mostly text, is from public filings, research reports, and the internet, e.g., public news, blogs, etc. It has long been a challenge that all the buy side, sell side, and data vendors have tried to address, either in house or by the use of third-party products to extract insight out of textural data.

The principal focus to achieve this is ‘sentiment analysis’, which extracts the polarity or opinion of an entity, e.g., a company, out of a piece of text by a variety of algorithms. These have evolved from simple positive or negative word lists, to more mature rule based and/or machine learning. But limitations still exist.

Vendor or in-house developed products attempt to explain how the indicator is derived, but the user never fully understands either the detail or how the results have been calculated. The underlying drivers in the news flow that caused the result to be obtained are simply not addressed. It is difficult for business users, be they research analysts, economists, traders or portfolio managers, to interpret the results. Therefore, they have no confidence to make critical investment decisions based on this approach. It is at best extremely difficult if not impossible, to modify how the sentiment data is derived –it is very time consuming and typically involves expensive custom projects. The knowledge between vendor and user is at best, isolated.

Less alpha when used by more users

The most widely-known sentiment products are from third party vendors. Processing unstructured data requires significant collaborative effort and specialized skillsets, such as natural language processing, (NLP), financial domain knowledge and massive parallel processing (MPP) systems. These are very expensive and difficult to build and maintain. In reality, all the sentiment analytics vendors publish the same result to all their clients, therefore even when alpha exists on sentiment data, it will be diluted when more and more clients purchase the same data.

It sounds like a trivial exercise to distribute sentiment data in the same way as pricing or reference data and leave it for the user to consume, integrate and use.

This is simply not the case. Firstly, sentiment analytics is about processing and responding to real-time stories, events that are happening right now. Sentiment fades when it takes a long time to consume

Secondly, building up in-house consumption capability is costly. This obstructs the adoption of sentiment data, especially in financial institutions that are forced to spend most of their IT related budgets on regulatory or compliance projects.

Quantitative fundamental analysis instead of sentiment analysis

Financial markets are not fully efficient. The price of any financial instrument is affected by expectation or sentiment and more importantly, affected by macroeconomic, or company related, political events, or even the weather. All these events occur daily and in real-time.

People used to rely on fundamental analysis frameworks, e.g., multi-factor modelling, to understand the importance of certain events, and in order to predict a future price. The problem with this is the model; this usually takes less than a dozen factors into consideration. Ideally, a fundamental analysis model should consider hundreds, if not thousands of factors. Simple regression testing is not suitable, therefore more advanced calculation, e.g., options such as machine learning algorithms should be applied.

The best prediction relies on timely response to real-time events. Fundamental analysis uses lagging data on modelling. Therefore, the model hardly reflects real world complexity. A proper text analysis system must run in real-time to deliver maximum usage of information and optimal alpha.

Customization is a must

There are two types of processing within a text analytical cycle.

Firstly, no customization is required to accurately capture all the events happening in real time, along with all the relevant features. The second part is to build up the model and derive value out of refined data, which is best achieved when users can easily build their own analytical logic into the engine. This is one way to transfer every user’s domain knowledge or specific trading strategy into alpha specific to the user, in such a way that this will not dilute however many users subscribe to the same platform.

Secured cloud

Since text analytics systems are distributed, run in real-time, and require incredibly heavy calculating power, it is no longer practical for the user to adopt the buy-implement-maintain-upgrade type of traditional software deployment. All the user requires is an indicator derived from his or her customised analytical model that is able to run on top of real-time streaming data. As long as the vendor can provide institutional levels of security and meet all the compliance checks, the user only needs a login and web browser to consume 80%-90% of all potentially usable business information. Up and until now this has never been achieved.

To facilitate modelling procedures, vendors must provide back-testing functions to help users to understand the data verify their models and evaluate performance, before using the data on real money. For quantitative clients or those who need to integrate data into their in-house system a variety of connections, e.g., API’s should be provided.

We believe that above listed features are mandatory, for text analytics to go into the mainstream and reshape investment research practices.

Such systems will free up business users from intense but low value manual work, to pro-active, value-added modelling and analytics. In addition, the system should also be able to serve up information to users that could be of material benefit in increasing the accuracy of their forecasts. With this type of functionality, teams would be forecasting against leading, as opposed to lagging indicators and accuracy should significantly increase as a result.

Click here to learn more about ORBIT Financial Technology