Skip to content

Finding machine learning ready data

Joe Rothermich
Joe Rothermich
Head of StarMine Quantitative Research, Refinitiv

To be machine learning ready, investment firms need high-quality structured and unstructured data — all in tailor-made form. How can they ensure this information is clean, true, and legal in order to achieve their AI goal of #MLReadyData in the search for alpha?


  1. Poor data quality is seen as the biggest barrier to being machine learning ready — well ahead of any other factor in a recent Refinitiv survey.
  2. Organizations will typically use data from a variety of direct or third-party sources and this can expose them to risks.
  3. PermID from Refinitiv creates universal, machine-readable identifiers that help to eliminate mapping inconsistencies, reduce operational risk and streamline processes.

Without the right data, even the most sophisticated and carefully constructed machine learning algorithm can fail to deliver useful results. ‘Garbage in, garbage out’, as the old adage goes.

This view is backed up by our recent “Smarter Humans, Smarter Machines” survey of senior personnel in global finance, including c-suite executives and data scientists.

In the survey, 43 percent of respondents identified poor data quality as the biggest barrier to machine learning adoption — well ahead of any other factors.

Data quality is the biggest barrier to ML adoption

So what is stopping organizations from getting the data they need?

Machine learning and the search for alpha

In a highly competitive sector, investment companies are seeking to get an edge by accessing new data sets, which might range from pricing information and news sentiment to data generated by GPS and mobile phone providers.

Getting this type of raw data machine learning ready can be challenging, from dealing with missing or incomplete records to identifying accurate information about the coverage, history and population of that data.

As well as being expensive and time-consuming, this process also requires careful judgment. For example, should an outlier be excluded because it is wrong or included because it carries valuable insight?

While many cutting-edge firms like to work with raw data when they create new trading strategies, unless very carefully handled such information can produce unreliable results.

That may explain why our survey also revealed that the use of machine learning for investment ideas and generation was well behind that of risk management and compliance.

Top challenges when working with new date for ML. Finding-machine-learning-ready-data

Getting the data right

When building and training models, it’s vital to know that the data you are using is point-in-time, and hasn’t been corrected and backdated due to subsequent events.

Similarly, adjusting for so-called ‘survivor bias’ is crucial to ensure that you benchmark against all companies active at the time, not just those still around today.

Organizations will typically use data from a variety of direct or third-party sources, and this can expose them to risks. It may conceal bias in the way it was collected, for example mobile phone data from a vendor whose install-base is primarily from one demographic.

Data that is found to include personally identifiable information could also put those storing and processing it in breach of data privacy regulations.

Explore our machine learning ready data sets

Avoiding mapping errors

You might expect that identifying individual companies within a data set would be a straightforward task.

However, the use of different symbologies and the need to connect brand names with company names can cloud the picture.

In some cases, organizations may deliberately disguise their identity, for example in order to not reveal their strategic direction in their patenting activity.

To help our customers work across data sets more easily, we developed PermID, which creates universal, machine-readable identifiers that help to eliminate mapping inconsistencies, reduce operational risk, and streamline processes.

Machine learning ready data

Financial services firms must plan and allocate the time and resources needed to clean data, test for biases, look for outliers, and review statistical summaries. Here, it is important to create reproducible processes rather than modify data on an ad-hoc basis.

It is also vital to be sure of your data’s provenance, and that it has been curated, normalized and tagged according to your requirements.

This is where Refintiv can help with machine learning ready data — #MLReadyData.

We can provide the volume and range of high-quality structured and unstructured data you need — all delivered in the form you require it.

Our experts combine their technical and financial knowledge to refine, map and clean data for your needs and ready to feed directly into your machine learning models.

We predict that artificial intelligence will be one of the greatest enablers of competitive advantage in the financial services sector. By providing access to high-quality data and expertise, Refinitiv can help you take full advantage of the opportunities ahead.

Learn more about the Refinitiv AI/ML survey