The power of machine learning is undeniable. Yet, it is no secret that computers and humans use different languages. To represent real world problems as interpretable data for machine learning models, especially in fields like sustainability, a combination of both languages can be necessary. In this blog post, I will share my learnings on using Natural Language Processing (NLP) to develop an encoding solution for categorical features in financial data, to later be used for prediction of sustainability metrics.
Numbers are preferred, in the language of computers. Consequently, to leverage the power of machine learning, information often needs to be translated into numbers – or non-numeric data needs to be encoded (to use techier terms). Humans, on the other hand, use a different language, where words and sentences have semantical meanings varying based on their context. Much like variables but, I would argue, often much more enriched with information difficult to enumerate or encode.
Encoding categorical features
Applying machine learning to any real-life problem settings often involve using information expressed in both languages. The numeric features are understood as they are, while the categorical ones – the non-numeric values – are better understood by humans and therefore need to be translated before used. This translation can be a challenge. I will explain why with an example.
Let’s say we have a feature called Animal containing different values of animal types. To translate this into numeric data, we could just put numbers to the different values and let dog = 1, horse = 2, … and, finally cat = 10. However, a problem may arise when we start putting these values into a machine learning model with a specific task. Say we want to use our encodings in two different problem settings; the first one being about animals that enjoy a lot of exercise, and the second about which animals are most likely to occur as house pets. In the first setting as regards exercise, it seems reasonable to group dog and horse numerically closer to one another than dog and cat to indicate that they are somewhat similar in regards of enjoying exercise. However, in the second setting, dogs and cats are more similar since they are both more likely to occur as house pets. This toy problem demonstrates something very important: The choice of appropriate encoding depend on the context of the problem. Furthermore, it is important to remember that neither the data nor the model is intelligent on its own. Therefore, when translating data expressed in human language into numbers, there is every reason to be cautious on not losing important meaning in the translation.
The project setting
A few years ago, SEB began developing the Impact Metric Tool – a software to check listed companies against the sustainability criteria of the EU Taxonomy. Among other things, this tool includes a prediction model for CO2 emission intensity. A model that was to be trained with financial and emission data from ~3500 companies, containing both numeric and categorical features. This leads us to the main question of this article: How should the categorical features be encoded to in this problem?
There are many ways to answer this, and I will tell you how I tried to do it.
The interesting company-level categorical features in the data set were country, sector, industry, and industry descriptions. From now on, I will focus mainly on the different industries.
One main issue with the current approach—one-hot encoding—was that it did not preserve any relationships between the features. Instead of enumerating industries as in the previous toy example, it creates unique vectors for all options where all elements are either one ore zero. Without going into detail, it could look something like dog = [1, 0, 0], horse = [0, 1, 0], and cat = [0, 0, 1]. In this problem setting, this means that industries such as Regional Banks, Major Banks, and Contract Drilling are treated as equally similar, even though a human easily would distinguish that the companies within these two banking industries shared more similarities in CO2 emissions than they did with companies in the contract drilling industry.
The idea became to solve this through human language, more specifically the meanings of the industries’ names and descriptions. Since any human clearly could distinguish some similarity and dissimilarity in a CO2 emission setting by just reading the industry name, would it be possible to preserve some more information of the industry values through relying on the semantics of the industry names? Could I, in other words, improve the computer’s performance on the task by helping it understand this bit of the human language?
In the quickly developing field of Natural Language Processing (NLP), the number of available resources for this type of task are enormous. My weapon of choice became a pre-trained Sentence Transformer – an NLP tool that encodes text into high-dimensional vectors that are positioned based on their semantic similarity. Simply put; This would group similar industries closer to one another and keep dissimilar ones far apart – for example, Regional Banks would be closer together with Major Banks than with Contract Drilling. This way, the semantic similarity in human language would be preserved even when expressed in numbers.
How did it work in practice?
A key factor to remember here is that the correlation between sematic and CO2 emission similarity was only hypothesized. As it turned out, semantic similarity of the names and description was not always a good indicator of carbon emissions. Conjecture relying on human language might not always be right! This meant that, even though the Sentence Transformer did its job in preserving the relations based on the meaning of the industry names and descriptions, it did not significantly improve the model's performance in predicting emissions. The two banking industries indeed shared similarities in both semantics and CO2 emissions. Yet, in other cases the similarity detected was spurious, such as when comparing the two industries Electricity Installation and Production of Electricity.
Adding more info, but to whom?
It became clear after some sample evaluations there were too many cases of mismatch correlations. There were, in other words several cases where industries were considered similar semantically, but not regarding emissions, and vice versa. Although it is possible to fine-tune the encodings to take these cases into consideration, it would likely require more company data emission than what was available. Despite this, I must admit that as a data-oriented person, this alternative was still tempting. But I then remembered that if the only tool you have is a hammer, everything looks like a nail and decided on not feeding more into the numerical solution. Instead, I chose to identify ways of visualizing the data and semantic similarities to assist the project group in making more informed decisions. In the end, it became a choice of providing more information to either the model or the people, and I went with the latter by building a dashboard. Ultimately, this decision can be tied back to my title question. My results could not clearly say if someone can judge the CO2 emissions of an industry by its name; but if we were to try, it should be using a combination of human and computer language skills and to leverage the strengths of both kinds of language.
This project was conducted during a summer internship – a part of the award for winning the competition “Tech Talent of the Year” – at SEB Group Data Analytics.