While people often use the expression “new oil” or new source of wealth for the data in the possession of company, legislators have started to strongly protect – with good reason – the customer’s right to privacy (keywords from GDPR: “right of access”, “right to erasure”, etc.). For both parties to emerge as winners in this balancing act, thinking systematically about how companies’ data management and lifecycle works is of critical importance.
Let’s shed some light on seven topics, which should be focused on
- How is data generated and where is it stored? The bigger the company, the more varied data sources. The more data sources, the greater the risk that the data remains isolated and the full potential of the “source of wealth” is not used. Additional added value can (only) be achieved by combining the data and looking for additional output from interaction. Therefore, it is important to ensure that each new data generator is also centrally accessible (stored to data warehouse, connectable with APIs or any other central solution). Before collecting data, one must consider the purpose for which the data would be used. This is important so that a possibly universal and reusable structure and reasonable data volume can be put in place early on to avoid subsequent problems with too extensive data volumes or just a partial data structure. It is also essential that we do not conflict with the requirements arising from the General Data Protection Regulation (GDPR) when processing personal data. When collecting personal data, we must have a clear purpose and the person (data subject) on whom the data is collected must be aware of the purpose. Personal data cannot be processed without that.
- Do we know what data we have collected and what we have? Let’s imagine a warehouse with no structure or description of how the items have been stored in the warehouse. It is possible to find something in there, but is it the right item and can it be found at a crucial moment? The same applies to data. Each data point is like an item in the warehouse, for which we need to know the origin and whether we can be sure about the trustworthiness of the source (for example, the data can originate from a data subject, external register or we have created it ourselves by combining the existing data). We must know where the data has been placed (for example, it is collected in a system/application, or sent also to the data warehouse. The data can exist also in an unstructured form, such as a Word document, the content of which cannot be placed on a shelf of the warehouse so easily). We also need a common understanding and clear definitions of the data points. It is especially important when a commonly used term can be interpreted in different ways or when different terms are used to describe the same phenomenon, or when the difference comes from tiny additional details. Defining terms often becomes more complicated when several countries are involved where the same term has different meanings. Going further from here, when does data reach its “best before” date? Where is the data forwarded and for what purpose is it used? What are the quality standards set for the data and how are these complied with? Are there any legal restrictions or even consequences to data processing? (For example, when the data quality is poor, it may lead to incorrect data submitted in statutory reporting). To get an overview of the above, it must be clear who the owner of the data is, i.e. who do we ask for additional information, who knows/makes the rules, and who puts together a bigger picture?
- High-quality data serves as basis to any data processing. It can be said without exaggeration that if data quality is not ensured, we could drop the entire data processing. A conclusion based on erroneous data will only amplify the magnitude of mistakes. While sometimes, data control has been built into the data (such as control digits or checksums in reference numbers or personal ID codes), logical control rules must be defined by us. Even better, you should limit the possibility of entering erroneous data into the system or the systems must be interfaced to request data from the parent systems. At the same time, whenever we detect errors, these should be resolved as close to the source level as possible. Saving systematic, governed, and high-quality data in a connectable format has also led to the emergence of a new concept besides big data – Smart data.
- To ensure the relevance and correspondence of data to (external) rules, the deletion of data must also be considered. It is quite easy to keep collecting new data sets. But how do we ensure that the new data, stored in the massive data sets, still follow the same principles? On one hand, the previously described control mechanisms can help against it. On the other hand, however, sometimes we need to take a critical look and ask ourselves; is the data still required in its previous form? For personal data, the requirements of GDPR are of assistance here. However, we also need to consider other principles arising from law and reasonableness. And, if we detect a real need for deletion, how would we do it in a systematic and automated way?
- In addition to legislative restrictions, ethical aspects must also be kept in mind when handling the data. As previously mentioned, a conclusion made based on erroneous data may only magnify the scale of error. Similarly, wrong conclusions can be drawn if we prepare the analysis or teach a model based on biased data. As a result, it is possible to create models which may unintentionally appear to be discriminating. Since data can also contain delicate or confidential data, one should also consider the restrictions on how and who can access the respective data. Has access to data been restricted and granted only based on actual needs and are these rights reviewed on a regular basis?
- Data literacy is ever more important today to prevent as well as potential ethical mistakes. The skill to find and process the required data with relevant tools is a skill which is needed in nearly every field and position. Thus, in today's organizations, the "data providers" can’t only be top-level analysts, but everyone must have basic skills and access to the data they need – a process also called data democratization, i.e., bringing the data and skills to the level of end user. The skill to adequately visualize data so that the content is highlighted and not distorted is becoming increasingly important. However, in bigger organizations it is also reasonable to employ data professionals - data stewards, data scientists, data engineers, data privacy specialists, etc. - who possess certain specific skills in competence centres.
- Finally, even if the data management principles and skills are in place, we still need to ensure we have up-to-date tools. As the field of data increases in geometrical progression, the solutions used must be scalable and human intervention must be kept to a minimum. There is an increasing need to process unstructured (such as text, sound, visual, location, emotion) real-time data. At the same time, solutions should remain simple and intuitive so that they can be interfaced with each other.
Lennart Kitt, Baltic Head of Customer Analytics and Data Science, SEB Baltic Division
Enel Pitk, Baltic Data Governance Manager, SEB Baltic Division