Project GDELT- Sentiment Analysis of World News in the Cloud

5 January 2022 10:51 1 minutes to read

Have you ever wondered how to quantify the media sentiment and coverage of news articles, or how some topics are related to others? Sentiment analysis in machine learning implies leveraging advanced models to quantify the tonality of the text. From this, one can infer whether an article contained a positive or negative message, which in turn can help guide decision making. Imagine this knowledge from billions of articles in real time. This was the aim of the project GDELT.

GDELT is a collection of over 7.5 billion news articles, updated every 15 minutes to be “a real-time open data global graph over human society as seen through the eyes of the world's news media.” Even though GDELT allows for extensive exploration, utilizing this as of today 14 TB dataset is not trivial in most cases, especially for users lacking the coding experience required. In “project GDELT” I aimed to build an app to enable SEB employees to effectively explore world news quantitatively in the cloud, both in terms of frequency and sentiment. The idea of the app was to allow for macroeconomic analysis of world news so that searching for a combination of keywords, phrases, or people would yield a dashboard displaying relevant information about the topic. From a technical standpoint, multiple requirements had to be satisfied for this to work. In short, the app is hosted on GCP using Cloud-Run with the data accessed through Big Query. For continuous deployment (CD), cloud-build and cloud-triggers were deployed. The entire infrastructure was then implemented using Terraform. The overall logic is written in Python and all front-end and visualizations are designed in Dash. In what follows, I will go over these components in more detail to provide a better understanding.

User experience

From the users' point of view, there are two main components to the app. Firstly, given a timeframe, a search for a given topic, person, or location will result in both sentiment and frequency statistics from GDELT to interact with. In the other case, the user might want to compare topics to each other or find related ones. To address this, I assumed that the relation between multiple topics in news articles also would be visible in Google Searches. As a result, I implemented support for Google Trends in the app. It is a tool that lets users get statistics on search queries from Google. In addition, it can also get relational information between multiple queries. So, for example, by searching for inflation, deflation, and the central bank, one could perhaps infer that there has recently been a stronger relationship between inflation and central bank, based on Google queries. This, given more context, could be used as an indication that there has been more discussion on inflation from central banks recently. Furthermore, these topics can then be passed into a GDELT query, returning the overall frequency and sentiment. This not only provides the user with knowledge of how these topics are discussed at scale, but also whether the news has a positive or negative stance on the topics. These insights can enable more analysis to discover trends and possibly detect underlying patterns in communication ahead of time.

Queries

In theory, Big Query allows fast operations on large datasets. However, to enable user interaction, partitioning was needed as every live update would trigger new queries, being expensive both timewise and cost-wise. So as the user selects a topic, a query is executed in Big Query to get a subset of the data for further preprocessing in a Python environment.

Visualization

Some reflection went into the choice of framework for the data visualization. Tableau offered a satisfying drag-and-drop method for quickly creating dashboards but lacked the flexibility to allow for more dynamic and changing data, which would be the case with different queries. The final choice was to use Dash, a framework written on top of plotly.js and React.js that enables interactive web applications written in pure Python. Dash has its own HTML syntax and callback functions that simplify the development process. Being created as an extension of the plotting library Plotly, it is great for working with data visualization.

DevOps and infrastructure-as-code (IaC)

SEB usually performs most of the CI/CD in GitLab, but given the shorter timeframe of this project, we decided to set everything up in Google Cloud Platform (GCP). The app has a complete CD pipeline, implying that code modifications automatically trigger rebuilds on the environment. Using cloud triggers to detect repository changes, and cloud builds to run each step of the deployment pipeline as a Docker container, redeployment and updates run seamlessly. In addition, with the transition into the cloud, SEB also has a strong emphasis on IaC. This principle suggests that, rather than configuring and creating the environment in the GCP UI, the entire structure should be documented in code. Terraform is an open-source tool that lets developers write the environment in the form of configuration files. These are then backed up and sent to GCP, where everything will be built based on the specified code, allowing for easy and safe version control.

I built the app as part of my paid summer internship at SEB, which was the award for winning the competition “Tech Talent of the Year”. Given the limited timeframe, the first version successfully enabled all the analysis and data storytelling I set out to achieve. Furthermore, given how everything runs in the cloud, it is easily scaled up or shut down on-demand, allowing for flexibility. The initiative of SEB to push for a more cloud-driven bank creates a wide range of opportunities. For the bank, it promotes scalability and flexibility to their data. Moreover, it also helps reshape the image of banks as non-tech-savvy, as it provides for a lot of intriguing problems for those interested in building a cloud-driven bank at scale.

Data Scientist

Johan Hammarstedt

MLOps: Why You Need It in Your Organisation – Part 1

Are you just starting to put your machine learning models into production? Or you already have some machine learning models in production and are looking for scaling your results? Enter MLOps!

MLOps: Why You Need It in Your Organisation – Part 2

While you might think pushing ML model into production is the final step of operationalizing ML models, it is really just the beginning of the story. Is the model doing what it is supposed to do? Here, I try to answer such questions and shed some light on different aspects of life cycle management of ML models.

3 guardrails to cloud

Cloud. Once it was only a fancy buzzword. Now it is already on the spotlight for most companies. Replacing “traditional” infrastructure. Even more – becoming the new traditional infrastructure.

User experience

Queries

Visualization

DevOps and infrastructure-as-code (IaC)

Johan Hammarstedt

Johan Hammarstedt

Related content

MLOps: Why You Need It in Your Organisation – Part 1

MLOps: Why You Need It in Your Organisation – Part 2

3 guardrails to cloud