The Data Knight Rises

10 November 2023 08:00

Hello again! In the previous part, we discussed how making mistakes can help Machine Learning models learn better solutions to complex problems and set the stage for my involvement with SEB over the summer.

Now, it's time to get technical and go into a detailed perspective of my work. A quick reminder from last time:

My work during the summer was focused on developing infrastructure and Proof-of-Concepts to help SEB develop the capabilities to deliver functional and robust Machine Learning models based on text data. In particular, my work largely consisted of:

Developing infrastructure in terms of a Dashboard and Data Warehouse to help the Bank annotate and store internal text data in a consumable manner.
Using Adversarial Machine Learning to help improve our existing QnA systems Building
Context-aware Question Answering Systems using Large Language Models

Let’s fire up the Batmobile and take a closer look at some aspects of this work!

The Dashboard

To help the bank organise its text data and work with it more easily, I put together a dashboard made on StreamLit, connected to and deployed on Google's Cloud Platform. It enables domain experts within the bank to help us understand what parts of their data are important, what could be interesting concepts around which questions can be built, etcetera. This while not having to be an expert on how and where we store the data for downstream consumption and usage.

Passage for contextual question answering

The picture shows the dashboard where, a domain expert can answer questions for us and verify that the text is human generated. Their annotations and inputs are then added to the raw data and stored on GCP for downstream consumption.

Adversarial Machine Learning

Now for the most exciting part - how do we go about fooling the question-answering system? First, we rank the words in the order of their importance. How much does a word influence the model's decision? We estimate this by using the gradient of the loss, and sort them according to their influence. Then, we pick the word with the highest influence on the model's decision and replace it with a synonym. We continue this process until the model fails, that is gives us an incorrect answer.

An Adversarial Example

Context.: "Super Bowl 50 was an American football game to ... for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to ..."

Question: Which NFL team represented the AFC at Super Bowl 50?
Model Answer: Denver Broncos

Question: ~~Which~~ Whose NFL team represented the AFC at Super Bowl 50?
Model Answer: Carolina Panthers ["Whose"→"Name"→"Carolina"]

Here, changing the word "Which" to "Whose" ends up fooling the model, possibly because Carolina is a reasonably common first name, and the answer to a question starting with "Whose" very frequently includes a person's name. The same question, if asked to a human, would most likely elicit a correct response.

Experimental Setup

After generating some adversarial examples, we trained baselines with different amounts of training data, evaluated them, and then fine-tuned on the adversarial data.

Pipeline for Experiments: Generating Adversial Data → Baselines with varying amounts of Training Data → Fine-tuning Trained Models using Adversial Examples

Evaluating the fine-tuned models gave us better performance on the test set, and hence, reason to have increased confidence in our Hypothesis:

The Question Answering System improves with the amount of training data, but even by fine- tuning on only ~100 Adversarial examples, the model consistently outperforms the baseline.

A natural question to ask would be the following:

What's so special about your "Adversarial" Data? How do we know the same effect could not have been achieved just by using more samples from the training set?

We asked precisely this question and ran control experiments to figure it out. In the control, we used around 100 randomly chosen examples from the training dataset that had not yet been seen by any of the baselines and fine-tuned on them instead of fine-tuning on the Adversarial data. The results showed that the resulting model tends to be very similar to the baselines, sometimes deteriorating in performance but never as good as fine-tuning on Adversarial Data. Intuitively, Adversarial examples are guaranteed to make the model slip up, like a challenging question on a test. The examples used in the control experiments are not guaranteed to be as "challenging", perhaps thus adding lesser value in terms of how much the model can learn from them.

Generative AI (LLMs) for Question Answering

Today, with models like GPT-3.5 (ChatGPT!), their general sense of awareness and realistic sounding dialogue make it tempting to trust whatever the model says as being true. However, this is known to be untrue. Large Language Models (LLMs) are very good at generating language that sounds natural, but they are by no means required to generate content that is factually correct all the time. In a certain sense, Large Language Models (LLMs) hallucinate a reality based on the prompt given to them.

To ensure reliability in systems like this, it is important to carefully engineer a prompt with all the right information, and only the relevant information. The models also have a maximum number of words they can process at the same time. A good way of doing this would be the following:

Split the input into chunks of a manageable size and store the chunks.
Given a question, find out the top 3 or 4 chunks in the input that are most "related" to the question.
Patch together the relevant chunks and the question into a prompt for the LLM and give that as input to it.

We did precisely this, and developed a prototype question-answering system that can be used with large code documentations and be asked to explain how to use certain features without the user having to spend hours sifting through the documentation manually.

Conclusion

The journey of enhancing question-answering systems through Adversarial Machine Learning and integrating them with Large Language Models has been both challenging and rewarding. By exposing the model to challenging examples and guiding its learning process, we've witnessed visible improvements in its performance. This approach not only helps us build more robust systems, but also sheds light on the underlying mechanisms of AI learning. As we continue to enhance these techniques and explore new possibilities, the outlook for AI-driven decision-making at SEB looks promising. This could lead to faster and more accurate access to vital information across the bank's diverse domains.

SEB's Tech Talent of the Year, 2023

Shekhar Devm Upadhyay

Tech Talent of the Year 2024

Are you passionate about cybersecurity? Are you curious on the AI advancements that can be leveraged within cybersecurity? If so, you might be on your way to becoming the next Tech Talent of the Year! The last day to apply is January 14th, 2024.

Information about the award and how to apply

Do you have feedback or thoughts about future blog articles? Get in contact with us at the e-mail address below.

techblog@seb.se