Podcast transcript: Cybersecurity in the age of Generative AI: Solving the ethical dilemma
10 min | 05 July 2023
In conversation with:
Prashant Choudhary
EY India Cybersecurity Partner
Pallavi: Hello, this is Pallavi Janakiraman, welcoming you to a new episode of EY India Insights podcast series on ‘Generative AI unplugged’. In today's episode, we explore the ethical dilemma of Generative AI in cybersecurity as that focuses on the delicate balance between privacy and production.
Joining us is a distinguished guest, Prashant Choudhary, Cybersecurity Partner at EY India. He comes with a wealth of experience in delivering cybersecurity solutions to financial services clients across the globe. He is also a client service partner for leading GCCs and financial sector clients at EY India. Welcome to our podcast, Prashant.
Prashant: Thank you, Pallavi, and what an exciting topic! If I look back to November 2022, when OpenAI launched its chatbot Chat GPT, it took the world by storm almost overnight. And today, if I had to give an analogy, we are seeing a very basic version of Generative AI and its capabilities, which is like looking at a Nokia candy phone with absolutely no idea of the feature phone it might evolve into. So, we should not judge Generative AI and its capabilities by what it is able to do today. We should base our conversations on what it can become and do in the days to come.
Pallavi: Like other AI models, Generative AI depends on data to work at its best. To train Generative AI in cybersecurity, the data it will need will be of people from the organizations that have to be protected. Will this not create privacy issues?
Prashant: Of course, there are privacy challenges. While some challenges have been discovered, many more are still coming out as more and more use cases pop up. And it is pervasive across all Generative AIs –Chat GPT, BERT, DALL-E, Midjourney, and so on. The whole model is that you use training data and then the AI comes out with whatever output it is supposed to give. It will give (output) on the basis of the data that was used to train the model. In this business, the source of data is the internet and there is a lot of web scraping involved, which brings the data to train these base models or the Large Language Models (LLMs).
The biggest question is that the data that is available on the internet belongs to a lot of people. While some data is available with proper consent and authorization and can be used for these purposes, some of the data is there because nobody even thought that when I am making a website and putting in details about my business and about myself, somebody will scrape it and use it to train LLM. That aspect itself is still under consideration.
If you go by our privacy regulations, the GDPR (General Data Protection Regulation), or even other leading regulations across the globe, there is always this element involved – if you are using my data, you need to have a legal basis.
Now, the question is what is the legal basis to scrape the data on my website or any other data source which I have, and then use it to train your Generative AI? The second question is of copyright infringement, because my data could be copyrighted. How are you ensuring that that infringement is not being violated? Then there are other concerns such as bias, public safety, and so much sensitive information which goes into train these LLMs. As we peel the onion, when it comes to privacy and Generative AI, there is a long way to go.
Pallavi: Based on our previous question, do you think generating synthetic data could be the solution to dealing with copyright and privacy issues?
Prashant: Synthetic data is a very interesting conversation to address all the copyright, legal, and other concerns when it comes to training LLMs. There are multiple interpretations of synthetic data, but for this conversation, I am assuming that synthetic data is basically when you generate a data using a computer and then you use that to train the LLM. While in first it appears to be a good method because this data will be cheap, it is scalable, and you can generate infinite variants, the challenge is that computer-generated data or synthetic data will always carry the reflections of the algorithm with which it was generated. When you use this data to train the LLM, it will get trained basis the input data. The LLM, hence, may or may not reflect the true state when it comes to its training. Hence, we go back to the point that we can use synthetic data, but we will have to go back and look at how exactly this synthetic data itself was generated.
There are also other conversations taking place such as anonymized or tokenized versions of Personally Identifiable Information (PII) and other information. In that approach, you still scrape the data, but you find out what is personally identifiable information or the other information that falls into the sensitive data set. Then you just replace that with anonymous labels so that exact individuals cannot be identified. If that takes care of PII, privacy, and some other issues, you use this to train your LLM.
Pallavi: Lawmakers across the world are looking at regulations and regulatory bodies to keep Generative AI from going astray or being misused by rogue actors. Could you give us an outline of what is happening around the world in this area?
Prashant: When it comes to regulations for LLMs or Generative AI, one action that is making headlines these days is our management’s decision to reach out to most of the key governments and heads of government and persuade them to put robust regulation to control Generative AIs. In fact, there was another interesting call which came from quite a few influential people that said that we should pause all development in Generative AI technology till the time we really get a handle on how to control this technology for the benefit of human usage.
I would say that the broad spectrum of input coming in is that we need to look at this technology very seriously and that we need to put guardrails around it. Across the globe, there are nascent stages of regulatory inputs are coming in. For instance, (in the US) NIST (National Institute of Standards and Technology) has come out with the AI Risk Management Framework; the European Parliament is pushing for the EU Artificial Intelligence Act; the European Union Agency for Cybersecurity is talking about cybersecurity for AI; the US Securities and Exchange Commission (SEC) is talking about AI, cybersecurity, and risk management. So, a lot of regulations are coming in. The worry is that we should not overregulate it. We should also not under-regulate it. It would be an act of fine balance.
The other point is that this technology is pervasive. It would make sense if there can be a consistent regulation across the globe.
Pallavi: Thank you, Prashant, for providing all these valuable insights on cybersecurity issues. It is clear that as Generative AI evolves, we must navigate the delicate balance between privacy and production to ensure its responsible and secure implementation in cybersecurity domain.
Prashant: Thank you, Pallavi. It was fabulous interacting with you today, and let's do it again in the days to come. Thank you.
Pallavi: And on that note, we come to end of this episode. We hope you found our discussion on Generative AI in cybersecurity insightful. To know more about Generative AI, visit our website and do leave us some suggestions on topics of Generative AI that you would like us to explore. Thanks for tuning in. Goodbye.