Data loss prevention vendors tackle gen AI data risks

Data loss prevention (DLP) vendors are racing to add support for generative AI use cases to their platforms, following the popularity and increasing adoption of ChatGPT since its release in November 2022. The tool quickly became the fastest-growing app in history, and a board-level agenda item for companies.

According to an IDC report released in August, 65% of companies have already deployed generative AI, 19% are actively exploring it, and 13% are still considering it. Only 3% say they are not planning to use generative AI.

The single biggest roadblock to adoption, according to IDC, was the concern that proprietary information would leak into the large language models owned by generative AI technology providers. “Employees across industries are finding new and innovative ways to perform their tasks at work faster,” says Kayne McGladrey, IEEE senior member and cybersecurity strategist at Ascent Solutions. “However, this can lead to the sharing of confidential or regulated information unintentionally. For instance, if a physician sends personal health information to an AI tool to assist in drafting an insurance letter, they may be in violation of HIPAA regulations.”

The problem is that many public AI platforms are continually trained based on their interactions with users. This means that if a user uploads company secrets to the AI, the AI will then know those secrets — and will spill them to the next person who asks about them. It’s not just public AIs that have this problem. An internal large language model that ingested sensitive company data might then provide that data to employees who shouldn’t be allowed to see it.

According to a Netskope report released in July, many industries are already starting to use DLP tools to help secure generative AI. In the financial services, for example, 19% use data loss prevention tools, and, in healthcare, the number is 21%. In the technology vertical, 26% of companies are using DLP to reduce the risks of ChatGPT and similar apps.

How companies are ensuring safety when using private and public gen AI

Technology consulting company Persistent Systems, for example, uses DLP extensively to protect sensitive data from generative AI systems. The company has 23,000 employees in 21 countries, and, depending on the type of employee and the kind of work they do, there are different safeguards in place for sensitive data, says company CIO Debashis Singh, who declined to name the specific vendors the company was using.

Engineers, for example, work in secure virtual environments, separated from customer data. Other enterprise applications with generative AI components run in private environments so that the AI models are completely isolated. Blocking all public generative AI platforms isn’t always possible. For example, the secure private AI environment might have a bottleneck, Singh says, and employees might be forced to use the public AI alternative to get their work done. “In some cases, we allow ChatGPT to be accessible, but the documents are filtered by classifications, plus traditional DLP engines,” he says.

In addition, a public AI model might have access to more information, or different kinds of information, than one running in a private environment. “There are multiple providers, and we work with all the different providers to create the right solutions with the right platforms to align our solutions with the needs of our customers,” Singh says. “The power that ChatGPT brings in is something you can’t close your eyes and ignore it.”

Some vendors are also not traditional AI vendors that are adding AI chatbots to their platforms, he says. For example, a travel service might add generative AI to help users choose the best airline. “We have to be extremely careful. We have realigned our legal contracting terms with all our SaaS operators. Whenever you’re providing anything in SaaS and a ChatGPT environment has to be integrated, we want it in a private environment, not in the public environment, so everything stays inside our instance,” Singh explains.

And although this costs money, Singh says he doesn’t mind spending extra to protect their data. “We make this part of our selection criteria for partners and we are not compromising on it.”

In addition, with every SaaS integration, there are DLP controls in place to see what kind of data is going out. There are also additional DLP controls to prevent classified documents be cut-and-pasted into AI chatbots, and there are filters in place on URLs to restrict the chatbots that employees are allowed to use. Of course, people can always get to the chatbot on their phones. So, for employees working with particularly sensitive data, there are even stricter controls — including a prohibition on certain devices in secure environments. “No cameras, no cellphones are allowed,” Singh says.

Now that desktop apps like Photoshop are also getting generative AI functionality, Persistent is looking at those security risks as well. “I expect pretty soon that every single character you type, every attachment, every image, every single thing will be scanned and if it’s aligned to our policies it goes through, and if it isn’t, it will not,” Singh says. “It will bring DLP from the perimeter to all the applications that people are using.” Yes, it might impact performance, but it’s something that the whole CISO community will be facing soon. DLP vendors are quickly adding monitoring, blocking, and access control tools for generative AI apps.

Skyhigh now tracks more than 500 AI apps

Skyhigh Security has DLP tools as part of its CASB product, which is part of its SSE platform. The company has been rapidly adding support for generative AI use cases over this past year.

The company now tracks more than 500 different AI cloud service providers in its registry, says Tracy Holden, Skyhigh’s director of corporate marketing. That’s up 130% since January. “This cloud registry quickly identifies new generative AI apps and the corresponding risk level for each app,” she says.

During the first half of 2023, close to a million end users accessed ChatGPT through corporate infrastructures. “The volume of users has increased by 1,500% from January to June 2023, demonstrating the unprecedented momentum and adoption of generative AI applications across organizations and industries,” Holden says.

The company also has direct API integration to many enterprise apps that are adding generative AI features, including Box, Google, Microsoft, Salesforce, ServiceNow, Slack, Workday, Workplace and Zoom. That allows them to have better insights into and control over the data flows.

Dan Meacham, CSO, CISO and VP of cybersecurity and operations at Legendary Entertainment, says he uses DLP technology to help protect his company, and Skyhigh is one of the vendors. Legendary Entertainment is the company behind television shows such as The Expanse and Lost in Space and movies like the Batman movies, the Superman movies, Watchmen, Inception, The Hangover, Pacific Rim, Jurassic World, Dune, and many more.

There is DLP technology built into the Box and Microsoft document platforms that Legendary Entertainment uses. Both of those platforms are adding generative AI to help customers interact with their documents.

Meacham says that there are two kinds of generative AI he worries about. First, there’s the AI that’s built into the tools the company already uses, like Microsoft Copilot. This is less of a threat when it comes to sensitive data. “You already have Microsoft, and you trust them, and you have a contract,” he says. “Plus, they already have your data. Now they’re just doing generative AI on that data.”

Legendary has contracts in place with its enterprise vendors to ensure that its data is protected, and that it isn’t used to train AIs or in other questionable ways. “There are a couple of products we have that added AI, and we weren’t happy with that, and we were able to turn those off,” he says. “Because those clauses were already in our contracts. We’re content creators, and we’re really sensitive about that stuff.”

Second, and more worrisome, are the standalone AI apps. “I’ll take this script and upload it to generative AI online, and you don’t know where it’s going,” he says. To combat this, Legendary uses proxy servers and DLP tools to protect regulated data from being uploaded to AI apps. Some of this kind of data is easy to catch, Meacham says. “Like email addresses. Or I’ll let you go to the site, but once you exceed this amount of data exfiltration, we’ll shut you down.”

The company uses Skyhigh to handle this. The problem with the data limiting approach, he admits, is that users will just work in smaller chunks. “You need intelligence on your side to figure out what they’re doing,” he says. It’s coming, he says, but not there yet. “We’re starting to see natural language processing used to generate policies and scripts. Now you don’t need to know regex — it will develop it all for you.”

But there are also new, complex use cases emerging. For example, in the old days, if someone wanted to send a super-secret script for a new movie to an untrustworthy person, there was a hash or a fingerprint on the document to make sure it didn’t get out.

“We’ve been working on the external collaboration part for the last few years,” he says. In addition to fingerprinting, security technologies include user behavior analytics, relationship monitoring and knowing who’s in whose circle. “But that is about the assets themselves not the concepts inside those assets.”

But if someone is having a discussion about the script with an AI, that’s going to be harder to catch, he says.

It would be nice to have an intelligent tool that can identify these sensitive topics and stop the discussion. But he’s not going to go and create one, he says. “We’d rather work on movies and let someone else do it — and we’ll buy it from them.” He says that Skyhigh has this on their roadmap. Skyhigh isn’t the only DLP vendor with generative AI in their cross hairs. Most major DLP providers have issued announcements or released features to support these emerging concerns.

Zscaler offers fine-grained predefined gen AI controls

As of May, Zscaler had already identified hundreds of generative AI tools and sites and created an AI apps category to make it easier for companies to block access, or to give warnings to users visiting the sites, or to enable fine-grained DLP controls.

The biggest apps that enterprises want to see blocked by the platform is ChatGPT, says Deepen Desai, Zscaler’s global CISO and head of security research and operations. But also — Drift, a sales and marketing platform that’s added generative AI tools.

The big problem, he says, is that users aren’t just sending out files. “It is important for DLP vendors to cover the detection of sensitive data in text and forms without generating too many false positives,” he says.

In addition, developers are using gen AI to debug code and write unit test cases. “It is important to detect sensitive pieces of information in source code such as AWS Keys, sensitive tokens, encryption keys and prevent GenAI tools from learning this sensitive data,” Desai says Gen AI tools can also generate images and sensitive information can be leaked via these images, he added.

Of course, context is important. ChatGPT intended for public use is by default configured in a way that allows the AI to learn from user-submitted information. ChatGPT running in a private environment is isolated and doesn’t carry the same level of risk. “Context while taking actions is critical with these tools,” Desai says.

CloudFlare’s DLP service extended to gen AI

Cloudflare extended its SASE platform, Cloudflare One, to include data loss prevention for generative AI in May. This includes simple checks for social security numbers or credit card numbers. But the company also offers custom scans for specific teams and granular rules for particular individuals. In addition, the company can help companies see when employees are using AI services.

In September, the company announced that it was offering data exposure visibility for OpenAI, Bard, and Github Copilot and showcased a case study in which Applied Systems used Cloudflare One to secure data in AI environments, including ChatGPT.

In addition, its AI gateway supports model providers such as OpenAI, Hugging Face, and Replicate, with plans to add more in the future. Its sits between AI applications and the third-party models they connect to and, in the future, will include data loss prevention — so that, for example, it can edit requests that include sensitive data like API keys, or delete those requests, or log and alert on them.

For those companies that are using generative AI, and taking steps to secure it, the main approaches include running enterprise-safe large language models in secure environments, using trusted third parties who are embedding generative AI into their tools in a safe and secure way, and using security tools such as data loss prevention to stop the leakage of sensitive data through unapproved channels.

According to a Gartner survey released in September, 34% of organizations are already using or are now deploying such tools, and another 56% say that they are exploring these technologies. They are using privacy-enhancing technologies that create anonymized versions of information for use in training AI models.

Cyberhaven for AI

As of March of this year, 4% of workers had already uploaded sensitive data to ChatGPT, and, on average, 11% of the data flowing to ChatGPT is sensitive, according to Cyberhaven. In a single week in February, the average 100,000-person company had 43 leaks of sensitive project files, 75 leaks of regulated personal data, 70 leaks of regulated health care data, 130 leaks of client data, 119 leaks of source code, and 150 leaks of confidential documents.

Cyberhaven says it automatically logs data moving to AI tools so that companies can understand what’s going on and helps them develop security policies to control those data flows. One particular challenge of data loss prevention for AI is that sensitive data is typically cut-and-pasted from an open window in an enterprise app or document, directly into an app like ChatGPT. DLP tools that look for file transfers won’t catch this.

Cyberhaven allows companies to automatically block this cut-and-paste of sensitive data and alert users about why this particular action was blocked then redirect them to a safe alternative like a private AI system, or allow them to provide an explanation and override the block.

Google’s Sensitive Data Protection protects custom models from using sensitive data

Google’s Sensitive Data Protection services include Cloud Data Loss Prevention technologies, allowing companies to detect sensitive data and prevent it from being used to train generative AI models. “Organizations can use Google Cloud’s Sensitive Data Protection to add additional layers of data protection throughout the lifecycle of a generative AI model, from training to tuning to inference,” the company said in a blog post.

For example, companies might want to use transcripts of customer service conversations to train their AIs. This tool would replace a customer’s email address with just a description of the data type — like “email_address” — or replace actual customer data with generated random data.

Code42’s Incydr offers generative AI training module

In September, DLP vendor Code42 released Insider Risk Management Program Launchpad, which includes resources focused on generative AI to help customers “tackle the safe use of generative AI,” says Dave Capuano, Code42’s SVP of product management. The company also provides customers with visibility into the use of ChatGPT and other generative AI tools and detects copy-and-paste activity and can block it.

Fortra adds gen AI-specific features to Digital Guardian

Fortra has already added specific generative AI-related features to its Digital Guardian DLP tool, says Wade Barisoff, director of product for data protection at Fortra. “This allows our customers to choose how they want to manage employee access to GenAI from outright blocking access at the extreme, to blocking only specific content being posted in these various tools, to simply monitoring traffic and content being posted to these tools.”

How companies deploy DLP for generative AI varies widely, he says. “Educational institutions, for example, are blocking access nearly 100%,” he says. “Media and entertainment are near 100%, manufacturing — specifically sensitive industries, military industrial for example — are near 100%.”

Services companies are mainly focused on not blocking use of the tools but blocking sensitive data from being posted to tools, he says. “This sensitive data could include customer information or source code for company created products. Software companies tend to either allow with monitoring or allow with blocking.”

But a vast number of companies haven’t even started to control access to generative AI, he says. “The largest challenge is that we know employees want to use it, so companies are faced with determining the right balance of usage,” Barisoff says.

DoControl helps block AI apps, prevents data loss

Different AI tools pose different risks, even within the same company. “An AI tool that monitors a user’s typing in documents for spelling or grammar problems might be acceptable for someone in marketing, but not acceptable when used by someone in finance, HR, or corporate strategy,” says Tim Davis, solutions consulting leader at DoControl, a SaaS data loss prevention company.

DoControl can evaluate the risks involved with a particular AI tool, understanding not just the tool itself, but also the role and risk level of the user. If the tool is too risky, he says, the user can get immediate education about the risks, and be guided towards approved alternatives. “If a user feels there is a legitimate business need for their requested application, DoControl can automate the process of creating exceptions in the organization’s ticketing system,” says Davis.

Among the company’s clients, so far 100% have some form of generative AI installed and 58% have five or more AI apps. In addition, 24% of companies have AI apps with extensive data permission, and 12% have high-risk AI shadow apps.

Palo Alto Networks protects against main gen AI apps

Enterprises are increasingly concerned about AI-based chatbots and assistants like ChatGPT, Google Bard, and Github Copilot, says Taylor Ettema, Palo Alto’s VP of product management. “Palo Alto Networks data security solution enables customers to safeguard their sensitive data from data exfiltration and unintended exposure through these applications,” he says. For example, companies can block users from entering sensitive data into these apps, view the flagged data in a unified console, or simply restrict the usage of specific apps altogether.

All the usual data security issues come up with generative AI, Ettema says, including protecting health care data, financial data, and company secrets. “Additionally, we are seeing the emergence of scenarios in which software developers can upload proprietary code to help find and fix bugs. And corporate communications or marketing teams can ask for help crafting sensitive press releases and campaigns.” Catching these cases can pose unique challenges and requires solutions with natural language understanding, contextual analysis, and dynamic policy enforcement.

Symantec adds out-of-the-box gen AI classifications

Symantec, now part of Broadcom, has added generative AI support to its DLP solution in the form of out-of-box capability to classify the entire spectrum of generative AI applications and monitor and control them either individually or as a class, says Bruce Ong, director of data loss prevention at Symantec.

ChatGPT is the biggest area of concern, but companies are also starting to worry about Google’s Bard and Microsoft’s Copilot. “Further concerns are often about special new and purpose-built GenAI applications and GenAI functionality integrated into vertical applications that seem to come online daily. Additionally, grass-root level, unofficial, unsanctioned AI apps increase additional customer data loss risks,” Ong says.

Users can upload drug formulas, design drawings, patent applications, source code and other types of sensitive information to these platforms, often in formats that standard DLP can’t catch. Symantec uses optical character recognition to analyze potentially sensitive images, he says.

Forcepoint categorizes gen AI apps, offers granular control

To make it easier for Forcepoint ONE SSE customers to manage gen AI data risks, Forcepoint allows IT departments to manage who can access generative AI sites as a category, or explicitly by name of individual apps. Forcepoint DLP offers granular controls over what kind of information can be uploaded to these sites, says Forcepoint VP Jim Fulton. Companies can also set restrictions on whether users can copy-and-paste large blocks of text or upload files. “This ensures that groups that have a business need to use gen AI sites can do so without being able to accidentally or maliciously upload sensitive data,” he says.

GTP zeroes in on law firms’ ChatGPT challenge

In June, two New York lawyers and their law firm were fined after the lawyers submitted a brief written by ChatGPT — and which included fictitious case citations. But law firms’ risks in using generative AI go beyond the apps’ well-known facility for making stuff up. They also pose a risk of disclosing sensitive client information to the AI models.

To address this risk, DLP vendor GTB Technologies announced a gen AI DLP solution in August specifically designed for law firms. It isn’t just about ChatGPT. “Our solution covers all AI apps,” says GTB director Wendy Cohen. The solution prevents sensitive data being shared through these apps with real-time monitoring, in a way that safeguards attorney-client privilege, so that the law firms can use AI while staying fully compliant with industry regulations.

Next DLP adds policy templates for ChatGPT, Hugging Face, Bard, Claude, and more

Next DLP introduced ChatGPT policy templates to its Reveal platform in April, offering pre-configured policies to educate employees about ChatGPT use, or blocking the sharing of sensitive information. In September, Next DLP, which according to GigaOm is a leader in the DLP space, followed up with policy templates for several other major generative AI platforms, including Hugging Face, Bard, Claude, Dall-E, Copy.AI, Rytr, Tome, and Lumen 5.

In addition, after reviewing activity from hundreds of companies in July, Next DLP discovered that, in 97% of companies, at least one employee used ChatGPT, and, overall, 8% of all employees used ChatGPT. “Generative AI is running rampant inside of organizations and CISOs have no visibility or protection into how employees are using these tools,” said John Stringer, Next DLP’s head of product said in a statement.

The future of DLP is generative AI

Generative AI isn’t just the latest use case for DLP technologies. It also has the potential to revolutionize the way that DLP works — if used correctly. Traditionally, DLP was rules-based, making it very static and labor-intensive, says Rik Turner, principal analyst for emerging technologies at Omdia. But the old-school DLP vendors have mostly all been acquired and are now part of bigger platforms or have evolved into data security posture management and use AI to augment or replace the old rules-based approach. Now, with generative AI, there’s an opportunity for them to go even further.

DLP tools that use generative AI themselves have to be constructed in such a way that they don’t retain the sensitive data that they find, says Rebecca Herold, IEEE member and an information security and compliance expert. To date, she hasn’t seen any vendors successfully accomplish this. All security vendors say that they are adding generative AI, but the earliest implementations seem to be around adding chatbots to user interfaces, she says, adding that she’s hopeful “that there will be some documented, validated DLP tools for multiple aspects of AI capabilities in the coming six to twelve months, beyond simply providing chatbot capabilities.”

Skyhigh, for example, is looking at generative AI for DLP to create new policies on the fly, says Arnie Lopez, the company’s VP of worldwide systems engineering. “We don’t have anything on the roadmap committed yet, but we’re looking at it — as is every company.” Skyhigh does use older AI techniques and machine learning to help it discover the AI tools used within a particular company, he says. “There are all kinds of AI tools — anyone can get access to them. My 70-year-old mother-in-law is using AI to find recipes.”

AI tools have unique aspects to them that makes them detectable, especially if Skyhigh sees them in use two or three times, says Lopez. Machine learning is also used to do risk scoring of the AI tools.

But, at the end of the day, there is no perfect solution, says Dan Benjamin, CEO at Dig Security, a cloud data security company. “Any organization that thinks there is is fooling themselves. We try to funnel people to private ChatGPT. But if someone uses a VPN or does something from a personal computer, you can’t block them from public ChatGPT.”

A company needs to make it difficult for employees to deliberately exfiltrate data and provide training so that they don’t do it accidentally. “But eventually, if they want to, you can’t block it. “You can make it harder, but there’s no one-size-fits all solution to data security,” Benjamin says.

Data and Information Security, DLP Software, Generative AI