Skyhawk Security ranks accuracy of LLM cyberthreat predictions

Cloud security vendor Skyhawk has unveiled a new benchmark for evaluating the ability of generative AI large language models (LLMs) to identify and score cybersecurity threats within cloud logs and telemetries. The free resource analyzes the performance of ChatGPT, Google BARD, Anthropic Claude, and other LLAMA2-based open LLMs to see how accurately they predict the maliciousness of an attack sequence, according to the firm.

Generative AI chatbots and LLMs can be a double-edged sword from a risk perspective, but with proper use, they can help improve an organization’s cybersecurity in key ways. Among these is their potential to identify and dissect potential security threats faster and in higher volumes than human security analysts.

Generative AI models can be used to significantly enhance the scanning and filtering of security vulnerabilities, according to a Cloud Security Alliance (CSA) report exploring the cybersecurity implications of LLMs. In the paper, CSA demonstrated that OpenAI’s Codex API is an effective vulnerability scanner for programming languages such as C, C#, Java, and JavaScript. “We can anticipate that LLMs, like those in the Codex family, will become a standard component of future vulnerability scanners,” the paper read. For example, a scanner could be developed to detect and flag insecure code patterns in various languages, helping developers address potential vulnerabilities before they become critical security risks. The report found that generative AI/LLMs have notable threat filtering capabilities, too, explaining and adding valuable context to threat identifiers that might otherwise go missed by human security personnel.

LLM cyberthreat predictions rated in three ways

“The importance of swiftly and effectively detecting cloud security threats cannot be overstated. We firmly believe that harnessing generative AI can greatly benefit security teams in that regard, however, not all LLMs are created equal,” said Amir Shachar, director of AI and research at Skyhawk.

Skyhawk’s benchmark model tests LLM output on an attack sequence extracted and created by the company’s machine-learning models, comparing/scoring it against a sample of hundreds of human-labeled sequences in three ways: precision, recall, and F1 score, Skyhawk said in a press release. The closer to “one” the scores, the more accurate the predictability of the LLM. The results are viewable here.

“We can’t disclose the specifics of the tagged flows used in the scoring process because we have to protect our customers and our secret sauce,” Shachar tells CSO. “Overall, though, our conclusion is that LLMs can be very powerful and effective in threat detection, if you use them wisely.”

It’s important for organizations to understand that they can’t just throw data [at an LLM] and expect it to do the work for them, Shachar says. “We meticulously built our technology to be able to incorporate LLMs into real-time threat detection by utilizing the right concepts from the ground up, and now we’re leveraging that to provide a glimpse into LLM performance to the broader industry to strengthen the security community. “

Skyhawk said its data will be regularly updated and available to view free of charge via its website.

Generative AI, Risk Management, Threat and Vulnerability Management