Malware continues to be one of the most effective attack vectors in use today, and it is often combatted with machine learning-powered security tools for intrusion detection and prevention systems.
According to Nidhi Rastogi, Assistant Professor at the Rochester Institute of Technology, machine learning security tools are not nearly as effective as they could be, as several different limitations often hinder them. Rastogi presented her views on the limitations of machine learning for security and a potential solution known as contextual security at a session on February 2 at the Engima 2022 Conference.
A key challenge for contemporary machine learning security comes from false alerts. Rastogi explained the impact of false alerts is both wasted time by organizations and security gaps that could potentially expose an organization to unnecessary risk.
"It is very difficult to get rid of false positives and false negatives," Rastogi said.
Why Machine Learning Models Generate False Alerts
Among the primary reasons machine learning models tend to generate false alerts is a lack of sufficient representative data.
Machine learning, by definition, is an approach where a machine learns how to do something that is often enabled by some form of training on a data set. If the training data set doesn't have all the correct data, it cannot identify all malware accurately.
Rastogi said that one possible way to improve machine learning security models is to integrate a continuous learning model. In that approach, as new attack vectors and vulnerabilities are discovered, the new data is continuously being used to train the machine learning system.
Adding Context to Boost Malware Detection Efficacy
However, getting the right data to train a model is often easier said than done. Rastogi suggests providing additional context as an opportunity to improve malware detection and machine learning models.
The additional context can be derived from third-party and open source threat intelligence (OSINT) sources. Those sources provide threat reports and analysis on new and often novel attacks. The challenge with OSINT is that it is usually in the form of unstructured data, blog posts and other formats that don't work particularly well to train a machine learning model.
"These reports are written in human-understandable language and provide context which otherwise wouldn't be possible to capture in code," Rastogi said.
Using Knowledge Graphs for Contextual Security
So how can unstructured data help to inform machine learning and improve malware detection? Rastogi and her team are attempting to use an approach known as a knowledge graph.
A knowledge graph uses what is known as a graph database, which maps the relationship between different data points. According to Rastogi, the biggest advantage of using knowledge graphs is that it enables an approach to capture and better understand unstructured information written in a language understood by humans.
"All of this combined data on a knowledge graph can help to identify or infer attack patterns when a malware threat is evolving," she said. "That's the advantage of using knowledge graphs, and that's what our research is pursuing."
By adding context and data lineage that help track the source of the data and its trustworthiness, Rastogi said that the overall accuracy of malware detection could be improved.
"We need to go beyond measuring the performance of machine learning models using accuracy and precision scores," Rastogi said. "We want to be able to help analysts by inference with confidence and context."