Within the AI hype bubble, information security vendors make sensational claims about their technology, often suggesting special algorithms that are smarter than those of rivals. However, these boasts play on the misperception that algorithms are the only thing that set successful AI/machine learning solutions apart, which is generally not the case.
The fuel that improves the effectiveness of AI is often the volume, velocity and variety of data that helps to generate and feed the models underpinning the solution’s ability to detect and counter threats.
These systems use multiple models which include rules and algorithms to simulate intelligence, understand context, and make decisions when faced with both known and previously unseen situations.
An AI system can process large amounts of real-time, historical, structured and unstructured data much faster, and in more intensive ways, than humans. This speed and depth replaces manual efforts with the potential to make a rapid accurate decision—again based on the training the system has had.
For example, imagine a malicious user logs in to a network-connected PC with admin rights that immediately runs a tool to search for open file shares across the network. Then this user starts to copy several files from a shared volume to a new folder. Next, the user starts sending these files to a previously unused FTP server: this could be a perfectly reasonable activity, or it could be a signal that credentials have been hacked and a data breach is taking place.
In this scenario, each one of these steps might only be noticed in hindsight after separate alerts and examination of the associated log files. Also, each step might take place days apart and may not even be correlated as a sequence. Worst of all, unless the security team had visibility with real-time wire data, they won’t be able to rebuild transactions. In fact, some of these actions are unlikely to be captured within logs or by agents.
However, a machine learning solution could spot this as an issue, generate an alert, and potentially automatically quarantine that PC from the rest of the network. For this automated process to be effective and be permitted in the risk-aware culture of security, the solution needs to score up a high level of confidence that this is an attack and not just a real admin going about legitimate duties.
Having more data for the machine learning to analyze allows the AI to make better judgements, and this should start with baseline data about users, devices, systems on the network, and workflow patterns. So, in this example, if the models had been fed network device discovery information which made the AI aware that the ‘PC’ the malicious user logged in from is a print server; then tasks outside of managing print jobs would be considered as highly suspicious.
A historic understanding of user behavior along with real time access to current network flows is also beneficial in training the underlying models. For example, if the “admin” account in our scenario had always logged in between 9am and 10am and logged out mostly between 6pm and 7pm and this activity was taking place at 10pm; this break from the established pattern could also cause a red flag. Or if this admin had never previously used an FTP or had any interaction with this file server—again red flags aplenty.
In parallel, has an RDP session recently been initiated with that PC indicating that an external hacker might be spoofing an internal network connection? Again, red flag time.
To be able to gain these insights needs both a broad array of baseline data and constant flow of real-time information beyond what is available from historical logs or agents. The last point is particularly critical as the next generation of IoT devices often don’t have either logs or agents, and application owners really don’t want agents running on their finely tuned systems.
What does it all mean in terms of practical application? Garbage in, garbage out. Quality in, quality out. Security operations now include many data and behavioral analytics applications, running in harness with SIEMs, in series, or in parallel.
All analytics will be more effective when provided with rich, high-fidelity sources of data that can be used to build higher resolution models that can help find patterns and real-time correlations to identify anomalies and predict and prevent security issues: and the more relevant the data the better—not to drown the system, as alerts do today, but to facilitate the training and optimization of the system for maximum accuracy and confidence.
Although a broad summary of the issue, next time somebody tries to convince you that it’s all about the algorithm; data in fact may hold the key.