Limitations of Machine Learning Algorithms in Malware Detection

There’s been a continuous increase in the use of Machine Learning but, despite the recent hype, the technology is not new. While researchers have been playing with artificial neural networks from as early as the 1950s, machine learning is not new even in the context of cybersecurity.

At Bitdefender, for example, machine learning algorithms have been used since 2008, when machine learning-based detection first appeared in our antimalware engines. In 2010, the first OSC algorithm was developed, designed to reduce the number of false positives. Since then, 10 percent of the 72 patents are implemented for machine learning in malware detection and online threats, anomaly-based detection and deep learning.

While many marketers present it as a universal solution to fight cyberattacks, the truth is machine learning has its limitations, and infrastructures need multi-level security technologies. A major issue is that hackers are intelligent and, sometimes, as skilled as security researchers. They regularly work on developing increasingly sophisticated malware variants to confuse the algorithm, potentially also turning to machine learning to train their samples. Not only do they leverage encryption and obfuscation in their tools, but they never play by the rules. So one machine learning algorithm can’t handle it all by itself, requiring assistance from other technologies for full efficiency.

Machine learning is great at recognizing patterns to handle security incidents, and breaks ground in a number of sectors, especially cybersecurity. However, it has flaws and limitations that require constant improvements and engine updates from human operators. As statistical analysis is at the core of its predictions, machine learning systems may be susceptible to errors in diagnosis if the original learning process was corrupted. This would automatically take time to adjust by engineers who would also need to replace the flawed input with an error-free data set. Machine learning displays a risk of running inefficient algorithms and making limited predictions when not trained properly.

Machine learning algorithms need to be taught to analyze data patterns and draw conclusions to detect anomalies and identify malware threats. Fed with large amounts of samples, if the database is corrupt or not labeled accordingly, the algorithm won’t be able to distinguish between clean and malicious files, so the solution will deliver unreliable results. Engineers are still required to step in and fine-tune the algorithm to prevent them from delivering unreliable solutions. This is what’s called the “garbage in, garbage out” problem in machine learning.

The worrying factor is that, contrary to past experiences, not only is the number of malware threats aggressively rising, but so is the amount of data analyzed as a result. Machine learning algorithms depend on the data used to train them. If the data load is good and comes from multiple sources, it could build a robust defense against sophisticated attacks.

Machine learning is a great tool to automate cybersecurity operations, basing its decisions on predictive modeling and detection theory. But, as discussed by a number of researchers, it is best to test different algorithms and combine them with other technologies for a high detection rate and a low false positive percentage.