Applying AI: getting underneath machine learning

If you attended last year’s RSA conference, you may have left with the idea that all you needed to build a complete cyber-security solution was a machine learning engine (or better yet, “advanced next-gen Artificial Intelligence”).

Every cyber-security company uses machine learning (or AI) because it is a powerful technique for malware analysis. But it is by no means the only one. Applied naïvely, it may not even work effectively. Sometimes, a powerful scanning engine is all that is required (it’s ‘cheap’), or even just a great database of known malware hashes (it’s fast).

At Avira, we’ve been applying machine learning to malware detection and threat prevention for a decade. But even when it’s the right technology to apply to the problem, it is far from the only technology that you need. It’s not a magic ‘black box’, and it can only ever be as good as the data it is fed.

Our experience shows that the successful application of any machine learning (and this includes deep learning) system to malware analysis, requires expertise in the problem domain and an in-depth understanding of the applied technologies and algorithms.

Over the coming months, we will look at some of the technologies needed to support the application of machine learning to malware analysis. We will discuss the importance of a rich data set (and how we build it). But first, let’s explore the different types of machine learning and how they contribute to successful malware analysis.

We will start this series by delving into some of the techniques used in machine learning – here we’ll look at supervised and unsupervised machine learning. In our next blog, we’ll focus on Deep Learning. We’ll then move on to some of the approaches needed to maximize the potential of machine learning. This will include content extraction techniques used to obtain great data.

Supervised and unsupervised machine learning– which is better?

Simply put, neither is better – they have different uses.

If we do not know much about the data we need to analyze, we often use unsupervised machine learning. For example, we use it to cluster data or to look for data points that lie away from others. In other words, it allows us to explore data for meaningful patterns.

When there is a specific property of the data we care about, we apply supervised machine learning. For example, this may be a label indicating if a file is malicious or not. Or it may be a continuous quantity like the probability of network traffic being abnormal. In these cases we already have a collection of data for which we know the correct answer. The goal is to find a general way (i.e. train a model) to determine that property for new unknown data.

The key difference here is that if the data is not first analyzed or labelled, we usually use unsupervised machine learning. Supervised machine learning algorithms tend to be more complex because we must first label the data. They also take more time than unsupervised. The essential thing is that supervised and unsupervised machine learning have different objectives.

Look at it this way: unsupervised learning asks “Tell me something about this data”. Supervised learning asks “Tell me something about how to get from this data to the labels.”

Applying unsupervised and supervised machine learning

We apply a number of unsupervised and supervised machine learning techniques to aid malware analysis: we also use other machine learning techniques, but more about those in other articles. For malware analysis we take a ‘coarse-to-fine’ strategy: initially, we explore the data, looking to cluster common data sets. We then take a fine-grained approach, and through the complex techniques of supervised learning, we determine whether the data is actually malware or not.

Supervised and unsupervised machine learning techniques offer significant benefit in terms of accurate and fast classification of malware. They offer very low False Positive rates, and very fast retraining times but they do require a large training base, extensive data expertise, and resources.

Of course, not everyone has access to extensive datasets, and may choose a different approach. Or they may not be looking to just identify malware; they could be looking for patterns in network traffic or device behavior that suggests something is wrong. Other approaches may be more effective in that case, and, indeed, this is a reason why we also use other machine learning/AI techniques.

In the next blog, we’ll dive into Deep Learning. In the meantime, for a more in-depth exploration of machine learning techniques, take a look at our white paper on NightVision, our machine learning system.

Thomas Bühler

Thomas Bühler is an AI researcher at Avira with a decade of experience in Machine Learning, both in industry and academic research. He enjoys wrapping his head around maths and algorithms and is passionate about building large-scale ML systems for fighting threats in the cyber security space. His ML research was published at top-tier international venues and he is a regular reviewer for scientific journals and conferences in Machine Learning