LAS VEGAS – BLACK HAT USA 2015 – Researchers at Invincea have tested the effectiveness of an automated analysis system that focuses on the images embedded in malware.
Desktop icons and various other types of images are often used by malware creators to lure users. A perfect example are the PDF icons leveraged to get users to open a malicious executable by tricking them into thinking that it’s a harmless document.
As part of a Defense Advanced Research Projects Agency (DARPA) project focusing on new types of malware analysis, Invincea researchers have demonstrated that an automated system designed to analyze images embedded in malware could improve threat detection rates, and it could help researchers understand how new malware tricks users and determine which adversary is behind a certain threat.
Alex Long, research engineer at Invincea Labs, presented the results of their work on Wednesday at the Black Hat conference in Las Vegas.
“Using the images in malware to analyze the sample puts malware authors in a ‘catch-22’ dilemma, because images are a huge part of how they manipulate users. We’re basically saying to malware authors, ‘You can keep using images to increase your chances of tricking a user, but we’re also going to be using images to make it easier for us to detect and understand your malware.’,” Long told SecurityWeek.
“Given the vast amount of research going into malware detection approaches, and the promising preliminary results our work has shown, we believe that this relatively simple idea has the potential to complement other approaches very effectively and should be continued further as a new signal in malware analysis,” he added.
According to Long, more than half of the two million malware samples provided by DARPA had at least one image embedded.
The automated analysis of malware images has two main stages: identifying malware samples using visually similar image sets, and classifying the images (e.g. fake antiviruses, installers, game-related threats).
For the first component, Invincea relied on a technique known as “Average Hash.” This technique involves reducing an image to grayscale, stretching it or shrinking it to a certain size, increasing its contrast, and converting it to a binary vector. After an average pixel value is obtained, a hash is generated by comparing the value of each pixel to this average.
This allows the analysis system to efficiently compare a malware image with images from a given set regardless of their contrast, scale, or color scheme.
The classification of malware images into categories relies on the Google Image Search API and user-defined queries.
“For the image classification work, we used Google Image Search results to get images representing the various semantic classes of interest. So for example, if you want training data for Internet Explorer icons, you do a search for ‘internet explore’” with some advanced search settings to narrow down the results to just icons,” Long explained.
For some of the tested categories, such as fake word processors, researchers obtained very good results, but in other categories the malware images were misclassified at an unacceptable rate.
“The goal in our research was to make this process entirely automated. Malware authors are using automated processes to produce an essentially endless stream of polymorphic variants from a single malware sample, so malware analysts must begin to rely more on automated approaches as well,” Long said.
“Using our approach, the extraction, comparison, and visualization of matching images is done completely automatically so an analyst can go from receiving 200,000 fresh new malware samples that he knows nothing about, to seeing a ‘social network’ of their shared images with literally the push of a button,” the expert noted. “[The image classification] process was also entirely automated, so choosing how you want to classify malware images is as simple as adding a few words (like ‘anti-virus’) for the new search query, to the list of queries in our system.”
Processing images often requires a lot of system resources, but Invincea has attempted to develop an analysis system that works efficiently.
“Staying in the theme of scalability, we focused on approaches that would be computationally-cheap. Average hash is a good example of that, as the entire algorithm takes 20-30 lines of code in python and is nearly instantaneous to run,” Long told SecurityWeek. “In order to maintain effective performance when performing image matching across potentially millions of images, we used the open source library, FLANN, which is short for ‘Fast Light-Weight Approximation of Nearest Neighbors.’ This uses a technique that is much more complex than kNN to approximate the results of kNN without having to perform the costly pair-wise comparisons between every possible pair of images. This allows us to do nearest neighbor calculations across hundreds of thousands of images in a few seconds, making any performance impact virtually negligible.”
Before image analysis can be integrated into a commercial solution, the overall accuracy of the system needs to be improved, the researcher said.
“This work was performed near the tail end of a 4 year DARPA-backed program, so our main goal was just publishing the idea into the community. We wanted to demonstrate that the concept had potential, which I believe we did,” Long said.