Other research

Malware Detection for IoT devices

Malware Detection for IoT devices

From July 2019 to May 2020, I worked with Dr. Abhishek Tripathi as an intern at McAfee. This was my first project there and also contributed towards my undergraduate thesis. This project aimed to develop a malware detection system targeted for IoT devices.

IoT devices

Internet of Things (IoT) refers to smart devices that collect and share data over the internet. These devices have found a use case in almost every setting. Smart devices encompass a large spectrum of gadgets ranging from small wearable smart watches to bigger air conditioning systems. IoT devices have found their way into several fields including medicine, entertainment and governance.

Motivation to develop malware detection for these devices:

  • IoT devices are becoming increasingly popular
    A report by Business Insider (dated 2017), states that there will be 30 billion IoT devices by the year 2020 [link] . This trend seems to only be going upwards since then.

  • Lack of security standards
    This surge in IoT devices has developed too quickly for the security standards to be maintained. This has made these devices highly vulnerable to malware attacks.

  • Low compute resources on such devices
    Most of these devices were designed to collect and share data to a centralized server that process the data and sends back instructions. Hence due to the lesser amount of compute resources present on IoT devices, it is difficult to use traditional malware detection methods on the device itself.

TLS Features

TLS (Transport Layer Security) is a cryptographic protocol that allows data being transmitted over the internet to be encrypted. While it is true that an the increased use of TLS over the internet, ensures greater security, it has also allowed malware to encrypt its traffic, thus making it hard to detect. Earlier methods of malware detection from network packets used features like the port numbers, IP addresses or patterns in the payload. These features cannot be used with encrypted traffic as they become unavailable with simple feature extraction.

Hence we rely on features that can be extracted from encrypted network traffic, i.e., TLS Features. Each TLS encrypted exchange of packets begins with a TLS handshake involving the exchange of a few unencrypted packets. We extract our features from this series of packets.

Basic TLS handshake from where we extract features. [Image source link]

Basic TLS handshake from where we extract features. [Image source link]

Models developed

Logistic Regression

A logistic regression model involves only the computation of a sigmoid function in order to give a prediction. This can easily be done on a router, with limited memory requirements. In order to ensure that the predictions can be made within a few milliseconds (to prevent the malware from getting passed on to the device from the router), we retained only the important features determined during our model training. Only these limited number of features were used for the model prediction, thus reducing the prediction time drastically.

Deep Learning (Autoencoders)

Since, in the real world, malware packets are usually an anomaly among mostly benign packets, we decided to experiment with autoencoders due to their good performance with anomaly detection.

Although in actuality, malware data is much less seen than benign data, this was not the case with the dataset we had access to. Due to privacy issues, it is surprisingly more difficult to collect benign data. Hence we developed 2 autoencoders, trained on malware and benign data separately. A combination of the predictions from both models were used to determine whether a packet was malicious or not.

This project contributed towards my undergraduate thesis, available at this link .