Android Malware Detection

Using custom Deep Neural Networks

“Artificial intelligence refers to a machine's ability to learn, adapt, and solve complex problems automatically that benefit society.”
— Amit Ray, Compassionate Artificial Intelligence

Read the notebooks

Click on any of the two models to check their respective implementation

Problem description

The widespread proliferation of Android devices has led to a concerning increase in malware threats, which pose significant risks to users' personal data and digital security. Malicious apps often disguise themselves as legitimate software, making them difficult to identify without specialized tools. The provided dataset, contains some of the features that an application may have or services that it may be using. Given this input, I developed an AI model that tries to find a pattern between the features that may reveal whether an application is malicious or not This was done specifically for educational purposes to learn better the two biggest ML frameworks: Tensorflow and PyTorch.

Approach

In the process of learning more about the implementation of neural networks and their respective frameworks, this model was created initially in Tensorflow and later on in PyTorch. In general, there is no reason nor a significant difference between the two implementations. However, for educational purposes the two frameworks have been used

Pre-processing

The dataset was clean enough yet some pre-processing steps had to be taken before feeding the data to the model. Briefly put, a few missing values had to be replaced with the mean value of the respective column and the labels had to be encoded. For more details, please check the two notebooks

Model

The problem is of a binary classification. In other words, the model developed will output whether the given attributes consist of an android malware or a goodware. To tackle this, a neural network has been used with the an input layer of 241 features and 3 hidden layers in between. For further details, please check the two notebooks

Tensorflow Vs PyTorch model

The main differences between the two models were the control of each step/algorithm. With Tensorflow, most of the steps were abstractly defined. For instance the train method was not implemented, the splitting of the dataset was not implemented. Tensorflow required only a few method calls to cover the aforementioned steps In PyTorch, more effort was necessary to achieve the same output yet this allowed for more control of the output. For instance, the training method had to be manually written. Additionally, in PyTorch, a manual seed was also added in order to "esnure" reproducibility

Results

Accuracy: 99.89%
Precision: 99.46%
Recall: 100%
F1-Score: 99.73%

Confusion Matrix

Technologies used

Docs Website (this page)

Pico CSS
Vanilla JS
Jyputer notebooks as html

AI Models

Python
Tensorflow
PyTorch
Numpy
Pandas
Sci-kit learn
Matplot
Seaborn

Credits

Special thanks to

Freepik and sentavio for the featured image. More here
Joakim Arvidsson for the dataset. More here