Android Malware Detection
Using custom Deep Neural Networks
“Artificial intelligence refers to a machine's ability to learn, adapt, and solve complex problems automatically that benefit society.”
Read the notebooks
Click on any of the two models to check their respective implementation
Problem description
The widespread proliferation of Android devices has led to a concerning increase in malware threats, which pose significant risks to users' personal data and digital security. Malicious apps often disguise themselves as legitimate software, making them difficult to identify without specialized tools. The provided dataset, contains some of the features that an application may have or services that it may be using. Given this input, I developed an AI model that tries to find a pattern between the features that may reveal whether an application is malicious or not This was done specifically for educational purposes to learn better the two biggest ML frameworks: Tensorflow and PyTorch.
Approach
In the process of learning more about the implementation of neural networks and their respective frameworks, this model was created initially in Tensorflow and later on in PyTorch. In general, there is no reason nor a significant difference between the two implementations. However, for educational purposes the two frameworks have been used
Pre-processing
The dataset was clean enough yet some pre-processing steps had to be taken before feeding the data to the model. Briefly put, a few missing values had to be replaced with the mean value of the respective column and the labels had to be encoded. For more details, please check the two notebooks
Model
The problem is of a binary classification. In other words, the model developed will output whether the given attributes consist of an android malware or a goodware. To tackle this, a neural network has been used with the an input layer of 241 features and 3 hidden layers in between. For further details, please check the two notebooks
Tensorflow Vs PyTorch model
The main differences between the two models were the control of each step/algorithm. With Tensorflow, most of the steps were abstractly defined. For instance the train method was not implemented, the splitting of the dataset was not implemented. Tensorflow required only a few method calls to cover the aforementioned steps In PyTorch, more effort was necessary to achieve the same output yet this allowed for more control of the output. For instance, the training method had to be manually written. Additionally, in PyTorch, a manual seed was also added in order to "esnure" reproducibility
Results
- Accuracy: 99.89%
- Precision: 99.46%
- Recall: 100%
- F1-Score: 99.73%
Confusion Matrix
Technologies used
Docs Website (this page)
- Pico CSS
- Vanilla JS
- Jyputer notebooks as html
AI Models
- Python
- Tensorflow
- PyTorch
- Numpy
- Pandas
- Sci-kit learn
- Matplot
- Seaborn
Credits
Special thanks to