TC-CNN: A Tree Classified model using AI for identifying malware intrusions in open Networks

Proliferation of the internet by multiple devices has led to dramatic increases in network traffic. The Internet medium has also been growing with this usage, but this fast growth has also resulted in new threats making networks vulnerable to intruders and attackers or malicious users. This has made network security an important factor due to excessive usage of ICT (Information and Communications Technology) as threats to IVTs has also grown manifold. Securing data is a major issue, especially when they are transmitted across open networks. IDSs (Intrusion Detection Systems) are methods or techniques or algorithm which cater to detection of intrusions while on transit. IDSs are useful in identifying harmful operations. Secure automated threat detection and prevention is a more effective procedure to reduce workloads of monitors by scanning the network, server functions and inform monitors on suspicious activity. IDSs monitor systems continually in the angle of threat. This paper’s proposed technique detects suspicious activities using AI (Artificial Intelligence) and analyzes networks concurrently for defense from harmful activities. The proposed algorithm’s experimental results conducted on the UNSW_NB15_training-set shows good performances in terms of accuracy clocking above 96%.


Index Terms-IDSs, ANNs, Machine learning, Artificial Intelligence, Network Security
Introduction: Fast-paced technological growth and their advancements have encouraged organizations to adopt ICTs. Organizations have expanded and users have increased, hand held devices and mobiles are found in billions. These factors have contributed to excessive growth of internet usage. Millions of packets are being transmitted using the medium of internet where traffic through these networks is heterogeneous and consists of flows from multiple applications and utilities. Traffic from hand held devices mainly smart phones have also increased with an expected four old increase globally [1]. Thus, current digital age has created an environment where every action is routed through the internet making it vulnerable in terms of security and ICTs may be compromised. Internet is full of dangers including malware and DDOS attacks. Figure 1 displayed details on malware attacks. against malware is also important. Network services like monitoring, accounting, control, and optimization can improve their security standards. Policies like bring-yourown-device have also been enabled in the past by corporate to manage access to their resources [2]. Taking these factors into consideration, the ability to detect and prevent attacks on network systems becomes significant. Networks can be protected against such attacks making IDS an essential layer of ICTs considered in the angle of global cyber safety. IDSs can also help in detecting intrusions or identifying attack types when used in a proper sense.  Studying data traffic for detecting malware consumes a lot of time, but the study in [13] used DNNs (Deep Neural Networks) for the same and analyzed process behavior.
They study used RNNs for extracting features while CNNs were used to classify. RNN training was done with LSTM as RNN are known to create errors while processing their layers. LSTM used in the study helps mitigating errors created by RNN learning and thus extracts only required features. While evaluating malware and process log files in for training and validation, they generated a dataset using Cuckoo Sandbox for running malware emulations.
The study used this knowledge to trace malware process for injections. Their system showed 91.89% accuracy in detecting malware, but was not tested on large-scale data.
Two important characteristics were studied in [14].The study's dataset was created from malware samples of an anti-virus and compared with a different sha1 generated from previously collected malware samples. The study extracted features, construction using NNs which were then classified by RNN. In NN constructions, network communications were captured while training and classification root node vectors were exposed and predictions were done using this phase. The study reduced analysis time by 66% while 96.9% of URL's got compared. The study in [15]     It is the removal of incorrect/incomplete or wrongly formatted or corrupt data found in the samples. This process of cleaning differs for datasets or features of a dataset. Data cleaning in this work is performed by  Figure 3.  [19] proposed Correlation based implementations for selecting suitable features. They discretized the dataset which was then used to compute feature classes feature correlations using symmetric uncertainty. Correlation is a well-known similarity measures between two features. If two features are linearly dependent, then their correlation coefficient is ±1. If the features are uncorrelated, the correlation coefficient is 0. The association between the features is found out by using the correlation method. There are two broad categories that can be used to measure the correlation between two random variables. One is based on classical linear correlation and the other is based on information theory where linear correlation coefficient is a familiar measure and is used by TC-CNN to assess correlations between features. Assuming 'r' is the linear correlation coefficient between a pair of variables (X, Y), r can be computed using equation (1):

UNSW-NB15 Dataset
Feature Selection: Feature Selection is a prominent steps before MLTs learn from samples. It is a process of reducing the feature set by choosing most relevant features from the original feature set according to an evaluation criterion and also removing the redundant features from the entire feature set. Assuming Fs is the feature set with larger number of features {Fs1, Fs2, Fs3… …Fsn} where n is the number of features in the data set. Feature selection Fsel can be defined as the process of selecting Fs or most discriminatory features count is greater than or equal to 1. Feature selection methods involve generation of subsets. Feature selection is also a kind of dimensionality reduction. Many feature selection methods like filters [20] wrappers [21] and hybrid methods [22] have been used in studies for a long time. The wrapper model uses the predictive accuracy of a predetermined learning algorithm to determine the goodness of the selected subsets. These methods are computationally expensive for data with a large number of features. The filter model separates feature selection from classifier learning and selects feature subsets that are independent of any learning algorithm. It relies on various measures of the general characteristics of the training data such as distance, information, dependency, and consistency. TC-CNN uses the model , ERTC (Extremely Randomized Trees Classifier), an ensemble learning technique which aggregates the results of multiple de-correlated decision trees collected in a "forest" to output it's classification result. It parallels RF (Random Forest) Classifier in operations but differs in its manner of construction. The ETF (Extra Trees Forest) is constructed from the original sample. Each tree has random k sample features from the feature-set from which each decision tree selects the best feature to split the data based on the Gini Index. This random sample of features leads to the creation of multiple de-correlated decision trees. To perform feature selection using the above forest structure, during the construction of the forest, for each feature, the normalized total reduction in the mathematical criteria used in the decision of feature of split based on the Gini Importance of the feature. Thus, ETF is used in this study to select the best features and filtered or selected based on Feature Values Greater that are greater than Mean Features Value. Classification: TC-CNN uses CNN for classifying intrusions from the dataset. CNN or ConvNet is a DLT and a part of AI. CNN learning are preferred as they filer characteristics for classifying data Moreover, CNNs require very limited pre-processing when compared to other classification algorithms. CNNs are analogous to a human brain's neuron connectivity patterns and capture Spatial and Temporal dependencies in the input data. CNNs learn appropriate representations of features from inputs and differ from MLPs in their sharing of weights and pooling. CNN layers have convolution kernels which generate varied feature maps. Neighboring neuron regions are connected to a neuron's feature map of the next layer. While generating the feature map, all input spatial locations are shared by the kernel. Convolutions and pooling layers yield in fully connected layers which are then used for classifying data [19 -21]. CNNs sharing of weights help the model learn same patterns occurring at different input position of inputs which happen without learning separate detectors for each input position. This makes CNN models robust to input translations [22]. Pooling layers reduce computational burdens by reducing count of connections between convolution layers [23]. Mathematically speaking each CNN layer has a set of kernels which convolve input data. This convolution between data and kernels produces a new feature map xk and the transformation can be defined as equation (2) xk l =σ(wk l-1 * x l-1 + bk l-1 ) Where, l is the convolution layer, W={w1,w2,…,wn} are n kernels and B={b1,b2,…, bn}are n biases. In learning, CNNs use a window with which the values of bias and weights from various features of the input data are optimized irrespective of their position within the input data.

Results and Discussions
The proposed model was evaluated using python 3 on the UNSW-NB15 Dataset compiled by the University of South Wales [24] was downloaded. The proposed system was implemented on AMD Radeon processor with 16gb running 64-bit Windows 10. Keras and Tensor-Flow were used the software framework. The exponentially increasing CNN architecture performed on a GPU enabled tensorflow in a single Nvidia-GK110BGL-Tesla-k40. This data set was chosen as it reflects a more modern and complex threat environment. The full dataset contains a total of 25,400,443 records. The partition of the full dataset are divided into a training set and a test set according to the hierarchical sampling method, namely, UNSW_NB15_training-set.csv and UNSW_NB15_testing-set.csv. The training dataset consists of 175,341 records whereas the testing dataset contains 82,332 records. The number of features in the partitioned dataset is different from the number of features in the full dataset and has 43 features with the class labels. The partitioned dataset contains ten categories, one normal and nine attacks, namely, generic, exploits, fuzzers, DoS, reconnaissance, analysis, backdoor, shellcode and worms.   Figure 7 depicts the output of preprocessing. work is also a dimensionality reduction phase to improve the accuracy of the classifier. Figure 9 depicts the output of feature selections. connect layer is used. The kernel size of the convolution layers is [4 *4] and [3 *3] respectively, and the pooling size for both pooling layers are [2 * 2]. Furthermore, www.psychologyandeducation.net three fully connected layers include 50, 20 and 2 neurons are used. To prevent overflow, a dropout by 0.2 is considered. Moreover, the data was split into a 70/30 combination where 70% of data was used for testing. Figure 10 depicts the data split into 70% in training and CNN learning on the test samples. The Rectified Linear Unit (ReLU) activation function is used in all layer except the last layer, which uses the 'Softmax' activation function. For optimization, Adaptive moment estimation (Adam) method is used, and the number of epochs is set to 20. Figure 11 depicts the output of CNN Learning which achieved 97.39 % in terms of accuracy.

Fig. 11 -CNN Learning Epochs
Model Evaluation: A learning curve is a plot of model learning performance over experience or time. Learning curves are a widely used diagnostic tool in machine learning for algorithms that learn from a training dataset incrementally. The model can be evaluated on the training dataset and on a hold out validation dataset after each update during training and plots of the measured performance created show learning curves.Reviewing learning curves of models during training can be used to diagnose problems with learning, such as an underfit or overfit model, as well as whether the training and validation datasets are suitably representative. A loss function is used to optimize a machine learning algorithm. The loss is calculated on training and validation and its interpretation is based on how well the model is doing in these two sets. It is the sum of errors made for each example in training or validation sets. Loss value implies www.psychologyandeducation.net how poorly or well a model behaves after each iteration of optimization. An accuracy metric is used to measure the algorithm's performance in an interpretable way. The accuracy of a model is usually determined after the model parameters and is calculated in the form of a percentage. It is the measure of how accurate your model's prediction is compared to the true data. Figure 12 depicts the Training and validation accuracy while Figure 13 depicts the training value loss of TC-CNN.

Conclusion
This paper has implemented the problem of classifying network attacks and identifying malware. The dataset used provided samples for classifying and identifying www.psychologyandeducation.net corresponding elements. The first part identified important features in the dataset using correlation and clustering based tree classifications for minimizing dimensionality and removing unwanted features. The second part used deep learning to identify and classify attacks on networks in the form of malware. The approach as whole produced an accuracy of 97% in training implying this proposed approach can be implemented in network systems to identify malware. Moreover, the depicted results in form of figures and also imply that TC-CNN is a promising approach and can classify malware samples much faster than all those solutions that rely on the manually extraction of features and thus, are more scalable. It can be concluded the TC-CNN is a viable, implementable technique for identifying malware in online networks. .
Future scope: Even that both approaches have been successfully applied, there is still a huge margin of improvement. Another possible modification is to expand the vocabulary of malware to virus by studing their patters and implementing them as samples for CNN learning. The study also aims to improve byintroducing and evaluating other AI methods for generic identification of attacks on networks and specifically mobiles where the internet is an open and unprotected areas of network access.