Application of Graph Theory in DNA similarity analysis of Evolutionary Closed Species .

DNA is a complex molecule that consists of biological information that is passed down from generation to generation. With the evolution over time, there are different kinds of species that evolved from a common ancestor because of the occurrence of DNA sequence rearrangements. DNA sequence similarity analysis is a major challenge since the number of sequences is rapidly increasing in the DNA database. In this research, we based a mathematical method to analyze the similarity of two DNA sequences using Graph Theory. This mathematical method started by modeling a weighted directed graph for each DNA sequence, constructing its adjacency matrix, and converting it to the representative vector for each graph. From these vectors, the similarity was determined by distance measurements such as Euclidean, Cosine, and Correlation. By keeping this method as the based method, we will check whether it is applicable for any DNA fragments in considered genomes and molecular similarity coefficients can be used as distance measurements. We will obtain similarities using the graph spectrum instead of the representative vector. Then we will compare the results from the representative vector and that of the graph spectrum. The modified method is tested by using the mitochondrial DNA of Human, Gorilla, and Orangutan. It gives the same result when the number of nucleotides in DNA fragments is increased.


I. Introduction
Deoxyribonucleic acid or DNA is a complex molecule that consists of the biological information that makes every species distinctive.It includes the instructions an organism needs to develop, live, and reproduce.Inside every living cell, these instructions are found and passed down during the reproductive process from a parent organism to its offspring.Deoxyribonucleic acid is created of chemical building blocks known as nucleotides.Each building block contains a phosphate group, a sugar group, and a nitrogen base.Nucleotides are arranged in two long strands that form a spiral called a double helix.The structure of the double helix is somewhat like a ladder with the base pairs.Nitrogen bases are available in four types.They are adenine (A), thymine (T), guanine (G), and cytosine(C).The biological instructions or genetic code contained in a DNA strand is determined by the order or sequence of these bases.DNA sequences analysis is very important in biology as the number of DNA sequences in the DNA database rapidly increases at the current rates of today.Therefore, it's essential that finding Similarities / dissimilarities in DNA sequences to know where the species originate and identify homologous sequences.But it is hard to obtain information from the DNA sequence directly because rearrangements occur during the evolution over time.Analyzing large amounts of genomic DNA sequence data is a major challenge for bio-scientists.In mathematical biology, mathematical models are applied in biology to deal with various modeling and calculation problems.In the microscopic field of biology, DNA, RNA structures, protein sequences, and other biological networks can be represented as a graph.Thus, graph theory is established itself as a unique mathematical tool in determining various biological properties due to its ease of representing the above biological networks.With the development of the graph www.psychologyandeducation.net theory, graphs are used for different needs in different biological structures such as graphs are used to predict similarities between DNA sequences [3].In 2011, a novel method based on graph theory was introduced for similarity calculations [4].That method was started from a weighted directed graph for each DNA sequence, DNA representing adjacency matrix, and then comprised the matrix to a representative vector.Three distance measurements were defined to calculate the similarity between vectors.In 2017, the above method was used to calculate the similarities between Human, Gorilla, and Orangutan by using Cosine, Correlation, and Euclidean distances [2].This research will modify the above mathematical model which is based on graph theory to represent DNA sequences mathematically for similarity analysis.We will check whether the above method is applicable for any regions of the genomes and apply molecular similarity measurements as distance measurements.The method was repeated by using the spectrum of the graph as a vector and compared the results of two vector representations.The modified method will be verified by using the genomes of Human, Gorilla, and Orangutan.

II. Research Methodology
The materials used in this research are mitochondrial DNA sequences of three evolutionary closed species (Human, Orangutan, and Gorilla) that were downloaded from Gen Bank of National Center for Biotechnology Information (NCBI).We use the Clustal Omega program as a multiple sequence alignment tool.The first step in this study is finding the regions of genomes that the novel method can be applied.Genomes of the above three species were aligned by using the Clustal Omega program to detect the conserved regions and DNA variations.The regions including the conserved regions and DNA variations were used to continue the research.A sample figure of aligned genomes is shown in Figure 1.Then we model weighted directed graphs for the randomly selected DNA regions with conserved regions and DNA variations of each DNA sequence of each species.Above function is a decreasing function.Since the maximum weight of an arc for any is just 1, it should be considered that arcs with weights not less than 0.1 are relatively significant.
would reflect the fact that the two nucleotides with smaller distance will have a stronger interactive relationship than the two nucleotides have a larger distance.
When assigning the weights of edges, it is important to choose as weights not less than 0.1 to construct the representative vector.We can choose according to the length of DNA sequence.If we take a long sequence, it is precise getting value for like or and for a short sequence or are applicable.An example of constructing a weighted directed graph for a given DNA sequence is given below.

Suppose
is a DNA sequence with 8 nucleotides and since it is a very short sequence.
Figure 2: The weighted directed graph for sequence Theorem There is a one-to-one mapping between a DNA sequence and its corresponding weighted directed multi graph .
In graph there are several parallel edges that connect from one vertex to another in same direction.Thus, we can simplify the graph to by merging parallel edges into one edge.Since the vertex set is not changed, .Suppose www.psychologyandeducation.net is the set of all edges from vertex to in .If , an edge is assigned from to in .The weight of that edge is given by, Figure 3 shows the simplified weighted directed graph for DNA sequence .

Adjacency matrix and representative vector -
The adjacency matrix corresponding to the weighted directed graph is defined as, For every DNA sequence, there is a square matrix as the adjacency matrix.Each element of the matrix gives the interaction between two different nucleotides in the sequence.Then we convert the matrix to a 16dimensional row vector .It is the representative vector for the DNA sequence.
The comparison of DNA sequences is converted to the comparison of 16-dimensional vectors.This is one of the vector representations of DNA sequence that we used in this study.The representative vector is used to determine the similarities of corresponding distance measurements.

Similarity measurements-
The representative vector is used to determine the degree of similarity between two sequences.
When the distance is in the range [0,1], the common relationship between the distance (dissimilarity) and similarity is, similarity = 1 -distance The smaller distance reflects that corresponding DNA sequences are more similar.We can determine the degree of similarity using the following distance measurements.The first four measurements can be interpreted using the vector www.psychologyandeducation.net structure [4].The last three measurements can be interpreted by the molecular structure of DNA complex molecule.We take two different DNA sequences & .
with the same length and , are the corresponding 16-dimensional representative vectors.1. Euclidean distance : Euclidean distance gives the distance between the endpoints of two vectors.It is the shortest distance between two points along the hypotenuse.

Cosine distance :
Cosine distance is called Angular distance and it measures the cosine of the angle between two vectors.

Correlation distance :
The linear correlation similarity coefficient measures the dependence between the two vectors.

Manhattan distance or City-block distance :
The Manhattan distance is the distance that would be traveled from one endpoint of vector to another if a grid like a path.In addition to above distance measurements, we can determine the similarities among DNA sequences using molecular similarity coefficients and compare with above results.Molecular similarity mainly focuses on the structural features of compounds and their representations such as shared substructures, ring systems, topologies and etc.We can apply these similarity coefficients to DNA compounds.Some coefficients that we apply here: Soergel distance, Jaccard distance, Dice distance.[1].
The spectrum of the graph was used as another vector representation of the DNA sequence.

Spectrum as a vector-
The spectrum of a graph is extensively used in graph theory to characterize the properties of a graph and gather information from its structure.Obviously, the spectrum may be changed with the small change of a graph structure.The graph spectrum is derived from a matrix representation of the graph and depends on the form of the matrix [5].In this research, each spectrum is obtained from the corresponding adjacency matrix of each graph.We use the spectrum as the .row vector since spectrum can be considered as the measure of graph similarity.Then the comparison between two DNA sequences was converted to the comparison between two row vectors.Euclidean, Cosine, and Correlation are three distance measurements that use to measure the similarity using the spectrum.

III. Results and Discussion
We randomly chose regions throughout the whole genomes of three species including conserved regions and DNA variations to check whether the same similarity result can be obtained from any area of the genome.Each DNA sequence that used here consists of the length of 12 nucleotides and for in whole the calculations.In the following table the distance measurements are shown in pair wisely for a region that was chosen randomly.The above three cases predict the same result that is the DNA sequences of Human and Gorilla are very similar because they have the smallest distance values for all distance measurements (Table -1).When comparing the results of the representative vector with the results of the spectrum both cases give the same result.

Orangutan
Although most of the aligned sequences give a positive result, there are some sequences with same length are failed in this methods.Even the representative vector based method is failed in some regions when the sequence length is very short.In some DNA regions, although the representative vector-based method is passed, the spectrum-based method is failed.We can get positive results for failed regions when these regions are reused after increasing the number of nucleotides.It is difficult to build a weighted directed graph when increasing the number of nucleotides in the sequence.But gradually when the number of nucleotides in sequences is increasing, there is a high probability to give accurate results than short length DNA sequences.Sometimes the length of the sequence that we use is not enough for similarity analysis between sequences because of that the short length sequence may not be stored enough DNA variations and mutations to analyze the similarities.

IV.CONCLUSION
We applied the novel method to determine the evolutionary closeness between Human, Gorilla, and Orangutan.A unique weighted directed graph for each species is represented by using DNA fragments.The weights of arcs are known as the entries of the adjacency matrix of the weighted graph.The adjacency matrix is written as the vector form is called as a representative vector.Representative vectors of each pair of DNA sequences are used to calculate the distance measurements such as Euclidean, Cosine, Correlation, and Manhattan.The molecular similarity measurements Soergel, Jaccard and Dice are also applicable to analyze the similarity using Representative vector.Instead of the representative vector, we used the spectrum of each graph obtained from the adjacency matrix of the weighted graph since the spectrum characterizes the properties of the graph.Using the spectrum, the Euclidean, Cosine and Correlation distances were accurate with distances using representative vector.The method that used the spectrum was not more accurate for very short DNA fragments.Final calculations using both vectors become more accurate when increasing the number of nucleotides in the DNA fragments.Then this research concludes that any DNA fragments with conserved regions and DNA variations in aligned genomes of different species can be applied to detect similarity.Molecular similarity measurements are also applicable as

Figure 1 :
Figure 1: Sample area of aligned genomes