神刀安全网

Graph Analytics over Relational Datasets

The analysis of interconnection structures of entities connected through relationships has proven to be of immense value in understanding the inner-workings of networks in a variety of different data domains including finance, health care, business, computer science, etc.

These analyses have emerged in the form of Graph Analytics — the analysis of the characteristics in these graph structures through various graph algorithms . Some examples of insights offered by graph analytics include finding clusters of entities closely connected to each-other, calculating optimal paths between entities (the definition of optimal depending on the dataset and use case), understanding the hierarchy of entities within an organization as well as figuring out the impact each entity has inside the network.

Graph structured data is a specialized type of dataset in terms of the way we need to access it; therefore it needs to be stored in ways that complements these access patterns. This has sparked the emergence of a wide variety of specialized graph databases such as Neo4j, OrientDB, Titan etc. These systems are highly optimized specifically for efficient graph storage, traversal, and analysis.

Despite the existence of these graph database management systems however, users do not typically store their data in a graph database unless they are strictly dealing with graph workloads. The reason for this is that while graph structured data may complement graph analysis, it ultimately undermines traditional relational analytics through the use of declarative languages like SQL that have been heavily leveraged by companies and organizations for decades. Users thus typically store their data in Relational Database Management System that backed by many years of research and development, are highly robust and optimized for these generalized analyses and a wide variety of different types of queries.

So how do we get the best of both worlds?

GraphGen (currently under development at the University of Maryland), is a system that enables users to efficiently conduct graph analytics on many different types of graphs that exist inside of relational datasets. The main idea behind GraphGen is to enable users to effortlessly leverage graph analytics on top of their data, without the need to go through a labor-intensive and time-consuming Extract, Transform and Load (ETL) process. It allows users to declaratively express any set of entities and relationships between them that exists in the underlying dataset, and extract it into memory where it can be analyzed — all this through a simple language abstraction! For more information on the project please visit the GraphGen project webpage

In this blog post, we will go through the entire process of writing an extraction query, to extracting and analyzing the extracted graph from within a relational database. GraphGen is natively written in Java and includes many native optimizations for efficient extraction and memory usage, but for the purposes of this post we will use graphgenpy , a Python wrapper over the GraphGen system, that enables quickly extracting and serializing the extracted graph to disk instead. We will demonstrate an end-to-end analysis on the openfootball football.db database, restricted to the 2014 World Cup , from extraction to analysis of a graph between players . You can download the pre-built version of the football.db sqlite database for the 2014 World Cup that we’ll be using in this tutorial from here: worldcup2014.db .

Installing graphgenpy

To install graphgenpy onto your system, simply download and uncompress the packaged zip file linked in the GraphGen webpage , into your workspace, and enter the graphgen-pkg folder.

If you’re using a virtual environment ( virtualenv ) — which we highly recommend — then you can simply

python setup.py install

Note : You may need to use sudo or Administrator privileges if you’re installing to your global site-packges.

Using graphgenpy Without Installation

If you’d prefer to try out graphgenpy in your local workspace without having to install it simply first install the requirements using pip

pip install -r requirements.txt

and then export your PYTHONPATH to include the graphgenpy folder in your local workspace

export PYTHONPATH=$PYTHONPATH:{full path of downloaded graphgenpy folder}

After that you can immediately import graphgenpy to begin exploring graphs in your relational datasets.

Extracting a World Cup 2014 Graph

Relational datasets, as their name suggests, often include a wide variety of underlying entities and relationships between them. One of the biggest strong points of GraphGen is that it enables users to explore any type of graph that may exist within the dataset, be it graphs with different types of entities (nodes), or different ways in which they are connected to each-other (edges).

For the sake of this example, we will extract a graph of players where there is an edge between them if they’ve played against each-other.

First, we need to inspect the database schema , in order to find which relations (tables) we need to utilize in order to extract this information. A subset of the schema can be seen below

Graph Analytics over Relational Datasets

As we can see here, the tables that hold the information about the ways we want the player nodes to be connected in this graph are included in the persons , rosters and games tables. The persons table holds the data for each player, the rosters table tells us which team each player is a part of, and the games table holds all the data about each game that occurred; including which teams played against each-other.

GraphGen uses a custom written declarative language (that gets translated to SQL queries under the hood). The query that extracts the graph we want in this language will look like this:

Nodes(id, name) :- persons(id,_,name). Edges(id1, id2) :- rosters(_,id1,team1),rosters(_,id2,team2), games(_,_,_,_,_,team1,team2).

The intuition behind the above query is that we are declaring how we want the nodes and edges in this graph to be defined in terms of the tables in this database. Here, we are declaring that we want the nodes to have two properties, an id and a name , which should be taken from the persons table. We are also stating that we want node with id1 and node with id2 , to be connected to each other, if there is a game in the games table, where id1’s team has played against id2’s team. For a more in-depth analysis on the language and on writing queries for GraphGen, please refer to the GraphGen Project webpage .

The python code that allows us to do this is quite simple

from graphgenpy import GraphGenerator  # extraction query string extractionQuery = """ Nodes(id, name) :- persons(id,_,name). Edges(id1, id2) :- rosters(_,id1,team1),rosters(_,id2,team2), games(_,_,_,_,_,team1,team2). """ # Absolute path to the db file or dbname (if using PostgreSQL) dbpath = ".../wolrdcup2014.db" # The name of the output graph file graph_name = "extracted_graph_football"  # Credentials for connecting to the database # PostgreSQL Usage: GraphGenerator(db_name, host, port, username, password). # SQLite Usage: GraphGenerator(db_path) gg = GraphGenerator(dbpath)  # Evaluate graph extraction query and serialize the resulting graph to disk. Returns the file path of the extracted graph. # Currently supported graph formats: GML and JSON fpath = gg.generateGraph(extractionQuery,graph_name,GraphGenerator.GML)

Note 1: By definition, this query will include all individuals in the "persons" table in the database, even if they did not participate in any of the games in the games table.

Note 2: Currently, the language is limited to only extracting undirected graphs with no edge properties.

We have now extracted the graph we wanted in a matter of seconds, and serialized it onto disk in a standard graph format — all without the need to write a single SQL query or script for wrangling the data into a graph format.

Most importantly however, the capabilities of the above language, allow us to essentially view this relational dataset from any angle, and explore all the types of graphs, that we may think of, and find interesting within.

Now we can go ahead and conduct analysis on the extracted graph.

Analyzing the Graph with networkx

As mentioned before, there exist a wide diversity of graph databases, and graph analytics engines that are optimized for graph traversal and analysis, and we are now in the position to import our extracted graph and utilize any of them. For the purposes of this post, we will use a widely adopted graph analysis library written and distributed in Python, known as NetworkX .

NetworkX is a very rich library that implements a wide array of graph algorithms for many different types of analyses including clustering, communities, centrality, distance measures and many more. It also parses several different types of standard graph formats; for our purposes we will use the GML format as we extracted in the previous section’s code.

Here’s how we import the graph into a networkx Graph object:

import networkx as nx  # 'id' here is the node property we want to parse in as the identifier in the networkx Graph structure. G = nx.read_gml(fpath,'id')

Finding major players (PageRank)

One classic type of analysis on any graph is to identify the key or most "important" entities in the graph. To do this, we will run the PageRank algorithm on top of the graph. This will assign a score to each node based on the structure of the incoming edges– we can then find the node with the highest PageRank score.

# run PageRank on the graph. algorithm returns a dictionary with the scores d = nx.pagerank(G) # find max score max_score = 0 max_node = None for node,score in d.iteritems():     if score > max_score:         max_score = score         max_node = node print 'Max Pagerank player is {}:{}, Score:{}'.format(max_node,G.node[max_node]['name'],max_score)

Betweenness Centrality

Another interesting aspect that we may want to analyze is finding the key intermediary nodes in terms of how information may flow inside the graph, commonly known as Betweenness Centrality . Vertices with high betweenness centrality, means that they have a large influence in the connectivity of their neighbors with the other nodes in the graph. The code here is identical to the way we ran PageRank

d = nx.betweenness_centrality(G) max_score = 0 max_node = None for node,score in d.iteritems():     if score > max_score:         max_score = score         max_node = node print 'Max betweenness_centrality player is {}:{}, Score:{}'.format(max_node,G.node[max_node]['name'],max_score)

Finding Triangles

Another common analysis is how many "triangles" does each node in the graph participate in if any. A triangle is a subgraph that includes exactly 3 nodes connected via exactly 3 edges (or a complete subgraph with 3 nodes). Triangle counting is used in a wide variety of graph mining and analysis algorithms, and can be done using networkx .

# Count all the triangles each node in the graph is a part of print nx.triangles(G)

It’s interesting to see that in this specific graph, there actually isn’t a single triangle; every node in the graph is part of exactly 0 triangles.

Conclusion

Analysis of graph structured data has proven its worth time and again, being able to provide invaluable insights about the relationships between entities, as well as enable optimizations over a network of interconnected objects. Graph analytics however is but one direction in which we’d like to leverage the information in our datasets, and typically do not want to center our entire data collection workflow around graph analysis. A majority of users and businesses rely on the robust and mature features and analytical capabilities provided by relational databases. GraphGen brings the best for both worlds, by enabling effortless extraction of many different types of graphs that exist inside of a relational datasets, thus opening graph analytics up to any dataset that exists inside a SQL-powered database engine.

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Graph Analytics over Relational Datasets

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
分享按钮