In this tutorial, I’m going to introduce you to two of my favorite packages for working with and visualizing networks - tidygraph and ggraph, both developed by Thomas Lin Pederson.
These packages take igraph
networks, and then use tools
from the tidyverse
to make it easier to manipulate and
visualize them. An igraph
network is a complicated object.
tidygraph
extends the tidy
paradigm to
networks by representing networks as two tables—a table of nodes and
node attributes and a table of edges and edge attributes.
We’ll load all the packages we need
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(tidygraph)
##
## Attaching package: 'tidygraph'
## The following object is masked from 'package:igraph':
##
## groups
## The following object is masked from 'package:stats':
##
## filter
library(ggraph)
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::%--%() masks igraph::%--%()
## ✖ dplyr::as_data_frame() masks tibble::as_data_frame(), igraph::as_data_frame()
## ✖ purrr::compose() masks igraph::compose()
## ✖ tidyr::crossing() masks igraph::crossing()
## ✖ dplyr::filter() masks tidygraph::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::simplify() masks igraph::simplify()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set_graph_style() # This sets the default style to the graph style
tidygraph
networkThis tutorial assumes that you know how to create an
igraph
network. Once you’ve got an igraph network object,
convert it to a tidygraph
network with
as_tbl_graph()
, like so:
G <- erdos.renyi.game(50, .4)
G <- as_tbl_graph(G)
We can then look at the tidygraph
object, and see the
two dataframes.
G
## # A tbl_graph: 50 nodes and 508 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 50 × 0 (active)
## #
## # Edge Data: 508 × 2
## from to
## <int> <int>
## 1 1 3
## 2 1 4
## 3 1 7
## # ℹ 505 more rows
Because a network is really composed of two tibbles, we can perform
many tidyverse
/dplyr
operations on them. In
order to know which table to use, we have to use
activate(nodes)
or activate(edges)
.
For example, the code below activates the nodes table and then uses
mutate
to create a variable called degree
.
(Note that the code throughout this tutorial uses “pipes”. Pipes
(|>
) let you express a sequence of operations, by taking
the output of the previous operation and using it as the input of the
next operation.)
create_notable('zachary') |>
activate(nodes) |>
mutate(degree = centrality_degree())
## # A tbl_graph: 34 nodes and 78 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 34 × 1 (active)
## degree
## <dbl>
## 1 16
## 2 9
## 3 10
## 4 6
## 5 3
## 6 4
## 7 4
## 8 4
## 9 5
## 10 2
## # ℹ 24 more rows
## #
## # Edge Data: 78 × 2
## from to
## <int> <int>
## 1 1 2
## 2 1 3
## 3 1 4
## # ℹ 75 more rows
Because the networks are just stored as data frames, that means that
we can export them as tibbles and then do things like use
ggplot
to graph attributes of a network. This code below
creates an edge attribute called bw
which is a measure of
edge betweenness, and then makes a histogram of the distribution of
bw
.
create_notable('zachary') |>
activate(edges) |>
mutate(bw = centrality_edge_betweenness()) |>
as_tibble() |>
ggplot() +
geom_histogram(aes(x=bw)) +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The companion package to tidygraph
is ggraph.
ggraph
is a set of tools based on ggplot2
. The
key idea behind both ggraph
and ggplot2
is
that you can build a plot by adding layers according to a “grammar of
graphics” that let you add to and change things about the plot.
ggraph
includes tons of really cool types of plots but
for this tutorial I am going to focus on standard plots that show nodes
as circles and edges as lines. There are three key components that
should be part of any of these plots:
There are a lots of different “geoms” for displaying nodes and edges
(full
list here). We are going to focus on using the simplest -
geom_node_point()
and geom_edge_fan()
.
The primary way to gain understanding or make an argument through network plots is through changing the color, size, etc. of nodes and edges.
If you want to change things based on a value that changes, then you
need to put it in a “mapping”. This is the first “argument” to the node
or edge geom, and appears within aes()
. Aesthetics that
apply to all of the nodes or edges appear outside of the mapping.
For example, in this graph the geom_edge_fan
has
color
and width
set to .2
and
'lightblue'
, respectively. These apply to all of the
edges.
On the other hand, the geom_node_point
has
color
set to group
. This means that the color
should vary based on what the group
variable is set to for
each node.
create_notable('zachary') |>
activate(nodes) |>
mutate(group = as.factor(group_infomap())) |> # Creates a `group` variable based on the infomap algorithm
ggraph(layout = 'stress') +
geom_edge_fan(width = .2, color = 'lightblue') +
geom_node_point(aes(color = group)) +
coord_fixed() +
theme_graph()
Often, we want to color things based on variables that already exist in our data. For these examples, let’s move to a new dataset. The following code loads in data from a Dutch school collected by Andrea Knecht and described here. I have cleaned it up a bit, using just Wave 2 from the data and changed it into CSV files - one for the nodes and one for the edges.
This code downloads these CSV files and creates a network from them
called G
. If we look at the node data, we can see that
there are a lot of attributes about each student that we might want to
visualize in a plot.
nodes = read_csv('https://raw.githubusercontent.com/jdfoote/Communication-and-Social-Networks/spring-2021/resources/school_graph_nodes.csv')
edges = read_csv('https://raw.githubusercontent.com/jdfoote/Communication-and-Social-Networks/spring-2021/resources/school_graph_edges.csv')
G = graph_from_data_frame(d=edges, v = nodes) |> as_tbl_graph()
G
## # A tbl_graph: 26 nodes and 203 edges
## #
## # A directed multigraph with 1 component
## #
## # Node Data: 26 × 7 (active)
## name delinquency alcohol_use sex age ethnicity religion
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 2 4 F 12 1 2
## 2 2 NA 2 F 12 1 2
## 3 3 2 1 F 12 2 3
## 4 4 2 1 M 12 1 2
## 5 5 1 1 M 12 1 2
## 6 6 1 1 F 12 1 NA
## 7 7 2 3 F 12 1 2
## 8 8 1 1 F 13 1 2
## 9 9 2 3 F 12 1 2
## 10 10 2 2 F 12 1 1
## # ℹ 16 more rows
## #
## # Edge Data: 203 × 3
## from to type
## <int> <int> <chr>
## 1 1 3 friendship
## 2 1 12 friendship
## 3 3 1 friendship
## # ℹ 200 more rows
For example, we may want to visualize alcohol use. This is how you
would change the color of nodes based on alcohol use. The
scale_color_viridis()
at the bottom changes from the
default color scale to the viridis
pallette which is prettier and easier to read.
G |>
ggraph(layout = 'stress') +
geom_edge_fan(width = .5, color = 'gray') +
geom_node_point(aes(color=alcohol_use), size = 3) +
scale_color_viridis()