In this tutorial, I’m going to introduce you to two of my favorite packages for working with and visualizing networks - tidygraph and ggraph, both developed by Thomas Lin Pederson.
These packages take igraph
networks, and then use tools
from the tidyverse
to make it easier to manipulate and
visualize them. An igraph
network is a complicated object.
tidygraph
extends the tidy
paradigm to
networks by representing networks as two tables—a table of nodes and
node attributes and a table of edges and edge attributes.
We’ll load all the packages we need
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(tidygraph)
##
## Attaching package: 'tidygraph'
## The following object is masked from 'package:igraph':
##
## groups
## The following object is masked from 'package:stats':
##
## filter
library(ggraph)
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::%--%() masks igraph::%--%()
## ✖ dplyr::as_data_frame() masks tibble::as_data_frame(), igraph::as_data_frame()
## ✖ purrr::compose() masks igraph::compose()
## ✖ tidyr::crossing() masks igraph::crossing()
## ✖ dplyr::filter() masks tidygraph::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::simplify() masks igraph::simplify()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set_graph_style() # This sets the default style to the graph style
tidygraph
networkThis tutorial assumes that you know how to create an
igraph
network. Once you’ve got an igraph network object,
convert it to a tidygraph
network with
as_tbl_graph()
, like so:
G <- erdos.renyi.game(50, .4)
G <- as_tbl_graph(G)
We can then look at the tidygraph
object, and see the
two dataframes.
G
## # A tbl_graph: 50 nodes and 508 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 50 × 0 (active)
## #
## # Edge Data: 508 × 2
## from to
## <int> <int>
## 1 1 3
## 2 1 4
## 3 1 7
## # ℹ 505 more rows
Because a network is really composed of two tibbles, we can perform
many tidyverse
/dplyr
operations on them. In
order to know which table to use, we have to use
activate(nodes)
or activate(edges)
.
For example, the code below activates the nodes table and then uses
mutate
to create a variable called degree
.
(Note that the code throughout this tutorial uses “pipes”. Pipes
(|>
) let you express a sequence of operations, by taking
the output of the previous operation and using it as the input of the
next operation.)
create_notable('zachary') |>
activate(nodes) |>
mutate(degree = centrality_degree())
## # A tbl_graph: 34 nodes and 78 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 34 × 1 (active)
## degree
## <dbl>
## 1 16
## 2 9
## 3 10
## 4 6
## 5 3
## 6 4
## 7 4
## 8 4
## 9 5
## 10 2
## # ℹ 24 more rows
## #
## # Edge Data: 78 × 2
## from to
## <int> <int>
## 1 1 2
## 2 1 3
## 3 1 4
## # ℹ 75 more rows
Because the networks are just stored as data frames, that means that
we can export them as tibbles and then do things like use
ggplot
to graph attributes of a network. This code below
creates an edge attribute called bw
which is a measure of
edge betweenness, and then makes a histogram of the distribution of
bw
.
create_notable('zachary') |>
activate(edges) |>
mutate(bw = centrality_edge_betweenness()) |>
as_tibble() |>
ggplot() +
geom_histogram(aes(x=bw)) +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The companion package to tidygraph
is ggraph.
ggraph
is a set of tools based on ggplot2
. The
key idea behind both ggraph
and ggplot2
is
that you can build a plot by adding layers according to a “grammar of
graphics” that let you add to and change things about the plot.
ggraph
includes tons of really cool types of plots but
for this tutorial I am going to focus on standard plots that show nodes
as circles and edges as lines. There are three key components that
should be part of any of these plots:
There are a lots of different “geoms” for displaying nodes and edges
(full
list here). We are going to focus on using the simplest -
geom_node_point()
and geom_edge_fan()
.
The primary way to gain understanding or make an argument through network plots is through changing the color, size, etc. of nodes and edges.
If you want to change things based on a value that changes, then you
need to put it in a “mapping”. This is the first “argument” to the node
or edge geom, and appears within aes()
. Aesthetics that
apply to all of the nodes or edges appear outside of the mapping.
For example, in this graph the geom_edge_fan
has
color
and width
set to .2
and
'lightblue'
, respectively. These apply to all of the
edges.
On the other hand, the geom_node_point
has
color
set to group
. This means that the color
should vary based on what the group
variable is set to for
each node.
create_notable('zachary') |>
activate(nodes) |>
mutate(group = as.factor(group_infomap())) |> # Creates a `group` variable based on the infomap algorithm
ggraph(layout = 'stress') +
geom_edge_fan(width = .2, color = 'lightblue') +
geom_node_point(aes(color = group)) +
coord_fixed() +
theme_graph()
Often, we want to color things based on variables that already exist in our data. For these examples, let’s move to a new dataset. The following code loads in data from a Dutch school collected by Andrea Knecht and described here. I have cleaned it up a bit, using just Wave 2 from the data and changed it into CSV files - one for the nodes and one for the edges.
This code downloads these CSV files and creates a network from them
called G
. If we look at the node data, we can see that
there are a lot of attributes about each student that we might want to
visualize in a plot.
nodes = read_csv('https://raw.githubusercontent.com/jdfoote/Communication-and-Social-Networks/spring-2021/resources/school_graph_nodes.csv')
edges = read_csv('https://raw.githubusercontent.com/jdfoote/Communication-and-Social-Networks/spring-2021/resources/school_graph_edges.csv')
G = graph_from_data_frame(d=edges, v = nodes) |> as_tbl_graph()
G
## # A tbl_graph: 26 nodes and 203 edges
## #
## # A directed multigraph with 1 component
## #
## # Node Data: 26 × 7 (active)
## name delinquency alcohol_use sex age ethnicity religion
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 2 4 F 12 1 2
## 2 2 NA 2 F 12 1 2
## 3 3 2 1 F 12 2 3
## 4 4 2 1 M 12 1 2
## 5 5 1 1 M 12 1 2
## 6 6 1 1 F 12 1 NA
## 7 7 2 3 F 12 1 2
## 8 8 1 1 F 13 1 2
## 9 9 2 3 F 12 1 2
## 10 10 2 2 F 12 1 1
## # ℹ 16 more rows
## #
## # Edge Data: 203 × 3
## from to type
## <int> <int> <chr>
## 1 1 3 friendship
## 2 1 12 friendship
## 3 3 1 friendship
## # ℹ 200 more rows
For example, we may want to visualize alcohol use. This is how you
would change the color of nodes based on alcohol use. The
scale_color_viridis()
at the bottom changes from the
default color scale to the viridis
pallette which is prettier and easier to read.
G |>
ggraph(layout = 'stress') +
geom_edge_fan(width = .5, color = 'gray') +
geom_node_point(aes(color=alcohol_use), size = 3) +
scale_color_viridis()
You may notice that there are as many as four edges between nodes. This is because this is actually a directed graph with two different types of relationships - whether two people went to primary school together and whether they are friends. We can enhance the graph by adding color based on the type of relationship.
G |>
ggraph(layout = 'stress') +
geom_edge_fan(aes(color=type),width = .5) +
geom_node_point(aes(color=alcohol_use), size = 3) +
scale_color_viridis()
If you want to choose your own colors, you can do that with
scale_color_manual
for nodes or
scale_edge_color_manual
for edges. You will need to create
what’s called a “named vector” where you assign a color to each of the
possible values for the measure.
You can either use a name for a color or if you want a specific color
you can use a hexadecimal color value. In this example I mix a named
color (lightgray
) with a hex code for Purdue Gold
('#ceb888'
), and assign them to friendship
and
primary_school
, respectively.
Notice that this also includes scale_color_viridis
,
which changes the color scale used for the nodes. In general, to change
something about the edge colors use scale_edge_color...
,
and for the nodes, it’s just scale_color...
.
You can learn a lot more about color, color palettes, etc. here
G |>
ggraph(layout = 'stress') +
geom_edge_fan(aes(color=type),width = .5) +
geom_node_point(aes(color=alcohol_use), size = 3) +
scale_color_viridis() +
scale_edge_color_manual(values = c('friendship' = '#ceb888', 'primary_school' = 'lightgray'))
The other common approach to highlighting aspects of a graph is to change the size of nodes or the width of edges. Let’s change our previous graph so that now the size of the nodes is based on delinquency. (Note that there is some missing data with delinquency so the node in the bottom right just disappears. We’ll talk in a moment about how to deal with that.)
G |>
ggraph(layout = 'stress') +
geom_edge_fan(aes(color=type), width = .5) +
geom_node_point(aes(color=alcohol_use, size = delinquency)) +
scale_color_viridis() +
scale_edge_color_manual(values = c('friendship' = '#ceb888', 'primary_school' = 'gray'))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
There are lots of layout options included with ggraph. You can let
ggraph pick for you or you can look through some here and here.
tidygraph
has lots of neat, non-sociometric layouts like
dendograms and treemaps, but I’m going to focus on more traditional
layouts.
Layouts are defined inside the ggraph()
function, which
has to be called before making any plot. The code below makes a tidy
graph of the Zachary karate network using
create_notable('zachary')
. It then sets 'kk'
as the layout, and adds a layer for the nodes and a layer for the
edges.
create_notable('zachary') |>
ggraph(layout = 'kk') +
geom_node_point() +
geom_edge_fan()
The following few plots show how changes the layout can really change the look and the interpretation of the plot. These all show the same data.
create_notable('zachary') |>
ggraph(layout = 'circle') +
geom_node_point() +
geom_edge_fan()
In this one, we get very fancy. This shows an ego network for node
3
. We will get to what some of the different commands in
this code mean but like a good programmer what I really did was just
find an
example that was sort of like what I wanted to do, and played around
with adding and changing things until I made it look like what I
wanted.
create_notable('zachary') |>
mutate(d = distances(.G(), to=3)) |>
ggraph(layout = 'focus', focus = 3) +
geom_edge_fan() +
ggforce::geom_circle(aes(x0 = 0, y0 = 0, r = r), data.frame(r = 1:3), colour = 'grey') +
geom_node_point(aes(color = as.factor(d)), size = 3) +
coord_fixed() +
scale_color_viridis_d() +
labs(color='Distance from Node 3')
The one thing that I do not like about ggraph
is that it
is not easy to make directed graphs that show arrows.
Here is the most basic way I can figure out to do arrows that look ok.
You need to add an arrow
parameter with an
arrow()
function that gives the length and width of the
arrows, and then start_cap
and end_cap
sizes
that say how far away from the node center to start and end the arrows,
which you just have to play around with until it looks how you want it
to. It isn’t great.
G |>
ggraph(layout = 'stress') +
geom_edge_fan(arrow = arrow(length = unit(3, 'mm')),
end_cap = circle(1,'mm'),
start_cap = circle(1, 'mm')
) +
geom_node_point()
The better way to show direction is by adjusting the
alpha
of edges.
geom_edge_fan(aes(alpha = stat(index)))
will draw a
gradient from the start node to the end node. If you want a leged, you
need to make sure to include the final line as a layer -
scale_edge_alpha('Edge direction', guide = 'edge_direction')
G |>
ggraph(layout='stress') +
geom_edge_fan(aes(alpha = stat(index))) + # This does gradients for directed edges
geom_node_point(color = '#ceb888', size = 5) +
scale_edge_alpha('Edge direction', guide = 'edge_direction') # Adds the "Edge direction" legend
## Warning: `stat(index)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(index)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Finally, one strategy for visualization that ggraph
makes really easy is the ability to “facet” the data. This basically
means splitting apart the data based on some measure and visualizing
each part of the data with that measure. For example, this code facets
the graph into two subgraphs - one for females and one for males (Note
that this is just one more layer - we can keep all of the crazy fancy
visualizations we already added):
G |>
ggraph(layout = 'stress') +
geom_edge_fan(aes(color=type), width = .5) +
geom_node_point(aes(color=alcohol_use, size = delinquency)) +
scale_color_viridis() +
scale_edge_color_manual(values = c('friendship' = '#ceb888', 'primary_school' = 'gray')) +
facet_nodes(~sex)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Or if you wanted to facet based on the edge type:
G |>
ggraph(layout = 'stress') +
geom_edge_fan(aes(color=type), width = .5) +
geom_node_point(aes(color=alcohol_use, size = delinquency)) +
scale_color_viridis() +
scale_edge_color_manual(values = c('friendship' = '#ceb888', 'primary_school' = 'gray')) +
facet_edges(~type)
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
Facets use a strange new syntax. The tilde (~
) can be
read as “by”, and facets can have two dimensions. Typically you will
just want one, then you just put the tilde before the measure (like we
did with facet_edges(~type)
above). However, you can also
facet across multiple dimensions, in which case you put one measure on
either side of the tilde. For example, it’s not a very useful plot but
this plot facets by sex and ethnicity:
G |>
ggraph(layout = 'stress') +
geom_edge_fan(aes(color=type), width = .5) +
geom_node_point(aes(color=alcohol_use, size = delinquency)) +
scale_color_viridis() +
scale_edge_color_manual(values = c('friendship' = '#ceb888', 'primary_school' = 'gray')) +
facet_nodes(ethnicity~sex)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
tidygraph
gives you a few other important tools. Often,
the network data you have won’t be exactly what you want to visualize.
You will often want to filter either the edges or the nodes. For
example, in this data we may want to just look at friendships. We can do
that with filter
.
First, we need to “activate” the edges dataframe, then we filter it.
For numeric measures, you can filter using less than, greater than, etc.
(<, >, >=, <=
). There are complicated things
you can do with text measures but we’ll focus on the two simplest.
==
filters only to those items where the value is an exact
match and !=
filters only to those where it isn’t. So, the
code below keeps only those edges where the type is equal to
'friendship'
.
G |>
activate(edges) |> # Remember to activate whatever you want to filter
filter(type == 'friendship') |>
ggraph(layout = 'stress') +
geom_edge_fan(color = 'gray', width = .5) +
geom_node_point(aes(color=alcohol_use), size = 3) +
scale_color_viridis()
One of the other powerful tools from tidygraph
is the
ability to create new measures. This has the odd name of
mutate
, but it really just means to create a new
variable.
One way of using this is to create network measures and apply them to nodes.
In this example I color nodes by “coreness”, which is one of my favorite measures of how integrated someone is in a network. If your coreness score is 3, that means you are part of a group of at least 3 nodes where all of them are also connected to at least 3 others.
There are lots and lots of different measures that you can calculate - you can learn about them here
G |>
activate(edges) |> # Remember to activate whatever you want to filter
filter(type == 'friendship') |>
activate(nodes) |>
mutate(coreness = node_coreness(mode = 'all')) |>
ggraph(layout = 'stress') +
geom_edge_fan(color = 'gray', width = .5) +
geom_node_point(aes(color=coreness), size = 3) +
scale_color_viridis()
This measure does a really good job of showing just where the core network is, and who is (and isn’t) part of it.
There are similar measures for edges, and we can mix and match filter and mutate. For example, this code filters to just the edges which are mutual (both people marked the other as a friend), and then creates the coreness measure for coloring the nodes.
G |>
activate(edges) |> # Remember to activate whatever you want to filter
filter(type == 'friendship') |>
filter(edge_is_mutual()) |>
activate(nodes) |>
mutate(coreness = node_coreness(mode = 'all')) |>
as.undirected() |> # This line and the following line collapse the edges into one edge
simplify() |>
ggraph(layout = 'stress') +
geom_edge_fan(color = 'gray', width = .5) +
geom_node_point(aes(color=coreness), size = 3) +
scale_color_viridis()
One visualization that I really like colors the edges based on how similar two nodes are. For example, we may want to show whether people have similar drinking behavior to their friends.
So, we use mutate
to create a new edge variable based on
the absolute distance between drinking behavior for the two nodes on
either side of the edge.
G |>
activate(edges) |>
filter(type == 'friendship') |>
mutate(drinking_diff = abs(.N()$alcohol_use[from] - .N()$alcohol_use[to])) |>
ggraph(layout = 'stress') +
geom_edge_fan(aes(color = drinking_diff), width = .5) +
geom_node_point(aes(color=alcohol_use), size = 3) +
scale_color_viridis() +
scale_edge_color_viridis()
## Warning: `guide_colourbar()` cannot be used for edge_colour.
## ℹ Use one of colour, color, or fill instead.
This is tricky to understand.
Basically, we need to modify the edge variable based on something
about the nodes. However, when we activate the edge table we don’t have
information about the nodes. .N()
lets us access the nodes
table from the edge table, and from
and to
refer to the node that an edge is coming from and the node it is going
to, respectively.
So, .N()$alcohol_use
refers to the
alcohol_use
column in the nodes table, and
.N()$alcohol_use[from]
refers to the value of
alcohol_use
for the from node.
Putting it all together,
abs(.N()$alcohol_use[from] - .N()$alcohol_use[to])
subtracts the alcohol_use
of the “to” node from that of the
“from” node, and takes the absolute value. It saves that as
drinking_diff
and then colors edges based on that.
In this tutorial, I have built every visualization starting with a network object. However, you don’t have to do that. You can create a plot and iteratively add to it.
You do this by saving a plot as a variable.
For example, this saves a plot with just edges to the variable
p
p <- G |>
activate(edges) |>
filter(type == 'friendship') |>
ggraph(layout = 'stress') +
geom_edge_fan(color = 'gray', width = .5)
p
And now we can add in the nodes on this plot.
p +
geom_node_point(aes(color=alcohol_use), size = 3) +
scale_color_viridis()
Note that all we can do is add layers - we can’t take away or alter layers that have already been added to an existing plot.
We can do a similar thing by saving a filtered or mutated version of a network. For example, we keep filtering to just the friendship edges. Let’s just save a new version of that graph that is already filtered.
friend_G <- G |>
activate(edges) |>
filter(type == 'friendship')
One thing that’s worth keeping in mind is that later layers will be drawn over the top of earlier layers. In practice, this means that edge layers should typically be drawn first.
Edges drawn last :(
friend_G |>
ggraph(layout = 'stress') +
geom_node_point(aes(color=alcohol_use), size = 3) +
geom_edge_fan(color = 'gray', width = .5) +
scale_color_viridis()
Edges drawn first =)
friend_G |>
ggraph(layout = 'stress') +
geom_edge_fan(color = 'gray', width = .5) +
geom_node_point(aes(color=alcohol_use), size = 3) +
scale_color_viridis()
By default, legends will be given the name of the variable that they
represent. Sometimes this is fine, but you will often want to change the
legend titles. This is confusing but the most consistent way to do this
is with the name
parameter in the scale...
commands.
You can also change the title with the labs
command.
For example, we take our super complicated plot above and add some labels for each of the legends, plus a title.
G |>
ggraph(layout = 'stress') +
geom_edge_fan(aes(color=type), width = .5) +
geom_node_point(aes(color=alcohol_use, size = delinquency)) +
scale_color_viridis(name = 'Alcohol Use') +
scale_edge_color_manual(values = c('friendship' = '#ceb888', 'primary_school' = 'lightgray'), name = 'Edge Type') +
scale_size(name = 'Delinquency') +
labs(title = 'The relationship between delinquency and alcohol use')
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
NA
represents missing data in R. In network data, this
is typically node attributes that we don’t know. For example, maybe one
of the people didn’t fill out the survey so we don’t have demographic
information about them.
Typically, we will still want to display that person in a plot. If we
are using color, then this works OK - if you look in the plot above you
will notice some gray nodes - these are nodes where
alcohol_use
data was NA
.
However, if we resize nodes based on alcohol_use
we get
an error. ggraph
doesn’t have a default to fall back on
when size is missing, so it gives us a warning and makes them size
0.
friend_G |>
ggraph(layout = 'stress') +
geom_edge_fan(color = 'gray', width = .5) +
geom_node_point(aes(size=alcohol_use))
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
There isn’t a clear way to fix this problem. One option is to filter out the missing data. Another is to just choose a size for the missing data. Either of these have the danger of being misleading.
Filtering out nodes with missing alcohol_use
friend_G |>
activate(nodes) |>
filter(!is.na(alcohol_use)) |>
ggraph(layout = 'stress') +
geom_edge_fan(color = 'gray', width = .5) +
geom_node_point(aes(size=alcohol_use))
Adding size “1” for all of the missing data
friend_G |>
activate(nodes) |>
mutate(alcohol_use = replace_na(alcohol_use, 1)) |>
ggraph(layout = 'stress') +
geom_edge_fan(color = 'gray', width = .5) +
geom_node_point(aes(size=alcohol_use))
Finally, when we filter out edges we may want to keep the nodes in
the same place. We can do that by saving the original position of the
nodes and then passing that directly as the layout to
ggraph()
like so:
l = create_layout(G, 'stress') |> select(x,y) |> as.matrix()
G |>
activate(edges) |>
ggraph(layout = l) +
geom_edge_fan() +
geom_node_point()
G |>
activate(edges) |>
filter(type == 'friendship') |>
ggraph(layout = l) +
geom_edge_fan() +
geom_node_point()
It gets trickier if you are filtering out nodes, but it is possible. See here for an example.
Find a mistake or a typo? Suggestions for other things to cover? Send me an email at [email protected] or a pull request at https://github.com/jdfoote/Communication-and-Social-Networks/blob/spring-2021/week_6/ggraph_walkthrough.Rmd