Let’s try and find out for ourselves just how much Facebook knows about me, given I waived my privacy to them back in 2009. If you download your own through settings on your profile and have some familiarity with R, you should be able to run the same graphs off from your own data. The markdown file is available on my github page and there’s a code dump as the appendix.
First the data needs to be rvested (hah, good one Hadley) from the downloaded messages.htm
file. This initially confused me but it turns out:
Xpaths make this pretty easy. Cheers then xpaths.
The first thing we can scrape is a table that is boring to look at, but is useful to identify participants of a conversation.
Any thread with more than 2 participants is a group thread. The rest of the analysis can be done grouped by thread to keep that separation.
I tried a couple of htmlwidgets
options to display this table but they didn’t work instantly (and my god iframes are ugly). This actually looks fine though, in my opinion.
dt_participants[, .(thread, n_people, convo)] %>%
head(10)
## thread n_people convo
## 1: 1 2 private
## 2: 2 2 private
## 3: 3 2 private
## 4: 4 8 group
## 5: 5 4 group
## 6: 6 10 group
## 7: 7 2 private
## 8: 8 2 private
## 9: 9 8 group
## 10: 10 2 private
It’s then easy to use xpaths to grab the few fields offered to us
And the nice table we’re left with is…
dt %>%
`[`(sample(1:.N, 20), -"message") %>%
`[`(, id := 1:.N) %>%
`[`(, .(thread, year, timestamp, user = strsplit(user, " ")[[1]][1], message = "hidden"), id) %>%
head(10)
## id thread year timestamp user message
## 1: 1 511 2012 2012-12-28 01:36:00 Hanna hidden
## 2: 2 58 2013 2013-08-01 13:36:00 Akhil hidden
## 3: 3 258 2015 2015-09-14 22:03:00 Biyi hidden
## 4: 4 500 2012 2012-06-28 00:14:00 Akhil hidden
## 5: 5 21 2013 2013-03-13 22:26:00 Akhil hidden
## 6: 6 485 2014 2014-11-03 19:37:00 Joe hidden
## 7: 7 557 2016 2016-09-21 22:46:00 Akhil hidden
## 8: 8 258 2013 2013-08-26 02:07:00 Akhil hidden
## 9: 9 514 2015 2015-03-01 12:08:00 Caroline hidden
## 10: 10 357 2011 2011-08-16 20:01:00 Akhil hidden
You may note that I have removed my messages. If anyone wants to offer up anything they actually wrote back in 2009, be my guest.
It’s pretty easy to see where I stopped using MSN messenger.
This bit was extremely annoying as I had to actually do something by hand, which involved aliasing everyone who has changed their name. I also matched up the facebook IDs to the standardised name by hand as I was already going in. I did this for the top 100 people I spoke to overall, and also the top 20 people who I spoke to per year.
dt[, .N, user][order(-N)][1:100]
dt[, .N, .(year, user)][order(-N), head(.SD, 20), year]
People who were quite interesting to me here were people who I’d only frequently spoken to in 1 or 2 years. That’s not to say my #dayones aren’t interesting.
More to come! Plans I’m already working on include message/character ratios and a quick ngram analysis (a keeno may have noticed tidytext
sitting up there in the library calls).
This is surprisingly time consuming, even only this much has taken about 6 hours in total already!
The code will be available on github for you to run on your own data. It’s surprisingly revealing…
Stay tuned…
#########################################################
# Setup and helper functions #
#########################################################
libs = c(
"data.table", "XML", "purrr", "magrittr", "tibble", "tidytext", "ggplot2"
)
sink = sapply(libs, purrr::quietly(library), character.only = TRUE)
# A colour palette from https://www.r-bloggers.com/the-paul-tol-21-color-salute/
palette = c(
"#771155", "#AA4488", "#CC99BB", "#114477", "#4477AA", "#77AADD", "#117777",
"#44AAAA", "#77CCCC", "#117744", "#44AA77", "#88CCAA", "#777711", "#AAAA44",
"#DDDD77", "#774411", "#AA7744", "#DDAA77", "#771122", "#AA4455", "#DD7788"
)
# I like it better shuffled as I'm red/green colour deficienct
palette = sample(palette, length(palette), replace = FALSE)
# Some random helpers
fmt_time = "%A, %B %d, %Y at %I:%M%p"
split_people = function(people) strsplit(people, ", ")[[1]]
# The basic table
make_dt = function(i, u, t, m) {
data.table(thread = i, user = u, timestamp = t, message = m)
}
#########################################################
# Scraping initial threads from the HTML #
#########################################################
# Read in the whole messages htm file
html = XML::htmlTreeParse(
"~/blog/blogdown_data/2017-05-27-facebook-message-analysis_files/messages.htm",
useInternalNodes = TRUE
)
# Scrape the threads
threads = XML::xpathSApply(html, '/html/body/div/div/div[@class="thread"]')
# Thread metadata
l_participants = lapply(threads, xpathSApply, 'text()', xmlValue)
# Splits comma seperated lists of people
l_participants = lapply(l_participants, split_people)
# Store these in a nested table for neatness
l_participants = purrr::map2(
seq_along(l_participants), l_participants,
.f = ~tibble(thread = .x, people = list(.y))
)
dt_participants = purrr::reduce(l_participants, dplyr::bind_rows)
dt_participants = dt_participants %>%
dplyr::rowwise() %>%
dplyr::mutate(n_people = length(people)) %>%
dplyr::mutate(convo = ifelse(n_people == 2, "private", "group")) %>%
setDT()
dt_participants[, .(thread, n_people, convo)] %>%
head(10)
#########################################################
# Scraping each message within the thread's context #
# I used a save/load cheat to not have to rerun this #
#########################################################
# xpaths
xpath_user = './div/div/span[@class="user"]'
xpath_meta = './div/div/span[@class="meta"]'
xpath_msg = './p'
# Any field with useful info
l_users = lapply(threads, xpathSApply, path = xpath_user, xmlValue)
l_meta = lapply(threads, xpathSApply, path = xpath_meta, xmlValue)
l_msg = lapply(threads, xpathSApply, path = xpath_msg, xmlValue)
# Bind into table retaining thread info
# Cast the timestamp
dt = pmap(.l = list(seq_along(l_users), l_users, l_meta, l_msg), make_dt)
dt = rbindlist(dt)
dt[, timestamp := as.POSIXct(strptime(timestamp, format = fmt_time))]
# Somehow this first one is wrong. Not worth finding out why
dt = dt[order(timestamp)][-1]
dt[, year := lubridate::year(timestamp)]
# Is the message to a group?
dt = dt_participants[, .(thread, convo)][dt, on = "thread"]
# save(dt, file = "~/blog/blogdown_data/2017-05-27-facebook-message-analysis_files/dt.rda")
# load( file = "~/blog/blogdown_data/2017-05-27-facebook-message-analysis_files/dt.rda")
dt %>%
`[`(sample(1:.N, 20), -"message") %>%
`[`(, id := 1:.N) %>%
`[`(, .(thread, year, timestamp, user = strsplit(user, " ")[[1]][1], message = "hidden"), id) %>%
head(10)
#########################################################
# Quick plot messages per year #
#########################################################
qplot(year, N, data = dt[, .N, year],
geom = "col",
main = "Messages per year", xlab = "Year", ylab = "Count") +
theme_minimal()
#########################################################
# Scrape out messy names to manually fix in csv #
#########################################################
# Sadly you have to edit this csv by hand for best results of combining aliases
# I could ping Facebook but given I have to manually change nicknames I did the
# top 100 on the csv which took ~5 mins
dt_users = dt[, .N, user][order(-N)]
dt_users[, url := stringr::str_extract(user, "[0-9]+")]
dt_users[, url := paste0("www.facebook.com/", url)]
dt_users %>% write.csv("~/blog/blogdown_data/2017-05-27-facebook-message-analysis_files/names.csv")
# Join on the aliases so I can group people messages from the same people
# Only use first name in plots incase any m8s don't want their full name shown
dt_users = fread("~/blog/blogdown_data/2017-05-27-facebook-message-analysis_files/names_edit.csv")
dt_users[, display_name := strsplit(name, " ")[[1]][1], V1]
dt_users[stringr::str_detect(display_name, "@facebook"), display_name := "Deleted"]
dt = dt_users[, .(user, name, display_name)][dt, on = "user"]
#########################################################
# Plot to show who I spoke to most per year #
#########################################################
# Look at most talked to people per year
dt_per_year = dt[name != "Akhil Nair", .N, .(convo, year, display_name, name)][order(year, -N), head(.SD, 20), .(year)]
dt_per_year = dt_per_year[year > 2008]
dt_per_year[, year := factor(year, levels = 2009:lubridate::year(Sys.Date()))]
dt_per_year = dt_per_year[, .(Infrequent = .N), name][Infrequent < 3][dt_per_year, on = "name"]
dt_per_year[, Infrequent := !is.na(Infrequent)]
dt_per_year[, Infrequent := ifelse(Infrequent == TRUE, "True", "False")]
dt_per_year[, convo := factor(convo, levels = c("group", "private"))]
# Already we can see a fun graph
ggplot(dt_per_year) +
geom_col(
aes(
x = reorder(display_name, -N),
y = N,
fill = Infrequent,
alpha = convo
)
) +
theme_minimal() +
theme(
text = element_text(size = 16),
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)
) +
facet_wrap(~year, scales = 'free', ncol = 2) +
labs(
title = 'Most common people per year',
x = 'Name',
y = 'Count'
) +
scale_fill_manual(values = c("steelblue", "#666666")) +
scale_alpha_discrete(range = c(0.5, 1))
#########################################################
# Plot to show how I spoke to people through the years #
#########################################################
label_names = dt_users[, display_name]
names(label_names) = dt_users[, name]
# How else do you readably chain data.table...?
idx_known4years = dt_per_year %>%
`[`(, .(name, year)) %>%
unique() %>%
`[`(, .N, name) %>%
`[`(N > 4, name)
ggplot(dt_per_year[name %in% idx_known4years]) +
geom_col(
aes(
x = year,
y = N,
alpha = convo
),
fill = "steelblue"
) +
theme_minimal() +
theme(
text = element_text(size = 16),
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)
) +
facet_wrap(~name, scales = 'free', ncol = 2,
labeller = as_labeller(label_names[names(label_names) %in% idx_known4years])) +
labs(
title = 'Most common people per year',
x = 'Name',
y = 'Count'
) +
scale_alpha_discrete(range = c(0.5, 1)) +
scale_x_discrete(drop = FALSE)
#########################################################
# Plot I was working on when I got bored #
#########################################################
idx_user = dt[, .N, name][order(-N)][1:21, name]
dt[, convo := factor(convo, levels = c("group", "private"))]
ggplot(dt[name %in% idx_user][name != "Akhil Nair" & year >= 2010]) +
geom_histogram(aes(x = timestamp, fill = display_name, alpha = convo),
binwidth = 60 * 60 * 24 * 14) +
facet_grid(name ~ ., scale = "free_y", space = "free_y",
labeller = as_labeller(label_names[names(label_names) %in% idx_user])) +
theme_minimal() +
theme(
text = element_text(size = 16),
strip.text.y = element_text(angle = 0),
axis.text.y = element_blank(),
legend.position = "none"
) +
labs(
x = "Time",
y = "Message Count",
Title = "Facebook Messages sent since I got Facebook Messenger",
subtitle = "Top 20 people shown"
) +
scale_fill_manual(values = palette) +
scale_alpha_discrete(range = c(0.5, 1))