::p_load(tidyverse, readtext, quanteda, tidytext, jsonlite, DT) pacman
In-class Exercise 5: Visualizing and Analyzing Text Data
1 Overview
We will visualize and analyze text data
2 Getting Started
2.1 Loading the required packages
readtext - Import and handling for plain and formatted text
quanteda - Quantitative text data analysis
tidytext - facilitates use of text to other tidy formats
jsonlite - for processing json files
2.2 Loading the data
Using readtext
reads all files and add the contents in tibble dataframe.
We use the articles/
and mc1.json
data from VAST Challenge.
= readtext("data/articles/*")
text_data glimpse(text_data)
Rows: 338
Columns: 2
$ doc_id <chr> "Alvarez PLC__0__0__Haacklee Herald.txt", "Alvarez PLC__0__0__L…
$ text <chr> "Marine Sanctuary Aid Boosts Alvarez PLC's Sustainable Fishing …
3 Tokenizing text data
This will extract the words/token from the text data
<- text_data %>%
usenet_words unnest_tokens(word, text) %>%
filter(str_detect(word, "[a-z']$"),
!word %in% stop_words$word)
glimpse(usenet_words)
Rows: 50,147
Columns: 2
$ doc_id <chr> "Alvarez PLC__0__0__Haacklee Herald.txt", "Alvarez PLC__0__0__H…
$ word <chr> "marine", "sanctuary", "aid", "boosts", "alvarez", "plc's", "su…
3.1 Sorting data by frequency
We can sort the words by frequency by using count
.
%>%
usenet_words count(word, sort = TRUE)
readtext object consisting of 3261 documents and 0 docvars.
# A data frame: 3,261 × 3
word n text
<chr> <int> <chr>
1 fishing 2177 "\"\"..."
2 sustainable 1525 "\"\"..."
3 company 1036 "\"\"..."
4 practices 838 "\"\"..."
5 industry 715 "\"\"..."
6 transactions 696 "\"\"..."
# ℹ 3,255 more rows
The 3 most common words from the text sources are fishing
, sustainable
, and company
.
3.2 Extracting company and news source information
The file name seems to be in the format CompanyName__0__Publisher
. We can extract company name and publisher.
<- text_data %>%
text_data_split separate_wider_delim("doc_id",
delim = "__0__",
names = c("X", "Y"),
too_few = "align_end")
glimpse(text_data_split)
Rows: 338
Columns: 3
$ X <chr> "Alvarez PLC", "Alvarez PLC", "Alvarez PLC", "Alvarez PLC", "Alva…
$ Y <chr> "0__Haacklee Herald.txt", "0__Lomark Daily.txt", "0__The News Buo…
$ text <chr> "Marine Sanctuary Aid Boosts Alvarez PLC's Sustainable Fishing Ef…
The code above hardcodes 0
in the delimiter but it can be any number. As a result, some are not parsed as expected, e.g.
NA | Cervantes-Kramer__1__1__Haacklee Herald.txt | C |
Looking at the data a second time, we can determine that the delimiter should be something like __<num>__<num>__
.
We can make adjustments to the code to use regex
to use regular expression with \d+
token to match any number. We will also change “X” and “Y” to “Company” and “Publisher”, respectively so we can more easily understand what the data represents.
<- text_data %>%
text_data_company_publisher separate_wider_delim("doc_id",
delim = regex("__\\d+__\\d+__"),
names = c("Company", "Publisher"),
too_few = "align_end")
datatable(text_data_company_publisher[1:2])
4 Working with json files
We can use fromJSON
to read json file
<- fromJSON("data/mc1.json") mc1_data