In-class Exercise 5: Visualizing and Analyzing Text Data

Author

Kristine Joy Paas

Published

May 11, 2024

Modified

May 11, 2024

1 Overview

We will visualize and analyze text data

2 Getting Started

2.1 Loading the required packages

  • readtext - Import and handling for plain and formatted text

  • quanteda - Quantitative text data analysis

  • tidytext - facilitates use of text to other tidy formats

  • jsonlite - for processing json files

pacman::p_load(tidyverse, readtext, quanteda, tidytext, jsonlite, DT)

2.2 Loading the data

Using readtext reads all files and add the contents in tibble dataframe.

We use the articles/ and mc1.json data from VAST Challenge.

text_data = readtext("data/articles/*")
glimpse(text_data)
Rows: 338
Columns: 2
$ doc_id <chr> "Alvarez PLC__0__0__Haacklee Herald.txt", "Alvarez PLC__0__0__L…
$ text   <chr> "Marine Sanctuary Aid Boosts Alvarez PLC's Sustainable Fishing …

3 Tokenizing text data

This will extract the words/token from the text data

usenet_words <- text_data %>%
  unnest_tokens(word, text) %>%
  filter(str_detect(word, "[a-z']$"),
         !word %in% stop_words$word)
glimpse(usenet_words)
Rows: 50,147
Columns: 2
$ doc_id <chr> "Alvarez PLC__0__0__Haacklee Herald.txt", "Alvarez PLC__0__0__H…
$ word   <chr> "marine", "sanctuary", "aid", "boosts", "alvarez", "plc's", "su…

3.1 Sorting data by frequency

We can sort the words by frequency by using count.

usenet_words %>%
  count(word, sort = TRUE)
readtext object consisting of 3261 documents and 0 docvars.
# A data frame: 3,261 × 3
  word             n text     
  <chr>        <int> <chr>    
1 fishing       2177 "\"\"..."
2 sustainable   1525 "\"\"..."
3 company       1036 "\"\"..."
4 practices      838 "\"\"..."
5 industry       715 "\"\"..."
6 transactions   696 "\"\"..."
# ℹ 3,255 more rows

The 3 most common words from the text sources are fishing, sustainable, and company.

3.2 Extracting company and news source information

The file name seems to be in the format CompanyName__0__Publisher. We can extract company name and publisher.

text_data_split <- text_data %>%
  separate_wider_delim("doc_id",
                       delim = "__0__",
                       names = c("X", "Y"),
                       too_few = "align_end")
glimpse(text_data_split)
Rows: 338
Columns: 3
$ X    <chr> "Alvarez PLC", "Alvarez PLC", "Alvarez PLC", "Alvarez PLC", "Alva…
$ Y    <chr> "0__Haacklee Herald.txt", "0__Lomark Daily.txt", "0__The News Buo…
$ text <chr> "Marine Sanctuary Aid Boosts Alvarez PLC's Sustainable Fishing Ef…
Tip

The code above hardcodes 0 in the delimiter but it can be any number. As a result, some are not parsed as expected, e.g.

NA Cervantes-Kramer__1__1__Haacklee Herald.txt C

Looking at the data a second time, we can determine that the delimiter should be something like __<num>__<num>__.

We can make adjustments to the code to use regex to use regular expression with \d+ token to match any number. We will also change “X” and “Y” to “Company” and “Publisher”, respectively so we can more easily understand what the data represents.

text_data_company_publisher <- text_data %>%
  separate_wider_delim("doc_id",
                       delim = regex("__\\d+__\\d+__"),
                       names = c("Company", "Publisher"),
                       too_few = "align_end")
datatable(text_data_company_publisher[1:2])

4 Working with json files

We can use fromJSON to read json file

mc1_data <- fromJSON("data/mc1.json")

5 Reflections