Tag Archives: HDTMT

Ulysses Interactive Visualization with scattertext (HDTMT)

How did they make that?

Inspired by the Book review Ulysses by Numbers by Eric Bulson, I have created a visualization of a sentiment analysis of the novel Ulysses. The visualization can be found here. The code for the analysis and the viz plotting is here.

This visualization plots the words based on the frequency (X axis), and shows the top positive and negative words in a sidebar table with related characteristics (adjectives). The Y axis displays a score computed based on the sentiment categories (‘Positive’ and ‘Negative’).

The visualization is interactive, allowing us to search specific words, and retrieve the part of the text where the word occurs.

 
 

This HDTMT is intended to show how to run this analysis, and it’s divided into 5 sections, as follows:

  • Downloading and importing Ulysses data
  • Tokenizing data into sentences
  • Perform sentiment analysis
  • Plotting the visualization with scattertext
  • Deploying the html into github pages
  • Next steps and improvements

 

Downloading and importing Ulysses data

We start by importing “Requests”, the Python library that makes HTTP requests and will connect us to the Github repo where the Ulysses data is stored – here. The data comes from the Gutenberg EBook from August 1, 2008 [EBook #4300], updated on October 30, 2018.

 

Tokenizing data into sentences

After importing the data, the next step involves cleaning and tokenizing the text into sentences. The data might contain noise such as line breaks (‘\n’) and misplaced line breaks (‘\r’). The function first joins the sentences with misplaced line breaks (‘\r’) to ensure they are connected properly. Then, it tokenizes the text into sentences by splitting the string using ‘\n’, line breaks, and punctuation marks as delimiters. This process ensures that the text is properly segmented into individual sentences for further analysis.

 

Perform sentiment analysis

The next step is performing a sentiment analysis that classifies the text into positive, negative, and neutral tone. This task is performed using the class SentimentIntensityAnalyzer from the nltk.sentiment module in NLTK (Natural Language Toolkit). 

The SentimentIntensityAnalyzer uses a lexicon-based approach to analyze the sentiment of text by calculating a compound score. The compound score is an aggregated sentiment score that combines the positive, negative, and neutral scores. The compound score ranges from -1 (extremely negative) to 1 (extremely positive). If the polarity is greater than 0.05 it can be considered positive; lower than 0.05 is negative, and the values in between are neutral. 

 

Plotting the visualization with scattertext

The next step is to import scattertext, a Python library that analyzes and creates an html file. The library receives a specific data type, which is a corpus that is created by using the CorpusFromPandas function. This takes a Pandas DataFrame (t) as input and specifies the category column (‘sentiment’) and text column (‘text’). The whitespace_nlp_with_sentences tokenizer is used for text processing. The corpus is then built, and the unigram corpus is extracted from it.

Then, it creates a BetaPosterior from the corpus to compute term scores based on the sentiment categories (‘Positive’ and ‘Negative’). Next, the get_score_df() method is called to get a DataFrame with the scores. The top positive and negative terms are printed based on the ‘cat_p’ (category posterior) and ‘ncat_p’ (non-category posterior) scores, respectively.

The produce_frequency_explorer function generates an interactive HTML visualization of the corpus. It specifies the corpus, the positive category, the name for the negative category, and the term_scorer as the BetaPosterior object. The grey_threshold parameter sets a threshold for terms to be displayed in gray if they have a score below it.

Finally, the resulting HTML visualization is saved to a file named ‘ulysses_sentiment_visualization.html’ using the open function.

 

Deploying the html into Github pages

Lastly, we can deploy the html output from the scattertext visualization into a github page. We start by creating a Github repo. Then, uploading the html. Finally, going to settings and changing the deployment source – I’m using from branch, and main/ root in this case.

I’m pointing my Github pages into my alias, livia.work, but you can also use the Github service. This tutorial by professor Ellie Frymire has more details about how to perform the deployment step that can be helpful – here.

HDTMT: Locating the beautiful, picturesque, sublime and majestic

Locating the beautiful, picturesque, sublime and majestic: spatially analysing the application of aesthetic terminology in descriptions of the English Lake District

https://doi.org/10.1016/j.jhg.2017.01.006

Authors/Project Team:
Christopher Donaldson – Lancaster University
Ian N. Gregory – Lancaster University
Joanna E. Taylor – University of Manchester

WHAT IT IS

An investigation of the geographies associated with the use of a set of aesthetic terms (“beautiful,” “picturesque,” “sublime,” and “majestic”) in writing about the English Lake District, a region in the northwest of England with a long and prestigious history of representation in English-language travel writing and landscape description, notably in the 18th and 19th centuries. The Lake District has been a particular focus within the field of spatial humanities for well over a decade, motivated in part by “an awareness of the braided nature of the region’s socio-spatial and cultural histories; and an understanding of this rural, touristic landscape as a repeatedly rewritten and imaginatively overdetermined space” (Cooper and Gregory 90).

Focusing on the four aforementioned terms, which exemplify a new language of landscape appreciation emerging in late 18th century British letters, Donaldson and his co-authors intend to “demonstrate what a geographically orientated interpretation of aesthetic diction can reveal about the ways regions like the Lake District were perceived in the past” (44).

Through this case study, the authors introduce the method of “geographical text analysis,” which they locate at the nexus of aesthetics, physical geography, and literary study. The project combines corpus linguistics with geographic information systems (GIS) in a novel fashion.

Primary Data Source:

  • Corpus of Lake District Writing, 1622-1900 (Github)

The corpus contains 80 manually digitized texts totaling over 1.5-million word tokens.

Natural language processing (NLP) techniques were used to identify place names and assign these names geographic coordinates—a method called “geoparsing.” But the project members also went beyond what was possible at the time with out-of-the-box NLP libraries and geoparser tools in order to deeply annotate the texts, linking place-name variants and differentiating a wide range of topographical features. As such, the corpus “forms a challenging testbed for geographical text analysis methods” (Rayson et al.).

What you’d need to know to conduct “geographical text analysis”:

Step 1: Geoparsing

If your corpus is not already annotated, you will need to “geoparse”—convert place-names into geographic identifiers.

Geoparsing involves two stages of NLP:

  • Named Entity Recognition (NER) – a method for automatically extracting placenames from text data
  • Named Entity Disambiguation (NED) – a method for linking the extracted and identified terms with existing knowledge, enabling cross-referencing and connections to metadata such as geo-spatial information.

Tools:

Step 2: Collocation analysis.

The authors go about identifying the specific geographies associated with “beautiful,” “picturesque,” “sublime,” and “majestic” by noting when those terms appear alongside placenames. Thus, the authors develop a dataset of placename co-occurrences or “PNCs” extracted from their corpus. They then assess the frequency of co-occurrence to determine the statistical significance of the association between a given place and one of the aesthetic terms.

Tools:

Step 3: Spatial analysis

With the statistically significant PNCs identified, the authors use geoparsing tools to assign latitude/longitude (mappable) coordinates to each PNC. This enables researchers to analyse the spatial distribution of PNCs through GIS software such as ArcGIS, creating standard dot maps as well as density-smoothed maps. They also use Kulldorf’s Spatial Scan Statistic (traditionally an epidemiological statistic) to identify clusters.

With sophisticated GIS, they can map the spatial coordinates of the PNCs onto topographical and geological datasets, enabling a rich understanding of how places described as “majestic,” for example, map onto different elevations or different geological formations.

Digital terrain models (DTMs) or Digital Elevation Models (DEM) are vector and raster maps that can be imported into GIS tools if they are not already included. National geological surveys provide geology data in the form of GIS line and polygons that can be matched with PNC spatial metadata.

Tools:

Results

Donaldson et al.’s geographic analysis yields some striking findings on how the four aesthetic terms are applied to the Lake District landscape, which the authors summarize thusly:

As we have seen, whereas beautiful and, more especially, picturesque are often associated with geographical features set within, and framed by, their environment, majestic is more typically associated with features that rise above or extend beyond their surroundings. Sublime, true to Burke’s influential conception of the term, stands apart from these other terms in being associated with formations that are massed together in ways that make them difficult to differentiate […] The distinctive geographies associated with the terms beautiful and picturesque, on the one hand, and majestic and sublime, on the other, confirm that the authors of the works in our corpus were, as a whole, relatively discerning about the ways they used aesthetic terminology.

(Donaldson et al. 59)

References Cited:

Cooper, David, and Ian N. Gregory. “Mapping the English Lake District: A Literary GIS.” Transactions of the Institute of British Geographers, vol. 36, no. 1, 2011, pp. 89–108.

Donaldson, Christopher, et al. “Locating the Beautiful, Picturesque, Sublime and Majestic: Spatially Analysing the Application of Aesthetic Terminology in Descriptions of the English Lake District.” Journal of Historical Geography, vol. 56, Apr. 2017, pp. 43–60. ScienceDirect, https://doi.org/10.1016/j.jhg.2017.01.006.

Rayson, Paul, et al. “A Deeply Annotated Testbed for Geographical Text Analysis: The Corpus of Lake District Writing.” Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities, Association for Computing Machinery, 2017, pp. 9–15. ACM Digital Library, https://doi.org/10.1145/3149858.3149865.