Building human context into how we understand and process textual data (response blog post to week 12 readings)

In the article “Automatically Processing Tweets from Gang-Involved Youth: Towards Detecting Loss and Aggression,” a team of computer scientists and social workers describe their use of natural language processing (NLP) to analyze tweets from known gang members in Chicago. Their goal is to understand when a tweet conveys a feeling of loss or aggression — which might precede an act of violence — with the hope that this automatic detection can support community outreach groups’ intervention efforts. 

The team realizes they cannot easily use existing NLP techniques given how significantly the textual data (i.e., the vernacular and slang found in the tweets) differ from the language these systems are trained on. Existing part-of-speech taggers, for instance, cannot parse the abbreviations and non-standard spellings found in the tweets. 

To create a more accurate model, the researchers hand-label the tweets, specifically turning to domain experts including social workers and two “informants” — 18-year-old African American men from a Chicago neighborhood with more frequent violence — to interpret each tweet. They call this act of decoding a “deep read” (2198), and the computer scientists work closely with the domain experts to understand the data and tune the model appropriately.

I feel this represents a feminist approach to working with textual data as it places primary importance on respecting the context of the data — and the people behind the data. Using existing NLP tools or foregoing the step of consulting domain experts would not only result in significant errors in the results, but could also misrepresent and even damage the communities the researchers are trying to support. As Lauren Klein and Catherine D’Ignazio write in Data Feminism, “Refusing to acknowledge context is a power play to avoid power. It’s a way to assert authoritativeness and mastery without being required to address the complexity of what the data actually represent” (ch. 6). 

Notably, the authors of the research article conclude by pointing to future research directions, explicitly stating that they aim to “extend our corpus to include more authors, more time periods, and greater geographical variation.” This suggests an iterative and process-oriented approach to modeling the data in line with Richard Jean So’s article “All Models Are Wrong,” in which he encourages researchers to assume that a model is wrong from the start but that it can be useful for exploration and can be improved through iteration (669). The researchers behind the Twitter data could (and should) continue to refine their model to better represent and engage with the people whose data they are modeling.