Daily Archives: May 8, 2023

Weapons of Math Destruction – a deep dive into the recidivism algorithm

Automation has been occupying more space in the productive system in today’s society and algorithms are at the core of it. They are logical sequences of instructions that enable autonomous execution by machines. The expansion of these calculation methods is largely due to the use of software, able to collect a more significant amount of data from different sources. It is embedded into users’ everyday tools such as Google’s search, social media networks, movies and music recommendations of Netflix and Spotify, personal assistants, video games, surveillance, and security systems. 

Computers are much more efficient in replicating endless tasks without committing mistakes. They connect information faster and establish protocols to log input and output data.  They don’t get distracted, tired, or sick. They don’t gossip, or miscommunicate with each other. The Post-Accident Review Meeting on the Chornobyl Accident (1986) reviewed that poor team communication and sleep deprivation were the major issues that caused the disaster. 

In 2018, the project Blue Brain relieved brain structures are able to process information in up to 11 dimensions. Computers, on the other hand, process zillions of dimensions and are able to unhide patterns that the human brain could not imagine. The concept of big data goes beyond the number of dataset cases. It also involved the number of features/ variables able to describe phenomena. 

Of all the advantages of computers, the most important one is their inability to be creative – at least so far. If I go to bed trusting that my phone is not planning revenge or plotting against me with other machines is because computers don’t have their own wills. Computers don’t have an agenda. Human beings do. Public opinion has become more aware of the impact of automation on the global economy. Accordingly, to a Pew Research 2019 study, 76% of Americans believe that work automation is more likely to increase inequality between rich and poor people, and hurt workplaces (48%). 85% of them favor limiting machines to dangerous or unhealthy jobs. 

Computers uncover patterns; they don’t create new ones. Machines use data to find patterns from past events, which means their predictions will replicate the current reality. If the reliance is on algorithms, the world will continue as it is. In “Weapons of Math Destruction,” Cathy O’Neil adds a new layer to how automation has propagated inequality by feeding biased data to models. “Weapons of Math Destruction” is a book by Cathy O’Neil published in 2016, which explores the societal impact of algorithms. O’Neil introduces the concept of “weapons of math destruction,” referring to big data algorithms that perpetuate existing inequality. She highlights three main characteristics of WMDs: they are opaque, making it challenging to understand their inner workings and question their outcomes; they are scalable, allowing biases to be magnified when applied to large populations; and they are difficult to contest, often used by powerful institutions that hinder individuals from challenging their results. Stretching her own example, if we based educational decision-making policies on college data from the early 1960s, we would not see the same level of female enrollment in colleges as we do today. The models would have primarily been trained on successful men, thus perpetuating gender and racial biases.

This article is intended to explore one of the examples she gives in her book, about the recidivism algorithm. One case to illustrate it was published in May 2016 by the nonprofit ProPublica. The article Machine Bias denounced the impact of biased data used to predict the probability of a convict committing new crimes in a commercial software Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) risk scores. The algorithms used to predict recidivism were logistic regression and survival analysis. Both models are also used to predict the probability of success of medical treatment among cancer patients.

“The question, however, is whether we’ve eliminated human bias or simply camouflaged it with technology. The new recidivism models are complicated and mathematical. But embedded within these models are a host of assumptions, some of them prejudicial. And while Walter Quijano’s words were transcribed for the record, which could later be read and challenged in court, the workings of a recidivism model are tucked away in algorithms, intelligible only to a tiny elite”.

To calculate risk scores, COMPAS analyzes data and variables related to substance abuse, family relationship and criminal history, financial problems, residential instability, and social adjustment. The scores are built using data from several sources, but mainly from a survey of 137 questions. Some of the questions include “How many of your friends have been arrested”, “How often have you moved in the last twelve months”, “In your neighborhood, have some of your friends and family been crime victims”, “Were you ever suspected or expelled from school”, “how often do you have barely enough money to get by”,  and “I have never felt sad about things in my life”. 

According to the Division of Criminal Justice Service of the State of New York 2012, “[COMPAS-Probation Risk]Recidivism Scale worked effectively and achieved satisfactory predictive accuracy”. The Board of Parole currently uses the score for decision-making. Data compiled by the non-profit Vera show that 40% of people were granted parole in 2020 in NY. In 2014, Connecticut reached 67% of the granted parole rate, Massachusetts 63%, and Kentucky 52%.

Former U.S. former Attorney General Eric Holder commented about the scores that “although these measures were crafted with the best of intentions, I am concerned that they inadvertently undermine our efforts to ensure individualized and equal justice […] may exacerbate unwarranted and unjust disparities that are already far too common in our criminal justice system and in our society.” 

Race, nationality, and skin color were often used in making such predictions until about the 1970s, when it became politically unacceptable, according to a survey of risk assessment tools by Columbia University law professor Bernard Harcourt. Despite that it is still targeting underprivileged communities, unable to access welfare. In 2019, African-Americans and Hispanic Origin Groups’ poverty rate was 18.8% and 15.7%, respectively, compared to 7.3% of white people.

The assessment of social projects has shown a decrease in violence among vulnerable communities assisted by income transfer programs in different parts of the world. In the US, the NGO Advance Peace conducted an 18 monthly program targeting members of a community that are at the most risk of perpetrating gun violence and being victimized by it in California.  The program includes trauma-informed therapy, employment, and training. The results show a decrease of 55% in firearm violence after the implementation of the program in Richmond. In Stockton, gun homicides and assaults declined by 21%, saving $42.3 -$110M in city expenses in the 2-year program.

In this sense, using algorithms will propagate the current system. Predictions reinforce a dual society in which the wealthy are privileged to receive personalized, humane, and regulated attention, as opposed to vulnerable groups that are condemned to the results of “smart machines”. There is no transparency in those machines, and no effort from companies or governments to educate the public opinion regarding how the decisions are made. In this regard, a scoring system is created to evaluate the vulnerable. The social transformation will come from new policies directed to reduce inequality and promote well-being. 

Abstract – Identifying Sexism in Text

Examining social and cultural questions using computational text analysis carries significant challenges. Texts can be socially and culturally situated. They reflect ideas of both, authors and their target audiences. Such interpretation is hard to incorporate in computational approaches.

Research questions should be identified through data selection, conceptualization and operationalization, and end with analysis and interpretation of results.

The person who produces sexist language is not given any space for productive change, but may simply become more entrenched in their position (Post-feminist analysis, 236)

Attempt reforming sexism in language can become a failure if it simply focuses on the eradication of certain phrases and words.

Because of this, approaching digital text collection from a literary textual critic’s perspective might require questioning the context behind the digital text. Rather than working on raw text and relying on results produced by machine processing, it will make more sense to understand the environment, reason and validity of the information provided in the text.

Instead of taking data at face value and looking toward future insights, data scientists can first interrogate the context, limitations and validity of the data under use. This being said, feminist strategy for considering context is to consider the cooking process that produces “raw” data (Klein and D’lgnazio, Numbers don’t speak for themselves, 14)

Researches too easily attribute phraseological differences to gender when in fact other intersecting variables might be at play. As far as gender can be counted as a performative language use in its social context, it is important to avoid dataset and interpretational bias. (These Are Not The stereotypes You are looking for”,15)

  • D’Ignazio, Catherine;  Klein, Lauren; –  Data Feminism. 3. On Rational, Scientic, Objective Viewpoints from Mythical, Imaginary, Impossible Standpoints 6. The Numbers Don’t Speak for Themselves. Published on: Mar 16, 2020
  • Dong Nguyen, Maria Liakata, Simon DeDeo, Jacob Eisenstein, David Mimno, Rebekah Tromble, and Jane Winters. “How We Do Things With Words: Analyzing text as Social and Cultural Data”. Published online 2020 Aug 25
  • Koolen, C., & van Cranenburgh, A. (2017). These are not the Stereotypes You are Looking For: Bias and Fairness in Authorial Gender Attribution. In Proceedings of the First Ethics in NLP workshop (pp. 12-22)

Abstract Proposal; Men in Wonderland: The Lost Girlhood of the Victorian Gentleman

Catherine Robson published a book titled Men in Wonderland: The Lost Girlhood of the Victorian Gentleman in 2001 under Princeton University Press. She is an assistant professor of English at the University of California, Davis, where she specialized in nineteenth-century British literature and culture. In this book, she explores the fascination with little girls in Victorian culture through 19th-century literature by British male authors. In doing so, she reveals the link between idealization of little girls and a wide-spread fantasy of male development. Robson’s argument is that the concept of ‘little girls’ during this era offered an adult male the best opportunity to reconnect with his own lost self.

The individuals work explored in her book include Wordsworth, De Quincey, Dickens, Ruskin, and Carrol. Along with these works of literature, she compares them cultural artifacts during this era, including conduct books, government reports, fine art, and popular journalism.

Along with Robson’s close-reading of literature, there can include a text-analysis on these works to reveal certain patterns and findings that coincide with Robson’s analysis of such works. Such distant-readings of childhood and masculinity in the Victorian Era could contribute to our own contemporary Western understanding of masculinity and femininity in pop-culture.

Unveiling the Patient Journey: A Gender Perspective on Chronic Disease-Centered Care

Abstract

Healthcare is the industry in which customers want to deal with humans. No machines, they want to connect with real people in their most vulnerable time. In this context, women are more likely to be the center of the patient journey: from taking their loved ones to a doctor’s appointment to being the primary care of kids with chronic diseases (such as asthma).

The burden of care is still under women dealing with unpaid work. However, it’s not any better for men. In 2019, Cleveland Clinic conducted a survey which found that 72% of respondents preferred doing household chores like cleaning the bathroom to going to the doctor; 65% said they avoid going to the doctor for as long as possible; 20% admitted they are not always honest with their doctors about their health. On average, men die younger than women in the United States. American women had a life expectancy of 79 years in 2021, compared to 73 for men (CDC, 2022).

The goal of this research is to explore the gender differences in a patient journey by applying a corpus linguistic approach to create and annotate a dataset about chronic disease in Portuguese and English manually using social media data from Facebook, Instagram, YouTube and Twitter. Then, I’m applying text analysis methods to describe the dataset. Lastly, comparing the classification results of Generative AI to the traditional machine learning text analysis.

This analysis also wondered between the benefits and detriments of performing with such analysis. Despite the investment of language model resources, it’s valuable to use AI to uncover gender inequalities. The final goal is open discussion about how to take the burden from women, and also empowering men to feel comfortable about their own health. It’s also open space to discuss new methods exploring different gender classifications.

Goal: This proposal is intended to describe a study of how corpus linguistic and text analysis methods can be used to support research on language and communication within the context of healthcare using social media data about chronic disease in English and Brazilian Portuguese.

The specific goals evolve:

  1. Performing literature review based on previous studies and benchmark datasets in the healthcare field – process finished.
  2. Creating a dataset with social media posts from 2020 to 2023 in social networks such as Twitter, YouTube, Facebook, and local media channels. The dataset is composed of around 7k posts and specifies the patient’s gender, type of treatment/ medication, number of likes and comments – process already finished: dataset here.
  3. Categorizing the corpus according to gender and the patient journey framework, from initial symptoms to diagnosis, treatment, and follow-up care – process already finished: dataset here.
  4. Documenting the dataset and creating a codebook explaining the categories and the criteria for the categorization process – process already finished: code book and the the connection of the ontologies.
  5. Applying categorization based on GPT3 results – in progress
  6. Comparing the results with the manual classification with GPT3 results

Literature review

Several linguistic analyses and corpus analysis studies have investigated the patient journey in healthcare, exploring different aspects of communication between patients and healthcare providers, patient experience, and clinical outcomes. One area of research has focused on the use of language by healthcare providers to diagnose and treat patients. For example, a study by Roter and Hall found that physicians used a directive communication style, using commands and suggestions, more often than a collaborative communication style when interacting with patients. This style can create a power imbalance between the physician and patient, potentially leading to dissatisfaction or miscommunication. 

Another area of research has investigated patient experience and satisfaction. A corpus analysis study by Gavin Brookes,  and Paul Baker examined patient comments to identify factors influencing patient satisfaction with healthcare services during cancer treatments. They found that factors such as communication, empathy, and professionalism were key drivers of patient satisfaction.

Finally, several studies have investigated the use of language in electronic health records (EHRs) to improve patient care and outcomes. A corpus analysis study by Xi Yang and colleagues examined the use of EHRs and found that natural language processing techniques could effectively identify relevant patient information from unstructured clinical notes.

Overall, the literature on linguistic analyses and corpus analysis studies on healthcare patient journey suggests that communication and language play a critical role in patient care and outcomes. Effective communication between patients and healthcare providers, as well as clear and concise language in patient education materials and EHRs, can lead to improved patient satisfaction, empowerment, and self-management.

Method overview

  • Data collection: collecting data based on keywords on social media;
  • Coding data: using qualitative coding and annotation
  • Data analysis: performing linguistics and statistical analysis 

References

“Doctors Talking with Patients—Patients Talking with Doctors: Improving Communication in Medical Visits.” Clinical and Experimental Optometry, 78(2), pp. 79–80

Yang, X., Chen, A., PourNejatian, N. et al. A large language model for electronic health records. npj Digit. Med. 5, 194 (2022). https://doi.org/10.1038/s41746-022-00742-2

Peterson KJ, Liu H. The Sublanguage of Clinical Problem Lists: A Corpus Analysis. AMIA Annu Symp Proc. 2018 Dec 5;2018:1451-1460. PMID: 30815190; PMCID: PMC6371258.

Adolphs, S, Brown, B., Carter, R., Crawford, C. and Sahota, O. (2004) ‘Applying Corpus
Linguistics in a health care context’, Journal of Applied Linguistics, 1, 1: 9-28

Adolphs, S., Atkins, S., Harvey, K. (forthcoming). ‘Caught between professional requirements and interpersonal needs: vague language in healthcare contexts’. In J. Cutting (ed.) Vague Language Explored Basingstoke: Palgrave

Skelton, J.R., Wearn, A. M., and F.D.R. Hobbs (2002) ‘‘I’ and ‘ we’: a concordancing analysis of doctors and patients use first person pronouns in primary care consultations ’, Family Practice, 19, 5: 484-488

Biber, D. and Conrad, S. (2004) ‘Corpus-Based Comparisons of Registers’, in C. Coffin, A.
Hewings, and K. O’Halloran (eds) Applying English Grammar: Functional and Corpus
Approaches London: Arnold

Abstract: Is analyzing gender using computational text analysis ethical?

Often in the tech world, we hear of algorithms that can predict with accuracy, the gender of the person who wrote a particular document, tweet, etc. Is this inherently unethical? To discover the answer, we can use the following as a jumping off point to arrive at a decision. 

In every project, there is the potential for biases to be introduced. Some may ask how this could be possible if an algorithm is doing all the work. This is an inaccurate idea. It is important to realize there are people behind every algorithm written. It’s trained on data provided by people who have thoughts, feelings, and opinions which could be translated into the training material provided. Does the training data perpetuate gender stereotypes or other biases?

Another element to consider is privacy. When collecting information about genders of authors, how is that data being used within the project? Was consent even obtained from the individuals providing the data? Was this communicated to the participants? If data were to be exposed would it cause harm? Would it be possible to anonymize the data and still provide significant results?

It is also important to consider social and political context when attempting to analyze gender using computational text analysis. Do the results perpetuate power dynamics between socially constructed gender roles? If so, this could reinforce what has been ingrained in our society. However, constructs change over time. Have historical and cultural context been taken into account to eliminate misunderstandings of the results? Since gender does not stand on its own, was there an intersectional approach taken within the experiment? Other social categories such as race, social class and sexuality are highly intertwined.

Proposal – Atilio Barreda II

Proposal:

In light of Nan Z. Da’s “The Computational Case against Computational Literary Studies,” which exposes the limitations of traditional computational methods in literary studies, my research project will focus on understanding the humanistic and philosophical concepts embedded in techniques like cosine similarity, Euclidean distance, and Latent Dirichlet Allocation (LDA). I aim to deconstruct the foundations of these methods and examine the potential for developing text analysis approaches that are sensitive to humanistic and philosophical dimensions.

To achieve this, I will begin by examining the underlying assumptions and principles of these methods, assessing their ability to capture the intricacies of literary works. I will draw on examples from Da’s critique and other instances within computational literary studies to identify common pitfalls and limitations of these techniques.

Next, I will explore interdisciplinary methodologies that can complement and improve upon traditional computational methods. By incorporating insights from fields such as linguistics, philosophy, and literary theory, I hope to begin develop a more robust and nuanced analytical framework.

Feminist Instructions

“It wasn’t a match, I say. It was a lesson”  Claudia Rankine, Citizen: An American Lyric (Greywolf: 2014)

In her introduction to Living a Feminist Life, scholar and activist Sara Ahmed adopts bell hooks’ definition of feminist work: “the movement to end sexism, sexual exploitation and sexual oppression”(hooks, 2000, cited in Ahmed, 2016).  I in turn take Ahmed’s description of “a scene of feminist instruction” as a starting point for an imagined feminist text analysis. Ahmed writes,

we hear histories in words; we reassemble histories by putting them into words . . . . attending to the same words across different contexts, allowing them to create ripples or new patterns like texture on a ground. I make arguments by listening for resonances . . . . The repetition is the scene of a feminist instruction.

Hence, Text Analysis with its focus on repeated words as quantifiable data points that reveal the workings of text or texts by its very nature would seem to be feminist. 

And yet, as Koen Loers and Sayan Bhattacharyya demonstrate, it’s not at all that simple.

In “Text Analysis for Thought in the Black Atlantic,” Sayan Bhattacharyya points out that “many methods of text analysis prove problematic, because they make an unwarranted assumption about the stability and constancy of the relation between words and their meanings across time.” Proposing Glissant’s notion of “archipelagic thinking in space (and its counterpart in time)” as a way of “pay[ing] attention to variation within, as well as to the specificity of, word-concepts” (Bhattacharyya, 80), Bhattacharyya traces a geneaology of Glissant’s metaphor back through the writings of Aimé Césaire and C. L. R. James, and thus suggests that the Digital Black Atlantic as “the body of interdisciplinary scholarship that examines connections between African diasporic communities and technology” (Introduction, Risam and Baker Josephs),  can, like Paul Gilroy’s eponymous challenge to Eurocentric white supremacist studies, “perform a similar decentering of the epistemological assumptions that underlie digital humanities in general by problematizing its tools” (Bhattacharyya 82).  In particular, “[b]y taking the relationships between words (expressed as co-occurrences of words), rather than the words themselves, as the basic unit of representation” (Bhattacharyya 81), word vectors “are not only a convenient technology to capture semantic relationships but also are . . . productive for problematizing concepts in the text and even for raising epistemological questions about the status of concepts themselves in relation to the text” (Bhattacharyya 81).

Likewise, in “Feminist Data Studies: Using Digital Methods for Ethical, Reflexive and Situated Socio-Cultural Research, Koen Loers points out that “Digital data is performative and context-specific” (Loers, 143); as a result a would-be feminist data researcher needs to “consider . . . text, users and materiality from a relational perspective” (Loers 133).  Asking, “[h]ow can we draw on user-generated data to understand agency vis-a`-vis structures of individuality and collectives across intersecting axes of difference?” as well as “[h]ow can we strategically mobilise digital methods in a non-exploitative way to illuminate everyday power struggles, agency and meaning-making?” (Loers 133), Loers offers a case-study and “road map” for the self-interrogating, “research participant-centered” (132), “alternative data-analysis practice” (139) that might better align with feminist and post-colonial ethics. 

While Loers himself demonstrates how Facebook TouchGraph’s visualizations of users’ relationships, even when jointly created, can generate alienation, hostility, and confusion in participants, necessitating adaptive understandings of data and collaboration, my presentation will focus in particular on his section entitled “Dependencies and relationalities” to explore whether what I tentatively term “relational textual analysis” might afford an epistemological as well as material model for a feminist textual analytic practice.

Ahmed, Sara, Living a Feminist Life. Duke University Press, 2016. Project MUSE. muse.jhu.edu/book/69122.

Bhattacharyya, Sayan, “Text Analysis for Thought in the Black Atlantic” in The Digital Black Atlantic, Roopika Risam and Kelly Baker Josephs eds, pp. 77-83.

Koen Leurs, “feminist data studies: using digital methods for ethical, reflexive and situated socio-cultural research” (130–154) 2017 The Feminist Review Collective. 0141-7789/17

Abstract for roundtable

Data-Driven Feminist Text Analysis: Exploring the Significance of Computational Methods and Digital Humanities Tools in Literary and Cultural Studies

The role of feminist text analysis in literary and cultural studies, with a particular focus on the use of data and code-based tools to support this approach. Drawing on established feminist theories and practices of text analysis, we argue that feminist text analysis is a crucial lens for understanding how gender and power dynamics shape the production and reception of literature and other cultural artifacts.

We explore the ways in which computational methods and digital humanities tools can support feminist text analysis, including through the use of text mining, machine learning (for example: machine learning algorithms can be trained to identify and classify gendered language and stereotypes in texts, which can then be used to quantify and analyze patterns of gender bias and discrimination. This can enable feminist text analysts to more efficiently and effectively identify and critique problematic representations of gender in literature and other cultural artifacts) and other data-driven approaches.

Consider the challenges and limitations of these tools is also very crucial, including the potential for bias and the need for critical awareness of their limitations. To support this argument, examples of feminist text analyses that have successfully navigated these challenges, including studies on the representation of gender in children’s books, the use of the word “hysterical” on Twitter, and the gendering of job titles in academia. These examples demonstrate the potential of feminist text analysis to uncover patterns of gender bias and inequality, and to contribute to the promotion of gender equality and social justice. Ultimately, we argue that feminist text analysis is an essential approach to literary and cultural studies that can help us to create more inclusive and equitable representations of gender in our culture.

Roundtable Abstract: The Days of the LLM is counted

A recently leaked Google document contains “We Have No Moat, And Neither Does OpenAI”. This is quite a big claim against OpenAI, the mastermind behind society’s enchanting creative LLM, ChatGPT. The document is mostly referring to the significant rise of boutique indie open source model developers and tweakers who took a leap frog towards deploying models that are fractionally less effective than renowned LLMs but have significantly less computational cost. Even OpenAI CEO Sam Altman acknowledged that the era of large language models is over. In this paper, some of the open-source language models that are posing significant challenges to LLMs will be analyzed. The analysis will look at the underlying technology, results of these models and answer questions like what these open source models are doing differently to challenge the norms and why they have fewer computational requirements.

Abstract: Investigating Indices of Impermanence

The human response to uncertainty, impermanence, and the unknowable are centrally located at the emergence of cultural and social practice. Discomfort tied to the fluctuating state of the knowable can lead to an outsized cultural emphasis on rigid systems of dissecting, parsing, naming and sorting—be that at the level of culturally defined gender roles or linguistic analysis. Digital text analysis, despite its rootedness in concepts of precision, empirical process and logic, offers outcomes akin to a blurry photograph of a subject in motion. It does provide evidence of materiality, but the details, origin, trajectory, and ongoing development of its subject fail to manifest under its lens. Feminism does not ask us to disregard digital text analysis because of its limitations, it asks us to consider its process and outcomes as circumscribed evidence of an iteration of ongoing knowledge creation, impacted by the interventions of researchers, authors, and editors. 

As exemplified by Standpoint Theory, which asks us to recognize knowledge as stemming from social position and therefore unfixed and subjective, and the practice of acknowledging what is not named, not performed, and not visible in representations of information and experience—Feminism pushes back against the concept of a finite and universally experienced perception of the world. This work argues that these feminist practices, inherently linked to Barthe’s discussion of a “work” as an iteration or “fragment of substance,” in relation to a “text,” as an evolving formulation or “methodological field,” are critical in examining the limitations digital text analysis in documenting the complex, transient and embedded knowledge referenced in literary works it seeks to investigate. Although text analysis may capture evidence of subjectivity and social performance, unearthing the depth of the underlying “methodological field” from which the work was derived requires a complex contextual framework outside the purview of current digital textual analysis tools. 

  • Eckert, Penelope, and Sally McConnell-Ginet. Language and Gender. 2nd Edition. Cambridge UP: Cambridge, 2013. pp 1-36 
  • Catherine D’Ignazio and Lauren Klein. “ChapterTwo: On Rational, Scientific, Objective Viewpoints from Mythical, Imaginary, Impossible Standpoints.”; Chapter 6: “The Numbers Don’t Speak for Themselves.”Data
    Feminism. Cambridge, MA: MIT Press, 2020.
  • Barthes, Roland. “The Rustle of Language.” (R. Howard., Trans.).  Farrar, Straus, and Giroux, Inc., 1986.
  • Jerome McGann “Introduction: Texts and Textualities” “The Textual Condition” and “How to Read a Book” McGann, Jerome J. The Textual Condition. Princeton, N.J.: Princeton UP, 1991. Print. Princeton Studies in Culture/power/history.