Author Archives: Muhammad Rakibul Islam (Rakib)

Book Review – “Raw Data” Is an Oxymoron

Data, in the modern-day world, became such a phenomenon that people began (and continue) claiming that it provides an objective set of facts (the fundamental stuff of truth itself) and consequently, methods and tools for its analytics would truly revolutionize everything and allow us an understanding of the “raw” and “unfiltered” sense of the world around us.


The book “Raw Data is an Oxymoron” stands as an answer to that claim, critiquing and analyzing the fundamental element of the process itself – data. While much has changed ever since the book was released in 2013 (like the advent of large language models and other AI-powered automated tools released daily), it can be taken as a rather early work on the subject to uncover what data really is.


The book is actually a compilation of essays providing different perspectives on data through different lenses of observation. While it is divided into chapters each being essay by a different author(s), every two sets of chapters can be taken to be different “topics” or “sections” in the book (as outlined in the introduction).


The first section looks rather at a historical and cultural account of some of its early beginnings (including how the word came to be, its meanings, and development over time) and critique of early modern arithmetic (I believe the inclusion of this chapter is because data is tied to math but it felt a little out of touch because it began to delve more into a critique of arithmetic than data). The second section looks at the formulation, creation, and collection of data in different disciplines (one being economics, the other astronomy). The third section relates to essays involving ideas of databases, data aggregation, and information management exploring the ethical implications of data and how it can be used to influence and manipulate. The fourth and final section relates to the use of data in society, a sort of exploration of the future of data, how it comes to impact people in the (then) current day world, and the challenges that large (and ever-growing) sets of data possess as data itself seems to be becoming an organism on its own.


As the title suggests what the book discusses through its essays and chapters is the idea that data is never “raw” and is rather “cooked”. When considered alone, data is still not raw in the sense that it’s a reflection of the world which itself has its biases and issues and so they are transferred over. Data is also “conceptualized” and “collected” and in that process contains biases of the humans or human-made tools collecting them. Also, data has to be manipulated, cleaned, structured, processed, etc. before any analysis can be performed and observations can be made. These end up “cooking” up the data we first begin with. Now when an analysis is performed through the usage of particular models run through tools, both the models and tools have embedded biases in them and contexts that they do not consider and that apply to any field whether economics or astronomy or others. Finally, even after the analysis is done, the results have to be interpreted which brings in biases from the humans doing it.


So, what we essentially understand is that data can never be raw or objective, that at every stage, it contains, grows, and evolves on its limitations and biases and so it is important to consider that since data or rather the analysis and conclusions we draw from data has wide implications whether in public policy, politics, education, business, etc.


The book connects to our class conversation in all the stages we have discussed. By simply replacing “data” with “text” – which is the element of critique in our class, we can still apply the facts that text itself, the processes performed to collect, process, clean, analyze and interpret it all have hidden biases which give rise for the need to approaching the problem from a feminist perspective.


Two quotes really showed the duality that data (text in our class discussion case) possesses. “Data require our participation. Data need us.” and “Yet if data are somehow subject to us, we are also subject to data.” This connects to the discussions we have that data requires labor and participation of humans and does not exist without a form of human intervention and, quite equally, we are subject to data itself. And being subject to data, we are ourselves being shaped by it. This is further discussed in an essay with the writing “The work of producing, preserving, and sharing data reshapes the organizational, technological, and cultural worlds around them.”


Another interesting idea I found can be summarized by the quote “When phenomena are variously reduced to data, they are divided and classified, processes that work to obscure – or as if to obscure – ambiguity, conflict, and contradiction.” And that is a point of concern to us, especially as feminist scholars critiquing the topic that when we reduce real-world phenomena to data, we are in many cases just oversimplifying things and lose out on the depth and complexity of the topic (for example classification algorithms can fit real-world phenomena into a binary while these can be on a spectrum rather than a strict binary).


The chapter on economic theory and data also tied to our conversations about how “all models are wrong, but some are useful” showing that economists’ use of data and the modeling done are actually just approximations of reality only useful “for the formulation in which they appear.”


Something that I did not think about earlier but now completely agree with can be summarized through the following sentence from the book: “Raw data is the material for informational patterns still to come, its value unknown or uncertain until it is converted into the currency of information.” It made me think how data is only valuable and useful to us when it can be converted into the “currency of information” or in other words when it provides us with information that is useful or profitable (whichever sense the word is applied to whether purely business or even academic) to us. This brings to the sense that data has in most cases become a utilitarian as well as a capitalist tool.


Also, the book does touch on the topic of how power dynamics influence the production, analysis, and interpretation of data and how that can be used to influence and manipulate as well as exert power – which is an important factor to consider when building a feminist critique.


While the book is now outdated since many more technological leaps (and dangers) have grown within the last decade, it does provide a good introduction to the topic especially given that it is a relatively early work in the field of data studies. Also, it provides a rather introductory, theoretical understanding of data but does not provide a practical framework that can be used to develop strategies to deconstruct and counteract the issues being brought up (something that later works like D’Ignazio and Klein’s book “Data Feminism” achieves).


However, it does produce early thoughts and perspectives that can be applied to the landscape of computational text analysis.


For example, the central idea of the book that data is “cooked” or in other words constructed, is a core topic discussed in computational text analysis (as well as by feminist scholars) and prompts all CTA work to be well documented and thought out of the ways that biases and assumptions influence it.


The book discusses the importance of context when it comes to understanding data which is very crucial and a hotly debated topic in the field.


It also touches not just on how humans influence data, but how data evolves at a point to almost mimic a live organism itself that beings to make us its subjects and influence and control us, which is important to consider in modern computational text analysis, especially in a feminist perspective, when algorithms and AI dominate the field to influence society in every manner. However, the book itself does not consider the current day situation of the topic.


Finally, the book draws on ideas from a diverse set of disciplines be It an anthropological account, an economic analysis, a humanities critique, scientific development, etc. That is important when applied to computational text analysis which itself is a subject that is highly interdisciplinary and can benefit from the incorporation of insights from different disciplines to provide a more nuanced understanding. Also, building a feminist text analysis itself requires us to bring diversity in the perspectives of participants.

Abstract for Roundtable

“Data Feminism” by D’Ignazio & Klein identifies that data is not objective and reinforces existing social inequalities. Consequently, studying the hidden biases within a text is an important step for building a feminist analysis.

Intersectional feminist theories inform us that social inequalities are better reflected “not by a single axis of social division, but by many axes that work together and influence each other” (Collins & Bilge 2016, p. 2).

Following this idea, to properly critique text analysis, a feminist model should be bi-directional and multi-dimensional (spatial rather than scalar) encoding in itself the context in which words are used to relate to the social divisions at play.

Models like LDA can decode topics but are not context-aware or spatial. Earlier word-embedding models like word2vec are spatial but not context-aware.

The word-embedding model BERT can be suitable in this case. BERT is context-aware and can capture the meaning in which a word is used. Being a multi-dimensional model allows intersectional analysis to be performed, uncovering the relationships between different contextual text use cases. With its sentiment analysis and opinion-mining capabilities, we can uncover those expressed in the text concerning different social identities. Given the model’s customizability, it can also be trained to the specific domain.

However, BERT is notoriously computationally intensive. To feminist scholars that is an issue both in terms of environment and accessibility. To achieve compression, we can use BERT to create context-aware word-embeddings but apply knowledge distillation and pruning to reduce computational intensity, optimizing for maximum accuracy.

HDTMT: Text Mining Oral Histories in Historical Archaeology

Paper Title: Text Mining Oral Histories in Historical Archaeology

Author: Madeline Brown & Paul Shackel

Linkhttps://link.springer.com/article/10.1007/s10761-022-00680-5

Introduction:

Oral histories help provide important context for the past. It provides firsthand accounts from people who have experienced particular historical events in the course of their own life – a valuable insight for researchers to explore the social context, culture, politics, economics, etc. of the period being studied.

Text mining oral records and interviews is a relatively new method of retrieving important contextual information about the communities being studied in historical archaeology.

While the qualitative interpretation of texts typically conducted in historical archaeology is important and crucial, text mining and NLP can provide valuable insights that enhance and complement these methods.

The authors outline the following benefits:

(1) Rapid and efficient analysis of large volumes of data

(2) Reproducible workflows

(3) Reducing the potential for observer bias

(4) Structured analysis of subgroup differences

The authors also mention that they do not propose that text mining and NLP replace the traditional techniques used but as a supplementary tool.

This project not only shows an implementation of such techniques for historical archaeology, but it also provides a framework that can be reused and provides insights into the areas of improvement that can help further enable the use of text analysis and NLP methodologies for this particular area of study.

Data Source:

26 oral interviews acquired in 1973 and transcribed in 1981 from the anthracite coal mining region of Pennsylvania which were collected by the Pennsylvania Historical and Museum Commission and are now on file at the Pennsylvania State archives.

Link to Data Source:

Available on Github: https://github.com/maddiebrown/OralHistories

How Did They Make That:

– Interview transcripts that were available in PDF were converted to text format for analysis.

– Text was formatted to ensure that everything was standardized i.e., has the same sort of font, text size, etc.

– When PDF is converted to text sometimes extra characters and spaces come up which were removed to make the text tidy.

– Line breaks were added to signify speaker change.

– The text was then imported into R for further cleaning and then analysis.

– In R, metadata was removed. The text was then standardized. All terms were converted into lowercase (which for example avoids counting “coal” and “Coal” as separate terms), and numbers and apostrophes were removed.

– Tokens were then into individual words and bigrams, and stop words were removed “using tidytext’s stop_words lexicons: onix, SMART, and snowball.”

– After this certain slang terms, names, and common words which did not influence any meaning of the text were removed.

– In terms of actual text analysis, in this project, the authors focused on word frequency and n-grams, and they discussed future potential of sentiment analysis and tagged lexicon analysis.

Tools/skills you would need to do the project yourself:

– R Programming Language

– Text mining and NLP concepts (for example what stop words to remove and why)

– R Packages Used: tidyverse, textclean, tidytext, and cleanNLP

Tutorials:
(I used Google search to come up with primary resources, although interested analysts are suggested to search more resources for further understanding)

– R: https https://www.datacamp.com/courses/free-introduction-to-r

– Text mining with R: https://www.tidytextmining.com

– NLP with R: https://s-ai-f.github.io/Natural-Language-Processing/

– tidyverse R package: https://www.tidyverse.org/learn/

– textclean R package: https://cran.r-project.org/web/packages/textclean/index.html

– tidytext R package: https://cran.r-project.org/web/packages/tidytext/index.html

– cleanNLP R package: https://cran.r-project.org/web/packages/cleanNLP/index.html

Response Blog Post – Week 9

AI Ethics: Identifying Your Ethnicity and Gender

This reading showcases the use of NLP and Machine Learning to build an AI model that can accurately identify gender based on a single tweet.

But it begins with the questions around the ethics of allowing AI to detect ethnicity and gender, followed up with more questions and then some more and before the author even tackles the pros and cons properly dives straight into building a model to detect gender from a tweet.

The article then showcases how that is achieved, how the data is cleaned and processed, how the model is built and finally applied to showcase its accuracy. However even the model itself leaves more questions than answers like what percentage of model accuracy is acceptable? Are tweets from celebrities the best training dataset to predict gender for everyone or just a particular niche of people?

But coming back to the ethical question of allowing AI to predict gender or ethnicity, it can be very unethical and harmful in many ways in my personal opinion and here is one example I had in mind.

Let’s say you are applying for a job application with your resume and cover letter and the employers run it through an AI model. Now the AI model they have has been trained on their own dataset of employees and tries to detect those who are likely to be more “successful” at their jobs. Consider this to be in the finance industry which is dominated by mostly white men. Since, their AI model has been trained on a dataset that in itself has coded bias because it is based on a training dataset that itself is biased on both ethnicity and gender, this can cause a problem if somebody with a different set of ethnicity and gender applies (for example a non-binary Hispanic person). This is why I personally would not advocate using AI to detect ethnicity and gender unless the training dataset itself is adequately representative of a rather diverse demographic or there is an absolute need.

Beyond Fact-Checking: Lexical patterns as Lie Detectors in Donald Trump’s Tweets

As the title suggests, this paper is on figuring out a model to detect lies in Donald Trump’s tweets.

But why does that matter and why can’t we simply label things as lies in a journalism context? Isn’t fact just different from opinions / lies / misinformation? Because, as the author explains, “the line between factual claims and statements of opinion can be difficult to draw” (Graves, 2016, p. 92). There can be “opinionated factual statements” where the statement is indeed factual but draws on a substance of opinion, misinformation as questions where while you are directly not making a misinformed statement but ask a question that would lead to a thought of misinformation, and weaving misinformation into “otherwise truthful statements” (Clementson, 2016, p. 253) where the statement you are making it true but needs more context as without it misinformation is propagated.

What becomes an even greater difficulty is understanding not only whether a statement is fact or false but if it is false, was it an “honest mistake or strategic misrepresentation”. There are debates on this questions itself whether and exactly when in journalism you should present something as an intentional lie.

The research paper’s primary aim is to see if any language patterns emerge in Donald Trump’s tweets to identify false information. It aims to do this by comparing the language of his tweets when he shares true and false information and find the lexical patterns in each to create a model for detection.

They focused on some patterns that are associated with lying in language like the use of words to express a negative emotion, certain pronouns etc. The study then found that indeed a lot of Trump’s tweets that had false or misleading information actually did have a high number of such patterns existing. This can be used in journalism and scholarship beyond simple fact checking when needed.

Gendered Language in Teacher Reviews

This is an interactive visualization with a chart that allows exploration of words used to describe male and female teachers from a dataset of reviews from RateMyProfessor.com. A word or two-word phrase can be entered, and the chart generates to show the usage of words (per million words of text) broken down by gender (male & female) which can be further broken down into “positive” or “negative” ratings which enables further exploration to see in what context the word was being used.

So, basically when a word or two-word phrase is typed in, it crawls the database of reviews from RateMyProfessor.com, searches up how many times those words have been used per million words of text and breaks it down by gender and further breaking it down by positive or negative review and creates a chart from it while showing which department / subject the professor teaches.

I personally think the visualization does match the intention and carries out the task quite swiftly. The data visualized matches the question asked in at least its literal meaning.

For me I also particularly liked the break down by department as well. For example, searching for “programming” shows computer science followed by engineering as the ones for which the word shows up the highest while “aesthetic” shows up the most for fine arts.

Also, it made me stop and think once that while we are trying to find out how certain words can have a gender bias in reviews, I as the prompter typing in words am making decisions on what words I am choosing that I think will have a gender bias (making me think of my own biases in gender language).

Film Dialogue

This project comes in a few parts to it.

The first analyses 30 Disney films and breaks down the screenplay by gender which showed that 22 out of 30 of those films have males dominating dialogues even in a film like Mulan where the leading character is female.

The second visualization contains a larger dataset of 2000 screenplays where the researchers’ linked characters with at least 100 words of dialogue to their respective IMDB pages which can showcase gender of the actor. This is then used to visualize screenplays on a scale of 100% of Words are Male to 100% of Words are Female and the chart is interactive so you can hover over one to see the name of the film and the breakdown of dialogue percentage by gender.

The third visualization takes into consideration only high-grossing films defined as films rankings in the top 2,500 of the US Box Office. It shows a general view of a scale from 100% male to 100% female like before. But it does a better job of showing just how less of the films have female dominated speech or even a 50-50 gender balance.

In this visualization, you can explore further by clicking on a film title and it shows the exact percentage breakdown of dialogue by gender, who the top 5 characters are (by number of words) and a breakdown of the male or female lines by minutes of the film from start to end showing the increase (or decrease) of each against the timeline of the film.

The fourth visualization shows a bar chart of the percent of dialogues by actor’s age broken down by gender. This shows that women actors have less dialogues available when they are over the age of 40 while for men it is the exact opposite, and they have more roles available with age.

The last visualization is a more interactive one where one can search up a particular film, see its gender dialogue breakdown and number of words by characters. There are also some filtering options like years and genres.

The visualizations do look really cool overall, and I just loved how clean and aesthetic everything is.

Gender Analyzer

This is a tool where one can enter certain text and it tries to analyze whether the text was written by a man or a woman. I frankly do not know how good this tool is and have been very confused as I have tried to put in text from woman writers which have been labeled as masculine and text from male writers labeled as feminine. It does say that it has an accuracy of about 70% which means it is not that perfect. But I just did not understand the use case or like why you even need this in the context it is supposed to be used.

Response Blog Post – Week 8

My key takeaway from this week’s readings is summarized by Catherine D’Ignazio and Lauren Klein’s chapter title in Data Feminism “What Gets Counted Counts”. Public policy and decision making at regional to national to international level takes place in the modern-day world based on conclusions drawn from the data being collected. And what is in that data matters because that dictates what takeaways we have and how we better understand the world around us.

Being counted is the equivalent of having representation in the data age. When census surveys are done for example, allocation of resources and even the amount of representation a region gets is based on the results of the surveys conducted. This poses a problem in cities like New York City where there is a significant undocumented immigrant population, and the census fails to properly “count” them which leads to a lack of resources when compared to the actual population that such regions have.

In fact, during the last census drive, I had briefly volunteered with a community organization that was helping collect the survey forms and I have seen in person how many people in the community, particular undocumented immigrants and those seeking asylum do not want to fill up the survey in fears that it makes them vulnerable. This creates a problem because most of these people are already from a demographic that is underrepresented in politics and without being represented on the census, they are only going to be more underrepresented.

I also really liked the point that sometimes lack of data or missing data is representative of a situation itself. This is best described by the Guardian interactive “Does the New Congress Reflect You” where users can select their demographic and see how many people like them are in Congress. Clicking on “trans + nonbinary” leads to a blank map showing that there are no people in Congress like them. That is powerful as it shows the lack of representation for trans & nonbinary folks in the 2018 Congress.

Overall, what the chapter presents is the fact that data is not impartial and rather has biases and inequalities embedded in the whole process from collection, analysis and the conclusions drawn which in turn perpetuates the said biases and inequalities. Hence as data feminists, it is important to be able to consider the misrepresentations and injustices that are embedded in the data process to confront and tweak these issues.