Daily Archives: May 1, 2023

HDTMT: Text Mining Oral Histories in Historical Archaeology

Paper Title: Text Mining Oral Histories in Historical Archaeology

Author: Madeline Brown & Paul Shackel

Linkhttps://link.springer.com/article/10.1007/s10761-022-00680-5

Introduction:

Oral histories help provide important context for the past. It provides firsthand accounts from people who have experienced particular historical events in the course of their own life – a valuable insight for researchers to explore the social context, culture, politics, economics, etc. of the period being studied.

Text mining oral records and interviews is a relatively new method of retrieving important contextual information about the communities being studied in historical archaeology.

While the qualitative interpretation of texts typically conducted in historical archaeology is important and crucial, text mining and NLP can provide valuable insights that enhance and complement these methods.

The authors outline the following benefits:

(1) Rapid and efficient analysis of large volumes of data

(2) Reproducible workflows

(3) Reducing the potential for observer bias

(4) Structured analysis of subgroup differences

The authors also mention that they do not propose that text mining and NLP replace the traditional techniques used but as a supplementary tool.

This project not only shows an implementation of such techniques for historical archaeology, but it also provides a framework that can be reused and provides insights into the areas of improvement that can help further enable the use of text analysis and NLP methodologies for this particular area of study.

Data Source:

26 oral interviews acquired in 1973 and transcribed in 1981 from the anthracite coal mining region of Pennsylvania which were collected by the Pennsylvania Historical and Museum Commission and are now on file at the Pennsylvania State archives.

Link to Data Source:

Available on Github: https://github.com/maddiebrown/OralHistories

How Did They Make That:

– Interview transcripts that were available in PDF were converted to text format for analysis.

– Text was formatted to ensure that everything was standardized i.e., has the same sort of font, text size, etc.

– When PDF is converted to text sometimes extra characters and spaces come up which were removed to make the text tidy.

– Line breaks were added to signify speaker change.

– The text was then imported into R for further cleaning and then analysis.

– In R, metadata was removed. The text was then standardized. All terms were converted into lowercase (which for example avoids counting “coal” and “Coal” as separate terms), and numbers and apostrophes were removed.

– Tokens were then into individual words and bigrams, and stop words were removed “using tidytext’s stop_words lexicons: onix, SMART, and snowball.”

– After this certain slang terms, names, and common words which did not influence any meaning of the text were removed.

– In terms of actual text analysis, in this project, the authors focused on word frequency and n-grams, and they discussed future potential of sentiment analysis and tagged lexicon analysis.

Tools/skills you would need to do the project yourself:

– R Programming Language

– Text mining and NLP concepts (for example what stop words to remove and why)

– R Packages Used: tidyverse, textclean, tidytext, and cleanNLP

Tutorials:
(I used Google search to come up with primary resources, although interested analysts are suggested to search more resources for further understanding)

– R: https https://www.datacamp.com/courses/free-introduction-to-r

– Text mining with R: https://www.tidytextmining.com

– NLP with R: https://s-ai-f.github.io/Natural-Language-Processing/

– tidyverse R package: https://www.tidyverse.org/learn/

– textclean R package: https://cran.r-project.org/web/packages/textclean/index.html

– tidytext R package: https://cran.r-project.org/web/packages/tidytext/index.html

– cleanNLP R package: https://cran.r-project.org/web/packages/cleanNLP/index.html

Response Blog Post – Week 9

AI Ethics: Identifying Your Ethnicity and Gender

This reading showcases the use of NLP and Machine Learning to build an AI model that can accurately identify gender based on a single tweet.

But it begins with the questions around the ethics of allowing AI to detect ethnicity and gender, followed up with more questions and then some more and before the author even tackles the pros and cons properly dives straight into building a model to detect gender from a tweet.

The article then showcases how that is achieved, how the data is cleaned and processed, how the model is built and finally applied to showcase its accuracy. However even the model itself leaves more questions than answers like what percentage of model accuracy is acceptable? Are tweets from celebrities the best training dataset to predict gender for everyone or just a particular niche of people?

But coming back to the ethical question of allowing AI to predict gender or ethnicity, it can be very unethical and harmful in many ways in my personal opinion and here is one example I had in mind.

Let’s say you are applying for a job application with your resume and cover letter and the employers run it through an AI model. Now the AI model they have has been trained on their own dataset of employees and tries to detect those who are likely to be more “successful” at their jobs. Consider this to be in the finance industry which is dominated by mostly white men. Since, their AI model has been trained on a dataset that in itself has coded bias because it is based on a training dataset that itself is biased on both ethnicity and gender, this can cause a problem if somebody with a different set of ethnicity and gender applies (for example a non-binary Hispanic person). This is why I personally would not advocate using AI to detect ethnicity and gender unless the training dataset itself is adequately representative of a rather diverse demographic or there is an absolute need.

Beyond Fact-Checking: Lexical patterns as Lie Detectors in Donald Trump’s Tweets

As the title suggests, this paper is on figuring out a model to detect lies in Donald Trump’s tweets.

But why does that matter and why can’t we simply label things as lies in a journalism context? Isn’t fact just different from opinions / lies / misinformation? Because, as the author explains, “the line between factual claims and statements of opinion can be difficult to draw” (Graves, 2016, p. 92). There can be “opinionated factual statements” where the statement is indeed factual but draws on a substance of opinion, misinformation as questions where while you are directly not making a misinformed statement but ask a question that would lead to a thought of misinformation, and weaving misinformation into “otherwise truthful statements” (Clementson, 2016, p. 253) where the statement you are making it true but needs more context as without it misinformation is propagated.

What becomes an even greater difficulty is understanding not only whether a statement is fact or false but if it is false, was it an “honest mistake or strategic misrepresentation”. There are debates on this questions itself whether and exactly when in journalism you should present something as an intentional lie.

The research paper’s primary aim is to see if any language patterns emerge in Donald Trump’s tweets to identify false information. It aims to do this by comparing the language of his tweets when he shares true and false information and find the lexical patterns in each to create a model for detection.

They focused on some patterns that are associated with lying in language like the use of words to express a negative emotion, certain pronouns etc. The study then found that indeed a lot of Trump’s tweets that had false or misleading information actually did have a high number of such patterns existing. This can be used in journalism and scholarship beyond simple fact checking when needed.

Gendered Language in Teacher Reviews

This is an interactive visualization with a chart that allows exploration of words used to describe male and female teachers from a dataset of reviews from RateMyProfessor.com. A word or two-word phrase can be entered, and the chart generates to show the usage of words (per million words of text) broken down by gender (male & female) which can be further broken down into “positive” or “negative” ratings which enables further exploration to see in what context the word was being used.

So, basically when a word or two-word phrase is typed in, it crawls the database of reviews from RateMyProfessor.com, searches up how many times those words have been used per million words of text and breaks it down by gender and further breaking it down by positive or negative review and creates a chart from it while showing which department / subject the professor teaches.

I personally think the visualization does match the intention and carries out the task quite swiftly. The data visualized matches the question asked in at least its literal meaning.

For me I also particularly liked the break down by department as well. For example, searching for “programming” shows computer science followed by engineering as the ones for which the word shows up the highest while “aesthetic” shows up the most for fine arts.

Also, it made me stop and think once that while we are trying to find out how certain words can have a gender bias in reviews, I as the prompter typing in words am making decisions on what words I am choosing that I think will have a gender bias (making me think of my own biases in gender language).

Film Dialogue

This project comes in a few parts to it.

The first analyses 30 Disney films and breaks down the screenplay by gender which showed that 22 out of 30 of those films have males dominating dialogues even in a film like Mulan where the leading character is female.

The second visualization contains a larger dataset of 2000 screenplays where the researchers’ linked characters with at least 100 words of dialogue to their respective IMDB pages which can showcase gender of the actor. This is then used to visualize screenplays on a scale of 100% of Words are Male to 100% of Words are Female and the chart is interactive so you can hover over one to see the name of the film and the breakdown of dialogue percentage by gender.

The third visualization takes into consideration only high-grossing films defined as films rankings in the top 2,500 of the US Box Office. It shows a general view of a scale from 100% male to 100% female like before. But it does a better job of showing just how less of the films have female dominated speech or even a 50-50 gender balance.

In this visualization, you can explore further by clicking on a film title and it shows the exact percentage breakdown of dialogue by gender, who the top 5 characters are (by number of words) and a breakdown of the male or female lines by minutes of the film from start to end showing the increase (or decrease) of each against the timeline of the film.

The fourth visualization shows a bar chart of the percent of dialogues by actor’s age broken down by gender. This shows that women actors have less dialogues available when they are over the age of 40 while for men it is the exact opposite, and they have more roles available with age.

The last visualization is a more interactive one where one can search up a particular film, see its gender dialogue breakdown and number of words by characters. There are also some filtering options like years and genres.

The visualizations do look really cool overall, and I just loved how clean and aesthetic everything is.

Gender Analyzer

This is a tool where one can enter certain text and it tries to analyze whether the text was written by a man or a woman. I frankly do not know how good this tool is and have been very confused as I have tried to put in text from woman writers which have been labeled as masculine and text from male writers labeled as feminine. It does say that it has an accuracy of about 70% which means it is not that perfect. But I just did not understand the use case or like why you even need this in the context it is supposed to be used.

Response Blog Post – Week 8

My key takeaway from this week’s readings is summarized by Catherine D’Ignazio and Lauren Klein’s chapter title in Data Feminism “What Gets Counted Counts”. Public policy and decision making at regional to national to international level takes place in the modern-day world based on conclusions drawn from the data being collected. And what is in that data matters because that dictates what takeaways we have and how we better understand the world around us.

Being counted is the equivalent of having representation in the data age. When census surveys are done for example, allocation of resources and even the amount of representation a region gets is based on the results of the surveys conducted. This poses a problem in cities like New York City where there is a significant undocumented immigrant population, and the census fails to properly “count” them which leads to a lack of resources when compared to the actual population that such regions have.

In fact, during the last census drive, I had briefly volunteered with a community organization that was helping collect the survey forms and I have seen in person how many people in the community, particular undocumented immigrants and those seeking asylum do not want to fill up the survey in fears that it makes them vulnerable. This creates a problem because most of these people are already from a demographic that is underrepresented in politics and without being represented on the census, they are only going to be more underrepresented.

I also really liked the point that sometimes lack of data or missing data is representative of a situation itself. This is best described by the Guardian interactive “Does the New Congress Reflect You” where users can select their demographic and see how many people like them are in Congress. Clicking on “trans + nonbinary” leads to a blank map showing that there are no people in Congress like them. That is powerful as it shows the lack of representation for trans & nonbinary folks in the 2018 Congress.

Overall, what the chapter presents is the fact that data is not impartial and rather has biases and inequalities embedded in the whole process from collection, analysis and the conclusions drawn which in turn perpetuates the said biases and inequalities. Hence as data feminists, it is important to be able to consider the misrepresentations and injustices that are embedded in the data process to confront and tweak these issues.

Book Report & Review Blog Post_YW

Weapons of Math Destruction – Cathy O’Neil

a.) summarizes the main takeaways of the book for classmates who have not had the opportunity to read it

Cathy O’Neil’s book “Weapons of Math Destruction” examines how mathematical models and algorithms are frequently employed to support and maintain systemic injustice and inequality. According to O’Neil, these “Weapons of Math Destruction” (WMDs) have serious adverse effects on people and society as a whole, especially on vulnerable and marginalized groups. The book offers numerous examples of WMDs being used in various fields, including hiring, advertising, education, and criminal justice. O’Neil, for instance, demonstrates how discriminatory and unfair predictive models may be when used to assess a teacher’s performance or a person’s creditworthiness.

The key lessons to be learned from the book include the necessity for increased accountability and openness in the design and use of algorithms, as well as the significance of adding moral considerations into algorithmic decision-making. The book also emphasizes the potential for algorithms to reinforce and magnify prejudices, highlighting the significance of diversity and inclusion in the technology sector.

In the Introduction to the book, the author sets the stage for her argument by describing how algorithms are increasingly being used to make decisions that have significant consequences for people’s lives, such as who gets hired or fired, who gets a loan or a mortgage, and who gets sent to prison. It notes that these algorithms are often proprietary and secret, meaning that the people affected by them have no way of knowing how they work or challenging their decisions.

There’s a popular saying that “men lie, women lie, but numbers don’t.” Because people tend to believe that ‘numbers don’t lie,’ many tend to cow into submission to anything that is based on numbers. However in “Weapons of Math Destruction” the author proves the use of warped mathematical and statistical models in algorithms to influence against ordinary people. These ‘algorithmic’ decisions tend to entrench existing inequalities by empowering the rich and powerful against the helpless mases. He debunks the notion of algorithmic neutrality with the argument that algorithms are based on data which are obtained from recorded behaviors and choices of people – most of which are flawed.

The author confirms Kate Crawford’s observation of the use of obscuring by mystification to conceal the truth from the people affected thereby. When confronted, computer scientists tend to present answers which suggest that the internal operations of the algorithms is ‘unknowable’ thereby slamming the door against all questioning. In line with the theory of political economy, the author observes that the effectiveness of algorithms is evaluated based on its ability to bring in the currency: i.e. political power to politicians, and money to business, but never on its effect on the affected people. Examples include the use of value-added modelling against teachers and scheduling software to optimize profits while exploiting desperate people and exacerbating working conditions of the worker and their social life. Political microtargeting which undermines democracy and provides politicians an avenue to be elusive by being ‘many things to many people.

b.) connects the book to our class conversations:

The book makes various connections to our class discussions on Feminist and Feminist Text Analysis. First, it draws attention to the ways in which algorithms can perpetuate systemic prejudices and discrimination, which can have serious adverse effects on disadvantaged and vulnerable communities, including women. Second, the book emphasizes the significance of including other viewpoints and voices in algorithmic decision-making processes, which is consistent with the intersectionality tenet of feminism. Finally, the book advocates for algorithmic decision-making to be more open and accountable, which is crucial for guaranteeing fairness and equity for all people, particularly women and other underrepresented groups.

c.) suggests what perspectives or new avenues of research and thought the book adds to the landscape of computational text analysis.

The book expands the field of computational text analysis by introducing a number of fresh viewpoints and lines of inquiry. One of the book’s major contributions is focusing light on the unfair application of mathematical models and algorithms in decision-making procedures like hiring, lending, and criminal justice, which can have a big impact on people’s lives. The book casts doubt on the idea of algorithmic neutrality by demonstrating how algorithms are built on inaccurate data derived from observed human actions and decisions, producing biased results that frequently worsen already-existing disparities.

Moreover, the impact of algorithmic decision-making on people, which reduces them to insignificant numbers and ignores their personal histories, psychological conditions, and interpersonal interactions, is highlighted in the book. It exposes the potential biases and inequities inherent in algorithmic judgments and emphasizes the necessity to address the ethical implications of using only algorithms to analyze human tales.

 Since many algorithms employed in significant decision-making processes are private and secret, it can be challenging for those who are affected by these judgments to understand how they operate or to contest them. This is why the book examines the topic of transparency and accountability in algorithmic decision-making. The book highlights the need for greater accountability and transparency in the creation and usage of algorithms and urges readers to evaluate these tools’ effects on society with greater knowledge and critical thought. The use of computational text analysis in domains like education, where algorithms are employed to evaluate professors and lecturers, and the potential biases and limitations of such evaluations are also raised in the book. It promotes deeper study and reflection on the creation of moral and just algorithms that take into account the intricate social and cultural influences on text data and analysis.

d.) Own critical reflections

In the chapter 7- Sweating Bullets, the author highlights an important issue: the unfair use of past records and WMD (weapons of math destruction) to screen job candidates, resulting in the blacklisting of some and the disregard of many others. When we rely on an algorithmic product to analyze human stories, individuals become mere numbers. For instance, a hard-working teacher’s efforts for a day are reduced to 8 hours in the database. Similarly, the practice of Clepening operates on the same principle. The machine does not care about an individual’s mental stress, personal preferences, or relationships; it only considers the additional hours worked. The Cataphora software system operates in the same manner. During the 2008 recession, companies used the software’s decision to lay off employees who had small and dim circles on the chart. While I agree with most of the author’s statements, I remain optimistic that with advancements in AI, the damages caused by WMD can be reduced. Although I am unsure of how this can be achieved, the author has addressed many of the problems, and solutions may exist. This chapter provided an example of Tim Clifford’s teacher evaluation case, it reminded me of the Student Evaluation of Teaching that is conducted every semester at City Tech, as well as at all other CUNY undergraduate colleges. These evaluations allow students to provide feedback on their classes before the final exams to eliminate potential bias. The feedback is then gathered and analyzed to help instructors improve their teaching. Prior to the pandemic, City Tech used a paper version of the evaluations, where professors would receive forms for each class and ask students to fill them out in class. Instructors had to leave the room while students filled out the forms, and a student would then send the completed forms to the Assessment and Research office. However, this evaluation process put pressure on some instructors, particularly those who were adjunct professors or had not yet received tenure. Some instructors chose to not distribute the forms to students, filled out the forms and submitted forms themselves etc. Despite the potential for bias from students, I believe that the Student Evaluation of Teaching questions are reasonable and can help instructors improve their teaching methods. At the same time, I recognize that the evaluation process may not be entirely fair to instructors, and that algorithms used to evaluate teaching may also be subject to biases and inequalities. Therefore, it is crucial to prioritize the development of ethical and fair algorithms that account for the biases and inequalities present in our society.

Respnse blog post_Week 4_2.27.23_YWei

Post-feminist text analysis by Sara Mill

Speaking in Tongues: Dialogics and Dialectics and The Black Woman Writer’s Literary Tradition

The author analyzes popular cultural texts using post-feminist theories. She investigates how gender and power are represented in these texts and how they reflect and influence societal beliefs and values about gender.

Sara Mills’ post-feminist text analysis work is relevant to the course goal of learning feminist text analysis because it highlights the ongoing struggle for gender equality in language and discourse. She also demonstrates how language is used to subtly undermine feminist goals and promote traditional gender roles by analyzing popular media texts such as advertisements. She demonstrates, for example, how women are frequently objectified and reduced to their physical appearance, whereas men are portrayed as powerful and dominant. And we have many scholars investigating how language constructs and reinforces gender roles, stereotypes, and power dynamics in feminist text analysis. Mills’ research expands on this foundation by investigating how postfeminist discourses that claim to have achieved gender equality actually perpetuate sexist attitudes and limit women’s agency.

By contrast, Mae G. Henderson’s article emphasizes how black women writers use language to challenge dominant cultural narratives and give voice to marginalized perspectives. Though Henderson’s article is not explicitly feminist, it can be viewed as part of a broader feminist project that seeks to amplify marginalized voices and challenge dominant cultural discourses.

In both articles, the importance of examining how language and media representations perpetuate stereotypes and power imbalances is stressed. In addition, they emphasize the importance of diverse representations that reflect marginalized perspectives and experiences. Both articles engage in feminist text analysis in critiques and challenges of dominant cultural narratives and promote greater diversity and inclusivity in media and literature.

Response blog post_Week 2 _2.6.23_YW

Sex and gender are often separated because they refer to distinct aspects of a person’s identity. gender is a social construct that can vary across cultures and can change over time. By separating the two concepts, it is possible to understand and address the ways in which gender and sex intersect and how they impact an individual’s experiences and opportunities. As we discussed in class today, gender is the question that is constantly asked when we need to fill out forms or join up for something. This reminds me of all the data analysis we did at City Tech for student enrollment, graduation, and retention, or surveys we hand out to gather data. We always included the variable of gender in our analyses. I thought it was interesting that the most recent Enrollment Dashboard, which we just updated for Spring 2022, had five demographic factors under the gender category: Men, Women, Non-binary Persons, Gender Nonconforming Persons, and Unspecified. We only have the gender categories of Men and Women when we get the data from the CUNY IRDB database prior to Spring 2022. This shift towards a more inclusive understanding of gender has led to an increase in the number of variables for gender in data analysis, allowing for a more nuanced and accurate representation of gender identities. And I believe it’s important for ensuring that data analysis is inclusive and respectful of all gender identities, and for providing a more complete picture of the experiences and perspectives of individuals who identify outside of the male/female binary.

Blog Post 2: Supervised Learning Readings

By definition, supervised learning is generally used to classify data or make predictions, whereas unsupervised learning is generally used to understand relationships within datasets. Therefore, supervised learning is much more resource-intensive due to labelled data. Various examples of supervised learning have been given in the assigned reading, such as spam detection as part of an email firewall, distinguishing between conglomerate and non-profit novels, and spotify’s recommended songs model. These differences made me think of the notebook we did previously with the one we did today. In unsupervised learning, we do not have any training dataset which is the plus point for supervised learning and therefore, it is the best predictor.

Sinykin and Roland talks in: Against Conglomeration: Nonprofit Publishing and American Literature After 1908” about how ‘multiculturalism’ evolved in the world of literature. It was started by the government to include the diverse population that defined the new America; however, during the process of establishing the ‘multiculturalism’, things ended up being categorized in the form of specific titles and reputation given to the authors (African American/ Asian American/ Indian American) who had no specific goals to achieve such prejudiced and racist titles that created categories in the name of diversity. But is this really a multiculturalism? Aren’t we categorizing people according to their race and expecting them to create their work on their cultural basis? Non-profits did this because they had a gain of money, and it was the government who promoted this which got standardized due to the profit-gain. However, apart from all the downside, we can not deny that due to non-profits, chances were given to those who were considered outsiders (non-white people) in the field of literary.

We are experiencing a similar situation in the current period where all the non-profits are collecting data to improve society and create less discrimination. However, they are facing a lot of challenges in doing so.  For example, Machine learnings (ML) algorithms are generated to pick candidates for hiring. In order to make unbiased decision, the algorithm has to be taught to not gender/race discriminate the candidate. According to supervised learning process, these algorithms would need the data on gender and race to align with the unbiasedness. Therefore, in reality, it is very hard to remove the gender and race-specified data as they are required to fight against the discrimination. However, most of the time it is misused at this certain place. As Ben Schmidt states in his article “the most important rule for thinking about artificial intelligence is that it’s deleterious effects are most likely in places where decision makers are perfectly happy to let changes in algorithms drive changes in society. Racial discrimination is the most obvious field where this happens”. Therefore, this is the most opportunistic area for the Feminist scholars to work on. I have provided a similar argument in the Notebook as well.

In conclusion, we can say that supervised learning is a really good feature of machine learning if used properly; or else, it can create many societal issues such as discrimination and racialization by categorizing things in groups.

Blog Post 1: Topic Modelling

Topic modeling is a machine learning technique that spontaneously analyzes the text data to determine the clustered words of a set of texts. In other words, this is called ‘unsupervised’ machine learning as it does not require a predefined list of tags or trained data that has already been classified by humans. Topic modelling helps to identify common themes of the texts. Text can have multiple perspectives that can cause the problem of being unable to address text at all its possible levels simultaneously. Topic modelling helps to achieve this goal.

After reading various assigned articles, I can visualize both the advantages and the disadvantages of LDA topic models. There is comparison/similarity about LDA with market produce that was mentioned in Lisa Rhody’s article. But what I am wondering is – isn’t the produce at market a very simple concept compared to LDA. The size of topics reflects the estimation of how much each kind of topic (in poetry) is available. However, would it be unfair if the algorithm somehow misinterprets certain words (co-occurrence of words) with something else and hence, gathers false evaluation of the estimated topics? Though the authors reflect on this saying that LDA does a pretty good job with its method of discovery, there is still no sign for 100 percent accuracy. Therefore, this may lead to some loss of authenticity or loss of accuracy of theme evaluation.

I want to make some comparisons with what we learned in our previous readings. We worked on clustered algorithms which is also unsupervised machine learning similar to topic modelling. Whoever, while looking at the contrary side, typical clustering algorithms like K-means rely on distance measure between topics, but LDA topic model does not perform any distance measuring. This means that LDA lacks the ability to predict the relation of topics to one another and just performs a probability test. Matthew Jocker’s article also talks about similar thing where it is stated that “the manner in which the computer (or dear Hemingway) does the calculation is perhaps less elegant and involves a good degree of mathematical magic”. This tells us how the narrative feature/structure (for example, of poetry) is lost along with the relation between two topics while calculating just the probability of the topics/themes.

In conclusion, it will not be wrong to say that topic modeling is more of an “exploratory” data analyzer rather than “explanatory”. Topic modeling can reveal patterns and initiate questions, but it is less appropriate to test and confirm them.