Author Archives: Zico Abhi Dey

How Did we make it: DirectHERS Search Engine

DirectHERS is a project that I was part of for the course DH Methodology and Practice. As a team, we build a text encoding project to represent Women Directors. I was in charge of building the search engine and was a part of the Dev team with Gemma for the project. When looking at possibilities for search engine creation, we considered three main routes:

1)  Building a search engine with vanilla Javascript and Ajax which would be running within the limited capacity of GitHub pages. Although this option seemed feasible, the downside was the latency of information processing as the team would not be able to use a dedicated virtual machine (VM) in the cloud, hence causing the crawler to be slow at indexing and showing results. This solution would have also required optimization and code refactoring to enhance its performance.

2)  Incorporating a basic search within GitHub pages in the knowledge that this would be limited to keywords search only but could grant a functional engine that could produce the desired output.

3) Creating a search engine within Tableau and leveraging Tableau Public’s resources that would later be embedded into our GitHub page. This solution seemed very palatable as it would meet the requirements for minimal, but efficient, computing.

Further to various attempts, the team would settle for a hybrid option, which also ended up being the most integrable solution for our website. We indeed used JavaScript but instead of allowing the crawler to dynamically crawl through XML files, we decided to build a common structure for our directors. This is basically a long structured xml file with tags that are relevant to all the directors. We later ingest the files directly with JavaScript to make them searchable. This uniformity enhances the query search runtime significantly. Since our sources are not changing dynamically, we do not need a dynamic crawler like a traditional search engine(for instance Google). Also, to finetune the runtime further, we have used a dictionary-based approach(key, value) where the key is the XML tags for the directors and the value is the associated information contained in the tag. The solution is great for cross-director search and can be used as a unique pedagogical tool to research the directors.

Book Review: What Is ChatGPT Doing … and Why Does It Work? 

https://www.goodreads.com/book/show/123245371-what-is-chatgpt-doing-and-why-does-it-work

ChatGPT is a sophisticated computational model of natural language processing that is capable of producing coherent and semantically meaningful text through incremental word addition. This is accomplished by scanning and locating every occurrence of the text in question from an extensive corpus of human-authored text.Afterward, the system generates a ranked list of potential subsequent words, accompanied by their corresponding probabilities. The fascinating thing is that when we ask ChatGPT with a prompt like “Compose an Essay”, all it does is ask, “Given the text so far, what should the next word be?” over and over again, and then it adds a word one after another. However, if the model uses the highest-ranked words all the time, the essay it will produce will not be creative. The system utilizes a “temperature” parameter to decide how often lower-ranked probable words will be used. “Temperature” is one of the tunable parameters of the model, ranging from 0 to 1, where 0 produces the flattest essay with no creativity and 1 is the most creative.

The foundation of neural networks is numerical. Therefore, in order to utilize neural networks for textual analysis, it is necessary to establish a method for numerical representation of the text. This concept is fundamental to ChatGPT, which employs an embedding to attempt to represent the entities as a set of numbers. To create this kind of embedding, we must examine vast quantities of text to determine how similar the contexts in which various words appear are. Word embedding discovery requires beginning with a trainable job using words, such as word prediction. For instance, to solve “the ___ cat” problem, let’s say among the 50000 most common words used in English, “the” is 914 and “cat” (with a space before it) is 3542. Then the input is {914, 3542}. Output should be a list of approximately 50,000 numbers representing the probabilities for each of the potential “fill-in” terms. By intercepting the embedded layer, the neural network reaches its conclusion about which terms are appropriate.

Human language and the cognitive processes involved in its production have always appeared to be the pinnacle of complexity. However, ChatGPT has shown that a completely artificial neural network with a large number of connecting nodes resembling connected neurons in the brain can generate human language with amazing fidelity. The fundamental reason is that language is fundamentally simpler than it appears, and ChatGPT effectively captures the essence of human language and the underlying reasoning. In addition, ChatGPT’s training has “implicitly discovered” whatever linguistic (and cognitive) patterns make this possible.

Syntax of language and parse trees of language are two well-known examples of what can be considered “laws of language,” and there are (relatively) clear grammatical norms for how words of various types can be combined, such as nouns may have adjectives before them and verbs after them, but two nouns usually can’t be immediately next to each other.  ChatGPT has no explicit “knowledge” of such principles, but through its training, it implicitly “discovers” and subsequently applies them. The most crucial information presented here is that a neural net can be taught to generate “grammatically correct” sequences, and there are several methods to deal with sequences in neural nets, including the use of transformer networks, which is what ChatGPT accomplishes. Like Aristotle, who “discovered syllogistic logic” by studying many instances of rhetoric, ChatGPT is projected to do the same by studying vast quantities of online literature with its 175B parameters. 

Roundtable Abstract: The Days of the LLM is counted

A recently leaked Google document contains “We Have No Moat, And Neither Does OpenAI”. This is quite a big claim against OpenAI, the mastermind behind society’s enchanting creative LLM, ChatGPT. The document is mostly referring to the significant rise of boutique indie open source model developers and tweakers who took a leap frog towards deploying models that are fractionally less effective than renowned LLMs but have significantly less computational cost. Even OpenAI CEO Sam Altman acknowledged that the era of large language models is over. In this paper, some of the open-source language models that are posing significant challenges to LLMs will be analyzed. The analysis will look at the underlying technology, results of these models and answer questions like what these open source models are doing differently to challenge the norms and why they have fewer computational requirements.

Response blog on topic modeling

The ones who are invited or planning to be in the buffet must have the idea of what is served on the menu. The background knowledge that is what I am referring to. The one perhaps is more important than the computational process or even the decision making process of how many latent topics are waiting to be discovered within a seemingly large pool of documents aka corpus. The magic is fascinating but overwhelming if the magician can not communicate with the audience. In the case of topic modeling, the magic is the machine with computational ability that does not have the ability to express on its own(not sentient!!!). There is a role for a magician, an expert to make the magic audience enchanting, putting context to the result.

Let me provide an example. A while back, I conducted an experiment of topic modeling on the DigitalNZ archive of historical newspapers. The result is the following interactive illustration of topics. I decided to uncover 20 topics that are more prevalent during the New Zealand Wars in the 1800s.

LDA

The interactive visualization is available in the following URL

https://zicoabhidey.github.io/pyldavis-topic-modeling-visualization#topic=0&lambda=1&term=

I played the role of the magician to demystify the result and present it to a broader audience who is not a historian by no means. I used intuition to and lent the superpower of Google to support my intuition to derive the bowl that represents the topic. Below is the result I came up with. I could not produce all the 20 topics I was hoping to find.

TopicsExplanation
“gun”, “heavy”, “colonial_secretary”, “news”, “urge”, “tax”, “thank”, “mail”, “night”‘Implying political movement and communication during the pre-independence declaration period.
“bill”, “payment”, “say”, “issue”, “sum”, “notice”, “pay”, “deed”, “amount”, “person”Business-related affairs after the independence declaration.
“distance”, “iron”, “firm”, “dress”, “black”, “mill”, “cloth”, “box”, “wool”, “bar”Representing industrial affairs mostly related to garments.
“Vessel”, “day”, “take”, “place”, “leave”, “fire”, “ship”, “native”, “water”, “captain”Represent maritime activities or war from a port city like Wellington.
“land”, “acre”, “company”, “town”, “sale” , “road”, “country”, “plan”, “district”, “section”Representing real-estate-related activities.
“year”, “make”, “receive”, “take”, “last”, “state”, “new”, “colony”, “great”, “give”No clear association.
“sail”, “master”, “day”, “passage”, “auckland”, “port”, “brig”, “passenger”, “agent”, “freight”Representing shipping activities related to Auckland port.
“Say”, “go”, “court”, “take”, “kill”, “prisoner”, “try”, “come”, “witness”, “give”Representing judicial activities and crime news.
“boy”, “pull”, “flag_staff”, “mount_albert”, “white_pendant”, “descriptive_signal”, “lip”, “battle”, “bride”, “signals_use”Representing traditional stories about Maori Myth and Legend regarding mount Albert.
Table 1: some of the topics and explanations from gensim LDA model
TopicsExplanation
‘land’, ‘company’, ‘purchase’, ‘colony’, ‘claim’, ‘price’, ‘acre’, ‘make’, ‘system’, ‘title’Representing real-estate-related activities.
‘native’, ‘man’, ‘fire’, ‘captain’, ‘leave’, ‘place’, ‘officer’, ‘arrive’, ‘chief’,  ‘make’Representing news regarding New Zealand War. 
‘government’, ‘native’, ‘country’, ‘settler’, ‘colony’, ‘man’, ‘act’, ‘people’, ‘law’Representing news about the sovereignty treaty signed in 1835.
‘mile’, ‘water’, ‘river’, ‘vessel’, ‘foot’, ‘island’, ‘native’, ‘side’, ‘boat’, ‘harbour’Representing maritime activities from a port city like Wellington.
‘settlement’, ‘company’, ‘make’,’war’, ‘place’, ‘port_nicholson’, ‘settler’, ‘state’, ‘colonist’, ‘colony’Representing news about Port Nicholson during the war in Wellington 1839
Table 2: some of the topics and explanations from gensim Mallet model

After working long hours on this project, although I am super delighted that I have produced an interactive visualization of the mallet model which is hard to produce, there is always a feeling of disappointment that I did not have the knowledge of the historian. A historian with special knowledge of New Zealand’s history might have judged better.

Attending buffet without the knowledge of the menu is like sailing a boat without the compass but isn’t it what the distant reading is? A calculated leap of faith.