Blog Post 1: Topic Modelling

Topic modeling is a machine learning technique that spontaneously analyzes the text data to determine the clustered words of a set of texts. In other words, this is called ‘unsupervised’ machine learning as it does not require a predefined list of tags or trained data that has already been classified by humans. Topic modelling helps to identify common themes of the texts. Text can have multiple perspectives that can cause the problem of being unable to address text at all its possible levels simultaneously. Topic modelling helps to achieve this goal.

After reading various assigned articles, I can visualize both the advantages and the disadvantages of LDA topic models. There is comparison/similarity about LDA with market produce that was mentioned in Lisa Rhody’s article. But what I am wondering is – isn’t the produce at market a very simple concept compared to LDA. The size of topics reflects the estimation of how much each kind of topic (in poetry) is available. However, would it be unfair if the algorithm somehow misinterprets certain words (co-occurrence of words) with something else and hence, gathers false evaluation of the estimated topics? Though the authors reflect on this saying that LDA does a pretty good job with its method of discovery, there is still no sign for 100 percent accuracy. Therefore, this may lead to some loss of authenticity or loss of accuracy of theme evaluation.

I want to make some comparisons with what we learned in our previous readings. We worked on clustered algorithms which is also unsupervised machine learning similar to topic modelling. Whoever, while looking at the contrary side, typical clustering algorithms like K-means rely on distance measure between topics, but LDA topic model does not perform any distance measuring. This means that LDA lacks the ability to predict the relation of topics to one another and just performs a probability test. Matthew Jocker’s article also talks about similar thing where it is stated that “the manner in which the computer (or dear Hemingway) does the calculation is perhaps less elegant and involves a good degree of mathematical magic”. This tells us how the narrative feature/structure (for example, of poetry) is lost along with the relation between two topics while calculating just the probability of the topics/themes.

In conclusion, it will not be wrong to say that topic modeling is more of an “exploratory” data analyzer rather than “explanatory”. Topic modeling can reveal patterns and initiate questions, but it is less appropriate to test and confirm them.