Natural Language Processing (NLP) is a fascinating field: the first gateway towards human-friendly user experience. In sci-fi movies decades before personal and mobile computing, we witnessed how the public vision of how we interact with intelligent machines in ways we would interact with fellow human beings. This vision is made possible today with the advances in NLP through machine learning, deep learning, and the wide adoption of conversational interface via chatbots (Messenger, Telegram, etc.) and voice assistants (Alexa, Google Assistant).
Today, we will explore a fundamental concept in NLP: Language Model. A Language Model is a probability distribution over sequences of words that affords the ability to predict the next word in the sequence. In this brief, we will build a language model via two approaches: first from scratch using a word-pair frequency approach, then with an n-gram approach to further expand the model utilizing an NLP library. Along the exploratory path, we will look at sentiments exhibited across the chapters, and uncover the relationships between the characters. We will end the tour with some pointers to expand this analysis and apply the lessons to real-world problems.
Before jumping onto the task of language modeling, it is imperative to inspect the data and clean it as necessary.
import requests
# read the text directly from the internet
response = requests.get("https://www.gutenberg.org/files/11/11-0.txt")
rawtxt = response.text
# inspect the first and last 500 characters
print(f"header>>\n{rawtxt[:500]}")
print(f"footer>>\n{rawtxt[-500:]}")
The unicode text file is littered with eol(end-of-line) characters that should be cleaned in preprocessing. Let's look at some summary statistics of the text. Upon inspection the Project Gutenberg header and footer should be removed.
int_txt = rawtxt.split("*** START OF THIS PROJECT GUTENBERG EBOOK ALICE’S ADVENTURES IN WONDERLAND ***")[1].split("*** END OF THIS PROJECT GUTENBERG EBOOK ALICE’S ADVENTURES IN WONDERLAND ***")[0]
print(f"header>>{int_txt[:50]}")
print(f"footer>>{int_txt[-50:]}")
Here's our first attempt at the language modeling problem in predicting the next word using frequency. First, tally all consecutive word pairs and their frequency across the whole book. Then write a function to look up a word, and give its most likely successor by frequency. The task is first accomplished from scratch without using any NLP library.
from collections import Counter
txt_vec = int_txt.lower().split()
txt_pairs = zip(txt_vec[:-1], txt_vec[1:])
cnt = Counter(txt_pairs)
def getMostLikelySuccessor(word):
return sorted([(cnt[x], x) for x in cnt if x[0].startswith(word.lower())], reverse=True)[0][1][1]
[getMostLikelySuccessor(x) for x in ["Alice", "was", "at"]] == ["was", "a", "the"]
Next, we perform the same task then further explore the text using the textacy library, which affords convenient text preprocessing methods.
import textacy
doc = textacy.Doc(textacy.preprocess_text(int_txt))
bot2 = doc.to_bag_of_terms(ngrams=2, as_strings=True, normalize="lower", named_entities=False, filter_stops=False)
def getNextWord(word, bot, n):
try:
return sorted([(y, x) for x, y in bot.items() if x.startswith(word.lower())], reverse=True)[0][1].split(" ")[n-1]
except IndexError:
return "<EOL>"
[getNextWord(x, bot2, 2) for x in ["Alice", "was", "at"]] == ["was", "a", "the"]
What we just built was a bigram language model based on frequency. However, bigrams are not very good in preserving word coherence in the sequence.
def textGen(seed, bot, m, n):
'''seed: the first word, bot: bag of terms, m: number of words in the sequence'''
result = seed.split(" ")
seed = " ".join(result[-n+1:])
for i in range(m):
if "<EOL>" not in result:
new_word = getNextWord(seed, bot, n)
result.append(new_word)
if n > 2:
seed = " ".join(result[-n+1:])
else:
seed = new_word
return " ".join(result)
textGen("Alice", bot2, 10, 2)
Let's modify the current language model to include larger n-grams, say trigrams (3) and 4-gram and compare the results.
[f'{n}-gram: {textGen("Alice was going to", doc.to_bag_of_terms(ngrams=n, as_strings=True, normalize="lower", named_entities=False, filter_stops=False), 20, n)}' for n in range(2, 5)]
In the case of the 4-gram, it has become so restrictive (due to a large n and a relatively short story) that an original line from the book was returned.
It is often helpful to perform summary statistics to have an idea of the "shape" of the data. It is especially important to spot imbalanced classes due to data availability to avoid biases. Here we inspect the various basic word counts in each chapter of the text. The chapters are well balanced except for a slightly shorter chapter 3 (A Caucus-Race and a Long Tale). Chapter 6 (Pig and Pepper) demonstrates a relative lack of word diversity per sentence. Curiously enough, that is the chapter where the quote "How do you know I'm mad" lives.
import pandas as pd
chapters = textacy.preprocess_text(int_txt).split("CHAPTER")[1:]
corp = textacy.Corpus("en", chapters)
ts = [textacy.TextStats(x) for x in corp]
basic_counts = pd.DataFrame([x.basic_counts for x in ts])
basic_counts["chapter"] = range(1, 13)
basic_counts["unique_word_ratio"] = basic_counts['n_unique_words'] / basic_counts['n_sents']
basic_counts
Sentiment analysis concerns quantifying the polarity of emotion expressed in a text ranging from negative (-1), neutral (0), to positive (1). Subjectivity measures whether the text is objective (0) or subjective (1) with a varying degree. We will briefly explore the sentiments across the chapters using the TextBlob library, which provides a simplified interface for sentiment analysis at various levels of the text (chapter, sentence, etc.).
from textblob import TextBlob
tb = [TextBlob(chapter) for chapter in chapters]
sentiment = pd.DataFrame([chapter.sentiment for chapter in tb])
sentiment["chapter"] = range(1, 13)
sentiment.head()
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
polarity = [sentence.sentiment.polarity for chapter in tb for sentence in chapter.sentences]
plt.scatter(x=list(range(len(polarity))), y=polarity);
# relationship between subjectivity and polarity
subjectivity = [sentence.sentiment.subjectivity for chapter in tb for sentence in chapter.sentences]
plt.scatter(x=polarity, y=subjectivity);
We observe a slight positive trend in the sentiment polarity as the plot progresses. And that there's a positive correspondence between subjectivity and the magnitude in polarity.
It would be of interest to correlate characters and the associated sentiment variation in the plot. A first step would be to count the character appearances per chapter. The list of characters is extracted from this site.
from collections import defaultdict
chars = ['alice', 'rabbit', 'caterpillar', 'cat', 'cheshire', 'queen', 'sister', 'dinah', 'mouse', 'duck', 'dodo', 'lory', 'eaglet', 'crab', 'mary', 'pat','bill', 'guinea', 'puppy', 'pigeon', 'frog-footman', 'fish-footman', 'duchess', 'baby', 'cook', 'march', 'dormouse', 'elsie', 'lacie', 'tillie', 'five', 'seven', 'two', 'knave', 'king', 'flamingos', 'grphon', 'turtle', 'juror']
char_counts = defaultdict(list)
for chapter in tb:
for char in chars:
char_counts[char].append(chapter.words.lemmatize().count(char))
char_counts = pd.DataFrame(char_counts)
char_counts["chapter"] = range(1, 13)
char_counts.head()
# WIP: to collect and plot major character apperances across chapters
It would be very informative to visualize the characters' social network in the wonderland. Our first approach identifies and counts the cooccurrences of characters in nearby words, then computes some centrality metrics concerning the nodes that show their relative importance. Betweeness quantifies the number of times a node acts as a bridge along the shortest path between two other nodes. The number of cliques (or complete sub-graphs) indicates how many tightly related community a node (character) is in. The visualized graph is created using Cytoscape, where the size of the node indicates the node betweeness, the size of the node label (name) indicates the number of cliques the character is in, and the thickness of the edge illustrates counts of cooccurrences.
import networkx as nx
g = doc.to_semantic_network()
g.remove_edges_from(g.selfloop_edges())
from networkx.algorithms import community
sub_g = g.subgraph(chars)
for node, attr in nx.betweenness_centrality(sub_g).items():
sub_g.nodes[node]['betweeness'] = attr
for node, attr in nx.number_of_cliques(sub_g).items():
sub_g.nodes[node]['n_cliques'] = attr
# nx.write_graphml(sub_g, "data/sub_g_clique.graphml")
Thank you for taking this quick tour with me in the wonderland. There is a lot more to explore, for example:
Feel free to read this if you want to dive deeper into the source material and its literary analyses.