Bag-of-Words Models

Bag-of-Words Model from Wikipedia: The bag-of-words model is a model of text which uses a representation of text that is based on an unordered collection (or “bag”) of words. […] It disregards word order […] but captures multiplicity.

Introduction

  1. Preparing text data (pre-processing)

    • Standardization: removing irrelevant information, such as punctuation, special characters, lower-upper case, and stopwords.

    • Tokenization (text splitting)

    • Stemming/Lemmatization

  2. Encode texts into a numerical vectors (features extraction)

    • Bag of Words Vectorization-based Models: consider phrases as sets of words. Words are encoded as vectors independently of the context in which they appear in corpus.

    • Embedding: phrases are sequences of words. Words are encoded as vectors integrating their context of appearance in corpus.

  3. Predictive analysis

    • Text classification: “What’s the topic of this text?”

    • Content filtering: “Does this text contain abuse?”, spam detection,

    • Sentiment analysis: Does this text sound positive or negative?

  4. Generate new text

    • Translation

    • Chatbot/summarization

Preparing text data

Standardization and Tokenization

# Example usage

text = """Check out the new http://example.com website! It's awesome.
Hé, it is for programmers that like to program with programming language.
"""

The Do It Yourself way

Basic standardization consist of: - Lower case words - Remove numbers - Remove punctuation

# import regex
import re

# Convert to lower case
lower_string = text.lower()

# Remove numbers
no_number_string = re.sub(r'\d+','', lower_string)

# Remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)

# Remove white spaces
no_wspace_string = no_punc_string.strip()

# Tokenization
print(no_wspace_string.split())

NLTK to perform more sophisticated standardization, including:

Basic standardization consist of: - Lower case words - Remove URLs - Remove strip accents - stop words are commonly used words that are often removed from text during preprocessing to focus on the more informative words. These words typically include articles, prepositions, conjunctions, and pronouns such as “the,” “is,” “in,” “and,” “but,” “on,” etc. The rationale behind removing stop words is that they occur very frequently in the language and generally do not contribute significant meaning to the analysis or understanding of the text. By eliminating stop words, NLP models can reduce the dimensionality of the data and improve computational efficiency without losing important information.

import nltk
import re
import string
import unicodedata
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

def strip_accents(text):
    # Normalize the text to NFKD form and strip accents
    text = unicodedata.normalize('NFKD', text)
    text = ''.join([c for c in text if not unicodedata.combining(c)])
    return text

def standardize_tokenize(text, stemming=False, lemmatization=False):
    # Convert to lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Remove punctuation
    # string.punctuation provides a string of all punctuation characters.
    # str.maketrans() creates a translation table that maps each punctuation
    # character to None.
    # text.translate(translator) uses this translation table to remove all
    # punctuation characters from the input string.
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Strip accents
    text = strip_accents(text)

    # Tokenize the text
    words = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Remove repeated words
    words = list(dict.fromkeys(words))

    # Initialize stemmer and lemmatizer
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()

    # Apply stemming and lemmatization

    words = [stemmer.stem(word) for word in words] if stemming \
        else words

    words = [lemmatizer.lemmatize(word) for word in words] if lemmatization \
        else words

    return words

# Create callable with default values
import functools
standardize_tokenize_stemming = \
    functools.partial(standardize_tokenize, stemming=True)
standardize_tokenize_lemmatization = \
    functools.partial(standardize_tokenize, lemmatization=True)
standardize_tokenize_stemming_lemmatization = \
    functools.partial(standardize_tokenize, stemming=True, lemmatization=True)
standardize_tokenize(text)

Stemming and lemmatization

Stemming and lemmatization are techniques used to reduce words to their base or root form, which helps in standardizing text and improving the performance of various NLP tasks.

Stemming is the process of reducing a word to its base or root form, often by removing suffixes or prefixes. The resulting stem may not be a valid word but is intended to capture the word’s core meaning. Stemming algorithms, such as the Porter Stemmer or Snowball Stemmer, use heuristic rules to chop off common morphological endings from words.

Example: The words “running,” “runner,” and “ran” might all be reduced to “run.”

# standardize_tokenize(text, stemming=True)
standardize_tokenize_stemming(text)

Lemmatization is the process of reducing a word to its lemma, which is its canonical or dictionary form. Unlike stemming, lemmatization considers the word’s part of speech and uses a more comprehensive approach to ensure that the transformed word is a valid word in the language. Lemmatization typically requires more linguistic knowledge and is implemented using libraries like WordNet.

Example: The words “running” and “ran” would both be reduced to “run,” while “better” would be reduced to “good.”

# standardize_tokenize(text, lemmatization=True)
standardize_tokenize_lemmatization(text)

While both stemming and lemmatization aim to reduce words to a common form, lemmatization is generally more accurate and produces words that are meaningful in the context of the language. However, stemming is faster and simpler to implement. The choice between the two depends on the specific requirements and constraints of the NLP task at hand.

# standardize_tokenize(text, stemming=True, lemmatization=True)
standardize_tokenize_stemming_lemmatization(text)

Scikit-learn analyzer is simple and will be sufficient most of the time.

from sklearn.feature_extraction.text import CountVectorizer

analyzer = CountVectorizer(strip_accents='unicode', stop_words='english').build_analyzer()
analyzer(text)

Bag of Words (BOWs) Encoding

Source: text feature extraction with scikit-learn

Simple Count Vectorization

CountVectorizer:” Convert a collection of text documents to a matrix of token counts. Note that ``CountVectorizer`` preforms the standardization and the tokenization.”

It creates one feature (column) for each tokens (words) in the corpus, and returns one line per sentence, counting the occurence of each tokens.

corpus = [
    'This is the first document. This DOCUMENT is in english.',
    'in French, some letters have accents, like é.',
    'Is this document in French?',
]

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(strip_accents='unicode', stop_words='english')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())

# Note thatthe shape of the array is:
# number of sentences by number of existing token
print(X.toarray())

Word n-grams are contiguous sequences of ‘n’ words from a given text. They are used to capture the context and structure of language by considering the relationships between words within these sequences. The value of ‘n’ determines the length of the word sequence:

  • Unigram (1-gram): A single word (e.g., “natural”).

  • Bigram (2-gram): A sequence of two words (e.g., “natural language”).

  • Trigram (3-gram): A sequence of three words (e.g., “natural language processing”).

vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2),
                              strip_accents='unicode', stop_words='english')
X2 = vectorizer2.fit_transform(corpus)
print(vectorizer2.get_feature_names_out())
print(X2.toarray())

TF-IDF Vectorization approach:

TF-IDF (Term Frequency-Inverse Document Frequency) feature extraction:

“TF-IDF (Term Frequency-Inverse Document Frequency) integrates two metrics: Term Frequency (TF) and Inverse Document Frequency (IDF). This method is employed when working with multiple documents, operating on the principle that rare words provide more insight into a document’s content than frequently occurring words across the entire document set.”

“A challenge with relying solely on word frequency is that commonly used words may overshadow the document, despite offering less”informational content” compared to rarer, potentially domain-specific terms. To address this, one can adjust the frequency of words by considering their prevalence across all documents, thereby reducing the scores of frequently used words that are common across the corpus.”

Term Frequency: Provide large weight to frequent words. Given a token \(t\) (term, word), a doccument \(d\)

\[TF(t, d) = \frac{\text{number of times t appears in d}}{\text{total number of term in d}}\]

Inverse Document Frequency: Give more importance to rare “meaningfull” words a appear in few doduments.

If N is the total number of documents, and df is the number of documents with token t, then:

\[IDF(t) = \frac{N}{1 + df}\]

\(IDF(t) \approx 1\) if \(t\) appears in all documents, while \(IDF(t) \approx N\) if \(t\) is a rare meaningfull word that appears in only one document.

Finally:

\[TF\text{-}IDF(t, d) = TF(t, d) * IDF(t)\]

TfidfVectorizer:

Convert a collection of raw documents to a matrix of TF-IDF (Term Frequency-Inverse Document Frequency)

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(strip_accents='unicode', stop_words='english')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray().round(3))
print(X.shape)

Lab 1: Sentiment Analysis of Financial data

Sources: Sentiment Analysis of Financial data

The data is intended for advancing financial sentiment analysis research. It’s two datasets (FiQA, Financial PhraseBank) combined into one easy-to-use CSV file. It provides financial sentences with sentiment labels. Citations Malo, Pekka, et al. “Good debt or bad debt: Detecting semantic orientations in economic texts.” Journal of the Association for Information Science and Technology 65.4 (2014): 782-796.

Import libraries

import numpy as np
import pandas as pd

# Plot
import matplotlib.pyplot as plt
%matplotlib inline
from wordcloud import WordCloud

# ML
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.feature_extraction.text import CountVectorizer

Load the Dataset

data = pd.read_csv('../datasets/FinancialSentimentAnalysis.csv')

print("Shape:", data.shape, "columns:", data.columns)
print(data.describe())
data.head()

Target variable

y = data['Sentiment']
y.value_counts(), y.value_counts(normalize=True).round(2)

Input data: BOWs encoding

Choose tokenizer

text = 'Tesla to recall 2,700 Model X SUVs over seat issue https://t.co/OdPraN59Xq $TSLA https://t.co/xvn4blIwpy https://t.co/ThfvWTnRPs'
vectorizer = CountVectorizer(stop_words='english', strip_accents='unicode')

tokenizer_sklearn = vectorizer.build_analyzer()
print(" ".join(tokenizer_sklearn(text)))
print("Shape: ", CountVectorizer(tokenizer=tokenizer_sklearn).fit_transform(data['Sentence']).shape)

print(" ".join(standardize_tokenize(text)))
print("Shape: ", CountVectorizer(tokenizer=standardize_tokenize).fit_transform(data['Sentence']).shape)

print(" ".join(standardize_tokenize_stemming(text)))
print("Shape: ", CountVectorizer(tokenizer=standardize_tokenize_stemming).fit_transform(data['Sentence']).shape)

print(" ".join(standardize_tokenize_lemmatization(text)))
print("Shape: ", CountVectorizer(tokenizer=standardize_tokenize_lemmatization).fit_transform(data['Sentence']).shape)

print(" ".join(standardize_tokenize_stemming_lemmatization(text)))
print("Shape: ", CountVectorizer(tokenizer=standardize_tokenize_stemming_lemmatization).fit_transform(data['Sentence']).shape)
# vectorizer = CountVectorizer(stop_words='english', strip_accents='unicode')
# vectorizer = CountVectorizer(tokenizer=standardize_tokenize)
# vectorizer = CountVectorizer(tokenizer=standardize_tokenize_stemming)
# vectorizer = CountVectorizer(tokenizer=standardize_tokenize_lemmatization)
vectorizer = CountVectorizer(tokenizer=standardize_tokenize_stemming_lemmatization)
# vectorizer = TfidfVectorizer(stop_words='english', strip_accents='unicode')
# vectorizer = TfidfVectorizer(tokenizer=standardize_tokenize_stemming_lemmatization)


# Retrieve the analyzer to store transformed sentences in dataframe
tokenizer = vectorizer.build_analyzer()
data['Sentence_stdz'] = [" ".join(tokenizer(s)) for s in data['Sentence']]

X = vectorizer.fit_transform(data['Sentence'])
# print("Tokens:", vectorizer.get_feature_names_out())
print("Nb of tokens:", len(vectorizer.get_feature_names_out()))
print("Dimension of input data", X.shape)

Classification with scikit-learn models

# clf = LogisticRegression(class_weight='balanced', max_iter=3000)
# clf = GradientBoostingClassifier()
clf = MultinomialNB()

from sklearn.model_selection import train_test_split
idx = np.arange(y.shape[0])
X_train, X_test, x_str_train, x_str_test, y_train, y_test, idx_train, idx_test = \
    train_test_split(X, data['Sentence'], y, idx, test_size=0.25, random_state=5, stratify=y)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Display prediction performances

print(metrics.balanced_accuracy_score(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred, normalize='true')
cm_ = metrics.ConfusionMatrixDisplay(cm, display_labels=clf.classes_)

cm_.plot()
plt.show()

Print some samples

probas = pd.DataFrame(clf.predict_proba(X), columns=clf.classes_)
df = pd.concat([data, probas], axis=1)
df['SentimentPred'] = clf.predict(X)

df.to_excel("/tmp/test.xlsx")

# Keep only test data, correctly classified, ordered by
df = df.iloc[idx_test]
df = df[df['SentimentPred'] == df['Sentiment']]

Positive sentences

sentence_positive = df[df['Sentiment'] == 'positive'].sort_values(by='positive', ascending=False)['Sentence_stdz']
print("Most positive sentence", sentence_positive[:5])

plt.figure(figsize = (20,20))
wc = WordCloud(max_words = 1000 , width = 1600 , height = 800,
collocations=False).generate(" ".join(sentence_positive))
plt.imshow(wc)

Negative sentences

sentence_negative = df[df['Sentiment'] == 'negative'].sort_values(by='negative', ascending=False)['Sentence_stdz']
print("Most negative sentence", sentence_negative[:5])

plt.figure(figsize = (20,20))
wc = WordCloud(max_words = 1000 , width = 1600 , height = 800,
collocations=False).generate(" ".join(sentence_negative))
plt.imshow(wc)

Lab 2: Twitter Sentiment Analysis

Step-1: Import the Necessary Dependencies

Install some packages:

conda install wordcloud
conda install nltk
# utilities
import re
import numpy as np
import pandas as pd
# plotting
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# nltk
from nltk.stem import WordNetLemmatizer
# sklearn
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report

Step-2: Read and Load the Dataset

Download the dataset from Kaggle

# Importing the dataset
DATASET_COLUMNS=['target','ids','date','flag','user','text']
DATASET_ENCODING = "ISO-8859-1"
df = pd.read_csv('~/data/NLP/training.1600000.processed.noemoticon.csv',
                 encoding=DATASET_ENCODING, names=DATASET_COLUMNS)
df.sample(5)

Step-3: Exploratory Data Analysis

print("Columns names:", df.columns)
print("Shape of data:", df.shape)
print("type of data:\n", df.dtypes)
df.head()

Step-4: Data Visualization of Target Variables

  • Selecting the text and Target column for our further analysis

  • Replacing the values to ease understanding. (Assigning 1 to Positive sentiment 4)

data = df[['text','target']]
data['target'] = data['target'].replace(4,1)
print(data['target'].unique())

import seaborn as sns
sns.countplot(x='target', data=data)

print("Count and proportion of target")
data.target.value_counts(),  data.target.value_counts(normalize=True).round(2)

Step-5: Data Preprocessing

5.4: Separating positive and negative tweets 5.5: Taking 20000 positive and negatives sample from the data so we can run it on our machine easily 5.6: Combining positive and negative tweets

data_pos = data[data['target'] == 1]
data_neg = data[data['target'] == 0]
data_pos = data_pos.iloc[:20000]
data_neg = data_neg.iloc[:20000]
dataset = pd.concat([data_pos, data_neg])

5.7: Text pre-processing

def standardize_stemming_lemmatization(text):
    out =  " ".join(standardize_tokenize_stemming_lemmatization(text))
    return out

dataset['text_stdz'] = dataset['text'].apply(lambda x: standardize_stemming_lemmatization(x))

QC, check for empty standardized strings

rm = dataset['text_stdz'].isnull() | (dataset['text_stdz'].str.len() == 0)

print(rm.sum(), "row are empty of null, to be removed")
dataset = dataset[~rm]
print(dataset.shape)

# Save dataset to excel file to explore
dataset.to_excel('/tmp/test.xlsx', sheet_name='data', index=False)

5.18: Plot a cloud of words for negative tweets

data_neg = dataset.loc[dataset.target == 0, 'text_stdz']
plt.figure(figsize = (20,20))
wc = WordCloud(max_words = 1000 , width = 1600 , height = 800,
               collocations=False).generate(" ".join(data_neg))
plt.imshow(wc)

5.18: Plot a cloud of words for positive tweets

data_pos = dataset.loc[dataset.target == 1, 'text_stdz']
plt.figure(figsize = (20,20))
wc = WordCloud(max_words = 1000 , width = 1600 , height = 800,
               collocations=False).generate(" ".join(data_pos))
plt.imshow(wc)

Step-6: Splitting Our Data Into Train and Test Subsets

X, y = dataset.text_stdz, dataset.target
# Separating the 95% data for training data and 5% for testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=26105111)

Step-7: Transforming the Dataset Using TF-IDF Vectorizer

vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vectoriser.fit(X_train)
#print('No. of feature_words: ', len(vectoriser.get_feature_names()))

X_train = vectoriser.transform(X_train)
X_test  = vectoriser.transform(X_test)

Step-8: Function for Model Evaluation

After training the model, we then apply the evaluation measures to check how the model is performing. Accordingly, we use the following evaluation parameters to check the performance of the models respectively:

  • Accuracy Score

  • Confusion Matrix with Plot

  • ROC-AUC Curve

def model_Evaluate(model):
    # Predict values for Test dataset
    y_pred = model.predict(X_test)
    # Print the evaluation metrics for the dataset.
    print(classification_report(y_test, y_pred))
    # Compute and plot the Confusion matrix
    cf_matrix = confusion_matrix(y_test, y_pred)
    categories = ['Negative','Positive']
    group_names = ['True Neg','False Pos', 'False Neg','True Pos']
    group_percentages = ['{0:.2%}'.format(value) for value in cf_matrix.flatten() / np.sum(cf_matrix)]
    labels = [f'{v1}n{v2}' for v1, v2 in zip(group_names,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    sns.heatmap(cf_matrix, annot = labels, cmap = 'Blues',fmt = '',
    xticklabels = categories, yticklabels = categories)
    plt.xlabel("Predicted values", fontdict = {'size':14}, labelpad = 10)
    plt.ylabel("Actual values" , fontdict = {'size':14}, labelpad = 10)
    plt.title ("Confusion Matrix", fontdict = {'size':18}, pad = 20)

Step-9: Model Building

In the problem statement, we have used three different models respectively :

  • Bernoulli Naive Bayes Classifier

  • SVM (Support Vector Machine)

  • Logistic Regression

The idea behind choosing these models is that we want to try all the classifiers on the dataset ranging from simple ones to complex models, and then try to find out the one which gives the best performance among them.

BNBmodel = BernoulliNB()
BNBmodel.fit(X_train, y_train)
model_Evaluate(BNBmodel)
y_pred1 = BNBmodel.predict(X_test)

8.2: Plot the ROC-AUC Curve for model-1

from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_pred1)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC CURVE')
plt.legend(loc="lower right")
plt.show()