Standard NLP Work Flow for Document Similarity and failure modes recognition

Tulasiram
6 min readOct 13, 2020

by Tulasi Ram Ponaganti

My Research at the Volvo group as a data Scientist challenged my skills and knowledge to get them to the same platform and bring up new ideas that showed a real impact on different views to achieve the goal.

Major Tasks

Packages and Tools Used:

Gensim, Pandas, NumPy, Scikit-learn, TSNE, Word2vec, Fast Text

Azure Data Bricks, Code-server, Atlassian Jira, Azure Data Lake.

How Organizations data will be in the real-world business:

Most of the data in the world are unstructured data form because, in human communication, message transmission happens in words, not in a table or other structured data format. Each day we produce unstructured data from emails, SMS, tweets, feedback, social media posts, blogs, articles, documents, customer reports, etc.

As we all know, the text is the most unstructured form of all the available data. Extracting Meaning from Text is Hard. Computers can’t yet truly understand text data even English Language in the way that humans do, but they can already do a lot with text data. In some areas using a computer or machine, what you can do with NLP already seems like magic.

NLP helps us to organize the massive chunks of text data and solve a wide range of problems such as —Document Similarity, chatbot, Machine Translation, Text Summarization, Named Entity Recognition (NER), Topic Modeling and Topic Segmentation, Semantic Parsing, Question and Answering (Q&A), Relationship Extraction, Sentiment Analysis, and Speech Recognition, etc.

NLP algorithms are based on machine learning algorithms. Doing anything complicated in machine learning usually means building a pipeline. The idea is to break up your problem into very small pieces and then use machine learning to solve each smaller piece separately. Then by chaining together several machine learning models that feed into each other, you can do very complicated things.

You might be able to solve lots of problems and also save a lot of time by applying NLP techniques to your own projects. Using NLP, we’ll break down the process of understanding text (English) into small chunks of words and see how each one works.

NLP Pipeline:

Preprocessing pipeline

From my experience of NLP project, I implemented a plan and approach that I will explain in the stepwise phases of my project.

Step 1: Text Pre-Processing Pipeline

Implemented data preprocessing pipeline which is a major step in making unstructured data into structured data.

  • Handled missing values
  • removed HTTP links
  • converted all letters to lower or upper case
  • converted numbers into words or removing numbers
  • removed punctuations, accent marks, and other diacritics
  • removed white spaces
  • expanding Contractions
  • expanded abbreviations
  • removed stop words, sparse terms, and particular words
  • text canonicalization
  • Handling mistyped and bad words
  • Tokenization
  • Stemming
  • Lemmatization

Step 2: Creating Bag-of-Words which is a Volvo Dictionary.

A “bag of words” (Bow) is a simple and fundamental technique used in natural language processing (NLP) and information retrieval to represent and analyze text data. It is a way to convert text documents into numerical vectors that can be used for text analysis tasks.

After tokenization, we will check the frequency of words using bag of words model. This model will just transform tokens into vectors and it won't give us the importance of tokens in the corpus.

So, we will check the importance of tokens using TF-IDF model

Step 3: Creating TF-IDF Model

Tf-idf is one of the best metrics to determine how significant a term is to a text in a series or a corpus. tf-idf is a weighting system that assigns a weight to each word in a document based on its term frequency (tf) and the reciprocal document frequency (tf) (idf). The words with higher scores of weights are deemed to be more significant.

tf-idf(t,d)== tf(t, d) * idf(t)

Term Frequency (TF): In document d, the frequency represents the number of instances of a given word t. Therefore, we can see that it becomes more relevant when a word appears in the text, which is rational. Since the ordering of terms is not significant, we can use a vector to describe the text in the bag of term models. For each specific term in the paper, there is an entry with the value being the term frequency.

Calculation:

tf(t,d) = count of t in d / total number of words in d = C(d(t)) / T(d(w))

Document Frequency (DF): Term T appears in number of documents in a corpus

Calculation:

df(t) = Term(T)/Documents corpus (Nd) = t / T(d)

Inverse Document Frequency (IDF):

The IDF of the word is the number of documents in the corpus separated by the frequency of the text.

idf(t) = log (N / df(t))

Step 4: Word-Embedding models:

Word2Vec:

Word2vec is a combination of models used to represent word embeddings. Word2Vec (W2V) is an algorithm that accepts text corpus as an input and outputs a vector representation for each word, as shown in the diagram below:

There are two architectures in this algorithm namely:

CBOW and Skip-Gram Architectures

CBow (continous bag of words): CBOW model predicts the current word given context words within a specific window. The input layer contains the context words and the output layer contains the current word. The hidden layer contains the number of dimensions in which we want to represent the current word present at the output layer.

  • Skip Gram: Skip gram predicts the surrounding context words. within specific window given current word. The input layer contains the current word and the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent current word present at the input layer.

Here I would like to share the result of the word2Vec model on the sample dataset from Kaggle that shows the word representation in a space.

Word2vec representation

Overview of Fast Text model:

Fast text can be used both for classification and word-embedding creation.

Here, I will go with a small example from our daily life so that you understand easily

We all use Facebook and you must all have experienced at some time that you have made some post and Facebook starts showing you ads exactly related to that thing.

For example, if you make a Facebook post that you are going to quit your job to start some new venture of yours and suddenly Facebook starts showing you ads like this

So how does Facebook know exactly what to show?

Well, it’s the magic of its NLP library- fast Text.

Fast Text is an open-source, free, lightweight library recently opened-sourced by Facebook. Fast Text is a library created by the Facebook Research Team for efficient learning of word representations and sentence classification.

This library has gained a lot of attention in the NLP community as it has shown great state of the art results in various NLP domains.

Here I would like to share the result of my Fast text model on the sample dataset Kaggle that shows the word representation in a space.

Comparison of both models: From, the above two models fast text model is showing more nearest neighbors that have more tokens in a space. This implicates building a search engine with a good performance and keyword search will be more efficient.

Finally, every NLP project from scratch to build should go through the NLP pipeline as a beginning step and tested with word-embedding models to transform words into numerical into a vector space.

In the next paper, I will discuss more on document level embedding models work and working nature and implementation of state-of-the-art models like BERT, XL-net for deep learning on complex text data.

--

--