Dinesh Asanka
Combining six datasets to for the demonstrations.

Latent Dirichlet Allocation in Text Analytics

July 8, 2021 by

Introduction

Until now, we have discussed a few topics in Analytics in Azure Machine Learning many aspects in the last couple of articles, we will be discussing the Latent Dirichlet Allocation in Text Analytics in this article.

Before this article, we have discussed the most commonly used machine learning techniques such as Regression analysis, Classification Analysis, Clustering, Recommender Systems and Anomaly detection of Time Series in Azure Machine Learning by using different sample datasets including data access to SQL Azure. Further, we have discussed the basic cleaning techniques, feature selection techniques and Principal component analysis, Comparing Models and Cross-Validation and Hyper Tune parameters in this article series as data engineering techniques to date. In the first article on Text Analytics, we had a detailed discussion on Language detection and Preprocessing of text in order to organize textual data for better analytics and how to recognize Named entities in Text Analytics. In the last article of the series, we have discussed how to perform Filter Based Feature Selection in Text Analytics.

What is Latent Dirichlet Allocation

Latent Dirichlet Allocation or LDA is a statistical technique that was introduced in 2003 from a research paper. LDA is used for topic modelling in text documents. LDA is more often analog to PCA that we covered before. If you remember in PCA, we used to generate a single value for the existing values in a dataset. LDA will generate a topic for documents by analyzing the content of the document. This technique can be used to cluster documents as well which is an important task in text analytics.

In the azure machine learning, there is a special LDA control. You can download and view the complicated Azure Machine Learning experiment from here. This experiment has more than 50 fifty controls and lets us see what the azure machine learning controls are used in the experiment.

Control

Purpose

Dataset

Six datasets were used.

Select Columns in Dataset

Remove unnecessary columns to improve readability.

Add Rows

Combine two different data streams into one data stream. Since only two data streams can be added by Add Rows control, multiple controls were used.

Detect Languages

Detect English language text.

Split Data

Split data depending on a criterion

Enter Data Manually

Enter data for as stop words

Preprocess Text

Preprocessing text so that the content can be modelled.

Clean Missing Data

Remove empty rows if exists

Latent Dirichlet Allocation

Apply LDA for the dataset and this is the new control that we are introducing from this article.

Apply SQL Transformation

Include Topic Name as a column

Dataset

Since we need a rich dataset, we will be using the same dataset that was used in the entity recognition article. In this, we have combined six different datasets and unnecessary columns were removed from Select Columns in Dataset control as shown in the below figure.

Combining six datasets to for the demonstrations.

We have joined the different news datasets and add them together using the Add Rows control.

After the dataset is created, next is to apply text analytics techniques that were performed in the previous articles.

Preprocessing of contents for modelling.

First, we need to filter the English text by using the Detect Languages and Split Data controls. After English contents were filtered, we have performed Preprocess Text so that special characters are removed and stop words such as “a, and, the, is” are removed, etc. With all these processes, there can be situations where empty text will present. These empty records will cause errors therefore by using the Clean Missing Data control, those empty records are removed.

After the processing of text data, now we are ready to apply Latent Dirichlet Allocation as shown in the below figure.

Applying LDA to the processed data set.

Now let us see how to configure the Latent Dirichlet Allocation in Azure Machine Learning as shown in the below figure.

Configuration of LDA.

First, we have selected the attribute that we need to apply the LDA which is processed_media_content in this example. We would be looking at finding the main five topics to the model which is configured at the Number of topics to the model parameter. For the LDA, we are going to use bi-gram, therefore, we have configured 2 for the N-gram option.

Let us analyse the first two outputs of the Latent Dirichlet Allocation control in Azure Machine Learning. Let us look at the first output which is the transformed dataset.

Transformed dataset in LDA

The first column in the figure is the original content in the source and the preprocessed text can be seen in the second column. The next five columns are columns that are added from the LDA control. Each weighted says how close each content to each topic. If you have selected a different number of topics, those topics will be displayed here.

Ideally what you need is the content and the topic name in one column so that you can identify to what topic each content belongs. We can assign content to a Topic that has more than 0.5 weightage. To achieve this, we have used split data control with the following settings as shown in the below figure.

Configuration of Split Data control.

Then the unnecessary columns are removed from Select Columns in Dataset control as we did in many situations in previous articles.

Then by using Apply SQL Transformation control, the necessary topic is added as shown in the below figure.

Applying SQL Query Script to add topic name to a column.

The same approach will be carried out for the other four topics as well as shown in the following figure. Please note that the only selected portion of the experiment is shown.

Adding Topic to a column.

All the data streams are added to a single data stream by using multiple Add Rows controls. The following is the final output from the above part of the experiment.

Each content is assigned to a single topic.

You can see that each content is assigned to the most relevant Topic.

Distribution of topics for the contents.

The above figure shows the topic distribution in the content.

Similarly, we can follow the same approach to the second output of the Latent Dirichlet Allocation control that is the Feature Topic Matrix.

This output will display the different features of topics. The first column of the following figure shows the bi-grams and weightage for each topic.

Feature topic matrix in LDA.

As we did in the previous section, we will be using Split Data and Add rows controls to generate the relevant Columns. It is important to note that in the Feature topic matrix of Latent Dirichlet Allocation, there is additional space in the column. If you have used the same components, make sure that you change the configurations accordingly.

You can see how each N-gram is contributed to different topics as shown in the following figure.

Topic allocation for each N-Gram

Following is the distribution for topics for each N-Gram feature.

Distribution for topics.

Please note that you can achieve this using python script as well. Python script would be much easier when you have topics in the range of 10 or more.

Summary

In this article, we looked at a topic modelling technique in Text Mining which is called, Latent Dirichlet Allocation. In Azure Machine Learning, there is a specific control named Latent Dirichlet Allocation. This control has the option of providing several topics. The control provides the different weights for the content and the N-grams.

By using the output of the LDA, and using the other controls such as Split Data, Add Rows, Select Columns in the Dataset and SQL Transformation controls, we were able to generate the most relevant topic for each content and the bi-gram feature. This feature can be extended to cluster the document as a similar Topic.

Reference: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/latent-dirichlet-allocation

Table of contents

Introduction to Azure Machine Learning using Azure ML Studio
Data Cleansing in Azure Machine Learning
Prediction in Azure Machine Learning
Feature Selection in Azure Machine Learning
Data Reduction Technique: Principal Component Analysis in Azure Machine Learning
Prediction with Regression in Azure Machine Learning
Prediction with Classification in Azure Machine Learning
Comparing models in Azure Machine Learning
Cross Validation in Azure Machine Learning
Clustering in Azure Machine Learning
Tune Model Hyperparameters for Azure Machine Learning models
Time Series Anomaly Detection in Azure Machine Learning
Designing Recommender Systems in Azure Machine Learning
Language Detection in Azure Machine Learning with basic Text Analytics Techniques
Azure Machine Learning: Named Entity Recognition in Text Analytics
Filter based Feature Selection in Text Analytics
Latent Dirichlet Allocation in Text Analytics
Recommender Systems for Customer Reviews
AutoML in Azure Machine Learning
AutoML in Azure Machine Learning for Regression and Time Series
Building Ensemble Classifiers in Azure Machine Learning
Text Classification in Azure Machine Learning using Word Vectors
Dinesh Asanka
168 Views