Dinesh Asanka
Distribution of documents for positive and negative labels.

Text Classification in Azure Machine Learning using Word Vectors

October 1, 2021 by

This article discusses how to perform Text classification in Azure Machine using a popular word vector technique in Text Mining. This article is part of the Azure Machine learning series during which we have discussed many aspects such as data cleaning and feature selection techniques of Machine Learning. We further discussed how to perform classification in Azure Machine Learning and we will be utilizing some of those experiences here. Further, we have discussed several machine learning tasks in Azure Machine Learning such as Clustering, Regression, Recommender System and Time Series Anomaly Detection. With respect to text mining, we have discussed techniques such as Language Detection, Named Entity Recognition, LDA, Text Recommendations, etc.

In this article, we will be looking at how to utilize word vectors in order to perform Text Classification in Azure Machine and we will be utilizing the popular WEKA tool as well.

What is a Word Vector

As we know documents have words and sentences and modelling them to analytics is typically a complex task. Therefore, these documents need to be converted to word vectors so that they can be modeled.

Waikato Environment for Knowledge Analysis (WEKA) has rich capabilities to build a word vector for text data. This article will use the word vectors that were created from the WEKA and use them in Azure Machine Learning in order to build the classification models.

We have selected the popular IMDB 2,000 reviews dataset with its ratings as shown on the screen from WEKA.

IMDB Dataset for Text Analytics

There are 2,000 reviews of which 1,000 of them are labeled as positive while the rest of the 1,000 reviews are labeled as negative as shown in the below screenshot.

Distribution of documents for positive and negative labels.

Our task is to find out what are the keywords that make positive or negative reviews. With this modeling, we will be able to classify unknown reviews into either Positive or negative classes.

After the file is loaded into the WEKA, you need to use a filter, Weka-> filters -> unsupervised -> attribute -> StringToWordVector as shown in the below image.

Using the filter option in WEKA.

Once the required filter is selected, you need to configure the selected filter.

Selected filter in WEKA

The following are the configurations for the selected StringToWordVector filter so that different types of word vectors can be derived.

Different options for StringtoWordVector in WEKA.

Though there are a lot of configurations for the WordVector filtering, we will be using only a few of these configurations. First, we will look at stemmer, stopwordhandler, tokenizer and no of words to keep as shown in the below figure.

Configuring stemmer, stopword, tokenizer and words.

The tokenization technique will decide how the keywords are selected. Since the AlphabeticTokenizer is selected, all the numeric and special characters will be dropped. Stopword configuration will drop the word that does not have any semantic meaning such as I, we, you, a, an, the etc. With the usage of lovinsStemmer stemmer, words will be converted to the base form. Please note that these are classical techniques that are processed in text mining analytics. These are mandatory techniques that should be performed on text documents.

Apart from the above techniques, there are few other techniques that would be dependent on the dataset. Those four configurations should be verified with different configurations using the WEKA.

Word Count: Output word counts rather than 0 or 1(indicating presence or absence of a word).

Term Frequency (TF): Term frequency is transformed to log (frequency +1 )

Inverse Document Frequency (IDF): If the same word is repeated in many documents, IDF will reduce the weightage of the frequency of the word counts.

Document Normalization: This parameter will consider the size of the document.

With four parameters, there are sixteen combinations as shown in the below table.

Combination

IDF

TF

World Count

Document Normalization

1

False

False

False

False

2

False

True

False

False

3

False

True

True

True

4

False

True

True

True

5

False

True

False

True

6

False

False

False

True

7

False

False

True

False

8

False

False

True

True

9

True

False

False

False

10

True

True

False

False

11

True

True

True

False

12

True

True

True

True

13

True

True

False

True

14

True

False

False

True

15

True

False

True

False

16

True

False

True

True

With the above combinations, sixteen files are created and these files are uploaded to the Word Vector for Different Configuration IMDB | Kaggle.

Now files are ready to use for Text Classification in Azure Machine Learning. All these sixteen files were uploaded to Azure Machine Learning for you to use.

Sixteen files that is used to Text Classification in Azure Machine Learning.

After the sixteen datasets were uploaded, next is to perform Text Classification in Azure Machine Learning. It will be the same as the text classification that was covered before and shown in the below figure.

Text Classification of single dataset in Azure Machine Learning.

As shown in the above figure, a Two-class neural network is used for text classification in Azure Machine Learning. The Split data control is used to split data between 70/30 for training and testing where the Train Model and Score Model were used. After the Evaluate model control. Then it is connected to a Convert to Dataset control. Similar 16 units are built for each dataset and connecting the same two-class neural network algorithm.

Multiple Apply SQL Transformation controls were used to combine datasets with the following scripts:

SELECT ‘IDF(F)TF(F)WC(F)NOR(F)’ AS Type,* from t1
UNION ALL
SELECT ‘IDF(F)TF(T)WC(F)NOR(F)’ AS Type,* from t2
UNION ALL
SELECT ‘IDF(F)TF(T)WC(T)NOR(F)’ AS Type,* from t3

Though we can use Add Rows control, we have used Apply SQL Transformation controls to reduce the number of controls as there is a limit of 100 controls. Add Rows control has only two inputs while Apply SQL Transformation has three inputs. By means of UNION, it was possible to reduce the number of controls.

Finally, this is the output for all datasets and their evaluation parameters for the Text Classification in Azure Machine Learning.

Evaluation parameters for sixteen different datasets.

From these combinations, you can choose the better combinations with the highest accuracy or precision or recall or F1-Score. As you can see IDF – True, TF – True, Word Count – False and Document Normalization – False is the better combination as it has the highest accuracy as well as the highest precision.

You can refer to the shared experiment as listed in the Reference section. Since this is a bulky experiment there are few things to point out. Since this is the free version of Azure Machine learning, not more than 100 controls cannot be used. Further, an experiment cannot contain more than 100 MB of data. Since we have 16 datasets, the size is more than 100 MB and removed some datasets from the experiment and you may add them back again from the shared dataset. Further, SQL Transformation will not work for columns more than 1024 which is another blocking concern for Text Classification in Azure Machine Learning.

As discussed in the last article, we could use Ensemble Classification for multiple configurations. Since accuracies are almost similar there is no need to perform ensemble classification.

In case we need to label unlabeled document, first, we need to transform with necessary parameters in WEKA and then input those values to the Azure Machine Learning in order to find the

Conclusion

This article discusses how to utilize Word Vector in WEKA with different text analytics parameters such as Term Frequency, Inverse Document Frequency, Document Normalization and Word count. We have used word vectors that were created from WEKA and loaded them into Azure Machine Learning. Using standard controls, it was possible to perform Text Classification in Azure Machine Learning. There are a few limitations with respect to free azure machine account such as the maximum number of controls etc.

References

Table of contents

Introduction to Azure Machine Learning using Azure ML Studio
Data Cleansing in Azure Machine Learning
Prediction in Azure Machine Learning
Feature Selection in Azure Machine Learning
Data Reduction Technique: Principal Component Analysis in Azure Machine Learning
Prediction with Regression in Azure Machine Learning
Prediction with Classification in Azure Machine Learning
Comparing models in Azure Machine Learning
Cross Validation in Azure Machine Learning
Clustering in Azure Machine Learning
Tune Model Hyperparameters for Azure Machine Learning models
Time Series Anomaly Detection in Azure Machine Learning
Designing Recommender Systems in Azure Machine Learning
Language Detection in Azure Machine Learning with basic Text Analytics Techniques
Azure Machine Learning: Named Entity Recognition in Text Analytics
Filter based Feature Selection in Text Analytics
Latent Dirichlet Allocation in Text Analytics
Recommender Systems for Customer Reviews
AutoML in Azure Machine Learning
AutoML in Azure Machine Learning for Regression and Time Series
Building Ensemble Classifiers in Azure Machine Learning
Text Classification in Azure Machine Learning using Word Vectors
Dinesh Asanka
168 Views