Rajendra Gupta
Term Extraction Transformation in SSIS

Term Extraction Transformation in SSIS

September 2, 2019 by

This article explores the Term extraction transformation in SSIS and its usage scenario.

Introduction

Developers face new requirements and challenges each day. They write complex codes to fulfill these requirements. Recently my developer friend approached me for text mining. The requirement is to extract the following information from text stored in the SQL table.

  • Noun
  • Noun phrases
  • Nouns and noun phrases

To explain this situation, create the following SQL table.

In this table, I inserted my published article titles on SQLShack (total 137) using the following query in the appendix.

Data in the SQL table looks like below.

Article list

In this article, we will explore how to extract the required information using the SSIS package transformations.

Prerequisites

  • You should have basic knowledge of SSIS package configuration such as data flow, execute a task, connection manager configurations. You can go through SSIS articles on SQLShack for this
  • This article uses term Extraction transformation in SSIS. It is an Enterprise level feature. You can configure it in other environments but cannot execute this without the Enterprise edition. Read more about Editions and supported features

Overview of Term Extraction Transformation in SSIS

We use the Term Extraction Transformation for extracting terms from the source data. We can extract four kinds of data from this transformation.

  • Nouns Only: It extracts a single-word noun, for example, Hat, Bottle, Table, and Car
  • Noun Phrases: Noun Phrases is a combination of Noun and Noun or an adjective. For Examples: Red Car, Hot bottle
  • Nouns and Noun Phrases: It extracts both the Nouns and Noun Phrases in this option

Few important facts about the Term Extraction Transformation in SSIS

  • We can extract information from English Text only. It uses its English dictionary for this
  • It does not extract the articles and pronouns. For example, the transformation fetches Car from the following words

    The Car, Your Car, That Car

  • It generates a score for each extracted term. We can use score type as Frequency or TF-IDF
    • Frequency: It represents several occurrences of a term in the text. It is a real number such as 1.0, 2.0
    • TF-IDF: It is a short form of Term Frequency and Inverse Document Frequency. We break into two parts to understand
      • TF: Term Frequency: It shows the frequency of a term in the text

        TF(t) = (Frequency of a term in the text) / (Total number of words in the text).

      • IDF: Inverse Document Frequency: It uses a mathematics log function to calculate the inverse document frequency

        IDF(t) = log_e(Total number of rows / Number of rows having term in it).

    • Suppose a document contains 100 words, and the SQL appears three times.

      TF: Term Frequency: for the term SQL, the TF value is 3/100=0.03

      IDF: Inverse Document Frequency: Suppose we have 200 rows in the SQL table and word SQL appears in 50 documents.

      We calculate the value for the IDF using the following function.

      Log(200/50)= 1.39

      Therefore, the value of TDIDF is 0.03*1.39 = 0.0417

      The Term Extraction Transformation in SSIS works with the text column having data type as DT_WSTR or DT_NTEXT. If we have text data in other columns, we can convert them to the appropriate format and use the transformation

      Extraction of Terms for Term Extraction Transformation in SSIS

      SSIS Term Extraction splits the text into smaller words. It considers the following points for the term extraction.

      • It uses the spaces, line breaks and word terminators such as Dot (.), Comma (,) and question mark (?)
      • We can have words with a hyphen or underscore in the text such as cross-region, read-only. Term extractions consider these as a single word
      • It ignores the special characters such as @,#. For example, it considers #SQL as SQL
      • It does not break the word using the apostrophe. For example, it considers word Children’s as Children
      • It splits the email address, postal address and time expressions into separate words
      • It ignores acronyms in a word and considers it as a single word. For example, it considers X.Y.Z as XYZ
      • It also converts the plural noun into a singular noun. For example:
        • Cars become Car
        • Women become Woman
        • Babies become Baby
        • Tooth becomes Teeth
      • By Default, the term extraction performs case insensitive search. It follows the following rules for the case insensitive search
        • It normalizes the capital letters into non-capitalized characters
        • Word Car, CAR and car all get converts to car
        • Ram drives car or Ram drives a CAR are similar for the text extraction
      • If we use the case sensitive search, the extraction considers word car and Car as separate words
      • Create an SSIS package for Term Extraction Transformation in SSIS

        In the Data Tools for Visual Studio, create an integration package and configure a data flow task. This data flow task selects the data from the source table we created earlier.

        Once we have configured the data flow task for the source table, you get the following package configuration.

        Configure a data flow task

        Drag the Term Extraction from the SSIS Toolbox and connect it with the source table using the green arrow.

        Term Extraction Transformation in SSIS

        Double-click on the term extractions, and it opens the term extraction transformation editor.

        Select the required columns

        In the first tab, Term Extraction, select the input columns. It shows the available columns from the configured data source. Put a check on the input column.

        Select the article column containing text

        It gives the two output columns Term and Score.

        • Term: It extracts terms such as Nouns from the text
        • Score: In this column, we get the frequency of a term in the text

        We can change the name of these columns as well. Type the new names in the text box, as shown below.

        term extraction transformation editor

        We will skip the configuration in the tab Exclusion as of now. Click on the Advanced tab, and it opens the following configurations.

        term extraction transformation editor advanced section

        We have following configuration available in the advanced section.

        • Term Type: we can select the term type as the following options
          • Noun
          • Noun phrase
          • Noun and noun phrase
        • Scope type: As defined earlier, we can select scope type as Frequency or TFIDF
        • Frequency threshold: By default, it shows the frequency threshold 2 in the screenshot. It shows that a particular term should repeat a minimum two times. If it appears less than two times, it ignores the term
        • The maximum length of item: in this column, we define the maximum length of a word or phrase. By default, its value is 12. It is available for Noun phrase, Noun and noun phrase term types
        • Use case-sensitive term extraction: We can perform a case-sensitive term extraction using this option

        Let’s use the following options for term extractions.

        Select the parameters in the configuration

        Click OK, and we can see the following SSIS data flow configuration. You see a warning message that rows sent to the error will be lost. We have not configured the error output, and we can ignore this warning message.

        Warning message in the term extraction

        If we do not want this warning message, double click on term extraction and click on Configure Error Output.

        Configure error output

        In the Error column, select Ignore failure option as shown below.

        Ignore failure

        We do not see a warning message for the term extraction.

        Warning message disappears

        Let’s add an OLE DB destination to store the output in a SQL table. You can use an existing table or create a new table.

        Select the table or view

        In the mapping, verify the mapping between source and destination tables.

        Mapping between source and destination

        The Term Extraction Transformation in SSIS configuration is now complete as shown in the following image.

        Term Extraction Transformation in SSIS configuration

        Execute the SSIS package. In the following screenshot, we can see the package executed successfully.

        Package successful execution

        Let’s check the records in the destination table. We can see the extracted Noun terms and their frequency.

        Output of term extraction

        Now, change the scope type as TFIDF for the Noun.

        Term type Noun and Scope TFIDF

        Execute the SSIS package, and we get the following output. You can notice the difference in the terms and their scores as compared to the frequency scope type.

        Output of TFIDF score

        Let’s modify the term extractions for

        • Score type: Frequency
        • Term type: Noun and noun phase

        Score type: Frequency and Term type: Noun and noun phase

        It extracts 25 rows from the source table and inserts into the destination table.

        Package successful execution

        In the output, you can see we get the noun phrase and noun. Previously, we get only a single word noun.

        Output of term extraction

        By default, term extraction transformation in SSIS is a case-insensitive activity. We can perform case-sensitive term extraction with the option Use case-sensitive term extraction.

        • Term type: Noun
        • Scope type: Frequency
        • Enable case-sensitive search

        Term type Noun and Score Frequency

        Execute the package to perform Term Extraction in SSIS and view the result in the destination table.

        perform Term Extraction in SSIS

        Before executing this package, I made a few changes in the article list and changed some word from SQL to Sql. It should treat both the word separate due to a case-sensitive search.

        In the output, you can notice both terms are different, and we get the frequency for both terms separately.

        Frequency

        Previously, we ignored the configuration in the Exclusion tab. We can specify the terms that we do not want to available in the output.

        Create an exclusion table and insert excluded terms. For example, we do not want BI term in the output.

        In the exclusion tab, select the OLE DB connection manager and select the table from the drop-down list.

        Exclusion terms

        Execute the package, and you do not get the term specified in the exclusion table in the output.

        Output after exclusion

        Conclusion

        The term extraction transformation in SSIS is a great idea to extract the relevant terms such as nouns, noun phrases from the specified text. You can also prepare a Word cloud in the Power BI Desktop after extracting information from the text. You can also use this transformation to analyze comments received from a blog post or website as well. You should explore this transformation to fulfill your requirements.

        Appendix

        Rajendra Gupta
        Latest posts by Rajendra Gupta (see all)
168 Views