Rajendra Gupta
Web Scraping

Web Scraping for SQL Machine learning using R scripts

December 14, 2020 by

In this article, we will explore Web Scraping using R Scripts for SQL Machine Learning.

What is the Web Scraping?

The web is a significant source of data in the digital era. All required information is available on the websites. For example, users can browse SQLShack for all useful SQL Server related articles. In the Power BI Desktop articles, we explored the way to fetch data directly from the URL.

Web Scraping is a process to extract the data from the websites and save it locally for further analysis. You can extract the information in a table, spreadsheet, CSV, JSON.

You can manually copy data from a website; however, if you regularly use it for your analysis, it requires automation. For this automation, usually, we depend on the developers to read the data from the website and insert it into SQL tables.

SQL Machine Learning language helps you in web scrapping with a small piece of code. In the previous articles for SQL Server R scripts, we explored the useful open-source libraries for adding new functionality in R.

Import RVest library for SQL Server R scripts

In this article, we use the rvest library for web scrapping. First, we need to install it using the Microsoft R client.

Navigate to the bin folder (C:\Program Files\Microsoft SQL Server\MSSQL15.INST1\R_SERVICES\bin) in your SQL instance and run the following command.

install.packages(“rvest”)

Import RVest library for SQL Server R scripts

Once you installed the external library, you should use the sp_execute_external_script stored procedure and check its version.

As shown below, its version is 0.3.2 for my SQL instance.

Rvest library version

A quick overview of HTML and CSS

It is necessary to understand the HTML and CSS used for a web page. It is not mandatory; however, if you have the necessary knowledge, it would help you to scrape data from a web page.

HTML: Hypertext Markup Language:

HTML is a markup language to define the content and structure of a web page. In the HTML, we use different tags and add the values inside it. In the below script, we have two tags – <head> and <body>. There are some predefined tags in HTML such as <p>. It is used to specify a paragraph in HTML.

  • <head>
  • </head>
  • <body>
  • <p>
  • Welcome!!
  • </p>
  • <p>
  • To SQLShack 
  • </p>
  • </body>
  • </html>

CSS: Cascading Style Sheets

CSS provides style sheets for a HTML document display. For example, in the below code, we use the style to define the color of the texts. Here, we apply the style CSS for the p tags in HTML.

  • <p style=”color:blue >Welcome</p>
  • <p style=”color:blue >SQLShack</p>

In CSS, we can also define the classes. Suppose you use the font colors in many places on your web page. If you need to change the color from blue to red or add additional styles, you need to modify it at each place. Instead of doing it, you can define a class and define styles in a separate class.

  • <p class=”myfont” >Welcome</p>
  • <p class=”myfont” >SQLShack</p></p>

In the below code, you can view the myfont class in CSS.

SQL Server R scripts for web scraping

For the demonstration purpose, I opened the following SQLShack page for my articles.

Sample page

You can call the web page directly from the SQL Machine learning R script or save the web page in a local directory in case you do not have internet connectivity.

Look at the below screenshot, and it fetches some data from the specified web page.

specified web page

To understand the script and scrapped data, right-click on the HTML page and click on View Source. I copied the generated HTML in the Visual Studio Code for better formatting and analysis.

In the first line, we use the read_html() function and reads the data from the HTML web page. The read_html() function accepts both web pages and HTML pages saved in the local directory. It converts the HTML into the XML. Later, we assign the extracted data into the mydata variable.

Next, we use html_nodes() function to search for the div and abh_description strings in the HTML.

Now, if you look at the VS Code for HTML file, it looks for both strings and fetches the data with its corresponding node tag a.

 VS Code for HTML file

Let’s convert the code into the SQL Machine learning R scripts and fetch more data.

The SQL R script returns the same result as we saw earlier in the R client console.

R client console

In the above section, we read a particular portion or tag from the SQLShack HTML page. Suppose, I want to fetch the titles of my articles on the current page. SQLShack displays 25 articles titles on a page; therefore, my output should also return 25 titles of articles.

Here, I modified the search pattern to div and entry-title.

It returns the 25 articles lists on the current page as shown below.

SQLShack HTML page

You can verify that article titles are in the entry_title tag; therefore, it reads the whole HTML document and returns the relevant information.

C:\Users\u369656\AppData\Local\Microsoft\Windows\INetCache\Content.MSO\EBE8307.tmp

Now, I am interested in analyzing the category tags. Each article on the homepage comprises of following:

  • Article title with hyperlink
  • Article Published date and author name
  • Brief description
  • Categories in which articles fit property

For example, in the below image, we have the article category – Azure Data Studio.

analyzing the category tags

To fetch this information from the web page using the SQL Machine Learning R scripts, locate the category tag and its corresponding div class. Here, you can note that the article category comes under the entry-meta.

fetch this information from the web page

Execute the following R script, and you get all categories in the specified HTML web page.

get all categories in the specified HTML web page

SQLShack page contains the following useful information on the right-hand side of the page.

  • Popular: It shows the popular articles on SQLShack

    Popular

  • Trending: Here, we get current trending articles on SQLShack, as shown below

    Trending

  • Solutions: In this section, it lists out the useful solutions for the SQLShack and SQL Community readers

    Solutions

  • Categories and tips: In this section, we get the articles categories and the number of articles in each category. Suppose you want to read backup-related articles so that you can click on the backup category and it lists all articles belongs to the specified category

    Categories and tips

Now, suppose we want to fetch all this information in a SQL table. It is not possible to copy the information manually and then insert it into the SQL table.

First, create a SQL table in which we insert the data using the SQL R script.

In the below R script, we insert the output into the SQLShack table. We modified the filters, and it now looks for the div and widget.

You can verify it through the HTML file of the web page.

HTML file of the web page

The script did not display the output on the SSMS screen. It inserted data into the specified table, therefore, execute a select statement on [SQLShack] table and view the output.

In the following screenshots, we can verify each category texts inserted into the SQL table.

category texts

category texts image

Conclusion

In this article, we explored the useful web scraping using SQL Machine Learning R scripts. You use this R for fetching data from web pages and insert into the SQL table and perform further analysis on the extracted data.

Rajendra Gupta
Machine learning, R

About Rajendra Gupta

As an MCSA certified and Microsoft Certified Trainer in Gurgaon, India, with 13 years of experience, Rajendra works for a variety of large companies focusing on performance optimization, monitoring, high availability, and disaster recovery strategies and implementation. He is the author of hundreds of authoritative articles on SQL Server, Azure, MySQL, Linux, Power BI, Performance tuning, AWS/Amazon RDS, Git, and related technologies that have been viewed by over 10m readers to date. He is the creator of one of the biggest free online collections of articles on a single topic, with his 50-part series on SQL Server Always On Availability Groups. Based on his contribution to the SQL Server community, he has been recognized with various awards including the prestigious “Best author of the year" continuously in 2020 and 2021 at SQLShack. Raj is always interested in new challenges so if you need consulting help on any subject covered in his writings, he can be reached at rajendra.gupta16@gmail.com View all posts by Rajendra Gupta

533 Views