Scrape Website Metadata using Azure Functions

A Quick Guide to Scrape Website Metadata Using open-graph-scraper and Azure Function

Scrape Website Metadata using Azure Functions

It has never been any easier to de-couple your web application functionality and leverage serverless resources to handle some of the application workloads. The same can go for any production application using serverless code to dynamically update the data in real-time.

Azure Functions has provided a seamless and simple experience when working with serverless code and integrating it with your existing codebase. Which in return made it fairly easy for newbies to get started with serverless knowledge.

I'll show you how you can leverage Azure Function to pull meta-data from multiple websites. I'll be using a Node JS package called open-graph-scraper that pulls in the meta-information from websites.

If you are looking for a video version of this article, please check out the video below:

What is Header Meta Data?

Did you ever typed in a URL on Twitter or LinkedIn post, and wondered how does the URL gets the card-shaped information automatically? Let's see how this is exactly done!

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/e87b36a6-2541-45f9-b8bd-e1fac7888d46/2020-11-22_16-55-21.gif

Meta information is used by various social media and open graph platform to create display card when a URL is used. Another reason for using meta-information is to increase your SEO visibility. There are many platforms that do this for you automatically, like Ghost, WordPress, etc. However, when working with custom build websites, you'd have to manually put in the information for the search engine to crawl and recognize the information about your website. You would often time go to various websites like heymeta.com or metatags.io to confirm if your website displays the card information.

The article will guide you on deploying an azure function that will display the meta-information inside the webpage to show you the results about any website you query. This project is based on the same idea of pulling-in the live data from your website's meta-information and confirming if it will display the correct information on social media platforms or not. You can use it to query a website directly for SEO testing or alternatively also add this to your portfolio as a serverless cloud project.

Table Of Content

  1. Prerequisites
  2. Azure Function Overview
  3. Setting up the Function Locally
  4. Testing the Function Locally
  5. Deploying Function to Azure
  6. How Is This Useful At All?
  7. Conclusion
  8. Reference

Prerequisites

If you plan to follow along, please make sure you have the following pre-requisites in place:

Azure Function Overview

Azure Function is an on-demand serverless compute service resource. Functions allow you to implement and use 'block-of-code' available to run on-demand when needed. The resource utilization is handled by the function resource itself that makes sure there's enough compute available to support the runtime of the code. Some of the command usages of the azure function are to build a web API, process file uploads, serverless workflow, respond to database change, run scheduled tasks, process data in real-time.

That being said, you can run your code chunk to offload application tasks to function and let it handle the workloads for you without worrying about the memory, CPU, and other resource utilization.

Setting up the Function Locally

The first step towards using the Azure Function is to install the Azure Functions Core Tools. Follow the link and install the tools before proceeding. Since we are going to create a Javascript function, you need to install NodeJS version 12 at this point.

Once you complete the installation of function tools and nodejs, follow through with the rest of the steps to create your function.

Creating a New Function

After installing the Azure Function tools, you should be able to test the installation by using the following command in any terminal. Also, use the second command to confirm your installation of the Az PowerShell/CLI module.

func --version

# PowerShell Module
PS> (Get-Module -ListAvailable Az).Version

# CLI Module
$> az --version
Validate Function Core Tools Installation

Now it's time to create your function. Run the following command to build your function.

PS> func init
Initiate Function

This would ask you to confirm the worker runtime and language for the runtime. Choose "node" as runtime and "javascript" as the language to use. The end results should look something like this:

MetaFunction> func init
Use the up/down arrow keys to select a worker runtime:node
Use the up/down arrow keys to select a language:javascript
Writing package.json
Writing .gitignore
Writing host.json
Writing local.settings.json
Writing C:\\Users\\{user}\\Documents\\MetaFunction\\.vscode\\extensions.json
MetaFunction> ls

    Directory: C:\\Users\\{user}\\Documents\\MetaFunction

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d-----        11/21/2020  10:39 PM                .vscode
-a----        11/21/2020  10:39 PM            437 .gitignore
-a----        11/21/2020  10:39 PM            302 host.json
-a----        11/21/2020  10:39 PM            147 local.settings.json
-a----        11/21/2020  10:39 PM            138 package.json
Function Init Results

The next step is to create the function itself with HTTP Trigger. Use the following commands to do so. Feel free o use any name for the function instead of FetchMetaData that I'm using. When prompted for the template, choose HTTP trigger and hit enter.

PS> func new --name FetchMetaData
Use the up/down arrow keys to select a template: **HTTP trigger** 
Creating New Function

The whole terminal session should look similar to this.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/df90946b-6c67-4e49-a50e-0d147054c57a/01-FuncInitCreate.png
Func Init PowerShell View

Installing NPM modules and VS Code Extension

Once the function is ready, use the VS Code to open the folder that contains the function components. You need to add some node packages for use in the script. Run the following commands in the terminal while inside the folder, to install packages required to run the scraper.

PS> npm install open-graph-scraper
Install NPM Package

With that, use the following screenshot as a reference to search for Azure Function Extension that you'll need to push the code to Azure Subscription. I'd highly recommend installing Azure Function and Azure Tools to get the best experience while working with Azure Resources using VS Code.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/dd88a2c6-d4e1-464a-a94d-0d4305a70771/02-VSCode_Extension.png
Installing VS Code Extensions

Writing Code to Pull Header Data

Now it is time to write the code for the header metadata. You'll modify the already available index.js file inside the folder named after your function name. In my case, it is called FetchMetaData. Follow the steps in the VS Code:

  • On the left side panel, navigate inside your function folder and open index.js file.
  • Replace the code with the following snippet.
module.exports = async function (context, req) {
  const ogs = require("open-graph-scraper");
  const options = { url: `https://${req.query.url}` };
  const data = await ogs(options, (error, results, response) => {
    return results; // This contains all of the Open Graph results
  });
  console.log(data);

  context.res = {
    body: data,
  };
};
Code to Fetch Header Data

Here's breakdown of each line if you need detailed explanation.

  • OGS variable is used to call out NPM package to use for the script
const ogs = require("open-graph-scraper");
  • The options variable takes in the request as part of the URL so you can query webpages by feeding the URL at runtime. You'll see the final URL with the URL passed in as a parameter.
const options = { url: `https://${req.query.url}` };
  • The data variable performed the request on the URL and fetches the results which contain all the open graph information.
const data = await ogs(options, (error, results, response) => {
  return results; // This contains all of the Open Graph results
});
  • This part is optional as it will print the results inside the terminal on the host machine during testing.
console.log(data);
  • Finally we return the data received from the open-graph on the webpage itself as a JSON data.
context.res = {
  body: data,
};

Testing the Function Locally

Now it is time to run the function locally and test the functionality of the code. To run the code, simply run the following command in your terminal while at the root of the function folder.

PS> func start

Output:
Azure Functions Core Tools
Core Tools Version:       3.0.2996 Commit hash: c54cdc36323e9543ba11fb61dd107616e9022bba
Function Runtime Version: 3.0.14916.0

Functions:

        FetchMetaData: [GET,POST] <http://localhost:7071/api/FetchMetaData>

For detailed output, run func with --verbose flag.
[2020-11-22T23:41:11.100Z] Worker process started and initialized.
Trigger Function Locally

You should see a URL and logs in the terminal session as you trigger the function itself. Use the URL mentioned in the terminal and append some additional information that will help you get the meta-data. Remember, you used req.query.url as part of the URL in the code? You'll need to pass in the parameter url so that it can be received and processed by the code.

https://localhost:7071/api/{funcName}?url={website.com}
Function URL

Here's an example of results that you'll see when a URL is passed in as a request parameter. It should also print the same information in your terminal session.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/6575a825-478d-49fa-b1b9-580128e44087/03-URLRequest.png
Function Trigger Web Results

Deploying Function to Azure

You made it this far! Now the last step is to deploy this to the Azure Function resource in your Subscription. I'll be using VS Code extension as it will let me create the resource and also deploy the function without leaving the VS Code session. Follow the steps below if you are following with VS Code:

Before you publish this code to Azure, you should change the function auth type so that it's available for use without any token. I'm doing this for testing only, DO NOT DO THIS, if you don't want your function to be open to the internet without any security token.

  • Open the function.json file and change the authLevel to anonymous to make it accessible to everyone.
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/0071e042-ec45-41b9-841e-fb1f915d1693/04-01_ChangeAuthType.png
Auth Setting for Function
  • Navigate to the Azure Extension on the left side blade and click the highlighted arrow to start the deployment.
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/593c460e-9a88-4e9d-b1fa-c8ebb2b7beec/04-FunctionDeploy.png
Deploy Local Function
  • If you don't see any resources of subscription in this panel, make sure you are logged in to your Azure account with VS Code. Press ctrl+shift+P to open a command search and type sign in which should give you an option to sign-in to Azure.
  • Select the folder that contains your function code. It should show you the folder automatically based on your VS Code open folder.
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/5af75bff-2fac-4ac6-8d59-a4e0dd551569/17-AzureSignIn.png
Select function folder
  • Next step is to choose the subscription where you'd like to host the function.
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/63b21cf6-7312-45b7-a850-740331458bc2/05-SelectFolder.png
Select Subscription
  • If you already have a function that you'd like to reuse, this is the time to choose one. If you want to use a brand new function, click "+ Create new Function App"
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/e2c7c464-3b94-41d3-94d7-c9426068f114/06-ChooseSub.png
Create New Function
  • Choose a globally unique name for your function.
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/322b1e8a-4fcc-461f-9ecf-dc2b8915b7ac/07-CreateNewFunc.png
Function Name
  • Select the runtime environment to be "Node.js 12 LTS"
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/e7bcac88-4b0e-42bc-a264-3640769f3c99/08-EnterFuncName.png
Node JS Runtime
  • Finally choose a location where you'd like to host the Azure resource.
Function Location
  • Keep an eye on the bottom left of the screen for the progress on the resource deployment.
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/ad3383ba-d948-4cc0-99bf-7f57e88ad276/09-FunRuntime.png
  • Once the function resource is deployed, you still have the last step to publish your code to the newly created function. Click "Yes" when prompted to Initiate the project.
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/c99f88f5-6439-4c69-a285-54249bca4983/12-InitializeProject.png
  • Again, keep an eye on the bottom left for the code initialization on your subscription. Click "View Output" to see the results for your function app.
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/ddb826b4-0056-4bd1-b4c9-7ab1d04fdf15/13-CompleteDeployment.png
  • The output terminal contains a URL that you can access directly now using a web browser.
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/8e471196-eb06-427f-86aa-013aedf72339/14-OutputURL.png
  • Use the URL mentioned in the output terminal and pass in the URL parameter with the address to query a website for its header meta-data as shown below.
https://s3-us-west-2.amazonaws.com/secure.notion-static.com/f255e57f-d4bd-4718-bc13-93bc6556bcf8/16-FunctionLiveTrigger.png

How Is This Useful At All?

I'm glad you asked! I build my portfolio using dynamic data where I leveraged this same Azure Function to pull the meta-header information from my multiple project website and display the results in a form of a card. If you are still wondering, check out the URL below to see this in action.

Parveen Singh
Cloud Engineer | Solutions Consultant

Conclusion

I hope you enjoyed the article and learned something new about Azure Function while also getting exposed to a little bit of NodeJS. If you are looking for more resources on what cloud projects you can build and showcase in your portfolio, please check out my other blog articles and follow me on Twitter for regular updates.

Reference