Guide to web scraping with Node

Guide to web scraping with Node:

 In this post, we’ll be making our first small little web scraping app.

Before we get started, let’s just talk a little bit about web scraping and what it is. The most simlified definition for web scraping is “extracting data from websites”, which is somewhat implied by the name. It has always been very much of a grey area. Going into a legal discussion is beyond the scope, though I will reccomend this blog post going into deeper detail about that.

So, to introduce todays project, we’ll be building a simple GitHub follower counter, to count how many followers a user has on GitHub through the terminal.

Initialising

First, let’s make a directory for this repository.

$ mkdir github-follower-count

Let’s navigate to it

$ cd github-follower-count

Open it in your code editor. If you’re using Visual Studio Code you can simply do code .

Initialise npm

npm init -y

Create the starting file.

touch index.js

Install puppeteer.

npm i --save puppeteer

You can learn more about puppeteer here

Getting Started With The Code

First things, let’s require puppeteer.

const puppeteer = require('puppeteer')

Let’s now setup a way for the terminal to take arguments, to have it output the followers for any user.

let username = process.argv[2]

if (username == null) return console.log('Error! Please specify a user!')

Next, let’s create our function.

async function getFollowers(user=`https://github.com/${username}`) {

}

Inside it, let’s launch the browser, open a new tab, and navigate to the URL.

   let browser = await puppeteer.launch()
   let page = await browser.newPage()
   await page.goto(user)

Inside it, let’s evaluate the page.

   let githubFollowers = await page.evaluate(() => {

   })

Inside, let’s get the follower count. If we navigate over to GitHub, and right click < view page source (or ctrl+u). We can see the code of the website.

Inside of here, we can see that the span element, with the class of text-bold text-gray-dark has the current follower count.

Back to our code, let’s do

      var followerCount = document.querySelector('span.text-bold').innerHTML

Now, let’s output the results. There is an error however. If a use does not exist, then it will show us as “optional” on the follower count. To preven this, we can do…

      if (followerCount == 'optional') return('Error! Incorrect username, make sure to double check your spelling.')
      else return(`That user has a total of ${followerCount} followers!`)

Next, back to our function, let’s output this.

   let githubFollowers = await page.evaluate(() => {
      var followerCount = document.querySelector('span.text-bold').innerHTML

      if (followerCount == 'optional') return('Error! Incorrect username, make sure to double check your spelling.')
      else return(`That user has a total of ${followerCount} followers!`)
   })

   console.log(githubFollowers)
   })

Make sure to close the browser windows aswell.

await browser.close()

At the bottom, don’t forget to call this function.

getFollowers()

And you should be good to go! Make sure to run node index.js followed by a username to test it out!

from Tumblr https://generouspiratequeen.tumblr.com/post/635562631759069184

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s