Web Scraping with Python

Image for post
Image for post
Photo by Markus Spiske on Unsplash

What is web scraping?

Web scraping is a process where you collect or gather information from the internet (websites). There are several methods of web scraping, from the naive copy-paste method to the advanced one with an automated program. Web scraping is becoming more popular nowadays as the internet is becoming more massive too. There is a lot of information that we could get from the internet. People often gathering information from it for certain purposes, i.e. research or analysis. In this article, we will learn how to scrape and collect data from websites.

Why scrape the web?

Sometimes we face some situations where we need to surf the web back and forth to get information from it as much as we can — for example; buying something from an online marketplace. When we’re doing it, we usually scan every price and other details from the posted products as we’re looking for the most suitable product for us. In this case, it’s kind of inefficient to scan the web manually. Now that technology has evolved enormously, why don’t we use it to ease our work? Hold on a sec! You’re about to know how to build an automated web scraper with Python. Keep scrolling down!

Step 1: Inspect the Website

Open the website that you desire to scrape from your favorite browser. Take a glimpse and try to analyze the structure of the website, then go to developer tools of your web browser. This one below is a picture of how you get into developer tools using Google Chrome.

Image for post
Image for post
How to Go to Developer Tools on Google Chrome
Image for post
Image for post
Web Inspector on Google Chrome

Step 2: Scrape the Website using Selenium

Now, you can start coding to scrape the website. In this article, we will do it using Python, and the library that we will be using to scrape the website is Selenium. If you haven’t installed the library yet, you can install it first using pip command.

pip install selenium
from selenium import webdriver as wddriver = wd.Chrome(‘path to your webdriver’)link = “the website link”driver.get(link)content = driver.page_source

Step 3: Parse the HTML code using BeautifulSoup4

Once you have scraped the web, now you need to highlight the information that you want to collect using the HTML tags you have got earlier in step 1. Note that selenium will return the HTML code and you will never store the HTML code as it’s messy and hard to read. You need to translate the HTML code, and this is where you need the BeautifulSoup4 library on your code. This library will parse the HTML code and return the information you need in text or string. If you haven’t installed this library yet, you can easily get one by executing this command below.

pip install beautifulsoup4
from bs4 import BeautifulSoupsoup = BeautifulSoup(content)example = soup.find(id = ‘someHTML-ID’)
x = soup.findall(‘div’, attrs={‘class’:’someHTML-Class’})info_data = []for i in x:info1 = x.find(‘span’, attrs={‘class’:’someClass1'}))info2 = x.find(‘div’, attrs={‘class’:’someClass2'}))info3 = x.find(‘span’, attrs={‘class’:’someClass3'}))info_data.apppend([info1, info2, info3])

Written by

〖A data geek 📊〗〖Life-long learner〗〖ESFP-T〗〖✨ŸØⱠØ✨〗

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store