Basic web scrapping crawler with python and BeautifulSoup

Learn how to scrap crawl the data from websites with python and save them to a csv file for further processing.

Automated Crawling of Data with Python

In this tutorial, we'll explore a basic example of web scrapping using Python and the BeautifulSoup library. Our goal is to scrape data from a website, extract the necessary information from the HTML content, and save it to a CSV file for further processing. Whether you're a beginner or looking to refresh your web scraping skills, this guide will walk you through the entire process step by step.

What You'll Learn

How to use the requests library to fetch webpage data.

Parsing HTML with BeautifulSoup.
Extracting specific data using CSS selectors.
Logging output to the console to track progress.
Saving the scraped data into a CSV file.
Implementing a page limit to avoid infinite loops.

Prerequisites

Before you begin, ensure you have the following Python packages installed:

requests
beautifulsoup4

You can install these packages using pip:

1pip install requests beautifulsoup4

The Example Script

Below is the complete Python script that scrapes the data from a target website, logs each row to the console, and writes the final output to a CSV file. In this example, we scrape a colour guide website, extract various details such as the hex value of the colour, the colour code, name, variation, car brand, year, and product codes for spray and touch-up paints.

1import requests
2from bs4 import BeautifulSoup
3import csv
4import time
5
6def scrape_motip_colors():
7    base_url = "<https://www.motip.com/en-en/colourguide?page=>"
8    page = 1
9    all_data = []
10
11    while True:
12        if page > 240:  # Stop after page 240 to prevent redirection loops
13            print(f"Reached the last allowed page: {page-1}. Stopping.")
14            break
15
16        url = f"{base_url}{page}"
17        response = requests.get(url)
18        if response.status_code != 200:
19            print(f"Failed to retrieve page {page}. Status code: {response.status_code}")
20            break
21
22        soup = BeautifulSoup(response.text, 'html.parser')
23        # Each row with the full set of data is contained in .color-guide-table__row
24        rows = soup.select('.color-guide-table__row')
25        if not rows:
26            print(f"No more rows found. Stopping at page {page}.")
27            break
28
29        for row in rows:
30            try:
31                # Extract the hex color value from the style attribute
32                bg_span = row.select_one('.color-guide-row__bg')
33                hex_value = ""
34                if bg_span and bg_span.has_attr('style'):
35                    # Expecting style like "background-color: #7c0017;"
36                    style = bg_span['style']
37                    hex_value = style.split('background-color:')[-1].strip().rstrip(';')
38
39                # Extract the Colour code
40                code_div = row.select_one('div[data-label="Colour code"] span')
41                color_code = code_div.get_text(strip=True) if code_div else ""
42
43                # Extract the Colour name
44                name_div = row.select_one('div[data-label="Colour name"] span')
45                color_name = name_div.get_text(strip=True) if name_div else ""
46
47                # Extract Variation
48                variation_div = row.select_one('div[data-label="Variation"]')
49                variation_text = variation_div.get_text(strip=True) if variation_div else ""
50
51                # Extract Car Brand
52                brand_div = row.select_one('div[data-label="Car Brand"] span')
53                car_brand = brand_div.get_text(strip=True) if brand_div else ""
54
55                # Extract Year
56                year_div = row.select_one('div[data-label="Year"] span')
57                year_text = year_div.get_text(strip=True) if year_div else ""
58
59                # Extract Spray (400 ML) product code
60                spray_div = row.select_one('div[data-label="site.color_guide.label.spray"] span')
61                spray_text = spray_div.get_text(strip=True) if spray_div else ""
62
63                # Extract Touch-up (12 ML) product code
64                touchup_div = row.select_one('div[data-label="site.color_guide.label.touch_up"] span')
65                touch_up_text = touchup_div.get_text(strip=True) if touchup_div else ""
66
67                row_data = [hex_value, color_code, color_name, variation_text, car_brand, year_text, spray_text, touch_up_text]
68                all_data.append(row_data)
69                # Log the scraped data for each row to the console
70                print(f"Scraped row on page {page}: {row_data}")
71            except Exception as e:
72                print(f"Error processing a row on page {page}: {e}")
73                continue
74
75        print(f"Page {page} scraped successfully.")
76        page += 1
77        time.sleep(1)  # Be nice to the server
78
79    # Write the scraped data to CSV
80    with open('motip_colors.csv', 'w', newline='', encoding='utf-8') as file:
81        writer = csv.writer(file)
82        writer.writerow(["Hex Value", "Color Code", "Colour Name", "Variation", "Car Brand", "Year", "Spray 400 ML", "Touch-up 12 ML"])
83        writer.writerows(all_data)
84
85    print("Data saved to motip_colors.csv")
86
87if __name__ == "__main__":
88    scrape_motip_colors()

Webscrapping with python Figure 1: Python web scrapping of data with BeautifulSoup.

How the code works?

Initialization

The script starts by setting up the base URL and initializing the page counter and data list.

Looping through pages

A while loop fetches pages until it reaches the 240th page. This limit prevents any redirection issues that might cause an infinite loop.

For each page, a GET request is made using the requests library.

Parsing the HTML

BeautifulSoup parses the returned HTML.

Each row of data is located using the CSS selector .color-guide-table__row.

Extracting data

Specific details such as the background color, colour code, and others are extracted using nested selectors.

The script processes each row, and the scraped information is logged to the console for visibility.

Saving the data

After all pages have been scraped, the data is written to a CSV file (motip_colors.csv), which can be used for further processing or analysis.

Rate limiting

A delay (time.sleep(1)) is included between page requests to avoid overloading the server.

Conclusion

This tutorial demonstrates a fundamental web scraping technique using Python and BeautifulSoup. By understanding each step, from sending requests and parsing HTML to logging data and saving it into a CSV file, beginners can build a solid foundation for more advanced crawling projects. Experiment with the code, and consider expanding it by adding error handling, dynamic user-agent headers, or multi-threading for enhanced performance.

Importing the result in CSV Figure 2: Importing the scrapped data result in CSV into Google Sheets.