Code Reuse with Objects, A Lecture

We covered web scraping in my Introduction to Programming (for Data Science) class earlier this semester, and one of the things we talked about is how it is often difficult to write a useful and generic web scraper. Each project and each site will have its own requirements and idiosyncrasies. But another thing I've been trying to teach them all semester is to look at a project or problem and ask, "How can we make this better?" and in following my own advice, I realized this was a great opportunity to both review a project I know some students struggled with, while also demonstrating why one might define an object and how inheritence is useful, concepts we had only abstractly discussed in lecture.

The original web scraping project was a simple script centered around a "main loop" placed in a if __name__ == '__main__' block. We talked about what needed to happen in that loop in lecture, and I had them fill out the actual code for a lab assignment. The code outline for that assignment looked something like this: Disclaimer: All of this code is simplified. If you were making a general purpose web scraper or crawler, you'd want to handle an assortment of edge cases, like files that arn't HTML, different HTTP status codes, or links that point to specific anchors within a page.

if __name__ == '__main__':
    # [TODO] Load the state file or start fresh if it cannot be read
    to_visit: list = [urljoin(DOMAIN, '/index.html')]
    data: dict[str, dict] = {}
    # Main Loop
    while len(to_visit) > 0:
        try:
            pass
            # [TODO] Process files from to_visit
            #        This requires:
            #        - Popping a link from the list
            #        - Checking to see if it has already been processed
            #        - Downloading the file the link points to
            #        - Add the current file to data, using the url as the
            #          key, and a dictionary containing book data
            #        - Extract links from the HTML
            #          - Use urljoin(full_url_of_current_doc, link_ref)
            #            to create the full url for a link
            #          - Check to see if this full url is already in data
            #          - If not, append to to_visit
        except KeyboardInterrupt:
            save_state(STATE_FILENAME, to_visit, data)
            is_finished = False
            break
    else:
        is_finished = True
    if is_finished:
        write_spreadsheet(OUTPUT_FILENAME, data)

...where save_state saves the state variables to a JSON file, and write_spreadsheet creates a CSV file with the extracted data once we're finished.

Now, if we were programming in Java, this would, of course, already be a class, but we could take a similar approach in Python. That is, we could create a WebScraper class with all the generic code from our project, and maybe some abstract functions (i.e. raises NotImplementedError) for extracting data and then create a subclass of WebScraper for each web scraping project. That could look something like this:

class WebScraper:

    # User Agent from Chrome Browser on Win 10/11
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
            AppleWebKit/537.36 (KHTML, like Gecko)
        Chrome/130.0.0.0 Safari/537.36'}

    default_sleep = 5.0 # These may need tuning
    sigma = 1.0

    def get(url: str) -> requests.Response:
        """Waits a random amount of time, then send a GET request"""
        time.sleep(abs(random.gauss(WebScraper.default_sleep,
                                WebScraper.sigma)))
        return requests.get(url, headers=WebScraper.headers)

    def __init__(self, domain, start='/index.html',
             state_filename='webscrape.dat',
         output_filename='output.csv'):
        self._links = [urljoin(domain, start)]
        self._data = {}
        self._links_processed = 0
        self._state_filename = state_filename
        self._output_filename = output_filename

    def save_state(self) -> None:
        with open(self._state_filename, 'wb') as statefile:
            pickle.dump(self, statefile)

    def load_state(state_filename='webscrape.dat') -> WebScraper:
        with open(state_filename, 'rb') as statefile:
            return pickle.load(statefile)

    def extract_links(page, from_url) -> list:
        links = page.find_all('a')
        return [urljoin(from_url, link['href'])
            for link in links
        if 'href' in link.attrs]

    def extract_data(self, page) -> dict:
        raise NotImplementedError

    def crawl(self):
        while len(self._links) > 0:
            # main loop for crawling a website

In this implementation, we move our various global constants to either class or instance data members. Values such as the user agent string is unlikely to vary from project to project, so we define headers for all WebScrapers. The domain, on the other hand, will vary, so it becomes a parameter to the initialization method (we don't actually use it beyond setting the initial link).

Since we're implementing this as a class, we can simply pickle the current object instead of storing the state variables in a JSON file. When we want to restart our program, rather than creating a new WebScraper, we can call the class method (or more properly, the static method) WebScraper.load_state which will return a copy of the instance of WebScraper that we stored before.

For a specific project then, we can subclass WebScraper and overload or implement methods as needed. For the lab, we scraped the practice site http://books.toscrape.com, so lets make a BookScraper.

class BookScraper (WebScraper):

    def extract_data(self, page) -> dict:
        book_data = {}
        # toscrape.com specific data collection
        return book_data

books = BookScraper('http://books.toscrape.com')
books.crawl()

This is good and all, but there are a handful of reasons I might not want to reuse code via inheritence. You'll notice, for example, that we made some decisions writing that class that aren't necessarily general. Like what if I really wanted to store state in JSON for one or more projects? Most solutions to make that happen are going to depart from the interface defined by our original WebScraper class one way or another.

Luckily, there is another option. What if we could make an object that lets us, say... iterate over all the pages in a website? Our core, common function in this code is crawling through the website, after all. Everything else, from file types to data extracted, depends on the requirements of the specific project. So instead of a WebScraper and ever more specific subclasses, lets implement a SiteCrawler.

class SiteCrawler:

    # class data and methods for get requests omitted for brevity

    def __init__(self, domain, start='/index.html'):
        self._to_visit = [urljoin(domain, start)]
        self._visited = {}

    def __iter__(self):
        while len(self._links) > 0:
            link = self._to_visit.pop()
            response = get(link)
            if response.ok:
                # Add link to _visited
                # Parse the page
                # Extract links and add any new ones to _to_visit
                yield page

Now our object has only one job: crawl through a website and collect (and yield) pages. We might want to add some code to get specific pages or our state variable for serialization, but ultimately, we'll leave the practical decisions unrelated to webcrawling to project-specific code.