Task-3 @Catseye Systems, Enhanced Web Scraping: Extracting Event and Company Logos with Python

Author Image

Kaustubh Patil

October 22, 2024 (5mo ago)

Extracting Event Logos and Data with Python: A Journey in Web Scraping

In my recent internship, I worked on an interesting task that involved scraping event details, including logos, from websites. This blog will walk you through the code I developed to accomplish this task, along with key takeaways from the experience.

1. Problem Statement

The objective was to extract various event-related data from websites, including event names, descriptions, dates, sponsors, and most importantly, the event logos. While the task seemed simple, handling inconsistent website structures posed challenges.

2. Understanding the Logo Extraction Process

One of the core tasks was extracting the event logo, which could be in different places across different websites, such as the navbar or within specific logo-related classes. Here's how I approached the problem:

# event logo function:
def getEventLogo(soup, path):
    try:
        # Look for logo in common navbar classes
        navbar_classes = ['navbar', 'nav', 'header', 'topbar']
        for class_name in navbar_classes:
            navbar = soup.find(class_=lambda x: x and class_name in x.lower())
            if navbar:
                logo_img = navbar.find('img', src=True)
                if logo_img:
                    logo_url = logo_img['src']
                    log_message(level='info', message=f"Event logo found in navbar with class '{class_name}'")
                    return {"value": logo_url, "message": "Event logo found in navbar"}
 
        # If not found in navbar, look for common logo classes
        logo_classes = ['logo', 'brand-logo', 'site-logo']
        for class_name in logo_classes:
            logo_img = soup.find('img', class_=lambda x: x and class_name in x.lower(), src=True)
            if logo_img:
                logo_url = logo_img['src']
                log_message(level='info', message=f"Event logo found with class '{class_name}'")
                return {"value": logo_url, "message": "Event logo found with class name"}
 
        return {"value": None, "message": "Event logo not found"}
    except Exception as e:
        log_message(level="error", message=f"Exception occurred while searching event logo: {e}")
        return {"value": None, "message": f"Exception occurred while searching event logo: {e}"}

Breakdown of the Code:

Navbar Search: The function first searches for the event logo within the website's navbar or header area, using common class names like navbar, nav, header, and topbar. Logo Class Search: If no logo is found in the navbar, it searches for common classes used for logos such as logo, brand-logo, or site-logo. Error Handling: If an error occurs during this process, the function logs the error for debugging.

3. Collecting Event Data

Beyond just the logo, we needed to gather other details of the event. The structure of the data was as follows:

data = {
    "event_name": eventName,
    "event_description": eventDescription,
    "event_start_date": eventDate,
    "event_location": eventLocation,
    "parent_company": eventParentCompany,
    "banner_url": eventBannerURL,
    "linkedin_url": eventLinkedinURL,
    "event_social_links": eventSocialLinks,
    "event_agenda": eventAgenda,
    "speakers": speakers,
    "users": [] if isIdentifiedUser else speakers,
    "sponsors_data": sponsors,
    "event_logo": eventLogo,
}

Each field corresponds to important event details like the name, date, sponsors, and the logo retrieved from the getEventLogo function.

4. Retrieving the Data

To integrate everything into one process, the data retrieval function was designed to load the HTML from the file and pass it through various functions, including the one for extracting the event logo:

def retriveEventData(requestData):
    soup, error = load_html_from_file(file_path=str(requestData['file_path']).strip(), scroll_required=True, request_id=str(requestData['request_id']).strip(), seleniumOnly=str(requestData['selenium_only']).strip())
    if error:
        return None, error
    
    eventLogo = getEventLogo(soup=soup, path=requestData['data']['event_logo_path'])  # Add this line
    # ...rest of the code for other event data...

This function handled the entire event data retrieval, including the logo, ensuring that all important event information was properly captured.

Conclusion

This project taught me the importance of error handling and working with inconsistent data. Whether you're looking for an event logo hidden deep within a website’s structure or trying to capture additional data, Python's BeautifulSoup library and careful structuring of logic can make the task manageable.