Git Product home page Git Product logo

competitor_analytics_scrape's Introduction

Competitor Analysis ETL

ETL pipeline creation, utilizing selenium web browser for scraping snapchat desktop UI, pushing data into a g-sheet database and loading into reporting

Purpose

The current project leverages web scraping to extract performance data of "competitor" channels on a social media platform through mimicking an "anonymous user" of the app (via Desktop). As API availability is sparse for this particular social media platform, the ability to effectively scrape information on competitors that are essentially receiving significant distribution from the platform is valuable when conducting competitive/market research in the industry (online media). This data is then cleaned and transformed to prepare for interpretability, and loaded into a Google Sheets file to serve as a database which can be accessed by any BI tool such as Looker, Tableau, etc for reporting.

Extract

The following code utilizes selenium web driver in combination with a headless firefox browser to scrape taxonomical and performance data (i.e subscriber base data) from channels that are being distributed in the "Suggested For You" sidebar. See Copy_Snapchat_Scrape.ipynb for more details.

  • The functions pictured below are used first to configure the selenium web driver, disabling caching, memory, cookies, etc in attempt to mimic an "anonymous" user on platform. In addition, the proxy_identificaion() function navigates to a website where the proxy ip being used by the driver can be identified for data collecting purposes, demonstrated further in the code base. See below:

image

  • The sfy_scrape() function serves only as a piece of the entire puzzle, but is isolated simply so that its purpose can be demonstrated in this analysis. Essentially, it gathers information on all relevant channels appearing through scrolling a (somewhat arbitrary) number of pixels down the page in a for loop to load new suggested channels. The channels are stored in a list to be accessed through the full_scrape() function which then loops through all the collected channels in the previous step, gathering all relevant information of interest:
  #Metrics Scrape
  creator_list = sfy_scrape()
  #Visiting each profile in the creator list and scraping the desired metrics
  for creator in creator_list:
    browser.get(creator)

    #Get metrics of interest
    webpage = browser.title
    page_list.append(webpage)

    distro_type = 'SFY'
    type_list.append(distro_type)

    content_source = 'Creator Show'
    source_list.append(content_source)

    #Append ip country and city from above steps
    ip_country.append(country)
    ip_city.append(city)

    #Get channel name
    try:
      channel = browser.find_element(By.CLASS_NAME, "PublicProfileDetailsCard_inlineDiv__V12Dg").text
      channel_list.append(channel)
    except NoSuchElementException:
      channel=np.nan
      subs=np.nan
      description = np.nan
      num_eps = np.nan
      landing_episode = np.nan
      landing_info = np.nan
      landing_thumbnail = np.nan

      channel_list.append(channel)
      subs_list.append(subs)
      descrp_list.append(description)
      num_eps_list.append(num_eps)
      landing_ep_list.append(landing_episode)
      landing_info_list.append(landing_info)
      landing_thumb_list.append(landing_thumbnail)
      final_links.append(np.nan)
      time_list.append(np.nan)

      continue

    #Get subscriber base
    try:
      subs = browser.find_element(By.CLASS_NAME, "PublicProfileDetailsCard_desktopSubscriberTextOnMedia__l0rjj").text
    except NoSuchElementException:
      subs=np.nan
    subs_list.append(subs)

    #Get channel description
    description = browser.find_element(By.CLASS_NAME, "PublicProfileCard_desktopTitle__9ik6D").text
    descrp_list.append(description)

    #Get the most recent episodes location, and respective metrics
    #Scroll to bottom of page
    SCROLL_PAUSE_TIME = 5
    # Get scroll height
    last_height = browser.execute_script("return document.body.scrollHeight")
    while True:
        # Scroll down to bottom
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)
        # Calculate new scroll height and compare with last scroll height
        new_height = browser.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

    episodes = browser.find_elements(By.CLASS_NAME, 'StoryListTile_title__uu0Lo')
    num_eps = len(episodes)
    num_eps_list.append(num_eps)

    if num_eps > 0:
      landing_episode = episodes[(num_eps-1)].text

      all_episodes_info = browser.find_elements(By.CLASS_NAME, 'StoryListTile_storyInfo__XnOTC')
      landing_info = all_episodes_info[(num_eps-1)].text

      multiple = browser.find_elements(By.CSS_SELECTOR, ".StoryListTile_thumbnail__NYD_G [src]")
      landing_thumbnail = multiple[(num_eps-1)].get_attribute('src')

    else:
      landing_episode = np.nan
      landing_info = np.nan
      landing_thumbnail = np.nan

    #Append all recent episode metrics to lists
    landing_ep_list.append(landing_episode)
    landing_info_list.append(landing_info)
    landing_thumb_list.append(landing_thumbnail)


    #Append link to final link list
    final_links.append(creator)

    #Timestamp
    utc = pd.Timestamp.today().floor('MIN').to_numpy()
    timestamp = utc - np.timedelta64(4, 'h')
    time_list.append(timestamp)

    browser.delete_all_cookies()
    time.sleep(10)

Transform

Adjustments to data types and regex formatting are applied to some of the data fields in order to prepare the data for loading and interpretation. The final table looks something like what's pictured below:

image

Load

Finally, using google colab authentication and gspread, the dataframe is pushed into a pre-existing G-sheet to serve as a database where information can be accessed by a BI Tool such as Looker for greater visibility or intepretation. If run on a schedule - such as a daily schedule for instance - differential metrics (such as subscriber growth per day) can be calculated to monitor the performance of competior channels of interest.

image

competitor_analytics_scrape's People

Contributors

a-memme avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.