Git Product home page Git Product logo

scrapers's Introduction

Hi there, I'm Ashish ๐Ÿ‘‹

โšก I love applied maths, programming, data science, and books

  • ๐ŸŒฑ Iโ€™m addicted to learning and growing every day

  • ๐ŸŒ I am currently sharing a little bit of my knowledge to the world through my blog.

  • โœ๏ธ I am current working on mixed data clustering

  • Connect with me on:

  • ๐Ÿ“ซ Learn more about me on:

Ashish's GitHub stats

scrapers's People

Contributors

dependabot[bot] avatar duttashi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

scrapers's Issues

Get job listings data from a job portal

Task

Write plain code to acquire job listings data from www.indeed.com.my

  • given a search keyword, code should browse to the webpage and all subsequent pages.
  • the code should then extract data such as job title, job location, company name, company rating and job post date.
    • arrange the extracted data in a pandas dataframe and write to disc.

scraping data from parent and children webpages

Given the website https://www.gsmarena.com/ , browse and scrape data from all phones listed on the home page including children webpages on this website

The idea is to practice selenium, requests, beautifulsoup and mysql. Do check the robots.txt file for this website.

Tweepy: tweepy.error.tweeperror 'code' 215 'message' 'bad authentication data.' or TweepError: Twitter error response: status code = 400

I'm unsure what is going wrong in here. The following code was working until 2 days ago. I'm using tweepy version number 3.6.0 on python3 in jupyter notebook. Now, when I execute the code given below, I keep getting the error, `TweepError: [{'code': 215, 'message': 'Bad Authentication data.'}]. What am I doing wrong?

import tweepy
# Read the twitter credential file
creds_file="twitter_creds.txt"

with open(creds_file,'r') as f:
   
    mylist=[line.rstrip('\n') for line in f]

ckey = mylist[0] # The first element of the list
csecret = mylist[1] # The second element of the list
#print("ckey: "+consumer_key, "\ncs: "+consumer_secret)
atoken= mylist[2] # The third element of the list
asecret = mylist[3] # The last element of the list

# OAuth process, using the keys and tokens
auth = tweepy.OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
# Creation of the actual interface, using authentication
api = tweepy.API(auth)
# collect tweets on #MRT
for tweet in tweepy.Cursor(api.search,q="MRT",count=100,
                           lang="en",rpp=100,
                           since="2017-04-03").items():
  print (tweet.created_at, tweet.text)

See this discussion

write pandas dataframe to mysql database

Motivation

Once the desired data has been scraped off a website, it's then arranged into a tabular format like a dataframe. The next step, will be to save this data into a database.
So, how will I write/retrieve a pandas dataframe to mysql database?

File was loaded in the wrong encoding: 'UTF-8'

After scraping the deals of the day from eBay Malaysia (http://deals.ebay.com.my/) webpage, the csv writer is not properly writing the results. The problem is with writing the item description field. Instead of writing the complete item description string, it is being broken down to characters separated by commas which is not desirable.

The desired result should be like; RedMi Note XX for sale, RM 0.0

Some possible solutions posted on SO are 1, 2

but the problem is not fixed

extract day, month, year from a variable in a dataframe

Issue:

In the script scrape_mudah_EDA.R, the variable df has a column called iDate. Its a character variable with the structure like

str(df$iDate)
 chr [1:4100] "Mon Feb 25 2019" "Mon Feb 25 2019" "Mon Feb 25 2019" "Mon Feb 25 2019" "Mon Feb 25 2019" ...

Question

How to extract or split this variable into separate columns like day, date,month, year ?

TODO

find the xpath for extracting the restaurant distance and add tags featured on the webpage to itemlistings like 'discount' etc

How to extract html content from javascript rendered website using selenium

Background

Modern e-commerce websites are javascript rendered. In simple terms, it means, that the webpage content is controlled by a scripting language like JavaScript (JS). The JS acts as a wrapper around the native html. Such webpages are difficult to scrape because a python library like beautifulsoup which is meant for html content will not be able to see the JS rendered html.

Minimum reproducible example

Suppose we need to scrape the product name from an e-commerce website like www.shopee.com.my. I'm assuming that the Xpath helper a chrome extension is already installed on the chrome browser. On navigating to the webpage, right-click on the item name and choose inspect. This will show the html. We find the xpath for product name is //div[@class="_1NoI8_ _16BAGk"]

Next, we write a python script to extract this data like,

from requests import get
from bs4 import BeautifulSoup

url = "https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales"
response= get(url)
soup=BeautifulSoup(response.text,'html.parser')
for item_name in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
    print(item_name.get_text())

The above code will return nothing. So the Q is how to extract data from a JS rendered website?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.