duttashi / scrapers Goto Github PK

View Code? Open in Web Editor NEW

7.0 3.0 3.0 16.14 MB

This repo is for web data scraping. Feel free to show your :heart: by giving a star :star:

License: MIT License

Python 76.58% R 15.65% Jupyter Notebook 7.76%

data-scraping webscraping rcrawler rvest python-3

scrapers's Introduction

Hi there, I'm Ashish 👋

⚡ I love applied maths, programming, data science, and books

🌱 I’m addicted to learning and growing every day
🌍 I am currently sharing a little bit of my knowledge to the world through my blog.
✏️ I am current working on mixed data clustering
Connect with me on:
- 🏢 LinkedIn
📫 Learn more about me on:
- ✏️ Stories Data Speak
- 🎯 Projects
- 🔈 Research
- 💡 Reviewer

scrapers's People

Contributors

Stargazers

Watchers

Forkers

radovankavicky kriskumar emsand

scrapers's Issues

scrape all mobile phone's data from a website and store to database

Given the website https://www.gsmarena.com/samsung-phones-9.php , click on each phone line and get the data for its dimension and other fields.

The idea is to practice selenium, requests, beautifulsoup and mysql. Do check the robots.txt file for this website.

periodically scrape the website for new data and store to database

Scrape the website (https://news.ycombinator.com/jobs) for new data
Save the scraped data to MySql database
Periodically scrape the website for new information

Reference

See this related SO post

Get job listings data from a job portal

Task

Write plain code to acquire job listings data from www.indeed.com.my

given a search keyword, code should browse to the webpage and all subsequent pages.
the code should then extract data such as job title, job location, company name, company rating and job post date.
- arrange the extracted data in a pandas dataframe and write to disc.

scraping data from parent and children webpages

Given the website https://www.gsmarena.com/ , browse and scrape data from all phones listed on the home page including children webpages on this website

The idea is to practice selenium, requests, beautifulsoup and mysql. Do check the robots.txt file for this website.

Tweepy: tweepy.error.tweeperror 'code' 215 'message' 'bad authentication data.' or TweepError: Twitter error response: status code = 400

I'm unsure what is going wrong in here. The following code was working until 2 days ago. I'm using tweepy version number 3.6.0 on python3 in jupyter notebook. Now, when I execute the code given below, I keep getting the error, `TweepError: [{'code': 215, 'message': 'Bad Authentication data.'}]. What am I doing wrong?

import tweepy
# Read the twitter credential file
creds_file="twitter_creds.txt"

with open(creds_file,'r') as f:
   
    mylist=[line.rstrip('\n') for line in f]

ckey = mylist[0] # The first element of the list
csecret = mylist[1] # The second element of the list
#print("ckey: "+consumer_key, "\ncs: "+consumer_secret)
atoken= mylist[2] # The third element of the list
asecret = mylist[3] # The last element of the list

# OAuth process, using the keys and tokens
auth = tweepy.OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
# Creation of the actual interface, using authentication
api = tweepy.API(auth)
# collect tweets on #MRT
for tweet in tweepy.Cursor(api.search,q="MRT",count=100,
                           lang="en",rpp=100,
                           since="2017-04-03").items():
  print (tweet.created_at, tweet.text)

See this discussion

write pandas dataframe to mysql database

Motivation

Once the desired data has been scraped off a website, it's then arranged into a tabular format like a dataframe. The next step, will be to save this data into a database.
So, how will I write/retrieve a pandas dataframe to mysql database?

File was loaded in the wrong encoding: 'UTF-8'

After scraping the deals of the day from eBay Malaysia (http://deals.ebay.com.my/) webpage, the csv writer is not properly writing the results. The problem is with writing the item description field. Instead of writing the complete item description string, it is being broken down to characters separated by commas which is not desirable.

The desired result should be like; RedMi Note XX for sale, RM 0.0

Some possible solutions posted on SO are 1, 2

but the problem is not fixed

extract day, month, year from a variable in a dataframe

Issue:

In the script scrape_mudah_EDA.R, the variable df has a column called iDate. Its a character variable with the structure like

str(df$iDate)
 chr [1:4100] "Mon Feb 25 2019" "Mon Feb 25 2019" "Mon Feb 25 2019" "Mon Feb 25 2019" "Mon Feb 25 2019" ...

Question

How to extract or split this variable into separate columns like day, date,month, year ?

warning scrapy.spider is deperecated then use scrapy.spiders

When executing the crawler, I get this warning

TODO

find the xpath for extracting the restaurant distance and add tags featured on the webpage to itemlistings like 'discount' etc

How to extract html content from javascript rendered website using selenium

Background

Modern e-commerce websites are javascript rendered. In simple terms, it means, that the webpage content is controlled by a scripting language like JavaScript (JS). The JS acts as a wrapper around the native html. Such webpages are difficult to scrape because a python library like beautifulsoup which is meant for html content will not be able to see the JS rendered html.

Minimum reproducible example

Suppose we need to scrape the product name from an e-commerce website like www.shopee.com.my. I'm assuming that the Xpath helper a chrome extension is already installed on the chrome browser. On navigating to the webpage, right-click on the item name and choose inspect. This will show the html. We find the xpath for product name is //div[@class="_1NoI8_ _16BAGk"]

Next, we write a python script to extract this data like,

from requests import get
from bs4 import BeautifulSoup

url = "https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales"
response= get(url)
soup=BeautifulSoup(response.text,'html.parser')
for item_name in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
    print(item_name.get_text())

The above code will return nothing. So the Q is how to extract data from a JS rendered website?

userwarning: userwarning: you do not have a working installation of the service_identity module: 'no module named pyasn1_modules.rfc2459'

When executing the ebay crawler, I get the following user warning,

userwarning: userwarning: you do not have a working installation of the service_identity module: 'no module named pyasn1_modules.rfc2459'

Environment configuration specs

OS: Windows 7 HP 64 bit
Python version: 2.7
Scrapy version: 1.2.0

Create a spider that can crawl through all pages in a website, extract the relevant data and write it to a database

How to create a spider that can do the following tasks;

Task 1: Crawl through all the pages of a website.
Task 2: Extract the relevant data and clean it
Task 3: Write the clean data to a database