Git Product home page Git Product logo

web-scraping's Introduction

Basics of Web scraping

A series of simple projects that I did while practicing Web scraping and parsing.Welcome to the Web scraping Mission. In this mission, you will learn various concepts of web scraping and get comfortable with scraping various types of websites and their data. You will be dealing with a simple problem statement here. The mission is to scrape data from Wikipedia Home page and parse it through various web scraping techniques. You will be getting familiar with various web scraping techniques, python modules for web scraping and processes of Data extraction and dat processing. This mission will be useful for graduates, post graduates, and research students who either have an interest in this subject or have this subject as a part of their curriculum. Web scraping is an automatic process of extracting information from web. This mission will give you an in-depth idea of web scraping, its comparison with web crawling, and why you should opt for web scraping. You will also learn about the components and working of a web scraper.

KEY POINTS & OBJECTIVES –

  • Use of python
  • Creating virtual env
  • Working with virtual env
  • Web scraping libraries
  • Legality

Starters Pack:

  • We need python IDE and should be familiar with the use of it.
  • Virtualenv is a tool to create isolated Python environments. With the help of virtualenv, we can create a folder that contains all necessary executables to use the packages that our Python project requires. Here we can add and modify python modules without affecting any global installation.
  • We need to install various python modules using pip command for our purpose.
  • But, we should always keep in mind that whether website we are scraping is legal or not.
  • We use pip command to install all the modules and libraries.

Requirements -

  • Requests:- It is an efficient HTTP library used for accessing web page.
  • Urlib3:- It is used for retrieving data from URLs.
  • Selenium:- It is an open source automated testing suite for web applications across different browsers and platforms.
  • Beautiful Soup library.

Resources

  1. https://realpython.com/python-web-scraping-practical-introduction/
  2. https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_getting_started_with_python.htm
  3. https://www.promptcloud.com/blog/scraping-dynamic-websites-web-scraping/
  4. https://www.webharvy.com/articles/what-is-web-scraping.html#:~:text=Web%20Scraping%20(also%20termed%20Screen,in%20table%20(spreadsheet)%20format.
  5. https://www.dataquest.io/blog/web-scraping-tutorial-python/
  6. https://medium.com/@pknerd/scraping-dynamic-websites-using-scraper-api-and-python-a8d041fc97ac

Guidelines for Contributing

  • Raise an issue regards to the topic you will be contributing.
  • Fork this repo to you own github profile.
  • Clone the repo to you machine using the command $git clone FORK_URL
  • Create a new branch using the command $git checkout -b BRANCH_NAME
  • Make changes that you wish to implement.
  • Test the changes.
  • Once you are satisfied with the testing, commit the changes.
  • Change to the master branch with the command $git checkout master
  • Merge the branch where you made changes with the master branch. Use the command $git merge BRANCH_NAME
  • Push the changes to your fork $git push -u origin
  • Create a pull request from your fork.
  • If there are no conflicts you can merge the PR.
  • If there are conflicts and you are uncomfortable with resolving them, contact for support.
  • Delete the fork once you are done.

Happy Contributing! 😁

web-scraping's People

Contributors

garimasingh128 avatar samarthsinghhappy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.