Git Product home page Git Product logo

web-scraping-example-with-ecs's Introduction

Web Scraping using Amazon ECS

The purpose of this demo is to build a stack that uses Amazon ECS, Selenium, Requests and BeautifulSoup in order to extract content from a given URL. We are using an Amazon ECS, Python3.6 and Selenium to navigate on a page and use the Requests lib to get the HTML file, then we use BeautifulSoup to extract some elements like the all texts and URL's from the page, then we save the output as .txt file in a S3 Bucket.

Prerequisites

  • AWS Account
  • AWS CLI installed and pre configured AWS Credentials
  • Docker
  • Pre configured VPC with minimum of 2 public subnets

THIS DEMO WAS TESTED IN US-EAST-1 REGION

Setup instructions

First of all we need to setup the foundation for our solution, that consists of create the bucket to store our output and the ECR to store our scraping worker docker image.

A script was developed to help in that task, simple run:

    ./setup.sh

CloudFormation

    aws cloudformation create-stack \
        --stack-name ecs-demo-scraping \
        --template-body file://cloudformation/ecs-stack.yaml \
        --parameters ParameterKey=ClusterName,ParameterValue=ecs-cluster-demo \
        ParameterKey=ServiceName,ParameterValue=scraping-worker \
        ParameterKey=ImageUrl,ParameterValue=<IMAGE_URL> \
        ParameterKey=BucketName,ParameterValue=<S3_BUCKET> \
        ParameterKey=VpcId,ParameterValue=<VPC_ID> \
        ParameterKey=VpcCidr,ParameterValue=<VPC_CIDR> \
        ParameterKey=PubSubnet1Id,ParameterValue=<PUB_SUBNET_1_ID> \
        ParameterKey=PubSubnet2Id,ParameterValue=<PUB_SUBNET_2_ID> \
        --capabilities CAPABILITY_IAM

Values to be replaced:

<IMAGE_URL> - The URI of ECR the image uploaded in the script setup.sh (ImageUrl).

<BUCKET_NAME> - Our created bucket for output files.

<VPC_ID> - VPC that we will use to provision ECS cluster.

<VPC_CIDR> - VPC CIDR that we will use to provision ECS cluster.

<PUB_SUBNET_1_ID> - First public Subnet ID that we will use to provision ECS cluster.

<PUB_SUBNET_2_ID> - Second public Subnet ID that we will use to provision ECS cluster.

Testing the solution

Check your S3 Bucket for the output file. You will see our solutions is looking for "I love Python" in Python official website because these were my parameters in our code, but you can change the way that we are doing this, this was just for a demonstration.

Our ECS cluster will run several web scraping tasks since we set the desired running tasks to 1, so when it finishes a scraping task it will start a new task in order to perform web scraping. Don't forget to clean up all of our provisioned stack because this may apply some additional costs

Cleaning up:

    aws cloudformation delete-stack \
        --stack-name ecs-demo-scraping
  • Delete all the files inside of the provisioned S3 bucket.
    aws s3 rm s3://<BUCKET-NAME> --recursive
  • Delete the provisioned S3 bucket.
    aws s3 rb s3://<BUCKET-NAME> --force

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

web-scraping-example-with-ecs's People

Contributors

amazon-auto avatar vbsvini avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.