Git Product home page Git Product logo

hirajanwin / aws-step-functions-kendra-web-crawler-search-engine Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aws-samples/aws-step-functions-kendra-web-crawler-search-engine

0.0 0.0 0.0 478 KB

This sample aims to demonstrate how to create a serverless web crawler and search engine, using AWS Lambda, AWS Step Functions, and Amazon Kendra

Home Page: https://aws.amazon.com/blogs/architecture/scaling-up-a-serverless-web-crawler-and-search-engine/

License: MIT No Attribution

Shell 3.21% JavaScript 4.45% TypeScript 92.35%

aws-step-functions-kendra-web-crawler-search-engine's Introduction

🕷Serverless Web Crawler and Search Engine with Step Functions and Kendra

Overview

This sample aims to demonstrate how to create a serverless web crawler (or web scraper) using AWS Lambda and AWS Step Functions. It scales to crawl large websites that would time out if we used just a single lambda to crawl a site. The web crawler is written in Typescript, and uses Puppeteer to extract content and URLs from a given webpage.

Additionally, this sample demonstrates an example use-case for the crawler by indexing crawled content into Amazon Kendra, providing a machine-learning powered search over our crawled content. The CloudFormation stack for the Kendra resources is optional, you can deploy just the web crawler if you like. Make sure to review kendra's pricing and free tier before deploying the kendra part of the sample.

The AWS Cloud Development Kit (CDK) is used to define the infrastructure for this sample as code.

Architecture

architecture-diagram

  • The Start Crawl Lambda is invoked with details of the website to crawl.
  • The Start Crawl Lambda creates a Dynamo DB table which will be used as the URL queue for the crawl.
  • The Start Crawl Lambda writes the initial URLs to the queue.
  • The Start Crawl Lambda triggers an execution of the web crawler state machine (see the section below).
  • The Web Crawler State Machine crawls the website, visiting URLs it discovers and optionally writing content to S3.
  • Kendra provides us with the ability to search our crawled content in S3.

The Web Crawler

The web crawler is best explained by the AWS Step Functions State Machine diagram:

state-machine-diagram

  • Read Queued Urls: Reads all non-visited URLs from the URL queue DynamoDB table.
  • Crawl Page And Queue Urls: Visits a single webpage, extracts its content, and writes new URLs to the URL queue. This step is executed in parallel across a batch of URLs. Batch size is configured in lambda/src/config/constants.ts.
  • Continue Execution: This is responsible for spawning a new state machine execution as we approach the execution history limit.
  • Complete Crawl: Delete the URL queue DynamoDB table and trigger a sync of the Kendra data source if applicable.

Project Structure

  • The infrastructure directory contains the CDK code which defines the AWS infrastructure.
  • The lambda directory contains the source code for the web crawler lambdas

Prerequisites

  • The aws-cli must be installed and configured with an AWS account with a profile (see https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html for instructions on how to do this on your preferred development platform). Please ensure your profile is configured with a default AWS region.
  • This project requires Node.js ≥ 10.17.0 and NPM. To make sure you have them available on your machine, try running the following command.
npm -v && node -v
npm i -g aws-cdk

Deploy

This repository provides a utility script to build and deploy the sample.

To deploy the web crawler on its own, run:

./deploy --profile <YOUR_AWS_PROFILE>

Or you can deploy the web crawler with Kendra too:

./deploy --profile <YOUR_AWS_PROFILE> --with-kendra

Note that if deploying with Kendra, ensure your profile is configured with one of the AWS regions that supports Kendra. See the AWS Regional Services List for details.

Run The Crawler

When the infrastructure has been deployed, you can trigger a run of the crawler with the included utility script:

./crawl --profile <YOUR_AWS_PROFILE> --name lambda-docs --base-url https://docs.aws.amazon.com/ --start-paths /lambda --keywords lambda/latest/dg

You can play with the arguments above to try different websites.

  • --base-url is used to specify the target website to crawl, only URLs starting with the base url will be queued.
  • --start-paths specifies one or more paths in the website to start at.
  • --keywords parameter filters the URLs which are queued to only ones containing one or more of the given keywords, (ie above, only URLs containing lambda/latest/dg are added to the queue)
  • --name is optional, and is used to help identify which step function execution or dynamodb table corresponds to which crawl.

The crawl script will print a link to the AWS console so you can watch your Step Function State Machine execution in action.

Search Crawled Content

If you also deployed the Kendra stack (--with-kendra), you can visit the Kendra console to see an example search page for the Kendra index. The crawl script will print a link to this page if you deployed Kendra. Note that it will take a few minutes once the crawler has completed for Kendra to index the newly stored content.

kendra-screenshot

Run The Crawler Locally

If you're playing with the core crawler logic, it might be handy to test it out locally.

You can run the crawler locally with:

./local-crawl --base-url https://docs.aws.amazon.com/ --start-paths /lambda --keywords lambda/latest/dg

Cleaning Up

You can clean up all your resources when you're done via the destroy script.

If you deployed just the web crawler:

./destroy --profile <YOUR_AWS_PROFILE>

Or if you deployed the web crawler with Kendra too:

./destroy --profile <YOUR_AWS_PROFILE> --with-kendra

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

aws-step-functions-kendra-web-crawler-search-engine's People

Contributors

amazon-auto avatar cogwirrel avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.