sjdirect / abotx Goto Github PK

View Code? Open in Web Editor NEW

129.0 6.0 23.0 17.37 MB

Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.

Home Page: https://abotx.org

C# 100.00%

abotx abotx-website web-crawler headless headless-br headless-browser javascript-renderer spider spiders spiders-

abotx's Introduction

AbotX

Please star this project!!

A powerful C# web crawler that makes advanced crawling features easy to use. AbotX builds upon Abot C# Web Crawler Framework by providing a powerful set of wrappers and extensions.

Features

Crawl multiple sites concurrently (ParallelCrawlerEngine)
Pause/resume live crawls (CrawlerX & ParallelCrawlerEngine)
Render javascript before processing (CrawlerX & ParallelCrawlerEngine)
Simplified pluggability/extensibility (CrawlerX & ParallelCrawlerEngine)
Avoid getting blocked by sites (AutoThrottling)
Automatically tune speed/concurrency (AutoTuning)

AbotX use to be a commercial product but is now FREE! Use the AbotX.Lic file in the root of this repository.

Technical Details

Version 2.x targets .NET Standard 2.0 (compatible with .NET framework 4.6.1+ or .NET Core 2+)
Version 1.x targets .NET Framework 4.0 (support ends soon, please upgrade)

Installing AbotX

Install AbotX using Nuget

PM> Install-Package AbotX

If you have an AbotX.lic file. Just make sure it ends up in the bin directory of your application (ie.. in the same directory as the AbotX.dll file).

Quick Start

AbotX adds advanced functionality, shortcuts and configurations to the rock solid Abot C# Web Crawler. It is recommended that you start with Abot's documentation and quick start before coming here.

AbotX consists of the two main entry points. They are CrawlerX and ParallelCrawlerEngine. CrawlerX is a single crawler instance (child of Abot's PoliteWebCrawler class) while ParallelCrawlerEngine creates and manages multiple instances of CrawlerX. If you want to just crawl a single site then CrawlerX is where you want to start. If you want to crawl a configurable number of sites concurrently within the same process then the ParallelCrawlerEngine is what you are after.

Using AbotX

using System;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;
using Abot2;
using AbotX2.Crawler;
using AbotX2.Parallel;
using AbotX2.Poco;
using Serilog;

namespace AbotX2.Demo
{
    class Program
    {
        static async Task Main(string[] args)
        {
            //Use Serilog to log
            Log.Logger = new LoggerConfiguration()
                .MinimumLevel.Information()
                .Enrich.WithThreadId()
                .WriteTo.Console(outputTemplate: Constants.LogFormatTemplate)
                .CreateLogger();

            var siteToCrawl = new Uri("YourSiteHere");

            //Uncomment to demo major features
            //await DemoCrawlerX_PauseResumeStop(siteToCrawl);
            //await DemoCrawlerX_JavascriptRendering(siteToCrawl);
            //await DemoCrawlerX_AutoTuning(siteToCrawl);
            //await DemoCrawlerX_Throttling(siteToCrawl);
            //await DemoParallelCrawlerEngine();
        }

        private static async Task DemoCrawlerX_PauseResumeStop(Uri siteToCrawl)
        {
            using (var crawler = new CrawlerX(GetSafeConfig()))
            {
                crawler.PageCrawlCompleted += (sender, args) =>
                {
                    //Check out args.CrawledPage for any info you need
                };
                var crawlTask = crawler.CrawlAsync(siteToCrawl);

                crawler.Pause();    //Suspend all operations

                Thread.Sleep(7000);

                crawler.Resume();   //Resume as if nothing happened

                crawler.Stop(true); //Stop or abort the crawl

                await crawlTask;
            }
        }

        private static async Task DemoCrawlerX_JavascriptRendering(Uri siteToCrawl)
        {
            var pathToPhantomJSExeFolder = @"[YourNugetPackagesLocationAbsolutePath]\PhantomJS.2.1.1\tools\phantomjs]";
            var config = new CrawlConfigurationX
            {
                IsJavascriptRenderingEnabled = true,
                JavascriptRendererPath = pathToPhantomJSExeFolder,
                IsSendingCookiesEnabled = true,
                MaxConcurrentThreads = 1,
                MaxPagesToCrawl = 1,
                JavascriptRenderingWaitTimeInMilliseconds = 3000,
                CrawlTimeoutSeconds = 20
            };

            using (var crawler = new CrawlerX(config))
            {
                crawler.PageCrawlCompleted += (sender, args) =>
                {
                    //JS should be fully rendered here args.CrawledPage.Content.Text
                };

                await crawler.CrawlAsync(siteToCrawl);
            }
        }

        private static async Task DemoCrawlerX_AutoTuning(Uri siteToCrawl)
        {
            var config = GetSafeConfig();
            config.AutoTuning = new AutoTuningConfig
            {
                IsEnabled = true,
                CpuThresholdHigh = 85,
                CpuThresholdMed = 65,
                MinAdjustmentWaitTimeInSecs = 10
            };
            //Optional, configure how aggressively to speed up or down during throttling
            config.Accelerator = new AcceleratorConfig();
            config.Decelerator = new DeceleratorConfig();

            //Now the crawl is able to "AutoTune" itself if the host machine
            //is showing signs of stress.
            using (var crawler = new CrawlerX(config))
            {
                crawler.PageCrawlCompleted += (sender, args) =>
                {
                    //Check out args.CrawledPage for any info you need
                };
                await crawler.CrawlAsync(siteToCrawl);
            }
        }

        private static async Task DemoCrawlerX_Throttling(Uri siteToCrawl)
        {
            var config = GetSafeConfig();
            config.AutoThrottling = new AutoThrottlingConfig
            {
                IsEnabled = true,
                ThresholdHigh = 2,
                ThresholdMed = 2,
                MinAdjustmentWaitTimeInSecs = 10
            };
            //Optional, configure how aggressively to speed up or down during throttling
            config.Accelerator = new AcceleratorConfig();
            config.Decelerator = new DeceleratorConfig();

            //Now the crawl is able to "Throttle" itself if the site being crawled
            //is showing signs of stress.
            using (var crawler = new CrawlerX(config))
            {
                crawler.PageCrawlCompleted += (sender, args) =>
                {
                    //Check out args.CrawledPage for any info you need
                };
                await crawler.CrawlAsync(siteToCrawl);
            }
        }

        private static async Task DemoParallelCrawlerEngine()
        {
            var siteToCrawlProvider = new SiteToCrawlProvider();
            siteToCrawlProvider.AddSitesToCrawl(new List<SiteToCrawl>
            {
                new SiteToCrawl{ Uri = new Uri("YOURSITE1") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE2") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE3") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE4") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE5") }
            });

            var config = GetSafeConfig();
            config.MaxConcurrentSiteCrawls = 3;
                
            var crawlEngine = new ParallelCrawlerEngine(
                config, 
                new ParallelImplementationOverride(config, 
                    new ParallelImplementationContainer()
                    {
                        SiteToCrawlProvider = siteToCrawlProvider,
                        WebCrawlerFactory = new WebCrawlerFactory(config)//Same config will be used for every crawler
                    })
                );                
            
            var crawlCounts = new Dictionary<Guid, int>();
            var siteStartingEvents = 0;
            var allSitesCompletedEvents = 0;
            crawlEngine.CrawlerInstanceCreated += (sender, eventArgs) =>
            {
                var crawlId = Guid.NewGuid();
                eventArgs.Crawler.CrawlBag.CrawlId = crawlId;
            };
            crawlEngine.SiteCrawlStarting += (sender, args) =>
            {
                Interlocked.Increment(ref siteStartingEvents);
            };
            crawlEngine.SiteCrawlCompleted += (sender, eventArgs) =>
            {
                lock (crawlCounts)
                {
                    crawlCounts.Add(eventArgs.CrawledSite.SiteToCrawl.Id, eventArgs.CrawledSite.CrawlResult.CrawlContext.CrawledCount);
                }
            };
            crawlEngine.AllCrawlsCompleted += (sender, eventArgs) =>
            {
                Interlocked.Increment(ref allSitesCompletedEvents);
            };

            await crawlEngine.StartAsync();
        }

        private static CrawlConfigurationX GetSafeConfig()
        {
            /*The following settings will help not get your ip banned
             by the sites you are trying to crawl. The idea is to crawl
             only 5 pages and wait 2 seconds between http requests
             */
            return new CrawlConfigurationX
            {
                MaxPagesToCrawl = 10,
                MinCrawlDelayPerDomainMilliSeconds = 2000
            };
        }
    }
}

CrawlerX

CrawlerX is an object that represents an individual crawler that crawls a single site at a time. It is a subclass of Abot's PoliteWebCrawler and adds some useful functionality.

Example Usage

Create an instance and register for events...

var crawler = new CrawlerX();
crawler.PageCrawlStarting += crawler_ProcessPageCrawlStarting;
crawler.PageCrawlCompleted += crawler_ProcessPageCrawlCompleted;
crawler.PageCrawlDisallowed += crawler_PageCrawlDisallowed;
crawler.PageLinksCrawlDisallowed += crawler_PageLinksCrawlDisallowed;

Working with some common events...

void crawler_ProcessPageCrawlStarting(object sender, PageCrawlStartingArgs e)
{
    PageToCrawl pageToCrawl = e.PageToCrawl;
    Console.WriteLine("About to crawl link {0} which was found on page {1}", pageToCrawl.Uri.AbsoluteUri,   pageToCrawl.ParentUri.AbsoluteUri);
}

void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
    CrawledPage crawledPage = e.CrawledPage;

    if (crawledPage.WebException != null || crawledPage.HttpWebResponse.StatusCode != HttpStatusCode.OK)
        Console.WriteLine("Crawl of page failed {0}", crawledPage.Uri.AbsoluteUri);
    else
        Console.WriteLine("Crawl of page succeeded {0}", crawledPage.Uri.AbsoluteUri);

    if (string.IsNullOrEmpty(crawledPage.Content.Text))
        Console.WriteLine("Page had no content {0}", crawledPage.Uri.AbsoluteUri);
}

void crawler_PageLinksCrawlDisallowed(object sender, PageLinksCrawlDisallowedArgs e)
{
    CrawledPage crawledPage = e.CrawledPage;
    Console.WriteLine("Did not crawl the links on page {0} due to {1}", crawledPage.Uri.AbsoluteUri, e.DisallowedReason);
}

void crawler_PageCrawlDisallowed(object sender, PageCrawlDisallowedArgs e)
{
    PageToCrawl pageToCrawl = e.PageToCrawl;
    Console.WriteLine("Did not crawl page {0} due to {1}", pageToCrawl.Uri.AbsoluteUri, e.DisallowedReason);
}

Run the crawl synchronously

var result = crawler.Crawl(new Uri("YourSiteHere"));

Run the crawl asynchronously

var result = await crawler.CrawlAsync(new Uri("YourSiteHere"));

Easy Override

CrawlerX has default implementations for all its dependencies. However, there are times where you may want to override one or all of those implementations. Below is an example of how you would plugin your own implementations. The new ImplementationOverride class makes plugging in nested dependencies much easier than it use to be with Abot. It will handle finding exactly where that implementation is needed.

var impls = new ImplementationOverride(config, ImplementationContainer {
    HyperlinkParser = new YourImpl1(),
    PageRequester = new YourImpl2()
});

var crawler = new CrawlerX(config, impls);

Pause And Resume

Pause and resume work as you would expect. However, just be aware that any in progress http requests will be finished, processed and any events related to those will be fired.

var crawler = new CrawlerX();

crawler.PageCrawlCompleted += (sender, args) =>
{
    //You will be interested in args.CrawledPage & args.CrawlContext
};

var crawlerTask = crawler.CrawlAsync(new Uri("http://blahblahblah.com"));

System.Threading.Thread.Sleep(3000);
crawler.Pause();
System.Threading.Thread.Sleep(10000);
crawler.Resume();

var result = crawlerTask.Result;

Stop

Stopping the crawl is as simple as calling Stop(). The call to Stop() tells AbotX to not make any new http requests but to finish any that are in progress. Any events and processing of the in progress requests will finish before CrawlerX stops the crawl.

var crawler = new CrawlerX();

crawler.PageCrawlCompleted += (sender, args) =>
{
    //You will be interested in args.CrawledPage & args.CrawlContext
};

var crawlerTask = crawler.CrawlAsync(new Uri("http://blahblahblah.com"));

System.Threading.Thread.Sleep(3000);
crawler.Stop();
var result = crawlerTask.Result;

By passing true to the Stop() method, AbotX will stop the crawl more abruptly. Anything in pogress will be aborted.

crawler.Stop(true);

Speed Up

CrawlerX can be "sped up" by calling the SpeedUp() method. The call to SpeedUp() tells AbotX to increase the number of concurrent http requests to the currently running sites. You can can call this method as many times as you like. Adjustments are made instantly so you should see more concurrency immediately.

crawler.CrawlAsync(new Uri("http://localhost:1111/"));

System.Threading.Thread.Sleep(3000);
crawler.SpeedUp();

System.Threading.Thread.Sleep(3000);
crawler.SpeedUp();

See the "Configure Speed Up And Slow Down" section for more details on how to control exactly what happens when SpeedUp() is called.

Slow Down

CrawlerX can be "slowed down" by calling the SlowDown() method. The call to SlowDown() tells AbotX to reduce the number of concurrent http requests to the currently runnning sites. You can can call this method as many times as you like. Any currently executing http requests will finish normally before any adjustments are made.

crawler.CrawlAsync(new Uri("http://localhost:1111/"));

System.Threading.Thread.Sleep(3000);
crawler.SlowDown();

System.Threading.Thread.Sleep(3000);
crawler.SlowDown();

See the "Configure Speed Up And Slow Down" section for more details on how to control exactly what happens when SlowDown() is called.

Parallel Crawler Engine

A crawler instance can crawl a single site quickly. However, if you have to crawl 10,000 sites quickly you need the ParallelCrawlerEngine. It allows you to crawl a configurable number of sites concurrently to maximize throughput.

Example Usage

The concurrency is configurable by setting the maxConcurrentSiteCrawls in the config. The default value is 3 so the following block of code will crawl three sites simultaneously.

static void Main(string[] args)
{
    var siteToCrawlProvider = new SiteToCrawlProvider();
    siteToCrawlProvider.AddSitesToCrawl(new List<SiteToCrawl>
    {
        new SiteToCrawl{ Uri = new Uri("http://somesitetocrawl1.com/") },
        new SiteToCrawl{ Uri = new Uri("http://somesitetocrawl2.com/") },
        new SiteToCrawl{ Uri = new Uri("http://somesitetocrawl3.com/") },
    });

    //Create the crawl engine instance
    var impls = new ParallelImplementationOverride(
        config,
        new ParallelImplementationContainer
        {
            SiteToCrawlProvider = siteToCrawlProvider
            WebCrawlerFactory = yourWebCrawlerFactory //YOU NEED TO IMPLEMENT THIS!!!!
        }
    );

    var crawlEngine = new ParallelCrawlerEngine(config, impls);

    //Register for site level events
    crawlEngine.AllCrawlsCompleted += (sender, eventArgs) =>
    {
        Console.WriteLine("Completed crawling all sites");
    };
    crawlEngine.SiteCrawlCompleted += (sender, eventArgs) =>
    {
        Console.WriteLine("Completed crawling site {0}", eventArgs.CrawledSite.SiteToCrawl.Uri);       
    };
    crawlEngine.CrawlerInstanceCreated += (sender, eventArgs) =>
    {
        //Register for crawler level events. These are Abot's events!!!
        eventArgs.Crawler.PageCrawlCompleted += (abotSender, abotEventArgs) =>
        {
            Console.WriteLine("You have the crawled page here in abotEventArgs.CrawledPage...");
        };
    };

    crawlEngine.StartAsync();

    Console.WriteLine("Press enter key to stop");
    Console.Read();
}

Easy Override Of Default Implementations

ParallelCrawlerEngine allows easy override of one or all of it's dependent implementations. Below is an example of how you would plugin your own implementations (same as above). The new ParallelImplementationOverride class makes plugging in nested dependencies much easier than it use to be. It will handle finding exactly where that implementation is needed.

var impls = new ParallelImplementationOverride(config, new ImplementationContainer {
    SiteToCrawlProvider = yourSiteToCrawlProvider,
    WebCrawlerFactory = yourFactory,
        ...(Excluded)
});

var crawlEngine = new ParallelCrawlerEngine(config, impls);

Pause And Resume

Pause and resume on the ParallelCrawlerEngine simply relays the command to each active CrawlerX instance. However, just be aware that any in progress http requests will be finished, processed and any events related to those will be fired.

crawlEngine.StartAsync();

System.Threading.Thread.Sleep(3000);
crawlEngine.Pause();
System.Threading.Thread.Sleep(10000);
crawlEngine.Resume();

Stop

crawlEngine.StartAsync();

System.Threading.Thread.Sleep(3000);
crawlEngine.Stop();

By passing true to the Stop() method, it will stop each CrawlerX instance more abruptly. Anything in pogress will be aborted.

crawlEngine.Stop(true);

Speed Up

The ParallelCrawlerEngine can be "sped up" by calling the SpeedUp() method. The call to SpeedUp() tells AbotX to increase the number of concurrent site crawls that are currently running. You can can call this method as many times as you like. Adjustments are made instantly so you should see more concurrency immediately.

crawlEngine.StartAsync();

System.Threading.Thread.Sleep(3000);
crawlEngine.SpeedUp();

System.Threading.Thread.Sleep(3000);
crawlEngine.SpeedUp();

See the "Configure Speed Up And Slow Down" section for more details on how to control exactly what happens when SpeedUp() is called.

Slow Down

The ParallelCrawlerEngine can be "slowed down" by calling the SlowDown() method. The call to SlowDown() tells AbotX to reduce the number of concurrent site crawls that are currently running. You can can call this method as many times as you like. Any currently executing crawls will finish normally before any adjustments are made.

crawlEngine.StartAsync();

System.Threading.Thread.Sleep(3000);
crawlEngine.SlowDown();

System.Threading.Thread.Sleep(3000);
crawlEngine.SlowDown();

See the "Configure Speed Up And Slow Down" section for more details on how to control exactly what happens when SlowDown() is called.

Configure Speed Up And Slow Down

Multiple features trigger AbotX to speed up or to slow down crawling. The Accelerator and Decelerator are two independently configurable components that determine exactly how agressively AbotX reacts to a situation that triggers a SpeedUp or SlowDown. The default works fine for most cases but the following are options you have to take further control.

Accelerator

Name	Description	Used By
config.Accelerator.ConcurrentSiteCrawlsIncrement	The number to increment the MaxConcurrentSiteCrawls for each call the the SpeedUp() method. This deals with site crawl concurrency, NOT the number of concurrent http requests to a single site crawl.	ParallelCrawlerEngine
config.Accelerator.ConcurrentRequestIncrement	The number to increment the MaxConcurrentThreads for each call the the SpeedUp() method. This deals with the number of concurrent http requests for a single crawl.	CrawlerX
config.Accelerator.DelayDecrementInMilliseconds	If there is a configured (manual or programatically determined) delay in between requests to a site, this is the amount of milliseconds to remove from that configured value on every call to the SpeedUp() method.	CrawlerX
config.Accelerator.MinDelayInMilliseconds	If there is a configured (manual or programatically determined) delay in between requests to a site, this is the minimum amount of milliseconds to delay no matter how many calls to the SpeedUp() method.	CrawlerX
config.Accelerator.ConcurrentSiteCrawlsMax	The maximum amount of concurrent site crawls to allow no matter how many calls to the SpeedUp() method.	ParallelCrawlerEngine
config.Accelerator.ConcurrentRequestMax	The maximum amount of concurrent http requests to a single site no matter how many calls to the SpeedUp() method.	CrawlerX

Decelerator

Name	Description	Used By
config.Decelerator.ConcurrentSiteCrawlsDecrement	The number to decrement the MaxConcurrentSiteCrawls for each call the the SlowDown() method. This deals with site crawl concurrency, NOT the number of concurrent http requests to a single site crawl.	ParallelCrawlerEngine
config.Decelerator.ConcurrentRequestDecrement	The number to decrement the MaxConcurrentThreads for each call the the SlowDown() method. This deals with the number of concurrent http requests for a single crawl.	CrawlerX
config.Decelerator.DelayIncrementInMilliseconds	If there is a configured (manual or programatically determined) delay in between requests to a site, this is the amount of milliseconds to add to that configured value on every call to the SlowDown() method CrawlerX
config.Decelerator.MaxDelayInMilliseconds	The maximum value the delay can be.	CrawlerX
config.Decelerator.ConcurrentSiteCrawlsMin	The minimum amount of concurrent site crawls to allow no matter how many calls to the SlowDown() method.	ParallelCrawlerEngine
config.Decelerator.ConcurrentRequestMin	The minimum amount of concurrent http requests to a single site no matter how many calls to the SlowDown() method.	CrawlerX

Javascript Rendering

Many web pages on the internet today use javascript to create the final page rendering. Most web crawlers do not render the javascript but instead just process the raw html sent back by the server. Use this feature to render javascript before processing.

Additional Installation Step

If you plan to use Javascript rendering there is an additional step for the time being. Unfortunately, NUGET has proven to be a train wreck as .NET has advanced (.NET Core vs Standard, PackageReference vs Packages.config, dotnet pack vs nuget pack, etc..). This has caused some packages that AbotX depends on no longer install correctly. Specifically the PhatomJS package no longer adds the phantomjs.exe file to your project and marks it for output to the bin directory.

The workaround is to manually add this file to your project, set it as "Content" and "Copy If Newer". This will make sure the phantom.exe file is in the bin when AbotX needs it. This package is already referenced by AbotX so you will have a copy of this file at "[YourNugetPackagesLocationAbsolutePath]\PhantomJS.2.1.1\tools\phantomjs". Another option would be to tell AbotX where to look for the file by using the CrawlConfigurationX.JavascriptRendererPath config value. This path is of the DIRECTORY that contains the phantomjs.exe file.

Performance Considerations

Rendering javascript is a much slower operation than just requesting the page source. The browser has to make the initial request to the web server for the page source. Then it must request, wait for and load all the external resources. Care must be taken in how you configure AbotX when this feature is enabled. A modern machine with an intel I7 processor and 8+ gigs of ram could crawl 30-50 sites concurrently and each of those crawls spawning 10+ threads each. However if javascript rendering is enabled that same configuration would overwhelm the host machine

Safe Configuration

The following is an example how to configure Abot/AbotX to run with javascript rendering enabled for a modern host machine that has an Intel I7 processor and at least 16GB of ram. If it has 4 cores and 8 logical processors, it should be able to handle this configuration under normal circumstances.

var config = new CrawlConfigurationX
{
    IsJavascriptRenderingEnabled = true,
    JavascriptRenderingWaitTimeInMilliseconds = 3000, //How long to wait for js to process 
    MaxConcurrentSiteCrawls = 1,                      //Only crawl a single site at a time
    MaxConcurrentThreads = 8,                         //Logical processor count to avoid cpu thrashing
};
var crawler = new CrawlerX(config);

//Add optional decision whether javascript should be rendered
crawler.ShouldRenderPageJavascript((crawledPage, crawlContext) =>
{
    if(crawledPage.Uri.AbsoluteUri.Contains("ghost"))
        return new CrawlDecision {Allow = false, Reason = "scared to render ghost javascript"};

    return new CrawlDecision { Allow = true };
}); //You can implement IDecisionMakerX interface for even more control
var crawlerTask = crawler.CrawlAsync(new Uri("http://blahblahblah.com"));

Auto Throttling

Most websites you crawl cannot or will not handle the load of a web crawler. Auto Throttling automatically slows down the crawl speed if the website being crawled is showing signs of stress or unwillingness to respond to the frequency of http requests.

Example Usage

var config = new CrawlConfigurationX
{
    AutoThrottling = new AutoThrottlingConfig
    {
        IsEnabled = true,
        ThresholdHigh = 10,                 //default
        ThresholdMed = 5,                   //default
        ThresholdTimeInMilliseconds = 5000, //default
        MinAdjustmentWaitTimeInSecs = 30    //default
    },
    Decelerator = new DeceleratorConfig
    {
        ConcurrentSiteCrawlsDecrement = 2,      //default
        ConcurrentRequestDecrement = 2,         //default
        DelayIncrementInMilliseconds = 2000,    //default
        MaxDelayInMilliseconds = 15000,         //default
        ConcurrentSiteCrawlsMin = 1,            //default
        ConcurrentRequestMin = 1                //default
    },
    MaxRetryCount = 3,
};

Using CrawlerX (single instance of a crawler)

var crawler = new CrawlerX(config);
crawler.CrawlAsync(new Uri(url));

Using ParallelCrawlerEngine (multiple instances of crawlers)

var crawlEngine = new ParallelCrawlerEngine(config);

Configure the sensitivity to what will trigger throttling

Name | Description | Used By --- | --- | config.AutoThrottling.IsEnabled | Whether to enable the AutoThrottling feature | CrawlerX config.AutoThrottling.ThresholdHigh | The number of "stressed" requests before considering a crawl as under high stress | CrawlerX config.AutoThrottling.ThresholdMed | The number of "stressed" requests before considering a crawl as under medium stress | CrawlerX config.AutoThrottling.ThresholdTimeInMilliseconds | The number of elapsed milliseconds in response time that would consider the response "stressed" | CrawlerX config.AutoThrottling.MinAdjustmentWaitTimeInSecs | The minimum number of seconds since the last throttled request to wait before attempting to check/adjust throttling again. We want to give the last adjustment a chance to work before adjusting again. | CrawlerX

See the "Configure Speed Up And Slow Down" section for more details on how to control exactly what happens during AutoThrottling in regards to slowing down the crawl (Decelerator).

Auto Tuning

Its difficult to predict what your machine can handle when the sites you will crawl/process all require different levels of machine resources. Auto tuning automatically monitors the host machine's resource usage and adjusts the crawl speed and concurrency to maximize throughput without overrunning it.

Example Usage

var config = new CrawlConfigurationX
{
    AutoTuning = new AutoTuningConfig
    {
        IsEnabled = true,
        CpuThresholdHigh = 85,              //default
        CpuThresholdMed = 65,               //default
        MinAdjustmentWaitTimeInSecs = 30    //default
    },
    Accelerator = new AcceleratorConfig
    {
        ConcurrentSiteCrawlsIncrement = 2,      //default
        ConcurrentRequestIncrement = 2,         //default
        DelayDecrementInMilliseconds = 2000,    //default
        MinDelayInMilliseconds = 0,             //default
        ConcurrentSiteCrawlsMax = config.MaxConcurrentSiteCrawls,   //default is 0
        ConcurrentRequestMax = config.MaxConcurrentThreads          //default is 0
    },
    Decelerator = new DeceleratorConfig
    {
        ConcurrentSiteCrawlsDecrement = 2,      //default
        ConcurrentRequestDecrement = 2,         //default
        DelayIncrementInMilliseconds = 2000,    //default
        MaxDelayInMilliseconds = 15000,         //default
        ConcurrentSiteCrawlsMin = 1,            //default
        ConcurrentRequestMin = 1                //default
    },
    MaxRetryCount = 3,
};

Using CrawlerX (single instance of a crawler)

var crawler = new CrawlerX(config);
crawler.CrawlAsync(new Uri(url));

Using ParallelCrawlerEngine (multiple instances of crawlers)

var crawlEngine = new ParallelCrawlerEngine(config);

Configure the sensitivity to what will trigger tuning

Name	Description	Used By
config.AutoTuning.IsEnabled	Whether to enable the AutoTuning feature	CrawlerX & ParallelCrawlerEngine
config.AutoTuning.CpuThresholdHigh	The avg cpu percentage before considering a host as under high stress	CrawlerX & ParallelCrawlerEngine
config.AutoTuning.CpuThresholdMed	The avg cpu percentage before considering a host as under medium stress	CrawlerX & ParallelCrawlerEngine
config.AutoTuning.MinAdjustmentWaitTimeInSecs	The minimum number of seconds since the last tuned action to wait before attempting to check/adjust tuning again. We want to give the last adjustment a chance to work before adjusting again.	CrawlerX & ParallelCrawlerEngine

See the "Configure Speed Up And Slow Down" section for more details on how to control exactly what happens during AutoTuning in regards to speeding up and slowing down the crawl (Accelerator & Decelerator).

abotx's People

Contributors

Stargazers

Watchers

abotx's Issues

custom implementations for IThreadManager doesn't work

Hi,

when using a custom implementation for IThreadManager, injected through AbotX.Poco.ImplementationContainer.ThreadManager, i get the following exception:

2018-06-12 06:04:00,229 [12] FATAL - [AbotLogger] (0) System.InvalidOperationException: Cannot call DoWork() after AbortAll() or Dispose() have been called.
   bei Abot.Util.ThreadManager.DoWork(Action action)
   bei Abot.Crawler.WebCrawler.CrawlSite()
   bei Abot.Crawler.WebCrawler.Crawl(Uri uri, CancellationTokenSource cancellationTokenSource)

The exception is thrown when the crawler crawls the second (and following) seeds in the same thread. While stepping through the code, i saw that Abot.Crawler.WebCrawler got the thread manager injected through the constructor and disposes it while executing Abot.Crawler.WebCrawler.Crawl(...). When not using a custom thread manager, the default one is constructed in AbotX.Parallel.WebCrawlerFactory.CreateInstance(SiteToCrawl) through the AbotX.Core.ImplementationOverride constructor for every crawl. When using the implementation override, the instance passed to AbotX.Poco.ImplementationContainer.ThreadManager will be reused for every crawl. So, this instance gets disposed after the first crawl and throws the exception above for all following crawls.

When I try to instantiate the ParallelCrawlerEngine, I get "Your current AbotX license does not include Auto Tuning."

User reported...

-------------------------------User Reported------------------------------

When I try to instantiate the ParallelCrawlerEngine, I get the below exception. However, I checked the value of AutoTuning.IsEnabled, and it's set to false. So is AutoThrottling.IsEnabled.

The version of AbotX installed is: 1.2.28, installed via NuGet. This also occurred with the previous version, which I believe was 1.1.x

Also, beginning with version 1.2.28, I now get a warning about my call of the ParallelCrawlerEngine's constructor being obsolete. Here is my current constructor call:

var crawlEngine = new ParallelCrawlerEngine(cx, abotFactory, null, siteToCrawlProvider);

Intellisense says that I should be using a constructor that refers to a 'ParallelImplementationOverride' - what's that about?

-------------------------------Stack Trace------------------------------

System.UnauthorizedAccessException was unhandled by user code
HResult=-2147024891
Message=Your current AbotX license does not include Auto Tuning. Please change AutoTuning.IsEnabled to false or upgrade your license.
Source=AbotX
StackTrace:
at AbotX.Core.HostStressAnalyzerCpu.CheckLicense()
at AbotX.Core.HostStressAnalyzerCpu..ctor(CrawlConfigurationX config, ICpuSampler cpuSampler)
at AbotX.Parallel.ParallelImplementationOverride..ctor(CrawlConfigurationX config, ParallelImplementationContainer impls)
at AbotX.Parallel.ParallelCrawlerEngine..ctor(CrawlConfigurationX config, IWebCrawlerFactory webCrawlerFactory, IRateLimiter rateLimiter, ISiteToCrawlProvider siteToCrawlProvider)
at lsSiteCrawler.Crawler.MultipleSitesCrawler.Crawl(Boolean bLoadCSS) in C:\Users\rjones\Documents\Visual Studio 2015\Projects\lsSiteCrawler\src\lsSiteCrawler\Crawler\SiteCrawler.cs:line 797
at lsSiteCrawler.Controllers.HarvestedLinksController.Post() in C:\Users\rjones\Documents\Visual Studio 2015\Projects\lsSiteCrawler\src\lsSiteCrawler\Controllers\HarvestedLinksController.cs:line 218
at lambda_method(Closure , Object , Object[] )
at Microsoft.AspNetCore.Mvc.Internal.ControllerActionInvoker.d__28.MoveNext()
InnerException:

Javascript rendering does not work on Azure Web App or Api managed PAAS

Email contents....

Steven,

Just wanted you to know that even though I’ve managed to get Abot / AbotX running on an Azure WebAPI instance, there is a huge problem with javascript rendering. My WebAPI code was exhibiting horrible performance issues that only seemed to happen when I published the code to an Azure WebAPI instance. When running on my development machine (a Windows 10 VM) it worked just fine. So I opened a support case with Azure support, and they got back to me with why my code was running so slow. It seems that phantomjs.exe is wanting to execute some code that Microsoft disallows in the WebAPI ‘sandboxed’ instance. They identified the code as: NTUserSystemParametersInfo(). They said that the phantomjs executable would attempt this call hundreds of times, each time a failure, before giving up.

So, my fix looks like it’s going to require that I change my code from a WebAPI project to something else that can run on a ‘pure’ Windows VM world (Microsoft support said that there is no problem running phantomjs.exe on a real Windows VM). But before going down that path, I thought you’d like to know this information, because I believe that at one time you had told me that I am the first of your customers to try to run Abot / AbotX on an Azure WebAPI instance. Also, I thought you might have an idea of a workaround (short of disabling javascript rendering!) that would save me the effort of re-writing my code.

Rob Jones

If paid feature is enabled, Abotx doesn't do anything and no warning or exception is present

I copied some code from here and Abotx didn't work. I tried a few things and found out that if I turned off autotuning, it works. There's no warning or exception to tell me that autotuning is a paid feature and I had to find out the hard way.

I suggest making paid features need a license through an exception message so it's very clear from the start.

Error when Inititalizing CrawlerX/ParallelCrawlEngine (System.FormatException)

I have configured as guided in address but still appeared http://abotx.org/Learn/Configuration An unhandled exception of type 'error' System.FormatException 'print lỗi mscorlib.dll

Additional information: The datetime represented by the string is out of range.

when I initialize paragraph following command:
var crawler = new CrawlerX();//Errror when Init object CrawlerX
Here is the configuration in my app.config file:

</ ConfigSections>

</ ExtensionValues>
</ Abot>

</ Configuration>

You can send me an example using AbotX.

Thanks

Problems with Javascript rendering (phantomjs.exe not found)

Recent changes to nuget have made some target framework installations ignore the install.ps1 of the PhantomJS 2.1.1 nuget package that AbotX relies on. The side effect is that the phantomjs.exe file does not get copied to the output directory which then fails during javascript rendering.

The workaround to manually copy the executable from the nuget installation directory "packages\PhantomJS.2.1.1\tools\phantomjs\phantomjs.exe" to your project root and then set it as "content" and "copy if newer".

See this stackoverflow answer for more details on how to setup your solution to copy to output directory...

Javascript rendering - detecting window.location changes

What would be your recommended way of dealing with window.location changes on the page? I'm crawling sites that have a method that looks something like the following probably to break crawlers:

function iframeOnLoad(){
  var reqUrl='https://domain.com/page_i_want';
  setTimeout(function() { window.location = reqUrl }, 3000);
}

<iframe onload="iframeOnload()" />

Assuming PhantonJs is rendering this, is it possible to detect url changes when window.location is set via JS? I could maybe write some custom addons but I'm not sure if this is already handled somehow.

The DateTime represented by the string is out of range

From customer...

We get the following exception “The DateTime represented by the string is out of range” (when the expiration date is being parsed as DateTime).
I saw others had the same problem in the following thread: dnauck/Portable.Licensing#26.
I tried to set the date to be 200 years from now just as a try, but then I get an exception that the expiration date not match the signature.

Need to implement some type of release notes

Hey Steven.

I see all of these updates for a abotx in my budget package manager but I'm having a difficult time seeing or finding the revision notes.
Am I missing something obvious or don't you publish revision notes. ?

AbotX not respecting CrawlConfigurationX MaxPagesToCrawlPerDomain and MaxCrawlDepth

I'm testing this and the crawler is currently on page ~55,000 and several layers deep for one of the three domains that I am crawling on the test. The code I used to load the configuration is below. I load from the app config xml and then override some of the settings in the method to customize the crawl based on user input for specific test crawls that I'm running. The two values in question are hard coded to 1000 and 1 respectively for this test. Am I doing something wrong?

var config = AbotXConfigurationSectionHandler.LoadFromXml().Convert();
config.CrawlTimeoutSeconds = timeoutMilliseconds / 1000;
config.HttpRequestTimeoutInSeconds = timeoutMilliseconds / 1000;
config.JavascriptRenderingWaitTimeInMilliseconds = timeoutMilliseconds;
config.MaxCrawlDepth = 1; //set for testing only
config.JavascriptRenderingWaitTimeInMilliseconds = javascriptTimeout;
config.MaxPagesToCrawlPerDomain = 1000; //set for testing only
ParallelImplementationOverride impls = new ParallelImplementationOverride(config);
impls.SiteToCrawlProvider.AddSitesToCrawl(sites);
ParallelCrawlerEngine crawlEngine = new ParallelCrawlerEngine(config, impls);

Constructing wrong URLs to crawl from anchor tags without scheme

The ParallelCrawlerEngine is getting the wrong URLs to crawl. Upon checking the page in the Parent URI, I could not find where it gets the wrong URL. It's probably the <a> anchor tag without the scheme "https://"

<a href="www.thelawyermag.com/au/best-in-law/best-legal-tech-and-legal-service-providers-in-australia-and-new-zealand-service-provider-awards/467481"> 
    bla bla
</a>

Parent URI:
https://www.thelawyermag.com/au/best-in-law/best-in-law-2023/468046

Parsed Hyperlink (Wrong URL):
https://www.thelawyermag.com/au/best-in-law/best-in-law-2023/www.thelawyermag.com/au/best-in-law/best-legal-tech-and-legal-service-providers-in-australia-and-new-zealand-service-provider-awards/467481

Javascript rendering even if ShouldRenderJavascript returns false

CrawlDecisionMakerX - ShouldRenderJavascript() is not virtual so no way to inherit it. However even if we override and return false the following log file shows that js is still attempted to be rendered.

[2017-07-09 16:46:17,733] [4] [DEBUG] - Page [https://XXX/sitemap.xml] did not have javascript rendered, [not an html page] - [AbotLogger]
[2017-07-09 16:46:17,762] [4] [DEBUG] - Rendering javascript for page [https://XXXX/sitemap.xml] - [AbotLogger]

Robots.txt is not reloaded when uri scheme is changed (http/https)

Hello,

We found Abot a few days ago, and we try its free version to see if it can meet our needs.

Everything worked fine until we noticed that it crawls urls which were 'Disallow' in robots.txt.

After some debugging, we ended up that it binds robots.txt with the initial uri scheme which is provided with the site to crawl, eg for the site https://mysite.com disallowed urls works only for https. If there is a link to http://mysite.com/somepage, Abot will ignore robots.txt and will crawl it.

*Assuming we have the following Robots.txt
User-agent: *
Disallow: /somepage

Could you help us how to deal with this issue?
Thank you

Incompatibility with net5.0

Consider this piece of code from AbotX readme page:

var crawlEngine = new ParallelCrawlerEngine(
                config, 
                new ParallelImplementationOverride(config, 
                    new ParallelImplementationContainer()
                    {
                        SiteToCrawlProvider = siteToCrawlProvider,
                        WebCrawlerFactory = new WebCrawlerFactory(config)//Same config will be used for every crawler
                    })
                );

It works fine on netcoreapp3.1, however on net5.0, the line new ParallelImplementationOverride(config... raises the exception System.Security.Cryptography.CryptographicException: ASN1 corrupted data. originating from System.Security.Cryptography.Algorithms. and the execution of the code is fatally interrupted.

This makes it impossible to use the parallel crawler on .net 5.

What happened to some of the properties that were in v1?

I had a program that used v1 of abotx. When I upgraded to v2, some of the properties are giving errors because they don't exist anymore. What happened to those properties?

Like IsExternalPageLinksCrawlingEnabled, DownloadableContentTypes, MaxCrawlDepth, IsExternalPageCrawlingEnabled...

crash when stop

hello, i got error when trying to stop.
System.OperationCanceledException: 'The operation was canceled.'
error at
[DoesNotReturn]
private void ThrowOperationCanceledException() =>
throw new OperationCanceledException(SR.OperationCanceled, this);

can you show me how to fix this? tks.

crawledPage.HttpRequestException. on https://aanhangwagenspattyn.be/ while ok in browser or Postman

While in FireFox this site ( = example, there are multiple sites ) opens without problems, Abot is having a problem with it.
The Error message refers to an "invalid or unrecognized response"
After some more digging:
When using Postman, with the exact same request, the site returns and responds normally
It look as if the HttpClient.sendAsync in the PageRequester seems to cause the problem

thanks in advance

Ghislain

Output of Abot2Demo

Did not crawl the links on page https://aanhangwagenspattyn.be/ due to Page has no content
ERR: Crawl of page failed crawledPage.HttpRequestException.InnerException = System.IO.IOException: The server returned an invalid or unrecognized response.
at System.Net.Http.HttpConnection.FillAsync()
at System.Net.Http.HttpConnection.ReadNextResponseHeaderLineAsync(Boolean foldedHeadersAllowed)
at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
Page had no content https://aanhangwagenspattyn.be/

User agent

the UA works correctly on ABOT, but in ABOTX it not working correctly

Parallel engine not working

I am trying to test the parallel engine but it is not working . it is returning after the first page crawl. I am testing using with a license. Would consider to upgrade if it works

Separate crawlerX's crawl each others sites

I have a Class with public static AlwaysOnSiteToCrawlProvider _siteToCrawlProviderX = new AlwaysOnSiteToCrawlProvider(); and a function that sets up a globalCrawlEngine from a singleton.

If I step through code I can see globalCrawlEngine.parallelCrawler.CrawlerInstanceCreated getting called and all of the data is correct, however by the time PageCrawlCompletedAsync => {} gets called the SiteBag (CrawlBag) data doesn't match the url that has been crawled.

This tends to only happen after a fresh compile of the project and when two requests are made in relatively quick succession.

Any ideas?

Javascript rendering does not work when cookies are required for ajax/cors calls

Currently the cookie handling tranfer from AbotX to PhantomJs fail.

Javascript rendering just returns empty html document

If JavaScript is enabled it gives me an empty html content: "\r\n" in PageCrawlCompletedEvent.

Releasing the source code for Abotx

Abotx.org website doesn't exist anymore.
If the author has no more monetization interest in abotx and providing support, how about releasing its source code?

Text issue on abotx.org website

In the example it uses :
crawler.CrawlConfigurationX.IsJavascriptRenderingEnabled = true;
crawler.CrawlConfigurationX.MaxConcurrentSiteCrawls = 1; //Only crawl a single site at a time
crawler.CrawlConfigurationX.MaxConcurrentThreads = 8;

Instead of the CrawlConfigurationX class.

AbotX.lic

The beta license file for Abotx.lic does not seem to allow me to use the javascript rendering.

[2016-02-17 21:01:09,557] [6] [FATAL] - System.UnauthorizedAccessException: Your current AbotX license does not include rendering of javascript. Please change IsJavascriptRenderingEnabled to false or upgrade your license.
at AbotX.Core.PhantomJsRenderer.IsLicensed()

But I have placed the lic file in the bin directory.

Access to the registry key 'Global' is denied.

Email received...

I'm running your latest code (1.2.44) and it runs just fine when not running under IIS. When I place my code under IIS (ASPNET Core 1.0) I get the following exception:

Access to the registry key 'Global' is denied.

This occurs when the below code is executed:

ParallelImplementationContainer implContainer = new ParallelImplementationContainer();
implContainer.SiteToCrawlProvider = siteToCrawlProvider;
implContainer.WebCrawlerFactory = abotFactory;

So is there any registry access going on in AbotX that I don't know about? Or is this something going on inside of IIS (running on Server 2012 R2)?

Cannot terminate an individual CrawlX intance on a Parallel crawl

e.CrawlContext.CancellationTokenSource.Cancel() in the pageCrawlCompleted event stops all concurrent crawls. NOT just the one that was intended to be cancelled.

Current AbotX license does not include rendering of javascript.

Hello!
I downloaded the .lic from the repository and saved it in the root of the project.
When I try to reproduce the example from README, I get an exception:
System.UnauthorizedAccessException: Your current AbotX license does not include rendering of javascript. Please change IsJavascriptRenderingEnabled to false or upgrade your license.

Add Elapsed property on the AllCrawlsCompleted event

We would like to be able to retrieve the TimeSpan Elapsed property in the AllCrawlsCompleted event for the ParallelCrawlerEngine. Just like the other events like SiteCrawlCompleted.

For now, we're using a workaround by declaring a Stopwatch variable in the class scope which is started before crawler.StartAsync() is called. Then it is stopped inside AllCrawlsCompleted event to get the TimeSpan for the Elapsed time.

abotx.org SSL certificate failing

When calling crawler.Crawl, getting NullReferenceException at RenderJavascript

NullReferenceException

   em AbotX.Crawler.CrawlerX.RenderJavascript(CrawledPage crawledPage, CrawlContext crawlContext)
   em AbotX.Crawler.CrawlerX.CrawlThePage(PageToCrawl pageToCrawl)
   em Abot.Crawler.WebCrawler.ProcessPage(PageToCrawl pageToCrawl)

How would I crawl a single site with multiple pages in parallel?

Hi,

Thanks for the product!

Apologies for the many questions.

How would I crawl a single site with multiple pages in parallel?
Do I need AbotX or Abot would do?
Do I need to loop through the list of sites if I can only do 3 at a time for the free version?
Is it ideal to have this in a job that keeps track of runs?
Also it doesn't say which part of the code I get the crawled data...is it in crawlEngine.SiteCrawlCompleted, after the lock(crawlCounts){...} statment?

Example

        private static async Task DemoParallelCrawlerEngine()
        {
            var siteToCrawlProvider = new SiteToCrawlProvider();
            siteToCrawlProvider.AddSitesToCrawl(new List<SiteToCrawl>
            {
                new SiteToCrawl{ Uri = new Uri("YOURSITE1") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE2") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE3") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE4") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE5") }
            });

            var config = GetSafeConfig();
            config.MaxConcurrentSiteCrawls = 3;
                
            var crawlEngine = new ParallelCrawlerEngine(
                config, 
                new ParallelImplementationOverride(config, 
                    new ParallelImplementationContainer()
                    {
                        SiteToCrawlProvider = siteToCrawlProvider,
                        WebCrawlerFactory = new WebCrawlerFactory(config)//Same config will be used for every crawler
                    })
                );                
            
            var crawlCounts = new Dictionary<Guid, int>();
            var siteStartingEvents = 0;
            var allSitesCompletedEvents = 0;
            crawlEngine.CrawlerInstanceCreated += (sender, eventArgs) =>
            {
                var crawlId = Guid.NewGuid();
                eventArgs.Crawler.CrawlBag.CrawlId = crawlId;
            };
            crawlEngine.SiteCrawlStarting += (sender, args) =>
            {
                Interlocked.Increment(ref siteStartingEvents);
            };
            crawlEngine.SiteCrawlCompleted += (sender, eventArgs) =>
            {
                lock (crawlCounts)
                {
                    crawlCounts.Add(eventArgs.CrawledSite.SiteToCrawl.Id, eventArgs.CrawledSite.CrawlResult.CrawlContext.CrawledCount);
                }
            };
            crawlEngine.AllCrawlsCompleted += (sender, eventArgs) =>
            {
                Interlocked.Increment(ref allSitesCompletedEvents);
            };

            await crawlEngine.StartAsync();
        }

Crawl through a proxy server

Hi,

I'm trying to figure out how to configure the crawler to use a proxy serve/port for connecting to the destination website, but I don't seem to be able to find any information for that.

Is there any way of doing that?

Implementation override ignoring shortcut delegates

Hi There,

I am using AbotX and specifically the ImplementationOverride, While Scheduler seems to be replaced the other helper methods (ShouldScheduleLink, ShouldCrawlPage, etc..)
Is this a known issue ?

var implementationOverride = new ImplementationOverride(config) {
    Scheduler = new MyScheduler(),
    ShouldScheduleLink = crawler_ShouldScheduleLink,
    ShouldCrawlPage = crawler_ShouldCrawlPage,
    ShouldDownloadPageContent = crawler_ShouldDownloadPageContent,
    ShouldCrawlPageLinks = crawler_ShouldCrawlPageLinks,
}; 
var crawler = new CrawlerX(config, implementationOverride);

AbotX produces huge amount of warnings on Linux

We are using AbotX in an application running on a containerized Ubuntu.
Almost on every page crawl, a warning is logged which reads as Cpu sampling implementation is not supported on this platform. Current implementation uses PerformanceCounter which is only valid on Windows.

Since we are logging warning-level messages too, this is making our logs useless and it's causing problems for our logging server too.

I can see that System.Diagnostics.PreformanceCounter is being referenced by AbotX and since the counter is a Windows-only API and considering the warnings, it gives me a feeling that something is not working as expected on Linux which might have other consequences, too?

Just to give you a feeling of what it currently looks like for us in the logs:

Please advise on what can be done about this.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

sjdirect / abotx Goto Github PK

abotx's Introduction

AbotX

Features

Technical Details

Installing AbotX

Quick Start

Using AbotX

CrawlerX

Example Usage

Easy Override

Pause And Resume

Stop

Speed Up

Slow Down

Parallel Crawler Engine

Example Usage

Easy Override Of Default Implementations

Pause And Resume

Stop

Speed Up

Slow Down

Configure Speed Up And Slow Down

Accelerator

Decelerator

Javascript Rendering

Additional Installation Step

Performance Considerations

Safe Configuration

Auto Throttling

Example Usage

Auto Tuning

Example Usage

abotx's People

Contributors

Stargazers

Watchers

Forkers

abotx's Issues

Recommend Projects

Recommend Topics

Recommend Org