Git Product home page Git Product logo

dotnetspider's Introduction

DotnetSpider

免责申明:本框架是为了帮助开发人员简化开发流程、提高开发效率,请勿使用此框架做任何违法国家法律的事情,使用者所做任何事情也与本框架的作者无关。

Build Status NuGet Member project of .NET Core Community GitHub license

DotnetSpider, a .NET Standard web crawling library. It is a lightweight, efficient, and fast high-level web crawling & scraping framework.

If you want to get the latest beta packages, you should add the myget feed:

<add key="myget.org" value="https://www.myget.org/F/zlzforever/api/v3/index.json" protocolVersion="3" />

DESIGN

DESIGN IMAGE

DEVELOP ENVIROMENT

  1. Visual Studio 2017 (15.3 or later) or Jetbrains Rider

  2. .NET Core 2.2 or later

  3. Docker

  4. MySql

     docker run --name mysql -d -p 3306:3306 --restart always -e MYSQL_ROOT_PASSWORD=1qazZAQ! mysql:5.7
    
  5. Redis (option)

     docker run --name redis -d -p 6379:6379 --restart always redis
    
  6. SqlServer

     docker run --name sqlserver -d -p 1433:1433 --restart always  -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=1qazZAQ!' mcr.microsoft.com/mssql/server:2017-latest
    
  7. PostgreSQL (option)

     docker run --name postgres -d  -p 5432:5432 --restart always -e POSTGRES_PASSWORD=1qazZAQ! postgres
    
  8. MongoDb (option)

     docker run --name mongo -d -p 27017:27017 --restart always mongo
    
  9. RabbitMQ

    docker run -d --restart always --name rabbimq -p 4369:4369 -p 5671-5672:5671-5672 -p 25672:25672 -p 15671-15672:15671-15672 \
           -e RABBITMQ_DEFAULT_USER=user -e RABBITMQ_DEFAULT_PASS=password \
           rabbitmq:3-management
    
  10. Docker remote api for mac

    docker run -d  --restart always --name socat -v /var/run/docker.sock:/var/run/docker.sock -p 2376:2375 bobrik/socat TCP4-LISTEN:2375,fork,reuseaddr UNIX-CONNECT:/var/run/docker.sock
    
  11. HBase

    docker run -d --restart always --name hbase -p 20550:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16010:16010 dajobe/hbase
    

MORE DOCUMENTS

https://github.com/dotnetcore/DotnetSpider/wiki

SAMPLES

Please see the Project DotnetSpider.Sample in the solution.

BASE USAGE

Base usage Codes

ADDITIONAL USAGE: Configurable Entity Spider

View complete Codes

public class EntitySpider : Spider
{
    public EntitySpider(IOptions<SpiderOptions> options, SpiderServices services, ILogger<Spider> logger) : base(
        options, services, logger)
    {
    }

    #region Nested type: CnblogsEntry

    [Schema("cnblogs", "news")]
    [EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)]
    [GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)]
    [FollowRequestSelector(XPaths = new[]
    {
        "//div[@class='pager']"
    })]
    public class CnblogsEntry : EntityBase<CnblogsEntry>
    {
        public int Id { get; set; }

        [Required]
        [StringLength(200)]
        [ValueSelector(Expression = "类别", Type = SelectorType.Environment)]
        public string Category { get; set; }

        [Required]
        [StringLength(200)]
        [ValueSelector(Expression = "网站", Type = SelectorType.Environment)]
        public string WebSite { get; set; }

        [StringLength(200)]
        [ValueSelector(Expression = "//title")]
        [ReplaceFormatter(NewValue = "", OldValue = " - 博客园")]
        public string Title { get; set; }

        [StringLength(40)]
        [ValueSelector(Expression = "GUID", Type = SelectorType.Environment)]
        public string Guid { get; set; }

        [ValueSelector(Expression = ".//h2[@class='news_entry']/a")]
        public string News { get; set; }

        [ValueSelector(Expression = ".//h2[@class='news_entry']/a/@href")]
        public string Url { get; set; }

        [ValueSelector(Expression = ".//div[@class='entry_summary']")]
        public string PlainText { get; set; }

        [ValueSelector(Expression = "DATETIME", Type = SelectorType.Environment)]
        public DateTime CreationTime { get; set; }

        protected override void Configure()
        {
            HasIndex(x => x.Title);
            HasIndex(x => new
            {
                x.WebSite,
                x.Guid
            }, true);
        }
    }

    #endregion

    public static async Task RunAsync()
    {
        var builder = Builder.CreateDefaultBuilder<EntitySpider>();
        builder.UseSerilog();
        builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>();
        await builder.Build()
            .RunAsync();
    }

    protected override async Task InitializeAsync(CancellationToken stoppingToken)
    {
        AddDataFlow(new DataParser<CnblogsEntry>());
        AddDataFlow(GetDefaultStorage());
        await AddRequestsAsync(new Request("https://news.cnblogs.com/n/page/1/", new Dictionary<string, string>
        {
            {
                "网站", "博客园"
            }
        }), new Request("https://news.cnblogs.com/n/page/2/", new Dictionary<string, string>
        {
            {
                "网站", "博客园"
            }
        }));
    }

    protected override (string Id, string Name) GetIdAndName()
    {
        return (ObjectId.NewId.ToString(), "博客园");
    }
}

Distributed spider

Read this document

Puppeteer downloader

Coming soon

NOTICE

when you use redis scheduler, please update your redis config:

timeout 0
tcp-keepalive 60

Dependencies

Package License
Bert.RateLimiters Apache 2.0
MessagePack MIT
Newtonsoft.Json MIT
Dapper Apache 2.0
HtmlAgilityPack MIT
ZCJ.HashedWheelTimer MIT
murmurhash Apache 2.0
Serilog.AspNetCore Apache 2.0
Serilog.Sinks.Console Apache 2.0
Serilog.Sinks.RollingFile Apache 2.0
Serilog.Sinks.PeriodicBatching Apache 2.0
MongoDB.Driver Apache 2.0
MySqlConnector MIT
AutoMapper.Extensions.Microsoft.DependencyInjection MIT
Docker.DotNet MIT
BuildBundlerMinifier Apache 2.0
Pomelo.EntityFrameworkCore.MySql MIT
Quartz.AspNetCore Apache 2.0
Quartz.AspNetCore.MySqlConnector Apache 2.0
Npgsql PostgreSQL License
RabbitMQ.Client Apache 2.0
Polly BSD 3-C

AREAS FOR IMPROVEMENTS

QQ Group: 477731655 Email: [email protected]

dotnetspider's People

Contributors

ananck avatar capadong avatar gtxck avatar hajiuxbz avatar jeffward01 avatar rangi376w avatar velka-dev avatar zlzforever avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dotnetspider's Issues

v3.0之后 登陆应该如何做?

1.x的时候可以使用这个:
spider.Downloader = new WebDriverDownloader(Browser.Chrome, new Option()
{
Login =
new LoginHandler()
{
Url = "https://wstj.bjchfp.gov.cn/apex/f?p=701:LOGIN_DESKTOP:5475297590252",
UserSelector = new Selector() { Type = SelectorType.Css, Expression = "#formlogin input[name='userID']" },
PassSelector = new Selector() { Type = SelectorType.Css, Expression = "#formlogin input[name='password']" },
SubmitSelector = new Selector() { Type = SelectorType.Css, Expression = "#formlogin input[type='submit']" },
User = "400820454-C",
Password = "XWZY0454-"
}
});
3.x之后呢???
WebDriverCommonCookieInjector 这个接口如何用 能否给个文档说明清楚?

How to schedule the spider to run daily job on 1am and are there any duplicate content check?

Hi,
I need to run the spider everyday on 1am or some specific time, are there any schedule available for this?

Another question is that are there any content duplicate check? for example, I do crawling everyday for website www.abc.com/aa.html for its xpath '/html/body/div[3]/div/div[2]/section', but if the content of '/html/body/div[3]/div/div[2]/section' is exactly the same as my last crawling, then I will just ignore it.

Thank you.

there is a bug of HttpClientDownloader

the DowloadContent method of HttpClientDownloader
code:
// TODO: 代理模式下: request.DownloaderGroup 再考虑
var proxy = spider.Site.HttpProxyPool.GetProxy();
request.Proxy = proxy;
httpClientItem = HttpClientPool.GetHttpClient(spider, this, CookieContainer, proxy?.GetHashCode(), CookieInjector);
httpClientItem.Handler.Proxy = httpClientItem.Handler.Proxy ?? proxy;

there is a issue at this line "httpClientItem.Handler.Proxy=httpClientItem.Handler.Proxy ?? proxy;"

if you reuse httpClient instance, the httpClientItem.Handler.Proxy can not modify the Proxy.
it will thow exceotion :This instance has already started one or more requests. Properties can only be modified before sending the first request.

能否发Release

能否发布Nuget的时候发release,可以看到不同版本的代码。现在想找某个Nuget版本的源码不太容易。

获取数据的时候,数据重复插入

通过 AddEntityType();
获取busstoplist 下所有P元素的内容
http://wapapp.dy4g.cn/bus/auto/test.php?t=linhtml&busline=1
123
`

        /// <summary>
        /// 获取车站信息
        /// </summary>
        [Schema("dybus", "BusStation")]
        [Entity(Expression = ".//div[@class='busstoplist']/div//p", Type = SelectorType.XPath)]
        class BusStation : BaseEntity
        {
            /// <summary>
            /// 车次信息
            /// </summary>
            [Column]
            [Field(Expression = "Keyword", Type = SelectorType.Enviroment)]
            public string Keyword { get; set; }

            /// <summary>
            /// 车站唯一ID
            /// </summary>
            [Column]
            [Field(Expression = "./@Id")]
            public string BusStationId { get; set; }


            /// <summary>
            /// 车次路线编号
            /// </summary>
            [Column]
            [Field(Expression = "./strong/text()")]
            public string StationNumber { get; set; }

            /// <summary>
            /// 车站名称
            /// </summary>
            [Column]
            [Field(Expression = "./span/text()")]
            public string Name { get; set; }

            /// <summary>
            /// 车站方向
            /// </summary>
            [Column]
            [Field(Expression = "../@class")]
            public string BusDirection { get; set; }

        }

`

2222

数据是能够获取到,
但是获取到的同一条数据插入了两次

请老师看看,是不是我使用姿势不对

求文档和案例

没得文档。。用起来好痛苦,案例好像也没完整,大佬整整

headers 中添加了 cookie 但是请求头中始终没有携带

var site = new Site
{
CycleRetryTimes = 1,
SleepTime = 200,
Headers = new Dictionary<string, string>()
{
{ "Accept","/" },
{ "Referer", "https://ad.tt.com/login/"},
{ "Cookie","tt_webid=6582711285758166536" },
{ "Connection","keep-alive" },
{ "Content-Type","application/x-www-form-urlencoded" },
{ "User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}
}
};

自带Sample运行有问题

是因为现在DotnetSpider库的更新太快了吗?导致自带Sample都跟不上步子了?

AfterDownloadCompleteHandlerSpider Sample运行起来之后
protected override void OnInit(params string[] arguments)
{
AddRequest($"http://api.search.sina.com.cn/?c=news&t=&q=赵丽颖&pf=2136012948&ps=2130770082&page=0&stime={DateTime.Now.AddYears(-7).AddDays(-1).ToString("yyyy-MM-dd")}&etime={DateTime.Now.AddDays(1).ToString("yyyy-MM-dd")}&sort=rel&highlight=1&num=10&ie=utf-8&callback=jQuery1720001955628746606708_1508996230766&_=1508996681484", new Dictionary<string, dynamic> { { "keyword", "赵丽颖" } });
AddPipeline(new ConsoleEntityPipeline());
Downloader.AddAfterDownloadCompleteHandler(new ReplaceHandler());
AddEntityType();
}

Downloader是null,导致运行出错,能更新下Sample吗?

Could please add some comments

Could you please add some comments on classes,methods and properties so that when use visual studio i can hover on them to see the comments and get a rough idear about what it is used for

第一个简单的爬虫 的例子报错

1、例子报错 https://github.com/dotnetcore/DotnetSpider/wiki/1.-第一个简单的爬虫 报错:Download https://github.com/zlzforever failed:发生一个或多个错误。
环境:net45,DotnetSpider2.Core.2.4.4 .
2、如何爬 带搜索参数的页面? 例如这个页面 http://list.youku.com/category/show/c_96_s_1_d_1_p_{i}.html 中 ,搜索框分别是 "后来的我们", "少林足球","羞羞的铁拳",如何优雅地爬到这3个页面呢?
3、这个组件,会自动切换ip爬么?要怎么切换ip呢?

NuGet-Packages

Hi
I have a question regarding NuGet-Packages. Which Project are intended to be a Nuget-Package? Only Dotnetspider.Core?
Best
f

mysql数据库

我获取到了项目运行例子需要数据库,能上传下数据库吗?

100线程就开始有408了.

100M电信专用线,
每个请求 2.6kb

100个线程 就time out了..

Left 0 Success 6100 Error 0 Total 5973 Dowload 109 Extract 0 Pipeline 15367

最佳实践

你好,我想请教你一下关于使用 DotnetSpider Framework 的最佳实践:

场景:我需要从一个网站的首页中拿到所有种类的一级链接,然后再通过抓起到的一级链接组装成一个新的二级链接,我需要将所有的二级链接执行抓起数据。

问题:我创建了一个 Spider,然后在 Spider 中通过创建一个 Processor来执行首页中所有一级链接的抓起,但是我该如何将这些抓到的一级链接拿到后直接放到一个新的 Spider 中执行新的抓取任务呢?

    还请不吝赐教,给一些最佳实践的灵感,谢谢。

两个Pipeline取数据取不到的问题

JsonFileEntityPipeline中第66行,直接取entry.ToString()返回不了数据。
json

ExcelEntityPipeline中第95行,data[column]返回不了数据,因为data是实体类。
excel

无法从起始页生成新的地址

如果没有从页面得到ResultItem,
AddTargetRequest 增加的页面没有作用? 这是个bug嘛?

一个页面不一定会有结果,但会生成新的地址阿

不是netcoreapp1.0啊

大哥,类库的project.json 里面不是netcoreapp1.0 ,应该是netstandard1.6平台标准啊

是否可以获取ajax加载后的网页

网页是在ajax结束之后,才有了内容数据,而我又需要在ajax渲染完成之后爬取,请问是否支持,如果支持的话具体应该怎么配置呢?

New Package for last commits (NLog/Serilog)

Hi. First of all: cool project!
When is next NuGet-Package update planned? I would really love to use my own Logger which is not possible with the current Version but should be with the next one (commit ae9bb7e) :)
thanks

Sample not working

When trying to run BaseUsage sample i get an error in Spider.cs in CheckIfSettingsCorrect() method.

if (Site.RemoveOutboundLinks && (Site.Domains == null || Site.Domains.Length == 0)) {
throw new SpiderException($"When you want remove outbound links, the domains should not be null or empty."); }

I guess the Domains property is required if RemoveOutboundLinks is true, but I don't know what is the purpose of that property.

TableInfo中的IsAutoIncrementPrimary的逻辑是不是有问题?

TableInfo中:

internal bool IsAutoIncrementPrimary => Primary.Count == 1 && Columns.Count(f => f.DataType == DataType.Int || f.DataType == DataType.Long) == 1;

判定数据库表是否应使用自增主键,当前逻辑为:

如果只有一个主键,且表中整型变量数量为1,那么这个主键就是自增的。
存在以下问题:

  • 当使用者不需要自增主键时,是否应强制指定?
  • 如果我只有一个string类型的主键,还需要设置自增吗?(mysql直接报错)
    明显原本设计意图是只有一个主键且主键为整型时,设置自增。

关于DefaultProxyValidator的问题

我是从网上爬的免费代理,在执行这句代码验证的时候 var host = Dns.GetHostEntry(httpProxy.Host),绝大部分的代理都会抛出异常,但其实大部分代理都是能用的。可以考虑改一下验证的方式,比如直接用代理访问这个网站http://httpbin.org/ip
tempsnip

两个关于DbRequestBuilder的错误

在使用DbRequestBuilder类的过程中遇到了两个错误,作者可以确认一下:

  1. 在QueryDatas方法中,这一句var dataItem = item as Dictionary<string, dynamic>转换失败,返回为空,解决方法:改为var dataItem = (item as IDictionary<string, dynamic>).ToDictionary(kvp => kvp.Key, kvp => kvp.Value)
  2. Build方法调用之后并没有把生成的Request加入_requests里面
    1

EntitySpider ,一对多的关系怎么处理呢?

请问 假如一篇文章有多个作者,每个作者对应一张图片, 在entitysipder 里面改如何定义字段类型呢?

现在Demo 里面好像都是单一的类型。
[PropertyDefine(Expression = ".//div[@Class='p-name']/a/em", Length = 100)]
public string Name { get; set; }

能实现类似的么?
[PropertyDefine(Expression = ".//div[@Class='p-name']/a/em", Length = 100)]
public List‘<string’> Name { get; set; }

或者给 每个字段解析成功后能定义个callback 函数也行。
谢谢!

能否将文档更新为DotnetSpider.Core

首先感谢开源此库,在看了issues后我发现DotnetSpider.Core2并不是最新版的,于是改为DotnetSpider.Core,但是我发现所有的文档都是DotnetSpider.Core2的,而且DotnetSpider.Core的注释没有DotnetSpider.Core2的完整。。。

waitCount导致Spider结束

Spider开了两个线程,有一个入口地址,第一个线程拿到Url以后去处理,第二个线程循环等待,恰巧这个地址处理了很长时间,第二个线程等待waitCount后将Spider状态设为Finished,但第一个Url其实还在处理。

所以是否应该判断:所有线程都空闲的时候,再等待waitCount视为结束。

关于使用代理报:此实例已经启动一个或多个请求。只能在发送第一个请求之前修改属性

使用代理采集下一个链接的时候会报:“此实例已经启动一个或多个请求。只能在发送第一个请求之前修改属性”,我看代理类是实现了IDisposable接口的,是不是因为没释放资源的缘故?在哪里释放呢?我的代理类如下:

public class HttpProxyPool : IHttpProxyPool
    {
        public void Dispose()
        {           
        }

        public UseSpecifiedUriWebProxy GetProxy()
        {
            var uri = new Uri("http://125.126.162.105:45504");
            return new UseSpecifiedUriWebProxy(uri);           
        }
        public void ReturnProxy(UseSpecifiedUriWebProxy proxy, HttpStatusCode statusCode)
        {            
        }
    }

设置代理代码:
site.HttpProxyPool = new HttpProxyPool();

关于PageProcessor中不调用AddResultItem就无法解析动态添加的url问题

作者在设计时默认ResultItem为空时动态添加的url不列入解析队列,需要配置spider的SkipTargetRequestsWhenResultIsEmpty
示例

Spider spider = Spider.Create(
	new QueueDuplicateRemovedScheduler(),
	new xxxProcessor(),
	new yyyProcessor()).
	AddPipeline(new MyPipeline());
// 添加初始采集链接
spider.AddRequests("xxxxx");
//配置ResultItem为空时不跳过目标请求
spider.SkipTargetRequestsWhenResultIsEmpty = false;//默认为true
// 启动爬虫
spider.Run();

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.