slotix / dataflowkit Goto Github PK
View Code? Open in Web Editor NEWExtract structured data from web sites. Web sites scraping.
Home Page: https://dataflowkit.com
License: BSD 3-Clause "New" or "Revised" License
Extract structured data from web sites. Web sites scraping.
Home Page: https://dataflowkit.com
License: BSD 3-Clause "New" or "Revised" License
Great work with this. I am wondering if you have deployed multiple instances behind a load balancer? Have you found a good way to do this, i.e. Traefik, some other kubernetes or docker swarm integration?
Let's consider the following case:
Domain: http://example.com .
Obviously robots.txt file is located at http://example.com/robots.txt . This robots.txt has no access restrictions.
Let's assume that we have a link like http://adv.example.com/click?item=1 to be scraped. It redirects one to http://example.com/item1 . For security reasons the second http://adv.example.com/robots.txt file
User-agent: *
Disallow: /
forbids everyone from accessing the page http://adv.example.com/click?item=1. But redirected page http://example.com/item1 is opened for crawling according to http://example.com/robots.txt .
To respect robots.txt we have to parse it BEFORE downloading its corresponding page. But following the rules listed in http://adv.example.com/robots.txt restricts us from accessing final redirected page http://example.com/item1 . It stops fetching and returns the error "Forbidden by robots.txt"
So... the only solution that comes to my mind is to download a page, generate robots.txt link from final redirected page response and check if its processing is allowed by robots.txt .
Please have a look at robotstxt.mw.go
func (mw robotstxtMiddleware) Fetch(req interface{}) (output interface{}, err error) {}
Please share your ideas about the most elegant solution.
Allows user to get result files in GZip format.
Solution: Payload needs to be supplemented with 'compressor' field, which will represent compress method: 'gz' for GZip and etc
Result file's extension should be 'gz' if 'compressor' applied and native 'format' extension if not.
type Payload struct {
Compressor string
}
Hi,
i try to locally deploy dataflowkit and getting some errors. Could you point me somewhere, thanks.
go get -u github.com/slotix/dataflowkit
package github.com/slotix/dataflowkit: no Go files in /opt/go/gopath/src/github.com/slotix/dataflowkit1.
./build_docker_images.sh
rm -f parse.d
CGO_ENABLED=0 \
GOOS=linux GOARCH=amd64 \
go build \
-ldflags "-s -w -X main.Release=1.0.0 \
-X main.Commit=096cadd -X main.BuildTime=2020-02-12_13:48:38" \
-a -installsuffix cgo \
-o parse.d
docker build -t slotix/dfk-parse:1.0.0 .
Sending build context to Docker daemon 13.34MB
Step 1/5 : FROM alpine:latest
---> e7d92cdc71fe
Step 2/5 : RUN apk update && apk add ca-certificates && rm -rf /var/cache/apk/*
---> Running in 82e93e366823
fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/community/x86_64/APKINDEX.tar.gz
v3.11.3-59-gf70c7aa335 [http://dl-cdn.alpinelinux.org/alpine/v3.11/main]
v3.11.3-60-gb3a10d424a [http://dl-cdn.alpinelinux.org/alpine/v3.11/community]
OK: 11262 distinct packages available
(1/1) Installing ca-certificates (20191127-r1)
Executing busybox-1.31.1-r9.trigger
Executing ca-certificates-20191127-r1.trigger
OK: 6 MiB in 15 packages
Removing intermediate container 82e93e366823
---> 4d80a7286186
Step 3/5 : COPY parse.d /
---> 286d8b08db13
Step 4/5 : EXPOSE 8002
---> Running in a5e72b1c1871
Removing intermediate container a5e72b1c1871
---> 736529c4a6f9
Step 5/5 : ENTRYPOINT ./parse.d
---> Running in 1172c19b8c11
Removing intermediate container 1172c19b8c11
---> 447c06723384
Successfully built 447c06723384
Successfully tagged slotix/dfk-parse:1.0.0
docker tag slotix/dfk-parse:1.0.0 slotix/dfk-parse:latest
docker push slotix/dfk-parse:1.0.0
The push refers to repository [docker.io/slotix/dfk-parse]
ad3827963233: Preparing
3b36fe6a41bd: Preparing
5216338b40a7: Preparing
denied: requested access to the resource is denied
make: *** [Makefile:30: push] Error 1
rm -f fetch.d
CGO_ENABLED=0 \
GOOS=linux GOARCH=amd64 \
go build \
-ldflags "-s -w -X main.Release=1.0.0 \
-X main.Commit=096cadd -X main.BuildTime=2020-02-12_13:48:53" \
-a -installsuffix cgo \
-o fetch.d
docker build -t slotix/dfk-fetch:1.0.0 .
Sending build context to Docker daemon 14.37MB
Step 1/5 : FROM alpine:latest
---> e7d92cdc71fe
Step 2/5 : RUN apk update && apk add ca-certificates && rm -rf /var/cache/apk/*
---> Using cache
---> 4d80a7286186
Step 3/5 : COPY fetch.d /
---> 4486675434a9
Step 4/5 : EXPOSE 8000
---> Running in b38c5b5be50c
Removing intermediate container b38c5b5be50c
---> a8f18189e717
Step 5/5 : ENTRYPOINT ./fetch.d
---> Running in 800cb67a52a1
Removing intermediate container 800cb67a52a1
---> 145c6d53aa06
Successfully built 145c6d53aa06
Successfully tagged slotix/dfk-fetch:1.0.0
docker tag slotix/dfk-fetch:1.0.0 slotix/dfk-fetch:latest
docker push slotix/dfk-fetch:1.0.0
The push refers to repository [docker.io/slotix/dfk-fetch]
6134c7560bc9: Preparing
3b36fe6a41bd: Preparing
5216338b40a7: Preparing
denied: requested access to the resource is denied
make: *** [Makefile:29: push] Error 1
rm -f testserver
CGO_ENABLED=0 \
GOOS=linux GOARCH=amd64 \
go build \
-ldflags "-s -w -X main.Release=1.0.0 \
-X main.Commit=096cadd -X main.BuildTime=2020-02-12_13:49:07" \
-a -installsuffix cgo \
-o testserver
docker build -t slotix/dfk-testserver:1.0.0 .
Sending build context to Docker daemon 9.451MB
Step 1/6 : FROM alpine:latest
---> e7d92cdc71fe
Step 2/6 : RUN apk update && apk add ca-certificates && rm -rf /var/cache/apk/*
---> Using cache
---> 4d80a7286186
Step 3/6 : COPY testserver /
---> 11023bdec3be
Step 4/6 : COPY web /web
---> 792c372decae
Step 5/6 : EXPOSE 12345
---> Running in 68552375095d
Removing intermediate container 68552375095d
---> 92c4a2fb6531
Step 6/6 : ENTRYPOINT ./testserver
---> Running in e86fad37fd33
Removing intermediate container e86fad37fd33
---> 78aaa71181d3
Successfully built 78aaa71181d3
Successfully tagged slotix/dfk-testserver:1.0.0
docker tag slotix/dfk-testserver:1.0.0 slotix/dfk-testserver:latest
docker push slotix/dfk-testserver:1.0.0
The push refers to repository [docker.io/slotix/dfk-testserver]
35457e462bc2: Preparing
3db1ec4fc644: Preparing
3b36fe6a41bd: Preparing
5216338b40a7: Preparing
denied: requested access to the resource is denied
make: *** [Makefile:29: push] Error 1
Is your feature request related to a problem? Please describe.
Here are some use cases of using JSON lines:
Describe the solution you'd like
Add new parameter here
type JSONEncoder struct {
JSONLines bool
}
Implement encoding to JSON Lines in the function
func (e JSONEncoder) encode(ctx context.Context, w *bufio.Writer, payloadMD5 string, keys *map[int][]int) error {}
How to use proxy to prevent blocking IPs?
Plz give some example/guide tutorial with docker or standalone
I read it was used for this.
Is the script public.
I want to get an idea of a production example and any issues that come up.
Great toolkit and really useful in golang.
I would like to scrape a website behind a login form (e.g. http://quotes.toscrape.com/login). Is Dataflowkit able to send forms and keep session information during scrapping? If yes, then how?
Currently there is not enough information about Parse Task returned except output file path.
It needs to add some extra information like Requests count divided by type (Initial, paginator, details), Response count, Error count, time elapsed, etc.
If STORAGE_TYPE is MongoDB we need to check if MongoDB Serve is alive.
So we should implement healthcheck interface for MongoDB.
Most of websites return same result whether fetch via browser or direct download. Can you add option for bypass?
Will the UI available for toscrape data be published as a general service? Without that UI the whole service is... well, useful only to some extent...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.