Git Product home page Git Product logo

ulixee / secret-agent Goto Github PK

View Code? Open in Web Editor NEW
644.0 17.0 43.0 47.43 MB

The web scraper that's nearly impossible to block - now called @ulixee/hero

Home Page: https://secretagent.dev

License: MIT License

Dockerfile 0.12% TypeScript 89.30% JavaScript 3.02% Shell 0.09% Go 1.26% HTML 0.58% CSS 0.05% Vue 4.34% SCSS 1.22%
scraping automated browser chromium secretagent proxy mitm mitmproxy stealth playwright

secret-agent's Introduction

SecretAgent 2.0 === Hero

๐Ÿ”” SecretAgent 2.0 is named "Hero" and currently in alpha testing. We're ready for developers to begin to switch over - it's a pretty easy transition (migration guide). Follow along with development of Hero here or check out the latest npm packages @ulixee/hero-playground.


SecretAgent

SecretAgent is a web browser that's built for scraping.

  • Built for scraping - it's the first modern headless browsers designed specifically for scraping instead of just automated testing.
  • Designed for web developers - We've recreated a fully compliant DOM directly in NodeJS allowing you bypass the headaches of previous scraper tools.
  • Powered by Chrome - The powerful Chrome engine sits under the hood, allowing for lightning fast rendering.
  • Emulates any modern browser - BrowserEmulators make it easy to disguise your script as practically any browser.
  • Avoids detection along the entire stack - Don't be blocked because of TLS fingerprints in your networking stack.

Check out our website for more details.

Installation

npm i --save secret-agent

or

yarn add secret-agent

Usage

SecretAgent provides access to the W3C DOM specification without the need for Puppeteer's complicated evaluate callbacks and multi-context switching:

const agent = require('secret-agent');

(async () => {
  await agent.goto('https://example.org');
  const title = await agent.document.title;
  const intro = await agent.document.querySelector('p').textContent;
  await agent.close();
})();

Browse the full API docs.

Contributing

We'd love your help in making SecretAgent a better tool. Please don't hesitate to send a pull request.

License

MIT

secret-agent's People

Contributors

blakebyrnes avatar calebjclark avatar daleevans avatar dependabot[bot] avatar jmannanc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

secret-agent's Issues

can't run the example in node.js 12.14.1

My node.js version is 12.14.1. I just tried to install it and run the example script in the document, but I got such error.

2020-07-22T02:37:00.770Z ERROR UnhandledClientError {
clientError: TypeError: SecretAgent.createBrowser is not a function
at /media/mine/Work/fiverr/keadam/index.js:4:37
at Object. (/media/mine/Work/fiverr/keadam/index.js:9:3)
at Module._compile (internal/modules/cjs/loader.js:955:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:991:10)
at Module.load (internal/modules/cjs/loader.js:811:32)
at Function.Module._load (internal/modules/cjs/loader.js:723:14)
at Function.Module.runMain (internal/modules/cjs/loader.js:1043:10)
at internal/main/run_main_module.js:17:11
}

Polyfills need to adapt to development OS/browser engine

This ticket is to refactor the polyfills that backfill headed chrome functionality so that they are ready to go on windows and mac, not just docker.

Right now, when you run on windows or mac, the polyfills don't actually simulate a real browser, which can cause issues with aggressively blocking sites.

Allow configuration of window/screen/viewport size

We need to allow configuration of the viewport that a session will use. Things we need to decide:

  1. How is this configured?
  • I think it's likely part of creating a session
  • We should probably store this with userProfile
  1. Do we have a list of default screen sizes that are rotated through? What about window positions/locations?
  • If we randomize this, could it result in some unexpected behavior? Do we have fixed screen sizes, but randomized locations?
  • Or should we tweak +- some number of pixels?
  1. Ideally we would be capturing screen combinations in Double Agent. Maybe this should be part of our eventual "user sampling" to determine how common various features are with real users

Defaulting to random emulatorID causes inconsistant results

By default SA should use a consistent emulatorID on every run. To randomly cycle through emulators, the user must explicitly set a default flag.

Proposal:

If emulatorId is null or undefined when calling SecretAgent.createBrowser() then default emulatorId will be used (for now, it will be chrome-80).

If emulatorId is set to "random" then a random emulator will be picked.

If emulatorId is set to a valid/known emulator then use that.

Otherwise (if emulatorId is invalid/unknown) then throw an error.

WaitForElement timeouts under heavy load

When running the Double Agent test suite, Secret Agent will experience timeouts for 3 or 4 sessions (out of ~40). We need to track down what exactly is happening here.

Things to note:

  • Other requests are still going through, so this is not a full-stop of the system
  • There's not consistency to "which" sessions fail to load
  • The hang is frequently on the "run-page" - eg, http://a1.dlf.org:3001/run-page?sessionid=32. Some of the fetch requests don't complete (no response handler ever called, but double agent records a response)
[35.x] Error on http://a1.dlf.org:3001/run-page?sessionid=35 TimeoutError: Timeout waiting for element document,querySelector,#goto-results-page to be visible
    at Timer.throwIfExpired (/app/shared/commons/Timer.ts:37:13)
    at Window.waitForElement (/app/core/lib/Window.ts:381:15)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at SessionState.runCommand (/app/shared/session-state/index.ts:124:14)
    at Window.runCommand (/app/core/lib/Window.ts:188:16)
    at Core.waitForElement (/app/core/index.ts:68:5)
    at CoreClient.coreClient.pipeOutgoingCommand (/app/full-client/index.ts:17:18)
    at CoreCommandQueue.processQueue (/app/client/lib/CoreCommandQueue.ts:46:28)

praise(gg): keep up the good work.

Hi,

This isn't an issue, feature request, or bug fix. I know that maintaining an open-source project can be daunting and challenging. Especially, without a lot of external validation or users willing to submit praise or feedback.

I am just saying thank you for your continued efforts in making scraping/automation an amazing experience!

I've not had a chance to use secret-agent yet but I am going to be using it in my next project.

Hopefully, this isn't spamming the issues feed. If it is, feel free to close/remove this issue but don't stop the good work.

All the best.

ERR_HTTP2_HEADER_SINGLE_VALUE & Certificate signed by unknown authority

First of all, thank you for a great job! Keep it up!

Now to the problem. I'm trying to make a request to street-beat.ru, but I get this two errors:

Error 1
2020-12-23T10:27:10.171Z ERROR [node_modules/@secret-agent/core/index] UnhandledErrorOrRejection { clientError: 'TypeError [ERR_HTTP2_HEADER_SINGLE_VALUE]: Header field "x-content-type-options" must only have a single value', context: {} }

Error 2
2020-12-22T19:48:44.558Z ERROR [node_modules/@secret-agent/mitm/handlers/HttpRequestHandler] MitmHttpRequest.ProxyToServer.RequestHandlerError { request: 'GET: https://cdn1.imshop.io/assets/app/b2.min.css', error: 'Error: 2020/12/22 23:48:44 Error on tls handshake with 91.209.77.149:45785. Args main.ConnectArgs{Host:"92.223.124.254", Port:"443", IsSsl:true, Servername:"cdn1.imshop.io", RejectUnauthorized:true, ProxyUrl:"", ClientHelloId:"Chrome83", TcpTtl:0, TcpWindowSize:0, Debug:false}, tlsConn.Handshake error: x509: certificate signed by unknown authority', context: {} }

In Replay window page seems like fully loaded, but neither waitForAllContentLoaded nor waitForElement methods are not triggered...

The errors occurs with and without proxy, I am using example code from Usage Example section.
Secret-Agent v1.2.0-alpha.4, nodeJS v14.15.1, macOS Big Sur 11.0.1

UnhandledErrorOrRejection when trying to connect to unreachable proxy

When proxy designated in upstreamProxyUrl is unreachable for whatever reason - process dies instead of throwing error.

How to reproduce:

const SecretAgent = require('secret-agent');

(async () => {
    try {
        const agent = await new SecretAgent({
            showReplay: false,
            upstreamProxyUrl: 'http://127.0.0.1:65535'
        });
        await agent.goto('http://api.ipify.org/');
    }
    catch(err) {
        console.error(`Error: ${err}`)
    }
})();

Expected Behavior

Catchable error?

Current Behavior

Getting a couple of errors in the console log.

ts-node ./src/_playground.ts

2020-12-04T10:53:48.829Z ERROR [@secret-agent/core/index] UnhandledErrorOrRejection {
  clientError: '2020/12/04 13:53:48 Dial (proxy/remote) Error: dial tcp 127.0.0.1:65535: connect: connection refused',
  context: {}
}
2020-12-04T10:53:48.831Z ERROR [@secret-agent/core/index] UnhandledErrorOrRejection {
  clientError: "TypeError: Cannot create property 'stack' on string '2020/12/04 13:53:48 Dial (proxy/remote) Error: dial tcp 127.0.0.1:65535: connect: connection refused'",
  context: {}
}

Context

SecretAgent version: 1.2.0-alpha.2
ts-node: v9.1.0

Support Dom Recording & Replay for Shadow DOM

Shadow DOM is not currently being recorded by our PageEventsRecorder, and therefor doesn't show up in Replay. This can make some sites appear unstyled (commonly those using React Styled Components).

Emulator polyfills need to handle secure vs non-secure

The Chrome 79 emulator has several failures in it's polyfills - many are the result of things not being available in a non-secure environment. These are things like Bluetooth, registerProtocolHandler, etc.

Polyfills can access the window location, so they should be altered to handle secure vs non-secure pages. This will require new data for http property orders

How do I execute JS in context of page?

For my particular case, I want to execute a jQuery AJAX request inside the browser for my target website. I'm sorry if this is mentioned in the documentation but I couldn't find anything related to it via google search.

How to reload tab?

What is the equivalent to puppeteer's page.reload? Can't find specific method, should it be performed as agent.document.location.reload()?

Seems like such sites as brandshop.ru (with Variti bot protection) can detect this. First load is successful, but such reload does not go through...

How to open a new tab?

I found a browser.closeTab function. Is there a way to open a new tab manually? How does secret-agent manage the opening of new tabs?

Can't skip replay download

I tried adding cross-env SA_REPLAY_SKIP_BINARY_DOWNLOAD=true npm instal into package.json and ran it. It ignores the env and downloads replay.

Add ability to navigate "back"

Secret agent scripts sometimes have the need to go "Back" to the previous url vs a new "goto" to the previous url. We should add the ability to navigate back in the navigation stack.

Create updated Chrome emulator plugins

We need to create updated Chrome plugins. When we add these, we should add the ability for emulators to choose their own rendering engine. To consider:

  1. Create a central repository that emulators can tap into. This would mean Emulators or SA is responsible for providing paths to different engines.
  2. Or emulators package their own rendering engine. This might be preferable if, for example, someone wanted to use a patched version of a browser

Provide args to Chrome

Hello,

Thank you for your excellent work.

Is it possible to provide args like --no-sandbox to the browser?

Emulator Tcp Settings not working in docker

When running Secret Agent in a Docker, the Tcp settings to emulate Windows 7/10 don't take. Need to investigate if this has to do with running in Docker "on a mac", or if this is because the Go configuration is not working in Linux/Docker-slim

Secret Agent is unable to complete Double Agent tests after latest push

A few things:

  1. Certain requests are hanging up in the Mitm. This appears to have been introduced once we started capturing postData.
  2. JsPath is not found in the window after a number of pages have loaded. Appears to be a timing issue with running the NewDocument initialization scripts
  3. Xhr Requests going through the Mitm are throwing an error that their "resourceType" cannot be found.

Support iframes in Replay

Right now, iframe content is being recorded during each session, but we don't replay anything other than the main frame. To get this to work, we need some way to link the frames created on the page to the domChange records (they're recorded in different contexts and so stored with overlapping ids and elements - ie, a body for each frame).

AwaitedDom Iterators should be accessible by index

When you run a document.querySelectorAll, you should be able to get an index out of the resultset:

const results = await document.querySelectorAll('a'); const href = await results[0].getAttribute('href')

How to add particular header to every request?

I want to add cache-control and pragma headers to every request with value 'no-cache'. It seems like my target site (street-beat.ru) somehow understands that I am using proxy, so I can't pass through to the actual site page. I thought that requests caching by proxy may be the problem.

Mobile view

I'd like to see a mobile view for this library, I think it'll be awesome!

UnhandledErrorOrRejection

const SecretAgent = require('secret-agent');

(async () => {
const agent = new SecretAgent();
console.log(agent)
await agent.goto('https://example.org');
const title = await agent.document.title;
const intro = await agent.document.querySelector('p').textContent;
await agent.close();

console.log('Retrieved from https://example.org', {
title,
intro,
});
})();
I'm getting this error why?
UnhandledErrorOrRejection { clientError: 'Error: spawn ENOMEM', context: {} }

Invalid cookie fields

Got this error, when I tried to set cookies using setItem method (agent.activeTab.cookieStorage.setItem('test', 'test')):

clientError: 'Error: Storage.setCookies: Invalid cookie fields'

Used simple example code from docs. MacOS.

AwaitedDOM - docs ?

Hello, how to use AwaitedDOM as specified element?

agent.document.querySelector("canvas") => returns ISuperElement

I want to call toDataURL() of HTMLCanvasElement but how would I do it ?

Shutting down example scripts

Should we add SecretAgent.shutdown to website examples? Or should we try to detect if we're the only thing sitting in the event loop and shutdown?

Create a consistent data directory for .sessions and .mitm-ca

At the moment SA creates two directories (.sessions and .mitm-ca) in your CWD (current working directory) every time a script is run. This creates a confusing clutter of directories.

Both directories need to be organized under a single .secret-agent directory. The big question is, where should this new directory go?

A few options:

  • Keep it in the CWD. Seems like a bad idea as the same script could have multiple .secret-agent directories depending on where it was run each time.

  • Put it in the scriptEntrypoint directory. This is better than the previous option, but still seems a bit overkill (multiple .secret-agent directories if you have scripts in multiple directories). However, the bigger challenge is the data dir is created inside Core, which means it doesn't have access to scriptEntrypoint if working with remote client.

  • Put it in the "project root". This would be ideal, except that determining the "project root" is tricky. It seems the project root can change substantially depending on how the script was launched, which in some circumstances, could make it frustrating to find the data directory. See https://stackoverflow.com/questions/10265798/determine-project-root-from-a-running-node-js-application

  • Put it in /tmp. The pro is that it's consistently in the same place every time. The con is that data from ALL scripts and projects is going to be jumbled together into the same directory.

Thoughts?

Documentation: Install Secret-Agent

In https://secretagent.dev/docs#installation it is mentioned that the download size is quite substantial and also why. But if this is all the case, why is it not simply recommended to download this instead as a global dependency (-g)? Is there a reason why you would want to have this as a local dependency?

I ask, as I cannot imagine that if you have 10 projects using it, that you want all of them using this dependency downloaded locally.

DNS configuration

Is there any way to enforce using local DNS resolver?
In one of the projects, I must use a local DNS resolver (installed on the same host), and it seems that SecretAgent uses pre-configured DNS servers like Cloudflare over TLS, and I can't find a way to change that.

Cannot read property 'split' of undefined

Updated Secret Agent to 1.3.0-alpha.0 and now getting this error. MacOS 11.1. The simplest example code.

TypeError: Cannot read property 'split' of undefined at extractOsVersion (/Users/dmz/.nvm/versions/node/v14.15.1/lib/node_modules/secret-agent/emulate-browsers/base/lib/getLocalOperatingSystemMeta.ts:27:21) at Object.getLocalOperatingSystemMeta [as default] (/Users/dmz/.nvm/versions/node/v14.15.1/lib/node_modules/secret-agent/emulate-browsers/base/lib/getLocalOperatingSystemMeta.ts:10:19) at Object.<anonymous> (/Users/dmz/.nvm/versions/node/v14.15.1/lib/node_modules/secret-agent/emulate-browsers/base/lib/DomDiffLoader.ts:5:48) at Module._compile (internal/modules/cjs/loader.js:1063:30) at Object.Module._extensions..js (internal/modules/cjs/loader.js:1092:10) at Module.load (internal/modules/cjs/loader.js:928:32) at Function.Module._load (internal/modules/cjs/loader.js:769:14) at Module.require (internal/modules/cjs/loader.js:952:19) at require (internal/modules/cjs/helpers.js:88:18) at Object.<anonymous> (/Users/dmz/.nvm/versions/node/v14.15.1/lib/node_modules/secret-agent/emulate-browsers/base/index.ts:8:1)

MitmProxy.RequestWithoutSessionId

Got this warning, while recursively running (with timeout throttling) instance of Secret Agent.

Iteration goes like this:

  1. Create Secret Agent instance with userProfile (if first iteration, userProfile = null);
  2. agent.goto(url);
  3. Scrape page;
  4. agent.close();
  5. Recursive call.

After some time this warning appears:
2021-01-28T11:57:09.227Z WARN [node_modules/@secret-agent/mitm/lib/MitmProxy] MitmProxy.RequestWithoutSessionId { isSSL: true, host: 'example.ru', url: '/', context: {} } 2021-01-28T11:57:09.331Z WARN [node_modules/@secret-agent/mitm/lib/MitmProxy] MitmProxy.RequestWithoutSessionId { isSSL: true, host: 'example.ru', url: '/example', context: {} }

Full iteration located in try/catch block, but it catches nothing. Iteration just freezes until I restart node process.

Secret agent v1.2.0-alpha.5. MacOS.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.