ambiretech / adex-market Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 6.0 2.16 MB

AdEx Market: a scraper that aggregates ad campaign information from the validator network

JavaScript 96.79% Shell 3.03% Dockerfile 0.19%

adex-market's People

Contributors

Stargazers

Watchers

Forkers

comentarismo witium tnycode lwandsyj ideasincrypto

adex-market's Issues

API to add more validators to scrape (or to submit campaigns)

if you create a campaign with validators that the market doesn't know about, it won't be crawled

so, introduce one of the two:

an API to submit campaigns; this means that any new validators will be "discovered" and the market will continue crawling them
an API to add validators to crawl

integration tests

Integration tests for the routes (e.g. getting channels by earner) and the scrape loop (e.g. getting USD estimation)

/campaign: filter by depositAsset

e.g. ?depositAsset=addr

POST ad slot: limit on number of slots/hostnames for an account

Impose a limit on the maximum number of slots and maximum number of hostnames that a publisher can use

This is useful to prevent various types of abuse

Default values

hostnames: 20
ad slots: 40

tests for /units-for-slot

test:

returns units with prices
applies targeting rules properly (see the tests in the JS validator, integration.js)
applies min CPM
applies limit for number of campaigns earning from

ensure lastApproved is always saved

save the full lastApprove object

do not return it from /campaigns, but save it

auto-categorization: tweak confidence thresholds

rather than using confident from webshrinker, use score: see if it's over a certain threshold which we'll adjust

adUnit (advertiser) auto-categorization

Definitions

adUnit object is the MongoDB object for an ad unit
unit.categories is set for each unit in campaignSpec.units
adUnitCategories refers to the targeting variable which is normally read from unit.categories

adUnit/advertiser auto-categorization based on AIP31:

Add a new route to the market, POST /unit-categorize that simply categorizes an unit by it's URL (targetUrl) and image and saves this to a collection. It should categorize using:

Google Vision/Google Natural Language
Webshrinker

For security, it should either compute the ipfs ID separately or store it's results separately for targetUrl/mediaUrl

this should be used in a few separate ways:

when the advertiser is uploading an ad unit, it should be categorized, with the results saved directly in the adUnit object
when the advertiser is creating a campaign, this can be used for suggesting default categories
when the advertiser is creating a campaign, we should set the unit.categories variable (in the units in campaignSpec) so that the publisher can exclude certain types of content (via adSlot.rules)

Basically, add everything in the adUnit mongo object, but govern it through the platform, which will use the market API beforehand to help auto categorization when creating a campaign, and to set unit.categories.

Further remarks/concerns:

adUnitCategories can be shimmed through the AdView getter (targetingInputGetter) from the old targeting tags if it's not set
if we need overrides, they can be storred directly in the adUnit Mongo object and used when unit.categories is generated on the platform when generating the campaignSpec

[supermarket] add `type` to `/units` route

Part of #101 and AmbireTech/adex-supermarket#9

We need a type query parameter for the route /units to optimize for the Supermarket route /units-for-slot/${slotId}.

drop reliance on web3/ethers

it shouldn't be needed; should use the relayer to check all identity-related things

Update README.md environment variables and query parameters

Implement additional test cases

HTTP Routes

optimization: once the status is irreversable, stop updating this campaign

states like Expired, Exhausted (and possibly others?) are permanent; once they're reached, stop updating the campaign

Market Health

each campaign/channel should have a "market health"

which is an aggregate of the

health as reported by each validator
off-chain state, as determined per: AmbireTech/adex-validator#17

so a healthy channel would satisfy:

recent heartbeat for all validators
there is a recent NewState
last ApproveState is recent and reports healthy
recent NewState and ApproveState have the same stateRoot value

if the state is unhealthy, there should be good information on why it happens, for example "Channel is unhealthy cause validator A is offline" or "Channel is unhealthy cause validator A reports unhealthy"

NewState/ApproveState signatures should be checked

use the same authentication method as the validators

solves #12

described here: https://github.com/AdExNetwork/aips/issues/32

Use the Relayer to verify EWT authentication tokens.

This will ensure we don't have to have blockchain-specific code in the Market, and we can use the same authentication method as the validator uses, so that tokens can be reused and the user will have to sign less messages on login

Fix: isDisconnected Implementation

There is an issue with the way isDisconnected is implemented.
https://github.com/AdExNetwork/adex-market/blob/master/lib/getStatus.js#L54
it does a util.isDeepStrictEqual(h1, h2) but the signatures & timestamp on the messages would be different

hence the channel would return disconnected

An approach would be to get the leader propagated heartbeat messages from the follower and then do length comparison and check if the difference is within the allowed difference. And the same for the follower from the leader.

enforce publisherAddr limits

The market should take a publisherAddr query parameter to enforce two limits:

Max number of channels the publisher is earning from; can be implemented via counting the Active channels the publisher is earning from, and if they’re >= maxChannels, add a filter to the query that only returns them (with .limit(maxChannels), sorted by earnings)
Max allowed earnings for limited accounts - we will stop returning results entirely if the account is limited and the users total balance (on chain plus outstanding) is over that

adUnit.created is a string rather than date

It needs to be an ISODate (meaning you have to pass a Date when saving)

crawl all channel pages

see https://github.com/AdExNetwork/adex-validator/blob/master/bin/validatorWorker.js#L62

we need to crawl all pages from the validator

NOTE: total might change to totalPages soon: AmbireTech/adex-validator#180

API to get all channels by earner

get all channels where a certain address has earned funds

this requires

storing lastApproved in the DB
querying lastApproved.newState.balances with $exists and projecting only that key

This API will merely return the latest channel balance, not the outstanding (non-withdrawn) amount; The reason for this decision is because the market must be blockchain-agnostic, and outstanding is defined as balances[addr] - onChainWithdrawn[addr]

State backups

the market should save the full state tree for some (all?) channels it cares about, as a failsafe against failing nodes; it should, ofc, also save the signature that goes with it

an easy way to do this would be to always save the full NewState/ApproveState validator msg pairs - both of them, together, contain a full state tree (required to build proofs) and the two required signatures

auto-categorization of adSlots/websites

use the Alexa API to automatically categorize adSlots into relevant categories on POST; see lib/publisherVerification.js for an example on how to use the Alexa API

Map the results to relevant tags from the options on the Platform; if they do not match directly, just add extra targeting tags for each Alexa categorization.

@simzzz I am not familiar with the Alexa API and what it returns exactly, so you'll need to research and come up with the best way to implement this

NOTE: this will replace the tags entered by the user (override them); we'll make relevant changes in the Platform (we won't ask the user to provide tags for the slot)

NOTE: once this is implemented and a PR is opened, we'll test it and if it's not working sufficiently well (due to Alexa's API) we'll restructure into a different design: we'll still allow users to submit tags for the adSlot, we'll keep auto-categorizations from Alexa in the websites collection and merge them with the user-provided tags when doing GET /slot/

NEW IDEA: use webshrinker - turns out it's a mauch more adequate API

support .well-known ownership verification

as an alternative to the DNS TXT record, support .well-known/adex.txt, with content equivalent to the DNS TXT record (adex-publisher=)

research whether this sparse format is acceptable for .well-known files: https://tools.ietf.org/html/rfc8615
implement it

benchmarking tool

depends on #77 and #76

Benchmark the market server, against a pre-filled DB of campaigns (copied from production), using a set of ~10 requests on /campaigns with various real-world parameters (one with just ?status, the others with ?status&limitForPublisher= for different publishers - some that hit the limit, others who don't)

see how many requests per second can the market do on /campaigns

should be accessible with npm run benchmark

Validate and add to ipfs units and slots

Validate slots and units

lastApprovedSigs/lastApprovedBalances

instead of returning lastApproved, return lastApprovedSigs: [sig1, sig2] and lastApprovedBalances: {...}

change the way data is stored in mongo

adSlot GET: return recommended earning limit

The /slot GET route should return a recommendedEarningLimitUSD (as a float that represents a USD value) based on the relevant entry in websites.

For now, use the following parameters:

lower than 10000 rank: 10k lifetime earning
lower than 100k: 5k lifetime earnings
lower than 300k: 1k lifetime earning
higher than 300k or none: 100 USD

automatic updating of Cloudflare WAF

See scrpits/get-waf

Use the cloudflare npm module to automate updating of the rule

see https://api.cloudflare.com/#account-level-firewall-access-rule-update-access-rule, namely PATCH accounts/:account_identifier/firewall/access_rules/rules/:identifier

Post Request Validation

There should be request body validation for /user/
https://github.com/AdExNetwork/adex-market/blob/master/routes/users.js#L11

implement clustering

like the validator, an ability to run in clustered mode + take MAX_WORKERS from the environment (services/cluster can just be copy-pasted)

however, this depends on #76, since the scraper should not run in multiple processes

Bad test assertion text for is_unhealthy

Issue:

Test actually has a recent Heartbeat for both validators.

As part of #5 for security audit

https://github.com/AdExNetwork/adex-market/blob/ee8edadf6a0555c38ffcf435084ab65249e80bd1/test/index.js#L143-L147

https://github.com/AdExNetwork/adex-market/blob/ee8edadf6a0555c38ffcf435084ab65249e80bd1/test/validatorTestMessages.js#L330-L336

https://github.com/AdExNetwork/adex-market/blob/ee8edadf6a0555c38ffcf435084ab65249e80bd1/test/validatorTestMessages.js#L508

creating an ad slot overrides the previous `verifiedForce`

bug: Rejected messages are ignored

When there is RejectState, the campaign state should be changed to something that reflects that

When there's no recent NewState/ApproveState pair, it can either mean that the channel is genuinely not updated (no new events) or that there is a new NewState but no new ApproveState (but a RejectState instead, or the follower is offline)

USD price estimation for campaigns

could be another field in the status that's calculated by the status-loop, using the same algorithm as this: https://github.com/SpankChain/uniprice

essentially, we get how much the token is in DAI by calling the uniswap contract

publisher verification: require a minimum alexa rank

The query will be as follows:

one of the two

verifiedForce
min alexa rank AND (verifiedIntegration or verifiedOwnership)

Or alternatively one of three

verifiedForce
verifiedOwnership
min rank AND verifiedIntegration

This query has to be changed in two places

routes/adSlot
scripts/get-waf

So this should be unified via lib/publisherVerification

save more information in the campaign status

percentage of funds distributed (where 100% means Exhausted)
last heartbeat time for each validator

CPC/CPA pricing for /units-for-slot

Problem

It's currently possible to set CLICK pricing for a campaign with zero IMPRESSION pricing, essentially implementing a CPC campaign.

However, the price returned from /units-for-slot (see AIP31) is respected by the AdView as the final price, and it's used to sort the available bids.

Research

https://support.google.com/google-ads/thread/1452036?hl=en - using historic CTR
https://blog.rontar.com/behind-the-scenes-how-advertising-auctions-and-cost-per-click-work - same

Solution

Use the slot average CTR to calculate a per-impression price to shim the price value.

For example, if price.CLICK is 100 and the CTR is 0.01, the returned price will be 1

It's important to use a default value in case the slot average is not available, or there's not enough data to gather it (e.g. less than 2000 impressions).

Same can be applied for custom aquisition events in the future, except it will require the rate between that event and impressions.

*The challenge is that the Market doesn't know the CTR. We can store an expectedCtr in the targetingRules by using { set: ['expectedCtr', 0.002] } and update it via a script on the validator

GET /units-for-slot/{slotId}: the "supermarket"

Problem

There's a few things about the current market that can be easily optimized/improved:

the adview manager needs to request info about the slot first, and then all campaigns; this can be done in one request
we can do the targeting server-side and only return relevant campaigns; some targeting rules can only be applied on the client side (e.g. AdEx Profile, frequency capping) - so those rules will simply be ignored (see AIP31)
because of those limitations, we have to set cache times high, which leads to this issue, which happens because campaigns are still returned for some time after they've exhausted/expired

Solution

A new route that allows to get all units matching a certain ad slot. We will build it as a separate component called the "supermarket" and we'll run it separately, and route on a NGINX level

This gives us 2 advantages:

🚄 Speed: by doing 1 request instead of 2, and using a in-memory data structure, and Rust, we'd be able to cut cache times down to only a few seconds and deliver fresher data; furthermore, AdEx ads will load faster, and less KB will be sent over network
🧐 Traceability: if there are no viable ad units, it will return the precise reasons, which means easier debugging of "why are my ads not showing"; also allows better internal stats

Functionality:

implements a route /units-for-slot, which returns all ad units which match this slot; it will apply targeting as per AIP31 AmbireTech/adex-supermarket#9
pulls up-to-date campaign data often and directly from validators to avoid trailing impressions Supermarket impl
applies targeting logic server-side using the adview manager rust code
[] (separated in own issue #170) returns issues and/or stats: a list of possible reasons why there are no returned units: NO_ACTIVE_CAMPAIGNS, CAMPAIGNS_NOT_SOUND, NO_DEPOSITASSET_CAMPAIGNS, NO_UNITS_FOR_SIZE, NO_UNITS_FOR_TARGETING, NO_UNITS_FOR_ADSLOTRULES, SLOT_NOT_VERIFIED (if acceptedReferrers.length == 0)

Tech design

The supermarket will function in-memory, without a database. It will pull all data it needs from the validators.

Recommendations for internal data structures:

active: HashMap<CampaignId, CamapignWithInfo> where CamapignWithInfo holds the channel, latest balance, latest status and etc.
finalized: Set<CampaignId> - a set of finalized campaigns (exhausted/expired, whatever)

That way only active (non-finalized) campaigns are kept in memory and updated, and once a campaign becomes inactive (which is irreversible), it will be flagged in finalized. Note that unsound (Unhealthy, Invalid, etc.) campaigns are not finalized - only Exhausted/Closed/Withdraw/Expired are finalized.

On start-up and every few minutes, we will get all known campaigns from a configurable list of validators. Every few seconds (10-20), we will update our active campaigns from the validators (update their latest balance tree/messages/status).

Ad slots can be retrieved from the market on demand without a cache: this means every request to /unit-for-slot will first request the slot from the market. The market endpoint will be configurable. Later on, we can cache that too if needed.

Applying earning limits

There are a few concepts of earning limits within AdEx: the limits recommended by the Market per slot. Those are based on the Alexa rank of the website in question and may include quick account limits too in the future.

We cannot apply earning limits because we can't compute all lifetime earnings because we drop finalized campaigns from memory (also, we only crawl active ones from validators). But the Market can apply limits at an ad slot level, by sneaking in { onlyShowIf: false } within adSlot.rules.

Prerequisites

https://github.com/AdExNetwork/aips/issues/31 and it's matching engine - although that's mostly implemented

Add slots and units to ipfs

Add units and slots to ipfs

Make a helper function which generates mock campaigns

can be configurable
makes testing simpler
low priority

optimize and security audit the market health function

perform a detailed (security focused) review on the market health functions and it's tests - @samparsky and @elpiel - you should each do that
rewrite it in rust in a more efficient way and use that in the Supermarket - @elpiel you should do that

it can be optimized by:

not querying the validators if not needed (e.g. channel is expired)
by using the heartbeats returned in last-approved
only retrieving latest NewState/RejectState individually when the ones in last-approved are not recent (to distinct between Invalid and just not having a new New/Approve pair)

In terms of types, this can be represented much more cleanly as an enum of:

Initializing
Waiting
Active
Finalized - should contain another enum containing { Expired, Exhausted }
Unsound - should contain another enum with a struct { disconnected, offline, rejectedState, unhealthy } - where all of those are booleans

This will be implemented in the supermarket first, then the logic should be backported to the JS implementation - we'll figure out how to translate the type considering JS does not have sum types

/units-for-slot: return basic issues

Return issues from the /units-for-slot route, as they're specified in #101

if campaignsActive.length is 0, do a .count() on campaigns with the same query but w/o depositAsset, and depending on that return NO_ACTIVE_CAMPAIGNS or NO_DEPOSITASSET_CAMPAIGNS
if all of the campaigns don't have units with the proper type, add NO_UNITS_FOR_SIZE
if the slot has no acceptedReferrers.length, add SLOT_NOT_VERIFIED

Tagging this with ux because using that, we'd be able to inform publishers why they don't get any ads.

/units-for-slot implementation (JS)

Implement /units-for-slot in JS before the supermarket is ready

This route should:

get all active campaigns
apply a global system-wide min CPM
get the slot
get all units from those campaigns which match the slot ad type
apply targeting
return the units, plus their price, and the targeting input variables

/campaigns?all - get all campaigns regardless of status

default depositAsset

if a depositAsset is not provided, default to DAI (either mainnet or testnet depending on the config)
related to #37

split campaign scraper and server in separate binaries

make a directory bin/ where there would be two separate things: server and scraper which can be started individually

the server would only handle requests, and the scraper would only scrape campaigns

npm start will default to the server

verify publishers on POST adSlot

call verifyPublisher and save the result every time a new adSlot is created; first check if the record exists - if it does, then it's an error (can do that with a single insert)
a convenience method that takes input, checks blacklist, and saves the successful result or returns issues: an array of strings with problems that the publisher must address (e.g. no DNS record); use this by scripts and POST; the existance of duplicates should also be an issue, but we will still save the record
if there is a verification error, send it back to the user with the message; return this in the form of issues array of err message strings
in the platform, show the error and explain how to add DNS TXT record: AmbireTech/adex-platform#394

/campaigns: fast filter by ?byEarner

query parameter that returns campaigns where there's a certain earner

do that by doing something like (not sure if correct) ```
{ [status.lastApprovedBalances.${req.params.earner}]: {$exists: true} }


this is required here: https://github.com/AdExNetwork/adex-relayer/blob/master/routineAuthsLoop.js#L42

this is basically the same as #17 but I'm reopening it cause the previous implementation appears to be querying everything that has `balances` and filtering it server side; we want to filter it on a DB query level, for performance

"Waiting" status

if a campaign status would be Ready but the activeFrom has not commenced yet, consider that a Waiting status

ambiretech / adex-market Goto Github PK

adex-market's People

Contributors

Stargazers

Watchers

Forkers

adex-market's Issues

Definitions

adUnit/advertiser auto-categorization based on AIP31:

Issue:

Problem

Research

Solution

Problem

Solution

Tech design

Applying earning limits

Prerequisites

Recommend Projects

Recommend Topics

Recommend Org