patcg-individual-drafts / topics Goto Github PK

View Code? Open in Web Editor NEW

605.0 605.0 199.0 771 KB

The Topics API

Home Page: https://patcg-individual-drafts.github.io/topics/

License: Other

Makefile 1.04% Bikeshed 98.96%

topics's People

Contributors

Stargazers

Watchers

Forkers

dmarti ressmann aleksei-gorbushin-at-walmart jonatasamorim bansicloud dafuku1976 kilaruoleh-def iuriimattos2 vivek-sah stallaert samuelweiler patisu01 cspadacini seungwooyahng israelromero197 wudi blacknbunny sjy manny27nyc truthiswill truth-is-will whitemike889 cnleo apriltotheg jeff-burkett admariner dreamerdad arminbuilds arnoldrw durangoe gevorg-verve jeremybney oyh7333 aetotvl6789 smeiltea912 xyaoinum dannyro22 logicad lbdvt nbisso-rely mutazawadorg mariannafcunha uoladtech william3johnson inferno-inc av-sherman babadopulos k-o-ta zhengweiwithoutthei saxias admin-01-web garimadhingra1 leeronisrael nachohumannow saimodj hoshino-masaki-gmoam roniesha1994 kaku-gi genzgilbert nicshub tumani4 qpc-github hmzisb safwsgithub2 pissac17 saladuha subhagam mustafaavci0q thisbarbsmith alextcone-google isabelcurado shuster-h thomasmasereel adamleslie piwanczak ivanchuk matiaslfranco lucasgvianna ajocalaz bruceheller github-remy-saissy remysaissy aligorkem04 paulo-toshio-olxbr caracoldigital limck78 nguyenlevanduy sewoomg sluo2023 pm-nitin-nimbalkar jurjendewal mimobr rushilw-google pavasg01 paulgoldbaum joshuaprismon siddharth-sahoo appascoe gtanzer abhijitkolhe

topics's Issues

First-party Leagues Implementing the Governed Harvesting of Topics (FLIGHT)

This issue was spawned from an earlier Twitter discussion.

The problem

As currently proposed, Topics are available based on where specific 3Ps are present. This gating creates several issues:

The rich get richer: 3Ps with a presence on the greatest number of sites are more likely to benefit than those with a small presence. In a competitive bidding environment, the entrenchment effects are likely to be strong.
Arbitrary value transfer: sites that are good predictors of valuable topics transfer value to sites that aren't based simply on having selected the same provider. This type of fungibility makes it more profitable to produce low-quality content (if not worse) and monetise thanks to others' high quality work, than to produce high quality work. First parties should have an active say in who they share their work with.
No incentive to monitor bad actors: 3Ps are incentivised to work with as many 1Ps as possible, 1Ps are incentivised to work with the 3Ps that have the most other 1Ps. This removes incentives (and, for the 1P, responsibility) to exclude bad actors that rely on the shared distribution of value to produce hate or disinformation.

Approach

I am deliberately not proposing a technical mechanism at this point, but rather focusing on the desirable outcomes first.

The idea is that instead of gating the shared elaboration of Topics through whoever happens to have the same 3P, it is done through explicit groups (leagues) of first-parties. Behaviour on any member of a given league contributes to the elaboration of topics that can be used for IG targeting on any other. This has several consequences:

It encourages collaboration across first parties, making groups of any size with whatever internal rules they wish to have.
It avoids structurally anticompetitive effects.
First parties get to choose who they share value with, meaning they can deploy values-based choices to the inclusion or exclusion of actors that intermediaries struggle with.
Presumably, a form of league identifier would be included in bids. This would enable bidders to assess topic quality (and some leagues will provide a better signal for specific topics) as well as brand safety (it's easier to filter out leagues that accept hate and disinfo).
Local monitoring: league members have a shared interest in the quality of their league as represented to bidders, and therefore are encouraged to check that other members aren't misbehaving. (Monitoring is a key function of commons.) This can help move away from the dynamic in which intermediaries are bad at monitoring and enforcement (for whatever reasons) and other actors have no leverage other than complaining.
It provides a logical and useful place off of which to potentially hang custom topic classification models, specific taxonomies, or various approaches to elaborating more precise topics (eg. a shared model that is simply used to push topics into a meta).

In order for this to work, a league would have to be something that you can't join at high speed or frequency. This is significantly more complex than gating on origin and a method call; my assumption is that whatever technical mechanism underlies topic leagues would be shared with other approaches that also benefit from grouping first parties for whatever reason.

I realise this is more hand waving than the usual fare (though maybe by waving hands fast enough one may… take flight) but I'm trying to hold off jumping into solution space while we discuss the potential value of the overall approach.

Should sites receive historic topics every visit, or first visit only?

For sites that users frequently visit there is no difference in privacy. For infrequently visited sites, this becomes a trade-off between topic dissemination rate and utility.
How might one define “first visit”?
- It could be: does the site have any cookies or other storage for the user? If so, it’s not first visit.

Document the controller, basis for processing, and related issues

These are some questions related to #31 that would help users of the Topics API understand when it could be called. (This is not just about compliance -- understanding these answers will help sites and third-party services manage the interactions of scripts on a page that call Topics API with scripts that handle consent and/or opt-outs/objections from the user. Covering this material at an early stage will help to evaluate how practical this proposal is to implement.)

Who is the controller? (Is any caller of document.browsingTopics() a controller?)
What is the basis for processing?
Is the Topics API data "obtained from the data subject" because it is provided by the browser, which operates as the agent of the user (data subject)?
Because Topics API is an exchange of value for value (a trade of information about the user's activity on the current site for information about the user's activity on previous sites) is it considered a "sale" of personal information in California?

Callers getting topics according to a priority list

A caller may not get the same signal from every topic for selecting an ad, for instance "Auto insurance" may be more useful than "Vegan Cuisine".

Would it be possible for callers to provide a ranked priority list of topics, for example at a .well-known location, and for the API to return topics, if eligible, according to this priority list?

Move the API to `navigator.browsingTopics()`

Why is the current API in Document? This has nothing to do with the current document, instead it should be in the Navigator object.

Provide Topics API for not adding current page's topics

The Topics API provides one zero-argument function document.browsingTopics(), which serves three logically distinct purposes:

Getting topics about the user. This is the obvious one.
Determining caller eligibility to receive the topic from future calls of the API.
Building up the user's set of top 5 topics for the epoch of the current call.

It would be useful to provide a little more control over these three different aspects of the API. In particular, there is some tension between the first two and the last use case. For the first two use-cases, there is no downside (aside potentially from some latency) to calling the API. Each ad tech is incentivized to call the API whenever possible, either to get useful signals or enabling nonempty responses for future calls to the API.

On the other hand, there are potential downsides to calling the API when it comes to the third point. For example, for a very large publisher site with generic, not commercially relevant topics at the domain/subdomain level. The ad tech might like to call the API to get useful signals, but with the current API it may not be worth the risk of potentially contaminating the users' future top 5 topics with the generic, not commercially relevant topics.

It would be beneficial to perhaps provide an argument that controls the behavior, something like browsingTopics(add_current_topics=true). Since eligibility is determined per API caller there should be no ecosystem concern about "freeloaders" getting other callers topics without contributing. There also does not seem to be any detrimental effect on user privacy. While the concern mentioned above might be partially mitigated by improved Topics ranking and commercially focused taxonomy changes, it seems best to provide this flexibility for API callers so they have flexibility in how they use the API.

Cover dynamic pricing use case

Topics may be useful for retail, travel, and other sites to identify more or less price-sensitive users.

Existing sources of data for dynamic or personalized pricing create a risk for the seller that they are inadvertently selecting members of protected groups for higher prices. However, Topics are intended to be non-sensitive (#4) so could be practical to use for dynamic pricing in cases where other data sources are not.

Personalised Pricing: The Demise of the Fixed Price?, by Joost Poort and Frederik Zuiderveen Borgesius, covers some of the incentives for retailers to adopt personalized pricing systems.

Price discrimination can benefit both buyers and sellers, leading to an increase of both consumer and producer welfare. Price discrimination can help the seller to recoup his fixed costs without losing many potential customers and make a good or service accessible to buyers with a smaller purse, even if it will lead to higher prices for other customers.

(Based on a previous issue from a previous proposal: WICG/floc#105 )

Related issues: #25 #42 #77 #82

Topics performance and measurement

With Floc, it was possible to measure ad performance as the Floc ID was the same on the publisher and the advertiser website. It was less efficient than with cookies, and less precise, but it was at least possible to do something at the floc cohort level.

How does topic interact with the different measurement proposals? Indeed, the topics will not be the same across the publisher and advertiser website.

Did you think about how measurement is supposed to work with topics, especially at the discovery phase (which topics are the most suited for a given ads)

Possible topics security model flaws

related to #38 #11 and #30

If a publisher is not allowed to see all the topics returned, it seems a new entity will arise, which has the responsibility of reflecting topics back to the publisher, or that ssps will do so as a condition of integration.

There are not a lot of obvious ways to restrict the topics from being shared amongst callers. One way seems to be to isolate the topics call inside an iframe. However, that would appear to break the bid request, as how would the ssp correlate the topics to it?
Another is to append the topics to a header in the bid request. In either case, the ssp has the opportunity to reflect the information back to the publisher and the publisher can accumulate them and deliver all the topics into OpenRTB.

This seems to defeat the purpose of limiting the information available to each caller, as the publisher is able to easily determine a lot more information about the user than they have now in the world of 3PC.

Finally, the publisher is now exposed to an enormous incremental security and performance risk profile, if all topics callers must run from third party js on their page. Publishers typically limit the number of parties that are allowed to do this prior to an auction. SSPs through advertisers are not able to run code unless they win. Typically only a video player, a header bidding wrapper, and an ad server have the privilege. It seems in the world of topics, publishers will be incentivized to run as many third parties as possible to try to get every conceivable topic, or to ensure they sent out a bid request that included all possible topics.

Apart from security, it seems this will also contribute to or engrain the existing bid jamming problem in adtech. If a user is a travel or finance enthusiast, the publisher is encouraged to basically spam the topics api with different third party callers and resultingly the bid stream until they can be reasonably assured they would have gotten that topic and its high value bid back.

aggregation using topics API

Hi there, thanks for sharing this awesome proposal. I have one quick follow up questions regarding how to do aggregation using this API.
In original floc proposal, each websites would be able to view the cohort id of a user. One benefit that I can think of is that the website can learn which group of users visited its site most frequently. With this aggregation data, the advertiser can adjust its bidding strategy for users from different group.
For example, it could bid higher in the auction for user from a group that visited its site most frequently.

In this API, is there any plan for making such aggregation above possible as well?
One option that I can think of is that the advertisers also fetch the topics when the browser visits its own website, and uses the topics as an identifier of that user instead of using the cohort id.

Contextual Ads vs Topics

Why not just allow contextual targeting as it exists today? Once a user exits a webpage, the topic they browsed is not relevant anymore thus there is no need to track the user. What added advantage does Topics API Add?

Exchange Support

The Topics API restricts learning about topics to those callers that have observed the user on pages about those topics.

It sounds like if a publisher issues an ad request to an ad-exchange server then DSPs participating in the exchange can only receive topics known to the ad-exchange?

I wonder if browsingTopics() can be extended to take an array of "reader" domains and respond with a mapping from each reader domain to the set of topics known to that reader domain, with each reader's set of topics encrypted with a public key published by that reader at a /.well-known/ path?

With this approach could the x-origin iframe also be avoided?

unfair restriction on topics filtering per caller

It seems quite unfair to be limiting the list of topics returned to a caller based on that caller presence on sites mapped to those same topics. This will create a large entry barrier for smaller actors, who will have a hard time accessing less common topics, vs large actors, such as Google, who already benefit from a very large foot print and won't be limited at all.
It also seems weird to defend that mechanism on the base of not providing more data than what cookies would provide, when same principle is not enforced on other proposals. For example, conversion measurement API will provide cross device functionality, which wasn't possible using third party cookies. One could argue that this is more privacy invasive as well.
Third party cookies shouldn't be used as the benchmark for privacy. Rather, we should consider whether such feature is following privacy expectation from users, or regulation's principle such as GDPR.
Seems like it would be more confusing for a user to know which topics he belongs to, but not knowing which callers can access them, vs simply providing all callers with same level of access.

Will we be able to limit callers?

Right now it appears anyone who can drop an iframe on a page is able to become a caller on my domain. This would include anyone who currently drops user syncs or potentially even advertisers.

As a publisher, will we be able to limit domains that are allowed to be callers and get access to browsingTopics on our site?

Privacy risk: can third-parties distributed and colluding on enough sites still infer top user topics? Can this lead to risk of unicity? I.e. allow tracking individuals

Suppose there are P third-parties {p1, p2, p3, …} who each have their code available to call the Topics API on many of the sites users visit. And suppose they either share information, or are even the same higher level party. I.e. ad-tech company A has servers p1,p2,p3,... all calling the Topics API on each of these sites. Each p calling the API on a given site sees a random 3 of the 5 top topics, with a 5% chance of the random topic. With enough simultaneous calls from p1,p2,p3…, they can learn the top 5 topics for that user by what is probabilistically returned.

For each site they might then create a pseudo-identifier that concatenates the top topics. E.g. If they learn the top topics for the user returned by that site are t1,t2,t3,t4,t5, then they might construct a string "t1-t2-t3-t4-t5".

Assuming these third parties are well distributed across the sites with the various topics a user visits, they might have full access to the user's topics. And they might then be able to gather a consistent pseudo-identifier for a user across the sites they visit ("t1-t2-t3-t4-t5").

Such pseudo-identifiers will likely be shared across many users who share the same interests, in any given week. Yet with 350 topics, there are 350**5 top topic combinations. Could some be unique?
Even if this kind of unicity is rare, the top topics change each week/epoch, and after a sufficient number of weeks, the sequence of pseudo-identifiers collected across these weeks might uniquely identify users.
i.e. this could allow the kind of cross-site tracking that removing third-party cookies is meant to do away with.

An assumption here is that the separate calls by p1,p2,p3,... for when a user visits a domain can be connected by the caller. This might be done by combining the timing of the calls with fingerprinting data.

This is a bit convoluted and makes a lot of assumptions. Are any of the assumptions and privacy concerns valid?

Define segtax for openrtb

Assuming this should go into user.data on openrtb and confine to the sda standard, we need a segtax pull on the openrtb repo similar to https://github.com/InteractiveAdvertisingBureau/openrtb/pull/81/files

Opportunity to reduce complexity: Topic API restrictions

Not every API caller will receive a topic. Only callers that observed the user visit a site about the topic in question within the past three weeks can receive the topic. If the caller (specifically the site of the calling context) did not call the API in the past for that user on a site about that topic, then the topic will not be included in the array returned by the API.

While I appreciate the spirit of what you're trying to achieve here, I think in practice this restriction won't amount to much other than making the proposal more difficult to read and for browsers, implement. Here's why I think that:

Ad Request Flow

In a typical ad request you have multiple technology provider types in play. For the sake of simplicity I'm only going to mention the two that matter for this particular issue: SSPs and DSPs
Most SSPs now integrate with sites via header bidding which is client side code that will be able to read this Topics API (side note: the leading header bidding library, Prebid, will almost assuredly take it upon itself to query the Topics API, and because of the broad deployment of Prebid and limited Topics taxonomy it is likely that all Topics would be covered on day one of this API going live)
SSPs then take what they know about the site/visit, whether supplied by Prebid library or their own header bidding wrapper, which would now include the Topic(s) shared by the API, and use it to populate an ad request out to one or many DSPs. The SSP is incentivized to get as many bid responses as possible to maximize monetization potential for the publisher and therefore would include the Topic(s) supplied to it by the API knowing that in the future many DSP customers will include targeted Topics in their campaigns
Those DSPs may or may not have called the Topics API on a site with the Topic now carried in the ad request. The SSP has no current way of knowing that and the browser doesn't have any idea what DSPs will be called.
The DSPs on the other end of the request now have the Topic(s)

Ad Response Flow

After a few days of operation at any scale it is likely that a DSP or other buy-side tech that is only delivered to a site via an ad (or pixel on advertiser's site) will have seen all of the several hundred Topics via its ability to call the Topics API once on page
If the restriction was intended to reduce the amount of parties in the ecosystem that could see all Topics nearly all the time then it only achieves this for likely a few minutes or at most hours at each site/user/Topic refresh

Again, I appreciate the spirit of wanting to limit what could be known about a given site/user/Topic to what is readily observable by an API caller, but the realities of the data flows and systems in the ecosystem mean that the restriction doesn't hold up well. Thus I am suggesting this piece of the proposal be revisited.

Bigger picture, companies landing on sites/apps via the ads themselves are one of the more complicating factors of privacy and data protection. Restrictions here probably are better delivered via things like Fenced Frames.

Clarify colluding sites case

If the user is known to be the same across colluding sites (e.g., because they’re logged into each with a persistent identifier), then it is possible for those sites to join their topics for the user together. This could also be achieved via adding topics to URLs when navigating between cooperating sites.

This analysis needs to be clarified.

Is the idea that the two sites already have the same identity for the user, e.g. same registered email address? If so, they can join topics backend.

However, if this analysis is just pointing out that two sites can collude to join topics, then the following should be reflected in the text:

It doesn’t matter if the user is logged in or has registered with PII. The colluding sites only need a persistent identifier such as a unique ID in a cookie. In fact, only one of the sites needs such a persistent identifier.
Sharing via navigational tracking should be the only way to collude under the assumption of full partitioning, no network layer tracking, and no shared user ID or PII. I.e. the “also” in “this can also be achieved” should be dropped since it’s not presenting an alternative but rather the way this attack can be carried out. If you are considering other ways they can collude such as IP address tracking, that should be highlighted.
The sites don’t need to collude. Only one or two parties with scripting powers over both sites is/are needed. A social network with script running on other sites can do it by themselves (i.e. no real collusion) or two ad tech vendors, one on each site, can collude.

It’s possible that you cover the above elsewhere and really intend to talk about the shared PII case here. If so, that needs to be clarified.

Classifier corpus

As far as I can envision, there are three high-level sets of data (corpus) for a given site that could be used by the classifier. In all cases, the output of the classifier might produce multiple strong signals about what the content of the site is (and "strong" will also need to be defined)

Looking at the homepage of a site and using the content/other signals there to determine the topics for the site
Looking at all of the content on a site and using that content/other signals to determine the topics for the site
Looking at all of the content on a site and weighting it by usage and then using that content/other signals to determine the topics for the site.

While the first and second options might be appealing methods as they are simple, they probably will give a very inaccurate view of the content of many sites. I think that the third would give the most accurate view of what a site is actually about.

API for browser extensions?

As a browser user, I might choose to install an extension that will

add a topic to my top 5 topics
remove a topic from my local topic information store, if that topic is present (otherwise do nothing)

It looks like the extension API should prevent extensions from seeing the user's topics or deducing any information about them, to limit incentives to submit malicious extensions.

Related: #78

What standard might be used for determining which topics are sensitive?

How might the browser detect abusive usage of the API to keep the topic dissemination rate in line with expectations?

What should happen if a site disagrees with the topics assigned to it by the browser?

Should there be a process to alter the assignment? What should the process be?

An alternative is to allow for sites to set their own topics via response header, as in #1.

Is this proposal ready for W3C privacy review?

Hi folks!
I'm the invited expert of PING (Privacy Interest Group) of W3C.
I've heard this Topics is the replacement of FloC.
Is it ready for privacy review now, or is it too early?

Thanks!

Expected volume of Topics Origin Trial

One major drawback of FLoC Origin Trial was the volume observed, to the point that drawing any meaningful conclusion was a challenge. What will be the expected volume of the future Topics API OT, and particularly:

What would be the percent of qualified Chrome users for which topics will be computed?
Will the experiment run on prod Chrome versions (i.e. representative traffic), rather than only Beta versions?

What is the performance impact?

This topic came up in the Web-Adv W3C group and I do not see a issue addressing it here? What is the performance impact of this system in terms of delaying time to first ad call if the ad system is dependent on calling this API?

Proposal: make browser API return topics as opaque blobs that are able to be decrypted only by a CSP granted server

I'm concerned allowing the browser javascript context direct access to the topics. There would be no way to ensure that the topics are delivered to the parties that the origin really would want or could be modified to disrupt behavior. For instance if I have a CSP that allows a CDN, those could scrape the topics without the origin knowing about it. It might make sense to create a blob thats opaque to the javascript context, and can be decrypted by an origin server given the key as a restricted header. Example below

The origin server (origin.example.com) issues a CSP restricting access to the topics requested through document.browsingTopics() only to itself.
The javascript performs await document.browsingTopics(); and retrieves an opaque encrypted blob that is signed for the origin server only.
The javascript attempts to fetch('https://anotherorigin.example.com', { includingTopics: true, body: JSON.stringify({ topics }) })
The user agent detects that anotherorigin.example.com is not permitted to access the topics and will omit a header including the decryption key or it could fail the fetch call outright.
The javascript attempts to fetch('https://origin.example.com', { includingTopics: true, body: JSON.stringify({ topics }) }).
This time it is permitted by the user agent. The user agent includes a Topics-Key header in the request and the value can be used by the receiving server can be used to decrypt the topics.

Topic Aggregation and Ranking

The Topics API involves ranking the "top topics" for a user's browsing activity in one epoch. How would the API aggregate and rank a user's browsing activity to find the top 5 topics described in the explainer?

Currently the explainer says https://github.com/jkarlin/topics#specific-details: "at the end of an epoch, the browser calculates the list of eligible pages visited by the user in the previous week" and "the topics are accumulated".

More specifically, can you clarify:

If a user visits the same site multiple times, does that increase the ranking for the site's topics? If so,
- Over what time period?
- What counts as a separate visit?
- Is there some maximum saturation?
Will the classifier output be one-hot (unit-weight) or some fractional weight? Having weights may make the classification seem less arbitrary.
Will the classifier output take the taxonomy hierarchy into account? For example, if the site is classified as /Computers & Electronics/Consumer Electronics/Cameras & Camcorders, we can also include /Computers & Electronics/Consumer Electronics and /Computers & Electronics. This also should apply for per-caller eligibility.
How will the weights (unit or otherwise) be aggregated?
- Could we consider some TF-IDF based approach? The IDF could be implemented as a rescaling of the classifier output weight. Using a global IDF would allow the topics that are more likely to be useful to ad tech to rank higher.
- Can we consider deduping/removing redundant topics? For example, if the more specific /Computers & Electronics/Consumer Electronics/Cameras & Camcorders is in the top 5, then it implies the more general /Computers & Electronics. Having a more diverse top 5 should improve the long-term utility.

One concern with the Topics API is that users' top topics will be dominated by broad, signal-poor topics. Choosing the taxonomy and weighting algorithm thoughtfully may help.

Objectives in terms of protection against cross-site user identification

The explainer sets as first privacy goal (https://github.com/jkarlin/topics#privacy-goals):

It must be difficult to reidentify significant numbers of users across sites using just the API.

What are the values (orders of magnitude) that Google would put behind "difficult" and "significant numbers of users"?

Reach issue

Since user can have up to 15 topics computed by the browser, and only 3 are returned on a given site, doesn't that mean that advertisers interested in specific topic will see their reach divided by 5 compared to today's cookie based approach?
For example, a user has the "sport" topic. As per current design, only 1 site out of 5 (on average) will access the "sport" topic from the user.
If an advertiser targets the sport topic, it means it can target that user only on that 1 site that got access to the sport topic.
Isn't it a risk of severely impacting the advertiser's reach?

Should sites be able to set their own topics via response headers?

The classifier is likely to be wrong from time to time and sites might which to adjust the topics returned for their site. One way to accomplish that is to allow sites to set their own topics via response headers.

The concern with this is if sites decide that some topics are more valuable than others, and decide to only list valuable topics, polluting the input to the API. How real is this risk?

Are hostnames for topics ideal?

We propose choosing topics of interest based only on website hostnames, rather than additional information like the full URL or contents of visited websites.

This is a difficult trade-off: topics based on more specific browsing activity might be more useful in picking relevant ads, but also might unintentionally pick up data with heightened privacy expectations.

Let's assume subdomains in this context are separate hostnames and we're using the terms interchangeably.

If publishers want to serve more relevant advertising, they're incentivized to send narrower topics. And if they are rewarded for sending narrower topics, then they're compelled to granularize their site into as many subdomains as possible without suffering traffic losses.

Making subdomain-heavy site architecture monetarily advantageous for publishers raises some concerns. @gui-poa's voiced this in another issue. A few of mine:

Most mainstream CMSs sit on a single subdomain and allow for content creation/organization via directories only. "Multisite" options exist, but are typically an enterprise solution. In this way, large publishers who can engineer their own hostname-first CMS and afford devOps to manage multiple name systems may be afforded an unfair advantage over small/independent publishers
Subdomains have their taxonomical role in isolating use-cases (support.example.com) and localization (es.example.com, de.example.com)… but in my experience, using them for subcategorization (i.e. arts.example.com) complicates breadcrumbs, wrecks sitemaps, dirties or breaks analytics, and causes cross-origin problems
The SEO debate around subdomains vs subfolders is tired and isn't worth rehashing here (just search "subdomains SEO")

Nightmare scenario: Publisher sites link internally to pages hosted on granularly categorized subdomains to appease the Topics API, all of which are cross-canonicalized to a single subdirectory-first site that the publisher feels is better optimized for search engines. Users only ever see this subdirectory site as a landing page... it exists for bots.

What alternatives are there to the proposed "one topic per hostname" rule? Can one respect heightened privacy expectations (avoiding hyper-targeted, mappable Topic API sends) without nudging publishers toward subdomains?

Please consider opt in instead of opt out

Hei,

I can read "The Topics API will have a user opt-out mechanism". I would strongly advise to go with opt in instead of opt out to go together with the stated privacy goals.

Just a note that opt out is very much not compatible with the GDPR:

Consent should be given by a clear affirmative act establishing a freely given, specific, informed and unambiguous indication of the data subject's agreement to the processing of personal data relating to him or her, such as by a written statement, including by electronic means, or an oral statement. This could include ticking a box when visiting an internet website, choosing technical settings for information society services or another statement or conduct which clearly indicates in this context the data subject's acceptance of the proposed processing of his or her personal data. Silence, pre-ticked boxes or inactivity should not therefore constitute consent.
https://eur-lex.europa.eu/eli/reg/2016/679/oj

floc was opt out (and using the ad blocking EasyList to track people for ads...) so it couldn't be enabled in Europe.

Fruit example clarification

Using your example where the top five topics for the first week are:

Top Topic	Parties That Can Learn About the Topic
Apples	T, R, S
Bananas	S
Cantaloupe	T, S
Emblica	S
Grapes	T, R, S

and that the user browses primarily the same (types of) sites in subsequent weeks so that the 5 topics (and parties that can learn about them) are identical for weeks 2 and 3, then is the following correct (ignoring the 5% random topic)?

The topic selected for a site in the first week, has a 20% chance of being selected again in the second week and a 4% chance of being selected for weeks 1, 2 and 3. In this case, the array will have a single value.
Because S can learn about all topics, they are guaranteed to see topics in their array. At week three they have a 96% chance of seeing at least 2 values
R has a 40% chance of seeing a value (Apples or Grapes) for the first week, a 78.8% chance of seeing applies or grapes in at least one week and a 19.2% chance of seeing both Apples and Grapes by week 3

Site-seeded topics

The topics will be inferred by the browser. The browser will leverage a classifier model to map site hostnames to topics. The classifier weights will be public, perhaps built by an external partner, and will improve over time.

As others have already pointed out, this poses a challenge for sites that may not have descriptive hostnames, or span a wide array of topics under the same hostname (e.g. a publisher covering sports, business, entertainment, etc), a merchant with a large catalog of items (home goods, clothing, etc), and so on. Given that the current proposal considers hostname, not just domain name, this might create pressure for sites to adopt more subdomains to help with classification (e.g. sports.pub.com, homeware.shop.com, ...), but that's a costly undertaking with its own side effects.

Separately, there are open questions on misclassification (#2) and the ability to set (#1) topics.

My hunch is they're all semi-related and we could, perhaps, try to address them by enabling sites to "seed" a set of suggested topics. Going down this route would effectively translate the current proposal into a weakly-supervised classifier model: it doesn't make strict guarantees about the outcome of the classification but allows the site to influence and provide input signals.

More concretely, the rough model here could be...

Site suggested topics MUST be from a set of valid topics
Site suggested topics MAY differ across pages of the hostname

By restricting suggested topics to the predefined list we're not any new labels/segments, etc. At the same time, enabling sites to provide page-level scoped topics would, I think, address the challenge for multi-topic sites. For example, a publisher or merchant could advertise relevant topics for each section of their site (which paths and pages get which topics is controlled by the site owner). Downstream, the browser can introspect the pagel-level browsing history of the visitor, build an aggregate count of observed topics by the visitor, apply its own filters/validation, and feed the resulting set as input into the classifier model.

As noted above, this makes no strict guarantees about the final output of the classification, but it enables the site to make suggestions, browser to audit/filter suggestions, classifier to act on suggestions. The net result is that a reader who spends most of their time on the sports section of pub.com, or a buyer on the homewares section of a large merchant, might then receive a relevant classification for the {site, user} tuple.

Product sale only, or is job advertising considered?

The taxonomy_v1 file proposes interests which seems to work just fine with product sale. Do you have any thoughts on how topics may work with job advertising? A user browsing pattern which emerges from the fact that you are in the state of a career/job change is not alone based on a categorization of interests.

What topic taxonomy should be used long term?

Who should create and maintain it?
Eventually it would be good if this was produced externally to the browser and became an industry standard.
The taxonomy should be publicly available for transparency.
If the number of topics increase, we’ll need to balance that with the ability of sites to observe topics (e.g., if there are more topics, there is less of a chance that an ad-tech has seen the chosen topic in the past).

Add a random percentage of empty responses

If the user opts out of the Topics API, or is in incognito mode, or the user has cleared all of their history, the list of topics returned will be empty...

Seeing how sites now try to detect incognito mode and/or ad blockers and give users of those modes substandard experience, consider making empty responses more normal by providing them in perhaps .5% - 1% of cases. That might be high enough to discourage the provision of substandard experiences.

publish tools and datasets to enable external researchers to evaluate proposal w.r.t. privacy

It's great that this proposal is incorporating public feedback.

It would be even better if this proposal was published with a set of tools and datasets for external researchers and the web community to better evaluate the proposal with empirical tests.

For example, a colleague and I recently did a post-mortem analysis of FLoC:
https://arxiv.org/pdf/2201.13402.pdf
Our analysis required us to re-implement FloC and to leverage a proprietary dataset of browsing histories. These hurdles make analysis by and for the public inaccessible to many external researchers and community members.

It would be helpful for the Chrome developers to publish tools, example datasets, and code so that their proposals can be more easily interrogated by researchers. Will such be made available?

Aggregate Conversion Reporting

Assuming that an aggregate conversion reporting mechanism is also supported by the browser, when a user converts on an advertiser's site, for each ad viewed and/or clicked, the aggregation report should include the:

the number of times each ad and/or campaign ID was viewed/clicked (not new to this proposal)
the topics that were shared on each site where an ad was displayed, with a count of the number of times each topic was present
if not too unique, the number of times each topic was present for each ad/campaign

This will allow the advertiser to understand the value of each topic or at least topics that are common for people who later convert on their site.

Privacy Risk: Aggregation and Household Tracking

It sounds to me like this will make it easier to track households, and through that: individuals.

I'll give an example:
Companies A, B, and C have access to the topics API on different sites (along with passive "PII" like IP address, browser version, etc.). They then send this data to the open auction (RTB stream).
At this point, companies M and N bidding on the auction can aggregate the topics from A, B, and C. They also group it with other topics from the user's home IP address (for example, from another device, roommate's device, etc.). Side note: all this info together should make it relatively "easy" to keep tracking the household through daily IP changes: multiple devices + browsers + top topics.

Now, another party, Company X wants to know whether they should bid on an ad placement from a specific IP. They ask companies M and N what that IP's top topics are, for a price. The result is that they get a list of the top 10+ topics of all users at that IP. For a larger price they can filter by device info.
I'll make that last point clearer: any advertiser/third-party can have a pretty good understanding of any residential address' topics ("browsing history"). Not just the top 5.

Also, if a household's aggregate data is "unique enough" from other households. Even a user in the household who opts out of tracking can be classified/targeted based on passive data.

Minimum domain activity before classification

A domain should have to be viewed by some minimum number of unique users in order to be used for classification, particularly if the associated Topic is very low volume. For example:

if a user visits bluesmusic.com, but that domain only has 100 unique visitors in a week, that domain should not be used to classify into the Blues Music topic.

Very low usage domains may not be strong enough signals to the classifier to accurately represent a topic, and therefore should be disqualified.

Enable a site to set an optional section name

Allow callers to specify a section name that the classifier can use to develop a topics list, to improve personalization for users of large, multi-topic sites. Callers could populate the section name in Topics API calls using the existing schema.org articleSection property already in use.

If the topic list is per-hostname, a user of a large general-interest site may receive inadequate personalization compared to a user of multiple niche sites with only a few topics per site.

A section can be any subdivision of a site, including a "channel" "group" or "space."

This is separate from the question of allowing publishers to specify individual topics. The publisher-provided "section" is just an identifier applied to a subset of pages on that site, and the actual topics for pages in that section would still have to be determined by the classifier.

Should the random responses be subject to filtering?

The random response for an epoch is drawn uniformly at random from the full taxonomy (though we should probably remove the 5 topics for the epoch). Should that random response have been seen before by the caller in order to be returned?

If yes: Then the plausible deniability is limited to whether or not this was a top topic for the caller.

If no: Then the plausible deniability is increased as the caller cannot 100% know that they actually observed the topic for the user. On the other hand utility drops some since there is more noise in the system.

Should there be a way to send topics via Fetch as a request header?

This would reduce the need for expensive (and slow) x-origin iframes to be created.

Minimum user activity before classification

As documented, the Topics API has no particular minimums set about how many page visits/unique topic visits a user needs to have before they are classified. To prevent users from being classified into Topics with too little data, minimums should be put in place. For example:

In a week, if the user has not had at least 50 page visits, they can not be classified into any topics.
In a week, if the user has not had at least 10 page visits to a particular topic, they should not be classified into that topic

Those might not be exactly right, but some minimum usage component should be built into the system.

Sites ad funding goal

The explainer says that Topics API is meant to support interest-based advertising to display more relevant ads "helping to fund the sites that the user visits.".

Compared to the same kind of advertising with third-party cookies, what's the goal in terms of sites funding level?

Reconciliation of usage -> topics realities

As a component of the testing of the Topics system, some reconciliation should be done to ensure that each topic is of material size to be useful AND that there are enough domains/usage to qualify a large enough group of users. This would be above and beyond any privacy requirements.

For example, there appear to be very few domains that could cleanly map to the "Blues Music" topic. If origin trials and other testing prove that fewer than 50,000 users might be qualified into that segment, it should be removed from the taxonomy.

50,000 users might be enough to guarantee privacy, but is likely not enough to be useful from an advertising perspective. Since the number of segments is going to be low (in the hundreds, in theory), each one should be of maximal benefit from an advertising perspective, and very small segments will be of limited value overall. So in the example, Blues Music would be removed and replaced with another topic that might achieve higher scale.

Clarification regarding multiple callers on a site

The explainer says:

Whatever topic is returned, will continue to be returned for any caller on that site for the remainder of the three weeks.

Could you please tell if the following example is correct:

AdTech A and AdTech B are two third-party integrated on Site S
A user lands on S and both A and B call the Topics API on page load
Whichever call is resolved first sets the topics to be returned for that week (on that site and for that user)
Let's say that call made by A is resolved first and topics Banana, Apple, and Grapes are returned. If B is only eligible to get Banana, B's call will only return Banana, and no other B eligible topic that the user might have

Economics of Topics spec, and exchange/buyer mismatch

Hey,

I think this spec is quite a big step in the right direction considering privacy. But there's a few items that I'd like to have clarifications on specifically around the economics that are created with such a spec.

Today it's possible to receive a variety of topics as well as buying using third party data and first party data. However in the near future all of this will go away replaced with Privacy Sandbox. At which point a lot of demand will be concentrated in specs like this. Having just 100s of categories will mean that all buyers will have a limited set of topics to buy from, in the case of this spec, but this wouldn't be too bad of an issue if each buyer could see a different set of categories, as the spec implies that it is possible. However since buying is still done through exchanges or SSPs, in practice all buyers will see the view of the exchange in the topics and everyone in the same exchange will receive the same 3 topics.

At the scale we're all operating at, 100s of possible categories and just the same 3 topics could end up causing significant price inflation, and that typically is something that favors bigger advertisers with bigger budgets capable of sustaining higher CPMs.

Second, and still connected, a buyer adtech should be able to see topics while browser is visiting the advertiser site, a buyer could use this data to determine which topics are correlated with certain actions. However this set of tags will not match what is going to be available from the exchange/SSP because of the different install base. It's possible for example that a big exchange like Google's will have a lot of generic news sites or generic shopping sites that will capture the top sites for a given browser, while a small business adtech vendor will instead see a lot of nuance in their topics. This nuance will be erased due to differences in footprint, further compressing demand to the most generic topics and increasing their price not just by reducing the tag numbers, but also by reducing variability of them.

I don't know how someone would build a solution that would work for you as well.