privacysandbox / aggregation-service Goto Github PK
View Code? Open in Web Editor NEWThis repository contains instructions and scripts to set up and test the Privacy Sandbox Aggregation Service
License: Apache License 2.0
This repository contains instructions and scripts to set up and test the Privacy Sandbox Aggregation Service
License: Apache License 2.0
Hello!
I am following the guide outlined here: https://github.com/privacysandbox/aggregation-service/blob/main/docs/gcp-aggregation-service.md#adtech-setup-terraform
And I am now at the stage where I am trying to deploy the individual environments:
GOOGLE_IMPERSONATE_SERVICE_ACCOUNT="aggregation-service-deploy-sa@ag-edgekit-prod.iam.gserviceaccount.com" terraform plan
However I am faced with this error:
╷
│ Error: invalid value for member (IAM members must have one of the values outlined here: https://cloud.google.com/billing/docs/reference/rest/v1/Policy#Binding)
│
│ with module.job_service.module.autoscaling.google_cloud_run_service_iam_member.worker_scale_in_sched_iam,
│ on ../../coordinator-services-and-shared-libraries/operator/terraform/gcp/modules/autoscaling/workerscalein.tf line 104, in resource "google_cloud_run_service_iam_member" "worker_scale_in_sched_iam":
│ 104: member = "serviceAccount:${var.worker_service_account}"
│
╵
╷
│ Error: invalid value for member (IAM members must have one of the values outlined here: https://cloud.google.com/billing/docs/reference/rest/v1/Policy#Binding)
│
│ with module.job_service.module.worker.google_spanner_database_iam_member.worker_jobmetadatadb_iam,
│ on ../../coordinator-services-and-shared-libraries/operator/terraform/gcp/modules/worker/main.tf line 98, in resource "google_spanner_database_iam_member" "worker_jobmetadatadb_iam":
│ 98: member = "serviceAccount:${local.worker_service_account_email}"
│
╵
╷
│ Error: invalid value for member (IAM members must have one of the values outlined here: https://cloud.google.com/billing/docs/reference/rest/v1/Policy#Binding)
│
│ with module.job_service.module.worker.google_pubsub_subscription_iam_member.worker_jobqueue_iam,
│ on ../../coordinator-services-and-shared-libraries/operator/terraform/gcp/modules/worker/main.tf line 104, in resource "google_pubsub_subscription_iam_member" "worker_jobqueue_iam":
│ 104: member = "serviceAccount:${local.worker_service_account_email}"
│
╵
I am new to terraform and have not been able to find a way to log the value of serviceAccount:${var.worker_service_account}
& serviceAccount:${local.worker_service_account_email}
.
Any help here would be greatly appreciated!
EDIT: The below seems to show that TF state does correctly store the two service accounts created in the adtech_setup
step.
terraform state show 'module.adtech_setup.google_service_account.deploy_service_account[0]'
# module.adtech_setup.google_service_account.deploy_service_account[0]:
resource "google_service_account" "deploy_service_account" {
account_id = "aggregation-service-deploy-sa"
disabled = false
display_name = "Deploy Service Account"
email = "aggregation-service-deploy-sa@ag-edgekit-prod.iam.gserviceaccount.com"
id = "projects/ag-edgekit-prod/serviceAccounts/aggregation-service-deploy-sa@ag-edgekit-prod.iam.gserviceaccount.com"
member = "serviceAccount:aggregation-service-deploy-sa@ag-edgekit-prod.iam.gserviceaccount.com"
name = "projects/ag-edgekit-prod/serviceAccounts/aggregation-service-deploy-sa@ag-edgekit-prod.iam.gserviceaccount.com"
project = "ag-edgekit-prod"
unique_id = "106307936135287037408"
}
Hello,
Currently, the aggregation service does a sum of the values on the set of keys which is declared in the output domain files. This explicit declaration of keys mean that the encoding must be well-done at report creation time (eg on the source and trigger side for ARA or in Shared Storage for Private Aggregation API). This is quite inflexible in its use.
To bring in some flexibility, I propose to add a system to the aggregation service where a predeclared set of keys would be summed by the aggregation service. This set of keys would constitute a partition of the key space for the service not to violate the DP limit. A simple check done by the aggregation service could reject the query if a key is in two sets.
Here is what the output domain file would look like. I am not sure "super bucket" is a great name, but this is the only I could think of right now.
Super bucket | Bucket |
---|---|
0x123 | 0x456 |
0x123 | 0x789 |
0x124 | 0xaef |
0x125 | 0x12e |
The aggregation service would provide the output only on the "super buckets".
The operational benefits of this added flexibility would be huge. Currently, one has to decide on an encoding before knowing what one can measure. For ARA or PAA for Fledge, this means having a very good idea before hand of the size and the performance of the campaign. When the campaign is running, then adjustment have to be made if the volume estimate was not good (or if the settings of the campaign are changed). Encoding change can be difficult to track, especially in ARA where sources and triggers both contribute to the keys, but at different point in time. This proposal allows to have a fixed encoding, and adjust after the fact (using the volume of reports as a proxy) the encoding actually used.
The sample provided here is using an out of date shared_info which also doesn't contain a version.
Better to use the one from the sampledata dir - here is the plaintext
"{\"api\":\"attribution-reporting\",\"version\":\"0.1\",\"scheduled_report_time\":1698872400.000000000,\"reporting_origin\":\"http://adtech.localhost:3000\",\"source_registration_time\":1698796800.000000000,\"attribution_destination\":\"dest.com\",\"debug_mode\":\"enabled\",\"report_id\":\"b360383a-108d-4ae3-96bd-aecde1c3c30b\"}"
Which has an allowed version, an actual 'api' key and also has attribution_destination moved within shared_info.
The response I get from the getJob API doesn't include debug_privacy_epsilon
as a double
but a string
.
e.g.
{
...
"job_parameters": {
"debug_privacy_epsilon": "64.0",
...
}
...
}
The API specifications in https://github.com/privacysandbox/aggregation-service/blob/main/docs/api.md state that we should expect a double
value. It would be helpful if either the specifications or the API response is changed to match the other.
Hello,
I'm trying to build and deploy images based on the steps here:
https://github.com/privacysandbox/aggregation-service/blob/2b3d5c450d0be4e2ce0f4cb49444f3f049508917/build-scripts/gcp/cloudbuild.yaml
This uploads the compiled JAR files to the bucket, however I can not use these directly in cloud functions and have to download them, zip them, and reupload them (this is automaticely done for users in terraform). Ideally I'd like to skip this step and was hoping to be able to directly upload those JAR files zipped.
I am trying to process a job without an output domain. I found the domain_optional flag in the AggregationWorkerArgs class (link). I can’t set this flag as a JobParameter. Can you guide me on how to set the flag?
Hi,
The Aggregation service team is looking for your feedback to improve debugging support in the service.
Adtech can already get metrics for their jobs (status, errors, execution time etc.) from the Cloud metadata (DynamoDb in AWS and Spanner on GCP).
We are exploring other metrics, traces and logs that can provide a better understanding of the job processing within the Trusted Execution Environment without impacting privacy. We are considering providing CPU and memory metrics and total execution time traces for the adtech deployment and will benefit from your feedback on other metrics that adtech may find useful.
We are also considering adding useful logs which can give information about the job processing for debugging purposes such as ‘Job at data reading stage’ etc.. This is subject to review and approval considering user privacy.
Your inputs will be reviewed by the Privacy Sandbox team. We welcome any feedback on debugging Aggregation Service jobs.
Thank you!
For Aggregation Service releases (e.g. Aggregation Service v2.0.0 ), can a more complete set of binaries be published? The use-case is to enable adtechs to more easily customize and build Aggregation Service AMI images to meet adtech deployment requirements.
For Aggregation Service v2.0.0 this set would include:
When specifying "enable_user_provided_vpc = true", creation of the environment following the instructions at https://github.com/privacysandbox/aggregation-service/tree/main#set-up-your-deployment-environment
fails with error:
Out of index vpc[0], 182: dynamodb_vpc_endpoint_id = module.vpc[0].dynamodb_vpc_endpoint_id
At file: terraform/aws/applications/operator-service/main.tf
Lines 182 & 183 refers to module.vpc[0]
While module.vpc is not set when "enable_user_provided_vpc = true"
module "vpc" {
count = var.enable_user_provided_vpc ? 0 : 1
The documentation states that:
Note: The prebuilt Amazon Machine Image (AMI) for the aggregation service is only available in the us-east-1 region. If you like to deploy the aggregation service in a different region you need to copy the released AMI to your account or build it using our provided scripts.
When I try to copy the AMI to my account I'm getting the following error:
Failed to copy ami-036942f537f7a7c2b
You do not have permission to access the storage of this ami
Can you give me some guidance or tell me if it's a configuration error?
Context:
The various links to get the local testing tool do not work (see for instance here https://github.com/privacysandbox/aggregation-service/blob/main/COLLECTING.md#produce-a-summary-report-locally).
Even replacing the {VERSION} by 0.4.0 in the link does not solve the issue.
Thanks a lot!
P.S. I could get the previous release (ie 0.3.0) using the link available before the 0.4.0 release. See the associated diff of the release.
Hi all!
We recently published a proposal for the aggregation service release and deprecation plan. This plan outlines a standardized cadence for feature releases, in addition to a strategy for patches:
https://github.com/privacysandbox/aggregation-service/wiki/Aggregation-Service-release-and-deprecation-plan
We're opening this issue to solicit general feedback on the proposal.
cc @hostirosti
Hello aggregation service team,
We (Criteo) would like to seek clarification on a couple of points to ensure we have a comprehensive understanding of certain features.
Your insights will greatly assist us in optimizing our utilization of the platform:
Batch Size Limit (30k reports):
Could you kindly provide more details about the batch size limit of 30,000?
We are a little unsure as to how this limit behaves: it is our understanding that the aggregation service will expect loads of up to tens (even hundreds) of thousands of reports. However when we provide it with batches of 50k+ reports, our aggregations fail.
Is the limit of 30k a limit that is to be enforced per avro file within the batch? Per batch overall?
If it is per overall batch, is there any kind of suggestion on your side to aggregate batches of more than 30k reports?
If we need to split these larger aggregations over several smaller requests, that will greatly increase the noise levels we see in our final results, and would work against the idea of the aggregation service, which encourages adtechs to aggregate as many reports as possible to increase privacy.
Understanding the specifics of this limit should greatly help us in tailoring our processes more effectively.
Debug Information on Privacy Budget Exhaustion:
We've been considering ways to enhance our debugging capabilities, especially in situations where the privacy budget is exhausted. Would it be possible to obtain more detailed debug information in such cases, specifically regarding the occurrence of duplicates? We believe that having for instance the report_ids of the duplicates wouldn't compromise privacy, and would significantly contribute to our troubleshooting efforts.
I have the aggregation service set up, but our system to produce encrypted reports is not ready to go yet. This repo's sampledata directory has a sample report, but it is unencrypted and so only works with the local testing tool only, not with AWS Nitro Enclaves.
Could you provide, either in the repo or in a zip file in this thread, an encrypted sample output.avro and accompanying domain.avrp that we can use to test our AWS aggregation service to make sure everything is running properly?
Running the step "Building artifacts" from https://github.com/privacysandbox/aggregation-service/blob/main/build-scripts/aws/README.md#building-artifacts
To build the artifacts on region: eu-west-1
The CodeBuild failed with the below error:
amazon-ebs.sample-ami: Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
754 | ==> amazon-ebs.sample-ami: Existing lock /var/run/yum.pid: another copy is running as pid 3465.
755 | ==> amazon-ebs.sample-ami: Another app is currently holding the yum lock; waiting for it to exit...
Hi,
I have managed to get the full flow running to aggregate debug reports in the browser and process them locally with the provided tool.
The final file output I have is:
[{"bucket": "d0ZHnRzgTJMAAAAAAAAAAA==", "metric": 195000}]
Which looks correct in terms of there should be a single key and the metric value is correct.
The issue I have is now decoding this bucket
to get my original input data, I assumed the steps would be:
But this causes the following error:
_cbor2.CBORDecodeEOF: premature end of stream (expected to read 23 bytes, got 15 instead)
Would really appreciate any help on how to get the input data back out of this bucket.
Best,
D
I am trying to follow the instructions in Testing locally using Local Testing Tool but when I run the following command with the sampledata:
java -jar LocalTestingTool_2.0.0.jar \
--input_data_avro_file sampledata/output_debug_reports.avro \
--domain_avro_file sampledata/output_domain.avro \
--output_directory .
I get the error below:
2023-10-31 12:21:57:506 -0700 [WorkerPullWorkService] INFO com.google.aggregate.adtech.worker.WorkerPullWorkService - Aggregation worker started
2023-10-31 12:21:57:545 -0700 [WorkerPullWorkService] INFO com.google.aggregate.adtech.worker.WorkerPullWorkService - Item pulled
2023-10-31 12:21:57:555 -0700 [WorkerPullWorkService] INFO com.google.aggregate.adtech.worker.aggregation.concurrent.ConcurrentAggregationProcessor - Reports shards detected by blob storage client: [output_debug_reports.avro]
2023-10-31 12:21:57:566 -0700 [WorkerPullWorkService] INFO com.google.aggregate.adtech.worker.aggregation.concurrent.ConcurrentAggregationProcessor - Reports shards to be used: [DataLocation{blobStoreDataLocation=BlobStoreDataLocation{bucket=/Users/jonaquino/projects/aggregation-service/sampledata, key=output_debug_reports.avro}}]
2023-10-31 12:21:57:566 -0700 [WorkerPullWorkService] INFO com.google.aggregate.adtech.worker.aggregation.domain.OutputDomainProcessor - Output domain shards detected by blob storage client: [output_domain.avro]
2023-10-31 12:21:57:567 -0700 [WorkerPullWorkService] INFO com.google.aggregate.adtech.worker.aggregation.domain.OutputDomainProcessor - Output domain shards to be used: [DataLocation{blobStoreDataLocation=BlobStoreDataLocation{bucket=/Users/jonaquino/projects/aggregation-service/sampledata, key=output_domain.avro}}]
2023-10-31 12:21:57:575 -0700 [WorkerPullWorkService] INFO com.google.aggregate.adtech.worker.aggregation.concurrent.ConcurrentAggregationProcessor - Job parameters didn't have a report error threshold configured. Taking the default percentage value 10.000000
return_code: "REPORTS_WITH_ERRORS_EXCEEDED_THRESHOLD"
return_message: "Aggregation job failed early because the number of reports excluded from aggregation exceeded threshold."
error_summary {
error_counts {
category: "REQUIRED_SHAREDINFO_FIELD_INVALID"
count: 1
description: "One or more required SharedInfo fields are empty or invalid."
}
error_counts {
category: "NUM_REPORTS_WITH_ERRORS"
count: 1
description: "Total number of reports that had an error. These reports were not considered in aggregation. See additional error messages for details on specific reasons."
}
}
finished_at {
seconds: 1698780117
nanos: 679576000
}
CustomMetric{nameSpace=scp/worker, name=WorkerJobCompletion, value=1.0, unit=Count, labels={Type=Success}}
2023-10-31 12:21:57:732 -0700 [WorkerPullWorkService] INFO com.google.aggregate.adtech.worker.WorkerPullWorkService - No job pulled.
Running the step "Building artifacts" from https://github.com/privacysandbox/aggregation-service/blob/main/build-scripts/aws/README.md#building-artifacts
To build the artifacts on region: eu-west-1
The CodeBuild failedwith the below error:
836 | --> amazon-ebs.sample-ami: AMIs were created:
837 | us-east-1: ami-069b14bccedc04571
....
[Container] 2023/05/09 15:34:31 Running command bash build-scripts/aws/set_ami_to_public.sh set_ami_to_public_by_prefix aggregation-service-enclave_$(cat VERSION) $AWS_DEFAULT_REGION $AWS_ACCOUNT_ID
841 |
842 | An error occurred (InvalidAMIID.Malformed) when calling the ModifyImageAttribute operation: Invalid id: "" (expecting "ami-...")
843 |
844 | An error occurred (MissingParameter) when calling the ModifySnapshotAttribute operation: Value () for parameter snapshotId is invalid. Parameter may not be null or empty.
845
The reason is that it created the ami on us-east-1 instead of eu-west-1
The way the browser and adtech's servers interact over the network makes it inherently unavoidable that some reports will be received by the adtech but not considered as such by the browser (e.g. when a timeout happens) and hence retried and received several times by the adtech; as is mentioned in your documentation:
The browser is free to utilize techniques like retries to minimize data loss.
Sometimes, these duplicate reports reach upwards of hundreds of reports each day, for several days (sometimes several months) in a row, all having the same report_id.
The aggregation service runs the no-duplicates rule basing itself on a combination of information:
Instead, each aggregatable report will be assigned a shared ID. This ID is generated from the combined data points: API version, reporting origin, destination site, source registration time and scheduled report time. These data points come from the report's shared_info field.
The aggregation service will enforce that all aggregatable reports with the same ID must be included in the same batch. Conversely, if more than one batch is submitted with the same ID, only one batch will be accepted for aggregation and the others will be rejected.
As an adtech company, when trying to provide timely reporting to clients, it is paramount to try and use all of the available information (in this case, reports) in order to have our reporting be as precise as possible.
In this scenario, however, if we try to batch together all of our reports for a chosen client on a chosen day, even by deduplicating all of the chosen day's reports through the report_id
(or the overall shared_info
) field, we may have a batch accepted on day 1, and then all subsequent batches for the next month be rejected because they all contain that same shared_info
-based id.
This means that we have to check further back in the data for possible duplicate reports. To be able to implement this check in an efficient manner we would benefit from a more precise description of the retry policy, namely for how long the retries can happen.
I guess the questions this issue raises are as follows:
When running the terraform code on step https://github.com/privacysandbox/aggregation-service/blob/main/build-scripts/aws/README.md#configure-codebuild-setup
I got the following error:
│ Error: error creating S3 bucket ACL for aggregation-service-artifacts: AccessControlListNotSupported: The bucket does not allow ACLs
To resolve this error i had to add to: build-scripts/aws/terraform/codebuild.tf
The following resource:
resource "aws_s3_bucket_ownership_controls" "artifacts_output_ownership_controls" {
bucket = aws_s3_bucket.artifacts_output.id
rule {
object_ownership = "BucketOwnerEnforced"
}
}
Hi aggregation service team, we(Adform) are facing issues Privacy budget Exhaustion issue due to duplicate reports. We are following the batching criteria
mentioned at
and
Based on the above rules, we tried to reverse engineer the batch data to check if we do have any duplicate reports across all our batch data but we couldn't find any .
We also looked at #35 and cross verified our assumption with the code as well.
Is there any other way we can have more debug information as to across which batches we have these duplicate reports with the same key.
Can you please provide any information on how to proceed with debugging this issue.
Hello All!
Having spent the past few days on trying to get the AS live, I have been jotting down various questions, suggestions & bugs which I think could be a great addition to the documentation and workflow.
Maybe for those who use terraform in their project this is not required, but we do not use terraform and essentially followed the instructions to get all the resources built. I have since had to traverse the GCP console to try and understand what the scripts created. A high level overview diagram with the main data flows, table names etc would be extremely useful.
Similar to the point above, the terraform scripts are spread over many files so it is not clear exactly what will be created. I think it would be great to have a single file config showing all the names of the resources as they are very obscure in the context of our overall infra, for example prod-jobmd
, is a name of a newly created Cloud Spanner instance, which is a pretty unhelpful name. At the very least everything should be prefixed with aggregation-service
, or even better allow users to transparently set this as a first step.
It would be good to have an understanding of the cost of the full set up at idle, and maybe have some suggestions for development and staging setups which can minimise costs by using more serverless infra for example.
I would suggest to drop the use of cloud functions and migrate fully to cloud run, the docs seems to use these interchangeable and although they sort of are (gen2 functions are powered by cloud run), I think this can cause extra confusion. There is also a small typo on the endpoint:
This is the value in the docs
https://<environment>-<region>-frontend-service-<cloud-funtion-id>-uc.a.run.app/v1alpha/createJob
But -uc.
was -ew.
in my case, so this does not seem to a value which can be hardcoded in the docs in this manner.
Running the jobs stores a nice error in the DB, which is awesome! But even with this nice error it would be great to have a document to show common errors and their solutions. For example my latest error is:
{"errorSummary":{"errorCounts":[{"category":"DECRYPTION_KEY_NOT_FOUND","count":"445","description":"Could not find decryption key on private key endpoint."},{"category":"NUM_REPORTS_WITH_ERRORS","count":"445","description":"Total number of reports that had an error. These reports were not considered in aggregation. See additional error messages for details on specific reasons."}]},"finishedAt":"2024-04-30T13:17:24.233681575Z","returnCode":"REPORTS_WITH_ERRORS_EXCEEDED_THRESHOLD","returnMessage":"Aggregation job failed early because the number of reports excluded from aggregation exceeded threshold."}
Which is very clear - but still does not leave me any paths open to try and rectify the issue apart from troubling people over email or in this repo :)
This was addressed in #48 but needs to be added to the repo.
There are quite a few flows in which data must be converted from one format to another, for example some hashed string into a byte array, whilst it is possible to figure this out given some disparate pieces of information available in the repository it would be very useful to have a few examples for various platforms, eg:
-- Convert hashes to domain avro for processing.
CAST(FROM_HEX(SUBSTR(reports.hashed_key, 3)) AS BYTES) AS bucket
I hope you do not mind if I keep updating this issue as I hopeful near completion of getting the service up!
All the best!
D
Hi team,
I’m trying to set up our deployment environment. But I encountered this error. Could you please help to look at it ? Thanks a lot !!!
These are the roles of our service accounts. Do I need to add some additional role permissions?
our projectId: ecs-1709881683838
Error: Error creating function: googleapi: Error 403: Could not create Cloud Run service dev-us-west2-worker-scale-in. Permission ‘iam.serviceaccounts.actAs’ denied on service account [worker-sa-aggregation-service@microsites-sa.iam.gserviceaccount.com](mailto:worker-sa-aggregation-service@microsites-sa.iam.gserviceaccount.com) (or it may not exist).
│
│ with module.job_service.module.autoscaling.google_cloudfunctions2_function.worker_scale_in_cloudfunction,
│ on ../../coordinator-services-and-shared-libraries/operator/terraform/gcp/modules/autoscaling/workerscalein.tf line 35, in resource “google_cloudfunctions2_function” “worker_scale_in_cloudfunction”:
│ 35: resource “google_cloudfunctions2_function” “worker_scale_in_cloudfunction” {
│
╵
╷
│ Error: Error creating function: googleapi: Error 403: Could not create Cloud Run service dev-us-west2-frontend-service. Permission ‘iam.serviceaccounts.actAs’ denied on service account [[email protected]](mailto:[email protected]) (or it may not exist).
│
│ with module.job_service.module.frontend.google_cloudfunctions2_function.frontend_service_cloudfunction,
│ on ../../coordinator-services-and-shared-libraries/operator/terraform/gcp/modules/frontend/main.tf line 43, in resource “google_cloudfunctions2_function” “frontend_service_cloudfunction”:
│ 43: resource “google_cloudfunctions2_function” “frontend_service_cloudfunction” {
│
╵
╷
│ Error: Error creating instance template: googleapi: Error 409: The resource ‘projects/ecs-1709881683838/global/instanceTemplates/dev-collector’ already exists, alreadyExists
│
│ with module.job_service.module.worker.google_compute_instance_template.collector,
│ on ../../coordinator-services-and-shared-libraries/operator/terraform/gcp/modules/worker/collector.tf line 49, in resource “google_compute_instance_template” “collector”:
│ 49: resource “google_compute_instance_template” “collector” {
Hello,
While executing /createJob request with following payload
Please see an example below:
{ "job_request_id": "Job-1010", "input_data_blob_prefix": "reports/inputs/input.avro", "input_data_bucket_name": "test-android-sandbox", "output_data_blob_prefix": "reports/output/result_1.avro", "output_data_bucket_name": "test-android-sandbox", "job_parameters": { "output_domain_blob_prefix": "reports/domains/domain.avro", "output_domain_bucket_name": "test-android-sandbox", "debug_privacy_epsilon": 30 } }
The response of this request will be 202
When executing /getJob?job_request_id=Job-1010
{ "job_status": "IN_PROGRESS", "request_received_at": "2023-06-12T15:14:17.891601Z", "request_updated_at": "2023-06-12T15:14:23.222830Z", "job_request_id": "Job-1010", "input_data_blob_prefix": "reports/inputs/input.avro", "input_data_bucket_name": "test-android-sandbox", "output_data_blob_prefix": "reports/output/result_1.avro", "output_data_bucket_name": "test-android-sandbox", "postback_url": "", "result_info": { "return_code": "", "return_message": "", "error_summary": { "error_counts": [], "error_messages": [ "Missing required properties: jobKey" ] }, "finished_at": "1970-01-01T00:00:00Z" }, "job_parameters": { "debug_privacy_epsilon": "30", "output_domain_bucket_name": "test-android-sandbox", "output_domain_blob_prefix": "reports/domains/domain.avro" }, "request_processing_started_at": "2023-06-12T15:14:23.133071Z" }
The error is Missing required properties: jobKey
The job stays in status IN_PROGRESS
When running same /createJob request without the job_request_id property -
the response from /createJob will be:
{ "code": 3, "message": "Missing required properties: jobRequestId\r\n in: {\n \"input_data_blob_prefix\": \"reports/inputs/input.avro\",\n \"input_data_bucket_name\": \"test-android-sandbox\",\n \"output_data_blob_prefix\": \"reports/output/result_1.avro\",\n \"output_data_bucket_name\": \"test-android-sandbox\",\n \"job_parameters\": {\n \"output_domain_blob_prefix\": \"reports/domains/domain.avro\",\n \"output_domain_bucket_name\": \"test-android-sandbox\"\n }\n}", "details": [ { "reason": "JSON_ERROR", "domain": "", "metadata": {} } ] }
Hi aggregation-service team,
I'm really confused about the file "output_domain.avro" used for producing a summary report locally. In your nodejs example(code), how can I generate a "output_domain.avro" for the aggregation report ?
Here is your sample doc: https://github.com/privacysandbox/aggregation-service/blob/main/docs/collecting.md#collecting-and-batching-aggregatable-reports
{
"bucket": "\u0005Y"
}
Will this "output_domain.avro" work for your nodejs example ?
If convenient, could you explain what this domain file is generated according to ? Thanks a lot !!
Tried to kick off a build of the build container using the git hash for v2.4.2 and got the error below
I believe its due to a missing "-y" on apt-get install here:
https://github.com/privacysandbox/aggregation-service/blame/22c2a42ea98b88e5dd3451446db2b7a152760274/build-scripts/gcp/build-container/Dockerfile#L63
Google Ldap: evgenyy@ if you want to reach out internally
Step 9/12 : RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.asc] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | tee /usr/share/keyrings/cloud.google.asc && apt-get update && apt-get install google-cloud-cli && apt-get -y autoclean && apt-get -y autoremove
---> Running in e691327d6e48
deb [signed-by=/usr/share/keyrings/cloud.google.asc] https://packages.cloud.google.com/apt cloud-sdk main
�[91m % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent L�[0m�[91meft Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0�[0m�[91m
100 2659 100 2659 0 0 42686 0 --:--:-- --:--:-- --:--:-- 42887
�[0m-----BEGIN PGP PUBLIC KEY BLOCK-----
...
-----END PGP PUBLIC KEY BLOCK-----
Hit:1 https://download.docker.com/linux/debian bookworm InRelease
Hit:2 http://deb.debian.org/debian bookworm InRelease
Hit:3 http://deb.debian.org/debian bookworm-updates InRelease
Get:4 https://packages.cloud.google.com/apt cloud-sdk InRelease [6361 B]
Hit:5 http://deb.debian.org/debian-security bookworm-security InRelease
Get:6 https://packages.cloud.google.com/apt cloud-sdk/main amd64 Packages [629 kB]
Fetched 636 kB in 1s (1239 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
google-cloud-cli-anthoscli
Suggested packages:
google-cloud-cli-app-engine-java google-cloud-cli-app-engine-python
google-cloud-cli-pubsub-emulator google-cloud-cli-bigtable-emulator
google-cloud-cli-datastore-emulator kubectl
The following NEW packages will be installed:
google-cloud-cli google-cloud-cli-anthoscli
0 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
Need to get 106 MB of archives.
After this operation, 609 MB of additional disk space will be used.
Do you want to continue? [Y/n] Abort.
The command '/bin/sh -c echo "deb [signed-by=/usr/share/keyrings/cloud.google.asc] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | tee /usr/share/keyrings/cloud.google.asc && apt-get update && apt-get install google-cloud-cli && apt-get -y autoclean && apt-get -y autoremove' returned a non-zero code: 1
ERROR
ERROR: build step 0 "gcr.io/cloud-builders/docker" failed: step exited with non-zero status: 1
I got an api gateway error The API with ID my-api-id doesn’t include a route with path /* having an integration arn:aws:lambda:us-east-1:my-aws-account-id:function:stg-create-job.
on aws console, after deploying aggregation service using terraform.
I changed Source ARN of lambda's permisson from arn:aws:execute-api:us-east-1:my-aws-account-id:my-api-id/*/**
to arn:aws:execute-api:us-east-1:my-aws-account-id:my-api-id/*/*/v1alpha/getJob
, and it solved the error.
https://github.com/privacysandbox/control-plane-shared-libraries/blob/9efe5591acc18e46263399d9785432a146d9675c/operator/terraform/aws/modules/frontend/api_gateway.tf#L62
Hello,
I'm currently experimenting with the Private Aggregation API and I'm struggling to validate that my final output is correct
From my worklet, I perform the following histogram contribution:
privateAggregation.contributeToHistogram({ bucket: BigInt(1369), value: 128 });
Which is correctly triggering a POST request with the following body:
{
aggregation_service_payloads: [
{
debug_cleartext_payload: 'omRkYXRhgaJldmFsdWVEAAAAgGZidWNrZXRQAAAAAAAAAAAAAAAAAAAFWWlvcGVyYXRpb25paGlzdG9ncmFt',
key_id: 'bca09245-2ef0-4fdf-a4fa-226306fc2a09',
payload: 'RVd7QRTTUmPp0i1zBev+4W8lJK8gLIIod6LUjPkfbxCOHsQLBW/jRn642YZ2HYpYkiMK9+PprU5CUi9W7TwJToQ4UXiUbJUgYwliqBFC+aAcwsKJ3Hg46joHZXV5E0ZheeFTqqvLtiJxlVpzFcWd'
}
],
debug_key: '777',
shared_info: '{"api":"shared-storage","debug_mode":"enabled","report_id":"aaa889f1-2adc-4796-9e46-c652a08e18ca","reporting_origin":"http://adtech.localhost:3000","scheduled_report_time":"1698074105","version":"0.1"}'
}
I've setup a small node.js server handling requests on /.well-known/private-aggregation/debug/report-shared-storage
basically doing this:
const encoder = avro.createFileEncoder(
`${REPORT_UPLOAD_PATH}/debug}/aggregation_report_${Date.now()}.avro`,
reportType
);
reportContent.aggregation_service_payloads.forEach((payload) => {
console.log(
"Decoded data from debug_cleartext_payload:",
readDataFromCleartextPayload(payload.debug_cleartext_payload)
);
encoder.write({
payload: convertPayloadToBytes(payload.debug_cleartext_payload),
key_id: payload.key_id,
shared_info: reportContent.shared_info,
});
});
encoder.end();
As you can see at this point I'm printing the decoded data on console and I can see as expected:
Decoded data from debug_cleartext_payload: { value: 128, bucket: 1369 }
However, now I'm trying to generate a summary report with the local test tool by running the following command:
java -jar LocalTestingTool_2.0.0.jar --input_data_avro_file aggregation_report_1698071597075.avro --domain_avro_file output_domain.avro --no_noising --json_output --output_directory ./results
No matther what value I've passed as payload of the contributeToHistogram
method, I always got 0 on the metric field:
[ {
"bucket" : "MTM2OQ==", // 1369 base64 encoded
"metric" : 0
} ]
Am I doing something wrong ?
Apart of this issue, I wonder how it would work in real life application, currently this example is handling one report at a time which is sent instantly because of being in debug_mode, but in real situation, how are we supposed to process a big amount of reports at once ? Can we pass a list of files to the --input_data_avro_file
? Should we batch the reports prior to converting it to avro
based on the shared_info
data? If yes, based on which field?
Thank you by advance !
In the AWS instructions, there are two options for using the AMI in a region other than us-east-1:
If you like to deploy the aggregation service in a different region you need to copy the released AMI to your account or build it using our provided scripts.
I have been having a lot of trouble building the AMI using the provided scripts, so I would like to try simply copying the AMI (the first option), but I don't see instructions for this. What is the AMI name and where do I get it from? Do I need to change any parameters to point to the new region? What step should I move on to after copying the AMI?
Could you add instructions for copying the AMI and subsequent steps?
I am trying to follow the instructions to build the AMI because I want it in a different region than us-east-1.
But when I run
aws codebuild start-build --project-name aggregation-service-artifacts-build --region us-west-2
I get this error:
Build 'amazon-ebs.sample-ami' errored after 936 milliseconds 511 microseconds: VPCIdNotSpecified: No default VPC for this user
status code: 400, request id: fffa8013-121f-4855-a665-70e36030a4e7x
Questions
Hi all!
The Aggregation service team is currently exploring options for adtechs who may want to migrate from one cloud provider to another. This gives adtechs flexibility in using a cloud provider of their choice to optimize for cost or other business needs. Our proposed migration solution would enable adtechs to re-encrypt their reports from a source cloud provider (let’s call this Cloud A) to a destination cloud provider (let’s call this Cloud B) and enable them to use Cloud B to process reports originally encrypted for Cloud A as part of the migration. After migration is completed, use of Cloud A for processing reports will be disabled and the adtech will only be able to use Cloud B to process their reports.
In the short-term, this solution will support migration of aggregation service jobs from AWS to GCP and vice versa. As we support more cloud options in the future, this solution would be extensible to moving from any supported cloud provider to another.
Depiction of the re-encryption flow:
For any adtechs considering a migration, we encourage completing this migration before third-party cookie deprecation to take advantage of feature benefits such as:
After third-party cookie deprecation, we plan to continue to support cloud migration with the re-encryption feature, but may not be able to give the additional benefits outlined above to preserve privacy.
We welcome any feedback on this proposal.
Thank you!
We are seeking feedback on consolidating coordinator services for attribution reporting and other workloads. Please review and comment on the main issue posted on the WICG/protected-auction-services-discussion#69 repository.
In the instructions for building the AMI (Building aggregation service artifacts), part of the instructions is to put a github_personal_access_token in codebuild.auto.tfvars.
Can you provide more information on this token?
Hi Aggregation Service testers,
We have discovered an issue that broke the AWS worker build, caused by an incompatible Docker engine version upgrade. We are planning to release a new patch next week. Meanwhile, if you encounter issues building AWS worker, you can use the following workaround:
<repo_root>/build_defs/shared_libraries/pin_pkr_docker.patch
with the following content:diff --git a/operator/worker/aws/setup_enclave.sh b/operator/worker/aws/setup_enclave.sh
index e4bd30371..8bf2e0fb1 100644
--- a/operator/worker/aws/setup_enclave.sh
+++ b/operator/worker/aws/setup_enclave.sh
@@ -19,7 +19,7 @@ sudo yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/late
#
# Builds enclave image inside the /home/ec2-user directory as part of automatic
# AMI generation.
-sudo yum install docker -y
+sudo yum install docker-24.0.5-1.amzn2023.0.3 -y
sudo systemctl enable docker
sudo systemctl start docker
patches
under shared_libraries
rules in the WORKSPACE
file. The shared_libraries
rule should now become:git_repository(
name = "shared_libraries",
patch_args = [
"-p1",
],
remote = "https://github.com/privacysandbox/coordinator-services-and-shared-libraries",
patches = [
"//build_defs/shared_libraries:coordinator.patch",
"//build_defs/shared_libraries:gcs_storage_client.patch",
"//build_defs/shared_libraries:dependency_update.patch",
"//build_defs/shared_libraries:key_cache_ttl.patch",
"//build_defs/shared_libraries:pin_pkr_docker.patch",
],
tag = COORDINATOR_VERSION,
workspace_file = "@shared_libraries_workspace//file",
)
Thank you!
When an aggregatable report is created by sendHistogramReport() (i.e. called inside reportWin function) it contains shared info without attribution_destination nor source_registration_time. This seems to be logical as these keys are strictly related with attribution logic. Example:
"shared_info": "{\"api\":\"fledge\",\"debug_mode\":\"enabled\",\"report_id\":\"9ae1a0d0-8cf5-4951-b752-e932bf0f7705\",\"reporting_origin\":\"https://fledge-eu.creativecdn.com\",\"scheduled_report_time\":\"1668771714\",\"version\":\"0.1\"}"
More readable form:
{
"api": "fledge",
"debug_mode": "enabled",
"report_id": "9ae1a0d0-8cf5-4951-b752-e932bf0f7705",
"reporting_origin": "https://fledge-eu.creativecdn.com",
"scheduled_report_time": "1668771714",
"version": "0.1"
}
(note: version 0.1, values for: privacy_budget_key, attribution_destination, source_registration_time are missing)
In the same time Aggregation service expect to have both attribution_destination and source_registration_time for shared info.version==0.1 (since aggregation service version 0.4):
see SharedInfo.getPrivacyBudgetKey()
Tested on chrome:
The following exception were printed:
CustomMetric{nameSpace=scp/worker, name=WorkerJobError, value=1.0, unit=Count, labels={Type=JobHandlingError}}
2022-11-22 09:10:54:120 +0100 [WorkerPullWorkService] ERROR com.google.aggregate.adtech.worker.WorkerPullWorkService - Exception occurred in worker
com.google.aggregate.adtech.worker.JobProcessor$AggregationJobProcessException: java.util.concurrent.ExecutionException: java.util.NoSuchElementException: No value present
at com.google.aggregate.adtech.worker.aggregation.concurrent.ConcurrentAggregationProcessor.process(ConcurrentAggregationProcessor.java:400)
at com.google.aggregate.adtech.worker.WorkerPullWorkService.run(WorkerPullWorkService.java:145)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:67)
at com.google.common.util.concurrent.Callables.lambda$threadRenaming$3(Callables.java:103)
at java.base/java.lang.Thread.run(Thread.java:1589)
Caused by: java.util.concurrent.ExecutionException: java.util.NoSuchElementException: No value present
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:588)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:567)
at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:113)
at com.google.aggregate.adtech.worker.aggregation.concurrent.ConcurrentAggregationProcessor.process(ConcurrentAggregationProcessor.java:295)
... 4 more
Caused by: java.util.NoSuchElementException: No value present
at java.base/java.util.Optional.get(Optional.java:143)
at com.google.aggregate.adtech.worker.model.SharedInfo.getPrivacyBudgetKey(SharedInfo.java:161)
at com.google.aggregate.adtech.worker.aggregation.engine.AggregationEngine.accept(AggregationEngine.java:88)
at com.google.aggregate.adtech.worker.aggregation.engine.AggregationEngine.accept(AggregationEngine.java:49)
We are working on adding the possibility to generate debug summary reports from encrypted aggregatable reports with the AWS based aggregation service. This capability will be time-limited and be phased out at a later time.
We would like to hear from you what capabilities you'd like to see in these debug summary reports.
Some ideas we are considering:
epsilon
output domain
with an annotation hinting to the omissionQuestions:
Hello,
One interesting evolution of the aggregation service would be to enable querying aggregate of keys. I think this was mentioned in the aggregate attribution API at a time when the aggregation was supposed to be performed by MPC rather than TEEs.
In other words, I would love to be able to query a bit mask (eg for a 8 bit key, 01100*01 would be 01100101 and 01100001).
This would enable a greater flexibility for decoding (ie chosing which encoded variables to get depending on the number of reports), and negate the need to adapt the encoding depending on the expected traffic to the destination website.
Thanks!
P.S. I can cross-post on https://github.com/WICG/attribution-reporting-api if needed.
I am able to trigger the aggregation job with /createJob endpoint deployed via terraform in aws. While running the /getJob with the request id, I am getting below error:
"result_info": { "return_code": "REPORTS_WITH_ERRORS_EXCEEDED_THRESHOLD", "return_message": "Aggregation job failed early because the number of reports excluded from aggregation exceeded threshold.", "error_summary": { "error_counts": [ { "category": "DECRYPTION_KEY_NOT_FOUND", "count": 1, "description": "Could not find decryption key on private key endpoint." }, { "category": "NUM_REPORTS_WITH_ERRORS", "count": 1, "description": "Total number of reports that had an error. These reports were not considered in aggregation. See additional error messages for details on specific reasons." } ], "error_messages": [] }, "finished_at": "2024-05-0
I could see @ydennisy also had similar issue but could not find the solution for it.
Hi, I work in the Google Ad Traffic Quality Team. I am using the local aggregation service tool to simulate noise on locally generated aggregatable reports. However, due to the contribution budget limits, I am unable to create multiple aggregatable reports that will correctly represent my data. What is the best way for me to test this locally, can I manually create an aggregatable report with very high values (corresponding to a raw summary report) for testing?
Hello everyone, I'm currently trying to create a version of attribution-reporting in NODE JS so far so good, I managed to complete the entire journey (trigger interactions with creatives, conversion on the final website, generate event and aggregable reports)
But I got to this part where I must store the aggregatable reports before sending them to the aggregation services, I wanted to know if anyone else did this step of collecting the reports in NODE JS
Below is the code responsible for collecting and storing the reports (I took the documentation code written in GO as a reference)
*Spoiler: Each report record I receive generates an .avro file
const avro = require('avsc');
const REPORTS_AVRO_SCHEMA = {
"name": "AvroAggregatableReport",
"type": "record",
"fields": [
{ "name": "payload", "type": "bytes" },
{ "name": "key_id", "type": "string" },
{ "name": "shared_info", "type": "string" }
]
};
const RECORD_SCHEMA = avro.Type.forSchema(REPORTS_AVRO_SCHEMA);
const registerAggregateReport = (req, res) => {
try {
// const report = req.body;
// Example to illustrate what the request body would be
const report = {
"aggregation_coordinator_origin": "https://publickeyservice.msmt.aws.privacysandboxservices.com",
"aggregation_service_payloads": [
{
"key_id": "bbe6351f-5619-4c98-84b2-4a74fa1ae254",
"payload": "7K9SQLdROKqITmnrkgIDulfEXDAR76XUP4vc6uzxPwDycQql3AhR3dxeXdEw2gbUaIAldnu33RSN4SAFcFFKgDQkvnhFzPoxJjO2Yfw4osJ1S0Odp0smu0rC5k5GuG4oIu9YQofCPNmSD7KRVJ9Y6Lucz3BXoI3RQhpQkO31RDyxVJdBbJ8JiS2KBtu8naUf5Z+/mNNKp39ObsNbo7kQKI0TwyRJDSJKqv42Yi3ctoAhOT0eaaUtMfho67i9XaEtVnh8wB4Mi+nzlAfVsGIavP6aXWDe44IgKZvTS/zEKjI68+nzWkyfdRNOf7jtb2XnoB7k5iM+Yu9Ayk5ic/aT1eA1iPEzLvW/tNLcohne3UL2DefZoTLb5l9aludA7Qlf0g+kW9nuvUSmHBuTjE/fTY5s9uRExHH+b2Hjm2sL9DyrFZUFqcl/KLS+McgOT8I0ZTpPRmr+njW8+4b01Hsc2MpY3KKAn1jUDUE45pGbhj/Gqlb1ikJO9nNKS/nnWJgR7+3P8JEpHC2fkfEase4+vrNxZujWolYfTUxswJpiEZs1+fCOroEyyEY6Zjvx5qLbk+7wMNqCeCltDPA6c8WtAPtMreIUvKbco6XUUzaGSnvWLz6/WJqCxG4hjPOfcYAWXIwSboqvNyBHrRr4H5V7C0unSkIjd0j/GeB3ywgnKEqiihuvZ5PPw+O5aYqJdaR3QEFZtpLj+3Uv4OGn2+CvU1thV0A0H1XViP846Tfmb0jVejN1+ih+VO5cf/7T2TPz6oGO9sa6qitWtll5vhwxVyG3vniCo3xghGnUcHSP5ogfp6qgDGSgsGFqSvdiuOpQU+MG/HrCDUjvce0GoXJP6674UcurGxR9UKAnVwZyKRIj/q9qzUgxhWEFC3ssADMmxhZBs3X+rrAxKfhXD12MfuUluRTCzpCKZ9/YapnJQYjngGx7GIkfW6tw8eSCC8yO41vWyHGRz4nKlgNeQkwYafGPzXqUXjyEyiupMUlmSsU/zT52wdCQYLJbQg7xhNuLebb8qh9LW07jMho4Vo9DBP9l463uqA8hcZnJ"
}
],
"shared_info": "{\"api\":\"attribution-reporting\",\"attribution_destination\":\"https://cliente.com\",\"report_id\":\"4d82121f-7d62-4fa4-bda4-a70c9e850089\",\"reporting_origin\":\"https://attribution.ads.uol.com.br\",\"scheduled_report_time\":\"1714764978\",\"source_registration_time\":\"0\",\"version\":\"0.1\"}"
}
report.aggregation_service_payloads.map(payload => {
const payloadBytes = Buffer.from(payload.payload, 'base64');
const record = {
payload: payloadBytes,
key_id: payload.key_id,
shared_info: report.shared_info,
};
const outputFilename = `./reports/output_reports_${Date.now()}.avro`;
const encoder = avro.createFileEncoder(outputFilename, RECORD_SCHEMA);
encoder.write(record);
encoder.end()
});
res.status(200).send('Report received successfully.');
} catch (e) {
console.error('Error processing report:', e);
res.status(400).send('Failed to process report.');
}
};
module.exports = {
registerAggregateReport
}
*English is not my native language so take it easy
As the death of third-party cookies is something that will affect everyone, it would be nice to have references in more commonly used languages such as NodeJs, Java, etc., I hope this post can contribute in some way to this
Hi all!
We are currently exploring migration from origin enrollment to site enrollment for the Aggregation Service (current form using origin here) for the following reasons:
As a follow up to this proposal, we would like to support multiple origins in a batch of aggregatable reports. Do adtechs have a preference or blocking concern with either specifying a list of origins or the site in the createJob request?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.