googlecloudplatform / dlp-dataflow-deidentification Goto Github PK
View Code? Open in Web Editor NEWMulti Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP
License: Apache License 2.0
Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP
License: Apache License 2.0
To Whom it may concern,
The input bucket parser needs to be updated to parse gcs folder structure in case the input file in nested in a folder in a gcs bucket.
Hello,
I am working on Google DLP project and I have followed tutorials to have DLP demo project running in our Google Cloud Account
I have tried to run both the approaches and I am getting error when my DataFlow Job runs.
Executed below commands on Google Cloud Shell
gcloud config set project bizsysit-298723
sh deploy-data-tokeninzation-solution.sh
All the resources are deployed and when I run the job with below command job runs but it fails.
gcloud dataflow jobs run demo-dlp-deid-pipeline-20210524-195542
--gcs-location gs://dataflow-templates/latest/Stream_DLP_GCS_Text_to_BigQuery
--region=us-central1
--parameters
inputFilePattern=gs://bizsysit-298723-demo-data/CCRecords_1564602825.csv,
dlpProjectId=bizsysit-298723,deidentifyTemplateName=projects/bizsysit-298723/deidentifyTemplates/dlp-demo-deid-latest-1621886108196,
inspectTemplateName=projects/bizsysit-298723/inspectTemplates/dlp-demo-inspect-latest-1621886108196,
datasetName=demo_dataset,
batchSize=500
Error message from worker: java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input length = 1 org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:187) org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:108) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1060) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:445) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:130) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1063) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:934) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:795) org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(StreamingGroupAlsoByWindowViaWindowSetFn.java:95) org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(StreamingGroupAlsoByWindowViaWindowSetFn.java:42) org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:121) org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:81) org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:137) org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44) org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:212) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:163) org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:92) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1426) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:163) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1105) java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) java.base/java.lang.Thread.run(Thread.java:834) Caused by: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input length = 1 org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39) com.google.cloud.teleport.templates.DLPTextToBigQueryStreaming$1$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:232) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:191) org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:339) org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44) org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49) org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:185) org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:108) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1060) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:445) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:130) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1063) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:934) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:795) org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(StreamingGroupAlsoByWindowViaWindowSetFn.java:95) org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(StreamingGroupAlsoByWindowViaWindowSetFn.java:42) org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:121) org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:81) org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:137) org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44) org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:212) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:163) org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:92) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1426) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:163) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1105) java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input length = 1 com.google.cloud.teleport.templates.DLPTextToBigQueryStreaming.getFileHeaders(DLPTextToBigQueryStreaming.java:748) com.google.cloud.teleport.templates.DLPTextToBigQueryStreaming.access$100(DLPTextToBigQueryStreaming.java:138) com.google.cloud.teleport.templates.DLPTextToBigQueryStreaming$1.lambda$processElement$0(DLPTextToBigQueryStreaming.java:236) java.base/java.lang.Iterable.forEach(Iterable.java:75) com.google.cloud.teleport.templates.DLPTextToBigQueryStreaming$1.processElement(DLPTextToBigQueryStreaming.java:233) Caused by: java.nio.charset.MalformedInputException: Input length = 1 java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274) java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) java.base/java.io.BufferedReader.read1(BufferedReader.java:210) java.base/java.io.BufferedReader.read(BufferedReader.java:287) java.base/java.io.BufferedReader.fill(BufferedReader.java:161) java.base/java.io.BufferedReader.read(BufferedReader.java:182) org.apache.commons.csv.ExtendedBufferedReader.read(ExtendedBufferedReader.java:58) org.apache.commons.csv.Lexer.parseSimpleToken(Lexer.java:215) org.apache.commons.csv.Lexer.nextToken(Lexer.java:167) org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674) org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:628) com.google.cloud.teleport.templates.DLPTextToBigQueryStreaming.getFileHeaders(DLPTextToBigQueryStreaming.java:741) com.google.cloud.teleport.templates.DLPTextToBigQueryStreaming.access$100(DLPTextToBigQueryStreaming.java:138) com.google.cloud.teleport.templates.DLPTextToBigQueryStreaming$1.lambda$processElement$0(DLPTextToBigQueryStreaming.java:236) java.base/java.lang.Iterable.forEach(Iterable.java:75) com.google.cloud.teleport.templates.DLPTextToBigQueryStreaming$1.processElement(DLPTextToBigQueryStreaming.java:233) com.google.cloud.teleport.templates.DLPTextToBigQueryStreaming$1$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:232) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:191) org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:339) org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44) org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49) org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:185) org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:108) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1060) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:445) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:130) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1063) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.emit(ReduceFnRunner.java:934) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:795) org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(StreamingGroupAlsoByWindowViaWindowSetFn.java:95) org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(StreamingGroupAlsoByWindowViaWindowSetFn.java:42) org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:121) org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:81) org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:137) org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44) org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:212) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:163) org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:92) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1426) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:163) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1105) java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Followed all the steps from the Google Tutorial
Getting the same above error from 1st approach
Hi Everyone
I have used Gradle run command to de-identify the data and provided the template name and source(GCS) and destination(Bigquery) . The de-identification is successful.
In my de-identification I have manually created a de-identifcation template which just encrypts 2 columns data using the (cryptographic deterministic token).
When I have run the gradle command for re-identifcation I have used the same template as it just encrypts the 2 coloumns, and during re-identifcation it will decrypt the data.
So I have used a similar below query and stored it in GCS.
export QUERY="select Name,AadharCardNum,Pancard from
Project_ID-1.workerDetails.WorkerDetails
;"
And the gradle command which I used is:
gradle run -DmainClass=com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2 -Pargs="
--region=region
--project=project_id
--tempLocation=gs://bucket/temp
--numWorkers=1
--maxNumWorkers=2
--runner=DataflowRunner
--tableRef=project_id:dataset.table
--dataset=dataset
--topic=projects/project_id/topics/name
--autoscalingAlgorithm=THROUGHPUT_BASED
--workerMachineType=n1-standard-1
--deidentifyTemplateName=projects/project_id/locations/location/deidentifyTemplates/name
--DLPMethod=REID
--keyRange=1024
--queryPath=gs://bucket/query.sql
--DLPParent=projects/project_id/locations/location"
Here the data is being written to pubsub but the data is not decrypted. It is just writing the same encrypted data into pubsub.
I couldn't figure out the reason why it is not de-crypting the data.
Parent is set only with a project id meaning all calls go to "global" instead of specific regions.
Build failing with the error:
Step #0: > Task :compileJava
Step #0: /workspace/src/main/java/com/google/swarm/tokenization/common/SanitizeFileNameDoFn.java:37: error: cannot find symbol
Step #0: Arrays.asList("csv", "avro", "json", "txt").stream().collect(Collectors.toUnmodifiableSet());;
Step #0: ^
Step #0: symbol: method toUnmodifiableSet()
Step #0: location: class Collectors
Step #0: 1 error
Step #0:
Step #0: > Task :compileJava FAILED
Step #0:
Step #0: FAILURE: Build failed with an exception.
Step #0:
Step #0: * What went wrong:
Step #0: Execution failed for task ':compileJava'.
Step #0: > Compilation failed; see the compiler error output for details.
Step #0:
My test file to de-identify has a name that is not compatible with bigquery's rules for table names.
This would work better if it just let me specify the output table. Or that it could normalize things.
2020-06-08 22:12:29.631 PDTError message from worker: java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request { "code" : 400, "errors" : [ { "domain" : "global", "message" : "Invalid table ID "dlp-sample-files_100k". Table IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long. Also, Table decorators cannot be used.", "reason" : "invalid" } ], "message" : "Invalid table ID "dlp-sample-files_100k". Table IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long. Also, Table decorators cannot be used.", "status" : "INVALID_ARGUMENT" } org.apache.beam.sdk.io.gcp.bigquery.CreateTables$CreateTablesFn.tryCreateTable(CreateTables.java:208) org.apache.beam.sdk.io.gcp.bigquery.CreateTables$CreateTablesFn.getTableDestination(CreateTables.java:160) org.apache.beam.sdk.io.gcp.bigquery.CreateTables$CreateTablesFn.lambda$processElement$0(CreateTables.java:113) java.util.HashMap.computeIfAbsent(HashMap.java:1127) org.apache.beam.sdk.io.gcp.bigquery.CreateTables$CreateTablesFn.processElement(CreateTables.java:112) Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request { "code" : 400, "errors" : [ { "domain" : "global", "message" : "Invalid table ID "dlp-sample-files_100k". Table IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long. Also, Table decorators cannot be used.", "reason" : "invalid" } ], "message" : "Invalid table ID "dlp-sample-files_100k". Table IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long. Also, Table decorators cannot be used.", "status" : "INVALID_ARGUMENT" } com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
When I run the script, this line returns the following error:
+ jobId=demo-dlp-deid-pipeline-20210908-064259
+ gcloud dataflow jobs run demo-dlp-deid-pipeline-20210908-064259 --gcs-location gs://dataflow-templates/latest/Stream_DLP_GCS_Text_to_BigQuery --parameters --region=us-central1,inputFilePattern=gs://<project_id>-demo2-demo-data/CCRecords_1564602825.csv,dlpProjectId=<project_id>,deidentifyTemplateName=projects/<project_id>/deidentifyTemplates/dlp-demo-deid-latest-1631083353071,inspectTemplateName=projects/<project_id>/inspectTemplates/dlp-demo-inspect-latest-1631083353071,datasetName=demo_dataset,batchSize=500
ERROR: (gcloud.dataflow.jobs.run) argument --parameters: expected one argument
Usage: gcloud dataflow jobs run JOB_NAME --gcs-location=GCS_LOCATION [optional flags]
optional flags may be --additional-experiments | --dataflow-kms-key |
--disable-public-ips | --enable-streaming-engine |
--help | --max-workers | --network | --num-workers |
--parameters | --region | --service-account-email |
--staging-location | --subnetwork |
--worker-machine-type | --worker-region |
--worker-zone | --zone
Correct command that I have tried:
gcloud dataflow jobs run ${jobId} --gcs-location gs://dataflow-templates/latest/Stream_DLP_GCS_Text_to_BigQuery --region=us-central1 --parameters "inputFilePattern=gs://${DATA_STORAGE_BUCKET}/CCRecords_1564602825.csv,dlpProjectId=${PROJECT_ID},deidentifyTemplateName=${DEID_TEMPLATE_NAME},inspectTemplateName=${INSPECT_TEMPLATE_NAME},datasetName=${BQ_DATASET_NAME},batchSize=500"
The template generated from "gradle run -DmainClass=com.google.swarm.tokenization.CSVBatchPipeline -Pargs="--streaming --project=mycegcpdemo-rico --runner=DataflowRunner --templateLocation=gs://input-tangerine-data/upload_structure_template --gcpTempLocation=gs://input-tangerine-data/structure_template" has the input argument "tablespec" named as "labels" . This results in the user updating the metadata file changing "tablespec" input variable "labels."
Hi There, When I cloned the repo and following the readme instructions. The runs are failing.
Exception in thread "main" java.lang.IllegalArgumentException: com.google.swarm.tokenization.common.CSVContentProce
ssorDoFn, @ProcessElement processElement(ProcessContext, OffsetRangeTracker): Has tracker type OffsetRangeTracker,
but the DoFn's tracker type must be of type RestrictionTracker.
at org.apache.beam.sdk.transforms.reflect.DoFnSignatures$ErrorReporter.throwIllegalArgument(DoFnSignatures.java:1495)
at org.apache.beam.sdk.transforms.reflect.DoFnSignatures$ErrorReporter.checkArgument(DoFnSignatures.java:1500)
at org.apache.beam.sdk.transforms.reflect.DoFnSignatures.verifySplittableMethods(DoFnSignatures.java:595)
at org.apache.beam.sdk.transforms.reflect.DoFnSignatures.parseSignature(DoFnSignatures.java:474)
at org.apache.beam.sdk.transforms.reflect.DoFnSignatures.lambda$getSignature$0(DoFnSignatures.java:140)
at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1133)
at org.apache.beam.sdk.transforms.reflect.DoFnSignatures.getSignature(DoFnSignatures.java:140)
at org.apache.beam.sdk.transforms.ParDo.validate(ParDo.java:547)
at org.apache.beam.sdk.transforms.ParDo.of(ParDo.java:397)
at com.google.swarm.tokenization.CSVStreamingPipeline.doTokenization(CSVStreamingPipeline.java:71)
at com.google.swarm.tokenization.CSVStreamingPipeline.main(CSVStreamingPipeline.java:109)
Hi,
I have been trying to create a data flow job for de-identification of data from GCS to bigquery in the asia-south1 region using the gradle build commands.
Our organization policy restricts creation of resources in asia-south1 region. And whenever we create any job we have to add the "--additional-experiments=enable_secure_boot ".
So I have cloned the git repo in cloud shell (It has Gradle 6.0)
And ran the commands
gradle spotlessApply
gradle build
gradle run -DmainClass=com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2 -Pargs=" --region= --project=<projct_id> --streaming --enableStreamingEngine --tempLocation=gs:///temp --numWorkers=1 --maxNumWorkers=2 --runner=DataflowRunner --filePattern=gs://.csv --dataset= --inspectTemplateName=<inspect_template> --deidentifyTemplateName=<deid_tmplate> --DLPMethod=DEID"`
A dataflow job is created but this got aborted because of error
Startup of the worker pool in zone asia-south1-a failed to bring up any of the desired 5 workers. CONDITION_NOT_MET: Instance 'dlptexttobigquerystreamin' creation failed: Constraint constraints/compute.requireShieldedVm violated for project projectName Secure Boot is not enabled in the 'shielded_instance_config' field. See https://cloud.google.com/resource-manager/docs/organization-policy/org-policy-constraints for more information.
I have tried to add a new parameter to the file DLPTextToBigQueryStreamingV2PipelineOptions. But wasn't successful.
I tried to clone the dataflow job which was not possible because clone is disabled.
Can anyone help me out how to add the extra required parameter.
For the use case with google managed key GCS objects, there is no need to pass this argument. Currently it's required although marked as optional in the template. Take the default project id as dlpProject id if there is nth supplied.
We are trying to deploy all resources in a regional location (us-east1). We made the following changes so that deidentify template and inspect template are created in us-east1 instead of global.
args: ['kms', 'keyrings', 'create', '${_KEY_RING_NAME}', '--location=us-east1']
...
args: ['kms', 'keys', 'create' ,'${_KEY_NAME}', '--location=us-east1','--purpose=encryption','--keyring=${_KEY_RING_NAME}']
KEK_API="${API_ROOT_URL}/v1/projects/${PROJECT_ID}/locations/us-east1/keyRings/${KEY_RING_NAME}/cryptoKeys/${KEY_NAME}:encrypt"
INSPECT_TEMPLATE_API="${DLP_API_ROOT_URL}/v2/projects/${PROJECT_ID}/locations/us-east1/inspectTemplates"
...
TEMPLATES_LAUNCH_API="${DF_API_ROOT_URL}/v1b3/projects/${PROJECT_ID}/locations/us-east1/templates:launch"
DEID_TEMPLATE_API="${API_ROOT_URL}/v2/projects/${PROJECT_ID}/locations/us-east1/deidentifyTemplates"
INSPECT_TEMPLATE_API="${API_ROOT_URL}/v2/projects/${PROJECT_ID}/locations/us-east1/inspectTemplates"
The problem is if we submit the Dataflow job using these templates, the job will report permission error.
export DEID_TEMPLATE_NAME=projects/{PROJECT-ID}/locations/us-east1/deidentifyTemplates/{TEMPLATE-NAME}
export INSPECT_TEMPLATE_NAME=projects/{PROJECT-ID}/locations/us-east1/inspectTemplates/{TEMPLATE-NAME}
export JOB_ID=my-deid-job-us-east1
gcloud dataflow jobs run ${JOB_ID} \
--gcs-location gs://dataflow-templates-us-east1/latest/Stream_DLP_GCS_Text_to_BigQuery \
--region ${REGION} \
--parameters \ "inputFilePattern=gs://${DATA_STORAGE_BUCKET}/CCRecords_1564602825.csv,dlpProjectId=${PROJECT_ID},deidentifyTemplateName=${DEID_TEMPLATE_NAME},inspectTemplateName=${INSPECT_TEMPLATE_NAME},datasetName=deid_dataset,batchSize=500"
Error:
2020-11-13 17:39:11.721 HKTError message from worker: com.google.api.gax.rpc.PermissionDeniedException: io.grpc.StatusRuntimeException: PERMISSION_DENIED: Not authorized to access requested inspect template.
2020-11-13 17:27:09.146 HKTError message from worker: com.google.api.gax.rpc.PermissionDeniedException: io.grpc.StatusRuntimeException: PERMISSION_DENIED: Not authorized to access requested deidentify template.
It seems the Dataflow template cannot work with deidentify template or inspect template in a regional location. (Global location works as expected.)
Any input is appreciated.
Thanks,
Qian
Investigate what's causing this to fail. gradle run still works.
java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
com.google.swarm.tokenization.common.FileReader.tryToEnsureNumberOfBytesInBuffer(FileReader.java:163)
com.google.swarm.tokenization.common.FileReader.findDelimiterBounds(FileReader.java:88)
com.google.swarm.tokenization.common.FileReader.startReading(FileReader.java:77)
com.google.swarm.tokenization.common.FileReader.<init>(FileReader.java:48)
com.google.swarm.tokenization.common.CSVFileReaderSplitDoFn.processElement(CSVFileReaderSplitDoFn.java:57)
java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
com.google.swarm.tokenization.common.FileReader.tryToEnsureNumberOfBytesInBuffer(FileReader.java:163)
com.google.swarm.tokenization.common.FileReader.findDelimiterBounds(FileReader.java:88)
com.google.swarm.tokenization.common.FileReader.startReading(FileReader.java:77)
com.google.swarm.tokenization.common.FileReader.<init>(FileReader.java:48)
com.google.swarm.tokenization.common.CSVFileReaderSplitDoFn.processElement(CSVFileReaderSplitDoFn.java:57)
java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
com.google.swarm.tokenization.common.FileReader.tryToEnsureNumberOfBytesInBuffer(FileReader.java:163)
com.google.swarm.tokenization.common.FileReader.findDelimiterBounds(FileReader.java:88)
com.google.swarm.tokenization.common.FileReader.startReading(FileReader.java:77)
com.google.swarm.tokenization.common.FileReader.<init>(FileReader.java:48)
com.google.swarm.tokenization.common.CSVFileReaderSplitDoFn.processElement(CSVFileReaderSplitDoFn.java:57)
java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
com.google.swarm.tokenization.common.FileReader.tryToEnsureNumberOfBytesInBuffer(FileReader.java:163)
com.google.swarm.tokenization.common.FileReader.findDelimiterBounds(FileReader.java:88)
com.google.swarm.tokenization.common.FileReader.startReading(FileReader.java:77)
com.google.swarm.tokenization.common.FileReader.<init>(FileReader.java:48)
com.google.swarm.tokenization.common.CSVFileReaderSplitDoFn.processElement(CSVFileReaderSplitDoFn.java:57)
java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
com.google.swarm.tokenization.common.FileReader.tryToEnsureNumberOfBytesInBuffer(FileReader.java:163)
com.google.swarm.tokenization.common.FileReader.findDelimiterBounds(FileReader.java:88)
com.google.swarm.tokenization.common.FileReader.startReading(FileReader.java:77)
com.google.swarm.tokenization.common.FileReader.<init>(FileReader.java:48)
Lets say in my use case - i have 100 columns and i only want to mask 2 columns . so I decide to send only 2 columns to add in DLP handler and get results of 2 columns back.
Let's sent 1000 records (2 columns).
Does call to DLP table guarantee order. If I send 1000 records (2 columns) and scale it to max capability of DLP. Is it a guarantee to get returns in same order. (because need to stitch records of these 2 columns back to 1000 column record). If DLP preserves order then can loop through both lists based on order and merge them. Or do I need to take care of joining these based on some key (which I'll have to identify) if DLP call doesn't guarantee to preserve order.
Hi
I have successfully been able to run the V2 Inspect and Deidentify pipeline.
Now I want to test the ReIdentification From BigQuery pipeline. I have tried to run this Gradle command:
gradle run -DmainClass=com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2 -Pargs=" \
--region=<region> \
--project=<project_id> \
--streaming \
--enableStreamingEngine \
--tempLocation=gs://<bucket>/temp \
--numWorkers=1 \
--maxNumWorkers=2 \
--runner=DataflowRunner \
--tableRef=<project_id>:<dataset>.<table> \
--topic=projects/<project_id>/topics/<name> \
--autoscalingAlgorithm=THROUGHPUT_BASED \
--workerMachineType=n1-standard-1 \
--deidentifyTemplateName=projects/<project_id>/locations/<location>/deidentifyTemplates/<name> \
--DLPMethod=REID \
--keyRange=1024 \
--queryPath=gs://<bucket>/file.sql \
--DLPParent=projects/<project_id>/locations/<location>"
Running the command throws this error:
Exception in thread "main" java.lang.NullPointerException: Null datasetId
I have tried to add the following parameter to the above Gradle command:
--dataset=<datasetname>
But this throws the following error:
Exception in thread "main" java.lang.IllegalArgumentException: Auto-sharding is only applicable to unbounded input.
I am not able to continue from here on, and I cannot find any specific info about this error either. Any help is appreciated.
Hi - I get an error with the last line in deploy-data-tokeninzation-solution.sh
.
gcloud config set project <project_id>
sh deploy-data-tokeninzation-solution.sh
...
gcloud dataflow jobs run ${jobId} --gcs-location gs://dataflow-templates/latest/Stream_DLP_GCS_Text_to_BigQuery --parameters --region=us-central1,inputFilePattern=gs://${DATA_STORAGE_BUCKET}/CCRecords_1564602825.csv,dlpProjectId=${PROJECT_ID},deidentifyTemplateName=${DEID_TEMPLATE_NAME},inspectTemplateName=${INSPECT_TEMPLATE_NAME},datasetName=${BQ_DATASET_NAME},batchSize=500
ERROR: (gcloud.dataflow.jobs.run) argument --parameters: expected one argument
Usage: gcloud dataflow jobs run JOB_NAME --gcs-location=GCS_LOCATION [optional flags]
optional flags may be --additional-experiments | --dataflow-kms-key |
--disable-public-ips | --enable-streaming-engine |
--help | --max-workers | --network | --num-workers |
--parameters | --region | --service-account-email |
--staging-location | --subnetwork |
--worker-machine-type | --worker-region |
--worker-zone | --zone
If I switch the args around, pulling region
out of parameters, then it works:
gcloud dataflow jobs run ${jobId} --gcs-location gs://dataflow-templates/latest/Stream_DLP_GCS_Text_to_BigQuery --region=us-central1 --parameters=inputFilePattern=gs://${DATA_STORAGE_BUCKET}/CCRecords_1564602825.csv,dlpProjectId=${PROJECT_ID},deidentifyTemplateName=${DEID_TEMPLATE_NAME},inspectTemplateName=${INSPECT_TEMPLATE_NAME},datasetName=${BQ_DATASET_NAME},batchSize=500
createTime: '2021-08-11T11:14:10.713985Z'
currentStateTime: '1970-01-01T00:00:00Z'
id: 2021-08-11_04_14_08-3628396078995258394
location: us-central1
name: demo-dlp-deid-pipeline-20210811-111101
projectId: <project-id>
startTime: '2021-08-11T11:14:10.713985Z'
type: JOB_TYPE_STREAMING
To whom it may concern,
please Rename keyring and decrypter key argument parameters to be more intuitive. currently they are:
--fileDecryptKeyName=String
--fileDecryptKey=String
And this is not intuitive, thanks
"Error message from worker: java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.ArrayIndexOutOfBoundsException: 1
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:184)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:108)
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:57)
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:39)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:121)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:134)
org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1369)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:154)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1088)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.beam.sdk.util.UserCodeException: java.lang.ArrayIndexOutOfBoundsException: 1
org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36)
com.google.swarm.tokenization.common.SanitizeFileNameDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:227)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:186)
org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:335)
org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:280)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:267)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access$900(SimpleDoFnRunner.java:79)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:413)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:401)
org.apache.beam.sdk.io.FileIO$ReadMatches$ToReadableFileFn.process(FileIO.java:801)
org.apache.beam.sdk.io.FileIO$ReadMatches$ToReadableFileFn$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:227)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:186)
org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:335)
org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:280)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:267)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access$900(SimpleDoFnRunner.java:79)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:413)
org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:73)
org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:139)
org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:227)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:186)
org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:335)
org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:280)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:267)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access$900(SimpleDoFnRunner.java:79)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:413)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:401)
org.apache.beam.runners.dataflow.ReshuffleOverrideFactory$ReshuffleWithOnlyTrigger$1.processElement(ReshuffleOverrideFactory.java:86)
org.apache.beam.runners.dataflow.ReshuffleOverrideFactory$ReshuffleWithOnlyTrigger$1$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:227)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:186)
org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:335)
org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:182)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:108)
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:57)
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:39)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:121)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:134)
org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1369)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:154)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1088)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
com.google.swarm.tokenization.common.SanitizeFileNameDoFn.sanitizeFileName(SanitizeFileNameDoFn.java:50)
com.google.swarm.tokenization.common.SanitizeFileNameDoFn.processElement(SanitizeFileNameDoFn.java:66)"
timestamp: "2021-09-02T05:14:47.945344080Z"
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot define class using reflection: Cannot define nest member class java.lang.reflect.AccessibleObject$Cache + within different package then class org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.mirror.AccessibleObject
at org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection$Dispatcher$Initializable$Unavailable.defineClass(ClassInjector.java:410)
at org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection.injectRaw(ClassInjector.java:235)
at org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.dynamic.loading.ClassInjector$AbstractBase.inject(ClassInjector.java:111)
at org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default$InjectionDispatcher.load(ClassLoadingStrategy.java:232)
at org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default.load(ClassLoadingStrategy.java:143)
at org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.dynamic.TypeResolutionStrategy$Passive.initialize(TypeResolutionStrategy.java:100)
at org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.dynamic.DynamicType$Default$Unloaded.load(DynamicType.java:5623)
at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.generateInvokerClass(ByteBuddyDoFnInvokerFactory.java:353)
at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.getByteBuddyInvokerConstructor(ByteBuddyDoFnInvokerFactory.java:249)
at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.newByteBuddyInvoker(ByteBuddyDoFnInvokerFactory.java:222)
at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.newByteBuddyInvoker(ByteBuddyDoFnInvokerFactory.java:153)
at org.apache.beam.sdk.transforms.reflect.DoFnInvokers.invokerFor(DoFnInvokers.java:35)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.translateParDo(ParDoTranslation.java:206)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.translateParDo(ParDoTranslation.java:191)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation$ParDoTranslator.translate(ParDoTranslation.java:128)
at org.apache.beam.repackaged.direct_java.runners.core.construction.PTransformTranslation.toProto(PTransformTranslation.java:225)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.getParDoPayload(ParDoTranslation.java:689)
at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.isSplittable(ParDoTranslation.java:704)
at org.apache.beam.repackaged.direct_java.runners.core.construction.PTransformMatchers$6.matches(PTransformMatchers.java:269)
at org.apache.beam.sdk.Pipeline$2.visitPrimitiveTransform(Pipeline.java:282)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:665)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:657)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:657)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:657)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:657)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:657)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600(TransformHierarchy.java:317)
at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:251)
at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:460)
at org.apache.beam.sdk.Pipeline.replace(Pipeline.java:260)
at org.apache.beam.sdk.Pipeline.replaceAll(Pipeline.java:210)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:170)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:315)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:301)
at com.google.swarm.tokenization.CSVStreamingPipeline.doTokenization(CSVStreamingPipeline.java:108)
at com.google.swarm.tokenization.CSVStreamingPipeline.main(CSVStreamingPipeline.java:116)
To Whom it may concern,
Add windowing time frame as an input parameter
We have a use-case where we want to run DLP process inspect, deidentify, reidentify for a BigQuery Table solely within Big Query space E.g Big Query Stored procedures or with Big Query commands & script. Wondering how we can achieve that.
We do not want to use the DataFlow or any other GCP services for DLP.
Appreciate your help and support.
Himanshu
We are trying to reidentifiation dataflow job by executing below command:-
gradle run -DmainClass=com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2
-Pargs="
--region=${REGION}
--project=${PROJECT_ID}
--stagingLocation=gs://${DATAFLOW_TEMP_BUCKET}/staging
--tempLocation=gs://${DATAFLOW_TEMP_BUCKET}/temp
--maxNumWorkers=2
--runner=DataflowRunner
--tableRef=${PROJECT_ID}:deid_dataset.CCRecords_1564602825
--dataset=deid_dataset
--topic=projects/${PROJECT_ID}/topics/${TOPIC_ID}
--autoscalingAlgorithm=THROUGHPUT_BASED
--workerMachineType=n1-standard-2
--deidentifyTemplateName=${REID_TEMPLATE_NAME}
--DLPMethod=REID
--keyRange=1024
--queryPath=gs://${DATA_STORAGE_BUCKET}/reid_query.sql"
Attaching the screenshot of error:-
Run a test with DataFlow runner against a 4.8 GB file and it fails with out of memory.
2018-08-07 (15:02:31) java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.OutOfMemoryError: ...
java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.OutOfMemoryError: GC overhead limit exceeded
com.google.cloud.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:183)
com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:102)
com.google.cloud.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:55)
com.google.cloud.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:37)
com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:115)
com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
com.google.cloud.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:133)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:200)
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:158)
com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:75)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1227)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:136)
com.google.cloud.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:966)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.beam.sdk.util.UserCodeException: java.lang.OutOfMemoryError: GC overhead limit exceeded
org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:34)
com.google.swarm.tokenization.CSVBatchPipeline$CSVFileReader$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:185)
org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:149)
com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:323)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)
com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:271)
org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:219)
org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:69)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:517)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:505)
org.apache.beam.sdk.io.FileIO$ReadMatches$ToReadableFileFn.process(FileIO.java:743)
org.apache.beam.sdk.io.FileIO$ReadMatches$ToReadableFileFn$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:185)
org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:149)
com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:323)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)
com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:271)
org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:219)
org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:69)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:517)
org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:71)
org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:128)
org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:185)
org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:149)
com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:323)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)
com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:271)
org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:219)
org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:69)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:517)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:505)
org.apache.beam.runners.dataflow.ReshuffleOverrideFactory$ReshuffleWithOnlyTrigger$1.processElement(ReshuffleOverrideFactory.java:85)
org.apache.beam.runners.dataflow.ReshuffleOverrideFactory$ReshuffleWithOnlyTrigger$1$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:185)
org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:149)
com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:323)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)
com.google.cloud.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:181)
com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:102)
com.google.cloud.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:55)
com.google.cloud.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:37)
com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:115)
com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
com.google.cloud.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:133)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:200)
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:158)
com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:75)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1227)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:136)
com.google.cloud.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:966)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.String.substring(String.java:1969)
java.lang.String.split(String.java:2353)
java.lang.String.split(String.java:2422)
com.google.swarm.tokenization.common.Util.convertCsvRowToTableRow(Util.java:37)
com.google.swarm.tokenization.common.Util.lambda$createDLPTable$0(Util.java:65)
com.google.swarm.tokenization.common.Util$$Lambda$79/491724826.accept(Unknown Source)
java.util.ArrayList.forEach(ArrayList.java:1251)
com.google.swarm.tokenization.common.Util.createDLPTable(Util.java:64)
com.google.swarm.tokenization.CSVBatchPipeline$CSVFileReader.processElement(CSVBatchPipeline.java:212)
com.google.swarm.tokenization.CSVBatchPipeline$CSVFileReader$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:185)
org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:149)
com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:323)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)
com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:271)
org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:219)
org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:69)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:517)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:505)
org.apache.beam.sdk.io.FileIO$ReadMatches$ToReadableFileFn.process(FileIO.java:743)
org.apache.beam.sdk.io.FileIO$ReadMatches$ToReadableFileFn$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:185)
org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:149)
com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:323)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)
com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:271)
org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:219)
org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:69)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:517)
org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:71)
For the last template referencing gcr.io/cloud-solutions-images/dlp-dataflow-v2-solutions, repository does not exist.
I am referring to:
gcloud beta dataflow flex-template run "dlp-reid-demo" --project=${PROJECT_ID} --region=${REGION} --template-file-gcs-location=gs://dataflow-dlp-solution-sample-data/dynamic_template_dlp_v2.json --parameters=^~^streaming=true~enableStreamingEngine=true~tempLocation=gs://${DATAFLOW_TEMP_BUCKET}/temp~numWorkers=1~maxNumWorkers=2~runner=DataflowRunner~tableRef=${PROJECT_ID}:deid_dataset.CCRecords_1564602825~dataset=deid_dataset~autoscalingAlgorithm=THROUGHPUT_BASED~workerMachineType=n1-highmem-8~topic=projects/${PROJECT_ID}/topics/${TOPIC_ID}~deidentifyTemplateName=${REID_TEMPLATE_NAME}~DLPMethod=REID~keyRange=1024~queryPath=gs://${DATA_STORAGE_BUCKET}/reid_query.sql
Used following command to run job on Dataflow runner
"gcloud dataflow jobs run test-run-3 --gcs-location gs://templates_test_as/dlp-tokenization --parameters inputFile=gs://input_as/test2.csv,project=testbatch-211413,batchSize=4700,deidentifyTemplateName=projects/testbatch-211413/deidentifyTemplates/1771891382411767128,outputFile=gs://output_as/template_def_run,inspectTemplateName=projects/testbatch-211413/inspectTemplates/1771891382411767128
"
And job doessn't fail and exit but Read step shows following logs and it doesn't process any data and no data processing happens , hence no output is generated
error_excerpt.txt
Excerpt of error is attached. Checking KMS API in console for this project and this is enabled and when accessing this link directly which is mentioned in error screenshot then it returns following error message
"The API "cloudkms.googleapis.com" doesn't exist or you don't have permission to access it
Tracking Number: 4532490161882610346"
Please review if it is a real issue.
This is my build command
gradle run \ -DmainClass=com.google.swarm.tokenization.CSVStreamingPipeline \ -Dexec.args="--runner=DataflowRunner \ --project=<GCP-PROJECT> \ --templateLocation=gs://<BUCKET>/dlp_templates \ --stagingLocation=gs://<BUCKET>/dlp_staging"
Error:
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot define class using reflection: Cannot define nest member class java.lang.reflect.AccessibleObject$Cache + within different package then class org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.mirror.AccessibleObject at org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection$Dispatcher$Initializable$Unavailable.defineClass(ClassInjector.java:410) at org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection.injectRaw(ClassInjector.java:235) at org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.dynamic.loading.ClassInjector$AbstractBase.inject(ClassInjector.java:111) at org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default$InjectionDispatcher.load(ClassLoadingStrategy.java:232) at org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default.load(ClassLoadingStrategy.java:143) at org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.dynamic.TypeResolutionStrategy$Passive.initialize(TypeResolutionStrategy.java:100) at org.apache.beam.vendor.bytebuddy.v1_9_3.net.bytebuddy.dynamic.DynamicType$Default$Unloaded.load(DynamicType.java:5623) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.generateInvokerClass(ByteBuddyDoFnInvokerFactory.java:353) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.getByteBuddyInvokerConstructor(ByteBuddyDoFnInvokerFactory.java:249) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.newByteBuddyInvoker(ByteBuddyDoFnInvokerFactory.java:222) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.newByteBuddyInvoker(ByteBuddyDoFnInvokerFactory.java:153) at org.apache.beam.sdk.transforms.reflect.DoFnInvokers.invokerFor(DoFnInvokers.java:35) at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.translateParDo(ParDoTranslation.java:206) at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.translateParDo(ParDoTranslation.java:191) at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation$ParDoTranslator.translate(ParDoTranslation.java:128) at org.apache.beam.repackaged.direct_java.runners.core.construction.PTransformTranslation.toProto(PTransformTranslation.java:225) at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.getParDoPayload(ParDoTranslation.java:689) at org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.isSplittable(ParDoTranslation.java:704) at org.apache.beam.repackaged.direct_java.runners.core.construction.PTransformMatchers$6.matches(PTransformMatchers.java:269) at org.apache.beam.sdk.Pipeline$2.visitPrimitiveTransform(Pipeline.java:282) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:665) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:657) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:657) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:657) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:657) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:657) at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600(TransformHierarchy.java:317) at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:251) at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:460) at org.apache.beam.sdk.Pipeline.replace(Pipeline.java:260) at org.apache.beam.sdk.Pipeline.replaceAll(Pipeline.java:210) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:170) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:315) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:301) at com.google.swarm.tokenization.CSVStreamingPipeline.doTokenization(CSVStreamingPipeline.java:108) at com.google.swarm.tokenization.CSVStreamingPipeline.main(CSVStreamingPipeline.java:116)
I wanted to add json support and create template out of it, but the existing code is also failing while trying to create template. I also tried to build by upgrading beam version to 2.19 from 2.16 But still seeing the bytebuddy error.
Can anyone please help me out?
please update the tempalte_metadata file to:
{
"name": "dlp-tokenization",
"description": "DLP Data Tokenization Pipeline",
"parameters": [{
"name": "inputFile",
"label": "GCS File Path to Tokenize",
"help_text": "gs://MyBucket/object",
"regexes": ["^gs://[^\n\r]+$"],
"is_optional": false
},
{
"name": "outputFile",
"label": "Location of GCS path where Tokenized Data will be written",
"help_text": "Path and filename prefix for writing output files. ex: gs://MyBucket/object",
"regexes": ["^gs://[^\n\r]+$"],
"is_optional": false
},
{
"name": "tableSpec",
"label": "BQ Table Spec ",
"help_text": "<project_id>:<dataset_id>.<table_id>",
"is_optional": false
},
{
"name": "project",
"label": "Name of the Host Project",
"help_text": "project_id",
"is_optional": true
},
{
"name": "batchSize",
"label": "batch size in number of rows",
"help_text": "4700, 200",
"is_optional": false
},
{
"name": "pollingInterval",
"label": "batch size in number of rows",
"help_text": "in seconds: 10, 60",
"is_optional": false
},
{
"name": "inspectTemplateName",
"label": "inspect template name",
"help_text": "null, projects/{dlp_prject_name}/inspectTemplates/{name}",
"is_optional": true
},
{
"name": "deidentifyTemplateName",
"label": "deidentify template name",
"help_text": "null, projects/{dlp_prject_name}/deidentifyTemplates/{name}",
"is_optional": false
},
{
"name": "csek",
"label": "Client Supplied Encryption key (KMS Wrapped)",
"help_text": "CiQAbkxly/0bahEV7baFtLUmYF5pSx0+qdeleHOZmIPBVc7cnRISSQD7JBqXna11NmNa9NzAQuYBnUNnYZ81xAoUYtBFWqzHGklPMRlDgSxGxgzhqQB4zesAboXaHuTBEZM/4VD/C8HsicP6Boh6XXk=",
"is_optional": true
},
{
"name": "csekhash",
"label": "Hash of CSEK",
"help_text": "lzjD1iV85ZqaF/C+uGrVWsLq2bdN7nGIruTjT/mgNIE=",
"is_optional": true
},
{
"name": "fileDecryptKeyName",
"label": "Key Ring For Input File Encryption",
"help_text": "gcs-bucket-encryption",
"is_optional": true
},
{
"name": "fileDecryptKey",
"label": "Key Name For Input File Encryption",
"help_text": "data-file-key",
"is_optional": true
},
{
"name": "workerMachineType",
"label": "Machine Type",
"help_text": "n1-highmem-4",
"is_optional": true
},
{
"name": "numWorkers",
"label": "Number of Workers",
"help_text": "number 1",
"is_optional": true
}
]
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.