gluent / goe Goto Github PK

View Code? Open in Web Editor NEW

8.0 3.0 2.0 45.55 MB

GOE: a simple and flexible way to copy data from an Oracle Database to Google BigQuery.

License: Apache License 2.0

Makefile 0.36% Python 95.76% Shell 0.37% Scala 0.05% PLSQL 1.32% HTML 1.64% JavaScript 0.33% CSS 0.11% Jinja 0.06%

data-migration data-migration-tool oracle

goe's People

Contributors

Stargazers

Watchers

Forkers

cofin

goe's Issues

Remove all traces of Incremental Update

Incremental Update commands have been removed as has much of the code but a lot still remains.

We need to clear up any remaining:

Options, option defaults and configuration
Code, including references to incremental update throughout the code base
Documentation. We probably shouldn't touch the original documentation therefore this only applies if relevant documentation has already been migrated to this repo

Remove synthetic bucket partitioning

We have support for a synthetic bucket partition column on a Hadoop backend to support parallel reading of data through a connector. As the connector no longer exists we no longer need to support this synthetic partition column.

We need to be careful removing the hash column hidden option/metadata though, I believe it is used for other functionality than just synthetic bucketing. Need to check.

While doing this also remove any num location files option/defaults/configuration and code references. Location files and number of buckets are intertwined in the code base and would be best removed in one activity.

Reinstate OFFLOAD_SCN metadata

As part of issue #18 we removed the OFFLOAD_SCN metadata attribute.

In hindsight I think we should have left it in place. There's been talk of the GOE batch data copy tool being able to work in tandem with a CDC tool in the future. For that to be true we ought to continue to capture the start SCN of the offload operation. In truth it may never be used but there's no sense in expending effort erasing it when it "might" be needed.

Remove num location files option

Remove num location files option/defaults/configuration and code references

Add support for BigQuery parameterized data types

GOE work was paused around the same time that BigQuery introduced parameterized data types for string types and precision/scale for numeric types.

A customer has experienced a problem with Informatica where if a length is not specified on the STRING data type for a table, it assumes length of 255. Columns declared in Oracle as VARCHAR2(<255) become STRING without length in BigQuery and Informatica assumes 255 and cuts off data it is subsequently adding to the table.

We should look to support:

string(l)
numeric(p,s)
bignumeric(p,s)
bytes(l)

Redefine software installation and packaging

We need to spend some time understanding how a customer moves from a cloned github repo to a working installation.

What do they need, from the top of my head:

A Python virtual environment with required packages
The gluentlib egg file, not sure if this should be in the virtual env or in a lib directory which we include in PYTHONPATH
An Oracle client which we no longer bundle. If we switch to python-oracledb then they probably don't need a client
An OFFLOAD_HOME containing log run setup, conf and bin directories
offload.env templates, current generated by templates/conf/make on conf directory
An OFFLOAD_HOME containing version file
The contents of setup and bin directories as current produced by the Makefile

I may have missed something.

We still need a make package option which produces a tarball that can be extracted to deploy GOE. There's no way customers will be prepared to run make commands on a production system, they should be able to do it once and produce a tarball which can be deployed to a server.

Rename Gluent in Spark Listener build

Rename Gluent to GOE in Spark Listener build

Remove data governance code and tests

First up we need to discuss whether this is the right thing to do. The feature has never been used and was limited to a Cloudera product that no longer exists which I think makes it clear that it is a waste of bits and bytes in this repo.

Assuming the answer is "yes" to removing it then:

Remove any options/option defaults/configuration file entries
Remove the code
Remove references to the data goverance APIs that are in any Offload code that creates new objects
Remove tests, just stories I believe
Remove any documentation, of which there may not be any

pass_tool command

When moving to the new repo we left all C++ artefacts behind, therefore we no longer have pass_tool which is used to obfuscate passwords.

We need to provide a Python version of the same tool.

Update cx-Oracle to python-oracledb package

Check for any version limitations, we support 10g upwards. There's a chance we will need to cater for both cx-Oracle and python-oracledb if there's a mismatch in version support.

Transport corrupted negative date values

Oracle dates with negative year were transported to BigQuery and on the BigQuery table, the dates have a positive year.

Development controls

Before we start to work as a team we need to understand how to control development. For example how to block changes made directly on the main branch and how to enforce pull requests and approvals.

Review Oracle supporting objects

We've already removed a lot of PL/SQL code from the ADM schema but should conduct a further review and remove anything we no longer require.

Other thoughts:

Shall we change the default account prefix from "GLUENT" to "GOE"?
Now is the right time to remove any metadata attributes that have become redundant. This is a fresh start so no upgrades are required.
We should review the repo schema table structures. For example should metadata now be keyed on the source table and no longer on the hybrid view? I can foresee a future where Offload needs to reinstate the ability to update a HWM in a dependent view but I think these would be recorded in a new child table
We can reset the repo schema version to an initial version, we do not need to honour upgrades from existing GDP installations

Review support for Conductor tool

In goe/conductor/hybrid_view_service.py we support an API for an externally written tool called Conductor.

We should review whether this tool is still required.

If it is then the tool and goe/conductor/hybrid_view_service.py should be refactored to be keyed on frontend owner/table, hybrid view is no longer a supported attribute.

If it is no longer reqquied then we should remove the API and any related tests.

Redefine Spark Listener management

Previously we would build Spark Listener jar files and bundle them. By sharing the repo with end users we are moving that burden to them, including installing Scala and its prerequisites.

As JAR files are portable I wonder if we should continue to include the Makefile and supporting code/instructions in the repo but not run the build as part of installation. Instead we could run it internally and commit the jar files to the repo in a location that installation can use them.

I've tested the currently built files against Dataproc with Java 11 and 17 and it works okay. However it does not work for Scala 2.13. So I think we don't need to worry about Java version but we do need to cater for Scala 2.13.

The current strategy for picking the correct listener is for the end user to copy the correctly versioned file into place. We should review this.

Remove schema sync

Remove schema sync and all related code/config/tests.

Also consider documentation. We probably shouldn't touch the original documentation therefore this only applies if relevant documentation has already been migrated to this repo.

Refactor story integration tests

I'd like to get away from our reliance on pre-built schemas.
Each set of tests should create its own test tables with minimal data
We need to bare in mind future requirements for matching SQL Server/Teradata test tables too
There's no TeamCity anymore
I'd ideally like the tests to run from pytest and not need a custom test harness

I've already refactored the data_type_controls story, I've stopped calling them stories and called them scenarios instead.

Refactor connect to be in gluentlib

At the moment scripts/connect.py contains all of the connect code.

We should break this up into a few .py files and locate them in gluentlib/connect.

Redefine integration tests

We have the existing test framework, a lot of which should be retained.

These are the thoughts in my head that I'm capturing here:

I'd like to get away from our reliance on pre-built schemas. Each set of tests should create its own test tables with minimal data. We need to bare in mind future requirements for matching SQL Server/Teradata test tables too.
There's no TeamCity anymore
DVT tests run in Cloud Build Run, perhaps that's something we could imitate?
Cody has some good ideas regarding testing
I'd like to make the stories self sufficient and not rely on an existing SH schema

Clean up redundant offload env variables

Examples I've seen that need to be rectified (there may be others):

CONNECTOR_SQL_ENGINE
LD_LIBRARY_PATH (no longer needed if we're not bundling our own Python)
CLASSPATH (no longer needed if we're not bundling our own Python)
DATAD_ADDRESS
METAD_POOL_SIZE
METAD_POOL_TIMEOUT
METAD_AUTOSTART

For each of these we need to remove any:

Options in code
Option defaults
Code that relies on those options

Change Query Import to support Oracle partitioned tables

This will make testing much easier and should be pretty easy.

Check that segment size used to check under small table threshold works across partitions
Check that transport source SQL is same for Query Import as for Spark by Query

If it gets gnarly then abort. But worth doing if it's trivial.

Redefine software installation

We need to spend some time understanding how a customer moves from a cloned github repo to a working installation.

We have already added a Makefile recipe to create a package containing many required artefacts.

make package

The tarball contains:

An OFFLOAD_HOME containing log, run, setup, conf, lib and bin directories
The gluentlib egg file in the lib directory which we include in PYTHONPATH in offload.env
offload.env templates
version file
Contents of setup and bin

What we do not include is:

An Oracle client which we no longer bundle. If we switch to python-oracledb then they probably don't need a client, for now I think it has to be a documented prereq.
A Python virtual environment with required packages

It is item 2 above that this issue concerns. We need to think about how we should ensure the customer has the correct Python and packages.

Refactor gluentlib_contrib

Currently code shared between offload and Console is in its own repo, gluentlib_contrib.

I don't think is required any longer and the code could be moved thereby removing an unnecessary complication.

Extra GCP project controls are required

Here's a known customer configuration:

Gluent node in project-01
BigQuery in project-02
KMS in project-03

Currently GOE cannot cope with that, all components need to be in the same project.

We should add the following environment variables:

BIGQUERY_DATASET_PROJECT
GOOGLE_KMS_KEY_RING_PROJECT

If the variables are blank then we fall back on existing behaviour which is to use the default project in the BigQuery Python client, which is probably that of the Gluent node/service account.

Add support for GCP secret manager

This could be used as an alternative store for credentials.

Remove diagnose CLI tool

Remove diagnose and all related code/config/tests.

Redefine unit tests

A lot of unit tests brought across from the original code base are border line integration tests and require database connections/database objects to complete. These should all be reviewed and changed to use mock objects where possible or removed if they cannot be sufficiently simplified.

Oracle SQL installation script is working directory sensitive

To be able to run the Oracle install_offload.sql (and upgrade/uninstall) scripts you must start SQL*Plus from the directory in which the scripts are located.

It would be nice to be able to run from anywhere by supplying a full or relative path.

LPA Offload --reset-hybrid-view option

Should this be renamed to reset metadata?

Additional BigQuery staged data check for data larger than 100MB

Tested with an Oracle CLOB value larger than 100MB. Offload fails with:

Load staged data
Fri Dec  8 14:42:57 2023
BigQuery SQL: INSERT INTO `project-01.owner.tab_clob`
SELECT `ID` AS `ID`
,      `DATA` AS `DATA`
,      PARSE_DATETIME('%F %H:%M:%E*S', `LOAD_DATE`) AS `LOAD_DATE`
FROM   `project-01.owner_load.tab_clob`
Step time: 0:00:15
Unhandled exception in offload_table(): 400 Cannot query rows larger than 100MB limit.

We should add a LENGTH check to the Validate staged data step for any BLOB/CLOB columns.

DBA_TABLES.AVG_ROW_LEN does not reflect out of row LOB space used which makes it tricky for us to warn the user at the start of an Offload.

Redefine software installation and packaging

This is a continuation of #8.

Goes for this issue are:

Migrate from Python setup.py (egg file) to pyproject.toml (wheel file)
Look at whether building with pdm is a good direction
Look at whether PyInstaller or Noetka are good solutions for bundling a Python executable

Notes:

PDM repo: https://github.com/pdm-project/pdm
Sample Makefile using pdm: https://github.com/litestar-org/litestar/blob/main/Makefile#L40

Remove Cloud Sync

Remove cloud sync and associated redundant tools.

Refactor offload metadata and interactions in orchestration

The GOE metadata repository and supporting database objects were refactored by issue 18. Minimal work was done in the Python as an interim "shim" to enable orchestration to use the existing method args and metadata dict keys. The orchestration therefore needs to be refactored to complete the integration.

Things to note:

There are a number of TODO comments in the code for immediate attention (search for TODO: Issue 18)
The existing metadata dict has a number of attributes that have been remapped to the new Oracle type attributes when getting/saving. In some cases, the developer may wish to remove the existing metadata attributes in favour of the new ones, though in other cases the old key names may be so ubiquitous in the orchestration code that it's easier to retain the old name (for example, INCREMENTAL_% dict keys are now named OFFLOAD_% in the Oracle type)
The existing metadata dict has a number of attributes that are no longer relevant and need to be removed
The various metadata methods (get, save etc) are keyed on hybrid_[owner|view] args. The "shim" prevented this from being an issue with the first push, but the hybrid_[owner|view] concept needs to be removed and all metadata methods should be keyed on frontend owner/name
The current command execution UUID is part of the metadata that needs to be saved. Historically, the execution_id value was added as an additional arg to the save_offload_metadata methods, but this "loose" arg needs to be cleaned up. The execution UUID should be in the metadata dict

Fix OFFLOAD_HOME check in Makefile

There is a check in the install recipe in Makefile to ensure OFFLOAD_HOME variable is set. This is executing for all sections of the Makefile and not just the intended one.

ifndef OFFLOAD_HOME
$(error OFFLOAD_HOME is not set)
endif

Oracle installation only works as SYS user

When trying in install Oracle objects as SYSTEM user it fails with:

PL/SQL procedure successfully completed.

Creating grants...
Error executing SQL : GRANT SELECT ON GLUENT_ADM.OFFLOAD_OBJECTS TO GLUENT_APP
Error: ORA-01720: grant option does not exist for
'GLUENT_REPO.OFFLOAD_METADATA_V'
DECLARE
*
ERROR at line 1:
ORA-01720: grant option does not exist for 'GLUENT_REPO.OFFLOAD_METADATA_V'
ORA-06512: at line 21
ORA-06512: at line 16
ORA-06512: at line 38
ORA-06512: at line 38


Disconnected from Oracle Database 21c Express Edition Release 21.0.0.0.0 - Production

Repo connection leak when running integration tests

When we run integration tests we are leaking OrchestrationRepo client connections and eventually tests fail because we exceed the Oracle processes parameter.

It's not a problem at the moment because most integration tests have not been converted to the new repo.

However this is a pre-req to issue #42

Update all third party packages

The third party package versions have been frozen for some time, we need to roll them forward as far as we can and then freeze them again.
We probably need to get tests up and running before doing this.
We should also not lose sight of the fact we need to test on multiple Python versions, versions frozen for Python 3.11 might not be suitable on Python 3.6.
This could be tricky, we need to have ranges of versions rather than exact versions.

Documentation

We need to bring across some of the original Offload documentation to the repo.

We need to add notes on the new Dataproc configuration variables.

Refactor remaining story integration tests

In issue #35 we converted a few essential story tests to be functional in this new repo.

This issue is to work on converting the remaining tests in tests/integration/test_sets/stories/test_stories/ creating new tests in tests/integration/scenarios/.

You can use the existing tests as templates for how new ones should look:

$ ls -1 tests/integration/scenarios/test_*.py
tests/integration/scenarios/test_cli_api.py
tests/integration/scenarios/test_data_type_controls.py
tests/integration/scenarios/test_identifiers.py
tests/integration/scenarios/test_offload_basic.py
tests/integration/scenarios/test_offload_rpa.py

The tests yet to convert are (the ones with question marks might be irrelevant and ripe for removal):

agg_validate.py (?)
backend_stats.py
cli_api_nossh.py
hybrid_view_service.py (?)
listener_orchestration.py
mssql.py (?)
netezza.py (?)
offload_backend_part.py
offload_esoteric.py (this should be renamed test_offload_advanced.py or similar)
offload_full.py
offload_hash_bucket.py (only need the part of this relating to Synabse hash column)
offload_list_rpa.py
offload_lpa.py
offload_part_fn.py
offload_pbo.py (intra day only)
offload_sorting.py
offload_subpart.py
offload_transport.py
orchestration_locks.py
orchestration_step_control.py
schema_sync.py (?)

Decide what --force option should do

The --force option used to make an offload regenerate hybrid objects even if there was nothing to offload, without it Offload would exit early.

We need to decide what to do now there is no hybrid schema.

I think it should force a metadata update if one doesn't already happen. So we need to check that and act accordingly.

Remove python path check from command executables

When running executables we see this message:

Python path is unexpected, have you sourced ../conf/offload.env?

This is because we have a check verifying that the Python we are using is the one inside the installation directory.

In the new GOE we do not bundle a Python interpreter with the software therefore I think the check is redundant.

In issue #31 we are evaluating how to package Python dependencies which might result in an executable of some kind but I still think this check is redundant.

Unless you can think of a good reason not to then remove the check on sys.exec_prefix from each executable.

Should offload pattern 100/10 be retired?

PBO cannot filter out negative dates

Using "--data-sample-percent=0 --offload-predicate=column(dt_close) is null or column(dt_close) >= datetime(0001-01-01 00:00:00)" did not filter out the negative dates

Update connect version checks

Connect is currently reporting two problems which are no longer problems:

Oracle version
Oracle 10.2 is now supported (we haven't tested 10.1 or prior):

Oracle version
10.2.0.1.0
Oracle version Failed

Linux
Debian is supported, we are no longer limited to Red Hat variants:

Operating system version
Unsupported operating system
Operating system version Failed

I think the file we need to interrogate on Debian is /etc/os-release. I've not thought about how we pick our supported versions, likewise for Red Hat based releases. This might link into how we build and pick Python package versions which could become complex. So perhaps for now we test on a recent-ish release and at least get a line in the sand.

offload.env
I don't think we need to worry about offload.env permissions since we became Offload only.

Configuration file permissions
/home/neiljohnson_google_com/goe/offload/conf/offload.env has permissions: 755
Configuration file permissions Warning
Expected permissions are 640

Rethink Oracle component version mismatch exception

When the software version does not match the database package version commands will not run due to this exception:

gluent.OffloadException: Mismatch between Oracle component version (6.0.7-DEV (027076c)) and binary version (6.0.8-DEV (878092c))

The database code is now much simplified and will rarely change, even more rarely will a change be a breaking change. We've seen in recent customer engagements that getting access to a privileged database account is often difficult and time consuming.

Thoughts:

Ideally we would change the logic behind this message to be based on a repo version. We only advance the repo version for breaking changes and let customers overlay GOE software without running the database upgrade script in most cases
If that is going to take too long we could look to downgrade the exception to a warning and create another issue for 1 above
Or we could leave the exception in place but downgrade to a warning if in force mode

Offload Transport SCN improvement

There are two things to think about here:

The Oracle SCN is used twice in Offload but is inconsistent
We need to think about how we support external tools

Inconsistent SCN
Pre-offload SCN (OFFLOAD_SNAPSHOT) is not aligned with the SCN used in Offload Transport, we need to sync these up.

I think the SCN picked by Offload transport should be stored in state and exposed in the object returned by offload_transport_factory. This should then be used for metadata if required.

It is important to align the data extraction and offload metadata before any CDC tool can be integrated.

External tool support
We should add an option to allow an Offload snapshot value (SCN) to be passed into an Offload. This should then be stored in metadata and used for Offload Transport queries. It is only stored in metadata for the initial offload but different values can be passed in for incremental offloads as long as they are higher/more recent than the initial snapshot value.

This allows external tools to request an SCN of Offload, I'm not sure how we cater for the opposite direction. Perhaps we don't for now. At least the value is in metadata if we want to add a utility tool later.

Change Query Import to support Oracle INTERVAL types

This will make testing easier/quicker because, at the moment, the unsupported INTERVAL type forces us to use Spark/Sqoop.

Open source and third party licences

There are a few things to think about here.

1) Open source licence
If we are moving to open source which licence should we use (Apache 2.0?). We should make sure we have the correct content in the top-level LICENSE file. This might help: https://choosealicense.com/

2) File headers
We include a Gluent copyright notice in most files in the repo. This is auto updated at make time.
We should decide what to do with these. We could do something like this file, which is a mix of company specific copyright and open source: https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/data_validation/app.py

3) Third party packages
We need to investigate whether we are required to bundle third party package licences even when this repo is classified as open source. If we do then we need a fresh assessment and to collect the required files.

Need a functioning initial code base

The initial code base will not work until a few necessary changes have been made:

Python requirements: we no longer bundle a pre-built Python and initially need to understand how to build a virtual environment we can use to progress further development. Eventually this will become a full build system config but for now we just need to be able to progress
Once we can run the code we need to fix any fallout so we can at least Offload from Oracle to BigQuery

This work can be completed on the main branch but after that we should work on feature branches only.

Exception when offloading non-partitioned table with --offload-type=INCREMENTAL (GOE-2333)

Customer ran command:

$OFFLOAD_HOME/bin/offload -t 'WH.DIM_TMP' --target-name=SANDBOX.DIM_TMP -x -vv --offload-type=INCREMENTAL --reset-backend-table

The incorrect use of --offload-type=INCREMENTAL should cause Offload to switch back to FULL and issue a warning.
Unfortunately there’s a bug in the code that outputs the warning, causing:

Find data to offload
Wed Jun 29 03:29:41 2022
Offload source data type: OffloadSourceDataFull
Step time: 0:00:00
Exception caught at top-level
Wed Jun 29 03:29:41 2022
Traceback (most recent call last):
  File "/opt/gluent/offload/bin/offload", line 29, in <module>
    offload(options)
  File "target/offload/bin/gluent.py", line 6584, in offload
  File "target/offload/bin/gluent.py", line 6276, in offload_table
  File "target/offload/bin/gluent.py", line 380, in step
  File "target/offload/bin/gluent.py", line 6277, in <lambda>
  File "target/offload/bin/gluent.py", line 2780, in get_offload_data_manager
  File "build/bdist.linux-x86_64/egg/gluentlib/offload/offload_source_data.py", line 103, in offload_source_data_factory
  File "build/bdist.linux-x86_64/egg/gluentlib/offload/offload_source_data.py", line 1160, in __init__
  File "build/bdist.linux-x86_64/egg/gluentlib/offload/offload_source_data.py", line 557, in __init__
  File "build/bdist.linux-x86_64/egg/gluentlib/offload/offload_source_data.py", line 589, in _set_offload_type
  File "build/bdist.linux-x86_64/egg/gluentlib/offload/offload_source_data.py", line 176, in get_offload_type_for_config
TypeError: warning() got an unexpected keyword argument 'ansi_type'