The SAP HANA Data Downloader is a Python script that allows you to download data from SAP HANA databases and store it in parquet files. This can be useful when working with large datasets that are too big to be stored in memory or when you want to work with data in a format that can be easily shared and analyzed.
Overall, the HANA-downloader script provides a powerful and flexible tool for downloading and processing data from SAP HANA systems, allowing users to customize the download and processing process to fit their specific needs and requirements.
- Install Python 3 and all the libraries present in requirements.txt
- Download and Install the SAP BTP Neo SDK from here and start it with the BTP Sub-Account info
- Take into account the Java Version that needs the SDK. You could use a JVM given by SAP here
When the tunnel is started using the SAP BTP Neo SDK, the DB instance in the cloud is map to your localhost with a specific port. Generally the first connection will take the port
30015
. Take note of this port to set an environment variable.
- Take into account the Java Version that needs the SDK. You could use a JVM given by SAP here
When the tunnel is started using the SAP BTP Neo SDK, the DB instance in the cloud is map to your localhost with a specific port. Generally the first connection will take the port
- Set the credentials, host and port to connect to the SAP HANA DB instance as a environment variables as the next example (with the exact same names):
export SAP_HANA_HOST=<localhost>
export SAP_HANA_PORT=<300xx>
export SAP_HANA_USER=<DB User>
export SAP_HANA_PASSWORD=<Password of DB user>
- To save the parquet files to a blob cloud storage you should set:
- For Google Cloud Platform the Environment variable
GOOGLE_APPLICATION_CREDENTIALS
with the path to the service account json - For AWS S3 you should follow the credentials authentication methods used by boto that are specified here: https://s3fs.readthedocs.io/en/latest/#credentials
- For Google Cloud Platform the Environment variable
Note: To be sure that the python script will recognize the environment variables and be independent of any terminal session, add the variables to your
~/.bashrc
file and then execute asource ~/.bashrc
. Update the port and credentials if needed on the file.
The SAP HANA Data Downloader can be used in two ways: to configure the data download and to download the data.
These means create several json file that will stored the behaviour, parameters and the origin of the data to be download. The configuration file approach is to decouple the process from a manual input and automate the extraction. To configure the data download, you can call the script with the "configure" parameter followed by the following options:
--config_dir
or-cd
: The directory where the configuration files will be stored.--table
or-t
: The name of the table to be downloaded.--table_schema
or-ts
: The schema of the table to be downloaded.--limit_mode
or-lm
: If the data have to be splitted in chunks because a practise, volume, network quota or performance, with these parameter can calculate these batches. These is optional and the possible values are["records", "B", "KB", "MB", "GB", "TB"]
. It's like the size unit of the batch.--limit_num
or-ln
: If the--limit_mode
flag was set, here you should set the size of the batch as an integer number.--group
or-g
: The fields used to group the data files. It's a way to generate all the data of certain subset toghether. It can be use with or without the--limit_mode
flag.--query
or-q
: The ability to send a custom SQL query as a string, that will override the flags--table_schema
,--table
and--group
. The flags--limit_num
and--limit_mode
can still use, but--limit_mode
must be set in records.--null_treatment
or-nt
: The treatment to apply to null values. This can be provided as a string in JSON format or the path to a JSON file that contains the rule. These is optional.- The structure of the json file could be separated in:
- A curly brace pair for each field of the table (or objects)
- For each object (aka field) the next key valu pairs:
- At the
field
key complete the name of the field of the table as a string - The
null_const
also expects a string for the constant to be used as a replacement for all null values - The
nulls
key expect an array of regular expressions as string. The script will recognize each matching value of these regex in thefield
informed and replace it with thenull_const
. - Finally there is the key
not_nulls
that can be use togheter or instead thenulls
. Because the nature of listing the null values needed to recognize what values are present to treat them as null, sometimes more feasible to explain what is the rule to be a expected "good" value. These key expect also a array of string, and each value is a regular expression. Then the script will compare each data value of the dataset for the column (aka field) and compare to all the values on these array with the AND operator, meaning that all the expressions must be true in order to pass, otherwise it will be replaced by the value ofnull_const
See an example here
- At the
- The structure of the json file could be separated in:
--download_dir
or-t
: Path where the data files will be downloaded. For a Google Cloud Storage bucket specify the path asgs://bucket/path
, and for an AWS S3 bucket specify the path ass3://bucket/path
.--download_mode
or-t
: File system or cloud blob storage (data lake) where the files will be written .
Example:
python HANA_downloader.py "configure" --config_dir "~/test/configuration" \
--table "T001" --table_schema "ECC" --group MANDT LAND1 \
--null_treatment "[{'field': 'FMHRDATE','null_const': '19000101','nulls': ['^0*$','^.*\s+$'],'not_nulls': ['^(19\d\d|20[0-2]\d)(0[1-9]|1[012])(0[1-9]|[12]\d|3[01])$']}]" \
--download_dir "~/test/data" --download_mode local
To download the data stored in a table, you can call the script with the "download" parameter followed by the following options:
--table
or-t
: The name of the table to be downloaded.--table_schema
or-ts
: The schema of the table to be downloaded.--limit_mode
or-lm
: If the data have to be splitted in chunks because a practise, volume, network quota or performance, with these parameter can calculate these batches.--limit_num
or-ln
: If the--limit_mode
flag was set, here you should set the size of the batch as an integer number.--download_dir
or-t
: The directory where the downloaded data will be stored. For cloud blob storage you should specify the URI as gs://bucket/path or s3://bucket/path--download_mode
or-t
: The download mode. This can be either "local" or a blob cloud storage you could choose between "GCS" (Google Cloud Storage) and "S3" (from AWS) .--query
or-q
: The ability to send a custom SQL query as a string, that will override the flags--table_schema
and--table
. The flags--limit_num
and--limit_mode
can still use, but--limit_mode
must be set in records.--config_dir
or-cd
: The directory where the configuration files are stored. This argument could be the path of a directory or even a single file. If it is populated, then the rest of the flags are optional and will be overwrite with the correspond value present on the configuration file.
Example:
python HANA_downloader.py "download" --config_dir "~/test/configuration"
Or a simplified version in one step and saving to a Google Cloud Storage bucket
python HANA_downloader.py "download" --table "T001" --table_schema "ECC" --group MANDT LAND1 \
--null_treatment "[{'field': 'FMHRDATE','null_const': '19000101','nulls': ['^0*$','^.*\s+$'],'not_nulls': ['^(19\d\d|20[0-2]\d)(0[1-9]|1[012])(0[1-9]|[12]\d|3[01])$']}]" \
--download_dir "gs://my_bucket/subfolder" --download_mode GCS
Once the data is downloaded, it will be stored in parquet files that are compressed with snappy. The name of the files will follow the pattern data_<table_schema>_<table>_[<grouped_fields-values>]_<timestamp>.parquet.snappy
.
You can use the parq
CLI to inspect the parquet files. For example:
parq data_ECC_T001_20230317002353886538.parquet.snappy
You can also view the schema of the parquet file with:
parq data_ECC_T001_20230317002353886538.parquet.snappy -s
And view part of the data with:
parq data_ECC_T001_20230317002353886538.parquet.snappy -head
- Create it as a Docker image
- Add a checksum to see if the data differe from when the configuration file was created and the data present on the system