Git Product home page Git Product logo

dblink's Introduction

DBLINK()

DBLINK() is a Vertica User Defined Transform Function coded in C++ to run SQL against other databases.

For example, the following statement runs a row count in PostgreSQL and retrieves the result (6,001,215) in Vertica:

=> SELECT DBLINK(USING PARAMETERS
    cid='pgdb',
    query='SELECT COUNT(*) FROM tpch.lineitem'
) OVER();
count
--------
6,001,215
--- 1 row selected in 0.228s (prep 0.17s, exec 0.211s, fetch 0.000/0.000s)

Usage

DBLINK() is a Vertica function that pushes SQL to other databases and retrieves the result of the remote execution back in Vertica. DBLINK() can push any type of SQL commands that the remote database accepts:

  • DDL statements. For example, CREATE a table in the remote database.
  • DML statements to manipulate data in the remote database (INSERT, UPDATE, etc.).
  • DQL statements to SELECT data from the remote database using the SQL dialect and functions available on the remote database.
  • DCL statements like GRANT and REVOKE.

Examples

The following statement creates a table in Vertica named public.customer that contains 10% of randomly selected data from the PostreSQL table tpch.customer:

=> CREATE TABLE public.customer AS
    SELECT DBLINK(USING PARAMETERS
        cid='pgdb',
        query='SELECT * FROM tpch.customer WHERE RANDOM() < 0.1')
OVER();

This statement creates an empty table in Vertica corresponding to the table definition in the remote database:

=> CREATE TABLE public.customer AS
    SELECT DBLINK(USING PARAMETERS
        cid='pgdb',
        query='SELECT * FROM tpch.customer LIMIT 0')
OVER();

This statement will group-by the result of a JOIN between the Vertica table tpch.nation and the MySQL table tpch.region:

=> SELECT r.r_name, count(*)
FROM tpch.nation n
    LEFT OUTER JOIN
        ( SELECT DBLINK(USING PARAMETERS
            cid='mypg’,
            query='SELECT r_name, r_regionkey FROM tpch.region'
            ) OVER()) r
    ON n.n_regionkey = r.r_regionkey
GROUP BY 1 ;

This statement drops a PostgreSQL table if exists:

=> SELECT DBLINK(USING PARAMETERS
	cid='pgdb',
	query='DROP TABLE IF EXISTS public.t1') OVER();

Sometimes the SQL that you want to push to the remote database is quite complex. In these cases, you might find useful to write the SQL in a file using your preferred editor, and then pass the file containing the SQL text to DBLINK() using the following syntax:

=> SELECT DBLINK(USING PARAMETERS
	cid='mysql',
	query='@/tmp/myscript.sql') OVER()";

Installation

You can install DBLINK() from the latest released binaries without needing to clone this repository. Or, you can clone the repository and build from source. Whichever way you chose, ODBC has to be configured on all nodes in order to use DBLINK()..

Prerequisites

DBLINK() uses ODBC to interact with the remote databases. You must install and configure the following on all nodes in your cluster:

Installing pre-built binaries

  1. Download the appropriate build of ldblink.so from the latest release. Example: ldblink.so.centos-v23.3.0 -> binary for vertica v23.3.0 on CentOS
  2. Rename the binary to ldblink.so
  3. Copy ldblink.so to the initiator node.
  4. Execute this SQL substituting the full path of ldblink.so on the initiator node:
	    CREATE OR REPLACE LIBRARY DBLink AS '/full/path/to/ldblink.so' LANGUAGE 'C++';
	    CREATE OR REPLACE TRANSFORM FUNCTION dblink AS LANGUAGE 'C++' NAME 'DBLinkFactory' LIBRARY DBLink ;
            GRANT EXECUTE ON TRANSFORM FUNCTION dblink() TO PUBLIC ;
  1. Create a Connection Identifier Database (a simple text file) under /usr/local/etc/dblink.cids. You can use a different location by changing the DBLINK_CIDS define in the source code. For details, see Configure DBLINK().

  2. (optional) Delete ldblink.so from initiator node.

Build DBLINK() From Source

Before you run make commands, review the Makefile and make any necessary changes.

On your own build environment

  • First you need to Setup a C++ Development Environment
  • Then to build using the installed Vertica SDK and devtoolset (centos): just run

    make

  • Install the library in Vertica (as dbadmin):

    make install

  • Create a Connection Identifier Database (a simple text file) under /usr/local/etc/dblink.cids. You can use a different location by changing the DBLINK_CIDS define in the source code. For details, see Configure DBLINK().

Without setting up a build environment

This is mainly for testing and release, or when you want to build dblink for a specific vertica version on centos/ubuntu but you do not have the tools to do it. It uses the verticasdk docker image owned by Vertica to build dblink for any specified vertica version.

  1. Compile the DBLINK source code with for the appropriate Vertica version and Linux distribution.

    • To build without needing to set up a build environment, specify the version and target OS. For example:
      $ make -C docker-dblink VERTICA_VERSION=12.0.2 OSTAG=ubuntu
      
      it will create ldblink.s0.ubuntu-v12.0.2 in the docker-dblink directory
  2. Install the library in Vertica (as dbadmin): Follow Installing pre-built libraries from step 2.

  3. Create a Connection Identifier Database (a simple text file) under /usr/local/etc/dblink.cids. You can use a different location by changing the DBLINK_CIDS define in the source code. For details, see Configure DBLINK().

Uninstall DBLINK()

You can uninstall the library with DROP LIBRARY DBLink in vsql or by running make clean.

Configure DBLINK()

DBLINK() requires two parameters and accepts one optional parameter with the following syntax:

DBLINK(USING PARAMETERS cid=value, query=value[, rowset=value]);

Parameters

Name Required Description
cid No Connection Identifier Database. Identifies an entry in the connection identifier database.
connect_secret No The ODBC connection string containing the DSN and credentials.
query Yes The query being pushed on the remote database. If the first character of this parameter is @, the rest is interpreted as the name of the file containing the query.
rowset No Number of rows retrieved from the remote database during each SQLFetch() cycle. Default is 100.

For example, the following query retrieves data from the remote database 500 rows at a time:

=> SELECT DBLINK(USING PARAMETERS
    cid='pgdb',
    query='SELECT c_custkey, c_nationkey, c_phone FROM tpch.customer ORDER BY 1',
    rowset=500) OVER();
 c_custkey | c_nationkey |     c_phone
-----------+-------------+-----------------
         1 |          15 | 25-989-741-2988
         2 |          13 | 23-768-687-3665
         3 |           1 | 11-719-748-3364
         4 |           4 | 14-128-190-5944
         5 |           3 | 13-750-942-6364
         6 |          20 | 30-114-968-4951
         7 |          18 | 28-190-982-9759
...

Connection parameters

Connection Identifier Database

One way to specify the connection parameters is to use a Connection Identifier Database -- a simple text file containing the codes used with cid. The cid file must exist in the same location on all vertica nodes. For example:

$ cat /usr/local/etc/dblink.cids
# Vertica DBLINK Configuration File
#
# Connection IDs lines have the following format:
#    <mnemonic code>:<ODBC configuration>
# and are terminated by a SINGLE '\n' (ASCII dec 10, ASCII hex 0x0a)
# Be aware of this! Windows editors might end lines with \r\n. In
# this case the Carriage Return is considered part of the ODBC config
# and can cause undefined ODBC Driver Behavior.
#
# Lines starting with '#' are considered comments
#
# Sample configuration:

pgdb:UID=mauro;PWD=xxx;DSN=pmf
myver:UID=mauro;PWD=xxx;DSN=vmf
mysql:USER=mauro;PASSWORD=xxx;DSN=mmf

Then use the cid parameter to pick our connection:

SELECT DBLINK(USING PARAMETERS cid='myconnecction', query=...) ...

DBLINK Parameters

Another methods you can use to specify the connection parameters is to use connect_secret. This way you don't have to create a the dblink.cids database however defining the connection parameters in the command line is not safe before Vertica 12.0.4. All queries are recorded under v_monitor.query_requests and in the log file, and that can expose passwords in the connect_secret parameter.

SELECT DBLINK(USING PARAMETERS connect_secret='UID=mauro;PWD=secret;DSN=pmf', query=...) ...

Session Parameters

Lastly, a final method is to set a UDPARAMETER in the session after connecting to Vertica. The value for this SESSION PARAMETER won't be recorded in query_requests.

ALTER SESSION SET UDPARAMETER FOR ldblink dblink_secret = 'UID=mauro;PWD=secret;...' ;
SELECT DBLINK(USING PARAMETERS query='my first query') ...
SELECT DBLINK(USING PARAMETERS query='my second query') ...
SELECT DBLINK(USING PARAMETERS query='my third query') ...

Configure the ODBC Layer

You must install and configure the ODBC layer on all nodes for each database that you want to connect to.

The ODBC configuration depends on the specific ODBC Driver Manager and Database ODBC Drivers. The following is an example configuration file that configures unixODBC and the the PostgreSQL/MySQL drivers:

$ cat /etc/odbc.ini
[ODBC Data Sources]
PSQLODBC  = PostgreSQL ODBC
MYODBC  = MySQL ODBC

[pmf]
Description  = PostgreSQL mftest
Driver = PSQLODBC
Trace  = No
TraceFile  = sql.log
Database = pmf
Servername = mftest
UserName =
Password =
Port = 5432
SSLmode  = allow
ReadOnly = 0
Protocol = 7.4-1
FakeOidIndex = 0
ShowOidColumn  = 0
RowVersioning  = 0
ShowSystemTables = 0
ConnSettings =
Fetch  = 1000
Socket = 4096
UnknownSizes = 0
MaxVarcharSize = 1024
MaxLongVarcharSize = 8190
Debug  = 0
CommLog  = 0
Optimizer  = 0
Ksqo = 0
UseDeclareFetch  = 0
TextAsLongVarchar  = 1
UnknownsAsLongVarchar  = 0
BoolsAsChar  = 1
Parse  = 0
CancelAsFreeStmt = 0
ExtraSysTablePrefixes  = dd_
LFConversion = 0
UpdatableCursors = 0
DisallowPremature  = 0
TrueIsMinus1 = 0
BI = 0
ByteaAsLongVarBinary = 0
LowerCaseIdentifier  = 0
GssAuthUseGSS  = 0
XaOpt  = 1
UseServerSidePrepare = 0

[mmf]
Description  = MySQL mftest
Driver = MYODBC
SERVER = mftest
PORT = 3306
SQL-MODE = 'ANSI_QUOTES'

$ cat /etc/odbcinst.ini
[ODBC]
Trace=off
Tracefile=/tmp/uodbc.trc

[PSQLODBC]
Description=PostgreSQL ODBC Driver
Driver64=/usr/lib64/psqlodbcw.so
UsageCount=1

[MYODBC]
Driver=/usr/lib64/libmyodbc8w.so
UsageCount=1

[MySQL ODBC 8.0 ANSI Driver]
Driver=/usr/lib64/libmyodbc8a.so
UsageCount=1

Report an issue

To ask a question, start a discussion. To report an issue, open up an issue and provide following information:

  • The command that you ran and the associated output as shown on your screen by using the standard Vertica SQL client vsql.
  • Vertica version: SELECT VERSION();.
  • DBLINK library metadata by running statement the following as dbadmin:
    => SELECT * FROM USER_LIBRARIES WHERE lib_name = 'ldblink';
    
  • Attach the following ODBC configuration files:
    • odbc.ini (please remove passwords or other confidential information)
    • odbcinst.ini
  • ODBC Driver Manager version and configuration. For example, with unixODBC, the output of the command odbcinst -j.
  • ODBC traces obtained while running the command (see 1.). To enable the ODBC traces you have to set Trace = on in odbcinst.ini. Do not forget to switch ODBC tracing off at the end.

dblink's People

Contributors

mfelici avatar nerdlogic avatar oualib avatar roypaulin avatar rs-vertica avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

dblink's Issues

UDx side process exited abnormally if calling dblink multiple times in one query

=> SELECT l.id, l.description, tab_a.value value_a, tab_b.value value_b
-> FROM tab_local l
-> LEFT JOIN (SELECT DBLINK(USING PARAMETERS cid='orcl', query='SELECT * FROM tab1') OVER()) tab_a
->   ON tab_a.id = l.id
-> LEFT JOIN (SELECT DBLINK(USING PARAMETERS cid='orcl', query='SELECT * FROM tab1') OVER()) tab_b
->   ON tab_b.id = l.id;

ERROR 8092:  Failure in UDx RPC call InvokeProcessPartition() in User Defined Object [dblink]: UDx side process has exited abnormally

dblink has several global variables and doesn't consider being called multiple times in one query.

For example, in the first call, DBLinkFactory.getReturnType() sets the data type array pointer to 'Odt' global variable. Then, the same method sets a different pointer to that variable in the second call. After that, 'Obt' global variable is cleared by clean() method called by processPartition(). But the cleared pointer is for the second call. So the second call failed when accessing this variable.

Remove docker dependencies from Makefile

We should remove any dependencies from docker in the Makefile. In case we need to take care of a "non standard" environment or a specific need than we will have to isolate the required changes:

  • either creating a different Makefile (for example Makefile.docker)
  • or adding specific rules
  • or - finally - adding ad-hoc directives and variables
    Default rules and standard behavior for the Makefile should work on a "normal" Linux system.

Connection using a DSN

As already discussed, it would be nice to be able to use directly a DSN as a parameter.

Adding Unit Tests

@mfelici you'll be able to do it when the environment is ready.

@roypaulin This is the process (via Github Action):

  1. install Vertica
  2. install DBLink UDx
  3. install Postgres
  4. create the different Connection ID needed by DBLINK
  5. run a test script (it will be provided by @mfelici)
  6. if the script fails it raises an error otherwise it tells us it is working
  7. it closes and kills everything
  8. we can then create a protected branch: the code can only be merged if the tests succeeded and one reviewer approves

Adding a License File

I think we can start with the Apache 2.0 License.
We need to add the license in the folder.

Each file of the project should start with:

# (c) Copyright [2018-2022] Micro Focus or one of its affiliates.
# Licensed under the Apache License, Version 2.0 (the "License");
# You may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.