Git Product home page Git Product logo

php-etl's Introduction

PHP ETL

Build Status Latest Stable Version Latest Unstable Version License

Extract, Transform and Load data using PHP.

Installation

In your application's folder, run:

composer require marquine/php-etl

Documentation

Documentation can be found here.

Example

In the example below, we will extract data from a csv file, trim white spaces from the name and email columns and then insert the values into the users table:

use Marquine\Etl\Etl;

$etl = new Etl;

$etl->extract('csv', '/path/to/users.csv')
    ->transform('trim', ['columns' => ['name', 'email']])
    ->load('insert', 'users')
    ->run();

License

PHP ETL is licensed under the MIT license.

php-etl's People

Contributors

antoniors avatar gilmojoa avatar hersoncruz avatar joshstaff avatar leomarquine avatar tristanhall avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

php-etl's Issues

Validator Step

Hi. Have you planned to implement a validation step? This could be very useful in ensuring that the extracted rows are valid. Thank you.

Connection Cleanup

Problem

I have been using PHP ETL and the Parallel extension's functional API to ingest large CSV datasets. It has been reliable and fast, but there can be several hundred database connections left open until the ingest is complete. I poked around the source code for this ETL package, but there are no methods or documentation for removing connections.

For context, this is how the ingest script works:

  • Stream contents of the CSV file into a temporary $batch array
  • Once 5K rows have been loaded into the batch, push a task into $tasks array
  • At end of file, load remaining batch into a task
  • Each task runs ETL in a parallel process using a closure
    • Initialize new ETL instance
    • addConnection() for each target database
    • extract() & transform() the batch data
    • load() the batch into the DB
    • run() ETL
    • unset($etl)
    • return (also tried exit)
  • Await completion of all tasks
  • Move to next dataset...

Though I can't share any actual code, those closure bullet points are essentially what happens.

Inside the ETL closure, I have tried exit and return after $etl->run() completes, and I have tried unsetting the ETL instance in the closure. Still, the processes and DB connections remain open.

Documentation for the Parallel extension could be more robust.

Request

The Manager class would benefit from a removeConnection or destroyConnection method, where the conn would be removed from $connections. Would that terminate the PDO connection?

I'm happy to open a PR if this would work. I would also take advice on using persistent connections with ETL and Parallel.

Uncaught PDOException: SQLSTATE[HY093]

I have a csv file i configured the leomarquine php-etl
but in the process i have this error
Fatal error: Uncaught PDOException: SQLSTATE[HY093]: Invalid parameter number: parameter was not defined in C:\xampp\htdocs\etl\vendor\marquine\php-etl\src\Loaders\Table.php:162

when i saw the Table.php:162
i found the error in the execute function of PDO

the function of insert in the Class Table is :

protected function insert($items) {
    if (empty($items)) {
        return;
    }

    $statement = Etl::database($this->connection)->prepareInsert(
            $this->table, $this->columns
    );
    $callback = function ($items) use ($statement) {
        foreach ($items as $item) {
            /* if ($this->timestamps) {
              $item['created_at'] = $this->time;
              $item['updated_at'] = $this->time;
              } */
            
           // i want to see what is in the $item value i have an array ;the value of this in the bottom ***
            var_dump($item);

            $statement->execute($item);
        }
        //die();
    };

    Etl::database($this->connection)->transaction(
            $items, $callback, $this->transaction
    );
}

*** output of var_dump($item)
array(1) {
["COD_HEXC,TECH,ELIG_FTTB,REFR_NRO,GEST_INFR,TYP_LOGM,NB_LOGT,REFR_PM,NUMR_VOIE,SS_NUMR_VOIE,NOM_VOIE,COD_POST,VILL,COD_INSEE,DAT_PREM_COMM_ADRS,DAT_DERN_COMM_ADRS,DEBIT_NET_PREC,DEBIT_NET,SELF_INST,ZDV_id,NOM_DE_ZDV,REGION,TYPE_ZDV,NB_FYR_FIX,NB_FYR_THD,NB_FYR_ADSL,NB_FYR_MOB_MONO,NB_FYR_MOB_MULTI,NB_FYR_MOB_MULTI_THD,NB_FYR_MOB_MULTI_ADSL,NB_CLNT_NC,week,cohorte,secteur_vad,COURTIER,Non_commercialisable,type,etat,Date_Visite_Last,Date_Visite_Next,Date_Arrivee,MOIS_LIVRAISON,POSITION,idVendeur,id_polygons"]=>
string(243) "62133227TS,FTTB,1,NA,NA,NA,1,NA,143,,"AVENUE DE LA REPUBLIQUE",62420,"BILLY MONTIGNY",62133,07/11/2016,07/11/2016,30,30Mo,0,89,"HENIN CARVIN",NORD-EST,"WAR ZONE THD",0,0,0,0,0,0,0,0,0,2016_S0,Heninois,CONNECT,,0,0,now(),now(),now()," "," ",0,0"
}
the error returned of the line 162 ( $statement->execute($item); )


Fatal error: Uncaught PDOException: SQLSTATE[HY093]: Invalid parameter number: parameter was not defined in C:\xampp\htdocs\etl\vendor\marquine\php-etl\src\Loaders\Table.php:162
Stack trace:
#0 C:\xampp\htdocs\etl\vendor\marquine\php-etl\src\Loaders\Table.php(162): PDOStatement->execute(Array)
#1 [internal function]: Marquine\Etl\Loaders\Table->Marquine\Etl\Loaders{closure}(Array)
#2 C:\xampp\htdocs\etl\vendor\marquine\php-etl\src\Database\Connection.php(141): call_user_func(Object(Closure), Array)
#3 C:\xampp\htdocs\etl\vendor\marquine\php-etl\src\Loaders\Table.php(168): Marquine\Etl\Database\Connection->transaction(Array, Object(Closure), 100)
#4 C:\xampp\htdocs\etl\vendor\marquine\php-etl\src\Loaders\Table.php(123): Marquine\Etl\Loaders\Table->insert(Array)
#5 C:\xampp\htdocs\etl\vendor\marquine\php-etl\src\Job.php(95): Marquine\Etl\Loaders\Table->load('pvr_fdr_copie', Array)
#6 C:\xampp\htdocs\etl\insert.php(150): Marquine\Etl\Job->load('table', 'pvr_fdr_copie')
#7 {main}
thrown in C:\xampp\htdocs\etl\vendor\marquine\php-etl\src\Loaders\Table.php on line 162

NOTE : the SUM of columns is equal of values

Can commands be PHP function callbacks?

Hi, I really like php-etl! It has made it really easy for me to fetch JSON data through APIs without me having to build my own interpreter. It saved me a couple of days already!

I would like to run a PHP function as a callback when an ETL job has finished, I, instead of as a command (which is a shell command).

Is there a way to do this?

Replace XMLReader with SimpleXML

SimpleXML is available by default in PHP since 5.1.3, and allows XPath queries. Switching the XML Extractor to using this module has a number of advantages:

  • Simpler code base; the extract() function no longer requires helper functions per node
  • Filtering from within the loop param; it becomes possible to only include nodes that match various criteria, which can be useful in certain XML structures
  • parent node querying from columns; useful when trying to convert nested documents to a flat relational structure

Performance seems to be comparable in my limited testing.

If this is a valuable enhancement to the existing XML extractor, I can submit a pull request. I could package this as a new extractor type ("simplexml") if that would be better for backward compatibility.

MySqlConnector unix_socket should check for not empty

In MySqlConnector you have:

        if (isset($unix_socket)) {
            $dsn['unix_socket'] = $unix_socket;
        }

I think it really should be this:

        if (!empty($unix_socket)) {
            $dsn['unix_socket'] = $unix_socket;
        }

The standard database config file that comes with Laravel defines unix_socket but leaves it empty. The code you have now sees that unix_socket is set and tries to use it even though it is an empty string.

I will do a pull request if you like.

Autobinding doesn't work - sometimes- with Symfony

The autobinding sometimes doesn't work with Symfony 5.

Unresolvable dependency resolving [Parameter #0 [ <required> string $assetsDir ]] in class App\File \CsvFileHandler

For some classes it works: classes I created and I call without any interaction with ETL classes.
For some it doesn't : when using custom tranformers/loaders inheriting the native ones.

It seems it is related to the "custom" DI system of the library... I wonder if in fact it is fully compatible with anything else than Laravel...

Updating column type array :(

Version: marquine/php-etl v1.2.0
Module: Loading
Loading TABLE with column data into array

Exemple:
Query select fone[1] from contact

Return log:
malformed array literal: "1" at character 35
Array value must start with "{" or dimension information.
STATEMENT: update contact set fone = '(82)9999-9999' where id = '1'

The column fone should be: fone[1]

PDOException: No handler for this scheme

Pulling the latest version of this package via Composer (v2.2.0) has an old version of the MySqlConnector class included.

The code tagged in this release has a version of this class from before the unix socket fix was applied in v1.2.

Specifically, this commit isn't included in v2.2.0

aa1ce7b

jscon_decode transformer is only decoding integers

is the json_decode transformer working as expected to your knowledge? I have been trying to debug issue with using the json_decode transformer for days now and problem I experience is it can only decode integers in the json string. The rest of the values are coming out as NULL. I have verified i can call the standard json_decode on the file contents so its not that the encoding of the file is wrong.
just to re-iterate - i have tested the etl flow and when i am just using integer values for the mapped columns in my database it runs through without a problem, but when i try to pass string vales they are decoded as NULL.

Xml multiple entries only get last value

When loading an xml document with the extract for a multiple value name node:

<item>
      <do>
            <code>A</code>
            <code>B</code>
            <code>C</code>
            <code>D</code>
      </do>
</item>
<item>
      <do>
            <code>A</code>
            <code>B</code>
            <code>C</code>
            <code>D</code>
      </do>
</item>

I iterate on and get all key value but for i get only one key "code" with the last value, do you think it possible to get an array value "do" => ["code" => [0 => "A", 1 => "B"...] to get all values ?

Uncaught PDOException: SQLSTATE[HY093] / Quoted elcosure issue

Hi,
It seems that the makeColumns function doesn't work in my case, my first column keep his double quote even if I specify the enclosure option :
this line (from Csv class / makColumns)
$columns = array_flip(str_getcsv(fgets($handle), $this->delimiter, $this->enclosure));
return
array(3) { [""FOO""]=> int(0) ["BAR"]=> int(1) ["FUZ"]=> int(2) }
So it trigger a PDOException: SQLSTATE[HY093]

Feature Request: Unique Transformer

A common issue that can come up is non-distinct values. My use case is an XML document where I am getting two columns as attributes. These can appear in the attributes more than once. This causes additional table updates during the load cycle.

PDOException: No handler for this scheme

Background

  • I am using this package with Laravel 5.7
  • I have resolved the issue describe below. This is just an informative ticket.

Issue

After configuring my first pipeline I was greeted with the following error when attempting to execute it.

PDOException with message 'SQLSTATE[HY000] [2002] No handler for this scheme'

Details

The default database config for Laravel 5.7 sets the unix_socket option to an empty string. Whenever this package attempts to establish a connection it only checks to see if a socket option `isset. It doesn't check to see if the provided option is truthy.

This results in it generating adns option that looks something like the following: mysql:unix_socket=;dbname=default. I'm not sure if that's ever considered a valid option, however it certainly sent me down the rabbit hole.

Solution

The simplest solution, for me, was to completely remove the unix_socket option from my database configuration. Doing so resulted in a dns option that looks more like the following: mysql:host=127.0.0.1;port=3306;dbname=default which worked great for me.

I'm not sure if this is documented somewhere. I might have missed an aside or skimmed over something I should have read thoroughly, however, I wanted to share my findings just in case anybody else runs into this issue.

Code References

The following is the code that checks to see if a unix_socket option has been provided. The same method also checks for other settings, such as a host, port, etc. As you can see, it only checks to see if the option isset, and it does not check to ensure a non-empty string is provided. (Again, I'm not sure if an empty string is ever acceptable. If it isn't, it would be great to see this check extended to also check for truthiness.)

https://github.com/leomarquine/php-etl/blob/master/src/Database/Connectors/MySqlConnector.php#L36-L38

SQL commands not using db scheme

I use postgresql.
I have DB with some schemas: public, transit, log, debug.
I use Laravel.
I define schema via .env file:
DB_SCHEMA=transit

To create valid sql i need to use write schema in each load function:
$etl->load('insert_update', 'transit.user2')

Unable to insert the data from XML file into database

Unable to insert the data from XML file into database
I am trying to use this module to extract and import data from XML files. But It is not inserting the data into the table and not showing any error too

Could you please tell whether it supports to insert data from XML to Database?

Resolving ETL from Laravel's IoC Container makes nonfunctioning instance

When I manually make an ETL instance

(new Marquine\Etl\Etl)->extract('csv', './file.csv')->transform('rename_columns', ['columns' => ['email_address' => 'email']])->toArray()

I get...

[
     [
       "id" => "1",
       "name" => "Name 1",
       "email" => "[email protected]",
     ],
     //...
]

but when use the IoC...

app(Marquine\Etl\Etl::class)->extract('csv', './file.csv')->transform('rename_columns', ['columns' => ['email_address' => 'email']])->toArray()

I get an error...

InvalidArgumentException with message 'The step [csv] is not a valid extractor.'

Boolean values appear to extract as strings

I'm using the table extractor on a PostgreSQL 7.8 database and it appears Boolean values are being extracted as strings. I'm trying to load the data back into a PostgreSQL 9.3 database and PostgreSQL is complaining:

Invalid text representation: 7 ERROR: invalid input syntax for type boolean: ""

Do I have to create a transformer to convert these strings to bool, or is there some additional configuration options I missed?

Duplicates on insert with multiple keys lookup?

Thanks for the library. It is making my code very readable!
I am using v1.1. I can't use the latest master due to PHP version constraints just yet.
With the code below I am getting duplicates after the job runs multiple times.

// Make items unique
$items = array_intersect_key($items, array_unique(array_map("serialize", $items)));

$loadOptions = [
  'keys' => ['account_id', 'project_id'],
  'insert' => true,
  'update' => true,
  'delete' => false, // TODO Soft delete
];
$loaderJob = new Job();
$loaderJob->extract('ArrayData', $items);
$loaderJob->load('Table', 'account_project', $loadOptions);

Extracting from multi-level JSON objects

I am extracting data where there is more than one level of objects with properties. In PHP this would be a multi-dimension array.

Sample data:

{
  "id": "ocd-bill/ab2cc592-ebbb-4350-aecc-215f9423c39f",
  "created_at": "2017-03-31T03:38:59.726143+00:00",
  "debug": null,
  "updated_at": "2017-04-27T00:34:25.132760+00:00",
  "abstracts": [
  ],
  "from_organization": {
    "name": "Chicago City Council",
    "id": "ocd-organization/ef168607-9135-4177-ad8e-c1f7a4806c3a"
  },
  "sponsorships": [
    {
      "entity_name": "Emanuel, Rahm",
      "entity_type": "person",
      "classification": "Primary",
      "primary": true,
      "entity_id": "ocd-person/f649753d-081d-4f22-8dcf-3af71de0e6ca"
    }
  ],
  "classification": [
    "ordinance"
  ],
  "subject": [
    "Sale",
    "Loan & Security"
  ],
  "identifier": "O 2017-3049",
  "other_titles": [
    
  ],
  "title": "Sale of City-owned property and multi-family loan agreement with New West Englewood Homes LLC at 2101-2111 W 63rd St",
  "related_bills": [
    
  ]
}

screenshot 2017-07-31 22 29 01

When I extract the sponsorships array, I want to preserve it as an array and load it into a JSON type field in a PostgreSQL database.

To do that, I added this code to /src/Loaders/Table.php before line 168 in function insert(). I don't think it will work on a 3-dimension array (array[array[array[]]]), but it works on my 2-dimension array.

// handle arrays that should be turned into JSON strings (only for Postgres)
                foreach($item as $key2=>$val):
                	if(is_array($val)):
                		$item[$key2] = json_encode($val);
                	endif;
                endforeach;

Callback transformer

Thanks for sharing PHP ETL. I enjoyed the approach to data handling and it helped me migrate between two databases provided by different systems.

The number of transformers is limited and it seems unreasonable to expect the library to provide something that covers every need.

Would you be interested in a transformer which lets the user provide a callback with an anonymous function á la:

<?php

use Marquine\Etl\Row;
use Marquine\Etl\Transformers\Transformer;

class CallbackTransformer extends Transformer
{

  /** @var callable */
  protected $callback;

  protected $availableOptions = [
    'callback'
  ];

  /**
   * Transform the given row.
   *
   * @param \Marquine\Etl\Row $row
   *
   * @return void
   */
  public function transform(Row $row)
  {
    call_user_func($this->callback, $row);
  }
}

Table.php needsUpdate() function is not returning the correct result if two fields have the same value

In my example below the incoming data has an updated value for client_name which is the same as client_key. The old data in the table has a value for client_key but a NULL for client_name. This should trigger an update but it doesn't.

I suggest that the issue is in this line and the change is that we need to use array_diff_assoc instead of array_diff.

return ! empty(array_diff($new, $old));

array(2) {
  ["client_key"]=>
  string(5) "Actual"
  ["client_name"]=>
  string(5) "Actual"
}
array(4) {
  ["id"]=>
  string(1) "1"
  ["client_key"]=>
  string(5) "Actual"
  ["client_name"]=>
  NULL
  ["client_access_key"]=>
  string(0) ""
}

Add customizer collumn

Hi, need customizer with one collumn in php-etl with my configuration Extract, is possible?

Example
In Model with Xml:
'organization'=>'parameter[0]', 'field01'=>'field01', 'field02'=>'field02' ...
organization not exist in my XML, is customizer is possible?

Package flow/jsonpath is abandoned, you should avoid using it. Use softcreatr/jsonpath instead.

Hi,

We have some issues with version 2.3.1.

When using composer 2, we get multiple errors connected with jsonpath package:

Class Flow\JSONPath\Test\JSONPathTest located in ./vendor/flow/jsonpath/tests/JSONPathTest.php does not comply with psr-0 autoloading standard. Skipping. Class Flow\JSONPath\Test\JSONPathTestClass located in ./vendor/flow/jsonpath/tests/JSONPathTest.php does not comply with psr-0 autoloading standard. Skipping. Class Flow\JSONPath\Test\JSONPathLexerTest located in ./vendor/flow/jsonpath/tests/JSONPathLexerTest.php does not comply with psr-0 autoloading standard. Skipping. Class Flow\JSONPath\Test\JSONPathBenchmarks located in ./vendor/flow/jsonpath/tests/JSONPathBenchmarks.php does not comply with psr-0 autoloading standard. Skipping. Class Flow\JSONPath\Test\JSONPathArrayAccessTest located in ./vendor/flow/jsonpath/tests/JSONPathArrayAccessTest.php does not comply with psr-0 autoloading standard. Skipping.

We suggest switching to: softcreatr/jsonpath - flow/jsonpath seems to be abandoned.

Dependency conflic Laravel 8.29

I am trying to install and this is the output:

Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - marquine/php-etl v2.3.0 requires illuminate/container ^5.0|^6.0 -> found illuminate/container[v5.0.0, ..., 5.8.x-dev, v6.0.0, ..., 6.x-dev] but these were not loaded, likely because it conflicts with another require.
    - marquine/php-etl v2.3.1 requires illuminate/container ^5.0|^6.0|^7.0 -> found illuminate/container[v5.0.0, ..., 5.8.x-dev, v6.0.0, ..., 6.x-dev, v7.0.0, ..., 7.x-dev] but these were not loaded, likely because it conflicts with another require.
    - Root composer.json requires marquine/php-etl ^2.3 -> satisfiable by marquine/php-etl[v2.3.0, v2.3.1].


Installation failed, reverting ./composer.json and ./composer.lock to their original content.

Using -W switch:

Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - marquine/php-etl v2.3.0 requires illuminate/container ^5.0|^6.0 -> found illuminate/container[v5.0.0, ..., 5.8.x-dev, v6.0.0, ..., 6.x-dev] but these were not loaded, likely because it conflicts with another require.
    - marquine/php-etl v2.3.1 requires illuminate/container ^5.0|^6.0|^7.0 -> found illuminate/container[v5.0.0, ..., 5.8.x-dev, v6.0.0, ..., 6.x-dev, v7.0.0, ..., 7.x-dev] but these were not loaded, likely because it conflicts with another require.
    - Root composer.json requires marquine/php-etl ^2.3 -> satisfiable by marquine/php-etl[v2.3.0, v2.3.1].


Installation failed, reverting ./composer.json and ./composer.lock to their original content.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.