Git Product home page Git Product logo

php-etl's People

Contributors

arthurhoaro avatar aurelienrasselet avatar devfrey avatar ecourtial avatar gilmojoa avatar hersoncruz avatar kdebisschop avatar leneko avatar leomarquine avatar nicolas-masson-wizaplace avatar ollyollyollyltd avatar tristanhall avatar vincentmarmiesse avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

php-etl's Issues

Complete example, with DB connection?

Does anyone have a complete example, no matter how simple, which includes database connections? I am unable to figure out how to get my database configuration into the Table class via the Manager and ConnectionFactory so that my program can actually read one database and write another.

Aggregator columns messed up

Has anyone encountered this problem when there is no match between two sources that the resulting csv file has the columns messed up?
PS: Data is extracted and aggregated from 2 queries, not csv files and here are the two data sources:
1.
"id";"cId"
3260311;5
3260312;4
3260313;3
2.
"id";"email"
3260312;"[email protected]";
3260315;"[email protected]";

Thx.

Niche extractors (plugin strategy)

I've been working on an Extractor class for Google Analytics. It is all set to the extent that it works fine. I have not been able to think of a way to unit test it meaningfully, but that is all that remains for work to be done.

Initially I had thought that it made sense to contribute here as a new Extractor. But on reflection, I'm not so sure.

It would add a limited-use dependency on google/apiclient that all users would incur, even though relatively few are likely to use it.

I am thinking a separate repo for limited use plugins makes more sense. I wonder what other people think.

Aggregator: add the possibility to not raise an exception invalid rows

With the following input sources:

  {
    "id": "1",
    "type": "premium",
    "subscriptionNumber": "p123"
  },
  {
    "id": "2",
    "type": "standard"
  },
  {
    "id": "3",
    "type": "gold",
    "subscriptionNumber": "g234"
  }
]
1;John;Doe;[email protected]
2;Jane;Does;[email protected]
3;Bart;Simpson;[email protected]

I would like extract data into a CSV, with the following columns :

  • id
  • email
  • type
  • subscriptionNumber

The result is the following:

1;[email protected];premium;p123
3;[email protected];gold;g234

The code works as expected, but it throws an exception because there is one invalid row. This is currently the only way to discard invalid rows.

Could we consider to add a flag or something to be able to keep discarding invalid rows but without throwing an exception, because in some use cases, it could be normal to have invalid rows.

Is it possible to transform a json that has more than one depth?

Hello everyone,

I just found out about your php etl library and I was wondering if it's possible to use RenameColumns() from transform for renaming column on a json source that has more than one depth ?

For exemple, following this json file;

[
  {
    "id": "110",
    "person": [
      {
        "idPerson": "89",
        "addresses": {
          "home": [
                {
                  "name": "10 rue des châtaigniers",
                  "country": "France"
                }
            ],
         "work": [
                {
                  "name": "12 rue des bois",
                  "country": "Belgium"
                }
            ]
        }
      }
    ]
  }
]

I want to load only the id, the idPerson and just the country of the "home" address that corresponding to my database:

$etl->extract($extractor, 'fakeDataSource/personAddress.json')
			->transform($transform, ['columns' => ???) //What dit I need to put here?
			->load($loader, 'schema.mytable', ['key' => ['id'], 'columns' => ['id' => 'id', 'id_personne' => 'idPerson', 'pays' => 'country']])
			->run();

Thanks in advance for your response, and I'm sorry if I'm in the wrong place to ask this kind of question, so don't hesitate to tell me.

And thank you for this library, it could really save me time on my work, good job ! 😃

Add the possibility to chain extractors

Imagine the following use case : we need the data coming form two different sources, for instance two CSV files.
Both have field in common, ex : email.

Open them in Excel, to sort them by email.
Edit the ETL to be able to chain Extractors to get the data from the two sources and merge them in one row. Since the two CSVs are sorted by email, we assume that each line in one file match the one in the other file. But we still thrown an exception if the email doesn't match.

Note that we must be able to specifify the field in common between the two files.
Question : be able to merge n files, instead of only two?

JSON Array to PostgreSQL Array error

When extracting data from JSON file and loading into PostgreSQL database, array values are not handled correctly.

For exemple when trying to load value ["value1", "value2"] from JSON file into an varchar[] field in PostgreSQL database, this error occure :

Array to string conversion

The JSON array should be transformed in PostgreSQL Array format string so it can be properly inserted in database : "{value1, value2}".

Csv Extractor: throw a proper error when the fields qty doesn't match the columns one

Example of implementation (does not require doc update):

    /**
     * Extract data from the input.
     */
    public function extract(): \Generator
    {
        $handle = fopen($this->input, 'r');

        $columns = $this->makeColumns($handle);
        $rowIndex = 1;

        while ($row = fgets($handle)) {
            $rowIndex++;

            yield new Row($this->makeRow($row, $columns, $rowIndex));
        }

        fclose($handle);
    }

    /**
     * Converts the row string to array.
     */
    protected function makeRow(string $row, array $columns, int $rowIndex): array
    {
        $row = str_getcsv($row, $this->delimiter, $this->enclosure);

        $data = [];

        $rowCount = count($row);
        $columnsCount = count($columns);
        if ($rowCount !== $columnsCount) {
            throw new \LogicException("Row with index #{$rowIndex} only contains $rowCount elements while $columnsCount were expected.");
        }

        foreach ($columns as $column => $index) {
            $data[$column] = $row[$index - 1];
        }

        return $data;
    }

Create a component factory

In some use case, we might need many instance of the same component (ETL, CSV Extractor...).

So far there is two way to proceed:

  • Use DI. But your constructor could become huge.
  • Use DI + clone. But become difficult to test.

We might consider to create a factory. It would be a class with one public method get(), with two parameters:

  • The classname of the request component
  • The optionnal array of the options

Type hinting and PHPDoc

In another issue, a question was raised about a function's PHPdoc - in particular, the function had been like this

/**
 * @param string $a
 * @param array $b
 */
public function foo (string $a, $b) {...}

I removed "@param string $a" in order to pass the inspection "no_superfluous_phpdoc_tags" (which is implied by @symfony in the distributed .php_cs.dist file). That raised a question among the reviewers and the statement that one reviewer would like to see PHPdoc type hints for all parameters.

I'm creating this issue as a locus for any follow-up conversation.

If we want to see PHPdoc type hints for all parameters, we would have to follow one of these courses:

  • remove php-native type hints (which would reduce code reliability)
  • provide comments on each variable (which tends to generate a lot of repetitive and uninformative text)
  • override the configuration and set "no_superfluous_phpdoc_tags" to false

I can live with any choice that makes the code and tests comply with the inspection profiles we ship alongside. But I prefer the configuration we have. In my mind, the desired ideal is that nearly all variables are self-documenting and are constrained by php-native types. With the current codebase, we can verify that things like

ETL::load(Loader $loader, string $output, $options = []): Etl {}

can be refactored to

ETL::load(Loader $loader, string $output, array $options = []): Etl

It seems implied by the code, but since that could introduce a breaking change, so it also seems worth explicitly verifying.

There are about as dozen places in the code where the variable type is mixed. With the current inspection profiles, those variables would need to retain their PHPdoc. In all other cases, PHPdoc would be provided for variables only when the variable name was not felt to be adequately self-documenting.

Allow multi-line CSV header

I came across a bug where the CSV extractor assumes the header is a single line (it uses fgets when parsing the header).

Using the following CSV file:

Column 1,Column 2,"Multi-Line
Column 3"
Value 1,Value 2,Value 3

Running the following code:

(new Etl())
    ->extract(new Csv(), __DIR__.'/test.csv')
    ->load($loader, '', ['index' => 'Column 1'])
    ->run();

I get the following notice when the extractor tries to parse the first row (which is actually the second line of the header):

Notice: Undefined offset: 1

Thanks for the hard work on this library. I'm happy to make a pull request if you're open to that.

Create an Extractor that uses a generator callback

The idea here is to use a Generator callback to yield multiple rows for each row of an input.

In particular, can use $pipeline->extract(...)->toIterator() as input to this Extractor. In such a configuration, two extractors can be chained together to return N x M rows selecting N rows of JSON arrays from an SQL database, each containing M elements in the JSON array.

I have this code and a tests, working on docs now.

Use PHP 7.4 type hints

This is for the 2.x branch/release. PR will be added shortly.

The work will replace PHPdoc annotations for type with PHP 7.4 property annotations. This is a compatibility-breaking change.

Though we don't have a published code style guide, previous conversations have suggested our desire is to avoid duplications and excess code that do carry information. So I've removed or streamlined parameter comments where appropriate (i.e., most parameters need no doc or only a single line.

Also, in a few places, methods can use @inheritdoc -- I do so when the existing comment just parrots the parent's doc (note: All this plays nicely with phpDocumentor. It seems that using @inheritdoc for inherited properties does not, so I have left comments there even if they are largely redundant.

Also, since our php_cs_fixer profile uses symfony, and since most of our variables are in camelCase, I refactored the handful of variables in snake_case so the project would be consistent overall.

In the absence of a formal style guide, I am definitely making assumptions about what our desired style looks like. As I result, I expect and welcome comment.

Insert and InsertUpdate loaders are inconsistent

When 'columns' => [] in Insert loader, all columns are selected.

When 'columns' => [] in InsertUpdate loader, no columns are selected.

I think InsertUpdate is wrong and

if (is_array($this->columns) ?? [] !== $this->columns) {

should be

if (is_array($this->columns) && [] !== $this->columns) {

This bug will more or less automatically get resolved when we start using PHP 7.4 type hint for properties.

But I have a simple patch for the current code if we want to apply it...

Conditional in Database\Transaction:shouldBeginTransaction() my have an extra clause

After refactoring for PHP 7.4, the current function shouldBeginTransaction() in \Wizaplace\Etl\DatabaseTransaction is:

return !$this->open && (0 === $this->size || 1 === $this->count);

It's not clear to me that 0 === $this->size should be part of that condition and in fact

return !$this->open && 1 === $this->count;

also passes all our current tests. Either the tests are not diagnostic or the code is inaccurate, We should determine which and resolve the inconsistency.

Is it possible to extract a xml that has more than one loop?

I am using the xml that has more than one depth. For example

<users>
<user>
<profile>
</profile>
<addresses>
<address></addresses>
</addresses>
</user>
</users>

I need to get profile details and address details for each user. How can I extract ? Could anyone give the solution to solve this?

Option to skip insert for InsertUpdate (update only)

Thank you for forking this and continuing the work!

Wondering if it would be possible to have an option for InsertUpdate to only do an update? That is, if the item isn't in the table then do not insert?

We've got one ETL that populates the table from an API service. But then another API service reports which items were deleted (but only the index). If for some reason an item was deleted before it was extracted then it wouldn't exist in our table. In this case we don't want it to try to insert it, just skip. Is this possible?

An odd idea for a transformer... a progress meter

It is not a normal ETL construct by any means, but what do you think of the idea of creating a transformer that periodically outputs some sort of progress indicator (and does nothing else)?

[Feeling a bit annoyed with myself for nit having a better way than counting rows in a database to measure progress :-( ]

boolean values update issues

With Postgresql database, update are not handled correctly in some cases.

Working :

  • old value : false => new value : true
  • old value : true => new value : NULL
  • old value : NULL => new value : true
  • old value : NULL => new value : false

Not working :

  • old value : true => new value : false
  • old value : false => new value : NULL

After some search I found that there is an issue in the update(array $row, array $current) function of src/Loaders/InsertUpdate.php file.
This section use loose comparison between old values array and new values array :
if ($row == array_intersect_key($current, $row)) { return; }

The $current array return boolean from database as a PHP Boolean while the $row array take string('true', 't', 'false', 'f') or NULL value to define booleans.

This prevent database update when it should be.

Add Validator Step

Hi. Have you planned to implement a validation step? This could be very useful in ensuring that the extracted rows are valid. Thank you.

file name handling in CsvLoader

The CsvLoader append fileCounter and the .csv file extension, in any case (using the linePerFile option or not).

We force the class consumers to feed an incomplete file path.
Ex.: input file name: /path/to/my/file produce:

  • /path/to/my/file_0.csv
  • /path/to/my/file_1.csv
  • /path/to/my/file_2.csv

In order to let the CsvLoader class consumers having full control of the file path, i suggest to make 0 silent in the file name generation, and to use pathinfo for the file name extension extraction.

Ex.: input file name: /path/to/my/file.csv will produce:

  • /path/to/my/file.csv
  • /path/to/my/file_1.csv
  • /path/to/my/file_2.csv

And/Or using the linePerFile option to activate or not this first _0 file name suffix.

Remove usage of the php function extract

Maybe related to #2
We added some todo annotation where the extract php function is used. It creates variables from from array keys, and it implies to check for their existence. This need to be refactored.

Create a CSV splitter

Create two CSV splitter class.
One splitting the CSV in n parts.
One splitting the CSV in n max lines.

Be careful with the multilines CSV.

Need documentation update.

Add a CSV Loader

So far we only have SQL Loaders.
We would need to create a basic CSV one, enabling to ETL to CSV.

Need documentation update.

Tests do not comply with PSR12

We call for PSR12 in phpcs. but the tests do not meet that standard -- primarily due to snake-case test names.

I'm happy to create a patch to address this. It should not create any backward compatibility issues and could land as a 1.2.x release or as 2.x if we decide to create a new major release.

Adding a multiline CSV extractor

The current CSV extractor seems to not work with files with entries which are not in only one line (multiline entries).

Need documentation update.

Syntax error on Load if column is a reserved word (on more recent MariaDB)

Earlier versions of MariaDB were more forgiving, but now I'm getting a syntax error because I have a column that is a reserved word. Oops. The quick fix is to change the column name, but I thought I'd bring it up here so that php-etl could put backticks around columns when creating the queries. This may avoid the error.

Fix tests in UniqueRowsTest

So far these tests don't test anything, as Rows in $expected array are strictly the same references to Row object than in $data. $expected need to be fill with proper new Rows to make any proper equality assertion.

Improve the way SQL request are handled

Following @ArthurHoaro comment:

I didn't really take a deep look at the implementation, but it looks like that there is a lot of code for SQL query building.
As a further improvement, maybe it would make sense to use a proper SQL query builder as a dependency rather than implementing one in this lib.

Use constants

Experience of development using this library has shown the redundancy of usage of keys, like for instance 'columns', 'delimiter'...
The idea would be to use define and use constants for that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.